Lightnews — Scholar-powered news

I've been thinking about what it would take to migrate to this as my primary mastodon account. I don't know if this is a supported use case or how well it would work

September 23, 2025 at 12:27 AM

Nish Tahir

@nish.nishtahir.com.ap.brid.gy

How LLM Structured Decoding works

Last week I happened to be in a discussion that involved getting an LLM to generate JSON reliably. A major frustration expressed was that no matter how much they tried the model would often fail to follow instructions during generation. I pointed out that most major vendors support some variant of "Structured Output" parsing which allows the user to provide an output schema. That happened to be a good solution to the problem but I wanted to take a moment to write up some notes about how and why it works so well. All language models happen to have a vocabulary which is essentially a map of a _Token_ to _Token Id_. This is how before making a prediction, strings are broken up and assigned to numbers to work with. A snippet of the Phi-4-mini-instruct vocabulary looks like this. { "\u0120NSError": 85268, "\u0120filtro": 85269, "\u0120vyt": 85270, "\u0120Prefeitura": 85271, "*sizeof": 85272, "\u0120Continental": 85273, "\u0120Enfin": 85274, "???\u010a\u010a": 85275, "-best": 85276, "\u0120tolle": 85277, "\u00e8\u012d\u00b9\u00e6\u0140\u013e\u00e7\u012b\u012a": 85278, "\u0120\u00d8\u00a7\u00d9\u0126\u00d8\u00b5\u00d9\u012a\u00d8\u00b1": 85279, "\u0120\u00c3\u00a9nerg": 85280, "icester": 85281, "\u0120abbiamo": 85282, ... } We can tokenize a string, using an instance of the tokenizer which will give us a sequence of token Ids. prompt = "Write a json object with the following keys: name, age, city must be an object that starts with { and ends with }" tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4-mini-instruct") inputs = tokenizer(prompt, return_tensors="pt") print(inputs) Output: {'input_ids': tensor([[10930, 261, 5701, 2817, 483, 290, 3992, 12994, 25, 1308, 11, 5744, 11, 5030, 2804, 413, 448, 2817, 484, 13217, 483, 354, 326, 17095, 483, 388]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} A prediction for the model outputs a probability distribution for each token over the entire vocabulary. This means that for each possible token, you get a score for how likely that token appears in the sequence. We can make a prediction to visualize this. model = AutoModelForCausalLM.from_pretrained("microsoft/phi-4-mini-instruct") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits Now let's grab the next token and see what the highest probability tokens are next_token_logits = logits[0, -1] torch.topk(next_token_logits.softmax(dim=-1), 10) Output: torch.return_types.topk( values=tensor([0.1901, 0.1205, 0.0866, 0.0475, 0.0461, 0.0414, 0.0316, 0.0183, 0.0180, 0.0171]), indices=tensor([ 326, 483, 13, 887, 1366, 2804, 558, 2238, 350, 290])) The highest probability token here is 326, which appens to be `and`. Next is 483 is `with`. So it's clear the model is really just trying to "complete" the prompt. Since we have access to the predictions, Structured Decoding in this context means making more intelligent decisions about what predictions to accept from the model based on the rules/criteria we wish to apply to our output. For example, JSON follows a strict grammar in order for a string to be valid JSON[1]. 1. A valid JSON object must start with a `{` 2. A valid JSON array must start with a `[` 3. A valid JSON primitive can be a string, number, `true`, `false`, `null` So to make a prediction that is a valid JSON object, it can only be a finite number of possible values. This means that when sampling the next token from the model's predictions, we can reject any token that would not be valid and only sample from a pool of valid alternatives[2]. valid_starts = ["{", "["] valid_ids = [tokenizer.encode(tok, add_special_tokens=False, return_tensors="pt")[0] for tok in valid_starts] print(valid_ids) # Mask out everything except our valid ids mask = torch.full_like(next_token_logits, float("-inf")) for vid in valid_ids: mask[vid] = next_token_logits[vid] # Take the highest probability token from the pool of valid tokens next_token_id = torch.argmax(mask).item() next_token = tokenizer.decode([next_token_id]) # Print for visualization print("Chosen token:", next_token) Output: Chosen token: '{' In this example, I'm constraining the valid tokens to `{` and `[`. We assume every other token is invalid and mask them. Then we take the highest probability token from the pool of valid tokens. For more elaborate control over what is valid, we need a way to define a grammar and partially match the completion. Most of the major LLM Vendors usually provide some way to define a JSON schema[3][4][5]. But more sophisticated APIs allow some sort of BNF like notation or Regex for matching on the outputs. A manual regex based implementation might look something like this. prompt = "Write a valid json object with a single test key and value" completion = "" for i in range(20): inputs = tokenizer(f"<|user|>{prompt}<|end|><|assistant|>{completion}", return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits next_token_logits = logits[0, -1] mask = torch.full_like(next_token_logits, float("-inf")) for token_id, token_str in zip(token_ids, token_strs): expected_completion = completion + token_str if pattern.fullmatch(expected_completion, partial=True): mask[token_id] = next_token_logits[token_id] if torch.all(mask == float("-inf")): # No valid alternative break next_token_id = torch.argmax(mask).item() next_token = tokenizer.decode([next_token_id]) print(next_token) completion += next_token Output: { " test ": " This " } Hopefully this shows why Structured Output is actually guaranteed to conform to the schema barring any bugs that occur during sampling. This is a great option if you have information about the expected structure that might not be immediately clear to the model or you intend for the output to be consumed by other tools. * * * 1. This is an example. For full details on the grammar the full standard is available here https://www.json.org/json-en.html. ↩︎ 2. It's not a coincidence that it's extremely similar to DFA or other parsing techniques. You are effectively streaming lexemes and need to make decisions about what fits and what does not. ↩︎ 3. (no date) Structured output. ai.google.dev. Available at: https://ai.google.dev/gemini-api/docs/structured-output (Accessed: 2025-9-13). ↩︎ 4. (no date) OpenAI platform. platform.openai.com. Available at: https://platform.openai.com/docs/guides/structured-outputs (Accessed: 2025-9-13). ↩︎ 5. (no date) Structured outputs. docs.vllm.ai. Available at: https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html (Accessed: 2025-9-13). ↩︎

nishtahir.com

September 13, 2025 at 5:17 PM

Nish Tahir

@nish.nishtahir.com.ap.brid.gy

This is incredible, genie 3 generates interactive worlds in real time with persistent memory. I can't find a paper on it yet, but the preview is incredible.

https://www.theverge.com/news/718723/google-ai-genie-3-model-video-game-worlds-real-time

Google’s new AI model creates video game worlds in real time

Google is investing a lot into AI world models.

www.theverge.com

August 5, 2025 at 3:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news