English · Español

00 — Why Structured Generation Exists¶

🇪🇸 Un LLM "casi siempre" devuelve JSON válido es indistinguible de "nunca devuelve JSON válido" para un programa downstream: el primer fallo rompe la tubería. La generación estructurada cambia "casi siempre" por "siempre, por construcción" — y la única forma honesta de garantizarlo es restringir los logits paso a paso.

This is the motivation page. Read it before the math pages — the formulas are easy; understanding why we bother is the part that takes time.

The 99% trap¶

You prompt a model: "Reply with JSON: {verb: ..., tense: ..., person: ...}." It usually does. Maybe 99 times out of 100. The 100^th time it says, "Sure! Here is the conjugation: json{...} Hope that helps!". Or it forgets a quote. Or it emits "tense": "past" when the schema expects "past_simple". Your downstream parser raises.

A 99% success rate looks fine in a demo and is catastrophic in production. If you call this LLM 1000 times a day, you have ten failures per day. If each failure triggers a retry or a fallback path, you have ten incidents per day that need handling. A 99.9% rate has one per day. A 99.99% rate has one per ten days. The cost of every additional nine reduces by a factor of ten.

You cannot prompt your way to infinite nines. The model is a probability distribution over tokens; there is always some non-zero mass on illegal continuations. The mass shrinks as the model is trained more and the prompt is engineered better — but it never reaches zero, because the model has no notion of "legal" beyond vibes.

The two ways out¶

You can either (a) validate after and retry on failure, or (b) constrain during and never produce an invalid output. Option (a) is what most production stacks did until ~2023: emit, parse, retry. It works, but it has three downsides:

Latency. A failed generation that retries doubles your wall-clock cost for that request.
Cost. If you're paying per token, you pay for the bad output, then again for the retry.
Worst-case unboundedness. Theoretically the retry could fail too. You set a max retries; you accept the failure floor.

Option (b) — constrain during — makes the failure rate exactly zero by construction. At each decode step, we look at which next tokens could lead to a legal completion of the grammar, and we set the logits of all other tokens to \(-\infty\). After softmax, the illegal tokens have probability exactly zero. They cannot be sampled. The output, by construction, parses.

This is what outlines, lm-format-enforcer, jsonformer, OpenAI's "JSON mode", llama.cpp's GBNF grammars, and the structured-output features of every major LLM provider do under the hood. They differ in which grammar, how cleverly the mask is precomputed, and what they expose to the user. They do not differ in the basic mechanism.

The pedagogical claim¶

If you understand logit masking, you understand structured generation. The rest is engineering: how to write down a grammar, how to compile it to an automaton, how to make the mask construction fast enough to not double the per-token cost.

We will derive masking from first principles in 02-logit-masks.md. We will implement it naively in lab/00-regex-mask.md and lab/01-json-schema-mask.md. We will measure its cost in lab/03-mask-overhead.md. We will not implement a production-grade tokenizer-aware DFA for arbitrary grammars, because doing so would consume an entire phase by itself; we will describe the algorithm in 03-grammar-as-dfa.md so Borja can read Outlines' source code afterward and recognize what it's doing.

Why this matters for the grammar tutor¶

The Phase 32 capstone agent is a grammar tutor: it reads an English sentence, identifies any tense / person / agreement error, and proposes a correction. Mechanically it is a function:

tutor: English sentence  ->  {verb, tense, person, correction?, spanish?}

The codomain is a fixed schema. Phase 32 cannot afford a parse failure: an agent that emits invalid JSON once per 100 requests is an agent that breaks the rest of the pipeline once per 100 requests. The retry strategy (Option a) is not enough — the agent has tools to call (Phase 31), a sandbox to coordinate with, and downstream consumers (Phase 33's serving layer, Phase 34's observability) that all expect structured input.

So Phase 30 is the contract definition phase for the rest of the agent stack. The output schema we pick here is the type signature of everything downstream. Treat it as a public API.

The §A13 universe makes this almost trivial¶

The §A13 scope (5 tenses × 3 persons × 20 verbs) makes the conjugation schema a closed enumeration. There are exactly 20 legal verbs, exactly 5 legal tenses, exactly 3 legal persons. The mask precomputation is therefore essentially free: a table of size \(20 + 5 + 3 = 28\) legal values. Most of the work in production stacks is handling open-vocabulary string fields (free-form names, addresses); we don't have any. That's a feature of the microscopic curriculum, not a coincidence.

This is why Phase 30 ships a complete implementation rather than a stub — the universe is small enough that we can be exhaustive.

A note on what we aren't solving¶

Structured generation guarantees that the output parses against the schema. It does NOT guarantee that the content is correct. The agent can confidently emit {"verb": "eat", "tense": "past_simple", "person": "3sg"} for the sentence "I am eating now" — schema-valid, semantically wrong — and the parser will accept it. Correctness is a Phase 20 / Phase 28 / Phase 37 problem (evaluation, fine-tuning, adversarial probing). Phase 30 only ensures that the answer is the right shape.

This is genuinely useful. Most production systems fail at parse time, not at content time. Eliminating parse failures alone is a 10×–100× improvement in the operational cost of the system.

What "the mask" looks like¶

A logit mask is an array of length \(|V|\) (vocabulary size). Each entry is either \(0\) (token is legal at this step) or \(-\infty\) (token would break the grammar). We add this mask to the model's logits before softmax. After softmax, illegal tokens have probability \(0\).

logits         = [ 1.2,  0.3,  4.1, -0.5,  2.0,  ...]    ← from the model
mask           = [   0, -inf,    0, -inf, -inf,  ...]    ← from the grammar
masked_logits  = [ 1.2, -inf,  4.1, -inf, -inf,  ...]
probs          = softmax(masked_logits)
               = [0.052,  0.0, 0.948,  0.0,  0.0,  ...]

We then sample from probs using whatever sampling strategy Phase 21 introduced (greedy, top-p, temperature). The sampler doesn't care that some entries are zero; it just doesn't pick them.

The work of structured generation is: given the partial output so far, compute the mask. The rest is plumbing.

A note on temperature¶

Temperature scaling, top-k, top-p, repetition penalties — all of these compose with masking. The mask is applied first (set illegal to \(-\infty\)), then any other modifications (temperature, top-p) operate on the surviving logits. This composition order matters and is derived in 02-logit-masks.md §"composition with sampling".

Where this is going¶

By the end of Phase 30, you have:

A LogitMask abstraction (src/ministruct/mask.py) that takes "what's been emitted so far" and returns "which next tokens are legal".
A concrete JSONSchemaMask (and RegexMask for the warm-up) implementing this for the conjugation schema.
A modified decoding loop (src/miniinfer/generate.py) that respects the mask.
An end-to-end CLI: python scripts/conjugate_structured.py "She wrote a book" → valid {"verb": "write", "tense": "past_simple", "person": "3sg"}.
The locked schema in src/ministruct/schemas.py — the public API for Phases 31, 32, 33.

At Phase 31, the tools layer will use this schema to validate tool-call arguments and return values (conjugate(verb, tense, person) takes exactly these types). At Phase 32, the grammar-tutor agent will use the CLI as one of its tools. At Phase 33, the serving layer will expose it over HTTP.

What this phase does NOT cover¶

Correctness of the model's conjugations. Phase 20 / Phase 28 territory.
Constrained beam search. Beam × mask state interaction is a separate subject; we only do greedy / top-p here.
GBNF parser implementation. We describe the format; reading llama.cpp's parser is a stretch goal.
General-purpose tokenizer-aware DFAs. Our DFA targets the §A13 BPE vocabulary, not arbitrary tokenizers.
Streaming / SSE mask updates. Phase 33 concern.

Next: theory/01-jsonmode-vs-grammar.md — the spectrum of "ask nicely" → "JSON mode" → "GBNF grammar", and what each gives up.