English · Español
03 — Grammar as DFA: Why Production Implementations Precompile¶
🇪🇸 La implementación naïf (validar carácter por carácter por cada token candidato, en cada paso) es correcta pero cuadrática. Los productos reales (Outlines, lm-format-enforcer, llama.cpp GBNF) precompilan la gramática en un autómata finito y precomputan un trie de tokens × estados. El resultado: O(1) por paso. Phase 30 entiende la teoría y la describe, pero no la implementa.
This page explains the production-grade structure so Borja can read Outlines' source code afterward and recognize it. It is theory-only — no lab implements this.
The naive cost¶
The Phase-30 mask is computed per step by iterating over the vocabulary and running the JSON parser one character at a time on each candidate token. Cost per step: O(\(|V| \cdot \bar{L} \cdot \text{parser-step}\)). With \(|V| = 512\), \(\bar{L} = 3\), parser-step ≈ 10 ops: ~15k ops per step. Negligible at small scale.
At LLM scale (\(|V| = 100\text{k}\), \(\bar{L} = 4\), parser-step ≈ 100 ops): 40M ops per step. At decode rates of ~10 tokens/sec on CPU and ~100/sec on GPU, mask construction would dominate. Production needs O(1) per step.
The trick: precompute the mask per state¶
Observe: at any moment, the relevant grammar state is finite (often very small — for our JSON schema, ~10 states). The mask depends only on the current state. So:
- Enumerate all reachable grammar states.
- For each state, for each token in the vocabulary, decide once whether emitting that token leads to a legal state. Store as a bit-array
mask[state][token]. - At decode time, look up
mask[current_state][:]— O(\(|V|\)) memory read, but a single one — and add to logits. - After emitting, look up the next state
next_state[current_state][token]— O(1).
Total preprocessing: O(|states| × \(|V| \cdot \bar{L}\)). Total per-step cost: O(\(|V|\)) (just the mask add). The per-step cost has been driven from O(\(|V| \cdot \bar{L}\)) to O(\(|V|\)) — the constant factor is improved by \(\bar{L}\), but more importantly the parser logic is gone from the hot path. With bit-packing the mask, you can drive it even lower.
State enumeration¶
For a JSON schema with n keys, the state set is roughly the product of:
- Position in the JSON skeleton: object-open, expecting-key, in-key, after-key, expecting-value, in-value, after-value, comma-or-close.
- Which keys have been emitted (subset of
{key_1, ..., key_n}). - Which key is currently being valued (one of
{key_1, ..., key_n}). - Within-value progress (depends on the value type: enum, string, integer).
Naively this is \(2^n \times n \times (\text{value states})\) states. For our 4-key conjugation schema (verb, tense, person, spanish?) with closed enums on three of the four keys, that's roughly \(2^4 \times 4 \times 30 \approx 2000\) states (loose upper bound). Tractable; for §A13's closed enums it can be cut dramatically by pruning. Phase 30 documents the count; the labs only build the naive validator-per-token implementation.
For a CFG (Level 4), the "state" is a stack snapshot. The state space can be infinite (think nested expressions). Production implementations handle this via pushdown automata + careful caching of stack-top transitions — a substantial implementation effort. Phase 30 stays at JSON-schema (DFA-equivalent) precisely to avoid this.
Token-level vs character-level: why the trie¶
A subtle point. The grammar's transitions are defined on characters; the model emits tokens (which are byte sequences). Bridging this requires asking, for each (state, token) pair: "does decoding this token from this state lead to a legal new state?".
Production implementations build a character-level trie of the vocabulary, and intersect it with the grammar DFA:
For each grammar state s:
For each branch of the vocabulary trie:
Walk the branch one character at a time, simultaneously walking the DFA from s.
If at any point the DFA rejects, prune this trie subtree (no token starting with this prefix is legal in state s).
If the trie branch ends at a token boundary AND the DFA is in a legal state, mark (s, token) legal.
This pruning is what makes the precomputation tractable on huge vocabularies. Without it, you'd be checking \(|V|\) tokens × \(|\text{states}|\) states × character-by-character — which doesn't scale.
Outlines' implementation of this lives in their outlines.fsm.guide module. Reading it after this phase is a strongly recommended stretch goal — it's elegant and worth the time.
The lookahead problem¶
A trickier issue: tokens that don't fit cleanly into grammar transitions. Suppose the grammar expects "verb": followed by a value. The tokenizer might have ","tense":" as a single token. This token would close one key-value pair and open the next. The naive mask says "is this legal to emit at this state?" — but the question is really "is there any path through grammar states that this token represents?".
Most production implementations handle this correctly by walking the token's character expansion through the DFA. Some buggy ones don't, and you get the "model sometimes emits a token that locally looks legal but globally desyncs the parser" failure mode. Outlines handles it correctly; some early lm-format-enforcer versions did not.
Phase 30's naive implementation handles it correctly because it walks character-by-character per candidate token. The production-grade precomputed version handles it correctly because it prunes during trie construction. Either way, you have to think about it.
Why we don't implement this¶
Three reasons:
- Pedagogical scope. Phase 30 is about the mask. Adding "and we built a trie-intersection compiler" doubles the phase's surface area for marginal teaching gain.
- Test coverage. The naive implementation is short enough to verify by exhaustive test (every state, every token). The precomputed implementation requires the same verification plus "did the precompilation produce the right table?". Double the surface.
- Building before abstracting. Per CLAUDE.md §0.4, we build the slow correct version first. If a later phase needs speed (Phase 33 serving, maybe), we revisit and add the precomputed cache then. Until then, slow + correct wins.
Reading guide: what to look at after Phase 30¶
If Borja wants to see this in production code, read in this order:
- Outlines
outlines.fsm.guide— the trie-intersection algorithm in pure Python. Most readable production-grade implementation. - llama.cpp's
grammar.cpp— the GBNF parser and the per-step mask computation. Lower-level but instructive. - vLLM's
LogitsProcessorinterface — how a serving system consumes a mask source. Phase 33 of this curriculum touches this. - The
lm-format-enforcerREADME — a different design choice (token-by-token state machine) — useful contrast.
The notes from this reading would live in docs/phase-30-structured-generation/notes/outlines-source.md (created at phase open if Borja does the stretch goal).
A practical aside: caching the mask¶
Even within the naive implementation, you can cache: if the grammar state hasn't changed step-to-step (the model is mid-string, still emitting characters of a value), the mask is the same. Cache by state hash, look up if hit. Phase 30 may implement this if Borja wants to take the lab further; the DoD does not require it.
What "compile time" means for grammars¶
The grammar compilation is per schema, not per request. You compile once at server boot:
mask_table = compile_schema(conjugation_schema, tokenizer) # heavy: minutes maybe
# ... serve forever ...
for request in stream:
masks = mask_table.run(generated_tokens) # cheap: O(1) per step
sampled = sample(logits, mask=masks.current())
The compile step is amortized across all requests that share the schema. For agent frameworks where every tool has its own schema, this means you compile N tables at startup. Memory: \(|V| \times |\text{states per schema}|\) bits per schema. For 100k vocab × 100 states, that's 10 Mbits = 1.25 MB per schema. Cheap.
Where this connects forward¶
- Phase 31 (tools/MCP). Tool argument schemas are JSON schemas.
conjugate(verb, tense, person),lookup_irregular_verb(verb),lookup_spanish(english_form)all take typed arguments. Constrained decoding makes tool calls parse by construction. Phase 31 wires this up. - Phase 32 (agents). The grammar tutor's output format uses Phase 30's
JSONSchemaMask. Every step the agent takes ends in a constrained-decode call that emits a conjugation triple. - Phase 33 (serving). A real serving stack exposes
response_formatas an API parameter. Implementing this requires (a) aLogitsProcessorplug-in point in the decoding loop, (b) per-request mask state. Phase 33 design discusses this.
What this phase does NOT cover¶
- Precompiled token-trie × DFA intersection. Described, not coded. Outlines does it; we read their source as a stretch goal.
- Pushdown automata for CFGs. GBNF / lark grammars need this; out of scope.
- Constrained-decode-aware fine-tuning. Teaching the model to natively emit the schema so \(\mathrm{KL}(p_\text{constr} \| p) \to 0\) is a Phase 28 (LoRA) idea.
Done with theory. On to lab/00-regex-mask.md — the warm-up.