Skip to content

English · Español

04 — CFG vs Regex vs JSON-Schema: when each mask is correct

🇪🇸 Tres maneras de restringir lo que un modelo puede emitir: una gramática libre de contexto (CFG), una expresión regular, o una máscara derivada de un JSON-Schema. Las tres reducen el espacio de salida, pero no son intercambiables. Aquí derivamos cuál es matemáticamente correcta para cada caso y por qué.

Anchors: theory/01-jsonmode-vs-grammar.md, theory/02-logit-masks.md, theory/03-grammar-as-dfa.md.


The three mask families

A decoding mask is a function mask: (state) → set[token] that returns the set of tokens allowed at the current decoding state. Outside that set, logits are set to -inf before softmax.

The three families differ in what state they track:

Family State Expressivity Compilation cost Runtime cost
Regex DFA state index q ∈ Q (a single integer) regular languages NFA→DFA: O(2^ R
CFG parser stack + current rule context-free languages LL(k) / Earley: O( G
JSON-Schema structural state of partial JSON (object/array/value) + schema position structurally context-free but typed precompile the JSON automaton at schema load O(1)-O(log d) per token, d = JSON depth

The taxonomy is real: regex is a strict subset of CFG, which is a strict subset of context-sensitive languages. JSON-Schema is not context-free in general ($ref resolution and recursive schemas need a stack-of-stacks), but most production schemas are.

When each is correct

Case 1: "the model must emit a 5-digit zip code"

The grammar is [0-9]{5}. This is regular — a 6-state DFA accepts exactly this language.

  • Regex → ✅ correct, fast.
  • CFG → also correct but you're paying for parser-stack overhead that's never used. Wasteful.
  • JSON-Schema → only correct if the zip code is inside a JSON object ({"zip": "12345"}). For a raw token-stream emission, JSON-Schema is the wrong tool.

Case 2: "the model must emit a balanced-parenthesis expression"

Expr → "(" Expr* ")" | "atom". This is context-free, not regular — the language is "all balanced strings", which the pumping lemma proves a regex cannot match.

  • Regex → ❌ incorrect. There is no DFA that accepts arbitrarily nested balanced parens. A regex limited to depth-d works, but it's no longer the same language.
  • CFG → ✅ correct.
  • JSON-Schema → wrong tool; not a JSON structure.

Case 3: "the model must emit a JSON object matching {type: object, properties: {verb: {type: string, enum: [...]}, ...}}"

The language is "structurally valid JSON conforming to schema X". This is structurally context-free — at any point the decoder must know "am I expecting a key, a value, a comma, a closing brace?".

  • Regex → ❌ incorrect. JSON's bracket-balancing is not regular. A regex can validate fragments, not the whole document.
  • CFG → ✅ correct, but writing the CFG by hand for every JSON-Schema variant is tedious.
  • JSON-Schema-derived mask → ✅ correct and convenient. The schema is the structure; the mask is generated from the schema. Production libraries (Outlines, lm-format-enforcer, Guidance) take this path.

Case 4: "the model must emit <input>verb_form</input> where verb_form is one of 600 conjugated forms"

A regex with 600 alternations is correct but blows up the DFA. A CFG with a single terminal "verb_form" referring to a lookup table is cleaner. A JSON-Schema with enum: [...600 strings...] works if the parent is JSON. For our §A13 grammar tutor, this is the exact pattern.

Why expressivity matters even for non-pathological cases

A common production mistake: use a regex when you needed a CFG. The symptom is "the model emits invalid output 5% of the time" — that 5% is the cases the regex couldn't constrain (e.g., truncated nested objects). The regex passes on the typical case; it fails on the structural-boundary case. The CFG mask would have caught it.

Conversely, a common over-engineering mistake: a full CFG when a regex suffices. The 600-string enum in case 4 above — if every form is a fixed string, a flat regex verb_1|verb_2|...|verb_600 is technically regular and the DFA-state machine works fine. The CFG buys you nothing.

The decoder integration: same logit-mask interface

All three families plug into the same per-step decoder loop. Pseudocode (theory/02-logit-masks.md derives it formally):

def constrained_decode(model, mask_engine, prompt):
    state = mask_engine.start_state()
    tokens = list(tokenize(prompt))
    while not state.is_terminal():
        logits = model.forward(tokens)[-1]          # (vocab_size,)
        allowed = mask_engine.allowed_tokens(state) # set[int]
        logits[~mask_for(allowed)] = -inf
        next_token = sample(logits)
        state = mask_engine.advance(state, next_token)
        tokens.append(next_token)
    return decode(tokens)

The only thing that changes between mask families is mask_engine. The decoder is invariant. This is the abstraction that makes constrained-decoding libraries portable across grammar formats.

When each is fast

For Mini-GPT decoding at ~250 tok/s, the per-token mask-construction budget is ~4 ms. Measurements on Borja's i5-8250U (predicted, to be confirmed in lab/03-mask-overhead.md):

Mask family Per-token mask cost Overhead vs unmasked decode
Precompiled regex DFA ≈ 0.01 ms < 1%
LL(1) CFG parser ≈ 0.5 ms ≈ 10%
JSON-Schema (precompiled DFA) ≈ 0.05 ms ≈ 1%
JSON-Schema (re-derived per token) ≈ 5 ms ≈ 200% (dominates)

The "precompiled vs per-token" axis matters more than the family. lm-format-enforcer and Outlines both precompile the schema once at load time and cache the DFA. A naive implementation that re-evaluates the schema at each step is the wrong implementation, regardless of mask family.

The right choice for the §A13 grammar tutor

Phase 32's grammar tutor emits structured corrections in JSON:

{
  "original": "He goed to school",
  "verb": "go",
  "tense": "past simple",
  "person": "3rd singular",
  "correct_form": "went",
  "spanish": "fue"
}

This is JSON-Schema's natural home. The schema:

{
  "type": "object",
  "required": ["original", "verb", "tense", "person", "correct_form", "spanish"],
  "properties": {
    "verb":         {"enum": ["work", "play", "walk", "talk", "listen", "watch",
                              "study", "finish", "start", "look", "want", "like",
                              "be", "have", "do", "go", "come", "see", "eat", "write"]},
    "tense":        {"enum": ["infinitive", "present simple", "past simple",
                              "past participle", "simple future"]},
    "person":       {"enum": ["1st singular", "2nd singular", "3rd singular"]},
    "correct_form": {"type": "string", "maxLength": 30},
    "original":     {"type": "string", "maxLength": 200},
    "spanish":      {"type": "string", "maxLength": 30}
  }
}

The enum constraints carry the §A13 microscopic-scope guarantee into the mask. The model cannot emit "running" as a verb value — it's not in the enum, the mask blocks the prefix that would lead there.

This is the load-bearing reason structured generation matters for our agent. Without the schema mask, Phase 32's tutor would occasionally hallucinate a verb outside §A13 ("she dranked the water") and lose the §A13 invariant. With the mask, that's mechanically impossible.

Trade-off summary

Trait Regex CFG JSON-Schema
Expressivity ceiling regular context-free structural JSON (effectively CFG with $ref)
Best for flat tokens, enums, IDs nested non-JSON DSLs tool-call args, structured output
Compile cost one-time, small one-time, moderate one-time, moderate
Runtime cost (precompiled) O(1) O(parser stack depth) O(1) per token, O(d) per state transition
Library examples re, flashtext Outlines, Lark, Guidance grammars lm-format-enforcer, Outlines (JSON mode), llama.cpp grammars
Failure mode accepts pathological inputs outside the regular fragment of the target over-engineering for flat data wrong tool when output is not JSON

Citations

  • Willard & Louf, Efficient Guided Generation for Large Language Models (Outlines paper), 2023. arXiv:2307.09702 — the regex/CFG-to-FSM compilation pipeline.
  • Beurer-Kellner, Fischer, Vechev, lm-format-enforcer (GitHub) — the JSON-Schema-driven prefix automaton.
  • Hopcroft, Motwani, Ullman, Introduction to Automata Theory (3rd ed.) — Chomsky hierarchy as the formal anchor.

One-paragraph recap

Regex masks are correct for regular outputs (zip codes, flat enums); CFG masks are correct for context-free outputs (nested DSLs, balanced delimiters); JSON-Schema masks are correct for typed JSON outputs and are the natural choice for tool-call arguments. The three plug into the same decoder via a logit-mask interface; the differentiator is which automaton tracks the parsing state. Choose by expressivity needed, not by what's easiest to write — using a regex on a context-free language is a silently-incorrect mask that fails on structural-boundary cases. For the §A13 grammar tutor, JSON-Schema with enum constraints carries the microscopic-scope invariant into the decoding loop mechanically.

Next: lab/02-end-to-end-conjugate.md to wire the JSON-Schema mask into Mini-GPT.