Skip to content

English · Español

01 — The Spectrum: "Ask Nicely" → "JSON Mode" → "Grammar-Constrained"

🇪🇸 Hay tres niveles de garantía sobre la salida estructurada: pedirla en el prompt (cero garantía), modo JSON (garantiza que parsea pero no la forma), y decodificación con gramática (garantiza la forma exacta). Solo el tercero es una garantía de tipo en el sentido fuerte.

This page situates the technique of Phase 30 (logit masking) inside the broader landscape of techniques people use to get structured output from LLMs. The point is to know which level of guarantee each gives you, and why we go straight to the strongest one.


Level 0: ask nicely (zero guarantee)

You are a helpful grammar tutor. Reply ONLY in JSON with the keys 'verb', 'tense', 'person'.
Do not include any prose. Do not include markdown fences.

The model usually complies. Frequency-of-compliance scales with model size, training data, and prompt engineering effort. It is not zero. The failures are correlated with adversarial inputs, unusual phrasings, edge cases — exactly the cases your downstream code is least prepared for.

Guarantee: none. Cost: zero implementation effort. Use when: prototyping; the consumer can tolerate retries.

Level 1: post-hoc parsing + retry

for attempt in range(MAX_RETRIES):
    out = model.generate(prompt)
    try:
        parsed = json.loads(out)
        if validates_against_schema(parsed, schema):
            return parsed
    except (json.JSONDecodeError, ValidationError):
        continue
raise GiveUp()

This is what most production stacks did until ~2023. It works, but the failure rate is multiplied by the number of times you give up. With MAX_RETRIES=3 and a 99% per-attempt rate, the failure rate is \((1 - 0.99)^3 = 10^{-6}\) — six nines. Great. With MAX_RETRIES=3 and a 90% per-attempt rate, the failure rate is \(10^{-3}\) — three nines. The retry strategy is multiplicative only when independent, and retries aren't independent (the same model on the same prompt fails the same way).

Guarantee: probabilistic; depends on per-attempt rate and MAX_RETRIES. Cost: N× latency in the failure path; partial work discarded. Use when: schema is loose; latency budget allows.

Level 2: "JSON mode" (provider-side mask)

OpenAI's response_format: {type: "json_object"} and equivalents. The provider applies a grammar — some grammar — internally. Output is guaranteed to be syntactically valid JSON. But:

  • It is not guaranteed to match your schema. You still need post-hoc schema validation.
  • The exact grammar is provider-controlled and may change.
  • It typically supports objects but not nested constraints (e.g., "this field must be one of these 20 verb enum values").

Guarantee: JSON parses; schema match is still your problem. Cost: minimal latency overhead (provider already paid for the mask). Use when: the schema is loose enough that "parses as JSON" is enough.

Level 3: schema-constrained ("structured outputs")

OpenAI's structured outputs (response_format: {type: "json_schema", schema: ...}), Anthropic's tool-use forcing, Outlines' outlines.generate.json(model, schema). These compile your schema into a token-level mask. The output is guaranteed to parse against the schema — every field present, every type correct, every enum value drawn from the declared set.

Guarantee: parse + schema validation, by construction. Cost: some compile-time cost for the schema → automaton step; per-step mask lookup is O(1) after that. Use when: you need a hard contract.

This is the level Phase 30 implements (manually, in NumPy, for one specific schema).

Level 4: grammar-constrained (GBNF and friends)

llama.cpp's GBNF grammars, lark grammars, EBNF. You write a context-free grammar; the implementation compiles it to a pushdown automaton; logit masks are derived from the current parser state. This is the most expressive level — you can constrain to "valid English sentence", "valid SQL", "valid SMTP responses", anything you can write as a CFG.

Guarantee: output is a member of the grammar's language. Cost: compilation is non-trivial; the automaton can be large; tokenizer-aware compilation is hard. Use when: the schema is too complex for JSON-schema and too important to leave loose.

We describe this level in 03-grammar-as-dfa.md. We do not implement it.

What we choose for Phase 30

Level 3, hand-built, NumPy-only. The conjugation schema is small enough that we can implement the mask as a Python state machine:

  • Schema (frozen at phase close in src/ministruct/schemas.py):
    {
      "verb":   enum-of-20  (work, play, walk, talk, listen, watch, study,
                             finish, start, look, want, like,
                             be, have, do, go, come, see, eat, write),
      "tense":  enum-of-5   (infinitive, present_simple, past_simple,
                             past_participle, simple_future),
      "person": enum-of-3   (1sg, 2sg, 3sg),
      "spanish": optional string  (Spanish translation of the conjugated form)
    }
    
  • States: expecting-{, expecting-quote-for-key, expecting-key-name-from-{verb,tense,person,spanish}-minus-already-seen, expecting-quote-close, expecting-colon, expecting-value-from-the-key's-enum, expecting-comma-or-brace-close.
  • Transitions: each step, given the state and the token being emitted, compute the new state. Only tokens that lead to a valid new state are unmasked.

We write this by hand because (a) it's small, (b) seeing the state machine as code is more instructive than seeing it as a generated artifact, © the bugs we'll hit (token-spanning, tokenizer-grammar mismatch) are the bugs production implementations also hit, and (d) we cannot import outlines without violating CLAUDE.md §0.4 ("build before abstracting").

The token-vs-character problem (preview)

A subtle point we will hit in lab/01: the model emits tokens, but the grammar reasons about characters. The BPE vocabulary contains tokens like "verb":, {", ",". A single token can span multiple grammar transitions. The mask must check: "if I emit this token, where does the grammar parser end up?". This requires simulating the parse of the candidate token's character expansion.

In our small vocabulary (≤ 512 tokens) this is cheap: for each token, decode it, advance the parser, check legality, mark mask. Total work per step: O(\(|V| \cdot \text{avg-token-length}\)). For 512 tokens × 3 chars per token, that's ~1500 operations per step. Negligible.

In a real LLM vocabulary (50k–200k tokens), this is too expensive. Production implementations precompute a trie of tokens intersected with the grammar's DFA, giving O(1) per-step lookup. We describe this in 03-grammar-as-dfa.md.

The closed-enum advantage

§A13's universe is closed: every legal value of every field is one of a small enumerated set. There are no free-form strings in the required fields, no integers in unbounded ranges. This is the easiest case in all of structured generation — easier than what most production stacks have to handle.

Concretely, the entire state-space of legal completions is bounded by:

states_count  ≈  (#fields_open_set) × (#partial_value_states_per_field)
              ≤  4 × 30  ≈  120 states

Compare to a typical real-world JSON schema with free-form name: string and description: string fields, where the state machine has to handle arbitrary text. Our universe lets us be exhaustive. We pre-compute every transition.

Comparison table

Level Guarantee Implementation cost Latency overhead Phase 30 implements
0 — Ask nicely None Zero Zero No
1 — Retry + validate Probabilistic Tiny N× on failure No
2 — JSON mode Parses as JSON Provider's Negligible No
3 — Schema-constrained Matches schema Moderate Per-step mask Yes
4 — Grammar-constrained In grammar's language High Per-step mask + parser Described only

Where this leaves the agent (Phase 32)

The agent will use Level 3 (Phase 30's JSONSchemaMask) for its output. It does not need Level 4. The conjugation schema is fixed, simple, and known in advance. We pay the cost of writing a JSON-schema-aware mask once, and the rest of the project consumes it.

If we ever wanted the agent to emit a corrected English sentence directly (rather than just identify the correction in structured form), we'd want Level 4 — an English-sentence grammar. We do not need that. The agent emits the structured correction ({"verb": "go", "tense": "past_simple", "person": "3sg"} for the input "Yesterday he goed home"); the natural-language sentence rendering is done by a separate templating step that consumes the structured output. No grammar needed there because the structure dictates the template.

What this page does NOT cover

  • The math of why masking works — see 02-logit-masks.md.
  • The implementation of the DFA — see 03-grammar-as-dfa.md.
  • How tokenizers and grammars actually intersect at the token-trie level — described in 03-grammar-as-dfa.md, not implemented in Phase 30.

Next: theory/02-logit-masks.md — the derivation and the math.