English · Español
Break 00 — Disable the JSON-Schema mask on a structured-output task¶
🇪🇸 Pedimos a Mini-GPT que devuelva el past simple español de
worken JSON. Con la máscara JSON-Schema activa, el output es siempre parseable. Sin ella, alternamos entre prosa, markdown, frases incompletas, y JSON con comas finales. Contamos el porcentaje de parses fallidos sobre 50 muestras — esa cifra es la lección.
This /break exercise targets the mechanical invariance structured generation provides. The bug is one boolean; the failure mode is a measurable parse-failure rate.
Anchors: theory/01-jsonmode-vs-grammar.md, theory/04-cfg-vs-regex-vs-jsonschema.md, .claude/commands/break.md.
Hypothesis¶
The learner predicts: "When I ask Mini-GPT 'give me the Spanish past simple of work in JSON', with the schema mask active, every output parses as valid JSON matching the schema. Without the mask, the model — trained on a tiny natural-language corpus, not a JSON-heavy one — emits a mix of prose, partial JSON, and JSON-with-trailing-commas. The parse failure rate jumps from 0% to ~40-70%."
The break¶
In src/minimask/conjugate_cli.py:
def conjugate(verb: str, tense: str, person: str) -> dict:
prompt = build_prompt(verb, tense, person)
- output = generate_with_mask(model, prompt, schema=CONJUGATE_SCHEMA)
+ output = generate_freeform(model, prompt) # /break: no schema mask
return json.loads(output) # will raise on most outputs
One swap: generate_with_mask → generate_freeform. The json.loads call stays — and that's where the failures will surface.
Predict, then run¶
The prompt template:
Return ONLY JSON: {"verb": "work", "tense": "past simple", "person": "1st singular",
"english": "<form>", "spanish": "<form>"}
Verb: work
Tense: past simple
Person: 1st singular
JSON:
Mini-GPT's response with the mask: always {"verb": "work", "tense": "past simple", "person": "1st singular", "english": "worked", "spanish": "trabajé"}. Parseable on every sample.
Without the mask, expected behaviors over 50 samples (temperature 0.7):
- ~10 samples: valid JSON, correct (the model does know JSON syntax from minimal corpus exposure).
- ~15 samples: valid JSON, but field missing or extra fields ("
spanish" misspelled "español"). - ~10 samples: JSON with trailing comma (
...,}) — invalid byjson.loadsbut recoverable. - ~10 samples: prose / markdown explanation instead of JSON ("The past simple of 'work' is 'worked' (yo trabajé).").
- ~5 samples: truncated JSON (
{"verb": "work").
Predictions¶
- Parse failure rate (raw
json.loads): ≈ 60-70%. - Schema-validation failure rate (JSON valid but wrong keys): on top of the parse failures, additional ~10%.
- Total useful samples: ~10-20%.
- With the mask: parse failures 0% by construction; schema-validation failures 0% by construction.
Write your predictions in learners/borja/phase-30/notes/breaks.md before running.
Observe¶
Run the conjugate CLI 50 times with each mode:
# Broken (no mask)
for i in $(seq 1 50); do just exp 30-conjugate --mode freeform --seed $i; done \
> experiments/30-freeform/results.jsonl
# Baseline (mask on)
for i in $(seq 1 50); do just exp 30-conjugate --mode masked --seed $i; done \
> experiments/30-masked/results.jsonl
Diagnostics:
- Bar chart:
parse_failures / 50for each mode. Expected: 30/50 freeform vs 0/50 masked. - Categorize the 30 freeform failures: trailing comma, prose, missing field, truncation. Stacked bar.
- Side-by-side first 10 outputs of each mode. The contrast is the lesson.
Symptom Borja will see¶
json.JSONDecodeErrorraised on a majority of freeform samples.- The pretty outputs (when they parse) often have field-name drift:
spanishbecomesespañolores. - The masked outputs are boringly identical — exactly the point: deterministic structure, content can still vary in
english/spanishfields.
Hidden cause (one sentence)¶
The decoder is no longer constrained by the JSON-Schema mask, so the model's natural distribution over English-grammar tokens (mostly prose) competes with the JSON syntax target, and the prose wins ~60% of the time.
Hint cascade¶
- Print the first 5 raw outputs from
generate_freeform. How many are valid JSON? - Compare to
generate_with_mask. What is structurally invariant between them? - Re-read
theory/04-cfg-vs-regex-vs-jsonschema.md§"The right choice for the §A13 grammar tutor". Why is the schema mask the load-bearing component here?
Fix diff¶
def conjugate(verb: str, tense: str, person: str) -> dict:
prompt = build_prompt(verb, tense, person)
- output = generate_freeform(model, prompt)
+ output = generate_with_mask(model, prompt, schema=CONJUGATE_SCHEMA)
return json.loads(output)
Restore the masked path. The fix is also a one-line change.
Why this teaches the concept¶
The whole pitch of structured generation: "you don't trust the model to follow the format — you make it mechanically impossible to deviate." This break makes that claim load-bearing on a real model. With the mask, the JSON contract is a type system — it cannot be violated. Without the mask, the contract is a prayer — the model usually honours it but does so at a rate that's catastrophic for downstream parsers expecting structured input.
The downstream impact: any tool-calling agent (Phase 31) that depends on JSON output from the LLM is fragile without a schema mask. Production agents either use schema-constrained decoding or rely on post-hoc retry-and-validate loops (which add latency and cost). Phase 32's grammar tutor uses the mask path; this break shows why.
Reference¶
- Willard & Louf, Outlines (arXiv:2307.09702) — quantifies the JSON-failure rate on common base models without constraints.
- OpenAI, Structured Outputs documentation — the production API exposes schema-constrained decoding for exactly this reason.
Next: restore the mask path and run lab/02-end-to-end-conjugate.md to measure the masked-decoder latency overhead.