Skip to content

English · Español

Break 00 — Disable the JSON-Schema mask on a structured-output task

🇪🇸 Pedimos a Mini-GPT que devuelva el past simple español de work en JSON. Con la máscara JSON-Schema activa, el output es siempre parseable. Sin ella, alternamos entre prosa, markdown, frases incompletas, y JSON con comas finales. Contamos el porcentaje de parses fallidos sobre 50 muestras — esa cifra es la lección.

This /break exercise targets the mechanical invariance structured generation provides. The bug is one boolean; the failure mode is a measurable parse-failure rate.

Anchors: theory/01-jsonmode-vs-grammar.md, theory/04-cfg-vs-regex-vs-jsonschema.md, .claude/commands/break.md.


Hypothesis

The learner predicts: "When I ask Mini-GPT 'give me the Spanish past simple of work in JSON', with the schema mask active, every output parses as valid JSON matching the schema. Without the mask, the model — trained on a tiny natural-language corpus, not a JSON-heavy one — emits a mix of prose, partial JSON, and JSON-with-trailing-commas. The parse failure rate jumps from 0% to ~40-70%."

The break

In src/minimask/conjugate_cli.py:

 def conjugate(verb: str, tense: str, person: str) -> dict:
     prompt = build_prompt(verb, tense, person)
-    output = generate_with_mask(model, prompt, schema=CONJUGATE_SCHEMA)
+    output = generate_freeform(model, prompt)   # /break: no schema mask
     return json.loads(output)        # will raise on most outputs

One swap: generate_with_mask → generate_freeform. The json.loads call stays — and that's where the failures will surface.

Predict, then run

The prompt template:

Return ONLY JSON: {"verb": "work", "tense": "past simple", "person": "1st singular",
                   "english": "<form>", "spanish": "<form>"}

Verb: work
Tense: past simple
Person: 1st singular
JSON:

Mini-GPT's response with the mask: always {"verb": "work", "tense": "past simple", "person": "1st singular", "english": "worked", "spanish": "trabajé"}. Parseable on every sample.

Without the mask, expected behaviors over 50 samples (temperature 0.7):

  • ~10 samples: valid JSON, correct (the model does know JSON syntax from minimal corpus exposure).
  • ~15 samples: valid JSON, but field missing or extra fields ("spanish" misspelled "español").
  • ~10 samples: JSON with trailing comma (...,}) — invalid by json.loads but recoverable.
  • ~10 samples: prose / markdown explanation instead of JSON ("The past simple of 'work' is 'worked' (yo trabajé).").
  • ~5 samples: truncated JSON ({"verb": "work").

Predictions

  • Parse failure rate (raw json.loads): ≈ 60-70%.
  • Schema-validation failure rate (JSON valid but wrong keys): on top of the parse failures, additional ~10%.
  • Total useful samples: ~10-20%.
  • With the mask: parse failures 0% by construction; schema-validation failures 0% by construction.

Write your predictions in learners/borja/phase-30/notes/breaks.md before running.

Observe

Run the conjugate CLI 50 times with each mode:

# Broken (no mask)
for i in $(seq 1 50); do just exp 30-conjugate --mode freeform --seed $i; done \
    > experiments/30-freeform/results.jsonl
# Baseline (mask on)
for i in $(seq 1 50); do just exp 30-conjugate --mode masked --seed $i; done \
    > experiments/30-masked/results.jsonl

Diagnostics:

  1. Bar chart: parse_failures / 50 for each mode. Expected: 30/50 freeform vs 0/50 masked.
  2. Categorize the 30 freeform failures: trailing comma, prose, missing field, truncation. Stacked bar.
  3. Side-by-side first 10 outputs of each mode. The contrast is the lesson.

Symptom Borja will see

  • json.JSONDecodeError raised on a majority of freeform samples.
  • The pretty outputs (when they parse) often have field-name drift: spanish becomes español or es.
  • The masked outputs are boringly identical — exactly the point: deterministic structure, content can still vary in english/spanish fields.

Hidden cause (one sentence)

The decoder is no longer constrained by the JSON-Schema mask, so the model's natural distribution over English-grammar tokens (mostly prose) competes with the JSON syntax target, and the prose wins ~60% of the time.

Hint cascade

  1. Print the first 5 raw outputs from generate_freeform. How many are valid JSON?
  2. Compare to generate_with_mask. What is structurally invariant between them?
  3. Re-read theory/04-cfg-vs-regex-vs-jsonschema.md §"The right choice for the §A13 grammar tutor". Why is the schema mask the load-bearing component here?

Fix diff

 def conjugate(verb: str, tense: str, person: str) -> dict:
     prompt = build_prompt(verb, tense, person)
-    output = generate_freeform(model, prompt)
+    output = generate_with_mask(model, prompt, schema=CONJUGATE_SCHEMA)
     return json.loads(output)

Restore the masked path. The fix is also a one-line change.

Why this teaches the concept

The whole pitch of structured generation: "you don't trust the model to follow the format — you make it mechanically impossible to deviate." This break makes that claim load-bearing on a real model. With the mask, the JSON contract is a type system — it cannot be violated. Without the mask, the contract is a prayer — the model usually honours it but does so at a rate that's catastrophic for downstream parsers expecting structured input.

The downstream impact: any tool-calling agent (Phase 31) that depends on JSON output from the LLM is fragile without a schema mask. Production agents either use schema-constrained decoding or rely on post-hoc retry-and-validate loops (which add latency and cost). Phase 32's grammar tutor uses the mask path; this break shows why.

Reference

  • Willard & Louf, Outlines (arXiv:2307.09702) — quantifies the JSON-failure rate on common base models without constraints.
  • OpenAI, Structured Outputs documentation — the production API exposes schema-constrained decoding for exactly this reason.

Next: restore the mask path and run lab/02-end-to-end-conjugate.md to measure the masked-decoder latency overhead.