English · Español
Lab 01 — JSON-Schema-Constrained Decoding¶
Goal: generalize the warm-up mask to a real JSON schema; produce
{verb, tense, person}outputs that parse 100% of the time.Estimated time: 4–6 hours.
Prereq: lab 00 (regex mask) committed.
What you produce¶
A directory experiments/30-conjugation-schema/ containing:
conjugation_schema.json— the formal schema your mask conforms to (a copy of the canonical schema for record-keeping; the canonical source lives insrc/ministruct/schemas.py).mask_driver.py— runs MiniGPT on the eval probe set withJSONSchemaMaskengaged; collects outputs.results.json—{n_samples, n_parsed_ok, n_schema_valid, kl_per_step_avg}.outputs.jsonl— every generated string, one per line.parse_failures.md— should be empty. If non-empty, those are bugs to fix.manifest.json.
Plus, in src/ministruct/:
schemas.py— the canonical conjugation schema as a Python dataclass-like spec.dfa.py— schema → state machine compiler.mask.py— extended withJSONSchemaMask.
The schema¶
{
"type": "object",
"additionalProperties": false,
"required": ["verb", "tense", "person"],
"properties": {
"verb": {
"type": "string",
"enum": ["work", "play", "walk", "talk", "listen", "watch", "study",
"finish", "start", "look", "want", "like",
"be", "have", "do", "go", "come", "see", "eat", "write"]
},
"tense": {
"type": "string",
"enum": ["infinitive", "present_simple", "past_simple",
"past_participle", "simple_future"]
},
"person": {
"type": "string",
"enum": ["1sg", "2sg", "3sg"]
},
"spanish": {
"type": "string",
"maxLength": 30
}
}
}
The enums on verb, tense, person are the canonical English-verb-grammar scope (per LYNX_CORTEX_ADDENDUM.md §A13). spanish is the optional Spanish translation of the resulting conjugated form (e.g., for {verb: "eat", tense: "past_simple", person: "3sg"} the value is "comió").
TODOs¶
Block A — schema → states¶
In src/ministruct/dfa.py:
- Parse the schema (stdlib
jsonis fine; do NOT use thejsonschemalibrary for the mask itself). - Build a state machine. States are documented in
theory/02-logit-masks.md§"Computing the mask". - Each state holds: (a) the parser's position in the JSON skeleton, (b) which keys have been emitted so far, © which key is currently being valued, (d) within-value progress.
- Implement
transition(state, char) -> state | None(None = illegal). Test exhaustively on a hand-written valid example.
Block B — token-level mask¶
In src/ministruct/mask.py, implement JSONSchemaMask:
- Constructor:
JSONSchemaMask(tokenizer, schema_dict). - On
step(last_token_id): - If
last_token_idis not None, decode it to characters, advance the state machine through each character. If any character is rejected, the parser is in an invalid state — flag this as a bug (the previous step's mask was wrong; should never trigger if the implementation is correct). - Compute mask: for each token in the vocabulary, decode it, simulate the state machine forward, accept if the simulation never hits an illegal state.
- Return mask array.
- On
is_done(): True iff the state machine reached the terminal DONE state.
Block C — multi-character tokens¶
This is the subtle part. A token like "," spans multiple JSON characters. Your transition must be called per-character, not per-token, during the mask-construction loop. Reference: theory/03-grammar-as-dfa.md §"Token-level vs character-level".
- Verify with a hand-written test: a token
"followed by tokenverbfollowed by token"produces the correct sequence of state transitions when decoded one char at a time. - Verify with a test: a token
,"tense":"(a multi-character BPE blob, if present) walks through 4 state transitions in one step and is accepted iff the final state is legal AND every intermediate state is legal.
Block D — wire into decoder + eval¶
-
mask_driver.py: load MiniGPT, load eval probe set (data/eval/conjugation_probes.jsonl— the §A13 probe set from Phase 20), for each probe build a prompt asking for the conjugation triple, generate withJSONSchemaMask, collect the output. - Validate every output:
json.loads(output)succeeds ANDjsonschema.validate(parsed, schema)succeeds. For validation only, thejsonschemalibrary IS allowed — it's not part of the mask. - Compute KL per step (using \(Z = \sum_{t \in \mathcal{L}_i} p(t)\) from
theory/02-logit-masks.md).
Block E — tests¶
In tests/test_ministruct_mask.py:
-
test_schema_first_token_is_open_brace— at step 0 ofJSONSchemaMask, the only legal token is one whose decoded string starts with{. -
test_verb_enum_after_verb_key— after emitting{"verb":", only tokens that are prefixes of one of the 20 verb-enum strings are legal. -
test_tense_enum_after_tense_key— same check for the 5 tense values. -
test_person_enum_after_person_key— same check for the 3 person values. -
test_done_after_close_brace— after a complete valid object,is_done()is True. -
test_reset_between_requests— generate two distinct outputs from the same mask instance after callingreset(). Both should parse. -
test_no_extra_keys_admitted— the mask should reject any key not in{verb, tense, person, spanish}. -
test_no_repeat_keys— the mask should reject a key that has already been emitted in the current object. -
test_spanish_optional— output without thespanishkey passes; output with thespanishkey passes too.
Constraints¶
- Schema parsing only. No
jsonschemafor the masking logic. Thejsonschemalibrary may be used for validation of outputs in Block D (post-hoc, not as part of the mask). - One file per concern. Schema constants in
schemas.py; DFA indfa.py; mask inmask.py; tests intests/test_ministruct_mask.py. - Determinism. Mask given (state, vocab, schema) is deterministic. No randomness in mask construction.
Stop conditions¶
Done when:
- All Block E tests pass.
results.jsonreportsn_parsed_ok == n_samplesANDn_schema_valid == n_samples. Hard contract.parse_failures.mdis empty.README.mdincludes a brief discussion of average KL per step. Is it small (model already knew the format)? Large (model was being coerced)? What does that imply for fine-tuning in Phase 28?
Pitfalls¶
- Whitespace. JSON allows arbitrary whitespace between elements. Easiest is to forbid whitespace in your mask (require the canonical minified form). Document this in the README.
- Escape characters in strings. A
"inside a Spanish translation (e.g., the Spanish word for some quote-containing form) is rare in §A13's scope but possible. Either disallow escape sequences inspanish(simplest), or implement escape logic carefully. - Key ordering. JSON allows any key order; your mask either (a) enforces a canonical order (simpler, fewer states), or (b) allows any order (more states, more bugs). Recommended: enforce canonical order
verb → tense → person → spanish?. Document this. - Token-trie surprises. A token like
"verb":might be a single token in your BPE. The mask must walk all 7 of its characters through the state machine in one step. If your test never sees such a token, mock one to force the code path. spanishoptional, but not "skippable mid-stream". If you started emitting"spanish":", you must finish it. The mask must enforce that.
When to consult solutions/¶
After 100% parse rate is achieved on the probe set. The solution will cross-check your state-machine structure and probably your KL diagnostic interpretation.
Next lab: lab/02-end-to-end-conjugate.md.