English · Español

Lab 01 — JSON-Schema-Constrained Decoding¶

Goal: generalize the warm-up mask to a real JSON schema; produce {verb, tense, person} outputs that parse 100% of the time.

Estimated time: 4–6 hours.

Prereq: lab 00 (regex mask) committed.

What you produce¶

A directory experiments/30-conjugation-schema/ containing:

conjugation_schema.json — the formal schema your mask conforms to (a copy of the canonical schema for record-keeping; the canonical source lives in src/ministruct/schemas.py).
mask_driver.py — runs MiniGPT on the eval probe set with JSONSchemaMask engaged; collects outputs.
results.json — {n_samples, n_parsed_ok, n_schema_valid, kl_per_step_avg}.
outputs.jsonl — every generated string, one per line.
parse_failures.md — should be empty. If non-empty, those are bugs to fix.
manifest.json.

Plus, in src/ministruct/:

schemas.py — the canonical conjugation schema as a Python dataclass-like spec.
dfa.py — schema → state machine compiler.
mask.py — extended with JSONSchemaMask.

The schema¶

{
  "type": "object",
  "additionalProperties": false,
  "required": ["verb", "tense", "person"],
  "properties": {
    "verb": {
      "type": "string",
      "enum": ["work", "play", "walk", "talk", "listen", "watch", "study",
               "finish", "start", "look", "want", "like",
               "be", "have", "do", "go", "come", "see", "eat", "write"]
    },
    "tense": {
      "type": "string",
      "enum": ["infinitive", "present_simple", "past_simple",
               "past_participle", "simple_future"]
    },
    "person": {
      "type": "string",
      "enum": ["1sg", "2sg", "3sg"]
    },
    "spanish": {
      "type": "string",
      "maxLength": 30
    }
  }
}

The enums on verb, tense, person are the canonical English-verb-grammar scope (per LYNX_CORTEX_ADDENDUM.md §A13). spanish is the optional Spanish translation of the resulting conjugated form (e.g., for {verb: "eat", tense: "past_simple", person: "3sg"} the value is "comió").

TODOs¶

Block A — schema → states¶

In src/ministruct/dfa.py:

Parse the schema (stdlib json is fine; do NOT use the jsonschema library for the mask itself).
Build a state machine. States are documented in theory/02-logit-masks.md §"Computing the mask".
Each state holds: (a) the parser's position in the JSON skeleton, (b) which keys have been emitted so far, © which key is currently being valued, (d) within-value progress.
Implement transition(state, char) -> state | None (None = illegal). Test exhaustively on a hand-written valid example.

Block B — token-level mask¶

In src/ministruct/mask.py, implement JSONSchemaMask:

Constructor: JSONSchemaMask(tokenizer, schema_dict).
On step(last_token_id):
If last_token_id is not None, decode it to characters, advance the state machine through each character. If any character is rejected, the parser is in an invalid state — flag this as a bug (the previous step's mask was wrong; should never trigger if the implementation is correct).
Compute mask: for each token in the vocabulary, decode it, simulate the state machine forward, accept if the simulation never hits an illegal state.
Return mask array.
On is_done(): True iff the state machine reached the terminal DONE state.

Block C — multi-character tokens¶

This is the subtle part. A token like "," spans multiple JSON characters. Your transition must be called per-character, not per-token, during the mask-construction loop. Reference: theory/03-grammar-as-dfa.md §"Token-level vs character-level".

Verify with a hand-written test: a token " followed by token verb followed by token " produces the correct sequence of state transitions when decoded one char at a time.
Verify with a test: a token ,"tense":" (a multi-character BPE blob, if present) walks through 4 state transitions in one step and is accepted iff the final state is legal AND every intermediate state is legal.

Block D — wire into decoder + eval¶

mask_driver.py: load MiniGPT, load eval probe set (data/eval/conjugation_probes.jsonl — the §A13 probe set from Phase 20), for each probe build a prompt asking for the conjugation triple, generate with JSONSchemaMask, collect the output.
Validate every output: json.loads(output) succeeds AND jsonschema.validate(parsed, schema) succeeds. For validation only, the jsonschema library IS allowed — it's not part of the mask.
Compute KL per step (using \(Z = \sum_{t \in \mathcal{L}_i} p(t)\) from theory/02-logit-masks.md).

Block E — tests¶

In tests/test_ministruct_mask.py:

test_schema_first_token_is_open_brace — at step 0 of JSONSchemaMask, the only legal token is one whose decoded string starts with {.
test_verb_enum_after_verb_key — after emitting {"verb":", only tokens that are prefixes of one of the 20 verb-enum strings are legal.
test_tense_enum_after_tense_key — same check for the 5 tense values.
test_person_enum_after_person_key — same check for the 3 person values.
test_done_after_close_brace — after a complete valid object, is_done() is True.
test_reset_between_requests — generate two distinct outputs from the same mask instance after calling reset(). Both should parse.
test_no_extra_keys_admitted — the mask should reject any key not in {verb, tense, person, spanish}.
test_no_repeat_keys — the mask should reject a key that has already been emitted in the current object.
test_spanish_optional — output without the spanish key passes; output with the spanish key passes too.

Constraints¶

Schema parsing only. No jsonschema for the masking logic. The jsonschema library may be used for validation of outputs in Block D (post-hoc, not as part of the mask).
One file per concern. Schema constants in schemas.py; DFA in dfa.py; mask in mask.py; tests in tests/test_ministruct_mask.py.
Determinism. Mask given (state, vocab, schema) is deterministic. No randomness in mask construction.

Stop conditions¶

Done when:

All Block E tests pass.
results.json reports n_parsed_ok == n_samples AND n_schema_valid == n_samples. Hard contract.
parse_failures.md is empty.
README.md includes a brief discussion of average KL per step. Is it small (model already knew the format)? Large (model was being coerced)? What does that imply for fine-tuning in Phase 28?

Pitfalls¶

Whitespace. JSON allows arbitrary whitespace between elements. Easiest is to forbid whitespace in your mask (require the canonical minified form). Document this in the README.
Escape characters in strings. A " inside a Spanish translation (e.g., the Spanish word for some quote-containing form) is rare in §A13's scope but possible. Either disallow escape sequences in spanish (simplest), or implement escape logic carefully.
Key ordering. JSON allows any key order; your mask either (a) enforces a canonical order (simpler, fewer states), or (b) allows any order (more states, more bugs). Recommended: enforce canonical order verb → tense → person → spanish?. Document this.
Token-trie surprises. A token like "verb": might be a single token in your BPE. The mask must walk all 7 of its characters through the state machine in one step. If your test never sees such a token, mock one to force the code path.
spanish optional, but not "skippable mid-stream". If you started emitting "spanish":", you must finish it. The mask must enforce that.

When to consult `solutions/`¶

After 100% parse rate is achieved on the probe set. The solution will cross-check your state-machine structure and probably your KL diagnostic interpretation.

Next lab: lab/02-end-to-end-conjugate.md.