Skip to content

English · Español

Lab 01 — JSON-Schema-Constrained Decoding

Goal: generalize the warm-up mask to a real JSON schema; produce {verb, tense, person} outputs that parse 100% of the time.

Estimated time: 4–6 hours.

Prereq: lab 00 (regex mask) committed.


What you produce

A directory experiments/30-conjugation-schema/ containing:

  • conjugation_schema.json — the formal schema your mask conforms to (a copy of the canonical schema for record-keeping; the canonical source lives in src/ministruct/schemas.py).
  • mask_driver.py — runs MiniGPT on the eval probe set with JSONSchemaMask engaged; collects outputs.
  • results.json{n_samples, n_parsed_ok, n_schema_valid, kl_per_step_avg}.
  • outputs.jsonl — every generated string, one per line.
  • parse_failures.md — should be empty. If non-empty, those are bugs to fix.
  • manifest.json.

Plus, in src/ministruct/:

  • schemas.py — the canonical conjugation schema as a Python dataclass-like spec.
  • dfa.py — schema → state machine compiler.
  • mask.py — extended with JSONSchemaMask.

The schema

{
  "type": "object",
  "additionalProperties": false,
  "required": ["verb", "tense", "person"],
  "properties": {
    "verb": {
      "type": "string",
      "enum": ["work", "play", "walk", "talk", "listen", "watch", "study",
               "finish", "start", "look", "want", "like",
               "be", "have", "do", "go", "come", "see", "eat", "write"]
    },
    "tense": {
      "type": "string",
      "enum": ["infinitive", "present_simple", "past_simple",
               "past_participle", "simple_future"]
    },
    "person": {
      "type": "string",
      "enum": ["1sg", "2sg", "3sg"]
    },
    "spanish": {
      "type": "string",
      "maxLength": 30
    }
  }
}

The enums on verb, tense, person are the canonical English-verb-grammar scope (per LYNX_CORTEX_ADDENDUM.md §A13). spanish is the optional Spanish translation of the resulting conjugated form (e.g., for {verb: "eat", tense: "past_simple", person: "3sg"} the value is "comió").

TODOs

Block A — schema → states

In src/ministruct/dfa.py:

  • Parse the schema (stdlib json is fine; do NOT use the jsonschema library for the mask itself).
  • Build a state machine. States are documented in theory/02-logit-masks.md §"Computing the mask".
  • Each state holds: (a) the parser's position in the JSON skeleton, (b) which keys have been emitted so far, © which key is currently being valued, (d) within-value progress.
  • Implement transition(state, char) -> state | None (None = illegal). Test exhaustively on a hand-written valid example.

Block B — token-level mask

In src/ministruct/mask.py, implement JSONSchemaMask:

  • Constructor: JSONSchemaMask(tokenizer, schema_dict).
  • On step(last_token_id):
  • If last_token_id is not None, decode it to characters, advance the state machine through each character. If any character is rejected, the parser is in an invalid state — flag this as a bug (the previous step's mask was wrong; should never trigger if the implementation is correct).
  • Compute mask: for each token in the vocabulary, decode it, simulate the state machine forward, accept if the simulation never hits an illegal state.
  • Return mask array.
  • On is_done(): True iff the state machine reached the terminal DONE state.

Block C — multi-character tokens

This is the subtle part. A token like "," spans multiple JSON characters. Your transition must be called per-character, not per-token, during the mask-construction loop. Reference: theory/03-grammar-as-dfa.md §"Token-level vs character-level".

  • Verify with a hand-written test: a token " followed by token verb followed by token " produces the correct sequence of state transitions when decoded one char at a time.
  • Verify with a test: a token ,"tense":" (a multi-character BPE blob, if present) walks through 4 state transitions in one step and is accepted iff the final state is legal AND every intermediate state is legal.

Block D — wire into decoder + eval

  • mask_driver.py: load MiniGPT, load eval probe set (data/eval/conjugation_probes.jsonl — the §A13 probe set from Phase 20), for each probe build a prompt asking for the conjugation triple, generate with JSONSchemaMask, collect the output.
  • Validate every output: json.loads(output) succeeds AND jsonschema.validate(parsed, schema) succeeds. For validation only, the jsonschema library IS allowed — it's not part of the mask.
  • Compute KL per step (using \(Z = \sum_{t \in \mathcal{L}_i} p(t)\) from theory/02-logit-masks.md).

Block E — tests

In tests/test_ministruct_mask.py:

  • test_schema_first_token_is_open_brace — at step 0 of JSONSchemaMask, the only legal token is one whose decoded string starts with {.
  • test_verb_enum_after_verb_key — after emitting {"verb":", only tokens that are prefixes of one of the 20 verb-enum strings are legal.
  • test_tense_enum_after_tense_key — same check for the 5 tense values.
  • test_person_enum_after_person_key — same check for the 3 person values.
  • test_done_after_close_brace — after a complete valid object, is_done() is True.
  • test_reset_between_requests — generate two distinct outputs from the same mask instance after calling reset(). Both should parse.
  • test_no_extra_keys_admitted — the mask should reject any key not in {verb, tense, person, spanish}.
  • test_no_repeat_keys — the mask should reject a key that has already been emitted in the current object.
  • test_spanish_optional — output without the spanish key passes; output with the spanish key passes too.

Constraints

  • Schema parsing only. No jsonschema for the masking logic. The jsonschema library may be used for validation of outputs in Block D (post-hoc, not as part of the mask).
  • One file per concern. Schema constants in schemas.py; DFA in dfa.py; mask in mask.py; tests in tests/test_ministruct_mask.py.
  • Determinism. Mask given (state, vocab, schema) is deterministic. No randomness in mask construction.

Stop conditions

Done when:

  1. All Block E tests pass.
  2. results.json reports n_parsed_ok == n_samples AND n_schema_valid == n_samples. Hard contract.
  3. parse_failures.md is empty.
  4. README.md includes a brief discussion of average KL per step. Is it small (model already knew the format)? Large (model was being coerced)? What does that imply for fine-tuning in Phase 28?

Pitfalls

  • Whitespace. JSON allows arbitrary whitespace between elements. Easiest is to forbid whitespace in your mask (require the canonical minified form). Document this in the README.
  • Escape characters in strings. A " inside a Spanish translation (e.g., the Spanish word for some quote-containing form) is rare in §A13's scope but possible. Either disallow escape sequences in spanish (simplest), or implement escape logic carefully.
  • Key ordering. JSON allows any key order; your mask either (a) enforces a canonical order (simpler, fewer states), or (b) allows any order (more states, more bugs). Recommended: enforce canonical order verb → tense → person → spanish?. Document this.
  • Token-trie surprises. A token like "verb": might be a single token in your BPE. The mask must walk all 7 of its characters through the state machine in one step. If your test never sees such a token, mock one to force the code path.
  • spanish optional, but not "skippable mid-stream". If you started emitting "spanish":", you must finish it. The mask must enforce that.

When to consult solutions/

After 100% parse rate is achieved on the probe set. The solution will cross-check your state-machine structure and probably your KL diagnostic interpretation.


Next lab: lab/02-end-to-end-conjugate.md.