English · Español
Lab 02 — Correctness: With-Cache Equals Without-Cache, Byte for Byte¶
Goal: prove that
generate(prompt, cache=True)produces the exact same tokens asgenerate(prompt, cache=False)for arbitrary prompts and seeds. Subtle cache bugs are silent; only an exact-equality test surfaces them.Estimated time: 2–4 hours.
Prereq:
lab/01-implement-cache.mdcomplete.src/miniinfer/generate.pyfrom Phase 21 in place.
What you produce¶
A directory experiments/22-cache-correctness/ containing:
property_test.py— your property test runner.results.json— pass/fail per prompt, divergence step if any.manifest.json.README.md— 2–3 paragraphs. If any test failed, document the bug you found and how you fixed it.
A second directory experiments/22-yesterday-worked/ containing the flagship slot-level dump:
dump.py— script that prefills"Yesterday I", decodes one token, then separately runs a full recompute on"Yesterday I worked"(or whichever past-simple form the model emitted), and dumps the K-row and V-row for the position-of-"I"slot from both runs.slots.npz— the dumped K, V rows from both paths.report.md— assertion: every byte of the cached path's"I"row equals every byte of the recomputed path's"I"row, for K and for V, for every layer and head.manifest.json.
The property¶
For a fixed model (MiniGPT, Phase 17) and a fixed sampling seed, the following must hold:
seed_everything(42)
tokens_cached = generate(prompt, max_new_tokens=64, cache=True)
seed_everything(42)
tokens_uncached = generate(prompt, max_new_tokens=64, cache=False)
assert tokens_cached == tokens_uncached # byte-identical token sequence
For 50 distinct prompts drawn from a fixed distribution (define in your property_test.py).
Determinism note: seed_everything must be re-applied before each path because sampling consumes the RNG. If cache=True calls the model fewer times (it does — that's the whole point), the RNG state diverges unless reseeded. This is the most common source of false-positive "correctness bug" reports; build the test to handle it from the start.
TODOs¶
Block A — write the test runner¶
- Load Phase-17 MiniGPT weights once. The model is trained on the §A13 verb-grammar corpus; its tokens are English (and Spanish) verb forms.
- Sample 50 prompts: pick from the grammar corpus's natural distribution. Suggested mix: (a) 20 prompts of the form
"<time-adverbial> <pronoun>"(e.g."Yesterday I","Tomorrow he","Now you"), (b) 20 prompts of length 4–8 that are valid partial sentences (e.g."I am going to"), © 10 longer prompts that mix tenses to stress causal masking. Seed the prompt sampler — different seed than the generation seed. - For each prompt:
seed_everything(gen_seed_for_this_prompt)t_cached = generate(prompt, max_new_tokens=64, cache=True)seed_everything(gen_seed_for_this_prompt)t_uncached = generate(prompt, max_new_tokens=64, cache=False)- If
t_cached != t_uncached: record the first divergence index. - Tally pass/fail. Write
results.json.
Block B — interpret failures¶
If any prompt diverges, the test alone tells you where (token index) but not why. Your job in this block:
- Re-run that prompt with
cache=Trueand dump per-layer attention outputs at the divergence step. - Re-run the same prompt with
cache=False, same dumps. - Compare: find the layer (and head?) where they first differ.
- Trace it back to the cache code. Common culprits:
- Cursor off-by-one (storing current token's K, V before computing attention).
- Mask shape wrong for
q_len=1decode (no mask needed, but Phase-15 code might still apply one). - Layer index swapped (using
cache.read(layer=0)everywhere). - dtype mismatch (cache stored fp32, but reads cast to fp64 mid-attention).
Document the bug + fix in README.md.
Block C — extend the test¶
Once 50 prompts pass:
- Try a longer generation: 256 new tokens. Still byte-identical? (Note: the model trained on a 600-form vocabulary will start cycling / repeating well before 256 tokens — that's fine. The equivalence property is what's tested.)
- Try
batch=4parallel sequences. Each must independently produce the same with/without cache. (This catches batch-dim bugs that single-stream tests miss.) - Try
q_len > 1(multi-token prefill resume). Edge case: if you ever do "warm-start decode from a long prompt + 5 tokens", does the prefill path use the cache correctly?
Block C-flagship — the "Yesterday I worked" slot-level dump¶
This is the human-visible artifact tying §A13 to KV-cache mechanics. Produce it in experiments/22-yesterday-worked/:
- Run path A:
prefill("Yesterday I")populates the cache for slots 0 and 1. Decode one new token; record what it was (likely"worked"/"played"/ etc.). Savecache.read(layer=ℓ)[..., :2, :]for every layer. - Run path B: from scratch, run the model on the full 3-token sequence
"Yesterday I <decoded_token>", with no cache, taking the K and V projections at positions 0 and 1. - Assert: path A's K row for slot 1 == path B's K row for position 1, byte-identical. Same for V. Same for slot 0. Repeat for all layers.
- If any byte differs: that's a positional-encoding leak (RoPE phase wrong in the decode path), or a layer-norm ordering bug, or a cursor off-by-one. The dump localizes the bug to a (layer, slot, head) triple.
- Save K, V dumps to
slots.npz. Write a 1-pagereport.md.
Block D — manifest¶
{
"experiment": "22-cache-correctness",
"date": "YYYY-MM-DD",
"seed_prompt_sampler": 1,
"seed_generation_per_prompt": "deterministic_from_prompt_idx",
"versions": {"python": "3.11.x", "numpy": "X.Y.Z"},
"config": {
"model": "miniGPT-phase17",
"n_prompts": 50,
"prompt_len_range": [8, 32],
"max_new_tokens": 64,
"batch_size": 1
},
"results_summary": {
"passed": null,
"failed": null,
"first_divergence_step_min": null,
"first_divergence_step_max": null
}
}
Constraints¶
- No fuzz. Tests must be deterministic. Same seed → same prompts → same outputs.
- No
try/exceptto "skip failures". Every divergence is a bug. Surface them all. - Reset the RNG between paths. As noted above.
Stop conditions¶
Done when:
- 50/50 prompts pass byte-identically over 64 new tokens, single-stream.
- The extended tests (256 tokens, batch=4) pass.
manifest.jsoncommitted withpassed: 50, failed: 0.README.mddocuments either "no bugs found" or the bug + fix.
If any test still fails after 4 hours of debugging, write up the symptom and stop for /phase-checkpoint — don't grind.
Pitfalls (read before debugging)¶
- "Off by one in token 1." Almost always: storing K, V for the current token before computing attention, so it attends to itself with full strength. Append K, V after the attention computation (or use a mask that excludes the current row — but then the cache has dead bytes; just append after).
- "Off by hundreds in token 30." Slow drift — accumulating numerical error from non-associative fp arithmetic. Acceptable up to ~1e-6, but causes a divergence eventually when sampling crosses a token boundary. Either: (a) match the exact operation order in cached vs uncached paths, or (b) accept divergence at long horizons and document the bound.
- "Diverges only with batch>1." Layer's
cache.read()shape is(B, H, S, d_h). Broadcasting in the matmul is unforgiving; double-check axes.
When to consult solutions/¶
After 50/50 pass. The reference at solutions/02-correctness-test-ref.md documents the canonical bugs encountered during reference-implementation development.
Next lab: lab/03-cost-curves.md.