English · Español

Lab 02 — Correctness: With-Cache Equals Without-Cache, Byte for Byte¶

Goal: prove that generate(prompt, cache=True) produces the exact same tokens as generate(prompt, cache=False) for arbitrary prompts and seeds. Subtle cache bugs are silent; only an exact-equality test surfaces them.

Estimated time: 2–4 hours.

Prereq: lab/01-implement-cache.md complete. src/miniinfer/generate.py from Phase 21 in place.

What you produce¶

A directory experiments/22-cache-correctness/ containing:

property_test.py — your property test runner.
results.json — pass/fail per prompt, divergence step if any.
manifest.json.
README.md — 2–3 paragraphs. If any test failed, document the bug you found and how you fixed it.

A second directory experiments/22-yesterday-worked/ containing the flagship slot-level dump:

dump.py — script that prefills "Yesterday I", decodes one token, then separately runs a full recompute on "Yesterday I worked" (or whichever past-simple form the model emitted), and dumps the K-row and V-row for the position-of-"I" slot from both runs.
slots.npz — the dumped K, V rows from both paths.
report.md — assertion: every byte of the cached path's "I" row equals every byte of the recomputed path's "I" row, for K and for V, for every layer and head.
manifest.json.

The property¶

For a fixed model (MiniGPT, Phase 17) and a fixed sampling seed, the following must hold:

seed_everything(42)
tokens_cached = generate(prompt, max_new_tokens=64, cache=True)

seed_everything(42)
tokens_uncached = generate(prompt, max_new_tokens=64, cache=False)

assert tokens_cached == tokens_uncached  # byte-identical token sequence

For 50 distinct prompts drawn from a fixed distribution (define in your property_test.py).

Determinism note: seed_everything must be re-applied before each path because sampling consumes the RNG. If cache=True calls the model fewer times (it does — that's the whole point), the RNG state diverges unless reseeded. This is the most common source of false-positive "correctness bug" reports; build the test to handle it from the start.

TODOs¶

Block A — write the test runner¶

Load Phase-17 MiniGPT weights once. The model is trained on the §A13 verb-grammar corpus; its tokens are English (and Spanish) verb forms.
Sample 50 prompts: pick from the grammar corpus's natural distribution. Suggested mix: (a) 20 prompts of the form "<time-adverbial> <pronoun>" (e.g. "Yesterday I", "Tomorrow he", "Now you"), (b) 20 prompts of length 4–8 that are valid partial sentences (e.g. "I am going to"), © 10 longer prompts that mix tenses to stress causal masking. Seed the prompt sampler — different seed than the generation seed.
For each prompt:
seed_everything(gen_seed_for_this_prompt)
t_cached = generate(prompt, max_new_tokens=64, cache=True)
seed_everything(gen_seed_for_this_prompt)
t_uncached = generate(prompt, max_new_tokens=64, cache=False)
If t_cached != t_uncached: record the first divergence index.
Tally pass/fail. Write results.json.

Block B — interpret failures¶

If any prompt diverges, the test alone tells you where (token index) but not why. Your job in this block:

Re-run that prompt with cache=True and dump per-layer attention outputs at the divergence step.
Re-run the same prompt with cache=False, same dumps.
Compare: find the layer (and head?) where they first differ.
Trace it back to the cache code. Common culprits:
Cursor off-by-one (storing current token's K, V before computing attention).
Mask shape wrong for q_len=1 decode (no mask needed, but Phase-15 code might still apply one).
Layer index swapped (using cache.read(layer=0) everywhere).
dtype mismatch (cache stored fp32, but reads cast to fp64 mid-attention).

Document the bug + fix in README.md.

Block C — extend the test¶

Once 50 prompts pass:

Try a longer generation: 256 new tokens. Still byte-identical? (Note: the model trained on a 600-form vocabulary will start cycling / repeating well before 256 tokens — that's fine. The equivalence property is what's tested.)
Try batch=4 parallel sequences. Each must independently produce the same with/without cache. (This catches batch-dim bugs that single-stream tests miss.)
Try q_len > 1 (multi-token prefill resume). Edge case: if you ever do "warm-start decode from a long prompt + 5 tokens", does the prefill path use the cache correctly?

Block C-flagship — the "Yesterday I worked" slot-level dump¶

This is the human-visible artifact tying §A13 to KV-cache mechanics. Produce it in experiments/22-yesterday-worked/:

Run path A: prefill("Yesterday I") populates the cache for slots 0 and 1. Decode one new token; record what it was (likely "worked" / "played" / etc.). Save cache.read(layer=ℓ)[..., :2, :] for every layer.
Run path B: from scratch, run the model on the full 3-token sequence "Yesterday I <decoded_token>", with no cache, taking the K and V projections at positions 0 and 1.
Assert: path A's K row for slot 1 == path B's K row for position 1, byte-identical. Same for V. Same for slot 0. Repeat for all layers.
If any byte differs: that's a positional-encoding leak (RoPE phase wrong in the decode path), or a layer-norm ordering bug, or a cursor off-by-one. The dump localizes the bug to a (layer, slot, head) triple.
Save K, V dumps to slots.npz. Write a 1-page report.md.

Block D — manifest¶

{
  "experiment": "22-cache-correctness",
  "date": "YYYY-MM-DD",
  "seed_prompt_sampler": 1,
  "seed_generation_per_prompt": "deterministic_from_prompt_idx",
  "versions": {"python": "3.11.x", "numpy": "X.Y.Z"},
  "config": {
    "model": "miniGPT-phase17",
    "n_prompts": 50,
    "prompt_len_range": [8, 32],
    "max_new_tokens": 64,
    "batch_size": 1
  },
  "results_summary": {
    "passed": null,
    "failed": null,
    "first_divergence_step_min": null,
    "first_divergence_step_max": null
  }
}

Constraints¶

No fuzz. Tests must be deterministic. Same seed → same prompts → same outputs.
No try/except to "skip failures". Every divergence is a bug. Surface them all.
Reset the RNG between paths. As noted above.

Stop conditions¶

Done when:

50/50 prompts pass byte-identically over 64 new tokens, single-stream.
The extended tests (256 tokens, batch=4) pass.
manifest.json committed with passed: 50, failed: 0.
README.md documents either "no bugs found" or the bug + fix.

If any test still fails after 4 hours of debugging, write up the symptom and stop for /phase-checkpoint — don't grind.

Pitfalls (read before debugging)¶

"Off by one in token 1." Almost always: storing K, V for the current token before computing attention, so it attends to itself with full strength. Append K, V after the attention computation (or use a mask that excludes the current row — but then the cache has dead bytes; just append after).
"Off by hundreds in token 30." Slow drift — accumulating numerical error from non-associative fp arithmetic. Acceptable up to ~1e-6, but causes a divergence eventually when sampling crosses a token boundary. Either: (a) match the exact operation order in cached vs uncached paths, or (b) accept divergence at long horizons and document the bound.
"Diverges only with batch>1." Layer's cache.read() shape is (B, H, S, d_h). Broadcasting in the matmul is unforgiving; double-check axes.

When to consult `solutions/`¶

After 50/50 pass. The reference at solutions/02-correctness-test-ref.md documents the canonical bugs encountered during reference-implementation development.

Next lab: lab/03-cost-curves.md.