English · Español
Break — KV cache with off-by-one in the position index; silent corruption¶
🇪🇸 El bug más sutil del KV cache: escribir K, V de la posición
ten el slott+1. La pérdida no explota — solo crece silenciosamente con la longitud de la secuencia. Lo causamos a propósito y mostramos el patrón que distingue "modelo subentrenado" de "cache corrupto".
Symptom Borja will see¶
Two runs with the same model checkpoint, same prompts, same sampling settings:
- Run A (control): cache writes K, V for the newly-computed token at position \(t\).
- Run B (break): cache writes K, V for the newly-computed token at position \(t + 1\).
Both run cleanly — no exceptions, no NaN, no obvious crashes. The dashboards look fine for the first few generations. But when you score on probes of varying length:
| Probe length | Run A CCR | Run B CCR |
|---|---|---|
| 5 tokens | 89% | 89% |
| 10 tokens | 87% | 82% |
| 20 tokens | 85% | 71% |
| 40 tokens | 83% | 52% |
| 80 tokens | 81% | 34% |
The pattern: for short sequences the bug is invisible; for long sequences accuracy collapses. By 80 tokens, the bugged run is barely better than chance.
If you naively measure CCR at the §A13-default probe length of 8 tokens, both runs report 88%. The bug escapes the harness.
The break, mechanically¶
In src/miniinfer/kv_cache.py:
# Run A (control)
def append(self, layer_idx: int, k: Tensor, v: Tensor) -> None:
pos = self.current_pos[layer_idx]
self.k_cache[layer_idx, :, pos, :] = k
self.v_cache[layer_idx, :, pos, :] = v
self.current_pos[layer_idx] += 1
# Run B (break) — one line
def append(self, layer_idx: int, k: Tensor, v: Tensor) -> None:
pos = self.current_pos[layer_idx]
self.k_cache[layer_idx, :, pos + 1, :] = k # <-- off-by-one
self.v_cache[layer_idx, :, pos + 1, :] = v # <-- off-by-one
self.current_pos[layer_idx] += 1
The whole break is +1 in two places. The code still type-checks, still runs, still terminates.
Why this teaches the concept¶
When you generate token \(t+1\):
- The cache holds K, V for positions \(0, 1, ..., t-1\) (from the prefill phase + prior decode steps).
- The model computes K, V for the new position \(t\) — based on the most recently emitted token.
- The new K, V get stored in slot \(t\) of the cache.
- Attention then computes \(Q_t \cdot K_{0:t+1}\) — i.e., the new query attends to all positions up through \(t\).
With the off-by-one:
- Step 3 stores K, V in slot \(t+1\) instead of \(t\).
- Slot \(t\) is left with stale data (whatever was there before — usually zeros at initialization, but possibly leftover from a previous run).
- Step 4's attention computes \(Q_t \cdot K_{0:t+1}\) where the entry for position \(t\) is stale.
- The model's effective attention sees a gap in its history at position \(t\).
For short sequences, the gap is tolerable — the model still has 4-7 positions of correct history. The next-token prediction is plausible.
For long sequences, the gaps accumulate: position \(t-1\) is stale, position \(t-2\) is stale (from the previous generation step), and so on. By position 80, every historical K, V slot is stale. The model is effectively generating with random history.
The corruption is silent because attention is robust to small history corruptions — the model averages over the remaining positions. But the corruption is cumulative — every decode step contributes one more stale slot.
The diagnostic ladder Borja should walk¶
- First check: the prefill is fine — accuracy is the same as no-cache for short sequences. So the prefill code path is correct. The bug is in the decode path.
- Second check: print the cache state after generating 5 tokens. Compare to the no-cache baseline (rerun the same prompt without the cache, capture all K, V tensors at each step). They should be identical; they aren't. Specifically, the cache has every other position stale.
- Third check: the
current_posvariable. Print it. After 5 generations, it reads 5. The slots written are... slots 1, 2, 3, 4, 5. Slot 0 (the prefill's last position) is correct, but the first decode slot is wrong. Now find slot 0's content — it should match the prefill's last token's K, V. It does. Slot 1's content matches the first decode token's K, V — but slot 1 should hold the prefill's second-to-last position. The shift is one off. - Diagnosis: off-by-one. The
+1in the cache.append call has shifted every decode write by one slot.
Reproducer¶
# Control
just phase-22-eval cache=correct probe_length=80
# Break
just phase-22-eval cache=broken probe_length=80
# Compare
just phase-22-eval-compare experiments/22-control experiments/22-break
The compare script prints a length-vs-CCR plot. The two curves diverge at length ~10 and the broken curve collapses by length 40.
Hint cascade¶
- (Mild) "Run the eval at multiple probe lengths. Plot CCR vs length. What's the pattern?"
- (Medium) "The bug is in the decode path, not prefill. What does the cache code do during a single decode step?"
- (Direct) "Print the K cache slot indices used during the first 5 decode steps. They should be
T_prefill, T_prefill+1, .... What do you see?"
Fix¶
Change pos + 1 to pos in kv_cache.append(). Re-run, confirm CCR-vs-length curve matches the control.
What makes this break educational¶
This is the paradigmatic silent-corruption bug. The compiler doesn't catch it (the types are fine). The unit tests pass (most unit tests check 5-token sequences). The dashboard doesn't scream (no NaN, no spike). It only shows up under a specific evaluation regime (long sequences).
Production lesson: when you implement a KV cache, your test suite must include a long-sequence equivalence test: generate \(N = 100\) tokens with and without the cache, assert the outputs are identical. This single test catches the off-by-one, the dtype mismatch, and the batch-dim flip. Phase 22's lab/02-correctness-test.md makes this the centerpiece.
What this break is NOT¶
- Not a model bug.
- Not a numerical-precision bug.
- Not a tokenizer bug.
It is an indexing bug in stateful inference. The category that costs the most engineer-hours per bug because the symptoms are silent and the discovery requires the right eval regime.
Cross-refs¶
theory/05-mini-gpt-memory-worked-example.md— the math being indexed.lab/02-correctness-test.md— the equivalence test that catches this in production.theory/01-prefill-vs-decode.md— the prefill/decode boundary that the bug crosses.