Skip to content

English · Español

Break — KV cache with off-by-one in the position index; silent corruption

🇪🇸 El bug más sutil del KV cache: escribir K, V de la posición t en el slot t+1. La pérdida no explota — solo crece silenciosamente con la longitud de la secuencia. Lo causamos a propósito y mostramos el patrón que distingue "modelo subentrenado" de "cache corrupto".


Symptom Borja will see

Two runs with the same model checkpoint, same prompts, same sampling settings:

  • Run A (control): cache writes K, V for the newly-computed token at position \(t\).
  • Run B (break): cache writes K, V for the newly-computed token at position \(t + 1\).

Both run cleanly — no exceptions, no NaN, no obvious crashes. The dashboards look fine for the first few generations. But when you score on probes of varying length:

Probe length Run A CCR Run B CCR
5 tokens 89% 89%
10 tokens 87% 82%
20 tokens 85% 71%
40 tokens 83% 52%
80 tokens 81% 34%

The pattern: for short sequences the bug is invisible; for long sequences accuracy collapses. By 80 tokens, the bugged run is barely better than chance.

If you naively measure CCR at the §A13-default probe length of 8 tokens, both runs report 88%. The bug escapes the harness.

The break, mechanically

In src/miniinfer/kv_cache.py:

# Run A (control)
def append(self, layer_idx: int, k: Tensor, v: Tensor) -> None:
    pos = self.current_pos[layer_idx]
    self.k_cache[layer_idx, :, pos, :] = k
    self.v_cache[layer_idx, :, pos, :] = v
    self.current_pos[layer_idx] += 1

# Run B (break) — one line
def append(self, layer_idx: int, k: Tensor, v: Tensor) -> None:
    pos = self.current_pos[layer_idx]
    self.k_cache[layer_idx, :, pos + 1, :] = k   # <-- off-by-one
    self.v_cache[layer_idx, :, pos + 1, :] = v   # <-- off-by-one
    self.current_pos[layer_idx] += 1

The whole break is +1 in two places. The code still type-checks, still runs, still terminates.

Why this teaches the concept

When you generate token \(t+1\):

  1. The cache holds K, V for positions \(0, 1, ..., t-1\) (from the prefill phase + prior decode steps).
  2. The model computes K, V for the new position \(t\) — based on the most recently emitted token.
  3. The new K, V get stored in slot \(t\) of the cache.
  4. Attention then computes \(Q_t \cdot K_{0:t+1}\) — i.e., the new query attends to all positions up through \(t\).

With the off-by-one:

  • Step 3 stores K, V in slot \(t+1\) instead of \(t\).
  • Slot \(t\) is left with stale data (whatever was there before — usually zeros at initialization, but possibly leftover from a previous run).
  • Step 4's attention computes \(Q_t \cdot K_{0:t+1}\) where the entry for position \(t\) is stale.
  • The model's effective attention sees a gap in its history at position \(t\).

For short sequences, the gap is tolerable — the model still has 4-7 positions of correct history. The next-token prediction is plausible.

For long sequences, the gaps accumulate: position \(t-1\) is stale, position \(t-2\) is stale (from the previous generation step), and so on. By position 80, every historical K, V slot is stale. The model is effectively generating with random history.

The corruption is silent because attention is robust to small history corruptions — the model averages over the remaining positions. But the corruption is cumulative — every decode step contributes one more stale slot.

The diagnostic ladder Borja should walk

  1. First check: the prefill is fine — accuracy is the same as no-cache for short sequences. So the prefill code path is correct. The bug is in the decode path.
  2. Second check: print the cache state after generating 5 tokens. Compare to the no-cache baseline (rerun the same prompt without the cache, capture all K, V tensors at each step). They should be identical; they aren't. Specifically, the cache has every other position stale.
  3. Third check: the current_pos variable. Print it. After 5 generations, it reads 5. The slots written are... slots 1, 2, 3, 4, 5. Slot 0 (the prefill's last position) is correct, but the first decode slot is wrong. Now find slot 0's content — it should match the prefill's last token's K, V. It does. Slot 1's content matches the first decode token's K, V — but slot 1 should hold the prefill's second-to-last position. The shift is one off.
  4. Diagnosis: off-by-one. The +1 in the cache.append call has shifted every decode write by one slot.

Reproducer

# Control
just phase-22-eval cache=correct probe_length=80

# Break
just phase-22-eval cache=broken probe_length=80

# Compare
just phase-22-eval-compare experiments/22-control experiments/22-break

The compare script prints a length-vs-CCR plot. The two curves diverge at length ~10 and the broken curve collapses by length 40.

Hint cascade

  1. (Mild) "Run the eval at multiple probe lengths. Plot CCR vs length. What's the pattern?"
  2. (Medium) "The bug is in the decode path, not prefill. What does the cache code do during a single decode step?"
  3. (Direct) "Print the K cache slot indices used during the first 5 decode steps. They should be T_prefill, T_prefill+1, .... What do you see?"

Fix

Change pos + 1 to pos in kv_cache.append(). Re-run, confirm CCR-vs-length curve matches the control.

What makes this break educational

This is the paradigmatic silent-corruption bug. The compiler doesn't catch it (the types are fine). The unit tests pass (most unit tests check 5-token sequences). The dashboard doesn't scream (no NaN, no spike). It only shows up under a specific evaluation regime (long sequences).

Production lesson: when you implement a KV cache, your test suite must include a long-sequence equivalence test: generate \(N = 100\) tokens with and without the cache, assert the outputs are identical. This single test catches the off-by-one, the dtype mismatch, and the batch-dim flip. Phase 22's lab/02-correctness-test.md makes this the centerpiece.

What this break is NOT

  • Not a model bug.
  • Not a numerical-precision bug.
  • Not a tokenizer bug.

It is an indexing bug in stateful inference. The category that costs the most engineer-hours per bug because the symptoms are silent and the discovery requires the right eval regime.

Cross-refs

  • theory/05-mini-gpt-memory-worked-example.md — the math being indexed.
  • lab/02-correctness-test.md — the equivalence test that catches this in production.
  • theory/01-prefill-vs-decode.md — the prefill/decode boundary that the bug crosses.