English · Español
Phase 22 — Quizzes (mirror)¶
🇪🇸 Las preguntas canónicas viven en
data/quizzes/phase-22-kv-cache.yaml.
q-22-01 — Memory math (mini-GPT)¶
Prompt (EN): For the Phase-17 mini-GPT (\(L=2, H=4, d_h=16\)) in fp16, what is the KV-cache size for \(B=4, S=64\)?
- A. 8 KiB
- B. 32 KiB
- C. 128 KiB
- D. 512 KiB
Correct: C. \(2 \cdot L \cdot H \cdot d_h \cdot S \cdot B \cdot s = 2 \cdot 2 \cdot 4 \cdot 16 \cdot 64 \cdot 4 \cdot 2 = 131{,}072\) bytes = 128 KiB.
q-22-02 — Off-by-one detection¶
Prompt (EN): A KV-cache implementation passes all tests at sequence length 8 but accuracy collapses on length-80 probes. What is the most likely bug?
- A. fp16 overflow.
- B. Off-by-one in position index (or similar silent stateful indexing bug).
- C. Wrong attention head count.
- D. Model checkpoint loaded with mismatched architecture.
Correct: B. The signature "fine at short length, collapses at long length" is the off-by-one fingerprint. The corruption accumulates with sequence length; short sequences hide it.
q-22-03 — Prefill vs decode distinction¶
Prompt (EN): In one or two sentences, explain the difference between prefill and decode and why the KV cache helps the decode phase but not the prefill phase.
Free response. Expected mentions: prefill processes the prompt in parallel (all positions at once, no cache reuse possible); decode generates one token at a time and reuses past K, V.
q-22-04 — Cache size scaling levers¶
Prompt (EN): Select every change that reduces KV-cache memory.
- A. Switching from fp32 to fp16.
- B. Reducing batch size \(B\).
- C. Using Grouped-Query Attention (GQA) with fewer KV heads than Q heads.
- D. Increasing \(d_\text{ff}\) (the FFN inner dimension).
Correct: A, B, C. The FFN inner dimension does not appear in the KV-cache formula; changing it doesn't affect cache size.
q-22-05 — Per-token marginal cost¶
Prompt (EN): For the Phase-17 mini-GPT in fp16, what is the per-generated-token marginal increase in cache memory (for \(B = 1\))?
- A. 64 bytes
- B. 256 bytes
- C. 512 bytes
- D. 1 KiB
Correct: C. \(2 \cdot L \cdot H \cdot d_h \cdot s = 2 \cdot 2 \cdot 4 \cdot 16 \cdot 2 = 512\) bytes per token (with \(B = 1\)).