English · Español

Phase 22 — Quizzes (mirror)¶

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-22-kv-cache.yaml.

q-22-01 — Memory math (mini-GPT)¶

Prompt (EN): For the Phase-17 mini-GPT (\(L=2, H=4, d_h=16\)) in fp16, what is the KV-cache size for \(B=4, S=64\)?

A. 8 KiB
B. 32 KiB
C. 128 KiB
D. 512 KiB

Correct: C. \(2 \cdot L \cdot H \cdot d_h \cdot S \cdot B \cdot s = 2 \cdot 2 \cdot 4 \cdot 16 \cdot 64 \cdot 4 \cdot 2 = 131{,}072\) bytes = 128 KiB.

q-22-02 — Off-by-one detection¶

Prompt (EN): A KV-cache implementation passes all tests at sequence length 8 but accuracy collapses on length-80 probes. What is the most likely bug?

A. fp16 overflow.
B. Off-by-one in position index (or similar silent stateful indexing bug).
C. Wrong attention head count.
D. Model checkpoint loaded with mismatched architecture.

Correct: B. The signature "fine at short length, collapses at long length" is the off-by-one fingerprint. The corruption accumulates with sequence length; short sequences hide it.

q-22-03 — Prefill vs decode distinction¶

Prompt (EN): In one or two sentences, explain the difference between prefill and decode and why the KV cache helps the decode phase but not the prefill phase.

Free response. Expected mentions: prefill processes the prompt in parallel (all positions at once, no cache reuse possible); decode generates one token at a time and reuses past K, V.

q-22-04 — Cache size scaling levers¶

Prompt (EN): Select every change that reduces KV-cache memory.

A. Switching from fp32 to fp16.
B. Reducing batch size \(B\).
C. Using Grouped-Query Attention (GQA) with fewer KV heads than Q heads.
D. Increasing \(d_\text{ff}\) (the FFN inner dimension).

Correct: A, B, C. The FFN inner dimension does not appear in the KV-cache formula; changing it doesn't affect cache size.

q-22-05 — Per-token marginal cost¶

Prompt (EN): For the Phase-17 mini-GPT in fp16, what is the per-generated-token marginal increase in cache memory (for \(B = 1\))?

A. 64 bytes
B. 256 bytes
C. 512 bytes
D. 1 KiB

Correct: C. \(2 \cdot L \cdot H \cdot d_h \cdot s = 2 \cdot 2 \cdot 4 \cdot 16 \cdot 2 = 512\) bytes per token (with \(B = 1\)).