English · Español
Phase 23 — Quizzes (mirror)¶
🇪🇸 Las preguntas canónicas viven en
data/quizzes/phase-23-gpu-fundamentals.yaml.
q-23-01 — Machine balance, i5-8250U¶
Prompt (EN): The i5-8250U has \(\pi \approx 250\) GFLOPS (fp32 sustained) and \(\beta \approx 16\) GB/s. What is its machine balance (FLOPs / byte)?
- A. 4
- B. ~16
- C. 64
- D. 256
Correct: B. \(250 / 16 \approx 15.6\) FLOPs/byte. Operators below this are memory-bound on this CPU.
q-23-02 — Coalescing penalty pattern¶
Prompt (EN): A hand-written CPU matmul with loop order (i, j, k) runs ~11× slower than the same matmul with loop order (i, k, j). What is the root cause?
- A. The arithmetic is different.
- B. The strided access pattern in the inner loop produces a cache miss on every iteration.
- C. The compiler vectorizes only the second version.
- D. Floating-point reordering changes the result.
Correct: B. Although C is also true (the compiler can SIMD the coalesced version better), the dominant cost is the cache misses.
q-23-03 — Roofline shape¶
Prompt (EN): In one or two sentences, explain why the roofline equation perf = min(π, I·β) has the same shape on CPU and GPU, even though the constants differ by orders of magnitude.
Free response. Expected mentions: peak compute (π) ceiling; bandwidth (β) slope; arithmetic intensity (I) determines which ceiling binds.
q-23-04 — Operators that benefit most from GPU¶
Prompt (EN): Which characteristic of an operator means it benefits most from moving from CPU to GPU?
- A. High arithmetic intensity, large enough to saturate the GPU's compute ceiling.
- B. Very low arithmetic intensity (pure memory bandwidth bound).
- C. Small operand size that fits in L1.
- D. Branchy control flow.
Correct: A. Bandwidth-bound operators get ~100× from GPU bandwidth, but compute-bound operators get up to ~1000× from Tensor Cores. Branchy operators (D) actually do worse on GPU due to warp divergence.
q-23-05 — GPU at §A13 scale¶
Prompt (EN): For the Phase-17 mini-GPT on the §A13 corpus, would moving training from i5-8250U to A100 GPU be worthwhile?
- A. Yes, training would be ~100× faster on the GPU.
- B. No, the kernels are too small to amortize GPU launch overhead.
- C. Yes, because attention is always faster on GPU.
- D. The model wouldn't fit on the GPU.
Correct: B. At §A13 scale, the kernels are tiny (\(d_\text{model} = 64\), batch 8). GPU kernel launch overhead (~5-10 μs each) eats the wall-clock savings. The pedagogical takeaway: GPU helps at scale, not always.