English · Español

Phase 23 — Quizzes (mirror)¶

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-23-gpu-fundamentals.yaml.

q-23-01 — Machine balance, i5-8250U¶

Prompt (EN): The i5-8250U has \(\pi \approx 250\) GFLOPS (fp32 sustained) and \(\beta \approx 16\) GB/s. What is its machine balance (FLOPs / byte)?

A. 4
B. ~16
C. 64
D. 256

Correct: B. \(250 / 16 \approx 15.6\) FLOPs/byte. Operators below this are memory-bound on this CPU.

q-23-02 — Coalescing penalty pattern¶

Prompt (EN): A hand-written CPU matmul with loop order (i, j, k) runs ~11× slower than the same matmul with loop order (i, k, j). What is the root cause?

A. The arithmetic is different.
B. The strided access pattern in the inner loop produces a cache miss on every iteration.
C. The compiler vectorizes only the second version.
D. Floating-point reordering changes the result.

Correct: B. Although C is also true (the compiler can SIMD the coalesced version better), the dominant cost is the cache misses.

q-23-03 — Roofline shape¶

Prompt (EN): In one or two sentences, explain why the roofline equation perf = min(π, I·β) has the same shape on CPU and GPU, even though the constants differ by orders of magnitude.

Free response. Expected mentions: peak compute (π) ceiling; bandwidth (β) slope; arithmetic intensity (I) determines which ceiling binds.

q-23-04 — Operators that benefit most from GPU¶

Prompt (EN): Which characteristic of an operator means it benefits most from moving from CPU to GPU?

A. High arithmetic intensity, large enough to saturate the GPU's compute ceiling.
B. Very low arithmetic intensity (pure memory bandwidth bound).
C. Small operand size that fits in L1.
D. Branchy control flow.

Correct: A. Bandwidth-bound operators get ~100× from GPU bandwidth, but compute-bound operators get up to ~1000× from Tensor Cores. Branchy operators (D) actually do worse on GPU due to warp divergence.

q-23-05 — GPU at §A13 scale¶

Prompt (EN): For the Phase-17 mini-GPT on the §A13 corpus, would moving training from i5-8250U to A100 GPU be worthwhile?

A. Yes, training would be ~100× faster on the GPU.
B. No, the kernels are too small to amortize GPU launch overhead.
C. Yes, because attention is always faster on GPU.
D. The model wouldn't fit on the GPU.

Correct: B. At §A13 scale, the kernels are tiny (\(d_\text{model} = 64\), batch 8). GPU kernel launch overhead (~5-10 μs each) eats the wall-clock savings. The pedagogical takeaway: GPU helps at scale, not always.