Skip to content

English · Español

Phase 23 — Quizzes (mirror)

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-23-gpu-fundamentals.yaml.


q-23-01 — Machine balance, i5-8250U

Prompt (EN): The i5-8250U has \(\pi \approx 250\) GFLOPS (fp32 sustained) and \(\beta \approx 16\) GB/s. What is its machine balance (FLOPs / byte)?

  • A. 4
  • B. ~16
  • C. 64
  • D. 256

Correct: B. \(250 / 16 \approx 15.6\) FLOPs/byte. Operators below this are memory-bound on this CPU.


q-23-02 — Coalescing penalty pattern

Prompt (EN): A hand-written CPU matmul with loop order (i, j, k) runs ~11× slower than the same matmul with loop order (i, k, j). What is the root cause?

  • A. The arithmetic is different.
  • B. The strided access pattern in the inner loop produces a cache miss on every iteration.
  • C. The compiler vectorizes only the second version.
  • D. Floating-point reordering changes the result.

Correct: B. Although C is also true (the compiler can SIMD the coalesced version better), the dominant cost is the cache misses.


q-23-03 — Roofline shape

Prompt (EN): In one or two sentences, explain why the roofline equation perf = min(π, I·β) has the same shape on CPU and GPU, even though the constants differ by orders of magnitude.

Free response. Expected mentions: peak compute (π) ceiling; bandwidth (β) slope; arithmetic intensity (I) determines which ceiling binds.


q-23-04 — Operators that benefit most from GPU

Prompt (EN): Which characteristic of an operator means it benefits most from moving from CPU to GPU?

  • A. High arithmetic intensity, large enough to saturate the GPU's compute ceiling.
  • B. Very low arithmetic intensity (pure memory bandwidth bound).
  • C. Small operand size that fits in L1.
  • D. Branchy control flow.

Correct: A. Bandwidth-bound operators get ~100× from GPU bandwidth, but compute-bound operators get up to ~1000× from Tensor Cores. Branchy operators (D) actually do worse on GPU due to warp divergence.


q-23-05 — GPU at §A13 scale

Prompt (EN): For the Phase-17 mini-GPT on the §A13 corpus, would moving training from i5-8250U to A100 GPU be worthwhile?

  • A. Yes, training would be ~100× faster on the GPU.
  • B. No, the kernels are too small to amortize GPU launch overhead.
  • C. Yes, because attention is always faster on GPU.
  • D. The model wouldn't fit on the GPU.

Correct: B. At §A13 scale, the kernels are tiny (\(d_\text{model} = 64\), batch 8). GPU kernel launch overhead (~5-10 μs each) eats the wall-clock savings. The pedagogical takeaway: GPU helps at scale, not always.