English · Español

Fase 23 — Quizzes (espejo)¶

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-23-gpu-fundamentals.yaml.

q-23-01 — Balance de la máquina, i5-8250U¶

Prompt (EN): The i5-8250U has \(\pi \approx 250\) GFLOPS (fp32 sustained) and \(\beta \approx 16\) GB/s. What is its machine balance (FLOPs / byte)?

A. 4
B. ~16
C. 64
D. 256

Correcta: B. \(250 / 16 \approx 15.6\) FLOPs/byte. Los operadores por debajo de esto están memory-bound en esta CPU.

q-23-02 — Patrón de penalización por coalescing¶

Prompt (EN): A hand-written CPU matmul with loop order (i, j, k) runs ~11× slower than the same matmul with loop order (i, k, j). What is the root cause?

A. The arithmetic is different.
B. The strided access pattern in the inner loop produces a cache miss on every iteration.
C. The compiler vectorizes only the second version.
D. Floating-point reordering changes the result.

Correcta: B. Aunque C también es cierto (el compilador puede vectorizar mejor con SIMD la versión coalesced), el coste dominante son los cache misses.

q-23-03 — Forma del roofline¶

Prompt (EN): In one or two sentences, explain why the roofline equation perf = min(π, I·β) has the same shape on CPU and GPU, even though the constants differ by orders of magnitude.

Respuesta libre. Menciones esperadas: techo de cómputo pico (π); pendiente de ancho de banda (β); intensidad aritmética (I) determina qué techo manda.

q-23-04 — Operadores que más se benefician de GPU¶

Prompt (EN): Which characteristic of an operator means it benefits most from moving from CPU to GPU?

A. High arithmetic intensity, large enough to saturate the GPU's compute ceiling.
B. Very low arithmetic intensity (pure memory bandwidth bound).
C. Small operand size that fits in L1.
D. Branchy control flow.

Correcta: A. Los operadores bandwidth-bound obtienen ~100× del ancho de banda de GPU, pero los compute-bound obtienen hasta ~1000× de los Tensor Cores. Los operadores branchy (D) en realidad lo pasan peor en GPU por la divergencia de warps.

q-23-05 — GPU a escala §A13¶

Prompt (EN): For the Phase-17 mini-GPT on the §A13 corpus, would moving training from i5-8250U to A100 GPU be worthwhile?

A. Yes, training would be ~100× faster on the GPU.
B. No, the kernels are too small to amortize GPU launch overhead.
C. Yes, because attention is always faster on GPU.
D. The model wouldn't fit on the GPU.

Correcta: B. A escala §A13, los kernels son diminutos (\(d_\text{model} = 64\), batch 8). El overhead del lanzamiento de kernel en GPU (~5-10 μs cada uno) se come el ahorro de wall-clock. La moraleja pedagógica: la GPU ayuda a escala, no siempre.