English · Español
Phase 1 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-01-hardware-substrate.yaml. Diseñado para que un aprendiz reflexivo se equivoque al menos en uno de estos en el primer intento.
Source of truth: data/quizzes/phase-01-hardware-substrate.yaml.
q-01-01 — What does the roofline ridge point represent?¶
- The minimum intensity at which the kernel becomes compute-bound.
- The maximum cache size in bytes per FLOP.
- The number of FLOPs per CPU cycle.
- The instruction-level parallelism limit of the machine.
Answer
**Choice 1.** The ridge x-coordinate is `π / β` — the arithmetic intensity above which the kernel can saturate FLOPs, below which it cannot because memory cannot feed the FPUs fast enough.q-01-02 — Cache-line effects on stride (multi-choice)¶
A 64-byte line holds 8 doubles. You scan a 1M-element array with stride S.
- Stride 1 makes optimal use of the line: 8 useful loads per fetched line.
- Stride 8 wastes the line: 1 useful element per 64 bytes fetched.
- Stride 16 uses fewer cache lines than stride 1 and therefore is faster.
- Hardware prefetchers help unit stride more than large strides.
- Doubling the stride from 8 to 16 halves the number of cache lines fetched, so wall time halves.
Answer
**Choices 1, 2, 4.** Choices 3 and 5 confuse "lines fetched per element" with "wall time." Beyond stride 8, every useful element still costs 1 cache line — wall time stays roughly flat (or worse, because TLB pressure and prefetcher mispredictions kick in).q-01-03 — Roofline calculation (free)¶
A kernel does 2N FLOPs and moves 8N bytes. Machine: π = 100 GFLOPS, β = 25 GB/s. Compute I, the ridge, attainable perf, and state the regime.
Answer
- **I** = 2N / 8N = **0.25 FLOPs/byte**. - **Ridge** = π/β = **4.0**. - **I < ridge** → **memory-bound**. - **Attainable perf** = I × β = 0.25 × 25 = **6.25 GFLOPS** (the FPUs run ~94% idle).q-01-04 — Why is matmul compute-bound but elementwise add memory-bound? (free)¶
Answer
Matmul intensity = `2N³ / 12N² = N/6`, grows linearly with N. Elementwise-add intensity = `N² / 12N² = 1/12`, constant. So matmul rewards SIMD/blocking at large N (it becomes compute-bound); elementwise-add stays memory-bound forever — SIMD on it only helps until the memory subsystem saturates.q-01-05 — NUMA: dominant cost of remote access¶
- The remote DRAM chip is physically slower.
- The request must traverse the inter-socket interconnect (UPI/QPI), adding latency and consuming shared bandwidth.
- The OS marks remote pages as read-only by default.
- Cache coherence is disabled across sockets.