English · Español

Phase 1 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-01-hardware-substrate.yaml. Diseñado para que un aprendiz reflexivo se equivoque al menos en uno de estos en el primer intento.

Source of truth: data/quizzes/phase-01-hardware-substrate.yaml.

q-01-01 — What does the roofline ridge point represent?¶

The minimum intensity at which the kernel becomes compute-bound.
The maximum cache size in bytes per FLOP.
The number of FLOPs per CPU cycle.
The instruction-level parallelism limit of the machine.

Answer

**Choice 1.** The ridge x-coordinate is `π / β` — the arithmetic intensity above which the kernel can saturate FLOPs, below which it cannot because memory cannot feed the FPUs fast enough.

q-01-02 — Cache-line effects on stride (multi-choice)¶

A 64-byte line holds 8 doubles. You scan a 1M-element array with stride S.

Stride 1 makes optimal use of the line: 8 useful loads per fetched line.
Stride 8 wastes the line: 1 useful element per 64 bytes fetched.
Stride 16 uses fewer cache lines than stride 1 and therefore is faster.
Hardware prefetchers help unit stride more than large strides.
Doubling the stride from 8 to 16 halves the number of cache lines fetched, so wall time halves.

Answer

**Choices 1, 2, 4.** Choices 3 and 5 confuse "lines fetched per element" with "wall time." Beyond stride 8, every useful element still costs 1 cache line — wall time stays roughly flat (or worse, because TLB pressure and prefetcher mispredictions kick in).

q-01-03 — Roofline calculation (free)¶

A kernel does 2N FLOPs and moves 8N bytes. Machine: π = 100 GFLOPS, β = 25 GB/s. Compute I, the ridge, attainable perf, and state the regime.

Answer

- **I** = 2N / 8N = **0.25 FLOPs/byte**. - **Ridge** = π/β = **4.0**. - **I < ridge** → **memory-bound**. - **Attainable perf** = I × β = 0.25 × 25 = **6.25 GFLOPS** (the FPUs run ~94% idle).

q-01-04 — Why is matmul compute-bound but elementwise add memory-bound? (free)¶

Answer

Matmul intensity = `2N³ / 12N² = N/6`, grows linearly with N. Elementwise-add intensity = `N² / 12N² = 1/12`, constant. So matmul rewards SIMD/blocking at large N (it becomes compute-bound); elementwise-add stays memory-bound forever — SIMD on it only helps until the memory subsystem saturates.

q-01-05 — NUMA: dominant cost of remote access¶

The remote DRAM chip is physically slower.
The request must traverse the inter-socket interconnect (UPI/QPI), adding latency and consuming shared bandwidth.
The OS marks remote pages as read-only by default.
Cache coherence is disabled across sockets.

Answer

**Choice 2.** The DRAM chip itself is no slower; the interconnect hop is the cost. NUMA-aware allocation (first-touch policy, `numactl`) keeps threads and data on the same socket.