Skip to content

English · Español

Phase 1 — Quizzes

🇪🇸 Espejo legible de data/quizzes/phase-01-hardware-substrate.yaml. Diseñado para que un aprendiz reflexivo se equivoque al menos en uno de estos en el primer intento.

Source of truth: data/quizzes/phase-01-hardware-substrate.yaml.


q-01-01 — What does the roofline ridge point represent?

  1. The minimum intensity at which the kernel becomes compute-bound.
  2. The maximum cache size in bytes per FLOP.
  3. The number of FLOPs per CPU cycle.
  4. The instruction-level parallelism limit of the machine.
Answer **Choice 1.** The ridge x-coordinate is `π / β` — the arithmetic intensity above which the kernel can saturate FLOPs, below which it cannot because memory cannot feed the FPUs fast enough.

q-01-02 — Cache-line effects on stride (multi-choice)

A 64-byte line holds 8 doubles. You scan a 1M-element array with stride S.

  1. Stride 1 makes optimal use of the line: 8 useful loads per fetched line.
  2. Stride 8 wastes the line: 1 useful element per 64 bytes fetched.
  3. Stride 16 uses fewer cache lines than stride 1 and therefore is faster.
  4. Hardware prefetchers help unit stride more than large strides.
  5. Doubling the stride from 8 to 16 halves the number of cache lines fetched, so wall time halves.
Answer **Choices 1, 2, 4.** Choices 3 and 5 confuse "lines fetched per element" with "wall time." Beyond stride 8, every useful element still costs 1 cache line — wall time stays roughly flat (or worse, because TLB pressure and prefetcher mispredictions kick in).

q-01-03 — Roofline calculation (free)

A kernel does 2N FLOPs and moves 8N bytes. Machine: π = 100 GFLOPS, β = 25 GB/s. Compute I, the ridge, attainable perf, and state the regime.

Answer - **I** = 2N / 8N = **0.25 FLOPs/byte**. - **Ridge** = π/β = **4.0**. - **I < ridge** → **memory-bound**. - **Attainable perf** = I × β = 0.25 × 25 = **6.25 GFLOPS** (the FPUs run ~94% idle).

q-01-04 — Why is matmul compute-bound but elementwise add memory-bound? (free)

Answer Matmul intensity = `2N³ / 12N² = N/6`, grows linearly with N. Elementwise-add intensity = `N² / 12N² = 1/12`, constant. So matmul rewards SIMD/blocking at large N (it becomes compute-bound); elementwise-add stays memory-bound forever — SIMD on it only helps until the memory subsystem saturates.

q-01-05 — NUMA: dominant cost of remote access

  1. The remote DRAM chip is physically slower.
  2. The request must traverse the inter-socket interconnect (UPI/QPI), adding latency and consuming shared bandwidth.
  3. The OS marks remote pages as read-only by default.
  4. Cache coherence is disabled across sockets.
Answer **Choice 2.** The DRAM chip itself is no slower; the interconnect hop is the cost. NUMA-aware allocation (first-touch policy, `numactl`) keeps threads and data on the same socket.