Skip to content

English · Español

03 — The Roofline Model

🇪🇸 La ecuación que une todo: perf_alcanzable = min(FLOPS_pico, BW_pico × intensidad_aritmética). La gráfica resultante (líneas log-log que forman un "tejado") es el mapa donde colocas cualquier kernel para diagnosticar si está limitado por cómputo o por memoria.

This file is the most important page in Phase 1. Read it once, read it again, and don't move on until you can re-derive the formula and sketch the plot without looking.


The setup

A kernel does some work. We want to know how fast it can run on a given machine. Two ceilings constrain it:

  1. Compute ceiling. The machine can do at most π FLOPS (peak floating-point operations per second). No matter how perfectly cached, the kernel can't go faster than that.
  2. Memory ceiling. The machine can move at most β bytes per second from main memory to the registers. If the kernel reads more bytes than it does FLOPs, the FPUs sit idle waiting.

These ceilings combine through arithmetic intensity I — the ratio of FLOPs to bytes the kernel does.

Deriving the roofline formula

Let: - F = total FLOPs the kernel performs. - B = total bytes the kernel moves between DRAM and cache (we'll formalize this in a moment). - π = peak FLOPS of the machine. - β = peak bandwidth of the machine, bytes/sec.

We want performance = FLOPs per second. Two routes from the kernel to the wall clock:

Compute time = F / π. The kernel needs at least F / π seconds just to do the math. Memory time = B / β. The kernel needs at least B / β seconds just to move the bytes.

The kernel can overlap these — while waiting on memory, the FPUs can do other math (out-of-order execution makes this real). In the best case, the kernel runs at the max of the two times:

T_min = max(F / π, B / β)

Performance is F / T:

perf_max = F / max(F/π, B/β)
        = min(π, F × β / B)
        = min(π, I × β)         where I = F / B (arithmetic intensity)

This is the roofline equation:

attainable_perf = min(π, I × β)

Two regimes: - If I < π / βmemory-bound: perf = I × β. Doubling intensity doubles performance. - If I > π / βcompute-bound: perf = π. Already saturating the FPUs; more intensity buys nothing.

The crossover happens at I_crit = π / β — the machine balance point. For an i5-8250U:

π ≈ 200 GFLOPS  (AVX2 thermal-limited)
β ≈ 20 GB/s
I_crit = π / β = 200 GF / 20 GB = 10 FLOPs/byte

Any kernel below 10 FLOPs/byte on this machine is memory-bound. Naive matmul does 2 FLOPs/byte. It's bandwidth-bound by 5×.

Drawing the roofline

The plot is log-log: arithmetic intensity on the x-axis (FLOPs/byte), performance on the y-axis (GFLOPS).

GFLOPS (log)
   ^
 π ┤                ╭─────────────────────  ← compute ceiling
   │               ╱
   │              ╱
   │             ╱
   │            ╱
   │           ╱
   │          ╱   ← memory ceiling: slope = β
   │         ╱
   │        ╱
   │       ╱
   └───────┴──────────────────────────────→
           I_crit            arithmetic intensity (log)

The two segments form a "roof" — hence roofline. The corner is at I_crit. Any kernel plots as a single (intensity, performance) dot; the closer the dot is to the roof, the better-tuned the kernel.

Where naive matmul lives

For C = A × B with all matrices N×N fp32:

  • FLOPs: 2 × N³ (one multiply + one accumulate per inner iteration; N³ inner iterations).
  • Bytes (worst case, no reuse): read each element of A N times (N rows of A × N columns of B), read each element of B N times, write C once. Total reads = 2 × N × N² = 2 N³ fp32 = 8 N³ bytes. Writes negligible.
  • Intensity: 2 N³ FLOPs / 8 N³ bytes = 0.25 FLOPs/byte.

A quarter of a FLOP per byte. The i5-8250U's roofline puts that at 0.25 × 20 GB/s = 5 GFLOPS. 2.5% of peak.

But if you block the matmul so each block fits in L1, you read each element of A only N/B times (B = block size), not N. Intensity scales with block size. At a 64×64 block (16 KiB, comfortably in 32 KiB L1), intensity ≈ 2N³ / (3 × 64²) ≈ N / 6 FLOPs/byte. For N = 1024, intensity ≈ 170 — well above I_crit = 10. Compute-bound. Naive 5 GF → tuned 200 GF. 40× speedup.

This is why np.matmul is 50× faster than three nested Python loops. The math is identical. The memory access pattern is not.

Three honest caveats about the roofline

The roofline is a first-order model. Three things it doesn't tell you:

  1. Which cache level β represents. The plot above uses DRAM bandwidth. You can draw separate roofs for L1-BW, L2-BW, L3-BW, DRAM-BW, each a higher ceiling. A kernel whose working set fits in L1 runs against the L1 roof, not the DRAM roof. Real-world plots stack the rooflines.
  2. Latency vs bandwidth. The model assumes you can saturate β — i.e., enough parallel memory requests in flight. A pointer-chasing workload sees β_effective ≈ 1 byte / 70 ns ≈ 14 MB/s instead of 20 GB/s. Different roof.
  3. Mixed precision. If half your kernel is fp32 and half is fp16, π is a weighted average, and the roof bends. For most of the curriculum, single precision dominates. We'll revisit in Phase 26 (quantization).

Why this is the mental model for AI hardware

Every accelerator pitch is implicitly a roofline argument:

  • "H100 has 2× the bandwidth of A100" → the memory-ceiling slope is 2× steeper. Memory-bound kernels (most inference) get ~2×. Compute-bound kernels (training with large batch) get ~1.5× from the FLOPS bump.
  • "Flash attention is 3× faster" → it doesn't change the FLOPs; it lowers B by keeping the working set in SRAM. Same roof, higher intensity → higher dot.
  • "Quantized inference is faster" → reduces bytes per weight, raising intensity for the same kernel. Same compute, different memory traffic.
  • "This kernel is GEMM-bound" → GEMM is compute-bound; the rest of the model is memory-bound. The roofline plot shows GEMM near the corner and everything else hugging the slope.

Once you can read these one-liners as roofline statements, you've graduated Phase 1.

How to actually compute these for your machine

There are three tractable paths:

  1. Hand calculation from lscpu + datasheet. Use the formula π = clock × cores × SIMD_width × FMA_factor. Use the memory channel count + DDR4 speed × 8 bytes for β. This is what lab 03 does as the default path.
  2. Microbenchmarks. Run a tight loop that's purely arithmetic (saturates FPUs) and one that's purely memcpy (saturates DRAM). Time them. Compute π and β from the timings. Labs 01 and 03 together do this.
  3. A tool. peakperf, Empirical Roofline Tool (ERT), LIKWID-perfctr. Worth knowing exists. Not assumed installed in lab 03.

Lab 03 has you do (1) and (2) and compare. (3) is optional.

Drill problems (work these before lab)

Solutions in solutions/03-roofline-ref.md — written at phase open, not visible during pre-write. Try them by reasoning, not running.

  1. A kernel does dot(a, b) for two length-N vectors. FLOPs = 2N - 1. Bytes moved (no reuse) = 8N (two fp32 vectors). Intensity? Bandwidth- or compute-bound on the i5-8250U?
  2. softmax(x) for length-N requires: read x (N fp32), compute max (N FLOPs), subtract max (N FLOPs), exp (N FLOPs), sum (N FLOPs), divide (N FLOPs), write result (N fp32). Total FLOPs = 5N (roughly). Bytes = 8N (one read + one write). Intensity? Why is softmax memory-bound on every machine ever built?
  3. Naive matmul(N×N, N×N) intensity = 0.25 FLOPs/byte. Show that a B×B-tiled matmul with B chosen so that 3·B² fp32 ≤ L1_size has intensity ≈ B/6. What's the largest B on Borja's i5-8250U (32 KiB L1)? What intensity does that give? Is it compute-bound?

If you can answer all three from memory + back-of-envelope arithmetic, you understand the roofline. If any one feels wobbly, re-read the relevant section above.

One-paragraph recap

The roofline equation perf = min(π, I × β) plots a "roof" with a flat compute ceiling and a sloped memory ceiling, joined at the machine-balance intensity I_crit = π / β. Every kernel is a dot on this plot. Kernels with intensity below I_crit are bandwidth-bound; above, compute-bound. Naive matmul lives near I = 0.25, deeply bandwidth-bound; blocked matmul raises intensity to ~N/6, compute-bound. This single picture explains why memory hierarchy optimization is the dominant kind of optimization in AI, and lets you predict — before running anything — which side of the roof your code will land on.


Done with theory. On to lab/00-machine-profile.md.