English · Español
05 — CPU-only roofline on the i5-8250U, and how GPU shifts the ceiling¶
🇪🇸 Antes de hablar de GPU, medimos el techo del único hardware que Borja tiene a mano (Intel i5-8250U, 4C/8T). Calculamos FLOPS y ancho de banda, situamos los kernels del mini-GPT en el plot, y vemos qué dimensión cambiaría una A100. La forma del roofline no cambia — los números sí, y mucho.
This page is the CPU-only depth-pass companion to theory/04-gpu-roofline.md. It anchors the GPU roofline equation in the concrete CPU Borja has on hand, and quantifies the shift a GPU would produce.
The i5-8250U on paper¶
From learners/borja/profile.md:
- Architecture: Kaby Lake R (2018), Intel Core i5-8250U.
- Cores / threads: 4 cores / 8 threads.
- Base / turbo clock: 1.6 GHz base / 3.4 GHz turbo (single-core).
- L1d cache: 32 KiB / core, 8-way.
- L2 cache: 256 KiB / core, 4-way.
- L3 cache: 6 MiB shared.
- Memory: DDR4-2400, dual channel, 19.2 GB/s peak theoretical.
- AVX2: yes (256-bit SIMD).
- AVX-512: no.
Computing the peak FLOPS¶
A Kaby Lake R core can do:
- 2 fused multiply-add (FMA) units per core.
- Each FMA handles 256-bit AVX2 = 8 single-precision floats or 4 double-precision.
- FMA is "2 FLOPs per element" (1 mul + 1 add).
Single-core peak fp32 throughput:
For 4 cores at sustained turbo (~2.8 GHz under load, not all-core turbo):
In practice, AVX2 frequency throttling and thermal limits on a U-series CPU drop this to ~250-300 GFLOPS sustained. Call it \(\pi \approx 250\) GFLOPS for our roofline.
For fp64, halve the lane count: \(\pi_\text{fp64} \approx 125\) GFLOPS.
For bf16/fp16, the i5-8250U has no native support (no AVX-512-BF16, no AMX). Software emulation runs at fp32 throughput or worse — so \(\pi_\text{bf16} \approx \pi_\text{fp32}\) in the best case, often slower.
Computing the peak bandwidth¶
DDR4-2400, dual channel:
In practice, with two memory channels and the chipset overhead, sustained streaming reads measure 15-19 GB/s on this class of CPU. Call it \(\beta \approx 16\) GB/s for our roofline.
Machine balance¶
The crossover: an operator with arithmetic intensity below ~16 FLOPs/byte is memory-bound on this CPU; above, it's compute-bound.
Where the mini-GPT operators land¶
Recall Phase-17 mini-GPT: \(d_\text{model} = 64, n_\text{heads} = 4, d_h = 16, L = 2, d_\text{ff} = 256\).
Embedding lookup¶
Operation: gather \(T\) rows from a \((V, d_\text{model}) = (512, 64)\) table.
- FLOPs: 0 (it's a memory op).
- Bytes: \(T \times d_\text{model} \times s = T \cdot 64 \cdot 4 = 256 T\) bytes.
- Arithmetic intensity: 0 FLOPs/byte. Pure memory-bound. On the roofline this sits on the bandwidth ceiling at \(I = 0\), performance is limited to \(\beta = 16\) GB/s. Generating one token requires reading 256 B; that's \(16 \times 10^9 / 256 = 6.25 \times 10^7\) tokens/s if memory-bound — vastly more than we need.
Linear layer (Q projection, \(d_\text{model} \to d_\text{model}\))¶
For a single token forward pass:
- Op: matmul \((1, d_\text{model}) \times (d_\text{model}, d_\text{model})\).
- FLOPs: \(2 \cdot 1 \cdot 64 \cdot 64 = 8192\) FLOPs.
- Bytes (weights + input + output): \(64 \cdot 64 \cdot 4 + 64 \cdot 4 + 64 \cdot 4 \approx 16{,}896\) bytes (weights dominate).
- Arithmetic intensity: \(8192 / 16{,}896 \approx 0.48\) FLOPs/byte. Memory-bound. Performance limited to \(\sim 0.48 \times 16 \text{ GB/s} = 7.7\) GFLOPS.
For a batch of \(B\) tokens:
- FLOPs: \(2 \cdot B \cdot 64 \cdot 64 = 8192 B\).
- Bytes: \(64 \cdot 64 \cdot 4 + B \cdot 64 \cdot 4 + B \cdot 64 \cdot 4 = 16384 + 512 B\).
- Arithmetic intensity: \(8192 B / (16384 + 512 B)\).
- At \(B = 8\): \(65{,}536 / (16384 + 4096) = 65{,}536 / 20480 \approx 3.2\) FLOPs/byte. Still memory-bound.
- At \(B = 128\): \(\approx 11.0\) FLOPs/byte. Approaching compute-bound but not yet there.
This is why batched inference is so much more efficient than single-sample: amortizing the weight load across many tokens raises the arithmetic intensity.
Attention score \(QK^T\)¶
For one sequence of \(S\) tokens, \(H = 4\) heads:
- Op: \((S, d_h) \times (d_h, S)\) per head.
- FLOPs per head: \(2 \cdot S \cdot d_h \cdot S = 2 S^2 d_h\).
- Bytes per head: \(2 S \cdot d_h \cdot s\) (read Q, K) + \(S^2 \cdot s\) (write scores) = \(S \cdot d_h \cdot 2s + S^2 \cdot s\).
- At \(S = 64, d_h = 16, s = 4\): FLOPs \(= 2 \cdot 4096 \cdot 16 = 131{,}072\); bytes \(= 64 \cdot 16 \cdot 8 + 4096 \cdot 4 = 8192 + 16384 = 24576\). AI \(= 131{,}072 / 24576 \approx 5.3\) FLOPs/byte. Memory-bound on i5-8250U.
Roofline diagram (mental)¶
250 GFLOPS ── ── ── ── ── ── ── ── ── ── ──┐ compute ceiling
│
╱
╱
╱ ←─── Linear, B=128
╱
╱
╱ ←─── Attention, S=64
╱
╱
╱ ←─── Linear, B=8
╱
╱
╱ ←─── Linear, B=1 (memory-bound)
↑
I = 0.48
Every meaningful kernel on the i5-8250U is to the left of the machine balance crossover. Every kernel is memory-bound. The CPU has 250 GFLOPS but the kernels can only extract 1-20 GFLOPS of that due to memory bandwidth.
How GPU shifts the roofline¶
An NVIDIA A100 (Phase 23-24 target):
- \(\pi_\text{fp32}\) = 19.5 TFLOPS = ~78× the i5-8250U.
- \(\pi_\text{fp16}\) = 312 TFLOPS (Tensor Cores) = ~1250× the i5-8250U fp32.
- \(\beta\) = 1.5 TB/s = ~94× the i5-8250U.
- Machine balance fp32: \(19500 / 1500 = 13\) FLOPs/byte (similar to i5-8250U).
- Machine balance fp16 with Tensor Cores: \(312000 / 1500 = 208\) FLOPs/byte (much higher).
What this means:
- For fp32 GEMMs, the machine balance is similar on CPU and GPU — operators that are memory-bound on CPU stay memory-bound on GPU, but the absolute throughput is ~100× higher.
- For fp16 GEMMs with Tensor Cores, the machine balance is much higher — operators that were compute-bound become memory-bound at the new ratio. To stay compute-bound on Tensor Cores, you need arithmetic intensity of 200+ FLOPs/byte, which means very large batch × seq × hidden products.
- The shape of the roofline doesn't change — same equation, \(\min(\pi, I \beta)\). Only the ceilings and the crossover.
Practical bench Borja should run¶
Phase 23 lab 02-bandwidth-test.md runs STREAM on the i5-8250U. Expected output: ~15-18 GB/s on the Triad benchmark. Compare to the 19.2 GB/s peak theoretical and you get the achievable fraction (~80-90%).
Then run a single-thread fp32 matmul benchmark on a \(1024 \times 1024\) matrix. Expected: 3-5 GFLOPS sustained (much below the 108 GFLOPS single-core peak — because matmul-at-this-size doesn't fit in L1, so it spills to memory and becomes memory-bound). On a \(128 \times 128\) matrix (fits in L1), expect 30-50 GFLOPS.
These two numbers — STREAM bandwidth, in-cache matmul GFLOPS — are your CPU roofline. Plot the matmul at multiple sizes; the curve traces the roofline.
The §A13 lesson¶
Every kernel in the §A13 mini-GPT lives in the memory-bound region of the CPU roofline. We could run on a GPU and the kernels would still be memory-bound (relative to the much higher GPU ceiling). The microscopic scope lets you see this clearly without the complexity of large models. When you move to GPT-3-scale in Phase 23+ cloud labs, you'll see the same kernels behave fundamentally differently — large hidden dimensions push arithmetic intensity into the compute-bound regime, and Tensor Cores extract orders of magnitude more throughput.
The point of this page: you measure the CPU roofline by hand once, and the methodology never changes. Only the constants.
Citation¶
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM 52(4), 65-76. The original roofline paper. Phase 1 cited it for CPU; Phase 23 cites it again for GPU. Same equation.
One-paragraph recap¶
The i5-8250U has \(\pi \approx 250\) GFLOPS (fp32, AVX2 across 4 cores) and \(\beta \approx 16\) GB/s (DDR4-2400 dual channel), giving a machine balance of ~16 FLOPs/byte. Every kernel in the §A13 mini-GPT lives to the left of this crossover — Linear at \(B = 1\) has AI = 0.48 FLOPs/byte, Linear at \(B = 128\) reaches ~11. The A100 GPU has \(\pi\) ~78× higher and \(\beta\) ~94× higher; the fp32 machine balance is similar, so the same kernels remain memory-bound at vastly higher absolute throughput. The roofline equation is invariant across CPU and GPU; only the constants change.
Cross-refs: theory/04-gpu-roofline.md (the same equation on GPU constants), Phase 1 theory/03-roofline-model.md (the original CPU derivation), lab/02-bandwidth-test.md (the STREAM benchmark you run on the i5).