English · Español

05 — CPU-only roofline on the i5-8250U, and how GPU shifts the ceiling¶

🇪🇸 Antes de hablar de GPU, medimos el techo del único hardware que Borja tiene a mano (Intel i5-8250U, 4C/8T). Calculamos FLOPS y ancho de banda, situamos los kernels del mini-GPT en el plot, y vemos qué dimensión cambiaría una A100. La forma del roofline no cambia — los números sí, y mucho.

This page is the CPU-only depth-pass companion to theory/04-gpu-roofline.md. It anchors the GPU roofline equation in the concrete CPU Borja has on hand, and quantifies the shift a GPU would produce.

The i5-8250U on paper¶

From learners/borja/profile.md:

Architecture: Kaby Lake R (2018), Intel Core i5-8250U.
Cores / threads: 4 cores / 8 threads.
Base / turbo clock: 1.6 GHz base / 3.4 GHz turbo (single-core).
L1d cache: 32 KiB / core, 8-way.
L2 cache: 256 KiB / core, 4-way.
L3 cache: 6 MiB shared.
Memory: DDR4-2400, dual channel, 19.2 GB/s peak theoretical.
AVX2: yes (256-bit SIMD).
AVX-512: no.

Computing the peak FLOPS¶

A Kaby Lake R core can do:

2 fused multiply-add (FMA) units per core.
Each FMA handles 256-bit AVX2 = 8 single-precision floats or 4 double-precision.
FMA is "2 FLOPs per element" (1 mul + 1 add).

Single-core peak fp32 throughput:

\[\pi_\text{core,fp32} = 2 \text{ FMAs} \times 8 \text{ floats} \times 2 \text{ FLOPs/float} \times 3.4 \text{ GHz} = 108.8 \text{ GFLOPS}\]

For 4 cores at sustained turbo (~2.8 GHz under load, not all-core turbo):

\[\pi_\text{chip,fp32} \approx 4 \times 2 \times 8 \times 2 \times 2.8 \text{ GHz} \approx 358 \text{ GFLOPS}\]

In practice, AVX2 frequency throttling and thermal limits on a U-series CPU drop this to ~250-300 GFLOPS sustained. Call it \(\pi \approx 250\) GFLOPS for our roofline.

For fp64, halve the lane count: \(\pi_\text{fp64} \approx 125\) GFLOPS.

For bf16/fp16, the i5-8250U has no native support (no AVX-512-BF16, no AMX). Software emulation runs at fp32 throughput or worse — so \(\pi_\text{bf16} \approx \pi_\text{fp32}\) in the best case, often slower.

Computing the peak bandwidth¶

DDR4-2400, dual channel:

\[\beta = 2 \text{ channels} \times 8 \text{ bytes} \times 2400 \text{ MT/s} = 38.4 \text{ GB/s peak}\]

In practice, with two memory channels and the chipset overhead, sustained streaming reads measure 15-19 GB/s on this class of CPU. Call it \(\beta \approx 16\) GB/s for our roofline.

Machine balance¶

\[\text{machine balance} = \pi / \beta = 250 \text{ GFLOPS} / 16 \text{ GB/s} = 15.6 \text{ FLOPs/byte}\]

The crossover: an operator with arithmetic intensity below ~16 FLOPs/byte is memory-bound on this CPU; above, it's compute-bound.

Where the mini-GPT operators land¶

Recall Phase-17 mini-GPT: \(d_\text{model} = 64, n_\text{heads} = 4, d_h = 16, L = 2, d_\text{ff} = 256\).

Embedding lookup¶

Operation: gather \(T\) rows from a \((V, d_\text{model}) = (512, 64)\) table.

FLOPs: 0 (it's a memory op).
Bytes: \(T \times d_\text{model} \times s = T \cdot 64 \cdot 4 = 256 T\) bytes.
Arithmetic intensity: 0 FLOPs/byte. Pure memory-bound. On the roofline this sits on the bandwidth ceiling at \(I = 0\), performance is limited to \(\beta = 16\) GB/s. Generating one token requires reading 256 B; that's \(16 \times 10^9 / 256 = 6.25 \times 10^7\) tokens/s if memory-bound — vastly more than we need.

Linear layer (Q projection, \(d_\text{model} \to d_\text{model}\))¶

For a single token forward pass:

Op: matmul \((1, d_\text{model}) \times (d_\text{model}, d_\text{model})\).
FLOPs: \(2 \cdot 1 \cdot 64 \cdot 64 = 8192\) FLOPs.
Bytes (weights + input + output): \(64 \cdot 64 \cdot 4 + 64 \cdot 4 + 64 \cdot 4 \approx 16{,}896\) bytes (weights dominate).
Arithmetic intensity: \(8192 / 16{,}896 \approx 0.48\) FLOPs/byte. Memory-bound. Performance limited to \(\sim 0.48 \times 16 \text{ GB/s} = 7.7\) GFLOPS.

For a batch of \(B\) tokens:

FLOPs: \(2 \cdot B \cdot 64 \cdot 64 = 8192 B\).
Bytes: \(64 \cdot 64 \cdot 4 + B \cdot 64 \cdot 4 + B \cdot 64 \cdot 4 = 16384 + 512 B\).
Arithmetic intensity: \(8192 B / (16384 + 512 B)\).
At \(B = 8\): \(65{,}536 / (16384 + 4096) = 65{,}536 / 20480 \approx 3.2\) FLOPs/byte. Still memory-bound.
At \(B = 128\): \(\approx 11.0\) FLOPs/byte. Approaching compute-bound but not yet there.

This is why batched inference is so much more efficient than single-sample: amortizing the weight load across many tokens raises the arithmetic intensity.

Attention score \(QK^T\)¶

For one sequence of \(S\) tokens, \(H = 4\) heads:

Op: \((S, d_h) \times (d_h, S)\) per head.
FLOPs per head: \(2 \cdot S \cdot d_h \cdot S = 2 S^2 d_h\).
Bytes per head: \(2 S \cdot d_h \cdot s\) (read Q, K) + \(S^2 \cdot s\) (write scores) = \(S \cdot d_h \cdot 2s + S^2 \cdot s\).
At \(S = 64, d_h = 16, s = 4\): FLOPs \(= 2 \cdot 4096 \cdot 16 = 131{,}072\); bytes \(= 64 \cdot 16 \cdot 8 + 4096 \cdot 4 = 8192 + 16384 = 24576\). AI \(= 131{,}072 / 24576 \approx 5.3\) FLOPs/byte. Memory-bound on i5-8250U.

Roofline diagram (mental)¶

       250 GFLOPS  ── ── ── ── ── ── ── ── ── ── ──┐   compute ceiling
                                                   │
                                                ╱
                                            ╱
                                        ╱   ←─── Linear, B=128
                                    ╱
                                ╱
                            ╱   ←─── Attention, S=64
                        ╱
                    ╱
                ╱   ←─── Linear, B=8
            ╱
        ╱
    ╱   ←─── Linear, B=1 (memory-bound)
       ↑
       I = 0.48

Every meaningful kernel on the i5-8250U is to the left of the machine balance crossover. Every kernel is memory-bound. The CPU has 250 GFLOPS but the kernels can only extract 1-20 GFLOPS of that due to memory bandwidth.

How GPU shifts the roofline¶

An NVIDIA A100 (Phase 23-24 target):

\(\pi_\text{fp32}\) = 19.5 TFLOPS = ~78× the i5-8250U.
\(\pi_\text{fp16}\) = 312 TFLOPS (Tensor Cores) = ~1250× the i5-8250U fp32.
\(\beta\) = 1.5 TB/s = ~94× the i5-8250U.
Machine balance fp32: \(19500 / 1500 = 13\) FLOPs/byte (similar to i5-8250U).
Machine balance fp16 with Tensor Cores: \(312000 / 1500 = 208\) FLOPs/byte (much higher).

What this means:

For fp32 GEMMs, the machine balance is similar on CPU and GPU — operators that are memory-bound on CPU stay memory-bound on GPU, but the absolute throughput is ~100× higher.
For fp16 GEMMs with Tensor Cores, the machine balance is much higher — operators that were compute-bound become memory-bound at the new ratio. To stay compute-bound on Tensor Cores, you need arithmetic intensity of 200+ FLOPs/byte, which means very large batch × seq × hidden products.
The shape of the roofline doesn't change — same equation, \(\min(\pi, I \beta)\). Only the ceilings and the crossover.

Practical bench Borja should run¶

Phase 23 lab 02-bandwidth-test.md runs STREAM on the i5-8250U. Expected output: ~15-18 GB/s on the Triad benchmark. Compare to the 19.2 GB/s peak theoretical and you get the achievable fraction (~80-90%).

Then run a single-thread fp32 matmul benchmark on a \(1024 \times 1024\) matrix. Expected: 3-5 GFLOPS sustained (much below the 108 GFLOPS single-core peak — because matmul-at-this-size doesn't fit in L1, so it spills to memory and becomes memory-bound). On a \(128 \times 128\) matrix (fits in L1), expect 30-50 GFLOPS.

These two numbers — STREAM bandwidth, in-cache matmul GFLOPS — are your CPU roofline. Plot the matmul at multiple sizes; the curve traces the roofline.

The §A13 lesson¶

Every kernel in the §A13 mini-GPT lives in the memory-bound region of the CPU roofline. We could run on a GPU and the kernels would still be memory-bound (relative to the much higher GPU ceiling). The microscopic scope lets you see this clearly without the complexity of large models. When you move to GPT-3-scale in Phase 23+ cloud labs, you'll see the same kernels behave fundamentally differently — large hidden dimensions push arithmetic intensity into the compute-bound regime, and Tensor Cores extract orders of magnitude more throughput.

The point of this page: you measure the CPU roofline by hand once, and the methodology never changes. Only the constants.

Citation¶

Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM 52(4), 65-76. The original roofline paper. Phase 1 cited it for CPU; Phase 23 cites it again for GPU. Same equation.

One-paragraph recap¶

The i5-8250U has \(\pi \approx 250\) GFLOPS (fp32, AVX2 across 4 cores) and \(\beta \approx 16\) GB/s (DDR4-2400 dual channel), giving a machine balance of ~16 FLOPs/byte. Every kernel in the §A13 mini-GPT lives to the left of this crossover — Linear at \(B = 1\) has AI = 0.48 FLOPs/byte, Linear at \(B = 128\) reaches ~11. The A100 GPU has \(\pi\) ~78× higher and \(\beta\) ~94× higher; the fp32 machine balance is similar, so the same kernels remain memory-bound at vastly higher absolute throughput. The roofline equation is invariant across CPU and GPU; only the constants change.

Cross-refs: theory/04-gpu-roofline.md (the same equation on GPU constants), Phase 1 theory/03-roofline-model.md (the original CPU derivation), lab/02-bandwidth-test.md (the STREAM benchmark you run on the i5).