English · Español

00 — Why a Separate Phase for GPU Mental Model¶

🇪🇸 La intuición de CPU (caches, predicción de saltos, ejecución especulativa, pocos hilos pesados) no se transfiere a GPU. La GPU es una máquina diferente — no un CPU más rápido. Esta fase construye el modelo mental nuevo antes de escribir cualquier kernel.

This page exists to fight a single recurring failure mode: treating the GPU as "a CPU but with more cores". Every line of theory/01..04 is calibrated against that misconception. If you internalize this page, the rest is mechanical.

The bait-and-switch¶

A CPU has 4–96 cores. Each core is a thick execution unit: out-of-order issue, branch prediction, speculative execution, several MB of private L2, 32+ KiB of L1, a complex memory disambiguation engine. The CPU is built to make one thread fast on unpredictable code. To go faster, you parallelize across cores and SIMD lanes — both of which are scarce.

A GPU has 60–150 SMs. Each SM has 32–128 thin execution units (CUDA cores). Each thin unit has minimal private state. The GPU is built to make millions of threads run simultaneously on predictable code. To go faster, you launch more threads — which are abundant.

If you bring the CPU mental model to a GPU, you'll write code that:

Spawns "only a few" threads (1024 instead of 1000000), thinking that's "efficient". The GPU sits at 1% occupancy.
Has data-dependent branches. The GPU executes both branches anyway, throwing away half the work.
Uses indirect / pointer-chasing memory access. The GPU does one memory transaction per scalar instead of one transaction per warp. Effective bandwidth: 1/32 of peak.
Allocates and frees a lot. Each cudaMalloc is a 50-microsecond synchronous call. A loop doing this is dead.
Measures wall-clock by reading a Python timer, with no cudaDeviceSynchronize(). You measure launch overhead, not work.

Each of those mistakes is invisible to CPU intuition. Each costs 10–100×. Phase 23 is the phase where you replace the intuition. Phase 24 is when you'd otherwise pay for not having done so.

What stays the same¶

Some Phase-1 intuitions transfer directly:

Memory hierarchy is a hierarchy. GPU has its own version: HBM (think DRAM, slower-than-cache, larger-than-cache), L2 (shared across SMs), SMEM (per-block, hand-controlled — not automatically managed like CPU L1), registers (per-thread). Numbers are different. Shape is the same.
Arithmetic intensity is the right diagnostic. The roofline equation \(\text{perf} = \min(\pi, I \beta)\) is identical. Only the numbers (peak FLOPS, peak BW) and dtype-multiplicity (fp64/fp32/fp16/bf16/fp8/int8 each have their own \(\pi\) on a Tensor-Core GPU) differ.
Coalesced access matters. "Adjacent threads access adjacent memory" is the GPU's version of "linear memory access wins on CPU". Both are about transaction count.
Roofline reasoning works. Phase 22's decode-attention sits on the slope of the GPU roofline at I≈1 (fp16), just as it did on the CPU roofline. Same diagnosis (memory-bound), different absolute numbers.

What changes¶

Three things change qualitatively and need new mental models. Each is a separate theory page:

Execution model: SIMT, not OoO. A warp of 32 threads executes the same instruction in lockstep. Branches divert lanes; lanes not on the active branch stall. This is the most counter-CPU concept and is covered in 01-gpu-vs-cpu-mental-model.md.
Memory hierarchy: manual SMEM. SMEM (shared memory) is not auto-managed cache. The programmer chooses what to cache in SMEM, when to load, when to evict. This is power; it's also the source of most "this kernel is slow because I forgot to use SMEM" mistakes. Covered in 02-gpu-memory-hierarchy.md.
Throughput via occupancy, not via "fewer faster threads". GPU performance comes from the SM staying busy by switching between warps every cycle. Occupancy is bounded by register count per thread, SMEM per block, and threads per block. Each of these is a budget you allocate. Covered in 03-warps-and-occupancy.md.

What you should be able to do by the end of Phase 23¶

State, from the device profile of your specific cloud GPU: peak fp16 TFLOPS (Tensor Core), peak HBM bandwidth, L2 size, max threads/block, registers/SM, SMEM/SM, compute capability.
Compute the machine balance \(I_\text{crit}\) for fp16 on your GPU. Compare to the i5-8250U's 10 FLOPs/byte and the Phase-22 decode operator's ~1 FLOP/byte. State the regime.
Sketch the memory hierarchy with annotated bandwidths.
Predict, from device specs alone, what a cudaMemcpy of size N bytes will take in microseconds, for small N and large N.
Identify two reasons a warp can run at <100% efficiency (divergence, uncoalesced access) and one reason an SM can run at <100% occupancy (register pressure).
Provision a cloud GPU instance, run a benchmark, terminate it, all in under 30 minutes, and produce a cost log.

That last item is operational, not intellectual — but it's what makes Phase 24+ possible without cost overruns.

The Phase-22 connection¶

The most useful thing about Phase 22 (KV cache) for Phase 23 is its operator menu. By the end of Phase 22 Borja has measured:

Prefill: compute-bound on CPU, with intensity scaling in \(P\).
Decode-step FFN: weight-read-bound on CPU.
Decode-step attention: ~0.5–1.0 FLOPs/byte, memory-bound on CPU.

Phase 23 takes these same operators and places them on the GPU roofline. The diagnosis (which is compute-bound, which is memory-bound) is the same qualitatively. The numbers (how fast, what fraction of peak) are wildly different.

That continuity is pedagogically important. The roofline isn't a per-machine ritual — it's the one mental model that explains both machines. Same equation, different constants. By Phase 24, when Borja writes a fused softmax kernel and measures it, that kernel's dot lands on this Phase-23 roofline. Phase 23 builds the plot; Phase 24 starts using it as a tool.

What this phase deliberately doesn't do¶

Doesn't write a kernel. Phase 24. Phase 23 uses pre-built (cudaMemcpy, cuBLAS GEMM) kernels only.
Doesn't import PyTorch. Phase 24. Phase 23's GPU-array convenience is cupy (NumPy-on-GPU), which is a substrate library not a framework.
Doesn't tune. Phase 23 measures; tuning is Phase 24.
Doesn't do multi-GPU. Phase 35.

The phase is intentionally a "measurement and orientation" phase, exactly like Phase 1 was for CPU.

Next: theory/01-gpu-vs-cpu-mental-model.md — the execution model differences, in detail.