English · Español

02 — GPU Memory Hierarchy¶

🇪🇸 Jerarquía de memoria GPU: HBM (lenta, grande, ~1–3 TB/s, ~40–80 GiB) → L2 (compartida entre SMs, ~3–10 TB/s, ~40–100 MiB) → SMEM (manual, por bloque, ~10–20 TB/s, ~64–164 KiB) → registros (por hilo, ~muchos TB/s, ~256 KiB por SM repartidos entre todos los hilos residentes). Cada nivel es ~3–10× más rápido y ~10–100× más pequeño que el anterior.

This page gives the GPU's memory hierarchy the same level of detail Phase-1's 02-memory-hierarchy.md gave to CPU caches — but with the GPU-specific differences called out explicitly. After this page you should be able to draw the hierarchy from memory with bandwidths labeled.

The hierarchy, top-down¶

For an A100-class GPU (numbers from the A100 whitepaper; representative of "modern data-center GPU"):

Level	Size	Bandwidth	Latency	Scope	Managed by
HBM (DRAM)	40–80 GiB	1.5–3 TB/s	400–800 cycles	Device-wide	Allocator (`cudaMalloc`)
L2 cache	40 MiB	~5 TB/s	200–250 cycles	Shared across SMs	Hardware
SMEM (shared mem.)	164 KiB / SM	~19 TB/s	20–30 cycles	Per thread-block	Programmer (`__shared__`)
L1 cache	shared with SMEM (164 KiB)	~19 TB/s	20–30 cycles	Per SM	Hardware (or repartition with SMEM)
Register file	256 KiB / SM	one register / cycle / lane	1 cycle	Per thread	Compiler

(H100 numbers are ~30–50% higher across the board; same hierarchy.)

For comparison, CPU's L1/L2/L3/DRAM are roughly 32 KiB / 256 KiB / 30 MiB / 64 GiB, with bandwidths ~1 TB/s / 200 GB/s / 70 GB/s / 20 GB/s on a modern CPU. The GPU's HBM is two orders of magnitude faster than CPU DRAM. The GPU's L2 is half a magnitude faster than CPU L3 and 5× larger. The GPU is a memory-bandwidth machine.

HBM: the global memory tier¶

HBM (High Bandwidth Memory) is the GPU's equivalent of "main memory". Everything that doesn't fit in cache lives here. All cudaMalloc allocations come from HBM. The KV cache (Phase 22) — when ported to GPU in Phase 24 — lives in HBM.

Three things to know:

HBM is bandwidth-fast, not latency-fast. A single load from HBM takes 400–800 cycles. Latency is hidden via warp switching (theory/01). To saturate HBM bandwidth, you need many concurrent memory transactions in flight.
HBM is the rate-limiter for memory-bound kernels. Phase-22 decode attention's intensity (~1 FLOP/byte fp16) is far below the machine balance (~150 FLOPs/byte on A100). The GPU's FPUs sit idle waiting on HBM. Same diagnosis as CPU, faster numbers.
HBM bandwidth is per-direction. Quoted as e.g. "2 TB/s" — that's aggregate bidirectional. A read-only workload sees ~half of it. A read-modify-write sees less.

L2: the implicit shared cache¶

L2 sits between HBM and the SMs. Hardware-managed (like CPU L2/L3). All SMs share it.

Key property: L2 is large enough to hold one or two layers' worth of KV cache at modest model sizes. This is why "L2 hit rate for cache reads" is a thing serving systems optimize. A well-laid-out cache that re-reads the same K, V rows across the layers of one forward pass can land them in L2 for the second-and-onward layer access.

Practical note: you can't allocate "in L2" — you can only allocate in HBM and hope L2 catches a working set. There are L2-persisting-access hints (cudaStreamAttribute*) for advanced control. Phase 23 doesn't use them; Phase 24 might.

SMEM: the programmer-managed scratchpad¶

This is the one that breaks CPU intuition.

SMEM is on-chip, fast (~19 TB/s aggregate across the SM), per-thread-block, and explicitly managed. You declare __shared__ float tile[32][32] in the kernel; the compiler reserves 4 KiB of SMEM per block; you write a load loop that copies a tile of global data into tile, you __syncthreads(), and now all threads in the block can access tile at SMEM speeds.

There is no "SMEM autoloaded from HBM" mechanism. SMEM is what you put there. If you load nothing, you have nothing. If you load the wrong tile, your kernel is wrong.

This is power and footgun. Power: you can structure data movement exactly for your access pattern, avoiding the wasted bytes that CPU caches always bring along. Footgun: get the tile size wrong and you either waste SMEM or thrash with too many global loads.

Every "fast" GPU kernel — GEMM, attention, convolution — has an SMEM dance. Phase 24's labs build this dance from scratch on a fused softmax.

Quirk: SMEM and L1 share the same physical SRAM on most modern GPUs. You configure the split at launch (e.g., 100 KiB SMEM + 28 KiB L1, vs 64 KiB SMEM + 64 KiB L1). Defaults are usually fine; advanced kernels twiddle this.

Registers: per-thread state¶

Each thread has its own registers, carved out of the SM's 256-KiB register file. The compiler decides how many registers each thread uses.

Tradeoff: - More registers per thread = more state held per thread = potentially faster (avoid spilling) but fewer threads can be resident on the SM (lower occupancy). - Fewer registers per thread = some local variables spill to local memory (slow!) but more threads can be resident (higher occupancy).

There's no perfect answer. cuBLAS and the like profile-tune this. For hand-written kernels, you'll learn to read the --ptxas-options=-v output for register count and tune via __launch_bounds__. Phase 24, not Phase 23.

For Phase 23, the takeaway is: occupancy depends on register count, and register count depends on the kernel's logic.

The bandwidth pyramid (for an A100)¶

                  ▲
        Registers  │  256 KiB/SM, per-thread, 1-cycle access
                  │
            SMEM  │  164 KiB/SM, per-block, ~19 TB/s, ~20 cycles
                  │
            L2    │  40 MiB, device-wide, ~5 TB/s, ~200 cycles
                  │
            HBM   │  40 GiB, device-wide, ~1.5-3 TB/s, ~500 cycles
                  ▼

The factor between adjacent tiers is roughly: - HBM → L2: 2× - L2 → SMEM: 4× - SMEM → reg: 100×+

So the biggest bandwidth discontinuity is between HBM and the rest. The biggest latency discontinuity is between L2/HBM and SMEM/registers. Together they explain why "tile to SMEM" is the universal GPU optimization.

Memory coalescing: the formal rule¶

(Set up in theory/01; formalized here.)

The GPU's load-store unit serves a warp by issuing memory transactions. A transaction is one access to one cache line of HBM (cache line = 128 bytes on modern GPUs). To serve a warp's 32 load instructions:

If the 32 addresses fall within one 128-byte cache line, one transaction serves all 32 threads. Coalesced. Bandwidth-efficient.
If they fall across two cache lines, two transactions. Less efficient by 2×.
If they fall across N cache lines (worst case N=32), N transactions. Uncoalesced. Bandwidth-efficient at 1/N.

Practical rule for fp32: thread \(i\) in a warp accessing address base + i * 4 (i.e., contiguous floats) is always coalesced.

For fp16/bf16 (\(i \cdot 2\)): also coalesced — 32 × 2 = 64 bytes, half a cache line, one transaction.

For int8 (\(i \cdot 1\)): 32 × 1 = 32 bytes, quarter of a line, one transaction. Even more efficient.

The footgun is stride access: base + (i * S) * sizeof(T) for stride \(S > 1\). Now the 32 addresses are spread across S cache lines (best case). Effective bandwidth: 1/S of peak. This is why padding is sometimes worth wasting memory for (avoiding bad strides).

What this hierarchy means for Phase 22's operators¶

Re-place the Phase-22 operators on the GPU hierarchy:

Prefill attention (\(P \times P\) per layer). The \(K, V\) matrices are tile-friendly. Flash-Attention's whole trick is keeping the working set in SMEM. HBM → SMEM tiling reduces HBM reads by O(P) factor. Phase 24's lab.
Decode attention (\(1 \times S\) per layer). The cache \(K, V\) for the current sequence is potentially 100 MiB → doesn't fit in SMEM. Streams through L2 (sometimes hits) and HBM (mostly misses). HBM-bound; occupancy and coalescing determine the constant factor.
FFN matmul (\(1 \times d\) against \(d \times 4d\)). For a 7B model, \(4d = 16384\), the FFN weight matrix is 4096×16384 fp16 = 134 MiB — doesn't fit in L2. Streams from HBM. Same diagnosis: bandwidth-bound on weight read.
KV cache append. Tiny write of new K, V row to HBM. Negligible.
Sampling (argmax / multinomial over vocab). Reads logits (vocab-sized vector, e.g., 32K fp16 = 64 KiB) — fits in L2. Compute-tiny. Negligible.

Notice that for every single operator, the diagnosis is "where does the working set live" — HBM, L2, SMEM. The roofline plot is the answer; the hierarchy is the vocabulary.

Drill problems¶

The Phase-22 decode-attention reads \(2 L S d s\) bytes of cache per step. On A100 fp16 HBM (2 TB/s), Llama-2-7B (L=32, d=4096), S=4096: how long is the cache-read leg? Compare to the measured per-token latency people quote (10–20 ms).
A kernel does dst[i] = src[i * 7] for stride 7, fp32. Coalesced fraction? Effective bandwidth?
The Phase-22 cache for MiniGPT is 16 MiB (4k context). Does it fit in A100's L2 (40 MiB)? What about for Llama-2-7B at 4k context (2 GiB)?
Why is "global atomic increment" 100× slower than "shared atomic increment" on most GPUs? (Hint: where does the atomic operation serialize?)

What you should now be able to do¶

Draw the hierarchy with numbers.
State the coalescing rule and apply it to a code snippet.
Predict which tier each Phase-22 operator's working set lands in.
Explain why SMEM is programmer-managed, not auto-cached, and why that's powerful.

What this page does NOT cover¶

HBM3 / HBM3e specifics. H100 / B100 numbers are flagged where they differ from A100 baseline but this page targets the shape of the hierarchy, not the bleeding edge.
SMEM bank conflicts in depth. Mentioned; the full analysis is Phase 24 when you actually allocate SMEM.
TMA (Tensor Memory Accelerator) on H100+. Phase 27.
Multi-GPU memory (NVLink, NCCL). Phase 35.

Next: theory/03-warps-and-occupancy.md — the warp execution model formalized; occupancy as a resource allocation problem.