Skip to content

English · Español

02 — Memory hierarchy: caches, DRAM, NUMA, PCIe, SSD

🇪🇸 La jerarquía de memoria es la columna vertebral de toda intuición de rendimiento en IA. Cada nivel tiene una latencia (cuán rápido empieza) y un ancho de banda (cuán rápido sostiene). Memoriza los órdenes de magnitud — el resto de la carrera son corolarios.


The numbers you must memorize

This is the table of Phase 1. If you internalize one thing from the curriculum's first hundred pages, make it this. All numbers are order-of-magnitude on a 2018-era laptop CPU (close to Borja's i5-8250U).

Level Capacity Latency (cycles) Latency (ns) Bandwidth
Register ~16 × 64-bit 0 0
L1 cache 32 KiB / core 4 ~1 ~1 TB/s
L2 cache 256 KiB / core 12 ~4 ~500 GB/s
L3 cache 6 MiB shared 40 ~12 ~200 GB/s
DRAM (main RAM) 16–64 GiB ~200 ~70 ~20 GB/s
NVMe SSD 256 GiB–4 TiB ~30,000 ~10,000 (10 µs) ~3 GB/s read
Network (gigabit) ~300,000 ~100,000 (100 µs) ~125 MB/s

The latencies span five orders of magnitude between L1 and SSD. The bandwidths span three orders of magnitude. Every performance argument in AI hardware works because of this gap.

Drill: before reading further, close this file. Reconstruct the table from memory. Get within 2× on every number. If you can't, you don't know the table — and you won't be able to make a roofline argument without it.

Why caches exist (re-derivation, since you forgot it)

Main RAM is 70 ns away. A 3 GHz CPU clocks at 333 ps. So a single uncached RAM read costs ~210 CPU cycles. In those 210 cycles, an AVX2-equipped core could have done ~3400 fp32 multiplies (4 cores × 8 wide × 2 FMA = 64 fp32/cycle × 210 cycles ÷ 4 because 1 core stalls on the load).

The CPU spends 210× more time waiting than computing if every access hits DRAM.

Caches close the gap by keeping recently-used data closer to the ALU. The principle is locality:

  1. Temporal locality. If you read address A, you'll probably read A again soon. (Most code re-uses variables.)
  2. Spatial locality. If you read address A, you'll probably read A+1, A+2, … soon. (Most code walks arrays / structs sequentially.)

So caches store data in lines (typically 64 bytes), not individual bytes. Loading one fp32 value loads its 15 neighbors for free. This is why stride-1 array access is fast and stride-1024 array access is slow — you're paying for 15 fp32 you don't use.

Lab 02 makes this visible.

Cache hit, cache miss, cache line, and "the working set"

A cache hit means the data was in cache; the access cost is the cache latency (1–12 ns). A cache miss means the data wasn't there; the access cost is the next level's latency, plus the time to evict an old line.

A cache line is the unit of transfer. 64 bytes on x86. When you load one byte, the CPU fetches the whole 64-byte aligned chunk containing it. This is why false sharing (two threads writing to different variables in the same cache line) is catastrophic — the line bounces between cores' L1s, each invalidating the other's copy.

The working set of a kernel is the unique data it touches in a tight loop. If working set fits in L1 (32 KiB), the kernel runs at L1 speed. If it fits in L2 only, it runs at L2 speed. If it overflows L3, it runs at DRAM speed.

This is the entire reason blocked / tiled matmul beats naive matmul by 50×: blocked matmul keeps each tile in L1, so each fp32 is read from DRAM once per tile, not once per inner-loop iteration. Phase 3 derives the block size from this exact reasoning.

Associativity (briefly, since lscpu shows it)

Caches can't store any line at any slot — that would require checking every slot on every access. Instead, each line address maps to a small set of slots (a set). An "8-way set-associative" cache means each line has 8 candidate slots; the CPU checks all 8 in parallel.

Higher associativity = fewer conflict misses but more chip area. L1 is usually 8-way. L2 is 4–16-way. L3 is 16-way+.

You will not optimize for associativity in Phase 1. We mention it because lscpu prints it and you should know what it means.

DRAM, not "RAM"

"RAM" the consumer label is DRAM internally — Dynamic Random Access Memory. Each bit is a tiny capacitor that leaks charge and must be refreshed thousands of times per second. This is why DRAM is slow: every access starts with finding the right row in the right bank, opening it (~30 ns), and only then reading.

DRAM has banks (groups of rows that can be accessed in parallel) and channels (parallel buses to the CPU). i5-8250U has 2 channels × 1 DIMM × 64 banks/DIMM ≈ 128 banks in flight. The 20 GB/s peak bandwidth assumes you spread your accesses across them. Worst case (one bank, random row hits): ~2 GB/s.

Implication: a kernel that touches DRAM randomly gets ~10× less bandwidth than one that touches it sequentially. Sequential access is your friend.

NUMA (mentioned for vocabulary)

A NUMA (Non-Uniform Memory Access) system has multiple CPU sockets, each with its own DRAM controllers. Accessing memory on your own socket is fast; accessing memory on the other socket goes over the inter-socket interconnect (UPI on Intel) and is 2–3× slower.

Borja's laptop is single-socket — no NUMA in practice. Phase 1 mentions NUMA because: - Every datacenter GPU paper talks about it. - Pinning processes to a NUMA node (numactl --cpunodebind=0 --membind=0 ./script) is a common Phase 35 (distributed) trick.

For Phase 1 labs, no NUMA experiment. Just know the word.

PCIe (path to the iGPU and to NVMe)

PCIe is the bus connecting the CPU to peripherals — discrete GPUs, NVMe SSDs, network cards. Generations matter:

Gen Bandwidth per lane Typical link to a GPU
3.0 1 GB/s 16 lanes ≈ 16 GB/s
4.0 2 GB/s 16 lanes ≈ 32 GB/s
5.0 4 GB/s 16 lanes ≈ 64 GB/s

For Borja's iGPU (Intel UHD 620), there is no PCIe link to the GPU — the iGPU sits on the same die as the CPU and shares the same DRAM. This is unusual and worth holding in mind for Phase 23, when "GPU memory" suddenly means "a separate, faster, smaller pool across PCIe."

NVMe SSDs typically use 4 PCIe lanes — 4 lanes × 1 GB/s (Gen 3) = 4 GB/s theoretical, ~3 GB/s practical.

Datacenter NVIDIA GPUs are linked to each other by NVLink — a proprietary interconnect much faster than PCIe (~600-900 GB/s aggregate on H100). Multi-GPU training (Phase 35) exploits NVLink heavily. Phase 1 mentions it so the word doesn't appear unannounced in Phase 24.

Latency vs throughput — a worked example

A common confusion: "I have 20 GB/s of bandwidth, so 70 ns latency doesn't matter."

It matters if your access pattern doesn't allow parallelism. Bandwidth assumes you have many outstanding requests at once (pipelined). Latency dominates if requests are serial — each one waits for the previous to finish.

Concretely: - Sequential memcpy of a 1 GB buffer: bandwidth wins. ~50 ms at 20 GB/s. - Linked-list traversal of 10M nodes (each load depends on the previous): latency wins. 10M × 70 ns = 700 ms. 14× slower for the same byte count.

This is Little's law applied to memory: throughput = parallelism / latency. If you can't expose parallelism (no prefetching, no out-of-order issue), throughput collapses to 1 / latency accesses per second.

Lab 02 makes this concrete by measuring both a sequential walk and a random walk on the same buffer.

Implications for AI

A few more applications of the table, building on theory/00-motivation.md:

  • Weight matrices that fit in L3 (~6 MiB = 1.5M fp32) run fast. Weight matrices that don't fit in L3 are DRAM-bound. A small Llama embedding (32k vocab × 4k dim = 128M fp32) is way larger than L3 → DRAM-bound during forward pass.
  • Attention's K-V cache (Phase 22) gets large quickly for long sequences. Once K-V doesn't fit in cache, every token's attention is DRAM-bound — this is why long-context inference latency scales with sequence length even though FLOPS only grows linearly.
  • Activations during training are bigger than weights. The reason gradient checkpointing exists (Phase 18) is that activations spill to DRAM and dominate memory traffic.

Every one of these is a roofline argument. The roofline plot is the next file.

One-paragraph recap

The memory hierarchy is a ladder: L1 / L2 / L3 / DRAM / SSD / network, with each rung 5–10× slower and ~10× larger than the previous. Caches close the L1↔DRAM gap by exploiting temporal and spatial locality at 64-byte line granularity. Sequential access gets bandwidth; random access gets latency. Every AI performance optimization either keeps the working set in a higher cache level or hides latency with parallelism. The roofline model (next) is the unified picture.


Next: theory/03-roofline-model.md.