English · Español

04 — NUMA, thread pinning, and the "single-socket lie"¶

🇪🇸 Tu laptop tiene un socket — sin NUMA real. Pero el modelo mental NUMA-aware ya te ayuda en una sola CPU: hilos en cores distintos comparten L3 y compiten por ancho de banda; pinear hilos a cores y ligar memoria a páginas concretas es la versión "barata" de lo que en servidores de dos sockets es vida o muerte.

The Phase 1 motivation and roofline pages established what the hardware can do. This page covers where the data and threads live, which on a multi-socket server is the difference between 1× and 0.4× peak performance. On Borja's i5-8250U (single socket) it is a smaller effect — but the vocabulary and the measurement habits are the same. Phase 23 (cloud GPU) and Phase 35 (distributed training) reuse them.

§1 What NUMA is¶

NUMA = Non-Uniform Memory Access. On a multi-socket machine, each CPU socket has its own attached DRAM (and its own memory controller). A thread on socket 0 accessing a page on socket 1 pays:

An extra latency hop through the inter-socket interconnect (Intel UPI/QPI, AMD Infinity Fabric). Typical: 60–150 ns on top of the ~80 ns local DRAM latency — so a remote access can be 2× slower in latency.
A bandwidth share of the interconnect, which is much slower than the per-socket memory controller. Saturation is easy.

The OS allocates pages using a first-touch policy by default: whichever socket's CPU first writes to a page owns it. So if a single thread initializes a 4 GiB tensor with np.zeros(...) and then 16 worker threads (8 on each socket) consume it, half the workers are paying the remote-access tax.

§2 Why this matters even on a single-socket laptop¶

The i5-8250U is one socket, four cores, eight threads (SMT/Hyper-Threading). Strictly, there is no NUMA. But three single-socket pathologies share the same flavor:

L3 sharing. All four cores share one L3 (6 MiB). If thread A streams a 10 MiB array, it evicts thread B's working set. The bandwidth A consumes is bandwidth B does not get.
SMT siblings. Hyper-threads on the same physical core share L1, L2, and the execution units. Putting two compute-bound threads on SMT siblings gives roughly 1.1–1.3× speedup, not 2×.
Frequency throttling. Sustained all-core load drops the all-core turbo frequency (base 1.6 GHz, single-core turbo 3.4 GHz, all-core turbo ~2.6 GHz on this chip). Wall-time benchmarks see this as "more threads → less speedup than expected."

The fix on a laptop is the same intellectual move as fixing NUMA on a server: bind work to specific cores, watch the throttling and cache-miss counters, and avoid forcing your scheduler to make decisions you could make explicitly.

§3 Tools you will use¶

Tool	What it does
`lscpu`	Reports sockets, cores, threads, cache sizes per level.
`numactl --hardware`	Reports NUMA nodes and distances. On a single-socket laptop: one node, distance 10.
`taskset -c 0,1,2,3 ./prog`	Pins a process to specific cores (helps avoid SMT-sibling collisions).
`OMP_NUM_THREADS=N`	Caps OpenMP / BLAS thread count. Setting it to the physical core count (4 on this laptop), not logical (8), often improves throughput by 5–15%.
`perf stat -e ...`	Reads PMU counters: cache misses, branch mispredicts, IPC.

§4 Worked exercise — set OMP and watch¶

The single experiment that demonstrates this on a laptop:

for T in 1 2 4 8; do
  OMP_NUM_THREADS=$T uv run python -c "
import numpy as np, time
A = np.random.randn(2000, 2000).astype(np.float32)
B = np.random.randn(2000, 2000).astype(np.float32)
t0 = time.perf_counter()
for _ in range(5): C = A @ B
dt = (time.perf_counter() - t0) / 5
print(f'OMP_NUM_THREADS={$T}  {dt*1000:7.1f} ms/iter')
"
done

Expected on the i5-8250U:

Threads	ms/iter	Speedup vs T=1
1	~120	1.0×
2	~65	1.85×
4	~40	3.0×
8	~38	3.2×

Going from 4 to 8 buys almost nothing because threads 5–8 are SMT siblings of 1–4. The matmul kernel is already keeping the execution units busy; SMT helps when the kernel has idle slots (memory-stall-heavy code), not when it is compute-saturated.

§5 The mental model for Phase 23+¶

On a cloud GPU node (8× H100, 2 sockets), the NUMA hierarchy is:

NIC ⟷ socket-0 PCIe ⟷ socket-0 DRAM
        ↕  UPI                    socket-0 cores ⟷ 4 GPUs over NVLink ↕
NIC ⟷ socket-1 PCIe ⟷ socket-1 DRAM
                                  socket-1 cores ⟷ 4 GPUs over NVLink ↕

Throwing a tensor into "host memory" without specifying which socket allocates it can cost a 30% throughput hit on the H100 feed. The single-socket discipline you learned today — be specific about where memory and threads live — is the exact same discipline at 100× the cost.

§6 What you can ignore for now¶

The Linux kernel's numactl page-migration policies (interleave, bind, preferred) — relevant on real NUMA. On a laptop, the default is fine.
CPU affinity for I/O threads vs compute threads — relevant for inference servers (Phase 33), not now.
NUMA-aware allocators (jemalloc, mimalloc) — these mostly help when many threads malloc concurrently from many sockets; not a laptop issue.

§7 References¶

Intel 64 and IA-32 Architectures Optimization Reference Manual, §11 (multi-socket considerations).
Drepper, What Every Programmer Should Know About Memory (2007), §5 (NUMA support).
Linux man-pages: numactl(8), taskset(1), lscpu(1).

§8 Read next¶

→ The existing roofline plot lab (lab/03-roofline-plot.md) — re-run it pinning OMP to physical core count, and document the difference.