English · Español
04 — NUMA, thread pinning, and the "single-socket lie"¶
🇪🇸 Tu laptop tiene un socket — sin NUMA real. Pero el modelo mental NUMA-aware ya te ayuda en una sola CPU: hilos en cores distintos comparten L3 y compiten por ancho de banda; pinear hilos a cores y ligar memoria a páginas concretas es la versión "barata" de lo que en servidores de dos sockets es vida o muerte.
The Phase 1 motivation and roofline pages established what the hardware can do. This page covers where the data and threads live, which on a multi-socket server is the difference between 1× and 0.4× peak performance. On Borja's i5-8250U (single socket) it is a smaller effect — but the vocabulary and the measurement habits are the same. Phase 23 (cloud GPU) and Phase 35 (distributed training) reuse them.
§1 What NUMA is¶
NUMA = Non-Uniform Memory Access. On a multi-socket machine, each CPU socket has its own attached DRAM (and its own memory controller). A thread on socket 0 accessing a page on socket 1 pays:
- An extra latency hop through the inter-socket interconnect (Intel UPI/QPI, AMD Infinity Fabric). Typical: 60–150 ns on top of the ~80 ns local DRAM latency — so a remote access can be 2× slower in latency.
- A bandwidth share of the interconnect, which is much slower than the per-socket memory controller. Saturation is easy.
The OS allocates pages using a first-touch policy by default: whichever socket's CPU first writes to a page owns it. So if a single thread initializes a 4 GiB tensor with np.zeros(...) and then 16 worker threads (8 on each socket) consume it, half the workers are paying the remote-access tax.
§2 Why this matters even on a single-socket laptop¶
The i5-8250U is one socket, four cores, eight threads (SMT/Hyper-Threading). Strictly, there is no NUMA. But three single-socket pathologies share the same flavor:
- L3 sharing. All four cores share one L3 (6 MiB). If thread A streams a 10 MiB array, it evicts thread B's working set. The bandwidth A consumes is bandwidth B does not get.
- SMT siblings. Hyper-threads on the same physical core share L1, L2, and the execution units. Putting two compute-bound threads on SMT siblings gives roughly 1.1–1.3× speedup, not 2×.
- Frequency throttling. Sustained all-core load drops the all-core turbo frequency (base 1.6 GHz, single-core turbo 3.4 GHz, all-core turbo ~2.6 GHz on this chip). Wall-time benchmarks see this as "more threads → less speedup than expected."
The fix on a laptop is the same intellectual move as fixing NUMA on a server: bind work to specific cores, watch the throttling and cache-miss counters, and avoid forcing your scheduler to make decisions you could make explicitly.
§3 Tools you will use¶
| Tool | What it does |
|---|---|
lscpu |
Reports sockets, cores, threads, cache sizes per level. |
numactl --hardware |
Reports NUMA nodes and distances. On a single-socket laptop: one node, distance 10. |
taskset -c 0,1,2,3 ./prog |
Pins a process to specific cores (helps avoid SMT-sibling collisions). |
OMP_NUM_THREADS=N |
Caps OpenMP / BLAS thread count. Setting it to the physical core count (4 on this laptop), not logical (8), often improves throughput by 5–15%. |
perf stat -e ... |
Reads PMU counters: cache misses, branch mispredicts, IPC. |
§4 Worked exercise — set OMP and watch¶
The single experiment that demonstrates this on a laptop:
for T in 1 2 4 8; do
OMP_NUM_THREADS=$T uv run python -c "
import numpy as np, time
A = np.random.randn(2000, 2000).astype(np.float32)
B = np.random.randn(2000, 2000).astype(np.float32)
t0 = time.perf_counter()
for _ in range(5): C = A @ B
dt = (time.perf_counter() - t0) / 5
print(f'OMP_NUM_THREADS={$T} {dt*1000:7.1f} ms/iter')
"
done
Expected on the i5-8250U:
| Threads | ms/iter | Speedup vs T=1 |
|---|---|---|
| 1 | ~120 | 1.0× |
| 2 | ~65 | 1.85× |
| 4 | ~40 | 3.0× |
| 8 | ~38 | 3.2× |
Going from 4 to 8 buys almost nothing because threads 5–8 are SMT siblings of 1–4. The matmul kernel is already keeping the execution units busy; SMT helps when the kernel has idle slots (memory-stall-heavy code), not when it is compute-saturated.
§5 The mental model for Phase 23+¶
On a cloud GPU node (8× H100, 2 sockets), the NUMA hierarchy is:
NIC ⟷ socket-0 PCIe ⟷ socket-0 DRAM
↕ UPI socket-0 cores ⟷ 4 GPUs over NVLink ↕
NIC ⟷ socket-1 PCIe ⟷ socket-1 DRAM
socket-1 cores ⟷ 4 GPUs over NVLink ↕
Throwing a tensor into "host memory" without specifying which socket allocates it can cost a 30% throughput hit on the H100 feed. The single-socket discipline you learned today — be specific about where memory and threads live — is the exact same discipline at 100× the cost.
§6 What you can ignore for now¶
- The Linux kernel's
numactlpage-migration policies (interleave, bind, preferred) — relevant on real NUMA. On a laptop, the default is fine. - CPU affinity for I/O threads vs compute threads — relevant for inference servers (Phase 33), not now.
- NUMA-aware allocators (jemalloc, mimalloc) — these mostly help when many threads
mallocconcurrently from many sockets; not a laptop issue.
§7 References¶
- Intel 64 and IA-32 Architectures Optimization Reference Manual, §11 (multi-socket considerations).
- Drepper, What Every Programmer Should Know About Memory (2007), §5 (NUMA support).
- Linux man-pages:
numactl(8),taskset(1),lscpu(1).
§8 Read next¶
→ The existing roofline plot lab (lab/03-roofline-plot.md) — re-run it pinning OMP to physical core count, and document the difference.