Skip to content

English · Español

04 — NUMA, thread pinning, and the "single-socket lie"

🇪🇸 Tu laptop tiene un socket — sin NUMA real. Pero el modelo mental NUMA-aware ya te ayuda en una sola CPU: hilos en cores distintos comparten L3 y compiten por ancho de banda; pinear hilos a cores y ligar memoria a páginas concretas es la versión "barata" de lo que en servidores de dos sockets es vida o muerte.

The Phase 1 motivation and roofline pages established what the hardware can do. This page covers where the data and threads live, which on a multi-socket server is the difference between 1× and 0.4× peak performance. On Borja's i5-8250U (single socket) it is a smaller effect — but the vocabulary and the measurement habits are the same. Phase 23 (cloud GPU) and Phase 35 (distributed training) reuse them.


§1 What NUMA is

NUMA = Non-Uniform Memory Access. On a multi-socket machine, each CPU socket has its own attached DRAM (and its own memory controller). A thread on socket 0 accessing a page on socket 1 pays:

  • An extra latency hop through the inter-socket interconnect (Intel UPI/QPI, AMD Infinity Fabric). Typical: 60–150 ns on top of the ~80 ns local DRAM latency — so a remote access can be 2× slower in latency.
  • A bandwidth share of the interconnect, which is much slower than the per-socket memory controller. Saturation is easy.

The OS allocates pages using a first-touch policy by default: whichever socket's CPU first writes to a page owns it. So if a single thread initializes a 4 GiB tensor with np.zeros(...) and then 16 worker threads (8 on each socket) consume it, half the workers are paying the remote-access tax.

§2 Why this matters even on a single-socket laptop

The i5-8250U is one socket, four cores, eight threads (SMT/Hyper-Threading). Strictly, there is no NUMA. But three single-socket pathologies share the same flavor:

  1. L3 sharing. All four cores share one L3 (6 MiB). If thread A streams a 10 MiB array, it evicts thread B's working set. The bandwidth A consumes is bandwidth B does not get.
  2. SMT siblings. Hyper-threads on the same physical core share L1, L2, and the execution units. Putting two compute-bound threads on SMT siblings gives roughly 1.1–1.3× speedup, not 2×.
  3. Frequency throttling. Sustained all-core load drops the all-core turbo frequency (base 1.6 GHz, single-core turbo 3.4 GHz, all-core turbo ~2.6 GHz on this chip). Wall-time benchmarks see this as "more threads → less speedup than expected."

The fix on a laptop is the same intellectual move as fixing NUMA on a server: bind work to specific cores, watch the throttling and cache-miss counters, and avoid forcing your scheduler to make decisions you could make explicitly.

§3 Tools you will use

Tool What it does
lscpu Reports sockets, cores, threads, cache sizes per level.
numactl --hardware Reports NUMA nodes and distances. On a single-socket laptop: one node, distance 10.
taskset -c 0,1,2,3 ./prog Pins a process to specific cores (helps avoid SMT-sibling collisions).
OMP_NUM_THREADS=N Caps OpenMP / BLAS thread count. Setting it to the physical core count (4 on this laptop), not logical (8), often improves throughput by 5–15%.
perf stat -e ... Reads PMU counters: cache misses, branch mispredicts, IPC.

§4 Worked exercise — set OMP and watch

The single experiment that demonstrates this on a laptop:

for T in 1 2 4 8; do
  OMP_NUM_THREADS=$T uv run python -c "
import numpy as np, time
A = np.random.randn(2000, 2000).astype(np.float32)
B = np.random.randn(2000, 2000).astype(np.float32)
t0 = time.perf_counter()
for _ in range(5): C = A @ B
dt = (time.perf_counter() - t0) / 5
print(f'OMP_NUM_THREADS={$T}  {dt*1000:7.1f} ms/iter')
"
done

Expected on the i5-8250U:

Threads ms/iter Speedup vs T=1
1 ~120 1.0×
2 ~65 1.85×
4 ~40 3.0×
8 ~38 3.2×

Going from 4 to 8 buys almost nothing because threads 5–8 are SMT siblings of 1–4. The matmul kernel is already keeping the execution units busy; SMT helps when the kernel has idle slots (memory-stall-heavy code), not when it is compute-saturated.

§5 The mental model for Phase 23+

On a cloud GPU node (8× H100, 2 sockets), the NUMA hierarchy is:

NIC ⟷ socket-0 PCIe ⟷ socket-0 DRAM
        ↕  UPI                    socket-0 cores ⟷ 4 GPUs over NVLink ↕
NIC ⟷ socket-1 PCIe ⟷ socket-1 DRAM
                                  socket-1 cores ⟷ 4 GPUs over NVLink ↕

Throwing a tensor into "host memory" without specifying which socket allocates it can cost a 30% throughput hit on the H100 feed. The single-socket discipline you learned today — be specific about where memory and threads live — is the exact same discipline at 100× the cost.

§6 What you can ignore for now

  • The Linux kernel's numactl page-migration policies (interleave, bind, preferred) — relevant on real NUMA. On a laptop, the default is fine.
  • CPU affinity for I/O threads vs compute threads — relevant for inference servers (Phase 33), not now.
  • NUMA-aware allocators (jemalloc, mimalloc) — these mostly help when many threads malloc concurrently from many sockets; not a laptop issue.

§7 References

  • Intel 64 and IA-32 Architectures Optimization Reference Manual, §11 (multi-socket considerations).
  • Drepper, What Every Programmer Should Know About Memory (2007), §5 (NUMA support).
  • Linux man-pages: numactl(8), taskset(1), lscpu(1).

→ The existing roofline plot lab (lab/03-roofline-plot.md) — re-run it pinning OMP to physical core count, and document the difference.