English · Español
Break — Blow the L1 cache with a stride switch¶
🇪🇸 Una sola línea cambia el paso de un loop de 1 a 1024 y la velocidad cae más de 10×. Sirve para ver la latencia de la memoria, no leerla en una tabla.
This /break is for the cache-walk lab (lab/02-cache-walks.md). It teaches by deliberately making memory the bottleneck and then watching the wall clock.
Hypothesis¶
The learner predicts: "Increasing the stride of a 1M-element double-array sum from 1 to 1024 will increase wall time by at least an order of magnitude, even though the number of additions performed drops by 1024×."
The point is that fewer additions per second can be slower in wall time, because each addition now costs one full cache-line fetch from DRAM.
The break¶
In your cache-walk script (benchmarks/cache_walk.py if you wrote one in lab 02, or a fresh file otherwise), change the inner loop's stride from 1 to 1024:
Do not reduce N at the same time — the goal is to keep array size fixed and only vary the access pattern.
Run procedure¶
uv run python -c "
import time, numpy as np
N = 1 << 22 # 4M doubles = 32 MiB, comfortably bigger than L2
arr = np.ones(N, dtype=np.float64)
for stride in (1, 8, 64, 512, 1024):
t0 = time.perf_counter()
s = 0.0
for i in range(0, N, stride):
s += arr[i]
dt = time.perf_counter() - t0
elems = N // stride
print(f'stride={stride:5d} elems={elems:8d} wall={dt*1000:8.2f} ms ns/elem={dt*1e9/elems:7.1f}')
"
Note: a pure-Python loop is fine here because the goal is to isolate memory latency, not to measure peak throughput. (For a vectorized version, use arr[::stride].sum() — but then the Python interpreter overhead is gone and you may want a numba.njit to surface the stride effect cleanly.)
Expected failure mode¶
On the i5-8250U (L1d = 32 KiB, L2 = 256 KiB, L3 = 6 MiB, DRAM latency ~80 ns):
- stride=1 — sequential, prefetcher-friendly.
ns/elem ≈ 3–5 ns(the Python loop is the bottleneck, not memory). - stride=8 — one element per 64-byte line, prefetcher still on.
ns/elem ≈ 10–20 ns. - stride=64 — eight elements per line skipped, mid-range.
ns/elem ≈ 50–80 ns. - stride=1024 — each element on a different page; TLB starts missing.
ns/elem ≈ 200–400 ns.
The wall time drop is the surprise: stride 1024 visits 1024× fewer elements but each is ~100× slower per visit, so wall time decreases by only ~10×, not 1024×.
Diagnostic¶
From logs alone, the smoking gun is the ns/elem column rising as stride grows. A naive (wrong) mental model would predict ns/elem to be roughly constant across strides — that's only true while you're inside L1.
Cross-check with perf stat if available:
A 10× rise in cache-miss rate confirms the diagnosis: memory is the bottleneck, not the loop.
Lesson¶
Two lessons in one:
- "Less work" is not always "less time." When the work is dominated by memory accesses, the pattern of accesses dwarfs the count. Sequential beats strided even when strided does 1000× less arithmetic.
- The L1 / L2 / L3 / DRAM cliffs are real. They are not textbook abstractions — you just measured them with a stopwatch on your laptop.
When Phase 3's matmul lab asks you to compare a transposed-loop-order matmul to a tiled matmul, you will already know why the difference is so large: it is this experiment writ slightly larger.
References¶
- Drepper, What Every Programmer Should Know About Memory (2007) — the canonical reference, esp. §3 (cache hierarchy).
- Intel 64 and IA-32 Architectures Optimization Reference Manual, §2.5 (cache organization on Skylake/Kaby Lake).