English · Español

Break — Blow the L1 cache with a stride switch¶

🇪🇸 Una sola línea cambia el paso de un loop de 1 a 1024 y la velocidad cae más de 10×. Sirve para ver la latencia de la memoria, no leerla en una tabla.

This /break is for the cache-walk lab (lab/02-cache-walks.md). It teaches by deliberately making memory the bottleneck and then watching the wall clock.

Hypothesis¶

The learner predicts: "Increasing the stride of a 1M-element double-array sum from 1 to 1024 will increase wall time by at least an order of magnitude, even though the number of additions performed drops by 1024×."

The point is that fewer additions per second can be slower in wall time, because each addition now costs one full cache-line fetch from DRAM.

The break¶

In your cache-walk script (benchmarks/cache_walk.py if you wrote one in lab 02, or a fresh file otherwise), change the inner loop's stride from 1 to 1024:

-for i in range(0, N):
+for i in range(0, N, 1024):
     total += arr[i]

Do not reduce N at the same time — the goal is to keep array size fixed and only vary the access pattern.

Run procedure¶

uv run python -c "
import time, numpy as np
N = 1 << 22  # 4M doubles = 32 MiB, comfortably bigger than L2
arr = np.ones(N, dtype=np.float64)

for stride in (1, 8, 64, 512, 1024):
    t0 = time.perf_counter()
    s = 0.0
    for i in range(0, N, stride):
        s += arr[i]
    dt = time.perf_counter() - t0
    elems = N // stride
    print(f'stride={stride:5d}  elems={elems:8d}  wall={dt*1000:8.2f} ms  ns/elem={dt*1e9/elems:7.1f}')
"

Note: a pure-Python loop is fine here because the goal is to isolate memory latency, not to measure peak throughput. (For a vectorized version, use arr[::stride].sum() — but then the Python interpreter overhead is gone and you may want a numba.njit to surface the stride effect cleanly.)

Expected failure mode¶

On the i5-8250U (L1d = 32 KiB, L2 = 256 KiB, L3 = 6 MiB, DRAM latency ~80 ns):

stride=1 — sequential, prefetcher-friendly. ns/elem ≈ 3–5 ns (the Python loop is the bottleneck, not memory).
stride=8 — one element per 64-byte line, prefetcher still on. ns/elem ≈ 10–20 ns.
stride=64 — eight elements per line skipped, mid-range. ns/elem ≈ 50–80 ns.
stride=1024 — each element on a different page; TLB starts missing. ns/elem ≈ 200–400 ns.

The wall time drop is the surprise: stride 1024 visits 1024× fewer elements but each is ~100× slower per visit, so wall time decreases by only ~10×, not 1024×.

Diagnostic¶

From logs alone, the smoking gun is the ns/elem column rising as stride grows. A naive (wrong) mental model would predict ns/elem to be roughly constant across strides — that's only true while you're inside L1.

Cross-check with perf stat if available:

perf stat -e cache-misses,cache-references uv run python your_script.py

A 10× rise in cache-miss rate confirms the diagnosis: memory is the bottleneck, not the loop.

Lesson¶

Two lessons in one:

"Less work" is not always "less time." When the work is dominated by memory accesses, the pattern of accesses dwarfs the count. Sequential beats strided even when strided does 1000× less arithmetic.
The L1 / L2 / L3 / DRAM cliffs are real. They are not textbook abstractions — you just measured them with a stopwatch on your laptop.

When Phase 3's matmul lab asks you to compare a transposed-loop-order matmul to a tiled matmul, you will already know why the difference is so large: it is this experiment writ slightly larger.

References¶

Drepper, What Every Programmer Should Know About Memory (2007) — the canonical reference, esp. §3 (cache hierarchy).
Intel 64 and IA-32 Architectures Optimization Reference Manual, §2.5 (cache organization on Skylake/Kaby Lake).