English · Español
Lab 02 — See the cache hierarchy via strided access¶
Goal: turn the "cache line / spatial locality" abstraction from
theory/02-memory-hierarchy.mdinto a curve you generated yourself.Estimated time: 60–90 minutes.
Prereq: lab 00 + lab 01 committed.
What you produce¶
A directory experiments/01-cache-walks/ containing:
walk.py— the stride-walk benchmark.results.json— per-stride access time.cache_walk.png— log-log plot of stride vs ns/access.manifest.json.README.mdinterpreting the curve.
The kernel¶
Allocate a large np.float32 buffer (say 128 MiB — bigger than L3). Walk it with a fixed stride S, summing every S-th element. Time how long one full walk takes. Compute ns per access.
Repeat for many strides: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024.
The plot of stride vs ns/access reveals the cache line size and (with the right buffer size choice) the cache levels themselves.
TODOs¶
Block A — choose dimensions carefully¶
- Buffer size: bigger than L3 (≥ 64 MiB).
- Element type:
np.float32(4 bytes — matches the typical cache-line discussion). - Strides: at minimum
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024](i.e.2^(0..10)). Note: a stride of 16 fp32 = 64 bytes = one cache line on x86. - Per measurement: target ≥ 100 ms wall time. Adjust
n_itersaccordingly.
Block B — write the walk¶
The naive form:
def walk(buf: np.ndarray, stride: int) -> float:
n = len(buf)
n_accesses = n // stride
total = np.float32(0.0)
for i in range(0, n, stride):
total += buf[i]
return total
But this is a Python loop, so it measures interpreter overhead, not memory. You need an idiomatic NumPy form that's still memory-bound at the chosen stride.
Two options:
- total = buf[::stride].sum() — clean, but [::stride] creates a view, not a copy; sum() walks it.
- total = np.add.reduce(buf[::stride]) — same thing, slightly faster.
Pick one. Justify in README.md why this isolates memory cost (and what it doesn't isolate).
Block C — time correctly¶
- Warm up once.
- Time with
perf_counter_ns(). - Repeat enough times to get ≥ 100 ms total per stride.
- Compute
ns_per_access = elapsed_ns / (n_accesses × iters). - Record in
results.json.
Block D — plot¶
- Log-log: x = stride (elements), y = ns/access.
- Annotate cache line size (16 fp32 = 64 bytes) as a vertical line — the curve should show a clear slope change there.
- Annotate L1/L2/L3 boundaries if your buffer-size choice exposes them (you may want to repeat the lab at three different total buffer sizes if you want all three).
Block E — interpret¶
Answer in README.md:
- Where does the curve "elbow"? The first elbow is the cache line: below stride = line_size, you're reusing the line you already loaded; above, every access is a fresh line.
- Why is the curve flat above some larger stride? When stride exceeds cache associativity × set count, every access is a miss; further increases buy you nothing.
- What's the ratio of "best access time" to "worst access time"? It should be roughly the cache-to-DRAM latency ratio.
Constraints¶
- Pure Python + NumPy. No C extensions, no
cython, noctypes. - One buffer per experiment — don't accidentally allocate a fresh one inside the timing loop.
- Single thread.
Stop conditions¶
Done when:
- All five files in the directory.
- The plot shows a visible elbow near stride = 16 (one fp32 cache line).
README.mdanswers all three Block E questions.- Numbers in
results.jsonare reproducible: re-runningwalk.pygives ns/access within ±15% of the first run.
Pitfalls¶
- NumPy striding fuses into SIMD differently.
buf[::1]is contiguous; NumPy SIMDs it.buf[::2]is not; NumPy falls back to a scalar gather. The transition from stride 1 → stride 2 is partly SIMD-related, not purely memory. Note this in your interpretation. - The curve isn't smooth. It will have wobbles. That's fine.
- At very large strides, the curve might dip. Out-of-order execution can hide DRAM latency when every request is a miss (because they all go in parallel). This is the "memory-level parallelism" effect — interesting, distracting; mention it but don't optimize for it.
When to consult solutions/¶
After committing your five files. Solution at solutions/02-cache-walks-ref.md (written at phase open).
Next lab: lab/03-roofline-plot.md. This is the integration lab.