English · Español

Lab 02 — See the cache hierarchy via strided access¶

Goal: turn the "cache line / spatial locality" abstraction from theory/02-memory-hierarchy.md into a curve you generated yourself.

Estimated time: 60–90 minutes.

Prereq: lab 00 + lab 01 committed.

What you produce¶

A directory experiments/01-cache-walks/ containing:

walk.py — the stride-walk benchmark.
results.json — per-stride access time.
cache_walk.png — log-log plot of stride vs ns/access.
manifest.json.
README.md interpreting the curve.

The kernel¶

Allocate a large np.float32 buffer (say 128 MiB — bigger than L3). Walk it with a fixed stride S, summing every S-th element. Time how long one full walk takes. Compute ns per access.

Repeat for many strides: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024.

The plot of stride vs ns/access reveals the cache line size and (with the right buffer size choice) the cache levels themselves.

TODOs¶

Block A — choose dimensions carefully¶

Buffer size: bigger than L3 (≥ 64 MiB).
Element type: np.float32 (4 bytes — matches the typical cache-line discussion).
Strides: at minimum [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] (i.e. 2^(0..10)). Note: a stride of 16 fp32 = 64 bytes = one cache line on x86.
Per measurement: target ≥ 100 ms wall time. Adjust n_iters accordingly.

Block B — write the walk¶

The naive form:

def walk(buf: np.ndarray, stride: int) -> float:
    n = len(buf)
    n_accesses = n // stride
    total = np.float32(0.0)
    for i in range(0, n, stride):
        total += buf[i]
    return total

But this is a Python loop, so it measures interpreter overhead, not memory. You need an idiomatic NumPy form that's still memory-bound at the chosen stride.

Two options: - total = buf[::stride].sum() — clean, but [::stride] creates a view, not a copy; sum() walks it. - total = np.add.reduce(buf[::stride]) — same thing, slightly faster.

Pick one. Justify in README.md why this isolates memory cost (and what it doesn't isolate).

Block C — time correctly¶

Warm up once.
Time with perf_counter_ns().
Repeat enough times to get ≥ 100 ms total per stride.
Compute ns_per_access = elapsed_ns / (n_accesses × iters).
Record in results.json.

Block D — plot¶

Log-log: x = stride (elements), y = ns/access.
Annotate cache line size (16 fp32 = 64 bytes) as a vertical line — the curve should show a clear slope change there.
Annotate L1/L2/L3 boundaries if your buffer-size choice exposes them (you may want to repeat the lab at three different total buffer sizes if you want all three).

Block E — interpret¶

Answer in README.md:

Where does the curve "elbow"? The first elbow is the cache line: below stride = line_size, you're reusing the line you already loaded; above, every access is a fresh line.
Why is the curve flat above some larger stride? When stride exceeds cache associativity × set count, every access is a miss; further increases buy you nothing.
What's the ratio of "best access time" to "worst access time"? It should be roughly the cache-to-DRAM latency ratio.

Constraints¶

Pure Python + NumPy. No C extensions, no cython, no ctypes.
One buffer per experiment — don't accidentally allocate a fresh one inside the timing loop.
Single thread.

Stop conditions¶

Done when:

All five files in the directory.
The plot shows a visible elbow near stride = 16 (one fp32 cache line).
README.md answers all three Block E questions.
Numbers in results.json are reproducible: re-running walk.py gives ns/access within ±15% of the first run.

Pitfalls¶

NumPy striding fuses into SIMD differently. buf[::1] is contiguous; NumPy SIMDs it. buf[::2] is not; NumPy falls back to a scalar gather. The transition from stride 1 → stride 2 is partly SIMD-related, not purely memory. Note this in your interpretation.
The curve isn't smooth. It will have wobbles. That's fine.
At very large strides, the curve might dip. Out-of-order execution can hide DRAM latency when every request is a miss (because they all go in parallel). This is the "memory-level parallelism" effect — interesting, distracting; mention it but don't optimize for it.

When to consult `solutions/`¶

After committing your five files. Solution at solutions/02-cache-walks-ref.md (written at phase open).

Next lab: lab/03-roofline-plot.md. This is the integration lab.