Skip to content

English · Español

Lab 02 — See the cache hierarchy via strided access

Goal: turn the "cache line / spatial locality" abstraction from theory/02-memory-hierarchy.md into a curve you generated yourself.

Estimated time: 60–90 minutes.

Prereq: lab 00 + lab 01 committed.


What you produce

A directory experiments/01-cache-walks/ containing:

  • walk.py — the stride-walk benchmark.
  • results.json — per-stride access time.
  • cache_walk.png — log-log plot of stride vs ns/access.
  • manifest.json.
  • README.md interpreting the curve.

The kernel

Allocate a large np.float32 buffer (say 128 MiB — bigger than L3). Walk it with a fixed stride S, summing every S-th element. Time how long one full walk takes. Compute ns per access.

Repeat for many strides: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024.

The plot of stride vs ns/access reveals the cache line size and (with the right buffer size choice) the cache levels themselves.

TODOs

Block A — choose dimensions carefully

  • Buffer size: bigger than L3 (≥ 64 MiB).
  • Element type: np.float32 (4 bytes — matches the typical cache-line discussion).
  • Strides: at minimum [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] (i.e. 2^(0..10)). Note: a stride of 16 fp32 = 64 bytes = one cache line on x86.
  • Per measurement: target ≥ 100 ms wall time. Adjust n_iters accordingly.

Block B — write the walk

The naive form:

def walk(buf: np.ndarray, stride: int) -> float:
    n = len(buf)
    n_accesses = n // stride
    total = np.float32(0.0)
    for i in range(0, n, stride):
        total += buf[i]
    return total

But this is a Python loop, so it measures interpreter overhead, not memory. You need an idiomatic NumPy form that's still memory-bound at the chosen stride.

Two options: - total = buf[::stride].sum() — clean, but [::stride] creates a view, not a copy; sum() walks it. - total = np.add.reduce(buf[::stride]) — same thing, slightly faster.

Pick one. Justify in README.md why this isolates memory cost (and what it doesn't isolate).

Block C — time correctly

  • Warm up once.
  • Time with perf_counter_ns().
  • Repeat enough times to get ≥ 100 ms total per stride.
  • Compute ns_per_access = elapsed_ns / (n_accesses × iters).
  • Record in results.json.

Block D — plot

  • Log-log: x = stride (elements), y = ns/access.
  • Annotate cache line size (16 fp32 = 64 bytes) as a vertical line — the curve should show a clear slope change there.
  • Annotate L1/L2/L3 boundaries if your buffer-size choice exposes them (you may want to repeat the lab at three different total buffer sizes if you want all three).

Block E — interpret

Answer in README.md:

  1. Where does the curve "elbow"? The first elbow is the cache line: below stride = line_size, you're reusing the line you already loaded; above, every access is a fresh line.
  2. Why is the curve flat above some larger stride? When stride exceeds cache associativity × set count, every access is a miss; further increases buy you nothing.
  3. What's the ratio of "best access time" to "worst access time"? It should be roughly the cache-to-DRAM latency ratio.

Constraints

  • Pure Python + NumPy. No C extensions, no cython, no ctypes.
  • One buffer per experiment — don't accidentally allocate a fresh one inside the timing loop.
  • Single thread.

Stop conditions

Done when:

  1. All five files in the directory.
  2. The plot shows a visible elbow near stride = 16 (one fp32 cache line).
  3. README.md answers all three Block E questions.
  4. Numbers in results.json are reproducible: re-running walk.py gives ns/access within ±15% of the first run.

Pitfalls

  • NumPy striding fuses into SIMD differently. buf[::1] is contiguous; NumPy SIMDs it. buf[::2] is not; NumPy falls back to a scalar gather. The transition from stride 1 → stride 2 is partly SIMD-related, not purely memory. Note this in your interpretation.
  • The curve isn't smooth. It will have wobbles. That's fine.
  • At very large strides, the curve might dip. Out-of-order execution can hide DRAM latency when every request is a miss (because they all go in parallel). This is the "memory-level parallelism" effect — interesting, distracting; mention it but don't optimize for it.

When to consult solutions/

After committing your five files. Solution at solutions/02-cache-walks-ref.md (written at phase open).


Next lab: lab/03-roofline-plot.md. This is the integration lab.