English · Español

Lab 01 — Measure RAM bandwidth empirically¶

Goal: prove the DRAM bandwidth ceiling exists, by hitting it.

Estimated time: 60–90 minutes.

Prereq: lab 00 (machine profile) must be committed first.

What you produce¶

A directory experiments/01-memcpy-bandwidth/ containing:

bench.py — your measurement script (Borja writes; NO peeking at NumPy's source).
results.json — measured throughput at multiple buffer sizes.
bandwidth.png — plot of buffer size vs throughput, log-x axis.
manifest.json — {seed, versions, config, hardware} per LYNX_CORTEX.md §5.
README.md (2–3 paragraphs) explaining what you measured, how, and what the curve shape tells you.

The kernel¶

The "kernel" of this lab is the simplest memory-bound operation possible: copy a buffer of N fp32 values to another buffer of N fp32 values, and time it.

Each iteration reads 4N bytes and writes 4N bytes — 8N bytes moved total. Throughput in GB/s = 8N / time_seconds / 10⁹.

You want to measure this for buffer sizes spanning L1 → L2 → L3 → DRAM, i.e. from ~1 KiB to ~1 GiB. The plot should show three plateaus corresponding to the three caches, then settle at the DRAM bandwidth.

TODOs¶

Block A — write the kernel¶

Use numpy arrays (np.empty(N, dtype=np.float32)). NOT Python lists.
Use np.copyto(dst, src) for the actual copy. (Don't write your own loop — Python loops over fp32 measure interpreter overhead, not memory.)
Time with time.perf_counter_ns(). Repeat each measurement enough times that one full repetition takes ≥ 100 ms (to swamp timer noise); typically 5–500 repetitions depending on N.
One warm-up iteration before timing, to populate caches and avoid page faults.
Buffer sizes: at least 12 points, log-spaced from 1 KiB to 1 GiB. Suggested: [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576] KiB (i.e. 2^(0..20) KiB).
At each size, record: bytes_per_iter, iters, elapsed_s, throughput_GBs. Save as results.json.

Block B — plot¶

matplotlib. x-axis: buffer size in KiB, log scale. y-axis: throughput in GB/s, linear.
Annotate the L1/L2/L3 boundaries on the plot, using sizes from your profile.md. (Vertical dashed lines + labels.)
Annotate the expected DRAM ceiling from profile.md as a horizontal dashed line.
Save as bandwidth.png and reference from README.md.

Block C — interpret¶

Three questions to answer in README.md:

At what buffer size does throughput drop sharply? Compare to L1, L2, L3 sizes. They should line up.
Where does the curve plateau on the right? That's your measured DRAM bandwidth. How close is it to the theoretical β_peak from lab 00?
Why is throughput at small buffer sizes (in L1) less than the L1 bandwidth of ~1 TB/s? (Hint: think about what overhead dominates a 1 KiB measurement.)

Block D — manifest¶

manifest.json schema:

{
  "experiment": "01-memcpy-bandwidth",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "versions": {
    "python": "3.11.x",
    "numpy": "X.Y.Z",
    "matplotlib": "X.Y.Z",
    "linux_kernel": "..."
  },
  "hardware": {
    "cpu_model": "Intel Core i5-8250U",
    "cores_threads": "4/8",
    "ram_gib": 62,
    "cpu_governor_at_run": "performance"
  },
  "config": {
    "sizes_KiB": [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576],
    "min_elapsed_ms_per_size": 100,
    "warmup_iters": 1
  },
  "results_summary": {
    "peak_GBs_measured": null,
    "L1_plateau_GBs_measured": null,
    "DRAM_floor_GBs_measured": null
  }
}

Fill results_summary after you've plotted.

Constraints¶

No mlflow, no dvc, no wandb. Phase 0 deferred these per §A8. A directory + a JSON file is the manifest.
No threading yet. Single-thread benchmarks only. Multi-threaded bandwidth is a Phase 35 topic.
CPU governor: performance. Set it before running: sudo cpupower frequency-set -g performance. Revert after. Record in the manifest.
Run on AC power, not battery. Battery throttles aggressively.
Close other apps. Background memory traffic pollutes the measurement.

Stop conditions¶

Done when:

The directory has all six files.
bandwidth.png shows three visible plateaus (L1, L2, L3) and a DRAM tail.
Measured DRAM bandwidth is within 30% of β_peak from lab 00. (Outside that range → likely Turbo Boost off or governor not set; re-check before peeking at the solution.)
The README.md answers all three Block C questions.

Pitfalls (read before debugging)¶

Throughput at 1 KiB looks like 50 GB/s, not 1 TB/s. Yes. Timer overhead dominates. That's a feature, not a bug — note it in your README.
Throughput rises again at 1 GiB. Probably you're hitting OS file cache or other artifacts. Re-check with a fresh allocation each iteration (or np.copyto(dst, src) where both are pre-allocated).
The L2 plateau is invisible. Try more closely-spaced sizes around L1_size × 1.5 to L2_size × 1.5. The transition is gradual.
Bandwidth is suspiciously low everywhere. Check cpupower frequency-info — powersave halves your numbers.

When to consult `solutions/`¶

After you have committed your six files and answered the Block C questions. The solution at solutions/01-memcpy-bandwidth-ref.md (written at phase open) compares your numbers and your code structure.

Next lab: lab/02-cache-walks.md.