English · Español
Lab 01 — Measure RAM bandwidth empirically¶
Goal: prove the DRAM bandwidth ceiling exists, by hitting it.
Estimated time: 60–90 minutes.
Prereq: lab 00 (machine profile) must be committed first.
What you produce¶
A directory experiments/01-memcpy-bandwidth/ containing:
bench.py— your measurement script (Borja writes; NO peeking at NumPy's source).results.json— measured throughput at multiple buffer sizes.bandwidth.png— plot of buffer size vs throughput, log-x axis.manifest.json—{seed, versions, config, hardware}perLYNX_CORTEX.md§5.README.md(2–3 paragraphs) explaining what you measured, how, and what the curve shape tells you.
The kernel¶
The "kernel" of this lab is the simplest memory-bound operation possible: copy a buffer of N fp32 values to another buffer of N fp32 values, and time it.
Each iteration reads 4N bytes and writes 4N bytes — 8N bytes moved total. Throughput in GB/s = 8N / time_seconds / 10⁹.
You want to measure this for buffer sizes spanning L1 → L2 → L3 → DRAM, i.e. from ~1 KiB to ~1 GiB. The plot should show three plateaus corresponding to the three caches, then settle at the DRAM bandwidth.
TODOs¶
Block A — write the kernel¶
- Use
numpyarrays (np.empty(N, dtype=np.float32)). NOT Python lists. - Use
np.copyto(dst, src)for the actual copy. (Don't write your own loop — Python loops over fp32 measure interpreter overhead, not memory.) - Time with
time.perf_counter_ns(). Repeat each measurement enough times that one full repetition takes ≥ 100 ms (to swamp timer noise); typically 5–500 repetitions depending on N. - One warm-up iteration before timing, to populate caches and avoid page faults.
- Buffer sizes: at least 12 points, log-spaced from 1 KiB to 1 GiB. Suggested:
[1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576] KiB(i.e.2^(0..20) KiB). - At each size, record:
bytes_per_iter, iters, elapsed_s, throughput_GBs. Save asresults.json.
Block B — plot¶
- matplotlib. x-axis: buffer size in KiB, log scale. y-axis: throughput in GB/s, linear.
- Annotate the L1/L2/L3 boundaries on the plot, using sizes from your
profile.md. (Vertical dashed lines + labels.) - Annotate the expected DRAM ceiling from
profile.mdas a horizontal dashed line. - Save as
bandwidth.pngand reference fromREADME.md.
Block C — interpret¶
Three questions to answer in README.md:
- At what buffer size does throughput drop sharply? Compare to L1, L2, L3 sizes. They should line up.
- Where does the curve plateau on the right? That's your measured DRAM bandwidth. How close is it to the theoretical
β_peakfrom lab 00? - Why is throughput at small buffer sizes (in L1) less than the L1 bandwidth of ~1 TB/s? (Hint: think about what overhead dominates a 1 KiB measurement.)
Block D — manifest¶
manifest.json schema:
{
"experiment": "01-memcpy-bandwidth",
"date": "YYYY-MM-DD",
"seed": 42,
"versions": {
"python": "3.11.x",
"numpy": "X.Y.Z",
"matplotlib": "X.Y.Z",
"linux_kernel": "..."
},
"hardware": {
"cpu_model": "Intel Core i5-8250U",
"cores_threads": "4/8",
"ram_gib": 62,
"cpu_governor_at_run": "performance"
},
"config": {
"sizes_KiB": [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576],
"min_elapsed_ms_per_size": 100,
"warmup_iters": 1
},
"results_summary": {
"peak_GBs_measured": null,
"L1_plateau_GBs_measured": null,
"DRAM_floor_GBs_measured": null
}
}
Fill results_summary after you've plotted.
Constraints¶
- No
mlflow, nodvc, nowandb. Phase 0 deferred these per §A8. A directory + a JSON file is the manifest. - No threading yet. Single-thread benchmarks only. Multi-threaded bandwidth is a Phase 35 topic.
- CPU governor:
performance. Set it before running:sudo cpupower frequency-set -g performance. Revert after. Record in the manifest. - Run on AC power, not battery. Battery throttles aggressively.
- Close other apps. Background memory traffic pollutes the measurement.
Stop conditions¶
Done when:
- The directory has all six files.
bandwidth.pngshows three visible plateaus (L1, L2, L3) and a DRAM tail.- Measured DRAM bandwidth is within 30% of
β_peakfrom lab 00. (Outside that range → likely Turbo Boost off or governor not set; re-check before peeking at the solution.) - The
README.mdanswers all three Block C questions.
Pitfalls (read before debugging)¶
- Throughput at 1 KiB looks like 50 GB/s, not 1 TB/s. Yes. Timer overhead dominates. That's a feature, not a bug — note it in your README.
- Throughput rises again at 1 GiB. Probably you're hitting OS file cache or other artifacts. Re-check with a fresh allocation each iteration (or
np.copyto(dst, src)where both are pre-allocated). - The L2 plateau is invisible. Try more closely-spaced sizes around
L1_size × 1.5toL2_size × 1.5. The transition is gradual. - Bandwidth is suspiciously low everywhere. Check
cpupower frequency-info—powersavehalves your numbers.
When to consult solutions/¶
After you have committed your six files and answered the Block C questions. The solution at solutions/01-memcpy-bandwidth-ref.md (written at phase open) compares your numbers and your code structure.
Next lab: lab/02-cache-walks.md.