English · Español

Lab 03 — Plot the roofline for your machine¶

Goal: produce the canonical Phase 1 artefact — a roofline plot for your CPU, with your measured kernels placed on it. This is the file you'll reference from every later phase that discusses performance.

Estimated time: 90–120 minutes.

Prereq: labs 00, 01, 02 all committed.

What you produce¶

A directory experiments/01-roofline/ containing:

roofline.py — script that draws the plot.
roofline.png — the plot itself. This is the artefact PHASE_01_REPORT.md cites.
kernels.json — per-kernel intensity and measured performance.
manifest.json.
README.md interpreting the plot.

The plot to produce¶

Log-log axes: - x-axis: arithmetic intensity (FLOPs / byte), log10 from 0.01 to 100. - y-axis: performance (GFLOPS), log10 from 0.1 to peak × 1.5.

Lines to draw: 1. Memory ceiling. Diagonal: y = β_peak × x. Use measured β from lab 01. 2. Compute ceiling. Horizontal: y = π_peak. Use a measured peak if you have one (see Block C); otherwise the derived value from lab 00 with that noted. 3. Optional second memory ceiling for L3 bandwidth (from lab 01 plateau), shown as a fainter line above the DRAM line.

Dots to place: - memcpy at intensity ~0.0 (it does no FLOPs; off the left edge — convention: drop a dot at the leftmost x-tick with y = your measured throughput as a sanity reference). - dot product of length 10⁶ fp32 vectors. - naive matmul 1024×1024 fp32. - NumPy matmul 1024×1024 fp32.

For each dot, compute intensity (theory file §3) and measure performance, then place.

TODOs¶

Block A — gather machine constants¶

β_peak: take the right-tail asymptote of experiments/01-memcpy-bandwidth/bandwidth.png (or the relevant entry in its results.json).
π_peak: from experiments/01-machine-profile/profile.md. Note this is derived, not measured. Block C below makes it measured if you want.

Block B — measure four kernels¶

memcpy — re-use lab 01's largest-buffer throughput. No new measurement needed; cite results.json directly.
dot(a, b) for length 10⁶ fp32. FLOPs = 2N - 1 ≈ 2 × 10⁶. Bytes moved (no reuse) = 8N = 8 × 10⁶. Intensity = 0.25 FLOPs/byte. Measure with np.dot(a, b), repeat ≥ 100 times, compute GFLOPS = 2N / mean_seconds / 10⁹.
naive matmul — implement with three nested Python loops (yes, Python; deliberately slow). N = 128 (anything bigger takes hours). FLOPs = 2N³. Intensity ≈ 0.25 (same as before; naive matmul reuses nothing). Measure time. Compute GFLOPS.
np.matmul — np.matmul(A, B) for N = 1024. FLOPs = 2N³. Measure time. Compute GFLOPS. For intensity, place this dot at the effective intensity it achieves — this is roughly the operational intensity of MKL's blocked algorithm, which you can leave as "best-fit on the roofline" without proving it.

Block C — (optional, recommended) measure peak FLOPS¶

Replace the derived π_peak with a measured one:

Write a tight loop that does fused multiply-adds on AVX2-friendly inputs.
The cleanest portable approximation in pure NumPy is c = a * a + a on a small array sized to fit L1, repeated K times.
Time it. Compute GFLOPS. This is your measured π.
If it's within 30% of derived, use derived (round number); if it's far off, use measured and note why.

🇪🇸 Si las FLOPS medidas son mucho menores que las derivadas (50%+), probablemente el chip está limitado térmicamente — los Intel de la serie U lo hacen rutinariamente. Documéntalo en README.md.

Block D — draw the plot¶

matplotlib. Both axes log.
Memory ceiling: diagonal line y = β × x from (x_min, β × x_min) to (π/β, π) — i.e. until it hits the compute ceiling at the corner.
Compute ceiling: horizontal y = π from the corner rightward.
Dots: scatter at (I_kernel, perf_kernel) with labels.
Annotate: I_crit = π / β at the corner.
Save as roofline.png (high DPI; this goes in the docs site).

Block E — interpret¶

In README.md:

State the machine balance. "I_crit = X FLOPs/byte. Any kernel below this is memory-bound on this machine."
Place each measured kernel. "memcpy is at perf = β = ~Y GB/s. Naive matmul is at 0.25 FLOPs/byte and only ~Z GFLOPS — well below the memory ceiling, indicating Python overhead, not memory. np.matmul is near the corner at ~W GFLOPS."
Make the killer observation. Naive matmul does the same math as np.matmul. The 50–100× gap is entirely memory hierarchy + SIMD, not algorithm. State this; this is the headline takeaway of Phase 1.

Constraints¶

The roofline plot must be drawn from your own measurements (β from lab 01, π from Block A or Block C of this lab, kernel dots from your own runs). No vendor numbers, no online datasets.
Save kernel measurements in kernels.json with the same manifest.json discipline.

Stop conditions¶

Done when:

roofline.png exists.
It has the two ceilings, the corner annotation, and all four kernel dots.
README.md makes the "naive vs np.matmul is memory, not math" argument explicitly.
You can show the plot to yourself, point to any dot, and explain in one sentence what bottleneck applies.
PHASE_01_REPORT.md (written separately at phase close) references roofline.png.

Pitfalls¶

Naive matmul in Python at N=1024 is hours. Use N=128 and scale FLOPs accordingly. The point isn't to time large naive matmul; it's to put the dot in the right zone of the plot.
np.matmul's "intensity" is fuzzy. It's blocked at multiple cache levels; the effective intensity depends on N. For N = 1024, MKL achieves ~70% of peak — place the dot at (I = π / β × 0.7, perf = 0.7 π) as a fair representation. Note this approximation in README.md.
The plot looks empty. Log axes with few points always look sparse. The shape (corner, slope, ceiling) is the deliverable, not dot density.

When to consult `solutions/`¶

After committing all five files in this directory. Solution at solutions/03-roofline-ref.md (written at phase open).

Phase 1 lab work is complete. Next: /quiz 01, then PHASE_01_REPORT.md, then reflection, then proceed.