English · Español
Lab 03 — Plot the roofline for your machine¶
Goal: produce the canonical Phase 1 artefact — a roofline plot for your CPU, with your measured kernels placed on it. This is the file you'll reference from every later phase that discusses performance.
Estimated time: 90–120 minutes.
Prereq: labs 00, 01, 02 all committed.
What you produce¶
A directory experiments/01-roofline/ containing:
roofline.py— script that draws the plot.roofline.png— the plot itself. This is the artefactPHASE_01_REPORT.mdcites.kernels.json— per-kernel intensity and measured performance.manifest.json.README.mdinterpreting the plot.
The plot to produce¶
Log-log axes:
- x-axis: arithmetic intensity (FLOPs / byte), log10 from 0.01 to 100.
- y-axis: performance (GFLOPS), log10 from 0.1 to peak × 1.5.
Lines to draw:
1. Memory ceiling. Diagonal: y = β_peak × x. Use measured β from lab 01.
2. Compute ceiling. Horizontal: y = π_peak. Use a measured peak if you have one (see Block C); otherwise the derived value from lab 00 with that noted.
3. Optional second memory ceiling for L3 bandwidth (from lab 01 plateau), shown as a fainter line above the DRAM line.
Dots to place:
- memcpy at intensity ~0.0 (it does no FLOPs; off the left edge — convention: drop a dot at the leftmost x-tick with y = your measured throughput as a sanity reference).
- dot product of length 10⁶ fp32 vectors.
- naive matmul 1024×1024 fp32.
- NumPy matmul 1024×1024 fp32.
For each dot, compute intensity (theory file §3) and measure performance, then place.
TODOs¶
Block A — gather machine constants¶
-
β_peak: take the right-tail asymptote ofexperiments/01-memcpy-bandwidth/bandwidth.png(or the relevant entry in itsresults.json). -
π_peak: fromexperiments/01-machine-profile/profile.md. Note this is derived, not measured. Block C below makes it measured if you want.
Block B — measure four kernels¶
- memcpy — re-use lab 01's largest-buffer throughput. No new measurement needed; cite
results.jsondirectly. - dot(a, b) for length 10⁶ fp32. FLOPs =
2N - 1 ≈ 2 × 10⁶. Bytes moved (no reuse) =8N = 8 × 10⁶. Intensity = 0.25 FLOPs/byte. Measure withnp.dot(a, b), repeat ≥ 100 times, compute GFLOPS =2N / mean_seconds / 10⁹. - naive matmul — implement with three nested Python loops (yes, Python; deliberately slow). N = 128 (anything bigger takes hours). FLOPs =
2N³. Intensity ≈ 0.25 (same as before; naive matmul reuses nothing). Measure time. Compute GFLOPS. - np.matmul —
np.matmul(A, B)for N = 1024. FLOPs =2N³. Measure time. Compute GFLOPS. For intensity, place this dot at the effective intensity it achieves — this is roughly the operational intensity of MKL's blocked algorithm, which you can leave as "best-fit on the roofline" without proving it.
Block C — (optional, recommended) measure peak FLOPS¶
Replace the derived π_peak with a measured one:
- Write a tight loop that does fused multiply-adds on AVX2-friendly inputs.
- The cleanest portable approximation in pure NumPy is
c = a * a + aon a small array sized to fit L1, repeatedKtimes. - Time it. Compute GFLOPS. This is your measured
π. - If it's within 30% of derived, use derived (round number); if it's far off, use measured and note why.
🇪🇸 Si las FLOPS medidas son mucho menores que las derivadas (50%+), probablemente el chip está limitado térmicamente — los Intel de la serie U lo hacen rutinariamente. Documéntalo en
README.md.
Block D — draw the plot¶
- matplotlib. Both axes log.
- Memory ceiling: diagonal line
y = β × xfrom(x_min, β × x_min)to(π/β, π)— i.e. until it hits the compute ceiling at the corner. - Compute ceiling: horizontal
y = πfrom the corner rightward. - Dots: scatter at
(I_kernel, perf_kernel)with labels. - Annotate:
I_crit = π / βat the corner. - Save as
roofline.png(high DPI; this goes in the docs site).
Block E — interpret¶
In README.md:
- State the machine balance. "
I_crit = XFLOPs/byte. Any kernel below this is memory-bound on this machine." - Place each measured kernel. "memcpy is at perf = β = ~Y GB/s. Naive matmul is at 0.25 FLOPs/byte and only ~Z GFLOPS — well below the memory ceiling, indicating Python overhead, not memory. np.matmul is near the corner at ~W GFLOPS."
- Make the killer observation. Naive matmul does the same math as np.matmul. The 50–100× gap is entirely memory hierarchy + SIMD, not algorithm. State this; this is the headline takeaway of Phase 1.
Constraints¶
- The roofline plot must be drawn from your own measurements (
βfrom lab 01,πfrom Block A or Block C of this lab, kernel dots from your own runs). No vendor numbers, no online datasets. - Save kernel measurements in
kernels.jsonwith the samemanifest.jsondiscipline.
Stop conditions¶
Done when:
roofline.pngexists.- It has the two ceilings, the corner annotation, and all four kernel dots.
README.mdmakes the "naive vs np.matmul is memory, not math" argument explicitly.- You can show the plot to yourself, point to any dot, and explain in one sentence what bottleneck applies.
PHASE_01_REPORT.md(written separately at phase close) referencesroofline.png.
Pitfalls¶
- Naive matmul in Python at N=1024 is hours. Use N=128 and scale FLOPs accordingly. The point isn't to time large naive matmul; it's to put the dot in the right zone of the plot.
- np.matmul's "intensity" is fuzzy. It's blocked at multiple cache levels; the effective intensity depends on N. For N = 1024, MKL achieves ~70% of peak — place the dot at
(I = π / β × 0.7, perf = 0.7 π)as a fair representation. Note this approximation inREADME.md. - The plot looks empty. Log axes with few points always look sparse. The shape (corner, slope, ceiling) is the deliverable, not dot density.
When to consult solutions/¶
After committing all five files in this directory. Solution at solutions/03-roofline-ref.md (written at phase open).
Phase 1 lab work is complete. Next: /quiz 01, then PHASE_01_REPORT.md, then reflection, then proceed.