English · Español
Lab 01 — Matmul performance: naive vs blocked vs einsum vs BLAS¶
Goal: measure the 50× gap between Python triple-loop matmul and
np.matmul, and explain it as a roofline fact (Phase 1) compounded by vectorisation + interpreter avoidance.Estimated time: 90–120 minutes.
Prereq:
theory/02-matmul-and-shapes.md,lab/00-shapes-by-hand.mdcomplete, Phase 1 roofline result accessible fromexperiments/01-*/.
What you produce¶
A directory experiments/03-matmul-perf/ containing:
bench.py— your benchmark script.results.json— measurements at multipleN.perf.png— log-log plot of GFLOPS vs N, four curves + roofline overlay.manifest.json.README.md— interpretation, 2-3 paragraphs.
No src/ module is introduced this phase. The four matmul variants live as plain functions in bench.py (or a sibling helper file inside the experiment directory). The implementation gets promoted to src/minigrad/linalg.py in Phase 7 when the autograd consumer arrives — until then, scratch-script status is correct.
Part A — Implement the four matmul variants¶
Inside experiments/03-matmul-perf/:
-
matmul_naive(A, B)— triple-loop pure Python (with NumPy arrays as containers, no Python lists; but the inner scalar multiply-add is interpreted). FP32 throughout. -
matmul_blocked(A, B, block_size=64)— six-loop blocked version. Document the block size as a parameter. -
matmul_einsum(A, B)— singlenp.einsum('ik,kj->ij', A, B)call. -
np.matmul(A, B)— the reference. - Verify the four agree with
A @ Btortol=1e-5, atol=1e-6on random(M=K=N=64)inputs. (Matmul accumulator differences are real but bounded.)
Constraints¶
- Pure NumPy. No
scipy.linalg, notorch, nonumba, nocython. - No early vectorisation in
matmul_naive. If you writeC[i, k] += A[i, k] * B[k, j]inside threeforloops, you've done it right. If you writeC[i] += A[i, k] * B[k](broadcasting acrossj), you've accidentally vectorised the inner loop — the measured GFLOPS will be too high and the lab fails its purpose. - Determinism.
np.random.default_rng(42).standard_normal(...).astype(np.float32)for the inputs. bench.pyalso writesmanifest.jsonwith seed + versions + CPU governor + thread settings.
Part B — The benchmark¶
bench.py measures wall time of each method at several N:
- Sizes:
N ∈ {64, 128, 256, 512, 1024}. Square matrices. fp32. - Methods:
naive,blocked,einsum,np.matmul. Optionally alsoblocked_32,blocked_128for tuning curiosity. - Iterations: enough to swamp timer noise — at minimum 5 reps, with the total elapsed time per (method, N) at least 1 second. For
naiveat N=1024 you may need to run fewer iterations (or skip the largest size and extrapolate from N=512 byT_naive(N) ≈ T_naive(N/2) × 8since cost isO(N³)). - Warm-up: one untimed iteration per (method, N) before measurement.
- GFLOPS:
(2 · N³) / elapsed_seconds / 1e9. Save asresults.json.
Sanity bound¶
You should see roughly:
| Method | N=128 | N=1024 |
|---|---|---|
naive |
~3 MFLOPS | ~3 MFLOPS (Python interpreter dominates regardless of N) |
blocked |
~3-5 MFLOPS | similar (still Python interpreter) |
einsum |
~10-30 GFLOPS | ~20-60 GFLOPS |
np.matmul |
~20-50 GFLOPS | ~50-100 GFLOPS |
If naive runs at >100 MFLOPS, you accidentally vectorised — re-check there are no NumPy broadcast ops in the inner loop.
If np.matmul runs at <10 GFLOPS, OpenBLAS isn't linked or you're throttled. Check np.show_config() and the CPU governor.
Part C — Plot¶
perf.png:
- x-axis:
N(log scale, integer ticks at 64, 128, 256, 512, 1024). - y-axis: GFLOPS (log scale).
- Four curves, one per method.
- Overlay: horizontal line at the i5-8250U's measured peak fp32 GFLOPS from Phase 1's roofline experiment. Compute the crossover with the memory ceiling. Annotate.
- Title: "Matmul performance on i5-8250U, fp32, single-threaded."
Part D — README interpretation¶
In README.md, answer:
- What's the speedup of
np.matmulovermatmul_naiveat N=1024? State as a multiplier. - Decompose the speedup into "BLAS magic" vs "no Python interpreter". Hint:
einsumtypically captures most of the BLAS speedup (it dispatches to the same routines), butnp.matmulmay still pull ahead at smallNbecause of optimiser-pass differences. So roughly: interpreter-elimination ≈ 100×, SIMD ≈ 8×, cache blocking ≈ 5-10×, multi-thread (off here) would be another 4×. - Where does
np.matmulsit on the Phase 1 roofline? It should land near or above the memory ceiling at N=1024 (because BLAS reuses cache aggressively, raising the effective arithmetic intensity well above the naive0.25 FLOP/bytefloor). State the arithmetic intensity it achieves. - Does
matmul_blockedbeatmatmul_naivein Python? If not, why not? (Hint: Python interpreter overhead dwarfs the cache-miss cost the blocking is meant to amortise. Blocking is a C/asm optimisation; in pure Python it's noise.)
§A13 anchor — matmul as embedding lookup¶
Even though this lab uses random (N, N) matrices, remember the §A13 anchor from theory/02-matmul-and-shapes.md: the embedding lookup E @ one_hot(i) is a matrix-vector multiply with E.shape = (V, D) = (600, 64). At our §A13 scale that's 2 × 600 × 64 = 76,800 FLOPs per token — trivially dominated by the actual gather (E[i], zero FLOPs). At MiniGPT scale (V = 600, D = 64, B = 32, T = 16) per forward pass the vocabulary projection x @ E^T is 2 × 32 × 16 × 64 × 600 ≈ 40M FLOPs, which lands comfortably in BLAS territory. Knowing where your operation sits on this curve tells you whether to worry about Python overhead at all.
Constraints¶
- Single-threaded. Set
OMP_NUM_THREADS=1,OPENBLAS_NUM_THREADS=1,MKL_NUM_THREADS=1before launchingbench.py. Record the env vars inmanifest.json. Multi-thread benchmarks are Phase 35. - CPU governor
performance+ AC power. Same protocol as Phase 1 lab 01. - No
cython/numba/torch. Pure NumPy / pure Python. - Same seed for all runs (so memory access patterns don't differ between methods).
Pitfalls¶
np.matmulis non-deterministic across runs by tiny amounts (BLAS may reorder for cache reasons). Don't expect bit-equal results between runs; expectrtol=1e-5between methods at the same run.- N=2048 might OOM if you allocate too many buffers. Reuse the same
A, B, Cacross iterations. - Naive at N=1024 takes ~5-10 minutes per iteration in pure Python. Plan accordingly. Extrapolating from N=512 is acceptable if you document it in the README.
np.einsum('ik,kj->ij', A, B)is not identical toA @ Bin code path — both go to BLAS, but through different optimiser paths. Both are valid measurements; record both.- First call to a BLAS routine in a Python process pays a one-time setup cost. The warm-up iteration handles it; don't measure cold-start.
Stop conditions¶
- All four functions implemented in
bench.py; the agreement check passes atrtol=1e-5, atol=1e-6onN=64. - Plot exists with all four curves + roofline overlay.
- README answers all four interpretation questions.
- Speedup of
np.matmulovermatmul_naiveat the largest measuredNis ≥ 50×. (Realistically it will be 10⁴-10⁵×; 50× is the floor.) If not, debug.
When to consult solutions/¶
After committing the plot + README. solutions/01-matmul-perf-ref.md (at phase open) walks through expected numbers on the i5-8250U and discusses BLAS internals at a curriculum level.
Next lab: lab/02-svd-compression.md.