English · Español

Lab 01 — Matmul performance: naive vs blocked vs einsum vs BLAS¶

Goal: measure the 50× gap between Python triple-loop matmul and np.matmul, and explain it as a roofline fact (Phase 1) compounded by vectorisation + interpreter avoidance.

Estimated time: 90–120 minutes.

Prereq: theory/02-matmul-and-shapes.md, lab/00-shapes-by-hand.md complete, Phase 1 roofline result accessible from experiments/01-*/.

What you produce¶

A directory experiments/03-matmul-perf/ containing:

bench.py — your benchmark script.
results.json — measurements at multiple N.
perf.png — log-log plot of GFLOPS vs N, four curves + roofline overlay.
manifest.json.
README.md — interpretation, 2-3 paragraphs.

No src/ module is introduced this phase. The four matmul variants live as plain functions in bench.py (or a sibling helper file inside the experiment directory). The implementation gets promoted to src/minigrad/linalg.py in Phase 7 when the autograd consumer arrives — until then, scratch-script status is correct.

Part A — Implement the four matmul variants¶

Inside experiments/03-matmul-perf/:

matmul_naive(A, B) — triple-loop pure Python (with NumPy arrays as containers, no Python lists; but the inner scalar multiply-add is interpreted). FP32 throughout.
matmul_blocked(A, B, block_size=64) — six-loop blocked version. Document the block size as a parameter.
matmul_einsum(A, B) — single np.einsum('ik,kj->ij', A, B) call.
np.matmul(A, B) — the reference.
Verify the four agree with A @ B to rtol=1e-5, atol=1e-6 on random (M=K=N=64) inputs. (Matmul accumulator differences are real but bounded.)

Constraints¶

Pure NumPy. No scipy.linalg, no torch, no numba, no cython.
No early vectorisation in matmul_naive. If you write C[i, k] += A[i, k] * B[k, j] inside three for loops, you've done it right. If you write C[i] += A[i, k] * B[k] (broadcasting across j), you've accidentally vectorised the inner loop — the measured GFLOPS will be too high and the lab fails its purpose.
Determinism. np.random.default_rng(42).standard_normal(...).astype(np.float32) for the inputs.
bench.py also writes manifest.json with seed + versions + CPU governor + thread settings.

Part B — The benchmark¶

bench.py measures wall time of each method at several N:

Sizes: N ∈ {64, 128, 256, 512, 1024}. Square matrices. fp32.
Methods: naive, blocked, einsum, np.matmul. Optionally also blocked_32, blocked_128 for tuning curiosity.
Iterations: enough to swamp timer noise — at minimum 5 reps, with the total elapsed time per (method, N) at least 1 second. For naive at N=1024 you may need to run fewer iterations (or skip the largest size and extrapolate from N=512 by T_naive(N) ≈ T_naive(N/2) × 8 since cost is O(N³)).
Warm-up: one untimed iteration per (method, N) before measurement.
GFLOPS: (2 · N³) / elapsed_seconds / 1e9. Save as results.json.

Sanity bound¶

You should see roughly:

Method	N=128	N=1024
`naive`	~3 MFLOPS	~3 MFLOPS (Python interpreter dominates regardless of `N`)
`blocked`	~3-5 MFLOPS	similar (still Python interpreter)
`einsum`	~10-30 GFLOPS	~20-60 GFLOPS
`np.matmul`	~20-50 GFLOPS	~50-100 GFLOPS

If naive runs at >100 MFLOPS, you accidentally vectorised — re-check there are no NumPy broadcast ops in the inner loop. If np.matmul runs at <10 GFLOPS, OpenBLAS isn't linked or you're throttled. Check np.show_config() and the CPU governor.

Part C — Plot¶

perf.png:

x-axis: N (log scale, integer ticks at 64, 128, 256, 512, 1024).
y-axis: GFLOPS (log scale).
Four curves, one per method.
Overlay: horizontal line at the i5-8250U's measured peak fp32 GFLOPS from Phase 1's roofline experiment. Compute the crossover with the memory ceiling. Annotate.
Title: "Matmul performance on i5-8250U, fp32, single-threaded."

Part D — README interpretation¶

In README.md, answer:

What's the speedup of np.matmul over matmul_naive at N=1024? State as a multiplier.
Decompose the speedup into "BLAS magic" vs "no Python interpreter". Hint: einsum typically captures most of the BLAS speedup (it dispatches to the same routines), but np.matmul may still pull ahead at small N because of optimiser-pass differences. So roughly: interpreter-elimination ≈ 100×, SIMD ≈ 8×, cache blocking ≈ 5-10×, multi-thread (off here) would be another 4×.
Where does np.matmul sit on the Phase 1 roofline? It should land near or above the memory ceiling at N=1024 (because BLAS reuses cache aggressively, raising the effective arithmetic intensity well above the naive 0.25 FLOP/byte floor). State the arithmetic intensity it achieves.
Does matmul_blocked beat matmul_naive in Python? If not, why not? (Hint: Python interpreter overhead dwarfs the cache-miss cost the blocking is meant to amortise. Blocking is a C/asm optimisation; in pure Python it's noise.)

§A13 anchor — matmul as embedding lookup¶

Even though this lab uses random (N, N) matrices, remember the §A13 anchor from theory/02-matmul-and-shapes.md: the embedding lookup E @ one_hot(i) is a matrix-vector multiply with E.shape = (V, D) = (600, 64). At our §A13 scale that's 2 × 600 × 64 = 76,800 FLOPs per token — trivially dominated by the actual gather (E[i], zero FLOPs). At MiniGPT scale (V = 600, D = 64, B = 32, T = 16) per forward pass the vocabulary projection x @ E^T is 2 × 32 × 16 × 64 × 600 ≈ 40M FLOPs, which lands comfortably in BLAS territory. Knowing where your operation sits on this curve tells you whether to worry about Python overhead at all.

Constraints¶

Single-threaded. Set OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1 before launching bench.py. Record the env vars in manifest.json. Multi-thread benchmarks are Phase 35.
CPU governor performance + AC power. Same protocol as Phase 1 lab 01.
No cython / numba / torch. Pure NumPy / pure Python.
Same seed for all runs (so memory access patterns don't differ between methods).

Pitfalls¶

np.matmul is non-deterministic across runs by tiny amounts (BLAS may reorder for cache reasons). Don't expect bit-equal results between runs; expect rtol=1e-5 between methods at the same run.
N=2048 might OOM if you allocate too many buffers. Reuse the same A, B, C across iterations.
Naive at N=1024 takes ~5-10 minutes per iteration in pure Python. Plan accordingly. Extrapolating from N=512 is acceptable if you document it in the README.
np.einsum('ik,kj->ij', A, B) is not identical to A @ B in code path — both go to BLAS, but through different optimiser paths. Both are valid measurements; record both.
First call to a BLAS routine in a Python process pays a one-time setup cost. The warm-up iteration handles it; don't measure cold-start.

Stop conditions¶

All four functions implemented in bench.py; the agreement check passes at rtol=1e-5, atol=1e-6 on N=64.
Plot exists with all four curves + roofline overlay.
README answers all four interpretation questions.
Speedup of np.matmul over matmul_naive at the largest measured N is ≥ 50×. (Realistically it will be 10⁴-10⁵×; 50× is the floor.) If not, debug.

When to consult `solutions/`¶

After committing the plot + README. solutions/01-matmul-perf-ref.md (at phase open) walks through expected numbers on the i5-8250U and discusses BLAS internals at a curriculum level.

Next lab: lab/02-svd-compression.md.

Lab 01 — Matmul performance: naive vs blocked vs einsum vs BLAS¶

What you produce¶

Part A — Implement the four matmul variants¶

Constraints¶

Part B — The benchmark¶

Sanity bound¶

Part C — Plot¶

Part D — README interpretation¶

§A13 anchor — matmul as embedding lookup¶

Constraints¶

Pitfalls¶

Stop conditions¶

When to consult solutions/¶

When to consult `solutions/`¶