Skip to content

English · Español

Lab 01 — Matmul performance: naive vs blocked vs einsum vs BLAS

Goal: measure the 50× gap between Python triple-loop matmul and np.matmul, and explain it as a roofline fact (Phase 1) compounded by vectorisation + interpreter avoidance.

Estimated time: 90–120 minutes.

Prereq: theory/02-matmul-and-shapes.md, lab/00-shapes-by-hand.md complete, Phase 1 roofline result accessible from experiments/01-*/.


What you produce

A directory experiments/03-matmul-perf/ containing:

  • bench.py — your benchmark script.
  • results.json — measurements at multiple N.
  • perf.png — log-log plot of GFLOPS vs N, four curves + roofline overlay.
  • manifest.json.
  • README.md — interpretation, 2-3 paragraphs.

No src/ module is introduced this phase. The four matmul variants live as plain functions in bench.py (or a sibling helper file inside the experiment directory). The implementation gets promoted to src/minigrad/linalg.py in Phase 7 when the autograd consumer arrives — until then, scratch-script status is correct.

Part A — Implement the four matmul variants

Inside experiments/03-matmul-perf/:

  • matmul_naive(A, B) — triple-loop pure Python (with NumPy arrays as containers, no Python lists; but the inner scalar multiply-add is interpreted). FP32 throughout.
  • matmul_blocked(A, B, block_size=64) — six-loop blocked version. Document the block size as a parameter.
  • matmul_einsum(A, B) — single np.einsum('ik,kj->ij', A, B) call.
  • np.matmul(A, B) — the reference.
  • Verify the four agree with A @ B to rtol=1e-5, atol=1e-6 on random (M=K=N=64) inputs. (Matmul accumulator differences are real but bounded.)

Constraints

  • Pure NumPy. No scipy.linalg, no torch, no numba, no cython.
  • No early vectorisation in matmul_naive. If you write C[i, k] += A[i, k] * B[k, j] inside three for loops, you've done it right. If you write C[i] += A[i, k] * B[k] (broadcasting across j), you've accidentally vectorised the inner loop — the measured GFLOPS will be too high and the lab fails its purpose.
  • Determinism. np.random.default_rng(42).standard_normal(...).astype(np.float32) for the inputs.
  • bench.py also writes manifest.json with seed + versions + CPU governor + thread settings.

Part B — The benchmark

bench.py measures wall time of each method at several N:

  • Sizes: N ∈ {64, 128, 256, 512, 1024}. Square matrices. fp32.
  • Methods: naive, blocked, einsum, np.matmul. Optionally also blocked_32, blocked_128 for tuning curiosity.
  • Iterations: enough to swamp timer noise — at minimum 5 reps, with the total elapsed time per (method, N) at least 1 second. For naive at N=1024 you may need to run fewer iterations (or skip the largest size and extrapolate from N=512 by T_naive(N) ≈ T_naive(N/2) × 8 since cost is O(N³)).
  • Warm-up: one untimed iteration per (method, N) before measurement.
  • GFLOPS: (2 · N³) / elapsed_seconds / 1e9. Save as results.json.

Sanity bound

You should see roughly:

Method N=128 N=1024
naive ~3 MFLOPS ~3 MFLOPS (Python interpreter dominates regardless of N)
blocked ~3-5 MFLOPS similar (still Python interpreter)
einsum ~10-30 GFLOPS ~20-60 GFLOPS
np.matmul ~20-50 GFLOPS ~50-100 GFLOPS

If naive runs at >100 MFLOPS, you accidentally vectorised — re-check there are no NumPy broadcast ops in the inner loop. If np.matmul runs at <10 GFLOPS, OpenBLAS isn't linked or you're throttled. Check np.show_config() and the CPU governor.

Part C — Plot

perf.png:

  • x-axis: N (log scale, integer ticks at 64, 128, 256, 512, 1024).
  • y-axis: GFLOPS (log scale).
  • Four curves, one per method.
  • Overlay: horizontal line at the i5-8250U's measured peak fp32 GFLOPS from Phase 1's roofline experiment. Compute the crossover with the memory ceiling. Annotate.
  • Title: "Matmul performance on i5-8250U, fp32, single-threaded."

Part D — README interpretation

In README.md, answer:

  1. What's the speedup of np.matmul over matmul_naive at N=1024? State as a multiplier.
  2. Decompose the speedup into "BLAS magic" vs "no Python interpreter". Hint: einsum typically captures most of the BLAS speedup (it dispatches to the same routines), but np.matmul may still pull ahead at small N because of optimiser-pass differences. So roughly: interpreter-elimination ≈ 100×, SIMD ≈ 8×, cache blocking ≈ 5-10×, multi-thread (off here) would be another 4×.
  3. Where does np.matmul sit on the Phase 1 roofline? It should land near or above the memory ceiling at N=1024 (because BLAS reuses cache aggressively, raising the effective arithmetic intensity well above the naive 0.25 FLOP/byte floor). State the arithmetic intensity it achieves.
  4. Does matmul_blocked beat matmul_naive in Python? If not, why not? (Hint: Python interpreter overhead dwarfs the cache-miss cost the blocking is meant to amortise. Blocking is a C/asm optimisation; in pure Python it's noise.)

§A13 anchor — matmul as embedding lookup

Even though this lab uses random (N, N) matrices, remember the §A13 anchor from theory/02-matmul-and-shapes.md: the embedding lookup E @ one_hot(i) is a matrix-vector multiply with E.shape = (V, D) = (600, 64). At our §A13 scale that's 2 × 600 × 64 = 76,800 FLOPs per token — trivially dominated by the actual gather (E[i], zero FLOPs). At MiniGPT scale (V = 600, D = 64, B = 32, T = 16) per forward pass the vocabulary projection x @ E^T is 2 × 32 × 16 × 64 × 600 ≈ 40M FLOPs, which lands comfortably in BLAS territory. Knowing where your operation sits on this curve tells you whether to worry about Python overhead at all.

Constraints

  • Single-threaded. Set OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1 before launching bench.py. Record the env vars in manifest.json. Multi-thread benchmarks are Phase 35.
  • CPU governor performance + AC power. Same protocol as Phase 1 lab 01.
  • No cython / numba / torch. Pure NumPy / pure Python.
  • Same seed for all runs (so memory access patterns don't differ between methods).

Pitfalls

  • np.matmul is non-deterministic across runs by tiny amounts (BLAS may reorder for cache reasons). Don't expect bit-equal results between runs; expect rtol=1e-5 between methods at the same run.
  • N=2048 might OOM if you allocate too many buffers. Reuse the same A, B, C across iterations.
  • Naive at N=1024 takes ~5-10 minutes per iteration in pure Python. Plan accordingly. Extrapolating from N=512 is acceptable if you document it in the README.
  • np.einsum('ik,kj->ij', A, B) is not identical to A @ B in code path — both go to BLAS, but through different optimiser paths. Both are valid measurements; record both.
  • First call to a BLAS routine in a Python process pays a one-time setup cost. The warm-up iteration handles it; don't measure cold-start.

Stop conditions

  • All four functions implemented in bench.py; the agreement check passes at rtol=1e-5, atol=1e-6 on N=64.
  • Plot exists with all four curves + roofline overlay.
  • README answers all four interpretation questions.
  • Speedup of np.matmul over matmul_naive at the largest measured N is ≥ 50×. (Realistically it will be 10⁴-10⁵×; 50× is the floor.) If not, debug.

When to consult solutions/

After committing the plot + README. solutions/01-matmul-perf-ref.md (at phase open) walks through expected numbers on the i5-8250U and discusses BLAS internals at a curriculum level.


Next lab: lab/02-svd-compression.md.