Skip to content

English · Español

Break — ignore memory-coalescing in a hand-written CPU matmul; benchmark the slowdown

🇪🇸 Memory coalescing es la propiedad central que un kernel — en CPU o GPU — debe respetar para no perder un orden de magnitud de bandwidth. La hacemos visible en el único hardware que Borja tiene: la CPU. La diferencia entre el orden de bucles (i, k, j) y (i, j, k) en un matmul C-style es la diferencia entre 5 GFLOPS y 0.5 GFLOPS.


Symptom Borja will see

Two implementations of \(C = A B\) for \(1024 \times 1024\) fp32 matrices, on the i5-8250U:

  • Run A (coalesced): loop order (i, k, j), inner loop accesses B[k][j] and C[i][j] sequentially in j (row-major friendly).
  • Run B (broken): loop order (i, j, k), inner loop accesses B[k][j] with j fixed and k varying — strided access across rows.

Benchmarks (median of 10 runs, single-threaded, fp32):

Run Time GFLOPS
A (coalesced) 0.42 s 5.1
B (broken) 4.85 s 0.44

The broken version is ~11× slower. Same arithmetic, same outputs (modulo floating-point reordering), same memory footprint. Just one loop reorder.

The break, mechanically

# Run A (control, coalesced inner loop)
def matmul_ikj(A, B, C):
    N = A.shape[0]
    for i in range(N):
        for k in range(N):
            a_ik = A[i, k]
            for j in range(N):
                C[i, j] += a_ik * B[k, j]

# Run B (break, non-coalesced inner loop)
def matmul_ijk(A, B, C):
    N = A.shape[0]
    for i in range(N):
        for j in range(N):
            acc = 0.0
            for k in range(N):
                acc += A[i, k] * B[k, j]
            C[i, j] = acc

In NumPy you'd never write this by hand — np.dot does it right. But the educational point is to show the failure mode in code small enough to read.

In real CPU/GPU kernel code, this pattern is everywhere:

  • A CPU loop that strides B[k][j] over k: each iteration loads a different cache line (B's rows are spaced by 1024×4 = 4096 bytes apart, far beyond a 64-byte cache line).
  • A GPU kernel where threads in a warp access non-consecutive global memory addresses: each access is a separate memory transaction, killing throughput.

Why this teaches the concept

A cache line on x86 is 64 bytes = 16 fp32 elements. When you access B[k][j] with j increasing inside the loop and k fixed, you read 16 consecutive j values per cache line load — that's memory coalescing. The hardware prefetcher detects the pattern and starts pulling the next cache line before you ask for it.

When you access B[k][j] with k increasing inside the loop and j fixed, each access is to a different row of B. Each row is 4096 bytes apart. Every access pulls a new cache line. The prefetcher can't help — it sees the access pattern, but it can't pull 16 different lines simultaneously, and the L1d cache (32 KiB = 512 fp32 elements) thrashes within a few inner-loop iterations.

The result: instead of one cache line load per 16 floats fetched (the coalesced case), you do one cache line load per 1 float fetched. Memory bandwidth utilization drops 16×. Combined with the loss of prefetcher help, the slowdown reaches ~11× in practice.

This is the same lesson as GPU memory coalescing (Phase 24). On a GPU, the analog is "threads in a warp accessing consecutive global memory addresses" vs "strided accesses". The penalty is bigger on GPU (32× or more), but the underlying physics is identical: memory hardware likes sequential access.

Diagnostic ladder Borja should walk

  1. First check: time both implementations. The 11× gap is obvious.
  2. Second check: use perf stat -e cache-misses,cache-references ./matmul_a vs ./matmul_b. Run B has ~16× more cache misses for the same workload.
  3. Third check: look at the loop order. Identify which array's accesses are strided in the inner loop.
  4. Diagnosis: B[k][j] with j outer and k inner is the strided pattern. Reorder to put j inner.

Reproducer

just phase-23-benchmark-matmul

# Outputs:
#   matmul_ikj (coalesced): 0.42 s, 5.1 GFLOPS
#   matmul_ijk (strided):   4.85 s, 0.44 GFLOPS
#   slowdown: 11.5x

Also run with perf:

perf stat -e L1-dcache-load-misses,L1-dcache-loads python -c "from minicpu.matmul import matmul_ijk; ..."

Compare the miss rates.

Hint cascade

  1. (Mild) "Both runs produce the same output. Why is one 11× slower?"
  2. (Medium) "Use perf stat -e cache-misses. What's the ratio?"
  3. (Direct) "In Run B's inner loop, what is the memory access pattern of B[k][j]? How does that interact with cache lines?"

Fix

Reorder the loops to (i, k, j). Or, more practically, use a library: np.dot, numpy @, or BLAS via scipy.linalg.blas.sgemm. These do tiling, blocking, and SIMD that hand-written Python can't match. But the educational point is: even the library internally must respect coalescing; the trick scales.

What this break is NOT

  • Not a correctness bug — the two outputs match (up to floating-point reordering noise).
  • Not a numerical bug.
  • Not an algorithmic complexity bug — both are \(O(N^3)\).

It is a memory hierarchy bug. The same data, the same arithmetic, in a different order — and the wall-clock cost differs by a factor of 11. This is the kind of bug you cannot find by reading the code; you find it by measuring.

Why this matters before Phase 24

Phase 24 will introduce CUDA and Triton. In those, memory coalescing is the first thing any kernel optimizer asks about. A GPU kernel with non-coalesced accesses can be 32× slower than the coalesced version. By doing this break on the CPU first, you internalize the concept in an environment where:

  • The build/run loop is fast (no CUDA toolchain).
  • The metrics are visible (perf works locally).
  • The numbers are reasonable (the cache hierarchy is documented and predictable).

Then on the GPU, you transfer the concept and just learn the new vocabulary (warps, banks, transactions).

Cross-refs

  • theory/05-cpu-only-roofline-i5-8250u.md — why the CPU is memory-bound for our kernels.
  • theory/02-gpu-memory-hierarchy.md — the GPU analog.
  • Phase 24 theory/02-from-naive-to-tiled.md — what happens when you fix coalescing on a GPU.