English · Español
00 — Why a hardware phase before any AI¶
🇪🇸 La intuición central: el cuello de botella en IA no son las multiplicaciones, es mover los datos. Si no entiendes la jerarquía de memoria, todo lo demás (matmul lento, training I/O-bound, KV cache caro) parece magia.
The lie textbooks tell¶
A linear algebra textbook says matmul is O(N³). That's true as a count of multiplications. It is false as a predictor of wall-clock time on real hardware. Try it: write naive fp32 matmul in Python with three nested loops, run it on a 1024×1024 input, then run np.matmul on the same input.
The naive version is at least 50× slower than NumPy. Both do the same number of multiplications. The difference is entirely the cost of moving data through the memory hierarchy.
The textbook model treats memory access as free. Real hardware does not. Until you have measured this yourself, every later phase has a leak — you'll explain training as "the optimizer finds a minimum" without ever wondering why a forward pass takes 50 ms instead of 50 µs.
The thesis of Phase 1¶
Phase 1 trains one habit:
Whenever you see a kernel that's slower than expected, your first question is "is it bandwidth-bound or compute-bound?", not "is the algorithm right?".
Algorithmic optimization is rare. Memory optimization is constant. By the end of Phase 1, you should be able to read a kernel and predict which side of the roofline it lives on — before you run it.
What "bandwidth-bound" means, concretely¶
A modern CPU can do roughly 100 GFLOPS of fp32 arithmetic. Reading data from main RAM, the same CPU sustains roughly 20 GB/s. fp32 is 4 bytes. So in the time it takes to load one fp32 value from main memory, the CPU could have done 20 floating-point operations.
If your kernel does fewer than ~20 FLOPs per byte loaded, you are memory-bound — the FPUs are idle, waiting for data. Naive matmul does exactly 2 FLOPs per fp32 loaded (one multiply, one accumulate). It's bandwidth-bound by a factor of ten before you even start optimizing.
The number that captures this is arithmetic intensity I = FLOPs / bytes_moved, and the plot that visualizes it is the roofline. Both are introduced in theory/03-roofline-model.md.
Why this matters for AI specifically¶
Five claims that should make sense after Phase 1, but probably look like jargon now:
- GEMM (matmul) is the bottleneck of training and inference. The reason "matmul speed" is the headline performance metric for any AI accelerator is that everything else — softmax, layer norm, residual add — is so memory-bound that GEMM efficiency dominates total runtime.
- The transformer is memory-bound during inference, compute-bound during training. During training, batch sizes are large, so each weight load is amortized over many activations — high arithmetic intensity. During inference with batch size 1 (chat), each weight is loaded for one token — low intensity. This shift is the entire reason the KV-cache exists (Phase 22).
- A100/H100 marketing is mostly about memory bandwidth, not FLOPS. The H100 has ~2× the memory bandwidth of an A100, which matters more than the FLOPS bump for most workloads. Vendors quote FLOPS because it's a bigger number.
- Quantization (Phase 26) speeds up inference primarily by reducing bytes moved, not multiplications. INT8 multiplies aren't 4× faster than FP32 multiplies on most hardware. But INT8 weights are 4× smaller, and you're loading them from memory.
- Flash Attention (Phase 27) is not a smarter algorithm; it's the same algorithm rearranged to fit in SRAM instead of HBM. That's a memory-hierarchy optimization, not a math optimization.
Every one of those statements is a roofline argument. You will not be able to evaluate any architecture or accelerator until you can make these arguments yourself, from measurements.
The path through Phase 1¶
- Theory 01 zooms in: what is a CPU, mechanically? From transistors to a pipelined out-of-order execution engine. The point is not to recreate a digital logic course but to put words on the abstractions that the next theory pages assume.
- Theory 02 zooms out: caches, DRAM, NUMA, PCIe, disk. Each level has a latency in nanoseconds and a bandwidth in GB/s. The ratios between them are the substrate of every performance argument in the curriculum.
- Theory 03 unifies the two with the roofline model.
- Labs 00–03 make you do the measurements on your own machine. Every plot is yours, not a textbook screenshot. The roofline you commit at the end is the roofline of this laptop, which makes the curriculum's later GPU comparisons (Phase 23+) concrete instead of abstract.
Stop here if¶
You are tempted to skip Phase 1 because "I already know about caches." Don't. The test is not "can you say the word L1." The test is: can you predict which side of the roofline your code lives on, before you run it? If you can't yet, Phase 1 is for you.
Next: theory/01-from-transistor-to-cpu.md.