English · Español
00 — Why Write Kernels at All¶
🇪🇸 Después de Fase 23 ya sabes lo que es un SM, un warp, una jerarquía HBM/L2/SMEM/registros. Después de Fase 22 ya sabes que la decode-attention es memory-bound. Phase 24 es donde escribes el kernel que materializa la teoría: un kernel propio que sabes leer, profile, y tunear, en CUDA C y en Triton, y comparado con cuBLAS. Es la prueba de fuego de la comprensión.
This is the motivation page. It answers: why isn't it enough to "just call torch.nn.functional"? Why a whole phase on writing kernels by hand? And — why now (Phase 24, not Phase 5 or Phase 40)?
"Just call PyTorch" — when it's fine¶
PyTorch / cuBLAS / cuDNN / Flash-Attention are great at the kernels they cover. For 95% of work, calling torch.matmul or F.scaled_dot_product_attention is correct, fast, and the right choice. Phase 24 doesn't argue against using framework kernels; it argues against being unable to write one when needed.
Three places where framework kernels stop being enough:
- A new operator the framework doesn't have. Custom RoPE variants, fused QKV+bias+RoPE, custom quantization formats (Phase 26), audio-spectrogram preprocessing — none of these come pre-built. Real production work writes custom kernels routinely.
- A fused operator the framework doesn't fuse. PyTorch's eager mode runs
softmax(x).mul(y)as two kernel launches plus an intermediate tensor allocation. A fused kernel does it in one launch with no intermediate. Latency-critical paths care. - A kernel where the framework's choice is wrong. cuBLAS picks one of several GEMM implementations by heuristic. For unusual shapes (e.g., tall-skinny matrices common in MoE routing), the heuristic chooses poorly and you write your own.
You will hit each of these by Phase 27 (paged attention has no out-of-box framework kernel that matches your exact memory layout), Phase 26 (custom quant), and Phase 33 (compilation passes). Phase 24 is the prerequisite skill.
Why now (Phase 24)?¶
Three reasons it lands here, not earlier and not later:
- You can finally measure. Phase 23 built the GPU roofline. Phase 24 puts kernels on it. Without the roofline, "fast" is a feeling; with it, "fast" is a percentage of peak.
- You have a real operator menu. Phase 22 named decode attention, prefill GEMM, FFN — concrete operators. Phase 24's kernel is one of those, not a synthetic benchmark.
- PyTorch is introduced here deliberately. All prior code is NumPy. Phase 24 introduces PyTorch after the substrate (Phase 23) is understood and after the math (every prior phase) is solid. The framework is the tool, presented after the problem is in your hands. Doing it earlier would have turned the framework into a magic box.
If you skipped Phase 24 entirely, you'd be a competent "framework user" but unable to debug or extend. The point of the curriculum is to never be in that position.
The one-kernel principle¶
A common failure pattern in kernel courses: cover ten kernels superficially. The student leaves with vague pattern-matching ("I think Flash-Attention uses tiling") and no ability to write a new kernel.
The Phase 24 alternative: one kernel, three implementations, deep. Same operator (fused softmax by default, GEMM as alternative) implemented in:
- Naive CUDA C — one thread per output, no SMEM, no fancy. Correct, slow. Establishes "what is the kernel doing".
- Tuned CUDA C — coalesced loads, SMEM tile, occupancy tuning. Reaches ≥30% of cuBLAS /
F.softmax. Establishes "what makes it fast". - Triton — Python-like kernel with autotune. Establishes "what does Triton automate vs what does it still require you to specify".
Plus a fourth implementation as the comparison line:
4. cuBLAS / torch.nn.functional — the framework reference. The thing you're chasing.
The four are placed side-by-side on the GPU roofline. The student leaves with a single kernel they have profiled, tuned, and explained. That single kernel is reusable as the lens for understanding every future kernel they read about.
The PyTorch introduction, deliberately late¶
This phase is also where torch first gets imported in this codebase. The choice to delay it is not because PyTorch is bad — it's because learning the substrate first makes the framework comprehensible.
By Phase 23, Borja knows:
- The math of every operator the model needs (Phase 7–17).
- The memory hierarchy of CPU (Phase 1) and GPU (Phase 23).
- The arithmetic intensity of decode (Phase 22).
So when PyTorch is finally imported in Phase 24, every line of torch_minigpt.py is a known operator in a familiar shape, just with .cuda() and nn.Module decoration. There's no "I don't know what this layer does" moment. The framework is thin in conception — it routes operators to backends, maintains an autograd graph, and provides ergonomics. Phase 25 cracks open even those.
This contrasts with the typical learning path (PyTorch tutorial → "my model trains, I don't know why → mostly works → silent bugs at scale"). Borja's path is "from-scratch every operator → frameworks are obvious → no magic".
What you'll feel by the end of Phase 24¶
Three things should feel viscerally true after Phase 24, not just intellectually:
- A kernel is a chunk of code, not magic. When
nsight-computesays "your kernel hits 23% of peak HBM bandwidth", you know exactly which lines of your CUDA C are responsible and how to change them. - Triton is "CUDA for the 80% case". Triton's autotune + Pythonic syntax cover most kernels with ~10× less code, at ~80–95% of hand-tuned CUDA performance. The 5–20% gap is when raw CUDA still wins.
- PyTorch is plumbing. A
nn.Moduleis a registry; aTensoris aStorage+ view metadata; a forward pass is a sequence of dispatch calls. Phase 25 will make this fully explicit, but Phase 24 plants the intuition.
If those three feelings are missing at phase close, redo lab 02 (the tuning loop) — the visceral understanding comes from the tuning loop, not from theory.
What you should be able to do by the end of Phase 24¶
- Write a CUDA C kernel for a row-wise operator (softmax, layernorm, RMSNorm) that:
- Correctly handles arbitrary row count and row length.
- Uses SMEM for the row.
- Coalesces global loads.
- Achieves ≥30% of cuBLAS at the representative size.
- Rewrite the same operator in Triton; explain what the autotune does.
- Profile the kernel with
ncu, identify the dominant stall reason (memory-bound vs compute-bound vs scheduler stall), and propose an improvement. - Load Phase-17 MiniGPT in PyTorch (
torch_minigpt.py), run a forward pass on GPU, confirm byte-identical to the NumPy reference at fp64. - Explain — without looking at the source — what
model.cuda()does to the underlying tensors and storages.
Next: theory/01-cuda-programming-model.md — the CUDA programming model, formalized.