English · Español
05 — Triton vs raw CUDA: what each gives you, and why we waited until Phase 24¶
🇪🇸 La regla §0.1 de CLAUDE.md dice: construye antes de abstraer. PyTorch llega en la Fase 24. Pero dentro de la Fase 24 hay otra elección — ¿escribimos CUDA directo o usamos Triton? Esta página justifica la respuesta, y explica por qué escribir CUDA primero aunque sea brevemente es no negociable.
This file is the depth-pass companion to theory/03-triton.md. We compare CUDA C++ and Triton head-to-head on a single kernel (vector add), show what Triton hides and what it doesn't, and explain why the build-before-abstract discipline of CLAUDE.md §0.1 says CUDA first, Triton next, framework-internal kernels last.
What CUDA gives you¶
CUDA C++ exposes the GPU's execution model directly:
- Threads organized into blocks, blocks into a grid.
- Shared memory explicitly declared and indexed.
- Warp-level primitives (shuffles, vote, ballot).
- Memory hierarchy controlled by hand (global, shared, registers, L1 / L2, texture).
- Synchronization via
__syncthreads(), atomics, barriers.
You write __global__ void kernel(...) { int idx = blockIdx.x * blockDim.x + threadIdx.x; ... }. Every concept maps directly to hardware. The compiler (nvcc) does almost no high-level transformation — what you write is what runs.
Cost: verbose. A "good" hand-tuned matmul in CUDA is 200-400 lines for a 2-D tile with double buffering. A "decent" version is 50-80 lines. The toy 5-line version performs at 1-5% of peak.
What Triton gives you¶
Triton (Tillet et al. 2019) is a Python-embedded DSL that compiles to PTX. Key differences:
- Block-level abstractions. Instead of thread-level indexing, you operate on tiles (e.g., a \(128 \times 128\) block of a matrix). The compiler decides how to map tiles to threads / warps.
- Auto-tuning. Block sizes are declarative parameters; you provide hints, the compiler enumerates configurations.
- Implicit pipelining. Software pipelining (overlapping memory loads with compute) is generated automatically; in CUDA you write it by hand with
__pipeline_*intrinsics. - Shared-memory management. Allocated, indexed, and bank-conflict-resolved automatically.
Cost: less control. For 95% of kernels, "compiler picks good defaults" is faster than "Borja hand-tunes". For the remaining 5% — usually the most performance-critical — Triton's abstractions hide things you might need to control (warp specialization, async copies, custom indexing).
Concrete comparison: vector add¶
CUDA C++¶
__global__ void vec_add(const float* a, const float* b, float* c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
c[idx] = a[idx] + b[idx];
}
}
// Launch
int threads = 256;
int blocks = (N + threads - 1) / threads;
vec_add<<<blocks, threads>>>(a, b, c, N);
8 lines kernel + 3 lines launch. Every concept (thread, block, bounds check) is explicit.
Triton¶
import triton
import triton.language as tl
@triton.jit
def vec_add(a_ptr, b_ptr, c_ptr, N, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < N
a = tl.load(a_ptr + offsets, mask=mask)
b = tl.load(b_ptr + offsets, mask=mask)
tl.store(c_ptr + offsets, a + b, mask=mask)
# Launch
grid = lambda meta: ((N + meta["BLOCK_SIZE"] - 1) // meta["BLOCK_SIZE"],)
vec_add[grid](a, b, c, N, BLOCK_SIZE=1024)
8 lines kernel + 2 lines launch. Similar surface area, but:
tl.arange(0, BLOCK_SIZE)gives you a vector of indices, not a scalar thread index.tl.load(..., mask=mask)handles the bounds check internally.BLOCK_SIZEis a compile-time constant — the compiler specializes the kernel for that value.
For a kernel this simple, Triton is not dramatically more concise. The win is at the matmul / attention level, where 200 lines of CUDA → 30 lines of Triton.
When CUDA wins¶
- Warp specialization. Treating warps within a block as having different roles (e.g., producer-consumer in attention). Triton has limited support.
- Tensor Core layout fiddling. When the data layout (row-major vs col-major vs swizzled) interacts with the Tensor Core MMA instruction, hand control matters.
- Non-uniform indexing. If your kernel needs threads in a warp to do different things based on data, CUDA gives you the divergence control; Triton's vector abstraction assumes uniformity.
When Triton wins¶
- Block matmul / attention. The kernels where 80% of the speed comes from getting the tile size right; Triton auto-tunes this.
- Quick prototyping. Get a kernel running in minutes, then iterate.
- Maintainability. A Triton kernel is easier to read 6 months later than a CUDA kernel.
Why we waited until Phase 24¶
Per CLAUDE.md §0.1 ("build before abstracting"):
- Phase 0-8: NumPy. Everything by hand. See how a matmul, a softmax, a layer-norm decompose into loops.
- Phase 9-17: Composing modules in raw Python/NumPy. The attention mechanism, the MLP, the embedding lookup — all visible.
- Phase 18-22: Training, evaluation, inference. Still NumPy. The optimizer, the eval harness, the KV cache — all visible.
- Phase 23: GPU mental model. No GPU code yet. Just why GPUs differ from CPUs.
- Phase 24: First GPU kernel. CUDA first (one kernel, in the lab), then Triton for the rest.
- Phase 25: PyTorch's autograd / dispatcher. Now we can read PyTorch's source because we already wrote everything by hand.
The order matters. If we'd started with PyTorch in Phase 9, every subsequent debugging exercise would have a "the framework does it" mystery. If we'd started with Triton in Phase 24, the compiler-generated optimizations would be invisible. By writing CUDA's vector-add by hand first, the abstractions Triton provides become visible — you know what they're hiding.
A diff in mental cost:
| Stage | Concept you must hold | New abstraction |
|---|---|---|
| Phase 0-8 NumPy matmul | array indexing, broadcast | none |
| Phase 23 mental model | warp, occupancy, memory hierarchy | hardware vocabulary |
| Phase 24 CUDA (one kernel) | thread/block index, shared mem | language: __global__, <<<>>> |
| Phase 24 Triton (rest) | tile, mask, block size | DSL on top of CUDA |
| Phase 25 PyTorch internals | dispatcher, autograd | framework on top of CUDA / Triton |
Each step is one new layer. If you skip a layer, the next layer becomes much harder to debug.
What this doesn't mean¶
It doesn't mean "always use CUDA in production". It means:
- Learn CUDA first by writing one or two kernels by hand.
- Then use Triton for the rest of Phase 24's labs.
- Then use PyTorch's built-ins (
torch.nn.functional,torch.compile) from Phase 25 onward.
In production, the layering is the same: Triton kernels for hot paths, PyTorch built-ins for the rest, raw CUDA only for the 1-2 places where Triton's abstractions are limiting. But the learning sequence is bottom-up, even if the deployment sequence is top-down.
The §A13 caveat¶
The §A13 mini-GPT does not benefit from GPU acceleration at this scale (see Phase 23 quiz q-23-05). The Phase 24 kernels are exercises in writing GPU code, not production tools for our model. The pedagogical artifact is:
- One CUDA kernel (vector-add or scalar matmul) in
lab/00-hello-cuda.md. - A Triton port of the same kernel in
lab/01-naive-kernel.md. - A tiled Triton matmul in
lab/02-tuned-kernel.md. - PyTorch wiring in
lab/03-triton-and-pytorch.md.
That's the entire Phase 24 scope. We don't write FlashAttention from scratch; that's an EDUCATIONAL_STUB per CLAUDE.md §0.1.8.
Citation¶
Tillet, P., Kung, H. T., & Cox, D. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. MAPL'19. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf — the original Triton paper, section 3 on the tile-level IR is the source for the framing of "what Triton gives you" here.
One-paragraph recap¶
CUDA exposes the GPU's thread/block/shared-memory model directly; Triton abstracts over tiles and auto-tunes block sizes. For simple kernels, the line-count is comparable; for matmul/attention, Triton is 5-10× shorter. We write one CUDA kernel by hand in Phase 24 to ground the abstractions Triton provides, then use Triton for the rest. The build-before-abstract discipline of CLAUDE.md §0.1 produces this layering: NumPy → CUDA → Triton → PyTorch → framework helpers. Each layer is meaningful only if the one below it was first written by hand.
Cross-refs: theory/01-cuda-programming-model.md, theory/03-triton.md, theory/04-pytorch-as-substrate.md, CLAUDE.md §0.1 (the build-before-abstract rule), Phase 23 theory/05-cpu-only-roofline-i5-8250u.md (the hardware context).