English · Español

05 — Triton vs raw CUDA: what each gives you, and why we waited until Phase 24¶

🇪🇸 La regla §0.1 de CLAUDE.md dice: construye antes de abstraer. PyTorch llega en la Fase 24. Pero dentro de la Fase 24 hay otra elección — ¿escribimos CUDA directo o usamos Triton? Esta página justifica la respuesta, y explica por qué escribir CUDA primero aunque sea brevemente es no negociable.

This file is the depth-pass companion to theory/03-triton.md. We compare CUDA C++ and Triton head-to-head on a single kernel (vector add), show what Triton hides and what it doesn't, and explain why the build-before-abstract discipline of CLAUDE.md §0.1 says CUDA first, Triton next, framework-internal kernels last.

What CUDA gives you¶

CUDA C++ exposes the GPU's execution model directly:

Threads organized into blocks, blocks into a grid.
Shared memory explicitly declared and indexed.
Warp-level primitives (shuffles, vote, ballot).
Memory hierarchy controlled by hand (global, shared, registers, L1 / L2, texture).
Synchronization via __syncthreads(), atomics, barriers.

You write __global__ void kernel(...) { int idx = blockIdx.x * blockDim.x + threadIdx.x; ... }. Every concept maps directly to hardware. The compiler (nvcc) does almost no high-level transformation — what you write is what runs.

Cost: verbose. A "good" hand-tuned matmul in CUDA is 200-400 lines for a 2-D tile with double buffering. A "decent" version is 50-80 lines. The toy 5-line version performs at 1-5% of peak.

What Triton gives you¶

Triton (Tillet et al. 2019) is a Python-embedded DSL that compiles to PTX. Key differences:

Block-level abstractions. Instead of thread-level indexing, you operate on tiles (e.g., a \(128 \times 128\) block of a matrix). The compiler decides how to map tiles to threads / warps.
Auto-tuning. Block sizes are declarative parameters; you provide hints, the compiler enumerates configurations.
Implicit pipelining. Software pipelining (overlapping memory loads with compute) is generated automatically; in CUDA you write it by hand with __pipeline_* intrinsics.
Shared-memory management. Allocated, indexed, and bank-conflict-resolved automatically.

Cost: less control. For 95% of kernels, "compiler picks good defaults" is faster than "Borja hand-tunes". For the remaining 5% — usually the most performance-critical — Triton's abstractions hide things you might need to control (warp specialization, async copies, custom indexing).

Concrete comparison: vector add¶

CUDA C++¶

__global__ void vec_add(const float* a, const float* b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        c[idx] = a[idx] + b[idx];
    }
}

// Launch
int threads = 256;
int blocks = (N + threads - 1) / threads;
vec_add<<<blocks, threads>>>(a, b, c, N);

8 lines kernel + 3 lines launch. Every concept (thread, block, bounds check) is explicit.

Triton¶

import triton
import triton.language as tl

@triton.jit
def vec_add(a_ptr, b_ptr, c_ptr, N, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N
    a = tl.load(a_ptr + offsets, mask=mask)
    b = tl.load(b_ptr + offsets, mask=mask)
    tl.store(c_ptr + offsets, a + b, mask=mask)

# Launch
grid = lambda meta: ((N + meta["BLOCK_SIZE"] - 1) // meta["BLOCK_SIZE"],)
vec_add[grid](a, b, c, N, BLOCK_SIZE=1024)

8 lines kernel + 2 lines launch. Similar surface area, but:

tl.arange(0, BLOCK_SIZE) gives you a vector of indices, not a scalar thread index.
tl.load(..., mask=mask) handles the bounds check internally.
BLOCK_SIZE is a compile-time constant — the compiler specializes the kernel for that value.

For a kernel this simple, Triton is not dramatically more concise. The win is at the matmul / attention level, where 200 lines of CUDA → 30 lines of Triton.

When CUDA wins¶

Warp specialization. Treating warps within a block as having different roles (e.g., producer-consumer in attention). Triton has limited support.
Tensor Core layout fiddling. When the data layout (row-major vs col-major vs swizzled) interacts with the Tensor Core MMA instruction, hand control matters.
Non-uniform indexing. If your kernel needs threads in a warp to do different things based on data, CUDA gives you the divergence control; Triton's vector abstraction assumes uniformity.

When Triton wins¶

Block matmul / attention. The kernels where 80% of the speed comes from getting the tile size right; Triton auto-tunes this.
Quick prototyping. Get a kernel running in minutes, then iterate.
Maintainability. A Triton kernel is easier to read 6 months later than a CUDA kernel.

Why we waited until Phase 24¶

Per CLAUDE.md §0.1 ("build before abstracting"):

Phase 0-8: NumPy. Everything by hand. See how a matmul, a softmax, a layer-norm decompose into loops.
Phase 9-17: Composing modules in raw Python/NumPy. The attention mechanism, the MLP, the embedding lookup — all visible.
Phase 18-22: Training, evaluation, inference. Still NumPy. The optimizer, the eval harness, the KV cache — all visible.
Phase 23: GPU mental model. No GPU code yet. Just why GPUs differ from CPUs.
Phase 24: First GPU kernel. CUDA first (one kernel, in the lab), then Triton for the rest.
Phase 25: PyTorch's autograd / dispatcher. Now we can read PyTorch's source because we already wrote everything by hand.

The order matters. If we'd started with PyTorch in Phase 9, every subsequent debugging exercise would have a "the framework does it" mystery. If we'd started with Triton in Phase 24, the compiler-generated optimizations would be invisible. By writing CUDA's vector-add by hand first, the abstractions Triton provides become visible — you know what they're hiding.

A diff in mental cost:

Stage	Concept you must hold	New abstraction
Phase 0-8 NumPy matmul	array indexing, broadcast	none
Phase 23 mental model	warp, occupancy, memory hierarchy	hardware vocabulary
Phase 24 CUDA (one kernel)	thread/block index, shared mem	language: `__global__`, `<<<>>>`
Phase 24 Triton (rest)	tile, mask, block size	DSL on top of CUDA
Phase 25 PyTorch internals	dispatcher, autograd	framework on top of CUDA / Triton

Each step is one new layer. If you skip a layer, the next layer becomes much harder to debug.

What this doesn't mean¶

It doesn't mean "always use CUDA in production". It means:

Learn CUDA first by writing one or two kernels by hand.
Then use Triton for the rest of Phase 24's labs.
Then use PyTorch's built-ins (torch.nn.functional, torch.compile) from Phase 25 onward.

In production, the layering is the same: Triton kernels for hot paths, PyTorch built-ins for the rest, raw CUDA only for the 1-2 places where Triton's abstractions are limiting. But the learning sequence is bottom-up, even if the deployment sequence is top-down.

The §A13 caveat¶

The §A13 mini-GPT does not benefit from GPU acceleration at this scale (see Phase 23 quiz q-23-05). The Phase 24 kernels are exercises in writing GPU code, not production tools for our model. The pedagogical artifact is:

One CUDA kernel (vector-add or scalar matmul) in lab/00-hello-cuda.md.
A Triton port of the same kernel in lab/01-naive-kernel.md.
A tiled Triton matmul in lab/02-tuned-kernel.md.
PyTorch wiring in lab/03-triton-and-pytorch.md.

That's the entire Phase 24 scope. We don't write FlashAttention from scratch; that's an EDUCATIONAL_STUB per CLAUDE.md §0.1.8.

Citation¶

Tillet, P., Kung, H. T., & Cox, D. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. MAPL'19. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf — the original Triton paper, section 3 on the tile-level IR is the source for the framing of "what Triton gives you" here.

One-paragraph recap¶

CUDA exposes the GPU's thread/block/shared-memory model directly; Triton abstracts over tiles and auto-tunes block sizes. For simple kernels, the line-count is comparable; for matmul/attention, Triton is 5-10× shorter. We write one CUDA kernel by hand in Phase 24 to ground the abstractions Triton provides, then use Triton for the rest. The build-before-abstract discipline of CLAUDE.md §0.1 produces this layering: NumPy → CUDA → Triton → PyTorch → framework helpers. Each layer is meaningful only if the one below it was first written by hand.

Cross-refs: theory/01-cuda-programming-model.md, theory/03-triton.md, theory/04-pytorch-as-substrate.md, CLAUDE.md §0.1 (the build-before-abstract rule), Phase 23 theory/05-cpu-only-roofline-i5-8250u.md (the hardware context).