Skip to content

English · Español

05 — Triton vs raw CUDA: what each gives you, and why we waited until Phase 24

🇪🇸 La regla §0.1 de CLAUDE.md dice: construye antes de abstraer. PyTorch llega en la Fase 24. Pero dentro de la Fase 24 hay otra elección — ¿escribimos CUDA directo o usamos Triton? Esta página justifica la respuesta, y explica por qué escribir CUDA primero aunque sea brevemente es no negociable.

This file is the depth-pass companion to theory/03-triton.md. We compare CUDA C++ and Triton head-to-head on a single kernel (vector add), show what Triton hides and what it doesn't, and explain why the build-before-abstract discipline of CLAUDE.md §0.1 says CUDA first, Triton next, framework-internal kernels last.


What CUDA gives you

CUDA C++ exposes the GPU's execution model directly:

  • Threads organized into blocks, blocks into a grid.
  • Shared memory explicitly declared and indexed.
  • Warp-level primitives (shuffles, vote, ballot).
  • Memory hierarchy controlled by hand (global, shared, registers, L1 / L2, texture).
  • Synchronization via __syncthreads(), atomics, barriers.

You write __global__ void kernel(...) { int idx = blockIdx.x * blockDim.x + threadIdx.x; ... }. Every concept maps directly to hardware. The compiler (nvcc) does almost no high-level transformation — what you write is what runs.

Cost: verbose. A "good" hand-tuned matmul in CUDA is 200-400 lines for a 2-D tile with double buffering. A "decent" version is 50-80 lines. The toy 5-line version performs at 1-5% of peak.

What Triton gives you

Triton (Tillet et al. 2019) is a Python-embedded DSL that compiles to PTX. Key differences:

  • Block-level abstractions. Instead of thread-level indexing, you operate on tiles (e.g., a \(128 \times 128\) block of a matrix). The compiler decides how to map tiles to threads / warps.
  • Auto-tuning. Block sizes are declarative parameters; you provide hints, the compiler enumerates configurations.
  • Implicit pipelining. Software pipelining (overlapping memory loads with compute) is generated automatically; in CUDA you write it by hand with __pipeline_* intrinsics.
  • Shared-memory management. Allocated, indexed, and bank-conflict-resolved automatically.

Cost: less control. For 95% of kernels, "compiler picks good defaults" is faster than "Borja hand-tunes". For the remaining 5% — usually the most performance-critical — Triton's abstractions hide things you might need to control (warp specialization, async copies, custom indexing).

Concrete comparison: vector add

CUDA C++

__global__ void vec_add(const float* a, const float* b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        c[idx] = a[idx] + b[idx];
    }
}

// Launch
int threads = 256;
int blocks = (N + threads - 1) / threads;
vec_add<<<blocks, threads>>>(a, b, c, N);

8 lines kernel + 3 lines launch. Every concept (thread, block, bounds check) is explicit.

Triton

import triton
import triton.language as tl

@triton.jit
def vec_add(a_ptr, b_ptr, c_ptr, N, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N
    a = tl.load(a_ptr + offsets, mask=mask)
    b = tl.load(b_ptr + offsets, mask=mask)
    tl.store(c_ptr + offsets, a + b, mask=mask)

# Launch
grid = lambda meta: ((N + meta["BLOCK_SIZE"] - 1) // meta["BLOCK_SIZE"],)
vec_add[grid](a, b, c, N, BLOCK_SIZE=1024)

8 lines kernel + 2 lines launch. Similar surface area, but:

  • tl.arange(0, BLOCK_SIZE) gives you a vector of indices, not a scalar thread index.
  • tl.load(..., mask=mask) handles the bounds check internally.
  • BLOCK_SIZE is a compile-time constant — the compiler specializes the kernel for that value.

For a kernel this simple, Triton is not dramatically more concise. The win is at the matmul / attention level, where 200 lines of CUDA → 30 lines of Triton.

When CUDA wins

  • Warp specialization. Treating warps within a block as having different roles (e.g., producer-consumer in attention). Triton has limited support.
  • Tensor Core layout fiddling. When the data layout (row-major vs col-major vs swizzled) interacts with the Tensor Core MMA instruction, hand control matters.
  • Non-uniform indexing. If your kernel needs threads in a warp to do different things based on data, CUDA gives you the divergence control; Triton's vector abstraction assumes uniformity.

When Triton wins

  • Block matmul / attention. The kernels where 80% of the speed comes from getting the tile size right; Triton auto-tunes this.
  • Quick prototyping. Get a kernel running in minutes, then iterate.
  • Maintainability. A Triton kernel is easier to read 6 months later than a CUDA kernel.

Why we waited until Phase 24

Per CLAUDE.md §0.1 ("build before abstracting"):

  1. Phase 0-8: NumPy. Everything by hand. See how a matmul, a softmax, a layer-norm decompose into loops.
  2. Phase 9-17: Composing modules in raw Python/NumPy. The attention mechanism, the MLP, the embedding lookup — all visible.
  3. Phase 18-22: Training, evaluation, inference. Still NumPy. The optimizer, the eval harness, the KV cache — all visible.
  4. Phase 23: GPU mental model. No GPU code yet. Just why GPUs differ from CPUs.
  5. Phase 24: First GPU kernel. CUDA first (one kernel, in the lab), then Triton for the rest.
  6. Phase 25: PyTorch's autograd / dispatcher. Now we can read PyTorch's source because we already wrote everything by hand.

The order matters. If we'd started with PyTorch in Phase 9, every subsequent debugging exercise would have a "the framework does it" mystery. If we'd started with Triton in Phase 24, the compiler-generated optimizations would be invisible. By writing CUDA's vector-add by hand first, the abstractions Triton provides become visible — you know what they're hiding.

A diff in mental cost:

Stage Concept you must hold New abstraction
Phase 0-8 NumPy matmul array indexing, broadcast none
Phase 23 mental model warp, occupancy, memory hierarchy hardware vocabulary
Phase 24 CUDA (one kernel) thread/block index, shared mem language: __global__, <<<>>>
Phase 24 Triton (rest) tile, mask, block size DSL on top of CUDA
Phase 25 PyTorch internals dispatcher, autograd framework on top of CUDA / Triton

Each step is one new layer. If you skip a layer, the next layer becomes much harder to debug.

What this doesn't mean

It doesn't mean "always use CUDA in production". It means:

  • Learn CUDA first by writing one or two kernels by hand.
  • Then use Triton for the rest of Phase 24's labs.
  • Then use PyTorch's built-ins (torch.nn.functional, torch.compile) from Phase 25 onward.

In production, the layering is the same: Triton kernels for hot paths, PyTorch built-ins for the rest, raw CUDA only for the 1-2 places where Triton's abstractions are limiting. But the learning sequence is bottom-up, even if the deployment sequence is top-down.

The §A13 caveat

The §A13 mini-GPT does not benefit from GPU acceleration at this scale (see Phase 23 quiz q-23-05). The Phase 24 kernels are exercises in writing GPU code, not production tools for our model. The pedagogical artifact is:

  • One CUDA kernel (vector-add or scalar matmul) in lab/00-hello-cuda.md.
  • A Triton port of the same kernel in lab/01-naive-kernel.md.
  • A tiled Triton matmul in lab/02-tuned-kernel.md.
  • PyTorch wiring in lab/03-triton-and-pytorch.md.

That's the entire Phase 24 scope. We don't write FlashAttention from scratch; that's an EDUCATIONAL_STUB per CLAUDE.md §0.1.8.

Citation

Tillet, P., Kung, H. T., & Cox, D. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. MAPL'19. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf — the original Triton paper, section 3 on the tile-level IR is the source for the framing of "what Triton gives you" here.

One-paragraph recap

CUDA exposes the GPU's thread/block/shared-memory model directly; Triton abstracts over tiles and auto-tunes block sizes. For simple kernels, the line-count is comparable; for matmul/attention, Triton is 5-10× shorter. We write one CUDA kernel by hand in Phase 24 to ground the abstractions Triton provides, then use Triton for the rest. The build-before-abstract discipline of CLAUDE.md §0.1 produces this layering: NumPy → CUDA → Triton → PyTorch → framework helpers. Each layer is meaningful only if the one below it was first written by hand.


Cross-refs: theory/01-cuda-programming-model.md, theory/03-triton.md, theory/04-pytorch-as-substrate.md, CLAUDE.md §0.1 (the build-before-abstract rule), Phase 23 theory/05-cpu-only-roofline-i5-8250u.md (the hardware context).