Skip to content

English · Español

03 — Triton: CUDA for the 80% Case

🇪🇸 Triton es un DSL de Python — un lenguaje que parece NumPy pero compila a PTX (lenguaje GPU). Su valor: el 80% de los kernels que escribirías en CUDA C los escribes en Triton con ~10× menos código, y el autotuner busca el block-size / num-warps / num-stages óptimo por ti. El 20% restante (Flash-Attention, GEMM ultraoptimizado) sigue siendo terreno de CUDA C / CUTLASS.

This page introduces Triton — what it is, what it automates, what it doesn't, and how to read its autotune. By the end you can write the same softmax-over-grammar-vocab kernel from theory/02 in ~30 lines of Triton and predict roughly where it'll land on the roofline relative to your tuned CUDA C version.


What Triton is

Triton is a Python-like DSL embedded in Python, developed at OpenAI, compiled by an MLIR-based stack down to PTX (then to SASS by ptxas). You write a function with @triton.jit, decorate it with autotune configs, and call it like a regular Python function. The compiler infers vectorization, register allocation, and basic memory scheduling; you provide the algorithm and the space of tile sizes.

A minimal Triton kernel:

import triton
import triton.language as tl

@triton.jit
def softmax_kernel(x_ptr, y_ptr, V, BLOCK: tl.constexpr):
    row = tl.program_id(0)
    cols = tl.arange(0, BLOCK)
    mask = cols < V
    x = tl.load(x_ptr + row * V + cols, mask=mask, other=-float('inf'))
    m = tl.max(x, axis=0)
    e = tl.exp(x - m)
    s = tl.sum(e, axis=0)
    y = e / s
    tl.store(y_ptr + row * V + cols, y, mask=mask)

That's a complete fused softmax kernel for one row per "program" (Triton's name for a thread block). For grammar MiniGPT's \(V \approx 600\) vocab, you'd set BLOCK=1024 (next power of 2 ≥ 600), mask out the tail.

Compared to the ~80-line CUDA C tuned version, Triton hides:

  • Coalescing — tl.load is coalesced automatically.
  • SMEM allocation — you don't write __shared__; the compiler decides.
  • Reductions — tl.max and tl.sum are warp- and block-level efficient.
  • Synchronization — no __syncthreads().

What Triton does not hide:

  • Block / program size. You pick BLOCK.
  • Algorithm. Online-softmax vs naive 3-pass is your decision.
  • Memory layout (row-major vs column-major access patterns).
  • Register usage. Heavy use of tl.where and arithmetic compiles to registers; spills if you push too hard.

The autotuner

Triton's killer feature is the autotuner:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK': 256},  num_warps=4),
        triton.Config({'BLOCK': 512},  num_warps=4),
        triton.Config({'BLOCK': 1024}, num_warps=8),
        triton.Config({'BLOCK': 2048}, num_warps=8),
    ],
    key=['V'],  # re-tune when V changes
)
@triton.jit
def softmax_kernel(...): ...

First call with a given V: the autotuner benchmarks each config, picks the fastest, caches the choice. Subsequent calls with the same V hit the cache.

What the autotuner explores:

  • BLOCK: tile size (powers of 2 typically).
  • num_warps: warps per block (1, 2, 4, 8, 16). Controls occupancy budget.
  • num_stages: software pipeline stages (relevant for matmul, less so for softmax).

What the autotuner does not explore:

  • The algorithm (online vs 3-pass).
  • The memory layout of inputs.
  • Whether to use Tensor Cores (Triton uses them when applicable for tl.dot matmuls; for softmax, irrelevant).

Implication: a bad algorithm autotuned is still a bad algorithm. Triton autotune saves you from guessing tile sizes, not from poor design.

When Triton wins, when CUDA C wins

Situation Choice
Custom elementwise or reduction kernel (softmax, layernorm, RMSNorm, RoPE) Triton. Cuts dev time 10×.
GEMM-like with tight Tensor Core utilization cuBLAS / CUTLASS. Triton's tl.dot is good but not state-of-the-art for the gnarliest shapes.
New research kernel (Flash-Attention v3, MLA) CUDA C + CUTLASS. Triton can approximate but the top 10% of perf usually needs raw control.
Quick prototype to test an algorithm Triton. Hands-down.
Production kernel for the most-hot path CUDA C if Triton is within 5%; Triton if it's within 15%. Maintenance cost matters.

For Phase 24's softmax-over-vocab: Triton is the right tool. Lab 03 has you write it after lab 02's CUDA C tuned version, specifically to feel the contrast.

How Triton compiles

The pipeline (you don't use this, but knowing it helps debugging):

Triton DSL (Python AST)
   ↓ (Triton MLIR dialect)
ttir.mlir (Triton IR)
   ↓ (lowering passes: fuse, vectorize, schedule)
ttgir.mlir (Triton GPU IR)
   ↓ (PTX emission)
ptx
   ↓ (ptxas, NVIDIA's PTX→SASS assembler)
SASS / cubin

triton.compile(...) exposes the intermediate IRs. If a kernel is slow, dumping ttgir tells you what the scheduler decided. Most kernel users never need to look at IR; advanced tuning does.

Cross-checking with ncu

Profiling a Triton kernel uses the same ncu tools as CUDA C. The reports look identical — same metric names (achieved occupancy, HBM throughput, SMEM bank conflicts). The differences:

  • Triton kernel names are mangled — Triton appends a hash to disambiguate autotune variants. softmax_kernel_0d1d2c3c is normal.
  • Source attribution: ncu --source on shows PTX; mapping back to Triton DSL requires ncu's Triton source view (Triton 3.x+).

For Phase 24's deliverable: profile both the CUDA C tuned and the Triton autotuned versions, place both on the same roofline, comment on the gap.

CPU mode: Triton's interpreter for local dev

Triton 3.x ships an interpreter mode (triton.runtime.driver.set_active(...)) that runs @triton.jit functions on CPU using NumPy semantics. Slow but useful for:

  • Correctness checks without a GPU (Borja's machine).
  • Stepping through the algorithm in pdb.
  • CI without GPU runners.

This is not the production CPU fallback (src/minikernel/dispatch.py uses NumPy directly for that — simpler, no Triton dependency). But for development on Borja's laptop, Triton interpreter mode is the closest thing to "run the GPU code locally".

A Triton softmax with autotune (full)

import triton
import triton.language as tl

@triton.autotune(
    configs=[
        triton.Config({'BLOCK': 256},  num_warps=2),
        triton.Config({'BLOCK': 512},  num_warps=4),
        triton.Config({'BLOCK': 1024}, num_warps=8),
    ],
    key=['V'],
)
@triton.jit
def softmax_kernel(x_ptr, y_ptr, V, BLOCK: tl.constexpr):
    row = tl.program_id(0)
    cols = tl.arange(0, BLOCK)
    mask = cols < V
    x = tl.load(x_ptr + row * V + cols, mask=mask, other=-float('inf'))
    m = tl.max(x, axis=0)
    e = tl.exp(x - m)
    s = tl.sum(e, axis=0)
    y = e / s
    tl.store(y_ptr + row * V + cols, y, mask=mask)

def softmax(x):  # Python wrapper
    B, V = x.shape
    y = torch.empty_like(x)
    softmax_kernel[(B,)](x, y, V)
    return y

That's everything. Lab 03 builds this incrementally.

What you should be able to do

  1. Read any Triton kernel and explain what each line does.
  2. Decide for a given new operator whether to start in Triton or CUDA C.
  3. Read a @triton.autotune block and predict which configs the tuner will likely pick at common sizes.
  4. Use triton.compile(...) to dump intermediate IR for debugging.

What this page does NOT cover

  • Triton's matmul primitives in depth. tl.dot and its Tensor Core path are a topic on their own; we don't write a matmul kernel in Phase 24 (the softmax is enough).
  • Triton 3.x specific features (warp specialization, async TMA on Hopper). Phase 27 if relevant.
  • Triton internals (the MLIR passes). Out of scope; you're a user, not a compiler engineer.

Next: theory/04-pytorch-as-substrate.md — PyTorch's first appearance in this curriculum.