English · Español

03 — Triton: CUDA for the 80% Case¶

🇪🇸 Triton es un DSL de Python — un lenguaje que parece NumPy pero compila a PTX (lenguaje GPU). Su valor: el 80% de los kernels que escribirías en CUDA C los escribes en Triton con ~10× menos código, y el autotuner busca el block-size / num-warps / num-stages óptimo por ti. El 20% restante (Flash-Attention, GEMM ultraoptimizado) sigue siendo terreno de CUDA C / CUTLASS.

This page introduces Triton — what it is, what it automates, what it doesn't, and how to read its autotune. By the end you can write the same softmax-over-grammar-vocab kernel from theory/02 in ~30 lines of Triton and predict roughly where it'll land on the roofline relative to your tuned CUDA C version.

What Triton is¶

Triton is a Python-like DSL embedded in Python, developed at OpenAI, compiled by an MLIR-based stack down to PTX (then to SASS by ptxas). You write a function with @triton.jit, decorate it with autotune configs, and call it like a regular Python function. The compiler infers vectorization, register allocation, and basic memory scheduling; you provide the algorithm and the space of tile sizes.

A minimal Triton kernel:

import triton
import triton.language as tl

@triton.jit
def softmax_kernel(x_ptr, y_ptr, V, BLOCK: tl.constexpr):
    row = tl.program_id(0)
    cols = tl.arange(0, BLOCK)
    mask = cols < V
    x = tl.load(x_ptr + row * V + cols, mask=mask, other=-float('inf'))
    m = tl.max(x, axis=0)
    e = tl.exp(x - m)
    s = tl.sum(e, axis=0)
    y = e / s
    tl.store(y_ptr + row * V + cols, y, mask=mask)

That's a complete fused softmax kernel for one row per "program" (Triton's name for a thread block). For grammar MiniGPT's \(V \approx 600\) vocab, you'd set BLOCK=1024 (next power of 2 ≥ 600), mask out the tail.

Compared to the ~80-line CUDA C tuned version, Triton hides:

Coalescing — tl.load is coalesced automatically.
SMEM allocation — you don't write __shared__; the compiler decides.
Reductions — tl.max and tl.sum are warp- and block-level efficient.
Synchronization — no __syncthreads().

What Triton does not hide:

Block / program size. You pick BLOCK.
Algorithm. Online-softmax vs naive 3-pass is your decision.
Memory layout (row-major vs column-major access patterns).
Register usage. Heavy use of tl.where and arithmetic compiles to registers; spills if you push too hard.

The autotuner¶

Triton's killer feature is the autotuner:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK': 256},  num_warps=4),
        triton.Config({'BLOCK': 512},  num_warps=4),
        triton.Config({'BLOCK': 1024}, num_warps=8),
        triton.Config({'BLOCK': 2048}, num_warps=8),
    ],
    key=['V'],  # re-tune when V changes
)
@triton.jit
def softmax_kernel(...): ...

First call with a given V: the autotuner benchmarks each config, picks the fastest, caches the choice. Subsequent calls with the same V hit the cache.

What the autotuner explores:

BLOCK: tile size (powers of 2 typically).
num_warps: warps per block (1, 2, 4, 8, 16). Controls occupancy budget.
num_stages: software pipeline stages (relevant for matmul, less so for softmax).

What the autotuner does not explore:

The algorithm (online vs 3-pass).
The memory layout of inputs.
Whether to use Tensor Cores (Triton uses them when applicable for tl.dot matmuls; for softmax, irrelevant).

Implication: a bad algorithm autotuned is still a bad algorithm. Triton autotune saves you from guessing tile sizes, not from poor design.

When Triton wins, when CUDA C wins¶

Situation	Choice
Custom elementwise or reduction kernel (softmax, layernorm, RMSNorm, RoPE)	Triton. Cuts dev time 10×.
GEMM-like with tight Tensor Core utilization	cuBLAS / CUTLASS. Triton's `tl.dot` is good but not state-of-the-art for the gnarliest shapes.
New research kernel (Flash-Attention v3, MLA)	CUDA C + CUTLASS. Triton can approximate but the top 10% of perf usually needs raw control.
Quick prototype to test an algorithm	Triton. Hands-down.
Production kernel for the most-hot path	CUDA C if Triton is within 5%; Triton if it's within 15%. Maintenance cost matters.

For Phase 24's softmax-over-vocab: Triton is the right tool. Lab 03 has you write it after lab 02's CUDA C tuned version, specifically to feel the contrast.

How Triton compiles¶

The pipeline (you don't use this, but knowing it helps debugging):

Triton DSL (Python AST)
   ↓ (Triton MLIR dialect)
ttir.mlir (Triton IR)
   ↓ (lowering passes: fuse, vectorize, schedule)
ttgir.mlir (Triton GPU IR)
   ↓ (PTX emission)
ptx
   ↓ (ptxas, NVIDIA's PTX→SASS assembler)
SASS / cubin

triton.compile(...) exposes the intermediate IRs. If a kernel is slow, dumping ttgir tells you what the scheduler decided. Most kernel users never need to look at IR; advanced tuning does.

Cross-checking with `ncu`¶

Profiling a Triton kernel uses the same ncu tools as CUDA C. The reports look identical — same metric names (achieved occupancy, HBM throughput, SMEM bank conflicts). The differences:

Triton kernel names are mangled — Triton appends a hash to disambiguate autotune variants. softmax_kernel_0d1d2c3c is normal.
Source attribution: ncu --source on shows PTX; mapping back to Triton DSL requires ncu's Triton source view (Triton 3.x+).

For Phase 24's deliverable: profile both the CUDA C tuned and the Triton autotuned versions, place both on the same roofline, comment on the gap.

CPU mode: Triton's `interpreter` for local dev¶

Triton 3.x ships an interpreter mode (triton.runtime.driver.set_active(...)) that runs @triton.jit functions on CPU using NumPy semantics. Slow but useful for:

Correctness checks without a GPU (Borja's machine).
Stepping through the algorithm in pdb.
CI without GPU runners.

This is not the production CPU fallback (src/minikernel/dispatch.py uses NumPy directly for that — simpler, no Triton dependency). But for development on Borja's laptop, Triton interpreter mode is the closest thing to "run the GPU code locally".

A Triton softmax with autotune (full)¶

import triton
import triton.language as tl

@triton.autotune(
    configs=[
        triton.Config({'BLOCK': 256},  num_warps=2),
        triton.Config({'BLOCK': 512},  num_warps=4),
        triton.Config({'BLOCK': 1024}, num_warps=8),
    ],
    key=['V'],
)
@triton.jit
def softmax_kernel(x_ptr, y_ptr, V, BLOCK: tl.constexpr):
    row = tl.program_id(0)
    cols = tl.arange(0, BLOCK)
    mask = cols < V
    x = tl.load(x_ptr + row * V + cols, mask=mask, other=-float('inf'))
    m = tl.max(x, axis=0)
    e = tl.exp(x - m)
    s = tl.sum(e, axis=0)
    y = e / s
    tl.store(y_ptr + row * V + cols, y, mask=mask)

def softmax(x):  # Python wrapper
    B, V = x.shape
    y = torch.empty_like(x)
    softmax_kernel[(B,)](x, y, V)
    return y

That's everything. Lab 03 builds this incrementally.

What you should be able to do¶

Read any Triton kernel and explain what each line does.
Decide for a given new operator whether to start in Triton or CUDA C.
Read a @triton.autotune block and predict which configs the tuner will likely pick at common sizes.
Use triton.compile(...) to dump intermediate IR for debugging.

What this page does NOT cover¶

Triton's matmul primitives in depth. tl.dot and its Tensor Core path are a topic on their own; we don't write a matmul kernel in Phase 24 (the softmax is enough).
Triton 3.x specific features (warp specialization, async TMA on Hopper). Phase 27 if relevant.
Triton internals (the MLIR passes). Out of scope; you're a user, not a compiler engineer.

Next: theory/04-pytorch-as-substrate.md — PyTorch's first appearance in this curriculum.