English · Español
03 — Triton: CUDA for the 80% Case¶
🇪🇸 Triton es un DSL de Python — un lenguaje que parece NumPy pero compila a PTX (lenguaje GPU). Su valor: el 80% de los kernels que escribirías en CUDA C los escribes en Triton con ~10× menos código, y el autotuner busca el block-size / num-warps / num-stages óptimo por ti. El 20% restante (Flash-Attention, GEMM ultraoptimizado) sigue siendo terreno de CUDA C / CUTLASS.
This page introduces Triton — what it is, what it automates, what it doesn't, and how to read its autotune. By the end you can write the same softmax-over-grammar-vocab kernel from theory/02 in ~30 lines of Triton and predict roughly where it'll land on the roofline relative to your tuned CUDA C version.
What Triton is¶
Triton is a Python-like DSL embedded in Python, developed at OpenAI, compiled by an MLIR-based stack down to PTX (then to SASS by ptxas). You write a function with @triton.jit, decorate it with autotune configs, and call it like a regular Python function. The compiler infers vectorization, register allocation, and basic memory scheduling; you provide the algorithm and the space of tile sizes.
A minimal Triton kernel:
import triton
import triton.language as tl
@triton.jit
def softmax_kernel(x_ptr, y_ptr, V, BLOCK: tl.constexpr):
row = tl.program_id(0)
cols = tl.arange(0, BLOCK)
mask = cols < V
x = tl.load(x_ptr + row * V + cols, mask=mask, other=-float('inf'))
m = tl.max(x, axis=0)
e = tl.exp(x - m)
s = tl.sum(e, axis=0)
y = e / s
tl.store(y_ptr + row * V + cols, y, mask=mask)
That's a complete fused softmax kernel for one row per "program" (Triton's name for a thread block). For grammar MiniGPT's \(V \approx 600\) vocab, you'd set BLOCK=1024 (next power of 2 ≥ 600), mask out the tail.
Compared to the ~80-line CUDA C tuned version, Triton hides:
- Coalescing —
tl.loadis coalesced automatically. - SMEM allocation — you don't write
__shared__; the compiler decides. - Reductions —
tl.maxandtl.sumare warp- and block-level efficient. - Synchronization — no
__syncthreads().
What Triton does not hide:
- Block / program size. You pick
BLOCK. - Algorithm. Online-softmax vs naive 3-pass is your decision.
- Memory layout (row-major vs column-major access patterns).
- Register usage. Heavy use of
tl.whereand arithmetic compiles to registers; spills if you push too hard.
The autotuner¶
Triton's killer feature is the autotuner:
@triton.autotune(
configs=[
triton.Config({'BLOCK': 256}, num_warps=4),
triton.Config({'BLOCK': 512}, num_warps=4),
triton.Config({'BLOCK': 1024}, num_warps=8),
triton.Config({'BLOCK': 2048}, num_warps=8),
],
key=['V'], # re-tune when V changes
)
@triton.jit
def softmax_kernel(...): ...
First call with a given V: the autotuner benchmarks each config, picks the fastest, caches the choice. Subsequent calls with the same V hit the cache.
What the autotuner explores:
BLOCK: tile size (powers of 2 typically).num_warps: warps per block (1, 2, 4, 8, 16). Controls occupancy budget.num_stages: software pipeline stages (relevant for matmul, less so for softmax).
What the autotuner does not explore:
- The algorithm (online vs 3-pass).
- The memory layout of inputs.
- Whether to use Tensor Cores (Triton uses them when applicable for
tl.dotmatmuls; for softmax, irrelevant).
Implication: a bad algorithm autotuned is still a bad algorithm. Triton autotune saves you from guessing tile sizes, not from poor design.
When Triton wins, when CUDA C wins¶
| Situation | Choice |
|---|---|
| Custom elementwise or reduction kernel (softmax, layernorm, RMSNorm, RoPE) | Triton. Cuts dev time 10×. |
| GEMM-like with tight Tensor Core utilization | cuBLAS / CUTLASS. Triton's tl.dot is good but not state-of-the-art for the gnarliest shapes. |
| New research kernel (Flash-Attention v3, MLA) | CUDA C + CUTLASS. Triton can approximate but the top 10% of perf usually needs raw control. |
| Quick prototype to test an algorithm | Triton. Hands-down. |
| Production kernel for the most-hot path | CUDA C if Triton is within 5%; Triton if it's within 15%. Maintenance cost matters. |
For Phase 24's softmax-over-vocab: Triton is the right tool. Lab 03 has you write it after lab 02's CUDA C tuned version, specifically to feel the contrast.
How Triton compiles¶
The pipeline (you don't use this, but knowing it helps debugging):
Triton DSL (Python AST)
↓ (Triton MLIR dialect)
ttir.mlir (Triton IR)
↓ (lowering passes: fuse, vectorize, schedule)
ttgir.mlir (Triton GPU IR)
↓ (PTX emission)
ptx
↓ (ptxas, NVIDIA's PTX→SASS assembler)
SASS / cubin
triton.compile(...) exposes the intermediate IRs. If a kernel is slow, dumping ttgir tells you what the scheduler decided. Most kernel users never need to look at IR; advanced tuning does.
Cross-checking with ncu¶
Profiling a Triton kernel uses the same ncu tools as CUDA C. The reports look identical — same metric names (achieved occupancy, HBM throughput, SMEM bank conflicts). The differences:
- Triton kernel names are mangled — Triton appends a hash to disambiguate autotune variants.
softmax_kernel_0d1d2c3cis normal. - Source attribution:
ncu --source onshows PTX; mapping back to Triton DSL requiresncu's Triton source view (Triton 3.x+).
For Phase 24's deliverable: profile both the CUDA C tuned and the Triton autotuned versions, place both on the same roofline, comment on the gap.
CPU mode: Triton's interpreter for local dev¶
Triton 3.x ships an interpreter mode (triton.runtime.driver.set_active(...)) that runs @triton.jit functions on CPU using NumPy semantics. Slow but useful for:
- Correctness checks without a GPU (Borja's machine).
- Stepping through the algorithm in pdb.
- CI without GPU runners.
This is not the production CPU fallback (src/minikernel/dispatch.py uses NumPy directly for that — simpler, no Triton dependency). But for development on Borja's laptop, Triton interpreter mode is the closest thing to "run the GPU code locally".
A Triton softmax with autotune (full)¶
import triton
import triton.language as tl
@triton.autotune(
configs=[
triton.Config({'BLOCK': 256}, num_warps=2),
triton.Config({'BLOCK': 512}, num_warps=4),
triton.Config({'BLOCK': 1024}, num_warps=8),
],
key=['V'],
)
@triton.jit
def softmax_kernel(x_ptr, y_ptr, V, BLOCK: tl.constexpr):
row = tl.program_id(0)
cols = tl.arange(0, BLOCK)
mask = cols < V
x = tl.load(x_ptr + row * V + cols, mask=mask, other=-float('inf'))
m = tl.max(x, axis=0)
e = tl.exp(x - m)
s = tl.sum(e, axis=0)
y = e / s
tl.store(y_ptr + row * V + cols, y, mask=mask)
def softmax(x): # Python wrapper
B, V = x.shape
y = torch.empty_like(x)
softmax_kernel[(B,)](x, y, V)
return y
That's everything. Lab 03 builds this incrementally.
What you should be able to do¶
- Read any Triton kernel and explain what each line does.
- Decide for a given new operator whether to start in Triton or CUDA C.
- Read a
@triton.autotuneblock and predict which configs the tuner will likely pick at common sizes. - Use
triton.compile(...)to dump intermediate IR for debugging.
What this page does NOT cover¶
- Triton's matmul primitives in depth.
tl.dotand its Tensor Core path are a topic on their own; we don't write a matmul kernel in Phase 24 (the softmax is enough). - Triton 3.x specific features (warp specialization, async TMA on Hopper). Phase 27 if relevant.
- Triton internals (the MLIR passes). Out of scope; you're a user, not a compiler engineer.
Next: theory/04-pytorch-as-substrate.md — PyTorch's first appearance in this curriculum.