English · Español

Phase 24 — CUDA & Triton Hands-On¶

Requires: 23 — GPU Architecture Fundamentals Teaches: cuda · triton · kernels · shared-memory · tiling Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab statements are stable drafts. The kernel choice (default: fused softmax) is the load-bearing open question at phase open.

🇪🇸 Esta es la fase de los kernels. Escribes uno (fused softmax o GEMM) en CUDA C, lo reescribes en Triton, lo comparas con cuBLAS / torch.nn.functional, y entiendes por qué cada versión va a la velocidad que va. Y también — deliberadamente, después de Fase 23 — importas PyTorch por primera vez.

Goal¶

Take one kernel from naive to "≥30% of cuBLAS at a representative size", in three implementations (CUDA C, Triton, framework reference). At the same time, introduce PyTorch — the framework that all subsequent phases assume. The introduction is deliberate: by Phase 24 Borja already understands the substrate (Phases 1, 23), the operators (everything else), and the math (Phase 15, 22 in particular). PyTorch is presented as the tool that fits the model you already have, not as a magic black box.

Read order¶

theory/00-motivation.md — why a separate phase for kernels; why now (post-Phase-23).
theory/01-cuda-programming-model.md — __global__, blockIdx/threadIdx, kernel launch, memory spaces, synchronization.
theory/02-from-naive-to-tiled.md — the canonical example (matmul or softmax — same lessons) from naive → coalesced → tiled to SMEM → using Tensor Cores.
theory/03-triton.md — what Triton is, why it exists, how autotuning closes the gap. Compare to hand-written CUDA C.
theory/04-pytorch-as-substrate.md — first PyTorch theory page in the curriculum. Tensor vs Storage, device dispatch, autograd as a graph builder (preview of Phase 25). What PyTorch is and what it isn't.
lab/00-hello-cuda.md — vector-add as warm-up: write, compile, launch. Confirm the toolchain works.
lab/01-naive-kernel.md — first version of the chosen kernel (fused softmax by default). Correct, slow.
lab/02-tuned-kernel.md — tile, coalesce, SMEM, occupancy. Iterate until ≥30% of cuBLAS / F.softmax.
lab/03-triton-and-pytorch.md — same kernel in Triton; PyTorch port of MiniGPT.

solutions/ is empty during pre-write — populated at phase open after kernel choice is confirmed.

Definition of Done¶

See PHASE_24_PLAN.md §6. Briefly:

A CUDA C kernel + a Triton kernel for the same op, both numerically correct, the CUDA C version at ≥30% of cuBLAS peak.
An ncu profile of the tuned kernel, annotated to identify the bottleneck.
A PyTorch port of Phase-17 MiniGPT, byte-equivalent to the NumPy version at fp64.
learners/borja/profile.md updated to reflect PyTorch internalization.

What this phase intentionally does NOT cover¶

Flash-Attention or PagedAttention kernels. Phase 27. Phase 24's kernel is one small op, deeply understood, not a from-scratch Flash-Attention.
PyTorch internals (__torch_dispatch__, custom autograd, allocator behavior). Phase 25.
Multi-GPU kernels (NCCL, all-reduce). Phase 35.
torch.compile / Inductor. Phase 25 (briefly) and Phase 33 (in depth).
TensorRT / nvinfer. Out of scope for the entire curriculum — TensorRT is a black-box optimizer, conflicts with the "no langchain/llama-index" anti-goal-spirit.
CUDA Graphs. Phase 33 (serving).
The transformers library. Per CLAUDE.md §0.4: never.
Old CUDA (pre-Ampere). Some idioms (sm_60/sm_70) are no longer optimal. We target Ampere/Ada/Hopper.

Phase 24's scope is: one kernel, three implementations, one ncu profile, one PyTorch port. Nothing more.

Next phase preview: docs/phase-25-pytorch-internals/ — autograd engine internals, __torch_dispatch__, custom ops.