English · Español

00 — Why Open the Framework's Hood¶

🇪🇸 Después de 24 fases construyendo la red desde NumPy + un poco de CUDA, llegas a Fase 25 con un modelo claro: PyTorch es plomería. Pero "plomería" no significa "trivial". Detrás de torch.matmul hay un dispatcher con miles de entradas, un motor de autograd que captura grafos en runtime, un compilador (Inductor) que reescribe forward passes a Triton/C++. Esta fase desmonta esas tapas. No para ser ingeniero de PyTorch — sino para saber dónde mirar cuando algo va mal.

This is the orientation page for Phase 25. It answers: why a separate phase on PyTorch internals (Phase 24 already introduced PyTorch as a tool)? Why now? What does "internals" buy you that "user-level PyTorch" doesn't?

The "framework user" trap¶

A common engineering trajectory:

Learn PyTorch from tutorials. Build a model. Train. It works.
Hit a perf problem. model = torch.compile(model) makes it 1.5× faster. Don't know why.
Hit a correctness bug at fp16. Spend a week. Eventually find a buggy custom layer. Still don't know why fp16 made it worse.
Hit an OOM at scale. Try find_unused_parameters=True. Sometimes works. Don't know why.

Each "don't know why" is a debt. Phase 25 pays the debt. After this phase, the following statements are not magic:

"PyTorch dispatched aten::linear to the CUDA backend at fp32-no-autocast key set."
"The grad_fn for y is MmBackward0 because y = x @ W.T and PyTorch's autograd records the gemm node."
"torch.compile fused the linear + softmax via Inductor; the generated Triton kernel is 60 lines and lives in /tmp/torchinductor_<user>/...."
"DDP wraps the model with nn.parallel.DistributedDataParallel, which hooks every .backward() call to add an all_reduce after each parameter's gradient is computed."

None of these require building PyTorch from source. They require reading the right diagnostic output and knowing what it means.

Phase 24 introduced; Phase 25 dismantles¶

Phase 24's mental model was the two-line summary:

nn.Module  =  parameters registry + forward-method-as-graph
Tensor     =  Storage (bytes) + view metadata

Phase 25 dismantles each piece:

Dispatcher (theory/01): how torch.matmul chooses between cuBLAS, MKL, custom kernels. The table lookup. The key set.
ATen (theory/01): the operator library — aten::matmul, aten::linear, aten::softmax — that the dispatcher dispatches to. ATen is the C++ backend kernel registry; PyTorch frontend wraps it.
Autograd engine (theory/02): how requires_grad=True causes a graph to be built; how .backward() traverses it. The grad_fn chain.
Custom-op registration (theory/02 + lab/02): torch.library.custom_op — how to add a new op that the dispatcher routes to, with autograd, with torch.compile integration.
Compile pipeline (theory/03): Dynamo (Python-bytecode-to-FX-graph), AOTAutograd (forward+backward joint capture), Inductor (lowering to Triton or C++ kernels).
Distributed (theory/03): the four canonical patterns (DDP, FSDP, tensor-parallel, pipeline-parallel) and what each one is without implementing it.

The running example: `nn.Linear(64, 600)`¶

Every internals topic is taught against the grammar MiniGPT's LM head — nn.Linear(64, 600). Why this op?

Concrete. Real shapes, real weights from Phase 17. No synthetic toy.
Small enough to inspect. 64 × 600 = 38,400 weights. The autograd graph for one forward pass has maybe 5 nodes. Each kernel launch is microseconds. Fits in a debugger session.
Connects to §A13. The 600 output columns are the grammar vocabulary. The forward pass produces a probability distribution over English/Spanish verb forms. The backward pass gives gradients used in training (Phase 18). Everything you derive in Phase 25 you can see affect the next-token prediction.

A typical Phase 25 lab session looks like:

import torch
W = torch.randn(600, 64, requires_grad=True)
b = torch.randn(600, requires_grad=True)
x = torch.randn(2, 64, requires_grad=True)
y = torch.nn.functional.linear(x, W, b)  # y.shape = (2, 600)
print(y.grad_fn)  # → <AddmmBackward0 object at 0x...>

That AddmmBackward0 is the autograd node. Lab 01 will walk through what it does in detail.

Why now (Phase 25)?¶

Three reasons:

Phase 24 just imported torch for the first time. It would be disorienting to dive into internals before having a feel for the surface API. Phase 24 builds the feel; Phase 25 dismantles the box.
Phase 26 (quantization) and Phase 27 (Flash-Attention) both require custom-op registration. If Phase 25 didn't cover it, Phase 26 would have to detour. Better to learn it cleanly here.
torch.compile matters for Phase 33 (serving). Understanding the compile pipeline now means the serving phase can focus on systems concerns, not "what does Inductor do".

What this phase is not¶

Not a PyTorch contribution course. You won't open aten/src/ATen/native/cuda/Softmax.cu. The dispatcher is read from the user side (logs, registrations, custom ops), not the contributor side.
Not a complete framework tour. PyTorch has hundreds of subsystems (torch.jit, torch.fx, torch.func, torch.profiler, torch.utils.data, ...). Phase 25 covers four: dispatcher, autograd, compile, distributed. The rest is treated as "you can find it when you need it".
Not a benchmark. No "X% speedup with this trick". Phase 25 is for understanding; perf is a Phase 33 / 24 / 35 topic.

What you should feel after Phase 25¶

Three things should become viscerally true:

PyTorch is a table-lookup machine. Every torch.<op> is dispatcher[(op, key_set)] → backend kernel. The framework's apparent complexity is the size of the table, not its mechanism.
Autograd is two functions. Forward records into a graph; backward traverses it in reverse, calling each node's backward formula. Phase ⅞'s scalar/tensor autograd already implemented exactly this; PyTorch's version is the same idea at scale.
torch.compile is a tracer + optimizer + emitter. It traces your Python, builds a graph, optimizes it (fusion, deadcode elimination, layout), emits Triton or C++ kernels. Same pattern as any compiler.

If those don't feel true at phase close, do lab 01 again. The autograd-by-hand exercise is the load-bearing one.

What you should be able to do by the end of Phase 25¶

Trace a forward pass through the dispatcher. Name the dispatch key for any given op call.
Walk an autograd graph by hand. For linear(x, W, b), derive each grad_fn's backward formula and verify against PyTorch's computation.
Register a custom op (torch.library.custom_op) with autograd. Pass gradcheck and run it inside a torch.compile'd model.
Dump and read Inductor-generated Triton/C++ from a torch.compile'd forward pass.
State the difference between DDP, FSDP, tensor-parallel, and pipeline-parallel in 2 sentences each.

What this page does NOT cover¶

PyTorch source code reading. You don't need to read aten/ or c10/ — this is a user-side internals phase.
CUDA-side internals (cudnn, cublasLt, etc.). Black-boxed; we trust them.
PyTorch C++ frontend. Out of scope.

Next: theory/01-dispatcher-and-aten.md — the dispatch table, the keys, the routing decision.