English · Español

Phase 25 — PyTorch Internals¶

Requires: 08 — Tensor Autograd from Scratch · 24 — CUDA & Triton Hands-On Teaches: dispatcher · autograd-graph · torch-compile · custom-ops Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab statements are stable drafts. The torch version pin is the load-bearing open question at phase open.

🇪🇸 Esta es la fase donde se abre la caja negra. PyTorch presentado en Fase 24 como herramienta; aquí desmontamos los engranajes: el dispatcher (tabla de routing), el motor de autograd (captura de grafo + recorrido inverso), torch.compile / Inductor (captura + lowering + Triton/C++), y un survey de torch.distributed (DDP, FSDP, tensor/pipeline parallel) — todo aplicado al nn.Linear(64, 600) del LM head del grammar MiniGPT.

Goal¶

Take the PyTorch port of Phase 17/24 — concretely the nn.Linear(64, 600) LM head over the §A13 grammar vocabulary — and dismantle the framework's behavior around it. By phase end Borja can trace a forward pass through the dispatcher, walk the autograd graph by hand, register a custom op with backward, read torch.compile Inductor output, and distinguish the major distributed patterns without implementing them yet.

This is a theory-heavy phase with hands-on experiments, not a new module. Phase 26 (src/miniquant/) is the next module phase.

Read order¶

theory/00-motivation.md — why a separate phase for framework internals; the "no magic boxes" principle that closes Phase 24's introduction.
theory/01-dispatcher-and-aten.md — the dispatcher as a table lookup; ATen as the backend kernel library; how torch.matmul becomes a cuBLAS call.
theory/02-autograd-engine.md — graph capture in forward, reverse traversal in backward, grad_fn chain explicitly walked for linear(x, W, b).
theory/03-compile-and-distributed.md — torch.compile capture pipeline (Dynamo → AOTAutograd → Inductor); distributed survey (DDP/FSDP/TP/PP) — concepts only.
lab/00-dispatcher-trace.md — instrument a linear(x, W, b) call and log every dispatch decision.
lab/01-autograd-by-hand.md — derive gradients for nn.Linear(64, 600) by hand, match to PyTorch's autograd to 1e-7.
lab/02-custom-op.md — register the Phase-24 Triton softmax as torch.library.custom_op with backward; verify under gradcheck and torch.compile.
lab/03-compile-and-distributed.md — run torch.compile on grammar MiniGPT; dump Inductor output; write the 1-page distributed survey.

Definition of Done¶

See PHASE_25_PLAN.md §6. Briefly:

Hand-derived gradients for nn.Linear(64, 600) match PyTorch's .backward() to 1e-7.
Custom-op-registered Triton softmax passes gradcheck and works under torch.compile.
One Inductor-generated kernel from the grammar MiniGPT forward pass is identified and explained.
1-page distributed-survey README distinguishes DDP/FSDP/TP/PP.

What this phase intentionally does NOT cover¶

Distributed training implementation. Phase 35 builds DDP / FSDP for real.
TorchScript. Legacy graph capture, superseded by torch.compile. Mentioned in theory 03 only as historical context.
JIT scripting of custom ops. TorchScript-era; not the modern path.
Custom ATen kernel registration (C++ / TORCH_LIBRARY). Mentioned; the lab uses the Python-side torch.library.custom_op API instead.
CUDA Graphs. Phase 33 (serving).
TensorRT, ONNX, OpenVINO export. Out of curriculum scope.
functorch / torch.func deep dive. Mentioned as the "functional transform" API; Phase 38 may revisit.
Quantization. Phase 26.
transformers library. Per CLAUDE.md §0.4: not before Phase 24, and not used here — we use raw PyTorch only.

Phase 25's scope is: one operator dismantled four ways (dispatcher trace, hand-derived autograd, custom-op registration, compile capture). The distributed section is a survey, not a build.

Next phase preview: docs/phase-26-quantization/ — src/miniquant/, int8 / int4 weight quantization, post-training and quantization-aware training. The grammar MiniGPT is quantized to int8 and compared at the §A13 prompt-prediction accuracy level.