English · Español
Phase 25 — PyTorch Internals¶
Requires: 08 — Tensor Autograd from Scratch · 24 — CUDA & Triton Hands-On Teaches:
dispatcher·autograd-graph·torch-compile·custom-opsJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab statements are stable drafts. The
torchversion pin is the load-bearing open question at phase open.🇪🇸 Esta es la fase donde se abre la caja negra. PyTorch presentado en Fase 24 como herramienta; aquí desmontamos los engranajes: el dispatcher (tabla de routing), el motor de autograd (captura de grafo + recorrido inverso),
torch.compile/ Inductor (captura + lowering + Triton/C++), y un survey detorch.distributed(DDP, FSDP, tensor/pipeline parallel) — todo aplicado alnn.Linear(64, 600)del LM head del grammar MiniGPT.
Goal¶
Take the PyTorch port of Phase 17/24 — concretely the nn.Linear(64, 600) LM head over the §A13 grammar vocabulary — and dismantle the framework's behavior around it. By phase end Borja can trace a forward pass through the dispatcher, walk the autograd graph by hand, register a custom op with backward, read torch.compile Inductor output, and distinguish the major distributed patterns without implementing them yet.
This is a theory-heavy phase with hands-on experiments, not a new module. Phase 26 (src/miniquant/) is the next module phase.
Read order¶
theory/00-motivation.md— why a separate phase for framework internals; the "no magic boxes" principle that closes Phase 24's introduction.theory/01-dispatcher-and-aten.md— the dispatcher as a table lookup; ATen as the backend kernel library; howtorch.matmulbecomes acuBLAScall.theory/02-autograd-engine.md— graph capture in forward, reverse traversal in backward,grad_fnchain explicitly walked forlinear(x, W, b).theory/03-compile-and-distributed.md—torch.compilecapture pipeline (Dynamo → AOTAutograd → Inductor); distributed survey (DDP/FSDP/TP/PP) — concepts only.lab/00-dispatcher-trace.md— instrument alinear(x, W, b)call and log every dispatch decision.lab/01-autograd-by-hand.md— derive gradients fornn.Linear(64, 600)by hand, match to PyTorch's autograd to 1e-7.lab/02-custom-op.md— register the Phase-24 Triton softmax astorch.library.custom_opwith backward; verify undergradcheckandtorch.compile.lab/03-compile-and-distributed.md— runtorch.compileon grammar MiniGPT; dump Inductor output; write the 1-page distributed survey.
Definition of Done¶
See PHASE_25_PLAN.md §6. Briefly:
- Hand-derived gradients for
nn.Linear(64, 600)match PyTorch's.backward()to 1e-7. - Custom-op-registered Triton softmax passes
gradcheckand works undertorch.compile. - One Inductor-generated kernel from the grammar MiniGPT forward pass is identified and explained.
- 1-page distributed-survey README distinguishes DDP/FSDP/TP/PP.
What this phase intentionally does NOT cover¶
- Distributed training implementation. Phase 35 builds DDP / FSDP for real.
- TorchScript. Legacy graph capture, superseded by
torch.compile. Mentioned in theory 03 only as historical context. - JIT scripting of custom ops. TorchScript-era; not the modern path.
- Custom ATen kernel registration (C++ /
TORCH_LIBRARY). Mentioned; the lab uses the Python-sidetorch.library.custom_opAPI instead. - CUDA Graphs. Phase 33 (serving).
- TensorRT, ONNX, OpenVINO export. Out of curriculum scope.
functorch/torch.funcdeep dive. Mentioned as the "functional transform" API; Phase 38 may revisit.- Quantization. Phase 26.
transformerslibrary. PerCLAUDE.md§0.4: not before Phase 24, and not used here — we use raw PyTorch only.
Phase 25's scope is: one operator dismantled four ways (dispatcher trace, hand-derived autograd, custom-op registration, compile capture). The distributed section is a survey, not a build.
Next phase preview: docs/phase-26-quantization/ — src/miniquant/, int8 / int4 weight quantization, post-training and quantization-aware training. The grammar MiniGPT is quantized to int8 and compared at the §A13 prompt-prediction accuracy level.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 PyTorch: An Imperative Style, High-Performance Deep Learning Library — Paszke et al. · 2019. the design decisions you now trace through.
- ✍️ PyTorch internals — Edward Z. Yang · 2019. the dispatcher and autograd graph, demystified.