English · Español
Phase 35 — Distributed Training & Inference¶
Requires: 18 — Training Loop, Mixed Precision Preview, Checkpointing · 34 — Observability, Cost & Capacity Teaches:
distributed-training·ddp·fsdp·tensor-parallel·allreduce·ncclJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 La única fase del currículum donde Borja toca de verdad múltiples GPUs. Por eso cada experimento empieza calculando coste en USD antes de lanzar nada — y un guardia de presupuesto (
budget_guard.py) corta si nos pasamos del techo de $5.
Goal¶
Vocabulary + mental models + one finger on the wallet. By end of phase Borja can:
- Sketch how a transformer block is sharded under tensor parallel (TP), with QKV column-parallel, output row-parallel, MLP up column-parallel, MLP down row-parallel.
- Compute the bubble fraction of a pipeline-parallel (PP) schedule.
- Pick the right parallelism strategy (DDP / ZeRO-N / FSDP / TP / PP) given a model size + cluster topology constraint.
- Read Megatron-LM and PyTorch FSDP source with confidence.
- Know what he spent. Total cloud bill for Phase 35 ≤ $5.
Read order¶
theory/00-motivation.md— why distributed at all. Two-bottleneck framing.theory/01-data-parallel-and-zero.md— DDP, ZeRO-½/3, FSDP. The "shard X, all-reduce" family.theory/02-parallelism-flavors.md— TP, PP, sequence parallel, expert parallel. The "shard model, communicate at boundaries" family.theory/03-collectives-and-cost.md— NCCL collectives (all-reduce, reduce-scatter, all-gather), ring vs tree algorithms, bandwidth math.theory/04-distributed-inference.md— TP for inference, disaggregated prefill/decode, why it's different from training.lab/00-cloud-budget-and-tooling.md— vendor survey + cost estimates + budget guard setup. Run before any cloud spinup.lab/01-ddp-on-cpu.md— DDP across 2 CPU processes locally. Free.lab/02-tp-inference-cloud.md— 2-GPU TP inference. Hard $3 cap.lab/03-megatron-fsdp-reading.md— annotated reading. Free.
solutions/ is empty during pre-write — populated at phase open after Borja's Phase 25 PyTorch internals are in.
Definition of Done¶
See PHASE_35_PLAN.md §6. Briefly:
- DDP-CPU experiment runs locally, all-reduce timing recorded.
- TP-cloud experiment runs ≤ $3 spend; speedup curve committed.
- Megatron + FSDP reading notes ≥5 design citations each.
- Total Phase 35 cloud spend ≤ $5, recorded in
experiments/35-cloud-budget.md. - TP-block-sharding mermaid diagram committed.
/quiz 35≥ 70%.
What this phase intentionally does NOT cover¶
- Implementing a production-grade DDP/TP/PP from scratch. Reading and a minimal educational version is the bar. Borja will use the real frameworks for the capstone.
- Multi-node clusters. Single-node, 2 GPUs only. Multi-node InfiniBand benchmarking is out-of-budget and out-of-scope.
- MoE expert parallelism in depth. Touched in Phase 36 (frontier architectures), not here.
- 3D parallelism (DP × TP × PP). Concept only; the math gets messy and the budget says no.
- Training a frontier model. Not the goal. The goal is understanding what those teams do.
- Disaggregated prefill/decode implementation. Concept only — covered in theory file 04.
Phase 35's scope is distributed parallelism vocabulary + one small hands-on touch, on budget. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models — Rajbhandari et al. · 2019. the sharding behind FSDP/DeepSpeed.
- 📄 Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism — Shoeybi et al. · 2019. tensor parallelism, formalized.
- 📄 Ring Attention with Blockwise Transformers — Liu, Zaharia, Abbeel · 2023. context parallelism for million-token windows.