English · Español

Phase 35 — Distributed Training & Inference¶

Requires: 18 — Training Loop, Mixed Precision Preview, Checkpointing · 34 — Observability, Cost & Capacity Teaches: distributed-training · ddp · fsdp · tensor-parallel · allreduce · nccl Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 La única fase del currículum donde Borja toca de verdad múltiples GPUs. Por eso cada experimento empieza calculando coste en USD antes de lanzar nada — y un guardia de presupuesto (budget_guard.py) corta si nos pasamos del techo de $5.

Goal¶

Vocabulary + mental models + one finger on the wallet. By end of phase Borja can:

Sketch how a transformer block is sharded under tensor parallel (TP), with QKV column-parallel, output row-parallel, MLP up column-parallel, MLP down row-parallel.
Compute the bubble fraction of a pipeline-parallel (PP) schedule.
Pick the right parallelism strategy (DDP / ZeRO-N / FSDP / TP / PP) given a model size + cluster topology constraint.
Read Megatron-LM and PyTorch FSDP source with confidence.
Know what he spent. Total cloud bill for Phase 35 ≤ $5.

Read order¶

theory/00-motivation.md — why distributed at all. Two-bottleneck framing.
theory/01-data-parallel-and-zero.md — DDP, ZeRO-½/3, FSDP. The "shard X, all-reduce" family.
theory/02-parallelism-flavors.md — TP, PP, sequence parallel, expert parallel. The "shard model, communicate at boundaries" family.
theory/03-collectives-and-cost.md — NCCL collectives (all-reduce, reduce-scatter, all-gather), ring vs tree algorithms, bandwidth math.
theory/04-distributed-inference.md — TP for inference, disaggregated prefill/decode, why it's different from training.
lab/00-cloud-budget-and-tooling.md — vendor survey + cost estimates + budget guard setup. Run before any cloud spinup.
lab/01-ddp-on-cpu.md — DDP across 2 CPU processes locally. Free.
lab/02-tp-inference-cloud.md — 2-GPU TP inference. Hard $3 cap.
lab/03-megatron-fsdp-reading.md — annotated reading. Free.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase 25 PyTorch internals are in.

Definition of Done¶

See PHASE_35_PLAN.md §6. Briefly:

DDP-CPU experiment runs locally, all-reduce timing recorded.
TP-cloud experiment runs ≤ $3 spend; speedup curve committed.
Megatron + FSDP reading notes ≥5 design citations each.
Total Phase 35 cloud spend ≤ $5, recorded in experiments/35-cloud-budget.md.
TP-block-sharding mermaid diagram committed.
/quiz 35 ≥ 70%.

What this phase intentionally does NOT cover¶

Implementing a production-grade DDP/TP/PP from scratch. Reading and a minimal educational version is the bar. Borja will use the real frameworks for the capstone.
Multi-node clusters. Single-node, 2 GPUs only. Multi-node InfiniBand benchmarking is out-of-budget and out-of-scope.
MoE expert parallelism in depth. Touched in Phase 36 (frontier architectures), not here.
3D parallelism (DP × TP × PP). Concept only; the math gets messy and the budget says no.
Training a frontier model. Not the goal. The goal is understanding what those teams do.
Disaggregated prefill/decode implementation. Concept only — covered in theory file 04.

Phase 35's scope is distributed parallelism vocabulary + one small hands-on touch, on budget. Nothing more.