Skip to content

English · Español

Phase 35 — Distributed Training & Inference

Requires: 18 — Training Loop, Mixed Precision Preview, Checkpointing · 34 — Observability, Cost & Capacity Teaches: distributed-training · ddp · fsdp · tensor-parallel · allreduce · nccl Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 La única fase del currículum donde Borja toca de verdad múltiples GPUs. Por eso cada experimento empieza calculando coste en USD antes de lanzar nada — y un guardia de presupuesto (budget_guard.py) corta si nos pasamos del techo de $5.


Goal

Vocabulary + mental models + one finger on the wallet. By end of phase Borja can:

  1. Sketch how a transformer block is sharded under tensor parallel (TP), with QKV column-parallel, output row-parallel, MLP up column-parallel, MLP down row-parallel.
  2. Compute the bubble fraction of a pipeline-parallel (PP) schedule.
  3. Pick the right parallelism strategy (DDP / ZeRO-N / FSDP / TP / PP) given a model size + cluster topology constraint.
  4. Read Megatron-LM and PyTorch FSDP source with confidence.
  5. Know what he spent. Total cloud bill for Phase 35 ≤ $5.

Read order

  1. theory/00-motivation.md — why distributed at all. Two-bottleneck framing.
  2. theory/01-data-parallel-and-zero.md — DDP, ZeRO-½/3, FSDP. The "shard X, all-reduce" family.
  3. theory/02-parallelism-flavors.md — TP, PP, sequence parallel, expert parallel. The "shard model, communicate at boundaries" family.
  4. theory/03-collectives-and-cost.md — NCCL collectives (all-reduce, reduce-scatter, all-gather), ring vs tree algorithms, bandwidth math.
  5. theory/04-distributed-inference.md — TP for inference, disaggregated prefill/decode, why it's different from training.
  6. lab/00-cloud-budget-and-tooling.md — vendor survey + cost estimates + budget guard setup. Run before any cloud spinup.
  7. lab/01-ddp-on-cpu.md — DDP across 2 CPU processes locally. Free.
  8. lab/02-tp-inference-cloud.md — 2-GPU TP inference. Hard $3 cap.
  9. lab/03-megatron-fsdp-reading.md — annotated reading. Free.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase 25 PyTorch internals are in.

Definition of Done

See PHASE_35_PLAN.md §6. Briefly:

  • DDP-CPU experiment runs locally, all-reduce timing recorded.
  • TP-cloud experiment runs ≤ $3 spend; speedup curve committed.
  • Megatron + FSDP reading notes ≥5 design citations each.
  • Total Phase 35 cloud spend ≤ $5, recorded in experiments/35-cloud-budget.md.
  • TP-block-sharding mermaid diagram committed.
  • /quiz 35 ≥ 70%.

What this phase intentionally does NOT cover

  • Implementing a production-grade DDP/TP/PP from scratch. Reading and a minimal educational version is the bar. Borja will use the real frameworks for the capstone.
  • Multi-node clusters. Single-node, 2 GPUs only. Multi-node InfiniBand benchmarking is out-of-budget and out-of-scope.
  • MoE expert parallelism in depth. Touched in Phase 36 (frontier architectures), not here.
  • 3D parallelism (DP × TP × PP). Concept only; the math gets messy and the budget says no.
  • Training a frontier model. Not the goal. The goal is understanding what those teams do.
  • Disaggregated prefill/decode implementation. Concept only — covered in theory file 04.

Phase 35's scope is distributed parallelism vocabulary + one small hands-on touch, on budget. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.