Skip to content

English · Español

00 — Motivation: why distributed

🇪🇸 Hay exactamente dos razones para distribuir: (1) el modelo no cabe en una GPU (presión de memoria), (2) el cómputo tarda demasiado (presión de tiempo). Cada estrategia (DDP / ZeRO / FSDP / TP / PP) ataca una de las dos. Confundir las razones es la causa #1 de elegir mal la paralelización.

You can run most of this curriculum on a single CPU. You ran Mini-GPT training there. Why does the field even bother with multi-GPU clusters?

Two reasons. Exactly two. Confuse them and you'll pick the wrong parallelism strategy every time.

Reason 1: the model doesn't fit

A model has parameters \(\theta\) (the weights) and activations (intermediate tensors during forward and backward). Both consume GPU memory.

For training, the memory footprint is roughly:

\[M_{\text{train}} \approx \underbrace{|\theta| \cdot b_\theta}_{\text{weights}} + \underbrace{|\theta| \cdot b_g}_{\text{gradients}} + \underbrace{2 \cdot |\theta| \cdot b_{\text{opt}}}_{\text{optimizer state (Adam)}} + \underbrace{A(\text{batch}, \text{seq}, \text{model})}_{\text{activations}}\]

For a 7B-parameter model in fp32 (yes, that's wasteful — humor me), without considering activations:

\[M \approx 7\text{B} \cdot 4 + 7\text{B} \cdot 4 + 2 \cdot 7\text{B} \cdot 4 = 7\text{B} \cdot 16\,\text{bytes} = 112\,\text{GB}\]

An A100 has 40 or 80 GB of HBM. You cannot train a 7B model on one A100 at fp32, full stop. Even with mixed precision (~half the number), activations push you over the edge.

For inference, gradients and optimizer state go away:

\[M_{\text{inf}} \approx |\theta| \cdot b_\theta + \text{KV cache}\]

7B fp16 weights = 14 GB. Fits on a 24 GB consumer GPU with KV cache headroom. Inference of 7B is one-GPU territory; training is not.

When the model itself doesn't fit on one GPU, the only options are:

  • Shrink the model's footprint per GPU. Sharding strategies: ZeRO-1 (shards optimizer state), ZeRO-2 (+ gradients), ZeRO-3 / FSDP (+ parameters). Each GPU still does one batch's worth of work but holds 1/N of the model's "state."
  • Split the model across GPUs. Tensor parallel (TP) splits within each layer; pipeline parallel (PP) puts different layers on different GPUs.

Pick one (or combine — 3D parallelism). The choice depends on which axis the model is "too big" along: too many parameters (TP / ZeRO-3 / FSDP), too many layers (PP), or simply too much optimizer state for one GPU (ZeRO-½).

Reason 2: the compute takes too long

Separate from "doesn't fit": the model fits but training one epoch takes a month and you have 3 days.

The fix is data parallel (DDP): every GPU has a full copy of the model, each gets a different batch shard, gradients are all-reduced across GPUs at the end of each step. Each GPU does less work per step (smaller per-GPU batch); total compute is faster.

DDP scales well to a point, then communication starts to dominate (more on this in theory/03-collectives-and-cost.md). For very large clusters, DDP combines with ZeRO-N to share the memory pressure of holding the model.

When the two reasons conflict

A team has a 70B-parameter model. It doesn't fit (Reason 1) and they want it to train fast (Reason 2). Now what?

This is the 3D parallelism problem. The textbook answer:

  • Use TP within a node (intra-node bandwidth is high — NVLink at 600 GB/s).
  • Use PP across nodes (inter-node bandwidth is lower — InfiniBand at 200 Gb/s ≈ 25 GB/s).
  • Use DDP at the outermost level (shards the data across "TP × PP" groups).

That gives you "DP=N, TP=8, PP=4" type configurations. The math gets thorny. Phase 35 covers the pieces; the combination is conceptual only.

What does NOT belong in this phase

A failure mode of teaching distributed training is starting with the cluster. Don't. Start with why a single GPU isn't enough for your particular workload, and let that drive the strategy.

For Borja's curriculum:

  • MiniGPT (≤500k params, fp32) fits on a calculator. It never needs distributed for memory reasons.
  • Training it distributed is only a teaching exercise — what does the wire protocol look like, what does an all-reduce cost.
  • Inference of MiniGPT under TP is also pedagogical — what does the comm pattern look like.

Borja will not in this curriculum train a frontier model. The point of Phase 35 is vocabulary and mental models so that when he reads a real distributed training log six months from now, the names map to mechanisms.

The cost dimension

Phase 35 is the only phase that spends real cloud money in the curriculum (Phase 23 and Phase 24 might — depends on Borja's setup — but Phase 35 will).

Every cloud experiment in this phase starts with a sentence like:

"This will spin up 2× RTX 4090 spot on RunPod at ~$0.35/hr each = $0.70/hr total, run for ~3 hours including setup tax = $2.10. Budget: $3.00. Confirmed acceptable, $0.90 buffer."

If Borja can't write that sentence about a lab, he doesn't run the lab. Phase 35 introduces src/distributed/budget_guard.py precisely to enforce this — a wrapper that refuses to launch any job whose estimated cost would exceed the configured ceiling.

The total Phase 35 cloud bill ceiling is $5. That's a learner budget, not a real cluster budget; the goal is education, not benchmarks.

How this connects to the wider curriculum

  • Phase 33 built single-node serving. Phase 35 introduces the concept of multi-node serving (TP for inference), implementation deferred.
  • Phase 34's observability stack extends here: each worker emits its own metrics; the dashboard aggregates. GPU USE metrics from DCGM enter now.
  • Phase 36 covers MoE; expert parallelism naturally piggybacks on this phase's parallelism vocabulary.
  • Phase 37 raises the threat surface — model weights cross network boundaries; KB-derived data hits multiple workers; the safetensors integrity story scales.
  • Phase 38 does cost-optimal cluster sizing; built on Phase 35's cost math.

So even though Phase 35 is a "concept-heavy, light implementation" phase, what lands here flows into the back half of the curriculum.

One-paragraph recap

Distribute because (1) the model doesn't fit on one GPU or (2) the compute takes too long. Strategy choice follows from which: TP/PP/ZeRO-3/FSDP for memory pressure, DDP for time pressure, 3D combinations for both. MiniGPT fits on a calculator, so Phase 35 is pedagogical — vocabulary, source-reading, one minimal cloud touch. Cost discipline is mandatory: every cloud experiment estimates spend up front, total Phase 35 ceiling is $5, budget_guard.py enforces it.


Next: theory/01-data-parallel-and-zero.md.