English · Español
05 — Ring all-reduce derivation; when each parallelism strategy applies¶
🇪🇸 El "2S" del ring all-reduce no es magia: sale de un argumento de información — cada worker tiene que enviar y recibir aproximadamente todo el tensor para que la suma esté en todas partes. Esta página deriva el coste, lo conecta con el deep-dive de hardware (X4) y da la tabla de decisión para elegir DP / TP / PP / ZeRO / FSDP según el tamaño del modelo y el batch.
Cross-ref to X4 (do not duplicate)¶
The hardware-deep-dive extension module already covers the NVLink / InfiniBand bandwidth side. Re-read:
docs/extension-track/X4-hardware-deep-dive/theory/03-interconnects-and-topology.md— link bandwidth numbers, topology classes.docs/extension-track/X4-hardware-deep-dive/theory/04-datacenter-economics.md— $ per byte-second.
This chapter is the algorithm-side companion: where the 2S factor comes from and why it's the same on any topology that admits a ring traversal.
The information-theoretic lower bound¶
Consider \(N\) workers each holding a tensor \(x_i \in \mathbb{R}^d\) of size \(S\) bytes. The all-reduce target is for every worker to end up holding \(y = \sum_i x_i\).
Claim. Any all-reduce algorithm, on any communication topology, requires at least \(\frac{N-1}{N} S \approx S\) bytes sent per worker (for large \(N\)).
Sketch. Worker \(i\) ends up with information about every \(x_{j \neq i}\). The only way that information arrives is via bytes received. By a counting argument, at minimum \(\sum_{j \neq i} \text{bytes}_{ij} \ge S\) — i.e. worker \(i\) must receive ~\(S\) bytes. By symmetry of the protocol, it must also send ~\(S\) bytes (the network is the same set of bytes from the other side).
So \(S\) sent + \(S\) received = \(2S\) per worker is the floor. Ring all-reduce achieves this floor (asymptotically). That is why it is the canonical algorithm.
The ring construction — explicit derivation¶
Workers arranged in a logical ring: \(0 \to 1 \to \ldots \to N-1 \to 0\). Split each \(x_i\) into \(N\) chunks of size \(S/N\), indexed \(x_i[0], x_i[1], \ldots, x_i[N-1]\).
Phase 1: reduce-scatter (N-1 steps)¶
At step \(k\) (for \(k = 0, \ldots, N-2\)), worker \(i\) sends chunk \(x_i[(i - k) \bmod N]\) to its right neighbor, receives chunk \(x_{i-1}[(i - 1 - k) \bmod N]\) from its left neighbor, and adds it into its local copy of that chunk.
After \(N-1\) steps, every chunk has accumulated contributions from every worker. Specifically, worker \(i\) now holds the fully-reduced chunk \((i + 1) \bmod N\).
Bytes sent per worker in Phase 1: \((N-1) \cdot S/N \approx S\) for large \(N\).
Phase 2: all-gather (N-1 more steps)¶
Now distribute the fully-reduced chunks. At step \(k\), worker \(i\) sends its current fully-reduced chunk to its right neighbor, receives the next fully-reduced chunk from its left. After \(N-1\) steps, every worker has all \(N\) fully-reduced chunks → all-reduce complete.
Bytes sent per worker in Phase 2: \((N-1) \cdot S/N \approx S\).
Total per worker: \(\approx 2S\). Matches the lower bound.
What about latency?¶
Each phase has \(N-1\) sequential steps with a network round-trip each: latency \(= 2 (N-1) \cdot \tau\) where \(\tau\) is the per-step latency. For small \(S\) this dominates. The tree all-reduce trades the optimal bandwidth for \(O(\log N)\) latency; NCCL picks between them based on payload size (see X4 §03).
Why this matters for the §A13 grammar tutor¶
At our scale, \(|\theta| = 500\)k params, \(S = 1\) MB. On gloo / loopback the per-step latency \(\tau \approx 50\,\mu s\), network bandwidth \(\approx 1\) GB/s effective. For \(N = 4\) CPU workers:
- Bytes per worker per all-reduce: \(\approx 2\) MB.
- Bandwidth time: \(2 / 1000 = 2\) ms.
- Latency time: \(2 \cdot 3 \cdot 50\,\mu s = 300\,\mu s\).
Total all-reduce: ~2.3 ms. Per step. For a forward+backward that takes ~50 ms, the comm overhead is ~5%. Acceptable.
For the hypothetical 7B model on the cluster (the X1 / X4 worked example): the same formula gives ~30 ms NVLink-only, 1+ s on slow Ethernet, which is the original "DDP doesn't scale past N=8 on cheap networks" anecdote.
The form is constant; only the numbers change. That is why deriving it once, on the microscopic model, transfers.
When does each parallelism strategy apply?¶
A decision table for the working engineer:
| Strategy | Best when | Per-step comm | Memory profile |
|---|---|---|---|
| Data parallel (DDP) | Model fits on one worker; you have batch to spare | \(2S_\theta\) per step (all-reduce) | Each worker holds full params + grads + states |
| ZeRO-1 / FSDP-light | Optimizer states dominate memory (Adam = 8×params); model still fits | \(2S_\theta\) but staged | Optimizer sharded across N workers |
| ZeRO-2 / FSDP-mid | Optimizer + grads dominate | \(2S_\theta\) + grad-scatter | Grads + opt sharded |
| ZeRO-3 / FSDP-full | Model itself does not fit | More all-gathers per step | Params, grads, opt all sharded |
| Tensor parallel | A single matmul does not fit on one worker (large d_model); intra-node only |
All-reduce inside the layer | Params sliced across the matmul axis |
| Pipeline parallel | Model has many layers and they do not fit; inter-node tolerable | Activation send/recv per stage | One layer block per worker |
| 3D parallel | Frontier-scale (≥ 100B params on ≥ 256 GPUs); requires careful comm scheduling | All of the above | All of the above |
The standard recipe:
- Start with DDP (or its single-host moral equivalent).
- When optimizer states OOM you, add ZeRO-1.
- When grads also OOM, ZeRO-2.
- When params don't fit, ZeRO-3 / FSDP.
- When a single matmul is too big for one device, add TP within a node.
- When the model has too many layers to fit even with the above, add PP across nodes.
The §A13 grammar tutor never reaches step 2. The 7B-on-8-H100s example (X4 §04) sits at step 3-4. Frontier 100B+ runs (X1) sit at step 6.
The auxiliary cost: code complexity¶
A subtle factor not in the bandwidth math: each step adds engineering complexity. FSDP correctness requires gradient bucket alignment, ZeRO-3 requires param-prefetch ordering, TP requires careful collective placement in the layer forward. Each step roughly doubles the surface area of "things that can silently produce wrong gradients."
This is why the Phase 35 lab is vocabulary, not implementation: the cost-of-bugs scales with the strategy choice, and choosing wrong is much more expensive than the bandwidth difference would suggest.
What this chapter does NOT cover¶
- NCCL kernel internals: GPU-direct DMA, ring-vs-tree adaptive selection, SHARP. Production / X4 territory.
- Hybrid sharding policies (FSDP HYBRID_SHARD): node-local FSDP + inter-node DDP. Real, used in frontier training (X1).
- Gradient compression (1-bit Adam, signSGD): trades compression error for comm reduction. Research-only at our scale.
- Pipeline bubble scheduling (1F1B, interleaved 1F1B): the activation-send/recv ordering that makes PP usable. Worth a separate read; X1 mentions it.
Reference¶
- Patarasuk & Yuan, "Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations" (J. Parallel Distrib. Comput., 2009). The original derivation of the ring's bandwidth optimality.
- Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (SC'20). The three ZeRO stages with the exact memory-vs-comm trade-off.
- Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019). The TP recipe.