English · Español
03 — Collective ops, NCCL, and cost math¶
🇪🇸 Las "primitivas" de la comunicación distribuida son cinco: broadcast, reduce, all-reduce, reduce-scatter, all-gather. Cada estrategia (DDP, ZeRO-N, TP, PP) se compone de estas. Saber su coste — bytes movidos por worker, latencia mínima, dependencia de N — es como saber sumar antes de multiplicar.
Every distributed strategy in theory/01-data-parallel-and-zero.md and theory/02-parallelism-flavors.md is implemented using a small set of collective operations. Knowing them — and their costs — is what lets you predict whether a distribution plan will work before you run it.
This page is the mechanics of the five collectives, the algorithms NCCL/gloo use to implement them, and the bandwidth math you need to plug into roofline-style estimates.
The five collectives¶
For \(N\) workers, each holding a tensor \(x_i\) of size \(S\) (bytes):
| Op | Input | Output |
|---|---|---|
| broadcast(src) | Worker src holds \(x\); others hold nothing |
All workers hold \(x\) |
| reduce(dst, op) | Each worker has \(x_i\) | Worker dst holds \(\bigoplus_i x_i\) (e.g., sum) |
| all-reduce(op) | Each worker has \(x_i\) | All workers hold \(\bigoplus_i x_i\) |
| reduce-scatter(op) | Each worker has \(x_i\) of size \(S\) | Each worker holds \(1/N\) of \(\bigoplus_i x_i\) |
| all-gather | Each worker has \(x_i\) of size \(S/N\) | All workers hold the full concatenation of size \(S\) |
Two non-trivial identities:
- \(\textbf{all-reduce} = \textbf{reduce-scatter} + \textbf{all-gather}\).
- \(\textbf{broadcast} = \textbf{reduce}\) with a single non-identity input.
These identities matter because real implementations exploit them: a ring all-reduce is a ring reduce-scatter followed by a ring all-gather.
Ring all-reduce — the canonical algorithm¶
Workers form a logical ring: \(0 \to 1 \to 2 \to \ldots \to N-1 \to 0\). The ring all-reduce of a tensor of size \(S\) proceeds in two phases of \(N-1\) steps each:
Phase 1: reduce-scatter. The tensor is split into \(N\) chunks of size \(S/N\). Each worker \(i\) starts owning chunk \(i\). In step \(k\) (for \(k = 0..N-2\)), every worker sends one chunk to its right neighbor and receives one chunk from its left, adding the received chunk to its local copy. After \(N-1\) steps, worker \(i\) holds the fully-reduced chunk \((i+1) \mod N\).
Phase 2: all-gather. Same ring, \(N-1\) more steps. Each worker passes its fully-reduced chunk to the right; after \(N-1\) steps every worker holds all \(N\) fully-reduced chunks.
Cost analysis¶
Per worker, per phase:
- \(N - 1\) steps, sending and receiving \(S/N\) bytes per step.
- Total bytes sent per worker per phase: \((N-1) \cdot S / N \to S\) for large \(N\).
Across both phases, per worker:
- Bytes sent: \(\approx 2S\) (the textbook "2× model size per all-reduce" line).
- Bytes received: same.
- Latency: \(2(N-1)\) steps. For small \(S\), latency-dominated; for large \(S\), bandwidth-dominated.
Why ring? Why not tree?¶
- Tree all-reduce: latency \(O(\log N)\), bandwidth per worker \(\approx S\). Faster for small payloads, less efficient for large.
- Ring all-reduce: latency \(O(N)\), bandwidth per worker \(\approx 2S\). The "\(2S\)" is independent of \(N\) asymptotically — that is the property that makes it scale.
NCCL picks between ring, tree, and double-binary-tree dynamically based on payload size, topology, and worker count.
NCCL vs gloo vs MPI¶
| Backend | Where | When |
|---|---|---|
| NCCL | NVIDIA GPUs, intra-node NVLink + inter-node InfiniBand/RoCE | Default for any GPU job |
| gloo | CPU, ethernet | When NCCL isn't available (e.g., Borja's local CPU lab) |
| MPI | Anything; legacy HPC | When you must integrate with existing MPI infrastructure |
NCCL has direct GPU-GPU comm via NVLink and GPUDirect RDMA — it bypasses CPU and host memory, which is critical for performance. Gloo is fine for CPU-only and is what experiments/35-ddp-cpu/ uses.
torch.distributed.init_process_group(backend="nccl") on a GPU job; backend="gloo" on CPU.
Bandwidth math worked end-to-end¶
Take a worked example: 7B-parameter model, DDP across \(N = 8\) GPUs, fp16 gradients.
- \(|\theta| = 7 \cdot 10^9\).
- Gradient buffer size: \(2 |\theta| = 14\) GB.
- Per worker bytes sent during all-reduce: \(2 \cdot 14 = 28\) GB.
Time depends on link:
- NVLink 4.0 (H100): 900 GB/s. Time: 28/900 = 31 ms.
- PCIe 4.0 x16: 64 GB/s. Time: 28/64 = 440 ms.
- InfiniBand HDR (200 Gb/s): 25 GB/s effective. Time: 28/25 = 1.12 s.
- 10 GbE: 1.25 GB/s. Time: 28/1.25 = 22 s. Don't do this. (But it's why people pay for InfiniBand.)
If forward + backward takes 200 ms on the GPU, then:
- NVLink: 31/200 = 15% comm overhead. Fine.
- PCIe: 440/200 = 220% — comm dominates compute. Useless.
- InfiniBand: 1120/200 = 560% — comm dominates compute. Also useless without optimizations.
These numbers explain why production large-model training uses NVLink intra-node, InfiniBand inter-node only with overlap and gradient bucketing, and why DDP scaling falls apart past O(100) GPUs without ZeRO or hierarchical all-reduce.
For the grammar tutor with \(|\theta| = 500\)k:
- Gradient buffer: \(2 \cdot 500\text{k} = 1\) MB.
- Per worker bytes per all-reduce: 2 MB.
- On gloo over loopback (the lab 01 setup): negligible. Comm is free at this scale.
The point of lab 01 is not to optimize comm. It's to make Borja read the torch.distributed trace and see the all-reduce happen.
Cost as $-per-step¶
The bandwidth math above converts to dollars by multiplying by the dollar-per-second rate of the cluster:
For a cluster of 8 H100s on a major cloud at ~$30/hr per GPU (spot pricing):
- $_per_second per GPU: $30/3600 = \(8.3 \times 10^{-3}\).
- 31 ms comm × \(8.3 \times 10^{-3}\) × 8 GPUs = \(2.1 \times 10^{-3}\) per step in comm.
- At 1 step per second, training day = 86400 steps. Comm cost ≈ $180/day.
For the grammar tutor on lab 02 (2× RTX 4090 at $0.70/hr total):
- $_per_second total: 0.70/3600 = \(1.9 \times 10^{-4}\).
- 1 ms (negligible) comm × \(1.9 \times 10^{-4}\) = \(1.9 \times 10^{-7}\) per step in comm.
- 3-hour lab = ~\(2 in compute, **\)\ll $0.01$ in comm**.
Comm cost is a footnote at the curriculum's budget. But the formula is the same; only the constants change. The lab walks Borja through computing it for the grammar-tutor case so the muscle memory exists when the cluster grows.
Latency vs bandwidth crossover¶
For very small payloads (a single gradient bucket of 1 KB), all-reduce is latency-bound — the \(2(N-1)\) network hops dominate. NCCL's adaptive algorithm uses tree all-reduce here.
For very large payloads (the full 14 GB gradient), all-reduce is bandwidth-bound — bytes-per-second dominates. Ring all-reduce wins.
Crossover is typically around 256 KB – 4 MB depending on the cluster. NCCL's NCCL_ALGO=tree|ring env var forces one; in practice the auto-tuner gets it right.
For the grammar tutor at 2 MB total gradient: right at the crossover. Either algorithm works.
What this phase does NOT cover¶
- NCCL internals: the implementation of ring/tree/double-binary-tree all-reduces, the GPU-aware MPI surface, GPUDirect RDMA setup. Production deployment territory; not pedagogically high-value at our scale.
- Gradient compression (1-bit Adam, PowerSGD, signSGD). Reduces comm volume at cost of convergence — a research area.
- Hierarchical all-reduce (intra-node ring + inter-node tree). Standard at large scale, vocabulary only.
- InfiniBand vs RoCE vs Slingshot benchmarking. Out of budget.
- Disaggregated compute/network (CXL, future fabrics). Vocabulary only.
Next: theory/04-distributed-inference.md.