English · Español

03 — Collective ops, NCCL, and cost math¶

🇪🇸 Las "primitivas" de la comunicación distribuida son cinco: broadcast, reduce, all-reduce, reduce-scatter, all-gather. Cada estrategia (DDP, ZeRO-N, TP, PP) se compone de estas. Saber su coste — bytes movidos por worker, latencia mínima, dependencia de N — es como saber sumar antes de multiplicar.

Every distributed strategy in theory/01-data-parallel-and-zero.md and theory/02-parallelism-flavors.md is implemented using a small set of collective operations. Knowing them — and their costs — is what lets you predict whether a distribution plan will work before you run it.

This page is the mechanics of the five collectives, the algorithms NCCL/gloo use to implement them, and the bandwidth math you need to plug into roofline-style estimates.

The five collectives¶

For $N$ workers, each holding a tensor $x_i$ of size $S$ (bytes):

Op	Input	Output
broadcast(src)	Worker `src` holds $x$; others hold nothing	All workers hold $x$
reduce(dst, op)	Each worker has $x_i$	Worker `dst` holds $\bigoplus_i x_i$ (e.g., sum)
all-reduce(op)	Each worker has $x_i$	All workers hold $\bigoplus_i x_i$
reduce-scatter(op)	Each worker has $x_i$ of size $S$	Each worker holds $1/N$ of $\bigoplus_i x_i$
all-gather	Each worker has $x_i$ of size $S/N$	All workers hold the full concatenation of size $S$

Two non-trivial identities:

$\textbf{all-reduce} = \textbf{reduce-scatter} + \textbf{all-gather}$.
$\textbf{broadcast} = \textbf{reduce}$ with a single non-identity input.

These identities matter because real implementations exploit them: a ring all-reduce is a ring reduce-scatter followed by a ring all-gather.

Ring all-reduce — the canonical algorithm¶

Workers form a logical ring: $0 \to 1 \to 2 \to \ldots \to N-1 \to 0$. The ring all-reduce of a tensor of size $S$ proceeds in two phases of $N-1$ steps each:

Phase 1: reduce-scatter. The tensor is split into $N$ chunks of size $S/N$. Each worker $i$ starts owning chunk $i$. In step $k$ (for $k = 0..N-2$), every worker sends one chunk to its right neighbor and receives one chunk from its left, adding the received chunk to its local copy. After $N-1$ steps, worker $i$ holds the fully-reduced chunk $(i+1) \mod N$.

Phase 2: all-gather. Same ring, $N-1$ more steps. Each worker passes its fully-reduced chunk to the right; after $N-1$ steps every worker holds all $N$ fully-reduced chunks.

Cost analysis¶

Per worker, per phase:

$N - 1$ steps, sending and receiving $S/N$ bytes per step.
Total bytes sent per worker per phase: $(N-1) \cdot S / N \to S$ for large $N$.

Across both phases, per worker:

Bytes sent: $\approx 2S$ (the textbook "2× model size per all-reduce" line).
Bytes received: same.
Latency: $2(N-1)$ steps. For small $S$, latency-dominated; for large $S$, bandwidth-dominated.

Why ring? Why not tree?¶

Tree all-reduce: latency $O(\log N)$, bandwidth per worker $\approx S$. Faster for small payloads, less efficient for large.
Ring all-reduce: latency $O(N)$, bandwidth per worker $\approx 2S$. The "$2S$" is independent of $N$ asymptotically — that is the property that makes it scale.

NCCL picks between ring, tree, and double-binary-tree dynamically based on payload size, topology, and worker count.

NCCL vs gloo vs MPI¶

Backend	Where	When
NCCL	NVIDIA GPUs, intra-node NVLink + inter-node InfiniBand/RoCE	Default for any GPU job
gloo	CPU, ethernet	When NCCL isn't available (e.g., Borja's local CPU lab)
MPI	Anything; legacy HPC	When you must integrate with existing MPI infrastructure

NCCL has direct GPU-GPU comm via NVLink and GPUDirect RDMA — it bypasses CPU and host memory, which is critical for performance. Gloo is fine for CPU-only and is what experiments/35-ddp-cpu/ uses.

torch.distributed.init_process_group(backend="nccl") on a GPU job; backend="gloo" on CPU.

Bandwidth math worked end-to-end¶

Take a worked example: 7B-parameter model, DDP across $N = 8$ GPUs, fp16 gradients.

$|\theta| = 7 \cdot 10^9$.
Gradient buffer size: $2 |\theta| = 14$ GB.
Per worker bytes sent during all-reduce: $2 \cdot 14 = 28$ GB.

Time depends on link:

NVLink 4.0 (H100): 900 GB/s. Time: 28/900 = 31 ms.
PCIe 4.0 x16: 64 GB/s. Time: 28/64 = 440 ms.
InfiniBand HDR (200 Gb/s): 25 GB/s effective. Time: 28/25 = 1.12 s.
10 GbE: 1.25 GB/s. Time: 28/1.25 = 22 s. Don't do this. (But it's why people pay for InfiniBand.)

If forward + backward takes 200 ms on the GPU, then:

NVLink: 31/200 = 15% comm overhead. Fine.
PCIe: 440/200 = 220% — comm dominates compute. Useless.
InfiniBand: 1120/200 = 560% — comm dominates compute. Also useless without optimizations.

These numbers explain why production large-model training uses NVLink intra-node, InfiniBand inter-node only with overlap and gradient bucketing, and why DDP scaling falls apart past O(100) GPUs without ZeRO or hierarchical all-reduce.

For the grammar tutor with $|\theta| = 500$k:

Gradient buffer: $2 \cdot 500\text{k} = 1$ MB.
Per worker bytes per all-reduce: 2 MB.
On gloo over loopback (the lab 01 setup): negligible. Comm is free at this scale.

The point of lab 01 is not to optimize comm. It's to make Borja read the torch.distributed trace and see the all-reduce happen.

Cost as $-per-step¶

The bandwidth math above converts to dollars by multiplying by the dollar-per-second rate of the cluster:

\[\text{comm cost per step} = \frac{\text{bytes sent per worker}}{\text{link bandwidth}} \cdot \$_{\text{rate per worker per second}}\]

For a cluster of 8 H100s on a major cloud at ~$30/hr per GPU (spot pricing):

$_per_second per GPU: $30/3600 = $8.3 \times 10^{-3}$.
31 ms comm × $8.3 \times 10^{-3}$ × 8 GPUs = $2.1 \times 10^{-3}$ per step in comm.
At 1 step per second, training day = 86400 steps. Comm cost ≈ $180/day.

For the grammar tutor on lab 02 (2× RTX 4090 at $0.70/hr total):

$_per_second total: 0.70/3600 = $1.9 \times 10^{-4}$.
1 ms (negligible) comm × $1.9 \times 10^{-4}$ = $1.9 \times 10^{-7}$ per step in comm.
3-hour lab = ~$2 in compute, **$\ll $0.01$ in comm**.

Comm cost is a footnote at the curriculum's budget. But the formula is the same; only the constants change. The lab walks Borja through computing it for the grammar-tutor case so the muscle memory exists when the cluster grows.

Latency vs bandwidth crossover¶

For very small payloads (a single gradient bucket of 1 KB), all-reduce is latency-bound — the $2(N-1)$ network hops dominate. NCCL's adaptive algorithm uses tree all-reduce here.

For very large payloads (the full 14 GB gradient), all-reduce is bandwidth-bound — bytes-per-second dominates. Ring all-reduce wins.

Crossover is typically around 256 KB – 4 MB depending on the cluster. NCCL's NCCL_ALGO=tree|ring env var forces one; in practice the auto-tuner gets it right.

For the grammar tutor at 2 MB total gradient: right at the crossover. Either algorithm works.

What this phase does NOT cover¶

NCCL internals: the implementation of ring/tree/double-binary-tree all-reduces, the GPU-aware MPI surface, GPUDirect RDMA setup. Production deployment territory; not pedagogically high-value at our scale.
Gradient compression (1-bit Adam, PowerSGD, signSGD). Reduces comm volume at cost of convergence — a research area.
Hierarchical all-reduce (intra-node ring + inter-node tree). Standard at large scale, vocabulary only.
InfiniBand vs RoCE vs Slingshot benchmarking. Out of budget.
Disaggregated compute/network (CXL, future fabrics). Vocabulary only.

Next: theory/04-distributed-inference.md.

Op	Input	Output
broadcast(src)	Worker `src` holds \(x\); others hold nothing	All workers hold \(x\)
reduce(dst, op)	Each worker has \(x_i\)	Worker `dst` holds \(\bigoplus_i x_i\) (e.g., sum)
all-reduce(op)	Each worker has \(x_i\)	All workers hold \(\bigoplus_i x_i\)
reduce-scatter(op)	Each worker has \(x_i\) of size \(S\)	Each worker holds \(1/N\) of \(\bigoplus_i x_i\)
all-gather	Each worker has \(x_i\) of size \(S/N\)	All workers hold the full concatenation of size \(S\)