Skip to content

English · Español

03 — Collective ops, NCCL, and cost math

🇪🇸 Las "primitivas" de la comunicación distribuida son cinco: broadcast, reduce, all-reduce, reduce-scatter, all-gather. Cada estrategia (DDP, ZeRO-N, TP, PP) se compone de estas. Saber su coste — bytes movidos por worker, latencia mínima, dependencia de N — es como saber sumar antes de multiplicar.

Every distributed strategy in theory/01-data-parallel-and-zero.md and theory/02-parallelism-flavors.md is implemented using a small set of collective operations. Knowing them — and their costs — is what lets you predict whether a distribution plan will work before you run it.

This page is the mechanics of the five collectives, the algorithms NCCL/gloo use to implement them, and the bandwidth math you need to plug into roofline-style estimates.


The five collectives

For \(N\) workers, each holding a tensor \(x_i\) of size \(S\) (bytes):

Op Input Output
broadcast(src) Worker src holds \(x\); others hold nothing All workers hold \(x\)
reduce(dst, op) Each worker has \(x_i\) Worker dst holds \(\bigoplus_i x_i\) (e.g., sum)
all-reduce(op) Each worker has \(x_i\) All workers hold \(\bigoplus_i x_i\)
reduce-scatter(op) Each worker has \(x_i\) of size \(S\) Each worker holds \(1/N\) of \(\bigoplus_i x_i\)
all-gather Each worker has \(x_i\) of size \(S/N\) All workers hold the full concatenation of size \(S\)

Two non-trivial identities:

  • \(\textbf{all-reduce} = \textbf{reduce-scatter} + \textbf{all-gather}\).
  • \(\textbf{broadcast} = \textbf{reduce}\) with a single non-identity input.

These identities matter because real implementations exploit them: a ring all-reduce is a ring reduce-scatter followed by a ring all-gather.


Ring all-reduce — the canonical algorithm

Workers form a logical ring: \(0 \to 1 \to 2 \to \ldots \to N-1 \to 0\). The ring all-reduce of a tensor of size \(S\) proceeds in two phases of \(N-1\) steps each:

Phase 1: reduce-scatter. The tensor is split into \(N\) chunks of size \(S/N\). Each worker \(i\) starts owning chunk \(i\). In step \(k\) (for \(k = 0..N-2\)), every worker sends one chunk to its right neighbor and receives one chunk from its left, adding the received chunk to its local copy. After \(N-1\) steps, worker \(i\) holds the fully-reduced chunk \((i+1) \mod N\).

Phase 2: all-gather. Same ring, \(N-1\) more steps. Each worker passes its fully-reduced chunk to the right; after \(N-1\) steps every worker holds all \(N\) fully-reduced chunks.

Cost analysis

Per worker, per phase:

  • \(N - 1\) steps, sending and receiving \(S/N\) bytes per step.
  • Total bytes sent per worker per phase: \((N-1) \cdot S / N \to S\) for large \(N\).

Across both phases, per worker:

  • Bytes sent: \(\approx 2S\) (the textbook "2× model size per all-reduce" line).
  • Bytes received: same.
  • Latency: \(2(N-1)\) steps. For small \(S\), latency-dominated; for large \(S\), bandwidth-dominated.

Why ring? Why not tree?

  • Tree all-reduce: latency \(O(\log N)\), bandwidth per worker \(\approx S\). Faster for small payloads, less efficient for large.
  • Ring all-reduce: latency \(O(N)\), bandwidth per worker \(\approx 2S\). The "\(2S\)" is independent of \(N\) asymptotically — that is the property that makes it scale.

NCCL picks between ring, tree, and double-binary-tree dynamically based on payload size, topology, and worker count.


NCCL vs gloo vs MPI

Backend Where When
NCCL NVIDIA GPUs, intra-node NVLink + inter-node InfiniBand/RoCE Default for any GPU job
gloo CPU, ethernet When NCCL isn't available (e.g., Borja's local CPU lab)
MPI Anything; legacy HPC When you must integrate with existing MPI infrastructure

NCCL has direct GPU-GPU comm via NVLink and GPUDirect RDMA — it bypasses CPU and host memory, which is critical for performance. Gloo is fine for CPU-only and is what experiments/35-ddp-cpu/ uses.

torch.distributed.init_process_group(backend="nccl") on a GPU job; backend="gloo" on CPU.


Bandwidth math worked end-to-end

Take a worked example: 7B-parameter model, DDP across \(N = 8\) GPUs, fp16 gradients.

  • \(|\theta| = 7 \cdot 10^9\).
  • Gradient buffer size: \(2 |\theta| = 14\) GB.
  • Per worker bytes sent during all-reduce: \(2 \cdot 14 = 28\) GB.

Time depends on link:

  • NVLink 4.0 (H100): 900 GB/s. Time: 28/900 = 31 ms.
  • PCIe 4.0 x16: 64 GB/s. Time: 28/64 = 440 ms.
  • InfiniBand HDR (200 Gb/s): 25 GB/s effective. Time: 28/25 = 1.12 s.
  • 10 GbE: 1.25 GB/s. Time: 28/1.25 = 22 s. Don't do this. (But it's why people pay for InfiniBand.)

If forward + backward takes 200 ms on the GPU, then:

  • NVLink: 31/200 = 15% comm overhead. Fine.
  • PCIe: 440/200 = 220% — comm dominates compute. Useless.
  • InfiniBand: 1120/200 = 560% — comm dominates compute. Also useless without optimizations.

These numbers explain why production large-model training uses NVLink intra-node, InfiniBand inter-node only with overlap and gradient bucketing, and why DDP scaling falls apart past O(100) GPUs without ZeRO or hierarchical all-reduce.

For the grammar tutor with \(|\theta| = 500\)k:

  • Gradient buffer: \(2 \cdot 500\text{k} = 1\) MB.
  • Per worker bytes per all-reduce: 2 MB.
  • On gloo over loopback (the lab 01 setup): negligible. Comm is free at this scale.

The point of lab 01 is not to optimize comm. It's to make Borja read the torch.distributed trace and see the all-reduce happen.


Cost as $-per-step

The bandwidth math above converts to dollars by multiplying by the dollar-per-second rate of the cluster:

\[\text{comm cost per step} = \frac{\text{bytes sent per worker}}{\text{link bandwidth}} \cdot \$_{\text{rate per worker per second}}\]

For a cluster of 8 H100s on a major cloud at ~$30/hr per GPU (spot pricing):

  • $_per_second per GPU: $30/3600 = \(8.3 \times 10^{-3}\).
  • 31 ms comm × \(8.3 \times 10^{-3}\) × 8 GPUs = \(2.1 \times 10^{-3}\) per step in comm.
  • At 1 step per second, training day = 86400 steps. Comm cost ≈ $180/day.

For the grammar tutor on lab 02 (2× RTX 4090 at $0.70/hr total):

  • $_per_second total: 0.70/3600 = \(1.9 \times 10^{-4}\).
  • 1 ms (negligible) comm × \(1.9 \times 10^{-4}\) = \(1.9 \times 10^{-7}\) per step in comm.
  • 3-hour lab = ~\(2 in compute, **\)\ll $0.01$ in comm**.

Comm cost is a footnote at the curriculum's budget. But the formula is the same; only the constants change. The lab walks Borja through computing it for the grammar-tutor case so the muscle memory exists when the cluster grows.


Latency vs bandwidth crossover

For very small payloads (a single gradient bucket of 1 KB), all-reduce is latency-bound — the \(2(N-1)\) network hops dominate. NCCL's adaptive algorithm uses tree all-reduce here.

For very large payloads (the full 14 GB gradient), all-reduce is bandwidth-bound — bytes-per-second dominates. Ring all-reduce wins.

Crossover is typically around 256 KB – 4 MB depending on the cluster. NCCL's NCCL_ALGO=tree|ring env var forces one; in practice the auto-tuner gets it right.

For the grammar tutor at 2 MB total gradient: right at the crossover. Either algorithm works.


What this phase does NOT cover

  • NCCL internals: the implementation of ring/tree/double-binary-tree all-reduces, the GPU-aware MPI surface, GPUDirect RDMA setup. Production deployment territory; not pedagogically high-value at our scale.
  • Gradient compression (1-bit Adam, PowerSGD, signSGD). Reduces comm volume at cost of convergence — a research area.
  • Hierarchical all-reduce (intra-node ring + inter-node tree). Standard at large scale, vocabulary only.
  • InfiniBand vs RoCE vs Slingshot benchmarking. Out of budget.
  • Disaggregated compute/network (CXL, future fabrics). Vocabulary only.

Next: theory/04-distributed-inference.md.