English · Español

03 — Interconnects and topology: NVLink, InfiniBand, fat-trees, and collectives¶

🇪🇸 Cuando entrenas en 1024 GPUs, el cuello de botella ya no es la GPU. Es la red que las conecta y los algoritmos de comunicación colectiva que usas.

The hierarchy of links¶

A training cluster has three distinct interconnect tiers:

Intra-server (scale-up): GPU ↔ GPU within a single chassis. NVLink + NVSwitch on NVIDIA; Infinity Fabric on AMD; ICI on TPU.
Inter-server, same rack/pod (scale-out, tight): server ↔ server via InfiniBand or RoCE (RDMA over Converged Ethernet). 200-800 Gb/s per port.
Inter-pod / datacenter-wide: regular Ethernet or InfiniBand backbone. Typically used only for data ingress and checkpoints, not gradient sync.

Each tier is 5-10× slower than the one above it. The skill is keeping as much traffic as possible in the upper tier.

NVLink vs PCIe — never confuse these¶

Bus	Per-link rate	Use	H100 aggregate
PCIe 5.0 ×16	64 GB/s bidirectional	Host CPU ↔ GPU; storage; non-GPU peripherals	1 link per GPU
NVLink 4	50 GB/s bidirectional per lane × 18 lanes	GPU ↔ GPU peer-to-peer	900 GB/s total

NVLink is 14× faster than PCIe 5.0. This is why you cannot fake a multi-GPU training setup with PCIe-only — the gradient AllReduce times balloon. Cloud rentals advertising "8× H100" without NVLink are not actually suitable for training.

NVSwitch — the on-board fabric¶

A DGX H100 has 4 NVSwitches that fully interconnect 8 H100s such that every GPU pair has 900 GB/s of bandwidth (not just to one peer — to every peer simultaneously). This is what makes intra-node AllReduce nearly free relative to inter-node.

The newer NVLink Switch System extends this to 32 or 256 GPUs (DGX SuperPOD), keeping the 900 GB/s-per-peer property across racks via dedicated NVLink switches in the network. With Blackwell GB200 NVL72, the domain grows to 72 GPUs/rack with 130 TB/s aggregate NVLink bandwidth.

[source: NVIDIA DGX H100 architecture documentation 2023; NVIDIA GB200 NVL72 specification 2024]

InfiniBand vs RoCE¶

Beyond a NVLink domain, you need a real network:

InfiniBand (Mellanox / NVIDIA Networking): a separate protocol stack from Ethernet, designed for HPC. RDMA native. Lower latency, higher consistency under congestion. Current top SKU: NDR InfiniBand, 400 Gb/s per port (800 Gb/s with dual-port adapters).
RoCE (RDMA over Converged Ethernet): RDMA semantics riding on Ethernet. Same speeds as InfiniBand on the latest hardware, often cheaper per port. Requires lossless Ethernet configuration (PFC, ECN). Used by Meta, AWS, Microsoft for scale-out fabrics.

For an ML interview, the rule is: InfiniBand is the reference fabric; RoCE is the price-competitive alternative; both deliver RDMA at 400 Gb/s+ at the leaf.

Topology: fat-tree vs torus vs dragonfly¶

The physical wiring of the network matters as much as the link speed.

Fat-tree (the GPU cluster default)¶

A fat-tree has switches arranged in tiers (leaf, spine, super-spine) such that bisection bandwidth grows with cluster size — you never have a bottleneck for any-to-any traffic.

Pros: any-to-any bandwidth; can be built incrementally.
Cons: needs many switches (cost); long cable runs at large scale.
What NVIDIA SuperPODs ship. The reference design.

3D-torus (Google TPU Pod)¶

TPU Pods use a 3D-torus topology: each chip is connected to 6 neighbors (one per axis direction).

Pros: very high bandwidth per dollar; great for AllReduce (the algorithm decomposes naturally onto a torus).
Cons: any-to-any bandwidth is not uniform — patterns that map well to the torus are fast; arbitrary all-to-all is slow.
What Google ships for TPUs. v4/v5 use optical circuit switches to reconfigure the torus per job. [source: Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer]

Dragonfly+ (HPC labs, Frontier)¶

Hierarchical: groups of nodes fully connected within a group, sparsely connected between groups. Used at exascale HPC systems.

For ML clusters, fat-tree dominates; torus is TPU-specific; dragonfly is rare in commercial AI.

Collective primitives¶

The three you must know cold:

AllReduce¶

Combines values from all \(N\) ranks via a reduction (usually sum) and gives the result to all ranks. Used to average gradients across data-parallel replicas. The single most important collective in ML.

AllGather¶

Each rank contributes a tensor; every rank ends up with the concatenation of all contributions. Used in FSDP (sharded weights → gather full layer before forward).

ReduceScatter¶

Each rank contributes a tensor; the sum is split across ranks (each rank ends up with one shard of the sum). Used in FSDP backward + ZeRO Stage 3.

Identity: AllReduce = ReduceScatter + AllGather. This decomposition is the reason ring-AllReduce is bandwidth-optimal.

AllReduce algorithms — the holy trinity¶

Ring AllReduce¶

Arrange the \(N\) ranks in a ring. The data \(D\) is split into \(N\) chunks. In \(2(N-1)\) steps, each rank sends one chunk and receives one chunk per step.

Per-rank bandwidth cost:

\[ B_{\text{ring}} = \frac{2(N-1)}{N} \cdot D \]

Time on a homogeneous network of bandwidth \(\beta\):

\[ T_{\text{ring}} = \frac{2(N-1)}{N} \cdot \frac{D}{\beta} \]

Bandwidth-optimal: the \(\frac{2(N-1)}{N}\) factor is the theoretical lower bound for AllReduce.
Latency-suboptimal: \(2(N-1)\) steps means \(O(N)\) latency. Bad for small messages.

Tree AllReduce (or Recursive-Doubling)¶

Reduce up a binary tree to a root, then broadcast down. \(2 \log_2 N\) steps. Each step sends the full data \(D\).

Per-rank bandwidth cost: \(2 \log_2 N \cdot D / N\) — worse than ring for large \(D\). But only \(2 \log_2 N\) network hops — much better latency for small \(D\).

Rabenseifner's algorithm (hybrid)¶

ReduceScatter via recursive halving + AllGather via recursive doubling. Bandwidth-optimal and \(O(\log N)\) latency. Used in MPI implementations for medium-large messages.

NCCL's choice¶

NVIDIA's NCCL (the de facto AllReduce implementation on GPU clusters) picks ring for large messages, tree for small ones, and uses topology-aware variants (double-binary-tree, ring-2D) for intra-node-vs-inter-node fabrics. You normally do not pick the algorithm; NCCL autotunes.

[source: NVIDIA NCCL documentation 2024; Rabenseifner R. 2004, Optimization of Collective Reduction Operations, ICCS]

Bandwidth math: 8-GPU node and 1024-GPU cluster¶

Intra-node AllReduce (8× H100, NVSwitch)¶

\(N = 8\), \(\beta = 900\) GB/s, \(D = 1\) GB.
\(T = \frac{2 \cdot 7}{8} \cdot \frac{1\,\text{GB}}{900\,\text{GB/s}} = 1.75 \cdot \frac{1}{900}\,\text{s} \approx 1.94 \text{ ms}.\)

Inter-node, 2 nodes × 8 GPUs (16 H100, NVLink + InfiniBand)¶

This becomes a hierarchical AllReduce: intra-node ring on NVLink, inter-node ring on InfiniBand. Effective bandwidth is the slowest link in the path. Assume 400 Gb/s = 50 GB/s per node (single-port NDR IB).

16 GPUs total, but only 2 IB links cross the inter-node boundary.
Inter-node phase: \(\frac{2 \cdot 1}{2} \cdot \frac{1\,\text{GB}}{50\,\text{GB/s}} = 20\) ms.
Intra-node phase: negligible (~2 ms).
Total: ~22 ms for 1 GB AllReduce — dominated by InfiniBand.

1024-GPU cluster (128 nodes × 8 GPUs each, 400 Gb/s IB)¶

Hierarchical: each node reduces internally (cheap), then 128 nodes do an inter-node AllReduce of the reduced gradient.
Inter-node bandwidth per node: 50 GB/s.
Ring on 128 nodes: \(\frac{2 \cdot 127}{128} \cdot \frac{D}{50\,\text{GB/s}} \approx 1.98 \cdot \frac{D}{50\,\text{GB/s}}\).
For \(D = 1\) GB: ~40 ms.

A 70B-parameter model in BF16 has 140 GB of gradients. AllReduce-ing 140 GB → ~5.5 s per step on this network — which is why gradient bucketing (overlap comm with compute) is mandatory at this scale.

Why bandwidth math is interview gold¶

Asked "how big can you go on 1024 H100s before AllReduce eats your step time?" — you should think:

Step compute time ≈ \(T_{\text{compute}} = \text{FLOPs/step} / (1024 \cdot 1\,\text{PF} \cdot \text{MFU})\).
AllReduce time ≈ \(T_{\text{AR}} = \frac{2(N-1)}{N} \cdot D_{\text{grad}} / \beta_{\text{net}}\), scaled by overlap factor.
If \(T_{\text{AR}} / T_{\text{compute}} > 0.2\) you have a problem. Solutions: larger microbatch (more compute per step), gradient sharding (FSDP), gradient compression, faster network.

Cross-links¶

02-h100-and-h200.md: NVLink 4 on the chip side.
Phase 35 — Distributed Training: where you actually use these collectives.
lab/01-collective-comm-microbenchmark.md: measure it on real hardware.

References¶

NVIDIA NCCL Developer Guide, 2024.
Rabenseifner R. 2004, Optimization of Collective Reduction Operations, ICCS.
Patarasuk P. and Yuan X. 2009, Bandwidth-Optimal AllReduce Algorithms for Clusters of Workstations, JPDC.
NVIDIA DGX H100 Architecture, 2023.
Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning, ISCA.
Sergeev A. and Del Balso M. 2018, Horovod: fast and easy distributed deep learning in TensorFlow — ring-AllReduce popularization.