English · Español
Lab 01 — Collective communication microbenchmark¶
🇪🇸 Mide AllReduce real en 2 nodos × 8 GPUs. Compáralo con la teoría del §03. La brecha es la historia.
Goal¶
Run nccl-tests on a 2-node, 8-GPU-per-node rental. Measure AllReduce bandwidth for message sizes 1 MB, 100 MB, and 1 GB. Compare to theoretical NVLink + InfiniBand ceilings. Explain the gap.
This is where §03's bandwidth math meets a wall clock.
Prerequisites¶
- Theory §03 read.
- Cloud account capable of provisioning multi-node, GPU-to-GPU RDMA-enabled instances. Single-node 8× GPU is not enough — we need to cross the InfiniBand boundary.
- A budget of ~$15-20 for a 1-hour run.
Hardware target¶
| Provider | SKU | Spec | Hourly cost | Total |
|---|---|---|---|---|
| RunPod | "2× H100 SXM 8-pack (Secure Cloud)" | 2 nodes × 8× H100 SXM5, NVLink intra + InfiniBand inter | ~\(3.50/h per H100 × 16 ≈ ~\)56/h | ~$15 for 15 min |
| Lambda Cloud | "8× H100 SXM5" reserved cluster + multi-node | similar | ~\(2.50/h per H100 × 16 ≈ ~\)40/h | ~$10 for 15 min |
| CoreWeave | "HGX H100 ×8" pair with NDR IB | 2 nodes × 8× H100, NDR 400 Gb/s IB | quote-based | ~$15-20 / hour |
Note: not every "multi-GPU" rental has true RDMA between nodes. Look for SKUs that explicitly advertise NDR InfiniBand or 400 Gb/s RoCE between nodes. If the inter-node link is "shared Ethernet", your AllReduce numbers will be 10× too slow and you'll have measured the wrong thing.
A100 alternative (cheaper, similar topology lessons):
- RunPod 2× node × 8× A100 SXM4 with HDR IB (200 Gb/s). ~\(1.80/h × 16 ≈ ~\)30/h. Total for 30 min run: ~$15.
Setup¶
# 1. On each node, install nccl-tests.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda
# 2. Verify NCCL sees the right interconnect.
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET
# 3. Verify InfiniBand is up.
ibstatus # Expect "LinkUp" on at least one mlx5_*.
ibv_devinfo # Expect NDR (400 Gb/s) or HDR (200 Gb/s) per port.
Run the benchmark¶
# Single-node, 8 GPU intra-node AllReduce (sanity check)
./build/all_reduce_perf -b 1M -e 1G -f 2 -g 8
# Two-node, 16 GPU AllReduce
mpirun -np 16 \
--hostfile hostfile \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_DISABLE=0 \
-x NCCL_NET_GDR_LEVEL=2 \
./build/all_reduce_perf -b 1M -e 1G -f 2 -g 1
Where hostfile is:
The -f 2 flag steps message sizes by 2× from -b to -e.
What you record¶
For each message size, nccl-tests reports:
time(us): median wall-clock time per AllReduce.algbw(GB/s): algorithmic bandwidth = \(D / t\) (data size / time, per rank).busbw(GB/s): bus bandwidth = \(\text{algbw} \cdot \frac{2(N-1)}{N}\) — accounts for the \(2(N-1)/N\) factor in ring AllReduce, so it should approach the link bandwidth.
busbw is the number to compare against the theoretical NVLink / InfiniBand peak.
Expected results (16× H100, NVLink + 400 Gb/s NDR IB)¶
| Message size | Expected time | Expected algbw |
Expected busbw |
Notes |
|---|---|---|---|---|
| 1 MB | ~50-100 µs | ~10-20 GB/s | ~9-18 GB/s | Latency-bound; tree algorithm; switch hops dominate |
| 100 MB | ~3-5 ms | ~25-35 GB/s | ~23-33 GB/s | Crossover region |
| 1 GB | ~25-40 ms | ~25-40 GB/s | ~23-37 GB/s | Bandwidth-bound; inter-node IB is the ceiling |
[source: published NCCL benchmark numbers on DGX-class 2-node setups; NCCL release notes 2023-2024]
Sanity check vs theory¶
Inter-node NDR IB is 400 Gb/s = 50 GB/s unidirectional per port. With single-port IB per node and ring AllReduce on \(N = 16\):
So busbw should approach 50 GB/s as \(D\) grows. If you measure ~30-40 GB/s on the 1 GB AllReduce, that's 60-80% of theoretical — typical real-world. The gap is:
- Protocol overhead (NCCL headers, ACKs).
- Memory-copy overhead from GPU to NIC (mitigated by GPUDirect RDMA, gated by
NCCL_NET_GDR_LEVEL). - IB switch buffer contention.
Intra-node (single-node, 8× H100 only)¶
For the 8-GPU intra-node AllReduce on NVLink/NVSwitch, the link bandwidth is 900 GB/s. Theory for 1 GB AllReduce on N=8:
You should measure ~2.5-3 ms (75-85% of theoretical). busbw should approach 700-800 GB/s.
The huge gap between intra-node (700+ GB/s) and inter-node (~40 GB/s) is the entire point of this lab.
Gap explanation prompt¶
Write a 1-page report answering:
- Where is the cliff? Plot
busbwvs message size. The intra-node curve should plateau near 700 GB/s; the inter-node curve should plateau near 40 GB/s. Identify the crossover message size. - Why is inter-node ~20× slower than intra-node? Frame it in terms of: NVLink bandwidth per GPU (900 GB/s) vs IB bandwidth per node (50 GB/s) divided across 8 GPUs (6.25 GB/s/GPU). The factor is ~140× at the wire level; NCCL hierarchical scheduling reduces this to ~20× observed.
- What would double your inter-node bandwidth? Dual-port IB (800 Gb/s per node), or two NICs per node. Some H100 SKUs ship with up to 8× NIC/node — used at SuperPOD scale.
- What does this mean for distributed training? Gradient sync of a 70B model (~140 GB gradients) takes ~5 s on this network. Compute step at 200 ms means we're 25× over-budget for AllReduce unless we overlap.
Deliverables¶
experiments/x4-collectives/manifest.json— versions, seeds, SKU details, NCCL env vars.experiments/x4-collectives/nccl_log.txt— full NCCL debug output (proves the topology was detected).experiments/x4-collectives/results.csv— message_size, time, algbw, busbw, theoretical_busbw, fraction.experiments/x4-collectives/REPORT.md— the gap-explanation page.
Definition of Done¶
-
nccl-testsran with NCCL_DEBUG=INFO showing the correct topology (NVLink intra, IB inter). - All three message sizes measured for both intra-node-only and inter-node configurations.
-
busbwat 1 GB is within 60-90% of theoretical for both configurations. - Report explains intra-vs-inter gap with concrete numbers.
Cross-links¶
theory/03-interconnects-and-topology.md: the math you're verifying.- Phase 35 — Distributed Training: where this matters for end-to-end training time.
References¶
- NVIDIA NCCL Developer Guide, 2024.
- NVIDIA NCCL Tests repository (github.com/NVIDIA/nccl-tests).
- Patarasuk P. and Yuan X. 2009, Bandwidth-Optimal AllReduce Algorithms, JPDC.
- NVIDIA DGX H100 Architecture, 2023.
- NDR InfiniBand specifications (Mellanox / NVIDIA Networking).