English · Español

Lab 01 — Collective communication microbenchmark¶

🇪🇸 Mide AllReduce real en 2 nodos × 8 GPUs. Compáralo con la teoría del §03. La brecha es la historia.

Goal¶

Run nccl-tests on a 2-node, 8-GPU-per-node rental. Measure AllReduce bandwidth for message sizes 1 MB, 100 MB, and 1 GB. Compare to theoretical NVLink + InfiniBand ceilings. Explain the gap.

This is where §03's bandwidth math meets a wall clock.

Prerequisites¶

Theory §03 read.
Cloud account capable of provisioning multi-node, GPU-to-GPU RDMA-enabled instances. Single-node 8× GPU is not enough — we need to cross the InfiniBand boundary.
A budget of ~$15-20 for a 1-hour run.

Hardware target¶

Provider	SKU	Spec	Hourly cost	Total
RunPod	"2× H100 SXM 8-pack (Secure Cloud)"	2 nodes × 8× H100 SXM5, NVLink intra + InfiniBand inter	~$3.50/h per H100 × 16 ≈ ~$56/h	~$15 for 15 min
Lambda Cloud	"8× H100 SXM5" reserved cluster + multi-node	similar	~$2.50/h per H100 × 16 ≈ ~$40/h	~$10 for 15 min
CoreWeave	"HGX H100 ×8" pair with NDR IB	2 nodes × 8× H100, NDR 400 Gb/s IB	quote-based	~$15-20 / hour

Note: not every "multi-GPU" rental has true RDMA between nodes. Look for SKUs that explicitly advertise NDR InfiniBand or 400 Gb/s RoCE between nodes. If the inter-node link is "shared Ethernet", your AllReduce numbers will be 10× too slow and you'll have measured the wrong thing.

A100 alternative (cheaper, similar topology lessons):

RunPod 2× node × 8× A100 SXM4 with HDR IB (200 Gb/s). ~$1.80/h × 16 ≈ ~$30/h. Total for 30 min run: ~$15.

Setup¶

# 1. On each node, install nccl-tests.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda

# 2. Verify NCCL sees the right interconnect.
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET

# 3. Verify InfiniBand is up.
ibstatus    # Expect "LinkUp" on at least one mlx5_*.
ibv_devinfo # Expect NDR (400 Gb/s) or HDR (200 Gb/s) per port.

Run the benchmark¶

# Single-node, 8 GPU intra-node AllReduce (sanity check)
./build/all_reduce_perf -b 1M -e 1G -f 2 -g 8

# Two-node, 16 GPU AllReduce
mpirun -np 16 \
       --hostfile hostfile \
       -x NCCL_DEBUG=INFO \
       -x NCCL_IB_DISABLE=0 \
       -x NCCL_NET_GDR_LEVEL=2 \
       ./build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

Where hostfile is:

node1 slots=8
node2 slots=8

The -f 2 flag steps message sizes by 2× from -b to -e.

What you record¶

For each message size, nccl-tests reports:

time (us): median wall-clock time per AllReduce.
algbw (GB/s): algorithmic bandwidth = $D / t$ (data size / time, per rank).
busbw (GB/s): bus bandwidth = $\text{algbw} \cdot \frac{2(N-1)}{N}$ — accounts for the $2(N-1)/N$ factor in ring AllReduce, so it should approach the link bandwidth.

busbw is the number to compare against the theoretical NVLink / InfiniBand peak.

Expected results (16× H100, NVLink + 400 Gb/s NDR IB)¶

Message size	Expected time	Expected `algbw`	Expected `busbw`	Notes
1 MB	~50-100 µs	~10-20 GB/s	~9-18 GB/s	Latency-bound; tree algorithm; switch hops dominate
100 MB	~3-5 ms	~25-35 GB/s	~23-33 GB/s	Crossover region
1 GB	~25-40 ms	~25-40 GB/s	~23-37 GB/s	Bandwidth-bound; inter-node IB is the ceiling

[source: published NCCL benchmark numbers on DGX-class 2-node setups; NCCL release notes 2023-2024]

Sanity check vs theory¶

Inter-node NDR IB is 400 Gb/s = 50 GB/s unidirectional per port. With single-port IB per node and ring AllReduce on $N = 16$:

\[ T_{\text{theoretical}} = \frac{2(N-1)}{N} \cdot \frac{D}{\beta_{\text{IB}}} = \frac{30}{16} \cdot \frac{1\,\text{GB}}{50\,\text{GB/s}} = 37.5 \text{ ms}. \]

So busbw should approach 50 GB/s as $D$ grows. If you measure ~30-40 GB/s on the 1 GB AllReduce, that's 60-80% of theoretical — typical real-world. The gap is:

Protocol overhead (NCCL headers, ACKs).
Memory-copy overhead from GPU to NIC (mitigated by GPUDirect RDMA, gated by NCCL_NET_GDR_LEVEL).
IB switch buffer contention.

Intra-node (single-node, 8× H100 only)¶

For the 8-GPU intra-node AllReduce on NVLink/NVSwitch, the link bandwidth is 900 GB/s. Theory for 1 GB AllReduce on N=8:

\[ T = \frac{14}{8} \cdot \frac{1\,\text{GB}}{900\,\text{GB/s}} \approx 1.94 \text{ ms}. \]

You should measure ~2.5-3 ms (75-85% of theoretical). busbw should approach 700-800 GB/s.

The huge gap between intra-node (700+ GB/s) and inter-node (~40 GB/s) is the entire point of this lab.

Gap explanation prompt¶

Write a 1-page report answering:

Where is the cliff? Plot busbw vs message size. The intra-node curve should plateau near 700 GB/s; the inter-node curve should plateau near 40 GB/s. Identify the crossover message size.
Why is inter-node ~20× slower than intra-node? Frame it in terms of: NVLink bandwidth per GPU (900 GB/s) vs IB bandwidth per node (50 GB/s) divided across 8 GPUs (6.25 GB/s/GPU). The factor is ~140× at the wire level; NCCL hierarchical scheduling reduces this to ~20× observed.
What would double your inter-node bandwidth? Dual-port IB (800 Gb/s per node), or two NICs per node. Some H100 SKUs ship with up to 8× NIC/node — used at SuperPOD scale.
What does this mean for distributed training? Gradient sync of a 70B model (~140 GB gradients) takes ~5 s on this network. Compute step at 200 ms means we're 25× over-budget for AllReduce unless we overlap.

Deliverables¶

experiments/x4-collectives/manifest.json — versions, seeds, SKU details, NCCL env vars.
experiments/x4-collectives/nccl_log.txt — full NCCL debug output (proves the topology was detected).
experiments/x4-collectives/results.csv — message_size, time, algbw, busbw, theoretical_busbw, fraction.
experiments/x4-collectives/REPORT.md — the gap-explanation page.

Definition of Done¶

nccl-tests ran with NCCL_DEBUG=INFO showing the correct topology (NVLink intra, IB inter).
All three message sizes measured for both intra-node-only and inter-node configurations.
busbw at 1 GB is within 60-90% of theoretical for both configurations.
Report explains intra-vs-inter gap with concrete numbers.

Cross-links¶

theory/03-interconnects-and-topology.md: the math you're verifying.
Phase 35 — Distributed Training: where this matters for end-to-end training time.

References¶

NVIDIA NCCL Developer Guide, 2024.
NVIDIA NCCL Tests repository (github.com/NVIDIA/nccl-tests).
Patarasuk P. and Yuan X. 2009, Bandwidth-Optimal AllReduce Algorithms, JPDC.
NVIDIA DGX H100 Architecture, 2023.
NDR InfiniBand specifications (Mellanox / NVIDIA Networking).