Skip to content

English · Español

Lab 01 — Collective communication microbenchmark

🇪🇸 Mide AllReduce real en 2 nodos × 8 GPUs. Compáralo con la teoría del §03. La brecha es la historia.

Goal

Run nccl-tests on a 2-node, 8-GPU-per-node rental. Measure AllReduce bandwidth for message sizes 1 MB, 100 MB, and 1 GB. Compare to theoretical NVLink + InfiniBand ceilings. Explain the gap.

This is where §03's bandwidth math meets a wall clock.

Prerequisites

  • Theory §03 read.
  • Cloud account capable of provisioning multi-node, GPU-to-GPU RDMA-enabled instances. Single-node 8× GPU is not enough — we need to cross the InfiniBand boundary.
  • A budget of ~$15-20 for a 1-hour run.

Hardware target

Provider SKU Spec Hourly cost Total
RunPod "2× H100 SXM 8-pack (Secure Cloud)" 2 nodes × 8× H100 SXM5, NVLink intra + InfiniBand inter ~\(3.50/h per H100 × 16 ≈ ~\)56/h ~$15 for 15 min
Lambda Cloud "8× H100 SXM5" reserved cluster + multi-node similar ~\(2.50/h per H100 × 16 ≈ ~\)40/h ~$10 for 15 min
CoreWeave "HGX H100 ×8" pair with NDR IB 2 nodes × 8× H100, NDR 400 Gb/s IB quote-based ~$15-20 / hour

Note: not every "multi-GPU" rental has true RDMA between nodes. Look for SKUs that explicitly advertise NDR InfiniBand or 400 Gb/s RoCE between nodes. If the inter-node link is "shared Ethernet", your AllReduce numbers will be 10× too slow and you'll have measured the wrong thing.

A100 alternative (cheaper, similar topology lessons):

  • RunPod 2× node × 8× A100 SXM4 with HDR IB (200 Gb/s). ~\(1.80/h × 16 ≈ ~\)30/h. Total for 30 min run: ~$15.

Setup

# 1. On each node, install nccl-tests.
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda

# 2. Verify NCCL sees the right interconnect.
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET

# 3. Verify InfiniBand is up.
ibstatus    # Expect "LinkUp" on at least one mlx5_*.
ibv_devinfo # Expect NDR (400 Gb/s) or HDR (200 Gb/s) per port.

Run the benchmark

# Single-node, 8 GPU intra-node AllReduce (sanity check)
./build/all_reduce_perf -b 1M -e 1G -f 2 -g 8

# Two-node, 16 GPU AllReduce
mpirun -np 16 \
       --hostfile hostfile \
       -x NCCL_DEBUG=INFO \
       -x NCCL_IB_DISABLE=0 \
       -x NCCL_NET_GDR_LEVEL=2 \
       ./build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

Where hostfile is:

node1 slots=8
node2 slots=8

The -f 2 flag steps message sizes by 2× from -b to -e.

What you record

For each message size, nccl-tests reports:

  • time (us): median wall-clock time per AllReduce.
  • algbw (GB/s): algorithmic bandwidth = \(D / t\) (data size / time, per rank).
  • busbw (GB/s): bus bandwidth = \(\text{algbw} \cdot \frac{2(N-1)}{N}\) — accounts for the \(2(N-1)/N\) factor in ring AllReduce, so it should approach the link bandwidth.

busbw is the number to compare against the theoretical NVLink / InfiniBand peak.

Message size Expected time Expected algbw Expected busbw Notes
1 MB ~50-100 µs ~10-20 GB/s ~9-18 GB/s Latency-bound; tree algorithm; switch hops dominate
100 MB ~3-5 ms ~25-35 GB/s ~23-33 GB/s Crossover region
1 GB ~25-40 ms ~25-40 GB/s ~23-37 GB/s Bandwidth-bound; inter-node IB is the ceiling

[source: published NCCL benchmark numbers on DGX-class 2-node setups; NCCL release notes 2023-2024]

Sanity check vs theory

Inter-node NDR IB is 400 Gb/s = 50 GB/s unidirectional per port. With single-port IB per node and ring AllReduce on \(N = 16\):

\[ T_{\text{theoretical}} = \frac{2(N-1)}{N} \cdot \frac{D}{\beta_{\text{IB}}} = \frac{30}{16} \cdot \frac{1\,\text{GB}}{50\,\text{GB/s}} = 37.5 \text{ ms}. \]

So busbw should approach 50 GB/s as \(D\) grows. If you measure ~30-40 GB/s on the 1 GB AllReduce, that's 60-80% of theoretical — typical real-world. The gap is:

  • Protocol overhead (NCCL headers, ACKs).
  • Memory-copy overhead from GPU to NIC (mitigated by GPUDirect RDMA, gated by NCCL_NET_GDR_LEVEL).
  • IB switch buffer contention.

Intra-node (single-node, 8× H100 only)

For the 8-GPU intra-node AllReduce on NVLink/NVSwitch, the link bandwidth is 900 GB/s. Theory for 1 GB AllReduce on N=8:

\[ T = \frac{14}{8} \cdot \frac{1\,\text{GB}}{900\,\text{GB/s}} \approx 1.94 \text{ ms}. \]

You should measure ~2.5-3 ms (75-85% of theoretical). busbw should approach 700-800 GB/s.

The huge gap between intra-node (700+ GB/s) and inter-node (~40 GB/s) is the entire point of this lab.

Gap explanation prompt

Write a 1-page report answering:

  1. Where is the cliff? Plot busbw vs message size. The intra-node curve should plateau near 700 GB/s; the inter-node curve should plateau near 40 GB/s. Identify the crossover message size.
  2. Why is inter-node ~20× slower than intra-node? Frame it in terms of: NVLink bandwidth per GPU (900 GB/s) vs IB bandwidth per node (50 GB/s) divided across 8 GPUs (6.25 GB/s/GPU). The factor is ~140× at the wire level; NCCL hierarchical scheduling reduces this to ~20× observed.
  3. What would double your inter-node bandwidth? Dual-port IB (800 Gb/s per node), or two NICs per node. Some H100 SKUs ship with up to 8× NIC/node — used at SuperPOD scale.
  4. What does this mean for distributed training? Gradient sync of a 70B model (~140 GB gradients) takes ~5 s on this network. Compute step at 200 ms means we're 25× over-budget for AllReduce unless we overlap.

Deliverables

  • experiments/x4-collectives/manifest.json — versions, seeds, SKU details, NCCL env vars.
  • experiments/x4-collectives/nccl_log.txt — full NCCL debug output (proves the topology was detected).
  • experiments/x4-collectives/results.csv — message_size, time, algbw, busbw, theoretical_busbw, fraction.
  • experiments/x4-collectives/REPORT.md — the gap-explanation page.

Definition of Done

  • nccl-tests ran with NCCL_DEBUG=INFO showing the correct topology (NVLink intra, IB inter).
  • All three message sizes measured for both intra-node-only and inter-node configurations.
  • busbw at 1 GB is within 60-90% of theoretical for both configurations.
  • Report explains intra-vs-inter gap with concrete numbers.

References

  • NVIDIA NCCL Developer Guide, 2024.
  • NVIDIA NCCL Tests repository (github.com/NVIDIA/nccl-tests).
  • Patarasuk P. and Yuan X. 2009, Bandwidth-Optimal AllReduce Algorithms, JPDC.
  • NVIDIA DGX H100 Architecture, 2023.
  • NDR InfiniBand specifications (Mellanox / NVIDIA Networking).