Skip to content

English · Español

02 — Cluster Economics

🇪🇸 Una corrida de pre-entrenamiento se decide en cuatro variables: TFLOPs por GPU, $/hora, MFU real, y costo de comunicación. Si no sabes calcular el costo en dólares antes de lanzar, no entiendes el experimento.

The frontier-LLM economy runs on three GPU SKUs and ~five clouds. The arithmetic for cost-per-run is simple, the gotchas are not.

The reference SKUs (mid-2024)

GPU HBM FP32 TFLOP/s BF16 TFLOP/s (tensor) FP8 TFLOP/s NVLink BW TDP List price
A100 80GB 80 GB HBM2e 19.5 312 600 GB/s 400 W $10–15k (used)
H100 SXM5 80 GB HBM3 67 989 1979 900 GB/s 700 W $25–40k
H200 SXM5 141 GB HBM3e 67 989 1979 900 GB/s 700 W $30–45k
B200 192 GB HBM3e ~80 ~2250 ~4500 1800 GB/s 1000 W $40–60k

(Sources: NVIDIA H100 datasheet, NVIDIA H200 announcement Nov 2023, B200 Blackwell datasheet Mar 2024.)

Key shifts: - A100 → H100: 3.2× bf16 throughput, 1.5× NVLink BW. - H100 → H200: same compute, 1.76× HBM (80 → 141 GB), critical for fitting 70B+ models per device. - H200 → B200: ~2.3× bf16 throughput, 2× NVLink, 35% larger HBM. Shipping in volume late 2024 / early 2025.

Cloud $/GPU-hour (spot vs on-demand, mid-2024)

GPU Provider On-demand $/hr Spot $/hr (lowest quartile) Notes
A100 80GB Lambda Labs $1.79 $1.10 Single-GPU reserved
A100 80GB RunPod community $1.89 $0.79 Spot availability spotty
A100 80GB vast.ai $1.30–2.00 $0.40–0.80 Wide variance, cheaper if pickier
H100 SXM5 Lambda Labs $3.29 $2.49 8× nodes
H100 SXM5 CoreWeave $4.25 On-demand only
H100 PCIe RunPod community $2.69 $1.99 PCIe ~70% of SXM5 throughput
H200 SXM5 Lambda Labs $3.99 New, scarce

(Sources: provider pricing pages as of 2024-06. Spot prices fluctuate ±30% intra-week.)

Spot vs on-demand: spot can be reclaimed by the cloud at <2 min notice. For X1's single 24-hour run, this is the principal risk. Mitigations: 1. Checkpoint every 30 min (cost: ~5% throughput). 2. --resume-from-latest flag in the trainer. 3. Cron a watchdog that auto-relaunches if the instance dies.

For an 8-GPU H100 multi-day run, spot is rarely usable — the probability any node gets reclaimed across the run approaches 1. Frontier labs use reserved capacity at 30-50% discount vs on-demand.

MFU: the metric that matters

MFU = Model FLOPs Utilization = (sustained model FLOPs/s) / (hardware peak FLOPs/s).

The Chinchilla paper reported MFU ≈ 0.45 on TPUv4. Frontier-lab training runs typically hit:

Config MFU Source
GPT-3 (V100s, fp32) ~0.15 Brown 2020
PaLM (TPUv4) 0.46 Chowdhery 2022
MT-NLG (A100 + Megatron) 0.30 Smith 2022
Llama-2 (A100 + Megatron) 0.40 Touvron 2023b
Llama-3 (H100 + custom) 0.41 Meta 2024 (model card)
Mosaic LLM Foundry, A100 0.50 mosaicml/llm-foundry README

Why is MFU not 1.0?

  • Memory bandwidth. Most ops in a transformer block are memory-bound on modern accelerators (BW utilization, not FLOP utilization, is the bottleneck for small batches).
  • Communication overhead. All-reduce and all-gather take time the GPU is not doing matmul.
  • Pipeline bubbles. PP schedules have synchronous "bubble" periods.
  • Kernel inefficiency. Pure PyTorch ops are 30-60% MFU; FlashAttention-2 + torch.compile brings it to 40-50%; hand-tuned Megatron + Transformer Engine reaches 50-60%.

X1 lab MFU target: 0.40 on 1× A100 80GB with FlashAttention-2 + bf16 + torch.compile. Lower is a bug; 0.45+ is a stretch.

The cost formula

For a single dense decoder-only model:

\[\text{Cost} \;=\; \frac{6 \cdot N \cdot D}{\text{MFU} \cdot G \cdot P} \cdot \text{price/hr}\]

where: - \(N\) = parameters (excluding embeddings) - \(D\) = training tokens - \(G\) = number of GPUs - \(P\) = peak per-GPU bf16 TFLOP/s (× 3600 for per-hour) - \(\text{price/hr}\) = $-per-GPU-hour

Worked example 1: X1 lab 00

\(N = 5 \times 10^7\), \(D = 4.3 \times 10^{10}\), \(G = 1\), \(P = 312 \times 10^{12}\) FLOP/s, MFU = 0.40, price = $1.10/hr (A100 80GB spot Lambda).

Time = \(6 \cdot 5 \times 10^7 \cdot 4.3 \times 10^{10} / (0.40 \cdot 1 \cdot 312 \times 10^{12})\) = \(1.29 \times 10^{19} / 1.25 \times 10^{14}\) = \(1.03 \times 10^5\) s = 28.7 hours.

Cost = \(28.7 \times \$1.10 = \$31.6\).

So lab 00 is "1× A100 80GB spot at \(1.10/hr for ~24-29 hours = ~\)26-32." Hard cap $35 with $3-9 buffer.

Worked example 2: a 7B Chinchilla-optimal run

\(N = 7 \times 10^9\), \(D = 1.4 \times 10^{12}\), \(G = 1024\) H100, \(P = 989 \times 10^{12}\), MFU = 0.45, price = $2.49/hr (H100 spot Lambda × 1024).

Time = \(6 \cdot 7 \times 10^9 \cdot 1.4 \times 10^{12} / (0.45 \cdot 1024 \cdot 989 \times 10^{12})\) = \(5.88 \times 10^{22} / 4.56 \times 10^{17}\) = \(1.29 \times 10^5\) s × per-GPU; divided by 1024-parallel... wait — the formula assumes ideal scaling. Let me redo:

Total FLOPs needed: \(C = 6ND = 5.88 \times 10^{22}\). Cluster FLOP/s: \(G \cdot \text{MFU} \cdot P = 1024 \cdot 0.45 \cdot 989 \times 10^{12} = 4.56 \times 10^{17}\) FLOP/s. Time: \(C / (\text{cluster FLOP/s}) = 5.88 \times 10^{22} / 4.56 \times 10^{17} = 1.29 \times 10^5\) s ≈ 35.8 hours.

Hm — that's a day and a half, not 25 days. Let me reconcile against the Llama-2 7B-on-2T-tokens reported number (Touvron 2023b): "184k A100-hours." \(2T / 1.4T = 1.43\), so 7B-on-1.4T-tokens ≈ 130k A100-hours.

130k A100-hours / 1024 GPUs ≈ 127 hours ≈ 5 days on 1024 A100s. My H100 number says 36 hours, which is roughly right because H100s are 3× faster (and the calc was on 1024 H100s, not A100s). Cross-check: 36h × 3 ≈ 108h ≈ 4.5 days on A100s ✓ (matches Llama-2 within tolerance).

Cost at \(2.49/hr × 1024 GPUs × 36 hours = **~\)92k. Add 10% buffer for spot reclaims and ablations = ~$100k**.

That's the Chinchilla-optimal 7B. Most modern 7B trains over-train to 2T+ tokens (Llama-⅔ path), at ~\(200k–\)1M of compute depending on the cluster discount.

Worked example 3: the 7B-for-$1M math

The user-requested "train 7B for 1.4T tokens at MFU 0.45 on 1024×H100 = ~25 days = ~$1M at $0.04/H100-hour spot" math:

Let me do this with the user-stipulated numbers (which assume $0.04/H100-hour — that's reserved-capacity hyperscaler pricing, not retail spot):

  • Cluster FLOP/s: \(1024 \cdot 0.45 \cdot 989 \times 10^{12} = 4.56 \times 10^{17}\)
  • C = \(6 \cdot 7 \times 10^9 \cdot 1.4 \times 10^{12} = 5.88 \times 10^{22}\)
  • Time = \(5.88 \times 10^{22} / 4.56 \times 10^{17} = 1.29 \times 10^5\) s = 35.8 h.

Hmm — getting 36 h not 25 days. The "25 days" figure assumes either (a) MFU 0.05, or (b) overtraining to ~15T tokens, or © much smaller cluster.

If we instead scale to 1.4T tokens on 256 H100s (more typical of a non-frontier lab): - Time = \(5.88 \times 10^{22} / (256 \cdot 0.45 \cdot 989 \times 10^{12}) = 5.88 \times 10^{22} / 1.14 \times 10^{17}\) = \(5.16 \times 10^5\) s ≈ 143 h ≈ 6 days. - Cost at \(2.49/hr spot × 256 × 143 = **\)91k. At \(0.04/hr reserved-rate × 256 × 143 = **\)1.5k (clearly wrong scale — $0.04 is per-second equivalent or a heavy enterprise discount).

The lesson: the $1M-per-7B figure widely quoted assumes overtraining to ~15T tokens and/or retail spot prices, not the Chinchilla-optimal 1.4T. The exact arithmetic is:

Scenario Tokens GPUs $/hr/GPU Time Cost
7B Chinchilla-optimal, 256× H100 retail spot 1.4T 256 $2.49 6 d $92k
7B Llama-2-shape, 1024× A100 retail spot 2T 1024 $1.10 8 d $216k
7B Llama-3-shape, 2048× H100 retail spot 15T 2048 $2.49 16 d $2.0M
7B Chinchilla, 1024× H100 reserved (hyperscaler) 1.4T 1024 $1.00 (est.) 36 h $37k

The "$1M for a 7B" is the order-of-magnitude figure for a modern over-trained 7B at retail spot. Reserved-capacity hyperscalers (an internal cluster at Anthropic / Meta / Google) are 3-10× cheaper.

Bandwidth and communication cost

Compute is half the story. The other half is moving tensors between GPUs.

Intra-node (NVLink/NVSwitch): - A100 NVLink 3: 600 GB/s per GPU (300 GB/s each direction). - H100 NVLink 4: 900 GB/s per GPU. - B200 NVLink 5: 1800 GB/s per GPU.

Inter-node (Infiniband): - NDR400 IB: 400 Gb/s = 50 GB/s per port. Single-port: ~50 GB/s. Multi-port: up to 400 GB/s with rail-optimized topology. - Ethernet (RoCE v2): up to 400 Gb/s comparable, with worse tail latency.

The communication ratio. For a model with \(N\) parameters and a TP group of size \(T\): - Each step's all-reduce (gradient sync, DDP): \(\sim 2 N\) bytes (fp16) moved per GPU. - Each step's all-gather (TP forward): \(\sim N/T\) bytes × num-layers per step.

Implication. Tensor parallel must stay inside one NVLink domain (1 node = 8 GPUs typically). Pipeline parallel can cross nodes because PP only sends micro-batch activations at PP-boundary layers. ZeRO/FSDP is more communication-heavy than DDP and benefits from NVLink.

For Llama-2 70B on 1024 A100s: TP=8 (intra-node), PP=8 (cross-node), DP=16 (outer). The "DP × TP × PP" 3D-parallelism config.

Energy cost (often forgotten)

A100 TDP: 400 W. H100 TDP: 700 W. A 24-h single-GPU A100 run = 9.6 kWh ≈ ~\(1 at residential rates or ~\)0.40 at data-center rates. For X1 lab 00, energy is negligible vs compute price.

For an 8-GPU H100 24h run: 700 W × 8 × 24 = 134 kWh ≈ $13. Still ~5% of the compute bill.

For a 1024-GPU H100 month-long run: 700 W × 1024 × 30 × 24 = 516 MWh ≈ ~$50k at $0.10/kWh. This is on the order of 5-10% of the compute bill for a frontier-scale run, and it's why hyperscalers care about PUE and where they build data centers.

(Frontier-lab "carbon footprint of training" estimates: GPT-3 = ~552 t CO₂, BLOOM-176B = ~50 t CO₂. Source: Luccioni et al. 2022. Bloom was lower because of French nuclear grid.)

Storage and egress

A frontier training run reads ~10-15TB of tokenized data and writes ~1-10TB of checkpoints. Storage and egress costs:

  • S3 / GCS hot storage: $0.02/GB-month. 10 TB-month = $200.
  • Egress to GPU node: typically free within same region/cloud, $0.08/GB cross-region. Always co-locate.
  • Checkpoint writes during a multi-day run: 10 checkpoints × 100 GB = 1 TB total written. $20 at standard rates.

For X1 lab 00: 200 GB tokenized FineWeb-Edu shards + 5 GB of checkpoints. Storage cost: ~$3 over 24 h. Below the line.

What changes at >10B params

At ~10B parameters, single-GPU memory runs out: - 10B params × 2 bytes (bf16) = 20 GB weights - ×2 (Adam m, v) = 60 GB optimizer state in fp32 ≈ 80 GB total (Adam optimizer in fp32 needs 4N + 4N + 4N = 12N bytes for params/m/v in fp32 master copy; with mixed precision and 8-bit optimizers this drops to ~6N, but the math is dominated) - Activations: ~5-20 GB depending on batch and seq - Sum: ~100+ GB. Does not fit on H100 80 GB.

At this point pipeline parallel becomes mandatory. The typical configuration: - TP=8 intra-node (NVLink can carry the all-gather). - PP=8 across nodes (Infiniband handles the rare boundary sends). - ZeRO-1 outermost (shard optimizer state across DP replicas).

Why not ZeRO-3 + DDP only? Because ZeRO-3 requires all-gather of parameters at every forward step. For 70B params, that's 140 GB of weight all-gather per step per DP rank, which saturates even NVLink and cripples MFU. TP + PP keeps the per-step communication local (small activation passes across PP boundaries, small all-reduces inside TP groups).

The sweet spot for ≥10B: TP + PP + ZeRO-1 (3D parallelism), MFU ~0.4 if tuned.

For X1 we only run 50M, so single-GPU is fine. But the cost math you do in your head for 7B+ is the math above.


Next: theory/03-data-pipelines-at-scale.md.