English · Español
02 — Cluster Economics¶
🇪🇸 Una corrida de pre-entrenamiento se decide en cuatro variables: TFLOPs por GPU, $/hora, MFU real, y costo de comunicación. Si no sabes calcular el costo en dólares antes de lanzar, no entiendes el experimento.
The frontier-LLM economy runs on three GPU SKUs and ~five clouds. The arithmetic for cost-per-run is simple, the gotchas are not.
The reference SKUs (mid-2024)¶
| GPU | HBM | FP32 TFLOP/s | BF16 TFLOP/s (tensor) | FP8 TFLOP/s | NVLink BW | TDP | List price |
|---|---|---|---|---|---|---|---|
| A100 80GB | 80 GB HBM2e | 19.5 | 312 | — | 600 GB/s | 400 W | $10–15k (used) |
| H100 SXM5 | 80 GB HBM3 | 67 | 989 | 1979 | 900 GB/s | 700 W | $25–40k |
| H200 SXM5 | 141 GB HBM3e | 67 | 989 | 1979 | 900 GB/s | 700 W | $30–45k |
| B200 | 192 GB HBM3e | ~80 | ~2250 | ~4500 | 1800 GB/s | 1000 W | $40–60k |
(Sources: NVIDIA H100 datasheet, NVIDIA H200 announcement Nov 2023, B200 Blackwell datasheet Mar 2024.)
Key shifts: - A100 → H100: 3.2× bf16 throughput, 1.5× NVLink BW. - H100 → H200: same compute, 1.76× HBM (80 → 141 GB), critical for fitting 70B+ models per device. - H200 → B200: ~2.3× bf16 throughput, 2× NVLink, 35% larger HBM. Shipping in volume late 2024 / early 2025.
Cloud $/GPU-hour (spot vs on-demand, mid-2024)¶
| GPU | Provider | On-demand $/hr | Spot $/hr (lowest quartile) | Notes |
|---|---|---|---|---|
| A100 80GB | Lambda Labs | $1.79 | $1.10 | Single-GPU reserved |
| A100 80GB | RunPod community | $1.89 | $0.79 | Spot availability spotty |
| A100 80GB | vast.ai | $1.30–2.00 | $0.40–0.80 | Wide variance, cheaper if pickier |
| H100 SXM5 | Lambda Labs | $3.29 | $2.49 | 8× nodes |
| H100 SXM5 | CoreWeave | $4.25 | — | On-demand only |
| H100 PCIe | RunPod community | $2.69 | $1.99 | PCIe ~70% of SXM5 throughput |
| H200 SXM5 | Lambda Labs | $3.99 | — | New, scarce |
(Sources: provider pricing pages as of 2024-06. Spot prices fluctuate ±30% intra-week.)
Spot vs on-demand: spot can be reclaimed by the cloud at <2 min notice. For X1's single 24-hour run, this is the principal risk. Mitigations:
1. Checkpoint every 30 min (cost: ~5% throughput).
2. --resume-from-latest flag in the trainer.
3. Cron a watchdog that auto-relaunches if the instance dies.
For an 8-GPU H100 multi-day run, spot is rarely usable — the probability any node gets reclaimed across the run approaches 1. Frontier labs use reserved capacity at 30-50% discount vs on-demand.
MFU: the metric that matters¶
MFU = Model FLOPs Utilization = (sustained model FLOPs/s) / (hardware peak FLOPs/s).
The Chinchilla paper reported MFU ≈ 0.45 on TPUv4. Frontier-lab training runs typically hit:
| Config | MFU | Source |
|---|---|---|
| GPT-3 (V100s, fp32) | ~0.15 | Brown 2020 |
| PaLM (TPUv4) | 0.46 | Chowdhery 2022 |
| MT-NLG (A100 + Megatron) | 0.30 | Smith 2022 |
| Llama-2 (A100 + Megatron) | 0.40 | Touvron 2023b |
| Llama-3 (H100 + custom) | 0.41 | Meta 2024 (model card) |
| Mosaic LLM Foundry, A100 | 0.50 | mosaicml/llm-foundry README |
Why is MFU not 1.0?
- Memory bandwidth. Most ops in a transformer block are memory-bound on modern accelerators (BW utilization, not FLOP utilization, is the bottleneck for small batches).
- Communication overhead. All-reduce and all-gather take time the GPU is not doing matmul.
- Pipeline bubbles. PP schedules have synchronous "bubble" periods.
- Kernel inefficiency. Pure PyTorch ops are 30-60% MFU; FlashAttention-2 +
torch.compilebrings it to 40-50%; hand-tuned Megatron + Transformer Engine reaches 50-60%.
X1 lab MFU target: 0.40 on 1× A100 80GB with FlashAttention-2 + bf16 + torch.compile. Lower is a bug; 0.45+ is a stretch.
The cost formula¶
For a single dense decoder-only model:
where: - \(N\) = parameters (excluding embeddings) - \(D\) = training tokens - \(G\) = number of GPUs - \(P\) = peak per-GPU bf16 TFLOP/s (× 3600 for per-hour) - \(\text{price/hr}\) = $-per-GPU-hour
Worked example 1: X1 lab 00¶
\(N = 5 \times 10^7\), \(D = 4.3 \times 10^{10}\), \(G = 1\), \(P = 312 \times 10^{12}\) FLOP/s, MFU = 0.40, price = $1.10/hr (A100 80GB spot Lambda).
Time = \(6 \cdot 5 \times 10^7 \cdot 4.3 \times 10^{10} / (0.40 \cdot 1 \cdot 312 \times 10^{12})\) = \(1.29 \times 10^{19} / 1.25 \times 10^{14}\) = \(1.03 \times 10^5\) s = 28.7 hours.
Cost = \(28.7 \times \$1.10 = \$31.6\).
So lab 00 is "1× A100 80GB spot at \(1.10/hr for ~24-29 hours = ~\)26-32." Hard cap $35 with $3-9 buffer.
Worked example 2: a 7B Chinchilla-optimal run¶
\(N = 7 \times 10^9\), \(D = 1.4 \times 10^{12}\), \(G = 1024\) H100, \(P = 989 \times 10^{12}\), MFU = 0.45, price = $2.49/hr (H100 spot Lambda × 1024).
Time = \(6 \cdot 7 \times 10^9 \cdot 1.4 \times 10^{12} / (0.45 \cdot 1024 \cdot 989 \times 10^{12})\) = \(5.88 \times 10^{22} / 4.56 \times 10^{17}\) = \(1.29 \times 10^5\) s × per-GPU; divided by 1024-parallel... wait — the formula assumes ideal scaling. Let me redo:
Total FLOPs needed: \(C = 6ND = 5.88 \times 10^{22}\). Cluster FLOP/s: \(G \cdot \text{MFU} \cdot P = 1024 \cdot 0.45 \cdot 989 \times 10^{12} = 4.56 \times 10^{17}\) FLOP/s. Time: \(C / (\text{cluster FLOP/s}) = 5.88 \times 10^{22} / 4.56 \times 10^{17} = 1.29 \times 10^5\) s ≈ 35.8 hours.
Hm — that's a day and a half, not 25 days. Let me reconcile against the Llama-2 7B-on-2T-tokens reported number (Touvron 2023b): "184k A100-hours." \(2T / 1.4T = 1.43\), so 7B-on-1.4T-tokens ≈ 130k A100-hours.
130k A100-hours / 1024 GPUs ≈ 127 hours ≈ 5 days on 1024 A100s. My H100 number says 36 hours, which is roughly right because H100s are 3× faster (and the calc was on 1024 H100s, not A100s). Cross-check: 36h × 3 ≈ 108h ≈ 4.5 days on A100s ✓ (matches Llama-2 within tolerance).
Cost at \(2.49/hr × 1024 GPUs × 36 hours = **~\)92k. Add 10% buffer for spot reclaims and ablations = ~$100k**.
That's the Chinchilla-optimal 7B. Most modern 7B trains over-train to 2T+ tokens (Llama-⅔ path), at ~\(200k–\)1M of compute depending on the cluster discount.
Worked example 3: the 7B-for-$1M math¶
The user-requested "train 7B for 1.4T tokens at MFU 0.45 on 1024×H100 = ~25 days = ~$1M at $0.04/H100-hour spot" math:
Let me do this with the user-stipulated numbers (which assume $0.04/H100-hour — that's reserved-capacity hyperscaler pricing, not retail spot):
- Cluster FLOP/s: \(1024 \cdot 0.45 \cdot 989 \times 10^{12} = 4.56 \times 10^{17}\)
- C = \(6 \cdot 7 \times 10^9 \cdot 1.4 \times 10^{12} = 5.88 \times 10^{22}\)
- Time = \(5.88 \times 10^{22} / 4.56 \times 10^{17} = 1.29 \times 10^5\) s = 35.8 h.
Hmm — getting 36 h not 25 days. The "25 days" figure assumes either (a) MFU 0.05, or (b) overtraining to ~15T tokens, or © much smaller cluster.
If we instead scale to 1.4T tokens on 256 H100s (more typical of a non-frontier lab): - Time = \(5.88 \times 10^{22} / (256 \cdot 0.45 \cdot 989 \times 10^{12}) = 5.88 \times 10^{22} / 1.14 \times 10^{17}\) = \(5.16 \times 10^5\) s ≈ 143 h ≈ 6 days. - Cost at \(2.49/hr spot × 256 × 143 = **\)91k. At \(0.04/hr reserved-rate × 256 × 143 = **\)1.5k (clearly wrong scale — $0.04 is per-second equivalent or a heavy enterprise discount).
The lesson: the $1M-per-7B figure widely quoted assumes overtraining to ~15T tokens and/or retail spot prices, not the Chinchilla-optimal 1.4T. The exact arithmetic is:
| Scenario | Tokens | GPUs | $/hr/GPU | Time | Cost |
|---|---|---|---|---|---|
| 7B Chinchilla-optimal, 256× H100 retail spot | 1.4T | 256 | $2.49 | 6 d | $92k |
| 7B Llama-2-shape, 1024× A100 retail spot | 2T | 1024 | $1.10 | 8 d | $216k |
| 7B Llama-3-shape, 2048× H100 retail spot | 15T | 2048 | $2.49 | 16 d | $2.0M |
| 7B Chinchilla, 1024× H100 reserved (hyperscaler) | 1.4T | 1024 | $1.00 (est.) | 36 h | $37k |
The "$1M for a 7B" is the order-of-magnitude figure for a modern over-trained 7B at retail spot. Reserved-capacity hyperscalers (an internal cluster at Anthropic / Meta / Google) are 3-10× cheaper.
Bandwidth and communication cost¶
Compute is half the story. The other half is moving tensors between GPUs.
Intra-node (NVLink/NVSwitch): - A100 NVLink 3: 600 GB/s per GPU (300 GB/s each direction). - H100 NVLink 4: 900 GB/s per GPU. - B200 NVLink 5: 1800 GB/s per GPU.
Inter-node (Infiniband): - NDR400 IB: 400 Gb/s = 50 GB/s per port. Single-port: ~50 GB/s. Multi-port: up to 400 GB/s with rail-optimized topology. - Ethernet (RoCE v2): up to 400 Gb/s comparable, with worse tail latency.
The communication ratio. For a model with \(N\) parameters and a TP group of size \(T\): - Each step's all-reduce (gradient sync, DDP): \(\sim 2 N\) bytes (fp16) moved per GPU. - Each step's all-gather (TP forward): \(\sim N/T\) bytes × num-layers per step.
Implication. Tensor parallel must stay inside one NVLink domain (1 node = 8 GPUs typically). Pipeline parallel can cross nodes because PP only sends micro-batch activations at PP-boundary layers. ZeRO/FSDP is more communication-heavy than DDP and benefits from NVLink.
For Llama-2 70B on 1024 A100s: TP=8 (intra-node), PP=8 (cross-node), DP=16 (outer). The "DP × TP × PP" 3D-parallelism config.
Energy cost (often forgotten)¶
A100 TDP: 400 W. H100 TDP: 700 W. A 24-h single-GPU A100 run = 9.6 kWh ≈ ~\(1 at residential rates or ~\)0.40 at data-center rates. For X1 lab 00, energy is negligible vs compute price.
For an 8-GPU H100 24h run: 700 W × 8 × 24 = 134 kWh ≈ $13. Still ~5% of the compute bill.
For a 1024-GPU H100 month-long run: 700 W × 1024 × 30 × 24 = 516 MWh ≈ ~$50k at $0.10/kWh. This is on the order of 5-10% of the compute bill for a frontier-scale run, and it's why hyperscalers care about PUE and where they build data centers.
(Frontier-lab "carbon footprint of training" estimates: GPT-3 = ~552 t CO₂, BLOOM-176B = ~50 t CO₂. Source: Luccioni et al. 2022. Bloom was lower because of French nuclear grid.)
Storage and egress¶
A frontier training run reads ~10-15TB of tokenized data and writes ~1-10TB of checkpoints. Storage and egress costs:
- S3 / GCS hot storage: $0.02/GB-month. 10 TB-month = $200.
- Egress to GPU node: typically free within same region/cloud, $0.08/GB cross-region. Always co-locate.
- Checkpoint writes during a multi-day run: 10 checkpoints × 100 GB = 1 TB total written. $20 at standard rates.
For X1 lab 00: 200 GB tokenized FineWeb-Edu shards + 5 GB of checkpoints. Storage cost: ~$3 over 24 h. Below the line.
What changes at >10B params¶
At ~10B parameters, single-GPU memory runs out: - 10B params × 2 bytes (bf16) = 20 GB weights - ×2 (Adam m, v) = 60 GB optimizer state in fp32 ≈ 80 GB total (Adam optimizer in fp32 needs 4N + 4N + 4N = 12N bytes for params/m/v in fp32 master copy; with mixed precision and 8-bit optimizers this drops to ~6N, but the math is dominated) - Activations: ~5-20 GB depending on batch and seq - Sum: ~100+ GB. Does not fit on H100 80 GB.
At this point pipeline parallel becomes mandatory. The typical configuration: - TP=8 intra-node (NVLink can carry the all-gather). - PP=8 across nodes (Infiniband handles the rare boundary sends). - ZeRO-1 outermost (shard optimizer state across DP replicas).
Why not ZeRO-3 + DDP only? Because ZeRO-3 requires all-gather of parameters at every forward step. For 70B params, that's 140 GB of weight all-gather per step per DP rank, which saturates even NVLink and cripples MFU. TP + PP keeps the per-step communication local (small activation passes across PP boundaries, small all-reduces inside TP groups).
The sweet spot for ≥10B: TP + PP + ZeRO-1 (3D parallelism), MFU ~0.4 if tuned.
For X1 we only run 50M, so single-GPU is fine. But the cost math you do in your head for 7B+ is the math above.
Next: theory/03-data-pipelines-at-scale.md.