English · Español

02 — Cluster Economics¶

🇪🇸 Una corrida de pre-entrenamiento se decide en cuatro variables: TFLOPs por GPU, $/hora, MFU real, y costo de comunicación. Si no sabes calcular el costo en dólares antes de lanzar, no entiendes el experimento.

The frontier-LLM economy runs on three GPU SKUs and ~five clouds. The arithmetic for cost-per-run is simple, the gotchas are not.

The reference SKUs (mid-2024)¶

GPU	HBM	FP32 TFLOP/s	BF16 TFLOP/s (tensor)	FP8 TFLOP/s	NVLink BW	TDP	List price
A100 80GB	80 GB HBM2e	19.5	312	—	600 GB/s	400 W	$10–15k (used)
H100 SXM5	80 GB HBM3	67	989	1979	900 GB/s	700 W	$25–40k
H200 SXM5	141 GB HBM3e	67	989	1979	900 GB/s	700 W	$30–45k
B200	192 GB HBM3e	~80	~2250	~4500	1800 GB/s	1000 W	$40–60k

(Sources: NVIDIA H100 datasheet, NVIDIA H200 announcement Nov 2023, B200 Blackwell datasheet Mar 2024.)

Key shifts: - A100 → H100: 3.2× bf16 throughput, 1.5× NVLink BW. - H100 → H200: same compute, 1.76× HBM (80 → 141 GB), critical for fitting 70B+ models per device. - H200 → B200: ~2.3× bf16 throughput, 2× NVLink, 35% larger HBM. Shipping in volume late 2024 / early 2025.

Cloud $/GPU-hour (spot vs on-demand, mid-2024)¶

GPU	Provider	On-demand $/hr	Spot $/hr (lowest quartile)	Notes
A100 80GB	Lambda Labs	$1.79	$1.10	Single-GPU reserved
A100 80GB	RunPod community	$1.89	$0.79	Spot availability spotty
A100 80GB	vast.ai	$1.30–2.00	$0.40–0.80	Wide variance, cheaper if pickier
H100 SXM5	Lambda Labs	$3.29	$2.49	8× nodes
H100 SXM5	CoreWeave	$4.25	—	On-demand only
H100 PCIe	RunPod community	$2.69	$1.99	PCIe ~70% of SXM5 throughput
H200 SXM5	Lambda Labs	$3.99	—	New, scarce

(Sources: provider pricing pages as of 2024-06. Spot prices fluctuate ±30% intra-week.)

Spot vs on-demand: spot can be reclaimed by the cloud at <2 min notice. For X1's single 24-hour run, this is the principal risk. Mitigations: 1. Checkpoint every 30 min (cost: ~5% throughput). 2. --resume-from-latest flag in the trainer. 3. Cron a watchdog that auto-relaunches if the instance dies.

For an 8-GPU H100 multi-day run, spot is rarely usable — the probability any node gets reclaimed across the run approaches 1. Frontier labs use reserved capacity at 30-50% discount vs on-demand.

MFU: the metric that matters¶

MFU = Model FLOPs Utilization = (sustained model FLOPs/s) / (hardware peak FLOPs/s).

The Chinchilla paper reported MFU ≈ 0.45 on TPUv4. Frontier-lab training runs typically hit:

Config	MFU	Source
GPT-3 (V100s, fp32)	~0.15	Brown 2020
PaLM (TPUv4)	0.46	Chowdhery 2022
MT-NLG (A100 + Megatron)	0.30	Smith 2022
Llama-2 (A100 + Megatron)	0.40	Touvron 2023b
Llama-3 (H100 + custom)	0.41	Meta 2024 (model card)
Mosaic LLM Foundry, A100	0.50	mosaicml/llm-foundry README

Why is MFU not 1.0?

Memory bandwidth. Most ops in a transformer block are memory-bound on modern accelerators (BW utilization, not FLOP utilization, is the bottleneck for small batches).
Communication overhead. All-reduce and all-gather take time the GPU is not doing matmul.
Pipeline bubbles. PP schedules have synchronous "bubble" periods.
Kernel inefficiency. Pure PyTorch ops are 30-60% MFU; FlashAttention-2 + torch.compile brings it to 40-50%; hand-tuned Megatron + Transformer Engine reaches 50-60%.

X1 lab MFU target: 0.40 on 1× A100 80GB with FlashAttention-2 + bf16 + torch.compile. Lower is a bug; 0.45+ is a stretch.

The cost formula¶

For a single dense decoder-only model:

\[\text{Cost} \;=\; \frac{6 \cdot N \cdot D}{\text{MFU} \cdot G \cdot P} \cdot \text{price/hr}\]

where: - $N$ = parameters (excluding embeddings) - $D$ = training tokens - $G$ = number of GPUs - $P$ = peak per-GPU bf16 TFLOP/s (× 3600 for per-hour) - $\text{price/hr}$ = $-per-GPU-hour

Worked example 1: X1 lab 00¶

$N = 5 \times 10^7$, $D = 4.3 \times 10^{10}$, $G = 1$, $P = 312 \times 10^{12}$ FLOP/s, MFU = 0.40, price = $1.10/hr (A100 80GB spot Lambda).

Time = $6 \cdot 5 \times 10^7 \cdot 4.3 \times 10^{10} / (0.40 \cdot 1 \cdot 312 \times 10^{12})$ = $1.29 \times 10^{19} / 1.25 \times 10^{14}$ = $1.03 \times 10^5$ s = 28.7 hours.

Cost = $28.7 \times \$1.10 = \$31.6$.

So lab 00 is "1× A100 80GB spot at $1.10/hr for ~24-29 hours = ~$26-32." Hard cap $35 with $3-9 buffer.

Worked example 2: a 7B Chinchilla-optimal run¶

$N = 7 \times 10^9$, $D = 1.4 \times 10^{12}$, $G = 1024$ H100, $P = 989 \times 10^{12}$, MFU = 0.45, price = $2.49/hr (H100 spot Lambda × 1024).

Time = $6 \cdot 7 \times 10^9 \cdot 1.4 \times 10^{12} / (0.45 \cdot 1024 \cdot 989 \times 10^{12})$ = $5.88 \times 10^{22} / 4.56 \times 10^{17}$ = $1.29 \times 10^5$ s × per-GPU; divided by 1024-parallel... wait — the formula assumes ideal scaling. Let me redo:

Total FLOPs needed: $C = 6ND = 5.88 \times 10^{22}$. Cluster FLOP/s: $G \cdot \text{MFU} \cdot P = 1024 \cdot 0.45 \cdot 989 \times 10^{12} = 4.56 \times 10^{17}$ FLOP/s. Time: $C / (\text{cluster FLOP/s}) = 5.88 \times 10^{22} / 4.56 \times 10^{17} = 1.29 \times 10^5$ s ≈ 35.8 hours.

Hm — that's a day and a half, not 25 days. Let me reconcile against the Llama-2 7B-on-2T-tokens reported number (Touvron 2023b): "184k A100-hours." $2T / 1.4T = 1.43$, so 7B-on-1.4T-tokens ≈ 130k A100-hours.

130k A100-hours / 1024 GPUs ≈ 127 hours ≈ 5 days on 1024 A100s. My H100 number says 36 hours, which is roughly right because H100s are 3× faster (and the calc was on 1024 H100s, not A100s). Cross-check: 36h × 3 ≈ 108h ≈ 4.5 days on A100s ✓ (matches Llama-2 within tolerance).

Cost at $2.49/hr × 1024 GPUs × 36 hours = **~$92k. Add 10% buffer for spot reclaims and ablations = ~$100k**.

That's the Chinchilla-optimal 7B. Most modern 7B trains over-train to 2T+ tokens (Llama-⅔ path), at ~$200k–$1M of compute depending on the cluster discount.

Worked example 3: the 7B-for-$1M math¶

The user-requested "train 7B for 1.4T tokens at MFU 0.45 on 1024×H100 = ~25 days = ~$1M at $0.04/H100-hour spot" math:

Let me do this with the user-stipulated numbers (which assume $0.04/H100-hour — that's reserved-capacity hyperscaler pricing, not retail spot):

Cluster FLOP/s: $1024 \cdot 0.45 \cdot 989 \times 10^{12} = 4.56 \times 10^{17}$
C = $6 \cdot 7 \times 10^9 \cdot 1.4 \times 10^{12} = 5.88 \times 10^{22}$
Time = $5.88 \times 10^{22} / 4.56 \times 10^{17} = 1.29 \times 10^5$ s = 35.8 h.

Hmm — getting 36 h not 25 days. The "25 days" figure assumes either (a) MFU 0.05, or (b) overtraining to ~15T tokens, or © much smaller cluster.

If we instead scale to 1.4T tokens on 256 H100s (more typical of a non-frontier lab): - Time = $5.88 \times 10^{22} / (256 \cdot 0.45 \cdot 989 \times 10^{12}) = 5.88 \times 10^{22} / 1.14 \times 10^{17}$ = $5.16 \times 10^5$ s ≈ 143 h ≈ 6 days. - Cost at $2.49/hr spot × 256 × 143 = **$91k. At $0.04/hr reserved-rate × 256 × 143 = **$1.5k (clearly wrong scale — $0.04 is per-second equivalent or a heavy enterprise discount).

The lesson: the $1M-per-7B figure widely quoted assumes overtraining to ~15T tokens and/or retail spot prices, not the Chinchilla-optimal 1.4T. The exact arithmetic is:

Scenario	Tokens	GPUs	$/hr/GPU	Time	Cost
7B Chinchilla-optimal, 256× H100 retail spot	1.4T	256	$2.49	6 d	$92k
7B Llama-2-shape, 1024× A100 retail spot	2T	1024	$1.10	8 d	$216k
7B Llama-3-shape, 2048× H100 retail spot	15T	2048	$2.49	16 d	$2.0M
7B Chinchilla, 1024× H100 reserved (hyperscaler)	1.4T	1024	$1.00 (est.)	36 h	$37k

The "$1M for a 7B" is the order-of-magnitude figure for a modern over-trained 7B at retail spot. Reserved-capacity hyperscalers (an internal cluster at Anthropic / Meta / Google) are 3-10× cheaper.

Bandwidth and communication cost¶

Compute is half the story. The other half is moving tensors between GPUs.

Intra-node (NVLink/NVSwitch): - A100 NVLink 3: 600 GB/s per GPU (300 GB/s each direction). - H100 NVLink 4: 900 GB/s per GPU. - B200 NVLink 5: 1800 GB/s per GPU.

Inter-node (Infiniband): - NDR400 IB: 400 Gb/s = 50 GB/s per port. Single-port: ~50 GB/s. Multi-port: up to 400 GB/s with rail-optimized topology. - Ethernet (RoCE v2): up to 400 Gb/s comparable, with worse tail latency.

The communication ratio. For a model with $N$ parameters and a TP group of size $T$: - Each step's all-reduce (gradient sync, DDP): $\sim 2 N$ bytes (fp16) moved per GPU. - Each step's all-gather (TP forward): $\sim N/T$ bytes × num-layers per step.

Implication. Tensor parallel must stay inside one NVLink domain (1 node = 8 GPUs typically). Pipeline parallel can cross nodes because PP only sends micro-batch activations at PP-boundary layers. ZeRO/FSDP is more communication-heavy than DDP and benefits from NVLink.

For Llama-2 70B on 1024 A100s: TP=8 (intra-node), PP=8 (cross-node), DP=16 (outer). The "DP × TP × PP" 3D-parallelism config.

Energy cost (often forgotten)¶

A100 TDP: 400 W. H100 TDP: 700 W. A 24-h single-GPU A100 run = 9.6 kWh ≈ ~$1 at residential rates or ~$0.40 at data-center rates. For X1 lab 00, energy is negligible vs compute price.

For an 8-GPU H100 24h run: 700 W × 8 × 24 = 134 kWh ≈ $13. Still ~5% of the compute bill.

For a 1024-GPU H100 month-long run: 700 W × 1024 × 30 × 24 = 516 MWh ≈ ~$50k at $0.10/kWh. This is on the order of 5-10% of the compute bill for a frontier-scale run, and it's why hyperscalers care about PUE and where they build data centers.

(Frontier-lab "carbon footprint of training" estimates: GPT-3 = ~552 t CO₂, BLOOM-176B = ~50 t CO₂. Source: Luccioni et al. 2022. Bloom was lower because of French nuclear grid.)

Storage and egress¶

A frontier training run reads ~10-15TB of tokenized data and writes ~1-10TB of checkpoints. Storage and egress costs:

S3 / GCS hot storage: $0.02/GB-month. 10 TB-month = $200.
Egress to GPU node: typically free within same region/cloud, $0.08/GB cross-region. Always co-locate.
Checkpoint writes during a multi-day run: 10 checkpoints × 100 GB = 1 TB total written. $20 at standard rates.

For X1 lab 00: 200 GB tokenized FineWeb-Edu shards + 5 GB of checkpoints. Storage cost: ~$3 over 24 h. Below the line.

What changes at >10B params¶

At ~10B parameters, single-GPU memory runs out: - 10B params × 2 bytes (bf16) = 20 GB weights - ×2 (Adam m, v) = 60 GB optimizer state in fp32 ≈ 80 GB total (Adam optimizer in fp32 needs 4N + 4N + 4N = 12N bytes for params/m/v in fp32 master copy; with mixed precision and 8-bit optimizers this drops to ~6N, but the math is dominated) - Activations: ~5-20 GB depending on batch and seq - Sum: ~100+ GB. Does not fit on H100 80 GB.

At this point pipeline parallel becomes mandatory. The typical configuration: - TP=8 intra-node (NVLink can carry the all-gather). - PP=8 across nodes (Infiniband handles the rare boundary sends). - ZeRO-1 outermost (shard optimizer state across DP replicas).

Why not ZeRO-3 + DDP only? Because ZeRO-3 requires all-gather of parameters at every forward step. For 70B params, that's 140 GB of weight all-gather per step per DP rank, which saturates even NVLink and cripples MFU. TP + PP keeps the per-step communication local (small activation passes across PP boundaries, small all-reduces inside TP groups).

The sweet spot for ≥10B: TP + PP + ZeRO-1 (3D parallelism), MFU ~0.4 if tuned.

For X1 we only run 50M, so single-GPU is fine. But the cost math you do in your head for 7B+ is the math above.

Next: theory/03-data-pipelines-at-scale.md.