English · Español
04 — Datacenter economics: power, PUE, $/MWh, CapEx vs OpEx¶
🇪🇸 Un modelo frontera no cuesta lo que cuesta su código. Cuesta los megavatios de su entrenamiento. Aquí están los números.
The thesis¶
When an Anthropic infra engineer says "we spent X on this training run," roughly 60% of X is electricity [source: Patterson et al. 2021; SemiAnalysis 2024 cost models]. Hardware amortization, networking, and engineering salaries are the rest. This page is the arithmetic.
Power per GPU¶
| Chip | TDP | Sustained training power (typical) |
|---|---|---|
| A100 SXM4 | 400 W | ~350 W |
| H100 SXM5 | 700 W | ~650 W |
| H200 SXM | 700 W | ~650 W |
| B200 | 1000 W | ~900 W |
| MI300X | 750 W | ~700 W |
| TPU v5p | ~450 W [not publicly confirmed] | — |
[source: NVIDIA H100/H200/B200 datasheets; AMD MI300X datasheet 2023]
Memorize: H100 = 700 W, B200 = 1000 W.
From chip TDP to facility power¶
You cannot wire up GPUs in isolation. The full stack:
- GPU power (the headline number).
- CPU + system board per node: ~300-500 W for a DGX host.
- Network switches and NICs: ~5-10% of IT load.
- Storage: small for training (cached checkpoints).
- Cooling: ~15-30% additional draw. This is what PUE captures.
For an 8× H100 DGX node:
- 8 × 700 W (GPU) + 500 W (host) + 100 W (networking) ≈ 6.2 kW IT load.
- With PUE 1.2: ~7.4 kW total facility draw per node.
PUE — Power Usage Effectiveness¶
Benchmarks:
| Tier | PUE | Example |
|---|---|---|
| Hyperscale, modern | 1.10-1.15 | Google, Meta latest sites |
| Hyperscale, average | 1.20-1.30 | AWS, Azure typical |
| Enterprise colo | 1.50-1.80 | Older facilities |
| Worst-case (older, hot climate) | 2.0+ | Industry historical average |
[source: Uptime Institute Global Data Center Survey 2024; Google's published PUE ≈ 1.10 across fleet]
For interview math, use PUE = 1.2 unless told otherwise. It's the modern hyperscale default.
The 1024-GPU cluster, in megawatts¶
- GPU power: 1024 × 700 W = 716.8 kW.
- Plus hosts/networking: ~75 kW (10%) → ~792 kW IT load.
- Plus cooling at PUE 1.2: 792 × 1.2 ≈ 950 kW facility ≈ 1 MW.
A "1 MW cluster" is a 1024-H100 cluster, to first order. Memorize this scaling.
A frontier training run on 25,000 H100s? ~25 MW — comparable to a small town.
$/MWh — the electricity bill¶
| Region | Industrial $/MWh | $/kWh |
|---|---|---|
| US Pacific NW (cheap hydro) | $40-60 | $0.04-0.06 |
| US average | $80-100 | $0.08-0.10 |
| US Bay Area / Northeast | $150-200 | $0.15-0.20 |
| EU industrial (post-2022) | $100-180 | $0.10-0.18 |
| Iceland (cheap geothermal) | $50-70 | $0.05-0.07 |
[source: US EIA industrial electricity rates 2024; Eurostat 2024]
For interview math, use $0.10/kWh.
Cost of a 10-day run on 1024 H100s¶
- Facility power: 950 kW.
- Duration: 10 days = 240 h.
- Energy: 950 kW × 240 h = 228 MWh.
- Energy cost at $0.10/kWh: 228,000 × \(0.10 = **\)22,800**.
- Energy cost at \(0.05/kWh (cheap hydro): **\)11,400**.
This is just electricity. Cloud rental ($3/H100-hour) for the same run: 1024 × 240 × \(3 = **\)737,280**. The cloud premium is enormous because it includes hardware amortization, networking, ops, margin.
If you own the cluster, your cost picture is different.
CapEx vs OpEx — when does owning pay off?¶
CapEx of a 1024-H100 cluster¶
- 1024 × H100 SXM5 GPUs: ~\(30,000 each (2024 retail) → **\)30.7 M**.
- 128 × DGX H100 chassis (host + 8 GPU + 4 NVSwitch): retail ~\(350k, of which ~\)240k is GPU. Add ~$110k/chassis for the rest → 128 × \(110k = **\)14 M**.
- InfiniBand fabric: NDR switches + cables for 1024-GPU fat-tree → ~$4-6 M.
- Storage + auxiliary: ~$2 M.
- Total: ~$50-55 M CapEx.
Amortized over 4 years (typical accelerator lifetime)¶
- CapEx per year: ~$12.5 M.
- Energy per year (continuous use): 8760 h × 950 kW × \(0.10/kWh = **\)832 k/year**.
- Datacenter rent + ops + cooling overhead: ~$2 M/year (rule of thumb).
- Total annual OpEx + amortized CapEx: ~$15.3 M/year for 1024 H100s running 24/7.
Per GPU-hour: \(15.3M / (1024 × 8760) = **\)1.70 / GPU-hour**.
Compare to cloud spot rate ~$2.50-3/H100-hour (2024-2025). The cloud premium is ~50-80%. This is why labs at scale (Anthropic, OpenAI, Meta, Microsoft) own or co-lease, not rent.
[source: SemiAnalysis cluster TCO model 2024; AWS / RunPod / Lambda pricing pages 2024]
Why a frontier model's cost is 60% energy (over its lifetime)¶
Take a 25 MW cluster, 4-year amortization:
- Energy over 4 years (24/7): 25,000 kW × 8760 h × 4 × \(0.06/kWh (hyperscaler rate) ≈ **\)53 M**.
- Hardware amortization (4 years): ~$1.2 B / 4 ≈ \(300 M/year × ... wait, the cluster CapEx for 25,000 H100s is ~\)1.2 B. Over 4 years: $300 M/year.
- Energy is a smaller fraction over 4 years of light use, but for a single intense training run (where the cluster is at 100% utilization for months), energy dominates the marginal cost.
The often-cited "60% energy" comes from amortizing hardware over many runs but counting energy as marginal to this run. Frontier labs run their clusters near 100% — so the marginal calculation is the real one, and energy is the bigger lever.
Three numbers an ML engineer should always have ready¶
For interview reflex:
- 1 H100 = 700 W TDP. So 1024 H100s ≈ 1 MW of GPU power, 1.2 MW of facility power at PUE 1.2.
- 1 H100 ≈ 1 PF dense FP8. So 1024 H100s ≈ 1 EFLOP of peak FP8 (real MFU ~40-50%).
- Cloud H100 ≈ $3/hour spot, owned H100 ≈ $1.70/hour amortized.
What this means strategically¶
- Why labs co-locate near cheap power: a 25 MW frontier cluster in Bay Area (\(0.18/kWh) vs. Pacific NW (\)0.05/kWh) is a $50M/year difference. Microsoft, Meta, and Google all chase cheap hydroelectric.
- Why Anthropic talks about "compute" as a strategic resource: at frontier scale, owning the cluster is strictly cheaper than renting, if you have the capital and the utilization to justify it. Compute partnerships (e.g. Anthropic + AWS Trainium) are partly about price and partly about supply security.
- Why FP8 / FP4 matter so much: doubling effective FLOPS without doubling power is a free 2× on the dominant cost.
Cross-links¶
02-h100-and-h200.md: the chip whose power we just budgeted.05-the-accelerator-landscape-2026.md: power efficiency comparison across chips.- Phase 34 — Observability & Cost: per-token cost accounting at inference time.
References¶
- Patterson D. et al. 2021, Carbon Emissions and Large Neural Network Training, arXiv:2104.10350.
- Patterson D. et al. 2022, The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink, IEEE Computer.
- SemiAnalysis, AI Datacenter TCO Model, 2024.
- Uptime Institute, Global Data Center Survey, 2024.
- US EIA, Electric Power Monthly, 2024.
- Google, Environmental Report 2024 (PUE disclosures).