English · Español

04 — Datacenter economics: power, PUE, $/MWh, CapEx vs OpEx¶

🇪🇸 Un modelo frontera no cuesta lo que cuesta su código. Cuesta los megavatios de su entrenamiento. Aquí están los números.

The thesis¶

When an Anthropic infra engineer says "we spent X on this training run," roughly 60% of X is electricity [source: Patterson et al. 2021; SemiAnalysis 2024 cost models]. Hardware amortization, networking, and engineering salaries are the rest. This page is the arithmetic.

Power per GPU¶

Chip	TDP	Sustained training power (typical)
A100 SXM4	400 W	~350 W
H100 SXM5	700 W	~650 W
H200 SXM	700 W	~650 W
B200	1000 W	~900 W
MI300X	750 W	~700 W
TPU v5p	~450 W [not publicly confirmed]	—

[source: NVIDIA H100/H200/B200 datasheets; AMD MI300X datasheet 2023]

Memorize: H100 = 700 W, B200 = 1000 W.

From chip TDP to facility power¶

You cannot wire up GPUs in isolation. The full stack:

GPU power (the headline number).
CPU + system board per node: ~300-500 W for a DGX host.
Network switches and NICs: ~5-10% of IT load.
Storage: small for training (cached checkpoints).
Cooling: ~15-30% additional draw. This is what PUE captures.

For an 8× H100 DGX node:

8 × 700 W (GPU) + 500 W (host) + 100 W (networking) ≈ 6.2 kW IT load.
With PUE 1.2: ~7.4 kW total facility draw per node.

PUE — Power Usage Effectiveness¶

\[ \text{PUE} = \frac{\text{Total facility power}}{\text{IT equipment power}} \]

Benchmarks:

Tier	PUE	Example
Hyperscale, modern	1.10-1.15	Google, Meta latest sites
Hyperscale, average	1.20-1.30	AWS, Azure typical
Enterprise colo	1.50-1.80	Older facilities
Worst-case (older, hot climate)	2.0+	Industry historical average

[source: Uptime Institute Global Data Center Survey 2024; Google's published PUE ≈ 1.10 across fleet]

For interview math, use PUE = 1.2 unless told otherwise. It's the modern hyperscale default.

The 1024-GPU cluster, in megawatts¶

GPU power: 1024 × 700 W = 716.8 kW.
Plus hosts/networking: ~75 kW (10%) → ~792 kW IT load.
Plus cooling at PUE 1.2: 792 × 1.2 ≈ 950 kW facility ≈ 1 MW.

A "1 MW cluster" is a 1024-H100 cluster, to first order. Memorize this scaling.

A frontier training run on 25,000 H100s? ~25 MW — comparable to a small town.

$/MWh — the electricity bill¶

Region	Industrial $/MWh	$/kWh
US Pacific NW (cheap hydro)	$40-60	$0.04-0.06
US average	$80-100	$0.08-0.10
US Bay Area / Northeast	$150-200	$0.15-0.20
EU industrial (post-2022)	$100-180	$0.10-0.18
Iceland (cheap geothermal)	$50-70	$0.05-0.07

[source: US EIA industrial electricity rates 2024; Eurostat 2024]

For interview math, use $0.10/kWh.

Cost of a 10-day run on 1024 H100s¶

Facility power: 950 kW.
Duration: 10 days = 240 h.
Energy: 950 kW × 240 h = 228 MWh.
Energy cost at $0.10/kWh: 228,000 × $0.10 = **$22,800**.
Energy cost at $0.05/kWh (cheap hydro): **$11,400**.

This is just electricity. Cloud rental ($3/H100-hour) for the same run: 1024 × 240 × $3 = **$737,280**. The cloud premium is enormous because it includes hardware amortization, networking, ops, margin.

If you own the cluster, your cost picture is different.

CapEx vs OpEx — when does owning pay off?¶

CapEx of a 1024-H100 cluster¶

1024 × H100 SXM5 GPUs: ~$30,000 each (2024 retail) → **$30.7 M**.
128 × DGX H100 chassis (host + 8 GPU + 4 NVSwitch): retail ~$350k, of which ~$240k is GPU. Add ~$110k/chassis for the rest → 128 × $110k = **$14 M**.
InfiniBand fabric: NDR switches + cables for 1024-GPU fat-tree → ~$4-6 M.
Storage + auxiliary: ~$2 M.
Total: ~$50-55 M CapEx.

Amortized over 4 years (typical accelerator lifetime)¶

CapEx per year: ~$12.5 M.
Energy per year (continuous use): 8760 h × 950 kW × $0.10/kWh = **$832 k/year**.
Datacenter rent + ops + cooling overhead: ~$2 M/year (rule of thumb).
Total annual OpEx + amortized CapEx: ~$15.3 M/year for 1024 H100s running 24/7.

Per GPU-hour: $15.3M / (1024 × 8760) = **$1.70 / GPU-hour**.

Compare to cloud spot rate ~$2.50-3/H100-hour (2024-2025). The cloud premium is ~50-80%. This is why labs at scale (Anthropic, OpenAI, Meta, Microsoft) own or co-lease, not rent.

[source: SemiAnalysis cluster TCO model 2024; AWS / RunPod / Lambda pricing pages 2024]

Why a frontier model's cost is 60% energy (over its lifetime)¶

Take a 25 MW cluster, 4-year amortization:

Energy over 4 years (24/7): 25,000 kW × 8760 h × 4 × $0.06/kWh (hyperscaler rate) ≈ **$53 M**.
Hardware amortization (4 years): ~$1.2 B / 4 ≈ $300 M/year × ... wait, the cluster CapEx for 25,000 H100s is ~$1.2 B. Over 4 years: $300 M/year.
Energy is a smaller fraction over 4 years of light use, but for a single intense training run (where the cluster is at 100% utilization for months), energy dominates the marginal cost.

The often-cited "60% energy" comes from amortizing hardware over many runs but counting energy as marginal to this run. Frontier labs run their clusters near 100% — so the marginal calculation is the real one, and energy is the bigger lever.

Three numbers an ML engineer should always have ready¶

For interview reflex:

1 H100 = 700 W TDP. So 1024 H100s ≈ 1 MW of GPU power, 1.2 MW of facility power at PUE 1.2.
1 H100 ≈ 1 PF dense FP8. So 1024 H100s ≈ 1 EFLOP of peak FP8 (real MFU ~40-50%).
Cloud H100 ≈ $3/hour spot, owned H100 ≈ $1.70/hour amortized.

What this means strategically¶

Why labs co-locate near cheap power: a 25 MW frontier cluster in Bay Area ($0.18/kWh) vs. Pacific NW ($0.05/kWh) is a $50M/year difference. Microsoft, Meta, and Google all chase cheap hydroelectric.
Why Anthropic talks about "compute" as a strategic resource: at frontier scale, owning the cluster is strictly cheaper than renting, if you have the capital and the utilization to justify it. Compute partnerships (e.g. Anthropic + AWS Trainium) are partly about price and partly about supply security.
Why FP8 / FP4 matter so much: doubling effective FLOPS without doubling power is a free 2× on the dominant cost.

Cross-links¶

02-h100-and-h200.md: the chip whose power we just budgeted.
05-the-accelerator-landscape-2026.md: power efficiency comparison across chips.
Phase 34 — Observability & Cost: per-token cost accounting at inference time.

References¶

Patterson D. et al. 2021, Carbon Emissions and Large Neural Network Training, arXiv:2104.10350.
Patterson D. et al. 2022, The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink, IEEE Computer.
SemiAnalysis, AI Datacenter TCO Model, 2024.
Uptime Institute, Global Data Center Survey, 2024.
US EIA, Electric Power Monthly, 2024.
Google, Environmental Report 2024 (PUE disclosures).