Skip to content

English · Español

04 — Datacenter economics: power, PUE, $/MWh, CapEx vs OpEx

🇪🇸 Un modelo frontera no cuesta lo que cuesta su código. Cuesta los megavatios de su entrenamiento. Aquí están los números.

The thesis

When an Anthropic infra engineer says "we spent X on this training run," roughly 60% of X is electricity [source: Patterson et al. 2021; SemiAnalysis 2024 cost models]. Hardware amortization, networking, and engineering salaries are the rest. This page is the arithmetic.

Power per GPU

Chip TDP Sustained training power (typical)
A100 SXM4 400 W ~350 W
H100 SXM5 700 W ~650 W
H200 SXM 700 W ~650 W
B200 1000 W ~900 W
MI300X 750 W ~700 W
TPU v5p ~450 W [not publicly confirmed]

[source: NVIDIA H100/H200/B200 datasheets; AMD MI300X datasheet 2023]

Memorize: H100 = 700 W, B200 = 1000 W.

From chip TDP to facility power

You cannot wire up GPUs in isolation. The full stack:

  1. GPU power (the headline number).
  2. CPU + system board per node: ~300-500 W for a DGX host.
  3. Network switches and NICs: ~5-10% of IT load.
  4. Storage: small for training (cached checkpoints).
  5. Cooling: ~15-30% additional draw. This is what PUE captures.

For an 8× H100 DGX node:

  • 8 × 700 W (GPU) + 500 W (host) + 100 W (networking) ≈ 6.2 kW IT load.
  • With PUE 1.2: ~7.4 kW total facility draw per node.

PUE — Power Usage Effectiveness

\[ \text{PUE} = \frac{\text{Total facility power}}{\text{IT equipment power}} \]

Benchmarks:

Tier PUE Example
Hyperscale, modern 1.10-1.15 Google, Meta latest sites
Hyperscale, average 1.20-1.30 AWS, Azure typical
Enterprise colo 1.50-1.80 Older facilities
Worst-case (older, hot climate) 2.0+ Industry historical average

[source: Uptime Institute Global Data Center Survey 2024; Google's published PUE ≈ 1.10 across fleet]

For interview math, use PUE = 1.2 unless told otherwise. It's the modern hyperscale default.

The 1024-GPU cluster, in megawatts

  • GPU power: 1024 × 700 W = 716.8 kW.
  • Plus hosts/networking: ~75 kW (10%) → ~792 kW IT load.
  • Plus cooling at PUE 1.2: 792 × 1.2 ≈ 950 kW facility1 MW.

A "1 MW cluster" is a 1024-H100 cluster, to first order. Memorize this scaling.

A frontier training run on 25,000 H100s? ~25 MW — comparable to a small town.

$/MWh — the electricity bill

Region Industrial $/MWh $/kWh
US Pacific NW (cheap hydro) $40-60 $0.04-0.06
US average $80-100 $0.08-0.10
US Bay Area / Northeast $150-200 $0.15-0.20
EU industrial (post-2022) $100-180 $0.10-0.18
Iceland (cheap geothermal) $50-70 $0.05-0.07

[source: US EIA industrial electricity rates 2024; Eurostat 2024]

For interview math, use $0.10/kWh.

Cost of a 10-day run on 1024 H100s

  • Facility power: 950 kW.
  • Duration: 10 days = 240 h.
  • Energy: 950 kW × 240 h = 228 MWh.
  • Energy cost at $0.10/kWh: 228,000 × \(0.10 = **\)22,800**.
  • Energy cost at \(0.05/kWh (cheap hydro): **\)11,400**.

This is just electricity. Cloud rental ($3/H100-hour) for the same run: 1024 × 240 × \(3 = **\)737,280**. The cloud premium is enormous because it includes hardware amortization, networking, ops, margin.

If you own the cluster, your cost picture is different.

CapEx vs OpEx — when does owning pay off?

CapEx of a 1024-H100 cluster

  • 1024 × H100 SXM5 GPUs: ~\(30,000 each (2024 retail) → **\)30.7 M**.
  • 128 × DGX H100 chassis (host + 8 GPU + 4 NVSwitch): retail ~\(350k, of which ~\)240k is GPU. Add ~$110k/chassis for the rest → 128 × \(110k = **\)14 M**.
  • InfiniBand fabric: NDR switches + cables for 1024-GPU fat-tree → ~$4-6 M.
  • Storage + auxiliary: ~$2 M.
  • Total: ~$50-55 M CapEx.

Amortized over 4 years (typical accelerator lifetime)

  • CapEx per year: ~$12.5 M.
  • Energy per year (continuous use): 8760 h × 950 kW × \(0.10/kWh = **\)832 k/year**.
  • Datacenter rent + ops + cooling overhead: ~$2 M/year (rule of thumb).
  • Total annual OpEx + amortized CapEx: ~$15.3 M/year for 1024 H100s running 24/7.

Per GPU-hour: \(15.3M / (1024 × 8760) = **\)1.70 / GPU-hour**.

Compare to cloud spot rate ~$2.50-3/H100-hour (2024-2025). The cloud premium is ~50-80%. This is why labs at scale (Anthropic, OpenAI, Meta, Microsoft) own or co-lease, not rent.

[source: SemiAnalysis cluster TCO model 2024; AWS / RunPod / Lambda pricing pages 2024]

Why a frontier model's cost is 60% energy (over its lifetime)

Take a 25 MW cluster, 4-year amortization:

  • Energy over 4 years (24/7): 25,000 kW × 8760 h × 4 × \(0.06/kWh (hyperscaler rate) ≈ **\)53 M**.
  • Hardware amortization (4 years): ~$1.2 B / 4 ≈ \(300 M/year × ... wait, the cluster CapEx for 25,000 H100s is ~\)1.2 B. Over 4 years: $300 M/year.
  • Energy is a smaller fraction over 4 years of light use, but for a single intense training run (where the cluster is at 100% utilization for months), energy dominates the marginal cost.

The often-cited "60% energy" comes from amortizing hardware over many runs but counting energy as marginal to this run. Frontier labs run their clusters near 100% — so the marginal calculation is the real one, and energy is the bigger lever.

Three numbers an ML engineer should always have ready

For interview reflex:

  1. 1 H100 = 700 W TDP. So 1024 H100s ≈ 1 MW of GPU power, 1.2 MW of facility power at PUE 1.2.
  2. 1 H100 ≈ 1 PF dense FP8. So 1024 H100s ≈ 1 EFLOP of peak FP8 (real MFU ~40-50%).
  3. Cloud H100 ≈ $3/hour spot, owned H100 ≈ $1.70/hour amortized.

What this means strategically

  • Why labs co-locate near cheap power: a 25 MW frontier cluster in Bay Area (\(0.18/kWh) vs. Pacific NW (\)0.05/kWh) is a $50M/year difference. Microsoft, Meta, and Google all chase cheap hydroelectric.
  • Why Anthropic talks about "compute" as a strategic resource: at frontier scale, owning the cluster is strictly cheaper than renting, if you have the capital and the utilization to justify it. Compute partnerships (e.g. Anthropic + AWS Trainium) are partly about price and partly about supply security.
  • Why FP8 / FP4 matter so much: doubling effective FLOPS without doubling power is a free 2× on the dominant cost.

References

  • Patterson D. et al. 2021, Carbon Emissions and Large Neural Network Training, arXiv:2104.10350.
  • Patterson D. et al. 2022, The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink, IEEE Computer.
  • SemiAnalysis, AI Datacenter TCO Model, 2024.
  • Uptime Institute, Global Data Center Survey, 2024.
  • US EIA, Electric Power Monthly, 2024.
  • Google, Environmental Report 2024 (PUE disclosures).