English · Español

05 — The accelerator landscape, 2026 edition¶

🇪🇸 Una guía honesta del mercado: quién hace qué, qué saben hacer bien, y dónde está la trampa del software stack.

Why you need this map¶

In 2026, "buy GPUs" is not a single decision. It is a multi-axis choice across compute, memory, network, software maturity, supply availability, and price. A senior ML engineer should be able to argue the choice, not just default to "H100 because that's what we know."

This page is the honest, vendor-neutral landscape.

The big seven (training-capable accelerators)¶

Vendor	Chip	Year	FP8/FP16 peak	HBM	Interconnect	Stack	Niche
NVIDIA	H100	2022	1979 / 989 TF	80 GB	NVLink 4 (900 GB/s)	CUDA, PyTorch native	Default; mature
NVIDIA	H200	2024	1979 / 989 TF	141 GB	NVLink 4 (900 GB/s)	CUDA, PyTorch native	Memory-bound inference
NVIDIA	B200	2024-2025	4500 / 2250 TF; 9000 TF FP4	192 GB	NVLink 5 (1.8 TB/s)	CUDA, Transformer Engine v2	Frontier training
AMD	MI300X	2023	2614 / 1307 TF	192 GB	Infinity Fabric (896 GB/s)	ROCm, PyTorch (improving)	Large-memory serving
Intel	Gaudi 3	2024	1835 / 1835 TF	128 GB	24× 200 GbE on-chip	SynapseAI, PyTorch via Habana	Price-competitive training
Google	TPU v5p	2023	-- / 459 TF BF16	95 GB	ICI 3D-torus	JAX/XLA, PyTorch via torch_xla	TPU-Pod scale training
AWS	Trainium 2	2024	-- / ~650 TF	96 GB	NeuronLink v3	Neuron SDK, XLA, PyTorch	AWS-internal training

[source: respective vendor whitepapers cited in §01 and §02]

NVIDIA Blackwell — the 2025 frontier¶

Already covered in §02. The headline:

FP4 with second-gen Transformer Engine doubles effective FLOPS again.
GB200 NVL72 makes the rack (72 GPUs, 130 TB/s NVLink domain) the unit of compute.
Liquid cooling becomes mandatory at 1000 W per chip.
Supply-constrained. Long delivery times.

Software stack: CUDA + cuDNN + PyTorch + Transformer Engine. The most mature stack by a wide margin.

AMD MI300X — the credible challenger¶

192 GB HBM3 — the highest memory capacity of any single chip until B200. Lets you fit a 70B FP16 model in one GPU.
5.3 TB/s HBM bandwidth — 58% more than H100.
2614 TF FP8 dense (vs H100's 1979).
8 chiplets (XCDs) per package.

Strength: memory-bound workloads (long-context inference, KV-cache-heavy MoE) where you trade FLOPS for HBM. Microsoft, Meta, and Oracle are publicly deploying MI300X.

Weakness: ROCm is improving fast but still 12-18 months behind CUDA on: - Compiler optimization for novel ops. - Triton support. - FlashAttention-style hand-tuned kernels. - Multi-vendor tooling.

Realistic assessment: a frontier lab can ship on MI300X if it staffs an internal kernel team. A startup probably cannot.

[source: AMD CDNA 3 whitepaper 2023; Microsoft + AMD MI300X deployment announcements 2024]

Intel Gaudi 3 — the integrated-RDMA play¶

1835 TF dense BF16/FP8 — competitive with H100 on paper.
128 GB HBM2e @ 3.7 TB/s.
The unique feature: 24× 200 GbE RoCE RDMA ports integrated on the chip. No separate NIC. Every Gaudi is its own network endpoint.

Strength: the integrated networking dramatically reduces fabric CapEx (~30%, vendor claim). PyTorch via habana_frameworks.torch is solid for transformer training. Intel is pricing aggressively as the new entrant.

Weakness: SynapseAI lags ROCm which lags CUDA. Smaller community → less third-party tooling. Less proven at very large scale.

[source: Intel Gaudi 3 Whitepaper 2024]

Google TPU v5p — the JAX-native option¶

459 TF BF16 / 918 TF INT8 per chip.
95 GB HBM @ 2.76 TB/s.
TPU v5p Pod: up to 8960 chips in a 3D-torus with optical reconfiguration. The largest single-instance compute system in any cloud.
Programming: JAX + XLA is first-class; PyTorch via torch_xla works but lags.

Strength: the compiler does sharding automatically (jax.shard_map). 3D-torus is bandwidth-optimal for AllReduce on a Pod. Google trains Gemini here.

Weakness: Google Cloud only (no on-prem). Custom kernels are essentially impossible — you're at the compiler's mercy. JAX skill is required.

[source: Google TPU v5p announcement December 2023; Jouppi et al. 2023]

AWS Trainium 2 — the cloud-native price play¶

~650 TF FP16, 96 GB HBM @ 2.9 TB/s.
Sold in trn2.48xlarge instances (16 chips/instance) and UltraClusters of 100k+ chips.
AWS prices Trainium ~30-40% below H100 on the same training workload [source: AWS re:Invent 2024 keynote, vendor claim].

Strength: lowest $/training-FLOP on AWS by a wide margin. Anthropic publicly trains on Trainium.

Weakness: Neuron SDK is improving but still requires workload-specific tuning. PyTorch via torch-neuronx. Limited to AWS; no other cloud has Trainium.

The exotic three (not training-first, but worth knowing)¶

Cerebras WSE-3 — wafer-scale integration¶

The entire 300mm wafer is one chip: 900,000 cores, 44 GB on-chip SRAM, 21 PB/s memory bandwidth.
No HBM: all memory is on-chip SRAM. Bandwidth is essentially infinite by GPU standards.
Designed for training; CS-3 systems sold to government labs and some hyperscalers.

Strength: unmatched single-system bandwidth. No NVLink/IB overhead — the chip is the cluster. Excellent for models that fit (up to ~$24B-ish parameters per WSE-3 with their MemoryX off-chip).

Weakness: programmability requires Cerebras-specific tooling (no PyTorch one-liner). Expensive per system. Software stack maturity is low.

[source: Cerebras WSE-3 Whitepaper 2024]

Groq LPU — inference-only, SRAM-resident¶

No HBM. All model weights in 230 MB of on-chip SRAM per chip.
Sells inference latency: ~500-800 tokens/sec on Llama 70B (multi-chip).
Deterministic, low-jitter execution — no caches, no out-of-order.

Strength: inference latency for short generations. The fastest single-stream tokens/sec on the market.

Weakness: scaling requires many chips (model spread across hundreds of LPUs). Not for training. Limited to specific models the Groq compiler has tuned.

[source: Groq LPU Architecture Brief 2024]

Apple Silicon (M-series Neural Engine)¶

Not a training accelerator. ANE is ~38 TOPS INT8 for on-device inference on iPhones/Macs. Relevant only if you target consumer hardware.

Software stack maturity — the honest ranking¶

This is the axis that actually decides choices, more than FLOPS.

Stack	Maturity	PyTorch native?	Custom kernels	Ecosystem
CUDA + PyTorch	A+	Yes	Triton, CUDA-C, CUTLASS	The entire field
ROCm + PyTorch	B	Yes (mostly)	hipBLAS, AITemplate	Growing
JAX + XLA (TPU)	A	Via `torch_xla` (B-)	Pallas (new)	JAX/research community
Neuron SDK (Trainium)	B-	Via `torch-neuronx`	Limited	AWS-only
SynapseAI (Gaudi)	B-	Via `habana_frameworks`	Limited	Intel + select customers
Cerebras SDK	C	Partial	Vendor-only	Small
Groq SDK	C	Compiled flow only	Vendor-only	Small

The pattern: NVIDIA's moat is software, not hardware. ROCm and the cloud-vendor stacks are closing the gap but the lead is years.

Strategic picture for 2026¶

Default for new training projects: NVIDIA H100/H200/B200. Software risk is lowest. Supply is the constraint.
If memory > compute matters: AMD MI300X or H200, depending on stack tolerance.
If you live in JAX: TPU v5p. The story is compiler-native sharding.
If you live in AWS at scale and want low cost: Trainium 2.
If you need single-stream inference latency: Groq LPU.
If you want to bet on a challenger: Gaudi 3 has interesting economics, the networking story is genuinely differentiated.

For an interview, having a thesis on the landscape — not just a list — is what distinguishes a senior candidate. Be ready to defend "I would train this on X because Y, even though Z."

Cross-links¶

01-cpu-vs-gpu-vs-tpu-vs-trn1.md: the architectural primitives.
02-h100-and-h200.md: NVIDIA flagship in depth.
04-datacenter-economics.md: the cost lens on these choices.

References¶

NVIDIA Blackwell Architecture Technical Brief, 2024.
AMD Instinct MI300X Datasheet and CDNA 3 whitepaper, 2023.
Intel Gaudi 3 Whitepaper, 2024.
Google TPU v5p announcement and TPU v4: An Optically Reconfigurable Supercomputer (Jouppi et al. 2023).
AWS Trainium 2 documentation; AWS re:Invent 2024 keynote.
Cerebras WSE-3 Whitepaper, 2024.
Groq LPU Architecture Brief, 2024.
MLPerf Training v4.0 and Inference v4.1 results, 2024.
SemiAnalysis, Accelerator Landscape, ongoing 2024-2026 coverage.