English · Español
05 — The accelerator landscape, 2026 edition¶
🇪🇸 Una guía honesta del mercado: quién hace qué, qué saben hacer bien, y dónde está la trampa del software stack.
Why you need this map¶
In 2026, "buy GPUs" is not a single decision. It is a multi-axis choice across compute, memory, network, software maturity, supply availability, and price. A senior ML engineer should be able to argue the choice, not just default to "H100 because that's what we know."
This page is the honest, vendor-neutral landscape.
The big seven (training-capable accelerators)¶
| Vendor | Chip | Year | FP8/FP16 peak | HBM | Interconnect | Stack | Niche |
|---|---|---|---|---|---|---|---|
| NVIDIA | H100 | 2022 | 1979 / 989 TF | 80 GB | NVLink 4 (900 GB/s) | CUDA, PyTorch native | Default; mature |
| NVIDIA | H200 | 2024 | 1979 / 989 TF | 141 GB | NVLink 4 (900 GB/s) | CUDA, PyTorch native | Memory-bound inference |
| NVIDIA | B200 | 2024-2025 | 4500 / 2250 TF; 9000 TF FP4 | 192 GB | NVLink 5 (1.8 TB/s) | CUDA, Transformer Engine v2 | Frontier training |
| AMD | MI300X | 2023 | 2614 / 1307 TF | 192 GB | Infinity Fabric (896 GB/s) | ROCm, PyTorch (improving) | Large-memory serving |
| Intel | Gaudi 3 | 2024 | 1835 / 1835 TF | 128 GB | 24× 200 GbE on-chip | SynapseAI, PyTorch via Habana | Price-competitive training |
| TPU v5p | 2023 | -- / 459 TF BF16 | 95 GB | ICI 3D-torus | JAX/XLA, PyTorch via torch_xla | TPU-Pod scale training | |
| AWS | Trainium 2 | 2024 | -- / ~650 TF | 96 GB | NeuronLink v3 | Neuron SDK, XLA, PyTorch | AWS-internal training |
[source: respective vendor whitepapers cited in §01 and §02]
NVIDIA Blackwell — the 2025 frontier¶
Already covered in §02. The headline:
- FP4 with second-gen Transformer Engine doubles effective FLOPS again.
- GB200 NVL72 makes the rack (72 GPUs, 130 TB/s NVLink domain) the unit of compute.
- Liquid cooling becomes mandatory at 1000 W per chip.
- Supply-constrained. Long delivery times.
Software stack: CUDA + cuDNN + PyTorch + Transformer Engine. The most mature stack by a wide margin.
AMD MI300X — the credible challenger¶
- 192 GB HBM3 — the highest memory capacity of any single chip until B200. Lets you fit a 70B FP16 model in one GPU.
- 5.3 TB/s HBM bandwidth — 58% more than H100.
- 2614 TF FP8 dense (vs H100's 1979).
- 8 chiplets (XCDs) per package.
Strength: memory-bound workloads (long-context inference, KV-cache-heavy MoE) where you trade FLOPS for HBM. Microsoft, Meta, and Oracle are publicly deploying MI300X.
Weakness: ROCm is improving fast but still 12-18 months behind CUDA on: - Compiler optimization for novel ops. - Triton support. - FlashAttention-style hand-tuned kernels. - Multi-vendor tooling.
Realistic assessment: a frontier lab can ship on MI300X if it staffs an internal kernel team. A startup probably cannot.
[source: AMD CDNA 3 whitepaper 2023; Microsoft + AMD MI300X deployment announcements 2024]
Intel Gaudi 3 — the integrated-RDMA play¶
- 1835 TF dense BF16/FP8 — competitive with H100 on paper.
- 128 GB HBM2e @ 3.7 TB/s.
- The unique feature: 24× 200 GbE RoCE RDMA ports integrated on the chip. No separate NIC. Every Gaudi is its own network endpoint.
Strength: the integrated networking dramatically reduces fabric CapEx (~30%, vendor claim). PyTorch via habana_frameworks.torch is solid for transformer training. Intel is pricing aggressively as the new entrant.
Weakness: SynapseAI lags ROCm which lags CUDA. Smaller community → less third-party tooling. Less proven at very large scale.
[source: Intel Gaudi 3 Whitepaper 2024]
Google TPU v5p — the JAX-native option¶
- 459 TF BF16 / 918 TF INT8 per chip.
- 95 GB HBM @ 2.76 TB/s.
- TPU v5p Pod: up to 8960 chips in a 3D-torus with optical reconfiguration. The largest single-instance compute system in any cloud.
- Programming: JAX + XLA is first-class; PyTorch via
torch_xlaworks but lags.
Strength: the compiler does sharding automatically (jax.shard_map). 3D-torus is bandwidth-optimal for AllReduce on a Pod. Google trains Gemini here.
Weakness: Google Cloud only (no on-prem). Custom kernels are essentially impossible — you're at the compiler's mercy. JAX skill is required.
[source: Google TPU v5p announcement December 2023; Jouppi et al. 2023]
AWS Trainium 2 — the cloud-native price play¶
- ~650 TF FP16, 96 GB HBM @ 2.9 TB/s.
- Sold in
trn2.48xlargeinstances (16 chips/instance) and UltraClusters of 100k+ chips. - AWS prices Trainium ~30-40% below H100 on the same training workload [source: AWS re:Invent 2024 keynote, vendor claim].
Strength: lowest $/training-FLOP on AWS by a wide margin. Anthropic publicly trains on Trainium.
Weakness: Neuron SDK is improving but still requires workload-specific tuning. PyTorch via torch-neuronx. Limited to AWS; no other cloud has Trainium.
The exotic three (not training-first, but worth knowing)¶
Cerebras WSE-3 — wafer-scale integration¶
- The entire 300mm wafer is one chip: 900,000 cores, 44 GB on-chip SRAM, 21 PB/s memory bandwidth.
- No HBM: all memory is on-chip SRAM. Bandwidth is essentially infinite by GPU standards.
- Designed for training; CS-3 systems sold to government labs and some hyperscalers.
Strength: unmatched single-system bandwidth. No NVLink/IB overhead — the chip is the cluster. Excellent for models that fit (up to ~$24B-ish parameters per WSE-3 with their MemoryX off-chip).
Weakness: programmability requires Cerebras-specific tooling (no PyTorch one-liner). Expensive per system. Software stack maturity is low.
[source: Cerebras WSE-3 Whitepaper 2024]
Groq LPU — inference-only, SRAM-resident¶
- No HBM. All model weights in 230 MB of on-chip SRAM per chip.
- Sells inference latency: ~500-800 tokens/sec on Llama 70B (multi-chip).
- Deterministic, low-jitter execution — no caches, no out-of-order.
Strength: inference latency for short generations. The fastest single-stream tokens/sec on the market.
Weakness: scaling requires many chips (model spread across hundreds of LPUs). Not for training. Limited to specific models the Groq compiler has tuned.
[source: Groq LPU Architecture Brief 2024]
Apple Silicon (M-series Neural Engine)¶
Not a training accelerator. ANE is ~38 TOPS INT8 for on-device inference on iPhones/Macs. Relevant only if you target consumer hardware.
Software stack maturity — the honest ranking¶
This is the axis that actually decides choices, more than FLOPS.
| Stack | Maturity | PyTorch native? | Custom kernels | Ecosystem |
|---|---|---|---|---|
| CUDA + PyTorch | A+ | Yes | Triton, CUDA-C, CUTLASS | The entire field |
| ROCm + PyTorch | B | Yes (mostly) | hipBLAS, AITemplate | Growing |
| JAX + XLA (TPU) | A | Via torch_xla (B-) |
Pallas (new) | JAX/research community |
| Neuron SDK (Trainium) | B- | Via torch-neuronx |
Limited | AWS-only |
| SynapseAI (Gaudi) | B- | Via habana_frameworks |
Limited | Intel + select customers |
| Cerebras SDK | C | Partial | Vendor-only | Small |
| Groq SDK | C | Compiled flow only | Vendor-only | Small |
The pattern: NVIDIA's moat is software, not hardware. ROCm and the cloud-vendor stacks are closing the gap but the lead is years.
Strategic picture for 2026¶
- Default for new training projects: NVIDIA H100/H200/B200. Software risk is lowest. Supply is the constraint.
- If memory > compute matters: AMD MI300X or H200, depending on stack tolerance.
- If you live in JAX: TPU v5p. The story is compiler-native sharding.
- If you live in AWS at scale and want low cost: Trainium 2.
- If you need single-stream inference latency: Groq LPU.
- If you want to bet on a challenger: Gaudi 3 has interesting economics, the networking story is genuinely differentiated.
For an interview, having a thesis on the landscape — not just a list — is what distinguishes a senior candidate. Be ready to defend "I would train this on X because Y, even though Z."
Cross-links¶
01-cpu-vs-gpu-vs-tpu-vs-trn1.md: the architectural primitives.02-h100-and-h200.md: NVIDIA flagship in depth.04-datacenter-economics.md: the cost lens on these choices.
References¶
- NVIDIA Blackwell Architecture Technical Brief, 2024.
- AMD Instinct MI300X Datasheet and CDNA 3 whitepaper, 2023.
- Intel Gaudi 3 Whitepaper, 2024.
- Google TPU v5p announcement and TPU v4: An Optically Reconfigurable Supercomputer (Jouppi et al. 2023).
- AWS Trainium 2 documentation; AWS re:Invent 2024 keynote.
- Cerebras WSE-3 Whitepaper, 2024.
- Groq LPU Architecture Brief, 2024.
- MLPerf Training v4.0 and Inference v4.1 results, 2024.
- SemiAnalysis, Accelerator Landscape, ongoing 2024-2026 coverage.