Skip to content

English · Español

05 — The accelerator landscape, 2026 edition

🇪🇸 Una guía honesta del mercado: quién hace qué, qué saben hacer bien, y dónde está la trampa del software stack.

Why you need this map

In 2026, "buy GPUs" is not a single decision. It is a multi-axis choice across compute, memory, network, software maturity, supply availability, and price. A senior ML engineer should be able to argue the choice, not just default to "H100 because that's what we know."

This page is the honest, vendor-neutral landscape.

The big seven (training-capable accelerators)

Vendor Chip Year FP8/FP16 peak HBM Interconnect Stack Niche
NVIDIA H100 2022 1979 / 989 TF 80 GB NVLink 4 (900 GB/s) CUDA, PyTorch native Default; mature
NVIDIA H200 2024 1979 / 989 TF 141 GB NVLink 4 (900 GB/s) CUDA, PyTorch native Memory-bound inference
NVIDIA B200 2024-2025 4500 / 2250 TF; 9000 TF FP4 192 GB NVLink 5 (1.8 TB/s) CUDA, Transformer Engine v2 Frontier training
AMD MI300X 2023 2614 / 1307 TF 192 GB Infinity Fabric (896 GB/s) ROCm, PyTorch (improving) Large-memory serving
Intel Gaudi 3 2024 1835 / 1835 TF 128 GB 24× 200 GbE on-chip SynapseAI, PyTorch via Habana Price-competitive training
Google TPU v5p 2023 -- / 459 TF BF16 95 GB ICI 3D-torus JAX/XLA, PyTorch via torch_xla TPU-Pod scale training
AWS Trainium 2 2024 -- / ~650 TF 96 GB NeuronLink v3 Neuron SDK, XLA, PyTorch AWS-internal training

[source: respective vendor whitepapers cited in §01 and §02]

NVIDIA Blackwell — the 2025 frontier

Already covered in §02. The headline:

  • FP4 with second-gen Transformer Engine doubles effective FLOPS again.
  • GB200 NVL72 makes the rack (72 GPUs, 130 TB/s NVLink domain) the unit of compute.
  • Liquid cooling becomes mandatory at 1000 W per chip.
  • Supply-constrained. Long delivery times.

Software stack: CUDA + cuDNN + PyTorch + Transformer Engine. The most mature stack by a wide margin.

AMD MI300X — the credible challenger

  • 192 GB HBM3 — the highest memory capacity of any single chip until B200. Lets you fit a 70B FP16 model in one GPU.
  • 5.3 TB/s HBM bandwidth — 58% more than H100.
  • 2614 TF FP8 dense (vs H100's 1979).
  • 8 chiplets (XCDs) per package.

Strength: memory-bound workloads (long-context inference, KV-cache-heavy MoE) where you trade FLOPS for HBM. Microsoft, Meta, and Oracle are publicly deploying MI300X.

Weakness: ROCm is improving fast but still 12-18 months behind CUDA on: - Compiler optimization for novel ops. - Triton support. - FlashAttention-style hand-tuned kernels. - Multi-vendor tooling.

Realistic assessment: a frontier lab can ship on MI300X if it staffs an internal kernel team. A startup probably cannot.

[source: AMD CDNA 3 whitepaper 2023; Microsoft + AMD MI300X deployment announcements 2024]

Intel Gaudi 3 — the integrated-RDMA play

  • 1835 TF dense BF16/FP8 — competitive with H100 on paper.
  • 128 GB HBM2e @ 3.7 TB/s.
  • The unique feature: 24× 200 GbE RoCE RDMA ports integrated on the chip. No separate NIC. Every Gaudi is its own network endpoint.

Strength: the integrated networking dramatically reduces fabric CapEx (~30%, vendor claim). PyTorch via habana_frameworks.torch is solid for transformer training. Intel is pricing aggressively as the new entrant.

Weakness: SynapseAI lags ROCm which lags CUDA. Smaller community → less third-party tooling. Less proven at very large scale.

[source: Intel Gaudi 3 Whitepaper 2024]

Google TPU v5p — the JAX-native option

  • 459 TF BF16 / 918 TF INT8 per chip.
  • 95 GB HBM @ 2.76 TB/s.
  • TPU v5p Pod: up to 8960 chips in a 3D-torus with optical reconfiguration. The largest single-instance compute system in any cloud.
  • Programming: JAX + XLA is first-class; PyTorch via torch_xla works but lags.

Strength: the compiler does sharding automatically (jax.shard_map). 3D-torus is bandwidth-optimal for AllReduce on a Pod. Google trains Gemini here.

Weakness: Google Cloud only (no on-prem). Custom kernels are essentially impossible — you're at the compiler's mercy. JAX skill is required.

[source: Google TPU v5p announcement December 2023; Jouppi et al. 2023]

AWS Trainium 2 — the cloud-native price play

  • ~650 TF FP16, 96 GB HBM @ 2.9 TB/s.
  • Sold in trn2.48xlarge instances (16 chips/instance) and UltraClusters of 100k+ chips.
  • AWS prices Trainium ~30-40% below H100 on the same training workload [source: AWS re:Invent 2024 keynote, vendor claim].

Strength: lowest $/training-FLOP on AWS by a wide margin. Anthropic publicly trains on Trainium.

Weakness: Neuron SDK is improving but still requires workload-specific tuning. PyTorch via torch-neuronx. Limited to AWS; no other cloud has Trainium.

The exotic three (not training-first, but worth knowing)

Cerebras WSE-3 — wafer-scale integration

  • The entire 300mm wafer is one chip: 900,000 cores, 44 GB on-chip SRAM, 21 PB/s memory bandwidth.
  • No HBM: all memory is on-chip SRAM. Bandwidth is essentially infinite by GPU standards.
  • Designed for training; CS-3 systems sold to government labs and some hyperscalers.

Strength: unmatched single-system bandwidth. No NVLink/IB overhead — the chip is the cluster. Excellent for models that fit (up to ~$24B-ish parameters per WSE-3 with their MemoryX off-chip).

Weakness: programmability requires Cerebras-specific tooling (no PyTorch one-liner). Expensive per system. Software stack maturity is low.

[source: Cerebras WSE-3 Whitepaper 2024]

Groq LPU — inference-only, SRAM-resident

  • No HBM. All model weights in 230 MB of on-chip SRAM per chip.
  • Sells inference latency: ~500-800 tokens/sec on Llama 70B (multi-chip).
  • Deterministic, low-jitter execution — no caches, no out-of-order.

Strength: inference latency for short generations. The fastest single-stream tokens/sec on the market.

Weakness: scaling requires many chips (model spread across hundreds of LPUs). Not for training. Limited to specific models the Groq compiler has tuned.

[source: Groq LPU Architecture Brief 2024]

Apple Silicon (M-series Neural Engine)

Not a training accelerator. ANE is ~38 TOPS INT8 for on-device inference on iPhones/Macs. Relevant only if you target consumer hardware.

Software stack maturity — the honest ranking

This is the axis that actually decides choices, more than FLOPS.

Stack Maturity PyTorch native? Custom kernels Ecosystem
CUDA + PyTorch A+ Yes Triton, CUDA-C, CUTLASS The entire field
ROCm + PyTorch B Yes (mostly) hipBLAS, AITemplate Growing
JAX + XLA (TPU) A Via torch_xla (B-) Pallas (new) JAX/research community
Neuron SDK (Trainium) B- Via torch-neuronx Limited AWS-only
SynapseAI (Gaudi) B- Via habana_frameworks Limited Intel + select customers
Cerebras SDK C Partial Vendor-only Small
Groq SDK C Compiled flow only Vendor-only Small

The pattern: NVIDIA's moat is software, not hardware. ROCm and the cloud-vendor stacks are closing the gap but the lead is years.

Strategic picture for 2026

  • Default for new training projects: NVIDIA H100/H200/B200. Software risk is lowest. Supply is the constraint.
  • If memory > compute matters: AMD MI300X or H200, depending on stack tolerance.
  • If you live in JAX: TPU v5p. The story is compiler-native sharding.
  • If you live in AWS at scale and want low cost: Trainium 2.
  • If you need single-stream inference latency: Groq LPU.
  • If you want to bet on a challenger: Gaudi 3 has interesting economics, the networking story is genuinely differentiated.

For an interview, having a thesis on the landscape — not just a list — is what distinguishes a senior candidate. Be ready to defend "I would train this on X because Y, even though Z."

References

  • NVIDIA Blackwell Architecture Technical Brief, 2024.
  • AMD Instinct MI300X Datasheet and CDNA 3 whitepaper, 2023.
  • Intel Gaudi 3 Whitepaper, 2024.
  • Google TPU v5p announcement and TPU v4: An Optically Reconfigurable Supercomputer (Jouppi et al. 2023).
  • AWS Trainium 2 documentation; AWS re:Invent 2024 keynote.
  • Cerebras WSE-3 Whitepaper, 2024.
  • Groq LPU Architecture Brief, 2024.
  • MLPerf Training v4.0 and Inference v4.1 results, 2024.
  • SemiAnalysis, Accelerator Landscape, ongoing 2024-2026 coverage.