English · Español
01 — CPU vs GPU vs TPU vs Trainium vs Gaudi¶
🇪🇸 Cuatro filosofías de hardware para una misma multiplicación de matrices: control-flow (CPU), SIMT (GPU), array sistólico (TPU/Trainium), y híbrido VLIW (Gaudi). Cada una optimiza una métrica diferente.
The fundamental trade¶
Every accelerator answers one question: given fixed silicon area, how much do I spend on control, how much on arithmetic, how much on memory?
- CPU: most area is control (branch prediction, out-of-order, caches). Few, fast scalar ALUs. Optimizes latency on arbitrary control flow.
- GPU: most area is arithmetic (thousands of small ALUs). Light per-thread control. Optimizes throughput on regular data-parallel work.
- TPU / Trainium: most area is one giant systolic array (a 128×128 or 256×256 grid of multiply-accumulate cells). Almost no per-cell control. Optimizes dense matmul throughput at the cost of programmability.
- Gaudi: hybrid — VLIW vector processors (TPCs) + a Matrix Math Engine. Optimizes price-per-FLOP for transformer workloads.
The whole module is variations on this single trade.
Side-by-side numbers (2024-2026 flagships)¶
| Chip | Peak compute (dense) | HBM capacity | HBM bandwidth | Intra-chip / scale-up | Process | TDP |
|---|---|---|---|---|---|---|
| Intel i5-8250U (CPU, 2018) | ~0.2 TF FP32 | 62 GiB DDR4 (system RAM) | ~19 GB/s | — | 14 nm | 15 W |
| NVIDIA A100 80 GB (GPU, 2020) | 312 TF BF16 / TF32; 624 TF FP16 | 80 GB HBM2e | 2.0 TB/s | NVLink 3: 600 GB/s | 7 nm | 400 W |
| NVIDIA H100 SXM5 (GPU, 2022) | 989 TF FP16/BF16; 1979 TF FP8 | 80 GB HBM3 | 3.35 TB/s | NVLink 4: 900 GB/s | 4 nm | 700 W |
| NVIDIA H200 SXM (GPU, 2024) | 989 TF FP16/BF16; 1979 TF FP8 | 141 GB HBM3e | 4.8 TB/s | NVLink 4: 900 GB/s | 4 nm | 700 W |
| NVIDIA B200 (GPU, 2024) | ~2.25 PF FP16; ~4.5 PF FP8; ~9 PF FP4 | 192 GB HBM3e | 8 TB/s | NVLink 5: 1.8 TB/s | 4NP | 1000 W |
| Google TPU v5p (2024) | 459 TF BF16; 918 TF INT8 | 95 GB HBM | 2.76 TB/s | ICI 3D-torus: 4.8 Tb/s/chip | — | ~450 W [not publicly confirmed] |
| AWS Trainium 2 (2024) | ~650 TF FP16 (per chip) | 96 GB HBM | 2.9 TB/s | NeuronLink-v3 | — | ~500 W [not publicly confirmed] |
| Intel Gaudi 3 (2024) | 1835 TF FP8; 1835 TF BF16 | 128 GB HBM2e | 3.7 TB/s | 24× 200 GbE on-chip RDMA | 5 nm | 900 W |
| AMD MI300X (2023) | 1307 TF FP16; 2614 TF FP8 | 192 GB HBM3 | 5.3 TB/s | Infinity Fabric: 896 GB/s | ⅚ nm | 750 W |
[source: NVIDIA H100 Tensor Core GPU Architecture Whitepaper 2022; NVIDIA H200 Datasheet 2024; NVIDIA Blackwell Architecture Whitepaper 2024; Google TPU v5p announcement 2023; AWS Trainium 2 documentation 2024; Intel Gaudi 3 Whitepaper 2024; AMD MI300X Datasheet 2023]
CPU — Intel i5-8250U as the baseline¶
This is Borja's laptop. We use it as the ground-truth single-thread baseline (see Phase 01). Key properties:
- 4 cores / 8 threads, 1.6-3.4 GHz.
- AVX2 (256-bit SIMD): 8× FP32 lanes per core → 4 × 3.4 GHz × 8 × 2 (FMA) ≈ 174 GFLOP/s theoretical peak FP32.
- L1d 32 KB / core, L2 256 KB / core, L3 6 MB shared.
- DDR4-2400 dual-channel: ~38.4 GB/s theoretical, ~19 GB/s achievable. Arithmetic intensity break-even ≈ 174 / 19 ≈ 9 FLOP/byte.
What this teaches: most "real" code is below that break-even. CPUs spend most of their time waiting on memory.
GPU — NVIDIA H100 as the canonical training chip¶
The H100 is what every frontier lab trained on in 2023-2024.
- 132 SMs × 128 FP32 cores = 16896 FP32 lanes, but the Tensor Cores are what dominate the FLOPS budget.
- 4th-generation Tensor Cores: support TF32, BF16, FP16, FP8 (two variants: E4M3 and E5M2), INT8.
- Peak FP16/BF16: 989 TF/s dense, 1979 TF/s with 2:4 structured sparsity.
- Peak FP8: 1979 TF/s dense, 3958 TF/s with sparsity. FP8 is the recipe for H100-era training. [source: NVIDIA H100 datasheet 2024]
- HBM3 80 GB at 3.35 TB/s. Arithmetic intensity break-even (FP16) = 989e12 / 3.35e12 ≈ 295 FLOP/byte. This number is the single most important thing on the chip — see §03 of this module.
- NVLink 4: 18 links × 50 GB/s = 900 GB/s intra-chip; with NVSwitch, all 8 GPUs in a DGX H100 see 900 GB/s to each peer.
What this teaches: at H100 scale, a kernel that does < ~300 FLOPs per byte loaded from HBM is bandwidth-bound. Most attention kernels (without FlashAttention) are.
TPU — the systolic-array philosophy¶
Google TPUs are radically different. The core is a systolic array: a 2D grid of MAC units that pumps data through edges in lockstep. For matmul, this is near-optimal: each datum loaded from HBM gets reused ~\(N\) times in an \(N \times N\) array.
- TPU v5p: 459 TF BF16 / 918 TF INT8 per chip; 95 GB HBM @ 2.76 TB/s.
- Inter-chip Interconnect (ICI): a 3D torus topology with ~4.8 Tb/s per chip in each dimension, designed for AllReduce on a Pod.
- Programming model: XLA via JAX or TF. No CUDA. The compiler does all the scheduling — there is no "warp" abstraction to expose.
Strength: dense matmul throughput per dollar is excellent. Compiler handles partitioning automatically (jax pmap / shard_map). 3D-torus is great for AllReduce.
Weakness: anything off the matmul golden path (sparse ops, dynamic shapes, custom kernels) is awkward. Ecosystem is JAX-first; PyTorch on TPU exists (torch_xla) but lags. No Triton.
Trainium — AWS's systolic-ish accelerator¶
AWS Trainium (Trn1, 2022) and Trainium 2 (Trn2, 2024) are systolic-array accelerators competing with GPU+TPU.
- Trainium 2: ~650 TF FP16 per chip, 96 GB HBM @ 2.9 TB/s, NeuronLink for scale-out.
- A
trn2.48xlargeinstance packs 16 chips in 4 nodes; AWS sells UltraClusters of thousands of Trainium chips. - Programming via the Neuron SDK: XLA backend, PyTorch/JAX frontends.
Strength: AWS-priced; reportedly ~40% cheaper per training FLOP than H100 on the same workload [source: AWS re:Invent 2024 keynote, vendor claim, not independently MLPerf-confirmed at time of writing].
Weakness: software stack maturity is well behind CUDA. Custom kernels are painful. Limited to AWS.
Gaudi — Intel's transformer-targeted chip¶
Intel Gaudi 3 (2024) is the result of the Habana Labs acquisition.
- 8 Matrix Math Engines + 64 Tensor Processor Cores (VLIW vector units).
- 1835 TF FP8 / BF16, 128 GB HBM2e @ 3.7 TB/s.
- On-chip 24× 200 GbE RDMA: scale-out without a separate NIC. Unique among the listed accelerators.
Strength: the integrated RoCE networking is genuinely novel — every chip is its own NIC. Good for cost-sensitive transformer training where InfiniBand fabric is a major CapEx.
Weakness: SynapseAI stack lags CUDA. PyTorch integration via habana_frameworks. Smaller community.
Sparsity support¶
- NVIDIA Tensor Cores: 2:4 structured sparsity (2 of every 4 weights are zero, declared in the weight layout). 2× speedup on Tensor Core ops if the model is pruned accordingly. Rarely used in practice because dense FP8 is fast enough and pruning hurts quality.
- TPU: no hardware sparsity support as of v5p.
- Trainium 2: 4:8 sparsity supported [source: AWS Trainium 2 docs 2024].
- Gaudi 3: structured sparsity not advertised as a key feature.
- MI300X: 2:4 sparsity supported [source: AMD CDNA 3 whitepaper 2023].
The honest assessment: sparsity is a marketing number more than a workhorse feature in 2026.
Which accelerator for which job?¶
| Workload | Best fit | Why |
|---|---|---|
| Pretraining a 70B+ dense model | H100 / H200 / B200 cluster | Software maturity + FP8 + NVLink+NVSwitch |
| Pretraining if you have a TPU contract | TPU v5p Pod | 3D-torus AllReduce + compiler-driven sharding |
| Cost-sensitive training | Trainium 2 (AWS) or Gaudi 3 | ~30-40% lower $/FLOP, accept stack pain |
| Serving large MoE / long-context | MI300X | 192 GB HBM fits more in one chip; less sharding |
| Ultra-low-latency inference | Groq LPU or Cerebras WSE | SRAM-resident weights, no HBM bottleneck (see §05) |
Cross-links¶
02-h100-and-h200.md: the H100 in much more depth.05-the-accelerator-landscape-2026.md: software stack maturity per accelerator.- Phase 01 — Hardware Substrate: CPU side, where this contrast begins.
References¶
- NVIDIA H100 Tensor Core GPU Architecture Whitepaper, 2022 (rev. 2024).
- NVIDIA H200 Tensor Core GPU Datasheet, 2024.
- NVIDIA Blackwell Architecture Technical Brief, 2024.
- Google Cloud TPU v5p announcement, December 2023.
- AWS Trainium 2 architecture documentation, 2024.
- Intel Gaudi 3 AI Accelerator Whitepaper, 2024.
- AMD Instinct MI300X Datasheet, 2023.
- Jouppi et al. 2017, In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA.
- Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer, ISCA.