English · Español

01 — CPU vs GPU vs TPU vs Trainium vs Gaudi¶

🇪🇸 Cuatro filosofías de hardware para una misma multiplicación de matrices: control-flow (CPU), SIMT (GPU), array sistólico (TPU/Trainium), y híbrido VLIW (Gaudi). Cada una optimiza una métrica diferente.

The fundamental trade¶

Every accelerator answers one question: given fixed silicon area, how much do I spend on control, how much on arithmetic, how much on memory?

CPU: most area is control (branch prediction, out-of-order, caches). Few, fast scalar ALUs. Optimizes latency on arbitrary control flow.
GPU: most area is arithmetic (thousands of small ALUs). Light per-thread control. Optimizes throughput on regular data-parallel work.
TPU / Trainium: most area is one giant systolic array (a 128×128 or 256×256 grid of multiply-accumulate cells). Almost no per-cell control. Optimizes dense matmul throughput at the cost of programmability.
Gaudi: hybrid — VLIW vector processors (TPCs) + a Matrix Math Engine. Optimizes price-per-FLOP for transformer workloads.

The whole module is variations on this single trade.

Side-by-side numbers (2024-2026 flagships)¶

Chip	Peak compute (dense)	HBM capacity	HBM bandwidth	Intra-chip / scale-up	Process	TDP
Intel i5-8250U (CPU, 2018)	~0.2 TF FP32	62 GiB DDR4 (system RAM)	~19 GB/s	—	14 nm	15 W
NVIDIA A100 80 GB (GPU, 2020)	312 TF BF16 / TF32; 624 TF FP16	80 GB HBM2e	2.0 TB/s	NVLink 3: 600 GB/s	7 nm	400 W
NVIDIA H100 SXM5 (GPU, 2022)	989 TF FP16/BF16; 1979 TF FP8	80 GB HBM3	3.35 TB/s	NVLink 4: 900 GB/s	4 nm	700 W
NVIDIA H200 SXM (GPU, 2024)	989 TF FP16/BF16; 1979 TF FP8	141 GB HBM3e	4.8 TB/s	NVLink 4: 900 GB/s	4 nm	700 W
NVIDIA B200 (GPU, 2024)	~2.25 PF FP16; ~4.5 PF FP8; ~9 PF FP4	192 GB HBM3e	8 TB/s	NVLink 5: 1.8 TB/s	4NP	1000 W
Google TPU v5p (2024)	459 TF BF16; 918 TF INT8	95 GB HBM	2.76 TB/s	ICI 3D-torus: 4.8 Tb/s/chip	—	~450 W [not publicly confirmed]
AWS Trainium 2 (2024)	~650 TF FP16 (per chip)	96 GB HBM	2.9 TB/s	NeuronLink-v3	—	~500 W [not publicly confirmed]
Intel Gaudi 3 (2024)	1835 TF FP8; 1835 TF BF16	128 GB HBM2e	3.7 TB/s	24× 200 GbE on-chip RDMA	5 nm	900 W
AMD MI300X (2023)	1307 TF FP16; 2614 TF FP8	192 GB HBM3	5.3 TB/s	Infinity Fabric: 896 GB/s	⅚ nm	750 W

[source: NVIDIA H100 Tensor Core GPU Architecture Whitepaper 2022; NVIDIA H200 Datasheet 2024; NVIDIA Blackwell Architecture Whitepaper 2024; Google TPU v5p announcement 2023; AWS Trainium 2 documentation 2024; Intel Gaudi 3 Whitepaper 2024; AMD MI300X Datasheet 2023]

CPU — Intel i5-8250U as the baseline¶

This is Borja's laptop. We use it as the ground-truth single-thread baseline (see Phase 01). Key properties:

4 cores / 8 threads, 1.6-3.4 GHz.
AVX2 (256-bit SIMD): 8× FP32 lanes per core → 4 × 3.4 GHz × 8 × 2 (FMA) ≈ 174 GFLOP/s theoretical peak FP32.
L1d 32 KB / core, L2 256 KB / core, L3 6 MB shared.
DDR4-2400 dual-channel: ~38.4 GB/s theoretical, ~19 GB/s achievable. Arithmetic intensity break-even ≈ 174 / 19 ≈ 9 FLOP/byte.

What this teaches: most "real" code is below that break-even. CPUs spend most of their time waiting on memory.

GPU — NVIDIA H100 as the canonical training chip¶

The H100 is what every frontier lab trained on in 2023-2024.

132 SMs × 128 FP32 cores = 16896 FP32 lanes, but the Tensor Cores are what dominate the FLOPS budget.
4^th-generation Tensor Cores: support TF32, BF16, FP16, FP8 (two variants: E4M3 and E5M2), INT8.
Peak FP16/BF16: 989 TF/s dense, 1979 TF/s with 2:4 structured sparsity.
Peak FP8: 1979 TF/s dense, 3958 TF/s with sparsity. FP8 is the recipe for H100-era training. [source: NVIDIA H100 datasheet 2024]
HBM3 80 GB at 3.35 TB/s. Arithmetic intensity break-even (FP16) = 989e12 / 3.35e12 ≈ 295 FLOP/byte. This number is the single most important thing on the chip — see §03 of this module.
NVLink 4: 18 links × 50 GB/s = 900 GB/s intra-chip; with NVSwitch, all 8 GPUs in a DGX H100 see 900 GB/s to each peer.

What this teaches: at H100 scale, a kernel that does < ~300 FLOPs per byte loaded from HBM is bandwidth-bound. Most attention kernels (without FlashAttention) are.

TPU — the systolic-array philosophy¶

Google TPUs are radically different. The core is a systolic array: a 2D grid of MAC units that pumps data through edges in lockstep. For matmul, this is near-optimal: each datum loaded from HBM gets reused ~$N$ times in an $N \times N$ array.

TPU v5p: 459 TF BF16 / 918 TF INT8 per chip; 95 GB HBM @ 2.76 TB/s.
Inter-chip Interconnect (ICI): a 3D torus topology with ~4.8 Tb/s per chip in each dimension, designed for AllReduce on a Pod.
Programming model: XLA via JAX or TF. No CUDA. The compiler does all the scheduling — there is no "warp" abstraction to expose.

Strength: dense matmul throughput per dollar is excellent. Compiler handles partitioning automatically (jax pmap / shard_map). 3D-torus is great for AllReduce.

Weakness: anything off the matmul golden path (sparse ops, dynamic shapes, custom kernels) is awkward. Ecosystem is JAX-first; PyTorch on TPU exists (torch_xla) but lags. No Triton.

Trainium — AWS's systolic-ish accelerator¶

AWS Trainium (Trn1, 2022) and Trainium 2 (Trn2, 2024) are systolic-array accelerators competing with GPU+TPU.

Trainium 2: ~650 TF FP16 per chip, 96 GB HBM @ 2.9 TB/s, NeuronLink for scale-out.
A trn2.48xlarge instance packs 16 chips in 4 nodes; AWS sells UltraClusters of thousands of Trainium chips.
Programming via the Neuron SDK: XLA backend, PyTorch/JAX frontends.

Strength: AWS-priced; reportedly ~40% cheaper per training FLOP than H100 on the same workload [source: AWS re:Invent 2024 keynote, vendor claim, not independently MLPerf-confirmed at time of writing].

Weakness: software stack maturity is well behind CUDA. Custom kernels are painful. Limited to AWS.

Gaudi — Intel's transformer-targeted chip¶

Intel Gaudi 3 (2024) is the result of the Habana Labs acquisition.

8 Matrix Math Engines + 64 Tensor Processor Cores (VLIW vector units).
1835 TF FP8 / BF16, 128 GB HBM2e @ 3.7 TB/s.
On-chip 24× 200 GbE RDMA: scale-out without a separate NIC. Unique among the listed accelerators.

Strength: the integrated RoCE networking is genuinely novel — every chip is its own NIC. Good for cost-sensitive transformer training where InfiniBand fabric is a major CapEx.

Weakness: SynapseAI stack lags CUDA. PyTorch integration via habana_frameworks. Smaller community.

Sparsity support¶

NVIDIA Tensor Cores: 2:4 structured sparsity (2 of every 4 weights are zero, declared in the weight layout). 2× speedup on Tensor Core ops if the model is pruned accordingly. Rarely used in practice because dense FP8 is fast enough and pruning hurts quality.
TPU: no hardware sparsity support as of v5p.
Trainium 2: 4:8 sparsity supported [source: AWS Trainium 2 docs 2024].
Gaudi 3: structured sparsity not advertised as a key feature.
MI300X: 2:4 sparsity supported [source: AMD CDNA 3 whitepaper 2023].

The honest assessment: sparsity is a marketing number more than a workhorse feature in 2026.

Which accelerator for which job?¶

Workload	Best fit	Why
Pretraining a 70B+ dense model	H100 / H200 / B200 cluster	Software maturity + FP8 + NVLink+NVSwitch
Pretraining if you have a TPU contract	TPU v5p Pod	3D-torus AllReduce + compiler-driven sharding
Cost-sensitive training	Trainium 2 (AWS) or Gaudi 3	~30-40% lower $/FLOP, accept stack pain
Serving large MoE / long-context	MI300X	192 GB HBM fits more in one chip; less sharding
Ultra-low-latency inference	Groq LPU or Cerebras WSE	SRAM-resident weights, no HBM bottleneck (see §05)

Cross-links¶

02-h100-and-h200.md: the H100 in much more depth.
05-the-accelerator-landscape-2026.md: software stack maturity per accelerator.
Phase 01 — Hardware Substrate: CPU side, where this contrast begins.

References¶

NVIDIA H100 Tensor Core GPU Architecture Whitepaper, 2022 (rev. 2024).
NVIDIA H200 Tensor Core GPU Datasheet, 2024.
NVIDIA Blackwell Architecture Technical Brief, 2024.
Google Cloud TPU v5p announcement, December 2023.
AWS Trainium 2 architecture documentation, 2024.
Intel Gaudi 3 AI Accelerator Whitepaper, 2024.
AMD Instinct MI300X Datasheet, 2023.
Jouppi et al. 2017, In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA.
Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer, ISCA.