Skip to content

English · Español

01 — CPU vs GPU vs TPU vs Trainium vs Gaudi

🇪🇸 Cuatro filosofías de hardware para una misma multiplicación de matrices: control-flow (CPU), SIMT (GPU), array sistólico (TPU/Trainium), y híbrido VLIW (Gaudi). Cada una optimiza una métrica diferente.

The fundamental trade

Every accelerator answers one question: given fixed silicon area, how much do I spend on control, how much on arithmetic, how much on memory?

  • CPU: most area is control (branch prediction, out-of-order, caches). Few, fast scalar ALUs. Optimizes latency on arbitrary control flow.
  • GPU: most area is arithmetic (thousands of small ALUs). Light per-thread control. Optimizes throughput on regular data-parallel work.
  • TPU / Trainium: most area is one giant systolic array (a 128×128 or 256×256 grid of multiply-accumulate cells). Almost no per-cell control. Optimizes dense matmul throughput at the cost of programmability.
  • Gaudi: hybrid — VLIW vector processors (TPCs) + a Matrix Math Engine. Optimizes price-per-FLOP for transformer workloads.

The whole module is variations on this single trade.

Side-by-side numbers (2024-2026 flagships)

Chip Peak compute (dense) HBM capacity HBM bandwidth Intra-chip / scale-up Process TDP
Intel i5-8250U (CPU, 2018) ~0.2 TF FP32 62 GiB DDR4 (system RAM) ~19 GB/s 14 nm 15 W
NVIDIA A100 80 GB (GPU, 2020) 312 TF BF16 / TF32; 624 TF FP16 80 GB HBM2e 2.0 TB/s NVLink 3: 600 GB/s 7 nm 400 W
NVIDIA H100 SXM5 (GPU, 2022) 989 TF FP16/BF16; 1979 TF FP8 80 GB HBM3 3.35 TB/s NVLink 4: 900 GB/s 4 nm 700 W
NVIDIA H200 SXM (GPU, 2024) 989 TF FP16/BF16; 1979 TF FP8 141 GB HBM3e 4.8 TB/s NVLink 4: 900 GB/s 4 nm 700 W
NVIDIA B200 (GPU, 2024) ~2.25 PF FP16; ~4.5 PF FP8; ~9 PF FP4 192 GB HBM3e 8 TB/s NVLink 5: 1.8 TB/s 4NP 1000 W
Google TPU v5p (2024) 459 TF BF16; 918 TF INT8 95 GB HBM 2.76 TB/s ICI 3D-torus: 4.8 Tb/s/chip ~450 W [not publicly confirmed]
AWS Trainium 2 (2024) ~650 TF FP16 (per chip) 96 GB HBM 2.9 TB/s NeuronLink-v3 ~500 W [not publicly confirmed]
Intel Gaudi 3 (2024) 1835 TF FP8; 1835 TF BF16 128 GB HBM2e 3.7 TB/s 24× 200 GbE on-chip RDMA 5 nm 900 W
AMD MI300X (2023) 1307 TF FP16; 2614 TF FP8 192 GB HBM3 5.3 TB/s Infinity Fabric: 896 GB/s ⅚ nm 750 W

[source: NVIDIA H100 Tensor Core GPU Architecture Whitepaper 2022; NVIDIA H200 Datasheet 2024; NVIDIA Blackwell Architecture Whitepaper 2024; Google TPU v5p announcement 2023; AWS Trainium 2 documentation 2024; Intel Gaudi 3 Whitepaper 2024; AMD MI300X Datasheet 2023]

CPU — Intel i5-8250U as the baseline

This is Borja's laptop. We use it as the ground-truth single-thread baseline (see Phase 01). Key properties:

  • 4 cores / 8 threads, 1.6-3.4 GHz.
  • AVX2 (256-bit SIMD): 8× FP32 lanes per core → 4 × 3.4 GHz × 8 × 2 (FMA) ≈ 174 GFLOP/s theoretical peak FP32.
  • L1d 32 KB / core, L2 256 KB / core, L3 6 MB shared.
  • DDR4-2400 dual-channel: ~38.4 GB/s theoretical, ~19 GB/s achievable. Arithmetic intensity break-even ≈ 174 / 19 ≈ 9 FLOP/byte.

What this teaches: most "real" code is below that break-even. CPUs spend most of their time waiting on memory.

GPU — NVIDIA H100 as the canonical training chip

The H100 is what every frontier lab trained on in 2023-2024.

  • 132 SMs × 128 FP32 cores = 16896 FP32 lanes, but the Tensor Cores are what dominate the FLOPS budget.
  • 4th-generation Tensor Cores: support TF32, BF16, FP16, FP8 (two variants: E4M3 and E5M2), INT8.
  • Peak FP16/BF16: 989 TF/s dense, 1979 TF/s with 2:4 structured sparsity.
  • Peak FP8: 1979 TF/s dense, 3958 TF/s with sparsity. FP8 is the recipe for H100-era training. [source: NVIDIA H100 datasheet 2024]
  • HBM3 80 GB at 3.35 TB/s. Arithmetic intensity break-even (FP16) = 989e12 / 3.35e12 ≈ 295 FLOP/byte. This number is the single most important thing on the chip — see §03 of this module.
  • NVLink 4: 18 links × 50 GB/s = 900 GB/s intra-chip; with NVSwitch, all 8 GPUs in a DGX H100 see 900 GB/s to each peer.

What this teaches: at H100 scale, a kernel that does < ~300 FLOPs per byte loaded from HBM is bandwidth-bound. Most attention kernels (without FlashAttention) are.

TPU — the systolic-array philosophy

Google TPUs are radically different. The core is a systolic array: a 2D grid of MAC units that pumps data through edges in lockstep. For matmul, this is near-optimal: each datum loaded from HBM gets reused ~\(N\) times in an \(N \times N\) array.

  • TPU v5p: 459 TF BF16 / 918 TF INT8 per chip; 95 GB HBM @ 2.76 TB/s.
  • Inter-chip Interconnect (ICI): a 3D torus topology with ~4.8 Tb/s per chip in each dimension, designed for AllReduce on a Pod.
  • Programming model: XLA via JAX or TF. No CUDA. The compiler does all the scheduling — there is no "warp" abstraction to expose.

Strength: dense matmul throughput per dollar is excellent. Compiler handles partitioning automatically (jax pmap / shard_map). 3D-torus is great for AllReduce.

Weakness: anything off the matmul golden path (sparse ops, dynamic shapes, custom kernels) is awkward. Ecosystem is JAX-first; PyTorch on TPU exists (torch_xla) but lags. No Triton.

Trainium — AWS's systolic-ish accelerator

AWS Trainium (Trn1, 2022) and Trainium 2 (Trn2, 2024) are systolic-array accelerators competing with GPU+TPU.

  • Trainium 2: ~650 TF FP16 per chip, 96 GB HBM @ 2.9 TB/s, NeuronLink for scale-out.
  • A trn2.48xlarge instance packs 16 chips in 4 nodes; AWS sells UltraClusters of thousands of Trainium chips.
  • Programming via the Neuron SDK: XLA backend, PyTorch/JAX frontends.

Strength: AWS-priced; reportedly ~40% cheaper per training FLOP than H100 on the same workload [source: AWS re:Invent 2024 keynote, vendor claim, not independently MLPerf-confirmed at time of writing].

Weakness: software stack maturity is well behind CUDA. Custom kernels are painful. Limited to AWS.

Gaudi — Intel's transformer-targeted chip

Intel Gaudi 3 (2024) is the result of the Habana Labs acquisition.

  • 8 Matrix Math Engines + 64 Tensor Processor Cores (VLIW vector units).
  • 1835 TF FP8 / BF16, 128 GB HBM2e @ 3.7 TB/s.
  • On-chip 24× 200 GbE RDMA: scale-out without a separate NIC. Unique among the listed accelerators.

Strength: the integrated RoCE networking is genuinely novel — every chip is its own NIC. Good for cost-sensitive transformer training where InfiniBand fabric is a major CapEx.

Weakness: SynapseAI stack lags CUDA. PyTorch integration via habana_frameworks. Smaller community.

Sparsity support

  • NVIDIA Tensor Cores: 2:4 structured sparsity (2 of every 4 weights are zero, declared in the weight layout). 2× speedup on Tensor Core ops if the model is pruned accordingly. Rarely used in practice because dense FP8 is fast enough and pruning hurts quality.
  • TPU: no hardware sparsity support as of v5p.
  • Trainium 2: 4:8 sparsity supported [source: AWS Trainium 2 docs 2024].
  • Gaudi 3: structured sparsity not advertised as a key feature.
  • MI300X: 2:4 sparsity supported [source: AMD CDNA 3 whitepaper 2023].

The honest assessment: sparsity is a marketing number more than a workhorse feature in 2026.

Which accelerator for which job?

Workload Best fit Why
Pretraining a 70B+ dense model H100 / H200 / B200 cluster Software maturity + FP8 + NVLink+NVSwitch
Pretraining if you have a TPU contract TPU v5p Pod 3D-torus AllReduce + compiler-driven sharding
Cost-sensitive training Trainium 2 (AWS) or Gaudi 3 ~30-40% lower $/FLOP, accept stack pain
Serving large MoE / long-context MI300X 192 GB HBM fits more in one chip; less sharding
Ultra-low-latency inference Groq LPU or Cerebras WSE SRAM-resident weights, no HBM bottleneck (see §05)

References

  • NVIDIA H100 Tensor Core GPU Architecture Whitepaper, 2022 (rev. 2024).
  • NVIDIA H200 Tensor Core GPU Datasheet, 2024.
  • NVIDIA Blackwell Architecture Technical Brief, 2024.
  • Google Cloud TPU v5p announcement, December 2023.
  • AWS Trainium 2 architecture documentation, 2024.
  • Intel Gaudi 3 AI Accelerator Whitepaper, 2024.
  • AMD Instinct MI300X Datasheet, 2023.
  • Jouppi et al. 2017, In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA.
  • Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer, ISCA.