English · Español

Extension Module X4 — Hardware Deep-Dive: Accelerators, Interconnects, Datacenters¶

Requires: 23 — GPU Architecture Fundamentals · 35 — Distributed Training & Inference Teaches: h100 · nvlink · allreduce · interconnects · datacenter-economics Jump to any chapter from the phase reference index.

Chapter map¶

🇪🇸 Módulo de extensión sobre el hardware moderno de IA: arquitecturas de aceleradores (GPU/TPU/Trainium/Gaudi), interconexiones (NVLink/InfiniBand) y la economía de los datacenters. Cierra la brecha "hardware bringup / accelerator landscape" del HIRING_PATH.md.

Status¶

Track: Extension (parallel to core 40-phase curriculum)
Authorization: Addendum A15 (extension tracks authorized)
Prerequisites: Phase 01 (hardware substrate / roofline), Phase 23 (GPU fundamentals)
Scope guard: Theory-heavy. Labs require cloud rentals; documented as $-budgeted exercises.
Hardware bar: Lab 00 runs on i5-8250U + 1× A100 (1 h) + 1× H100 (1 h). Lab 01 requires 2× 8-GPU nodes (1 h).

Why this module exists¶

An ML engineer interviewing at Anthropic, NVIDIA, Google, or AWS will be asked: "How would you scale your model from 8 GPUs to 1024?" — and the real answer is not in the modeling code. It is in the interconnect topology, the collective primitives, the memory hierarchy of the accelerator, and the power budget of the cluster. The core curriculum touches GPU programming in Phases 23-24 and distributed training in Phase 35, but it does not give a fluent map of the accelerator landscape. This module fills that gap.

This module is what lets you sit in a meeting with an infra team and not need a translator.

Module map¶

File	Topic
`theory/00-motivation.md`	Why hardware fluency is interview-load-bearing even for ML engineers
`theory/01-cpu-vs-gpu-vs-tpu-vs-trn1.md`	Architecture comparison: control-flow CPU, SIMT GPU, systolic TPU, Trainium, Gaudi
`theory/02-h100-and-h200.md`	H100 / H200 / Blackwell deep-dive: Tensor Cores, FP8, NVLink, NVSwitch, MIG
`theory/03-interconnects-and-topology.md`	NVLink, PCIe, InfiniBand, RoCE; fat-tree vs torus; collective primitives
`theory/04-datacenter-economics.md`	Power, PUE, $/MWh, CapEx vs OpEx; why frontier training cost is 60% energy
`theory/05-the-accelerator-landscape-2026.md`	NVIDIA Blackwell, AMD MI300X, Intel Gaudi 3, TPU v5p, Trainium 2, Cerebras WSE-3, Groq LPU
`lab/00-roofline-on-three-accelerators.md`	Same matmul: i5-8250U vs A100 vs H100; gap-explanation per accelerator
`lab/01-collective-comm-microbenchmark.md`	`nccl-tests` on 2 nodes × 8 GPUs; AllReduce 1 MB / 100 MB / 1 GB; theoretical vs measured

Cross-links to core curriculum¶

Phase 01 — Hardware Substrate: the CPU side of the same story. This extension is the natural sequel.
Phase 23 — GPU Fundamentals: SIMT execution model, warps, occupancy. X4 extends to datacenter-scale GPU.
Phase 24 — CUDA & Triton: single-GPU kernel programming. X4 gives the why at fleet scale.
Phase 35 — Distributed Training: collectives are used there; X4 is where you understand them.

Key references¶

NVIDIA H100 Tensor Core GPU Architecture Whitepaper (2022, rev. 2024).
NVIDIA H200 Datasheet (2024).
NVIDIA Blackwell Architecture Whitepaper (2024).
Jouppi et al. 2017, In-Datacenter Performance Analysis of a Tensor Processing Unit — the original TPU paper.
Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning.
AWS Trainium Architecture Guide (2023, Trn1) and Trainium 2 (2024).
Intel Gaudi 3 Whitepaper (2024).
Cerebras WSE-3 Whitepaper (2024).
Patterson et al. 2021, Carbon Emissions and Large Neural Network Training.
MLPerf Training v4.0 and Inference v4.1 results (2024).

Definition of Done¶

All six theory files reviewed by math-reviewer and phase-gatekeeper.
Every numerical claim has a cited source.
Lab 00 has a reproducible roofline triple (i5-8250U, A100, H100) with documented runpod.io SKUs and cost.
Lab 01 has nccl-tests setup + expected vs measured AllReduce table for 2× 8-GPU nodes.
mkdocs build --strict passes with X4 in the nav.

Cost budget for labs¶

Lab	Cloud need	Approx. cost
Lab 00	1 h A100 (40 GB or 80 GB)	~$1.50
Lab 00	1 h H100 (80 GB)	~$3.00
Lab 01	1 h 2× node × 8× H100 (or A100)	~$15.00
Total	—	~$20

All prices are 2025-2026 spot/on-demand market rates from runpod.io / lambda.ai / vast.ai. Lab specs document exact SKUs.