Skip to content

English · Español

Extension Module X4 — Hardware Deep-Dive: Accelerators, Interconnects, Datacenters

Requires: 23 — GPU Architecture Fundamentals · 35 — Distributed Training & Inference Teaches: h100 · nvlink · allreduce · interconnects · datacenter-economics Jump to any chapter from the phase reference index.

Chapter map

🇪🇸 Módulo de extensión sobre el hardware moderno de IA: arquitecturas de aceleradores (GPU/TPU/Trainium/Gaudi), interconexiones (NVLink/InfiniBand) y la economía de los datacenters. Cierra la brecha "hardware bringup / accelerator landscape" del HIRING_PATH.md.

Status

  • Track: Extension (parallel to core 40-phase curriculum)
  • Authorization: Addendum A15 (extension tracks authorized)
  • Prerequisites: Phase 01 (hardware substrate / roofline), Phase 23 (GPU fundamentals)
  • Scope guard: Theory-heavy. Labs require cloud rentals; documented as $-budgeted exercises.
  • Hardware bar: Lab 00 runs on i5-8250U + 1× A100 (1 h) + 1× H100 (1 h). Lab 01 requires 2× 8-GPU nodes (1 h).

Why this module exists

An ML engineer interviewing at Anthropic, NVIDIA, Google, or AWS will be asked: "How would you scale your model from 8 GPUs to 1024?" — and the real answer is not in the modeling code. It is in the interconnect topology, the collective primitives, the memory hierarchy of the accelerator, and the power budget of the cluster. The core curriculum touches GPU programming in Phases 23-24 and distributed training in Phase 35, but it does not give a fluent map of the accelerator landscape. This module fills that gap.

This module is what lets you sit in a meeting with an infra team and not need a translator.

Module map

File Topic
theory/00-motivation.md Why hardware fluency is interview-load-bearing even for ML engineers
theory/01-cpu-vs-gpu-vs-tpu-vs-trn1.md Architecture comparison: control-flow CPU, SIMT GPU, systolic TPU, Trainium, Gaudi
theory/02-h100-and-h200.md H100 / H200 / Blackwell deep-dive: Tensor Cores, FP8, NVLink, NVSwitch, MIG
theory/03-interconnects-and-topology.md NVLink, PCIe, InfiniBand, RoCE; fat-tree vs torus; collective primitives
theory/04-datacenter-economics.md Power, PUE, $/MWh, CapEx vs OpEx; why frontier training cost is 60% energy
theory/05-the-accelerator-landscape-2026.md NVIDIA Blackwell, AMD MI300X, Intel Gaudi 3, TPU v5p, Trainium 2, Cerebras WSE-3, Groq LPU
lab/00-roofline-on-three-accelerators.md Same matmul: i5-8250U vs A100 vs H100; gap-explanation per accelerator
lab/01-collective-comm-microbenchmark.md nccl-tests on 2 nodes × 8 GPUs; AllReduce 1 MB / 100 MB / 1 GB; theoretical vs measured

Key references

  • NVIDIA H100 Tensor Core GPU Architecture Whitepaper (2022, rev. 2024).
  • NVIDIA H200 Datasheet (2024).
  • NVIDIA Blackwell Architecture Whitepaper (2024).
  • Jouppi et al. 2017, In-Datacenter Performance Analysis of a Tensor Processing Unit — the original TPU paper.
  • Jouppi et al. 2023, TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning.
  • AWS Trainium Architecture Guide (2023, Trn1) and Trainium 2 (2024).
  • Intel Gaudi 3 Whitepaper (2024).
  • Cerebras WSE-3 Whitepaper (2024).
  • Patterson et al. 2021, Carbon Emissions and Large Neural Network Training.
  • MLPerf Training v4.0 and Inference v4.1 results (2024).

Definition of Done

  • All six theory files reviewed by math-reviewer and phase-gatekeeper.
  • Every numerical claim has a cited source.
  • Lab 00 has a reproducible roofline triple (i5-8250U, A100, H100) with documented runpod.io SKUs and cost.
  • Lab 01 has nccl-tests setup + expected vs measured AllReduce table for 2× 8-GPU nodes.
  • mkdocs build --strict passes with X4 in the nav.

Cost budget for labs

Lab Cloud need Approx. cost
Lab 00 1 h A100 (40 GB or 80 GB) ~$1.50
Lab 00 1 h H100 (80 GB) ~$3.00
Lab 01 1 h 2× node × 8× H100 (or A100) ~$15.00
Total ~$20

All prices are 2025-2026 spot/on-demand market rates from runpod.io / lambda.ai / vast.ai. Lab specs document exact SKUs.

Further reading

Optional — enrichment, not required to pass the phase.