English · Español

00 — Motivation: why hardware fluency is interview-load-bearing¶

🇪🇸 Aunque trabajes en modelado, los entrevistadores de Anthropic, NVIDIA y Google esperan que entiendas el hardware. No es un detalle de infra: es lo que decide qué modelos son posibles.

The unspoken interview rubric¶

When a senior infra engineer at a frontier lab interviews an ML candidate, they are checking a specific thing: does this person know what they are running on? The candidate who can say "the bottleneck on an H100 for batch-size-1 inference is HBM bandwidth, not FLOPS, because we re-read the KV cache every token" passes the bar. The candidate who can only talk about model architectures does not.

This is not gatekeeping. It is calibration. At frontier scale:

A modeling decision (e.g. "double the context length") is a hardware decision ("we now need 2× HBM per token in the KV cache, plus higher AllReduce bandwidth for the longer sequence").
A research idea (e.g. "let's add a new MoE expert routing scheme") is a systems idea ("the all-to-all traffic on expert dispatch will dominate the step time on our 1024-GPU cluster").
An eval result (e.g. "the model regressed on long-context") is often a hardware artifact ("FP8 numerics lost precision in attention after position 32k").

If you cannot reason about the hardware, you cannot reason about the consequences of your work at scale.

What "fluency" means here¶

Fluency, for the purposes of this module, is the ability to:

Name the accelerators that ship in 2024-2026: H100, H200, B100/B200, MI300X, Gaudi 3, TPU v5p, Trainium 2. Know the rough FLOPS and HBM numbers.
Read a roofline for any of them and predict whether a given kernel will be compute-bound or memory-bound.
Describe the interconnect hierarchy: intra-chip (NVLink), intra-rack (NVSwitch), inter-rack (InfiniBand / RoCE).
Compute the cost of a 1024-GPU training run in dollars and megawatts, end-to-end.
Pick the right accelerator for a workload (training a 70B model vs serving a 7B model vs running a long-context RAG retriever).

By the end of this module, all five should be reflex.

Why this is a separate module, not a chapter¶

The core curriculum has Phase 01 (CPU substrate, roofline) and Phase 23 (single-GPU SIMT model). Those phases are about building intuition from one machine. X4 is about mapping a fleet. It is a different cognitive skill: less measurement, more comparison and arithmetic at large numbers.

The other reason this is a separate module is that the field moves. The accelerator landscape that was current in 2022 (V100, A100, TPU v3) is not the landscape of 2026 (Blackwell, MI300X, TPU v5p, Trainium 2). This module is explicitly versioned to 2024-2026 and tagged so it can be refreshed independently.

Concrete deliverable: the back-of-envelope quiz¶

After finishing X4, you should be able to answer, in under 60 seconds and without a calculator:

"I want to train a 70B-parameter dense transformer on 1 trillion tokens. I have 1024 H100s. Roughly how long does this take, how much does it cost, and what is the dominant bottleneck?"

The answer involves: FLOPS-per-token (≈ 6N for forward+backward, so 420 GFLOP/token, so 4.2e23 FLOPs total) ÷ effective cluster FLOPS (1024 × 989 TF/s peak FP8 × ~50% MFU ≈ 5e17 FLOPs/s) ÷ 86400 s/day → about 10 days of wall-clock. Power: 1024 × 700 W ≈ 720 kW IT load, ~860 kW with PUE 1.2, × 240 h × \(0.10/kWh ≈ **\)21k for electricity alone. Cloud rental at \(3/H100-hour: 1024 × 240 × 3 ≈ **\)740k. Dominant bottleneck: at this size, MFU (model FLOPS utilization) is gated by AllReduce bandwidth on the gradient sync.

If that paragraph makes sense, the module worked.

Cross-links¶

Phase 01 — Hardware Substrate: the CPU-side mental model this builds on.
Phase 23 — GPU Fundamentals: the single-GPU mental model this builds on.
HIRING_PATH.md (in the repo root): the gap this module closes.

References¶

Karpathy A. 2023, Let's reproduce GPT-2 (124M) — back-of-envelope FLOPS and MFU intuition.
Hoffmann et al. 2022, Training Compute-Optimal Large Language Models (Chinchilla) — the FLOPS / tokens trade-off.
Patterson D. and Hennessy J. 2021, Computer Architecture: A Quantitative Approach, 6^th ed. — the canonical reference for everything in this module.