English · Español

Blackwell deep-dive¶

🇪🇸 La H100 es el chip de la era 2023-2024; la H200 amplía la memoria; Blackwell (B100/B200) es la generación 2025. Aquí desmenuzamos qué cambia y por qué importa.

Why the H100 deserves its own page¶

The H100 is the chip on which most of GPT-4-class models, Claude 3, Llama 3, and Gemini 1.x were trained. If you cannot reason about an H100, you cannot reason about the cost or duration of frontier training. This page is the mechanical model.

Die-shot mental model¶

An H100 SXM5 contains:

132 Streaming Multiprocessors (SMs), organized into 8 GPCs (Graphics Processing Clusters).
Each SM has:
128 FP32 CUDA cores (4 sub-partitions × 32).
64 INT32 cores, 64 FP64 cores.
4 fourth-generation Tensor Cores (one per sub-partition). These are where the FLOPS live.
256 KB register file.
228 KB of unified L1/shared memory (configurable split).
50 MB of shared L2 cache.
5 HBM3 stacks × 16 GB = 80 GB total, on a 5120-bit memory bus → 3.35 TB/s.
NVLink 4: 18 links × 50 GB/s = 900 GB/s off-chip to peer GPUs.

[source: NVIDIA H100 Tensor Core GPU Architecture Whitepaper 2022]

The interview-ready summary: 132 SMs × 4 Tensor Cores × FP8 throughput at boost clock ≈ 1979 TF/s peak FP8 dense; 80 GB HBM3 at 3.35 TB/s; 900 GB/s NVLink to peers in a DGX H100.

Tensor Cores: the 4^th generation¶

A Tensor Core is, mechanically, a small matrix-multiply unit. On H100, each Tensor Core can do, per cycle:

A 4×8 × 8×8 matrix multiply-accumulate in FP16 / BF16 → 256 FLOPs.
Same shape in FP8 → 512 FLOPs (2× the rate because FP8 is half-width).

There are 4 Tensor Cores per SM × 132 SMs = 528 total. At ~1.83 GHz boost: 528 × 512 × 1.83e9 ≈ 4.9e14 FP8 FLOP/s = 490 TF/s per "instruction issue". With dual-issue and the new FP8 transformer engine path, NVIDIA advertises 1979 TF/s FP8 dense as peak. Don't get hung up on the exact derivation; remember the headline number.

FP8, the transformer engine, and numerics¶

H100 introduced two FP8 formats:

Format	Sign	Exponent	Mantissa	Range	Typical use
E4M3	1	4	3	±448	Forward activations, weights
E5M2	1	5	2	±57344	Backward gradients (needs more range)

[source: Micikevicius et al. 2022, FP8 Formats for Deep Learning, arXiv:2209.05433]

The Transformer Engine is a NVIDIA software library (transformer_engine) that:

Keeps per-tensor scaling factors in FP32.
Dynamically casts activations and weights between FP8 (E4M3 for fwd, E5M2 for bwd) at the right places.
Re-quantizes between layers so the FP8 range is used efficiently.

The point is that FP8 alone is too narrow to train; you need a scaling policy. The Transformer Engine ships that policy, validated for transformer training. This is what gives you the 2× speedup over BF16 without quality loss.

Other formats you should know:

TF32 (TensorFloat-32): an Ampere-introduced FP32-input format with FP16's precision (10-bit mantissa) and FP32's range. Drop-in for FP32 matmul on cuDNN. Practically obsolete on H100 in favor of BF16/FP16/FP8.
BF16 (bfloat16): same range as FP32, mantissa truncated to 7 bits. The default training format for Phase 17-22 era. Slow on H100 vs FP8 (half the throughput) but trivially safe — no scaling policy needed.

Decision rule: - Inference: FP8 (or INT8 with care) on H100; FP16/BF16 on A100. - Training: BF16 with FP32 master weights on A100; FP8 with the Transformer Engine on H100.

Memory hierarchy at H100 scale¶

Level	Size	Bandwidth	Latency (rough)
Register file	256 KB / SM	~20 TB/s/SM	1 cycle
L1/SMEM	228 KB / SM	~33 TB/s/SM	~30 cycles
L2 cache	50 MB (chip-wide)	~5.5 TB/s	~200 cycles
HBM3	80 GB	3.35 TB/s	~500 ns
NVLink peer	80 GB on peer	900 GB/s aggregate	~few µs

[source: NVIDIA H100 Whitepaper; Luo et al. 2023 microbenchmark study]

Key insight: bandwidth is the bottleneck for inference, compute for training.

Inference, batch=1, decode: you re-read all weights from HBM for each token. A 70B model in FP8 is 70 GB; 70 GB / 3.35 TB/s ≈ 20 ms/token ⇒ ~50 tok/s ceiling on a single H100. You are bandwidth-bound; FLOPS are mostly idle.
Training, large batch: each weight is reused over thousands of tokens per microbatch; arithmetic intensity exceeds 295 FLOP/byte; you become compute-bound and the FP8 FLOPS matter.

This is why batching matters so much for serving — it converts a bandwidth-bound workload into a compute-bound one.

NVLink 4 and NVSwitch¶

A single H100 has 18 NVLink 4 lanes × 50 GB/s = 900 GB/s of off-chip bandwidth, distinct from PCIe.

NVSwitch is the on-board fabric that connects 8 H100s in a DGX H100 server such that every GPU sees the full 900 GB/s to every other GPU (vs. an A100's NVSwitch, which gave 600 GB/s/peer). This means an 8-GPU AllReduce of 1 GB takes (see §03 for math):

\[ T_{\text{AllReduce}} = \frac{2(N-1)}{N} \cdot \frac{D}{\text{BW}} = \frac{2 \cdot 7}{8} \cdot \frac{1\,\text{GB}}{900\,\text{GB/s}} \approx 1.94 \text{ ms}. \]

A DGX H100 has 8 GPUs, 4 NVSwitches, and external NVLink-Network ports that connect to other DGX boxes via NVLink Switch System — extending the 900 GB/s domain to 32 or 256 GPUs.

MIG — Multi-Instance GPU¶

MIG lets you partition a single H100 into up to 7 isolated instances, each with its own slice of SMs, L2, and HBM. Used for:

Multi-tenancy in serving (each customer gets a guaranteed slice).
Fault isolation (a runaway kernel cannot starve neighbors).
Inference workloads that do not fill a whole H100.

Each instance gets ~10 GB HBM and ~18 SMs. Not used in training (training wants the full chip).

H200 — same compute, much more memory¶

The H200 (shipping mid-2024) is electrically identical to H100 in compute. What changes:

141 GB HBM3e (vs. 80 GB HBM3 on H100). 76% more.
4.8 TB/s bandwidth (vs. 3.35 TB/s). 43% higher.
Same NVLink 4, same TDP (700 W), same form factor (drop-in replacement for H100 SXM5).

[source: NVIDIA H200 Datasheet 2024]

Why this matters:

A 70B FP8 model fits in one H200 with room for activations and KV cache. On an H100, you needed model parallelism. This collapses serving complexity dramatically.
KV-cache-bound inference workloads (long context, large batch) scale roughly with HBM bandwidth → ~43% higher tokens/sec.

The H200 is a memory-bandwidth product, not a compute product. Vendors who own H100 fleets often upgrade to H200 specifically for serving.

B100 / B200 / GB200 (Blackwell) preview¶

Announced GTC 2024, shipping late 2024 / 2025.

B200 is a dual-die chip (two reticle-limit dies fused with a 10 TB/s die-to-die link, NV-HBI). Counted as one GPU.
~2.25 PF dense FP16 / ~4.5 PF FP8 / ~9 PF FP4 (Blackwell introduces FP4 with a second-gen Transformer Engine).
192 GB HBM3e at 8 TB/s.
NVLink 5: 18 lanes × 100 GB/s = 1.8 TB/s per chip.
TDP: 1000 W per B200. Liquid cooling effectively mandatory at rack scale.
GB200: a Grace CPU + 2× B200 GPU module. NVL72: 36 GB200 modules (72 GPUs) in one rack, connected by NVLink Switch as a single 130 TB/s NVLink domain. The "rack is the unit" architecture.

[source: NVIDIA Blackwell Architecture Whitepaper 2024; GTC 2024 keynote]

Headline for the interview: Blackwell makes the rack — not the server — the unit of compute, and introduces FP4 to double effective FLOPS again.

Cross-links¶

03-interconnects-and-topology.md: NVLink 4 / 5 and NVSwitch in network context.
04-datacenter-economics.md: what 1000 W per chip means at fleet scale.
Phase 23 — GPU Fundamentals: single-GPU SIMT model that this builds on.
Phase 24 — CUDA & Triton: how to actually program the SM.

References¶

NVIDIA H100 Tensor Core GPU Architecture Whitepaper, 2022 (rev. 2024).
NVIDIA H200 Tensor Core GPU Datasheet, 2024.
NVIDIA Blackwell Architecture Technical Brief, 2024.
Micikevicius et al. 2022, FP8 Formats for Deep Learning, arXiv:2209.05433.
Luo et al. 2023, Dissecting the NVIDIA Hopper Architecture through Microbenchmarks, arXiv:2402.13499.
Dao T. et al. 2022, FlashAttention, NeurIPS — the canonical "make this kernel compute-bound, not bandwidth-bound" paper.