English · Español

03 — Mixed-precision preview (fp16 / bf16, accumulator rules)¶

🇪🇸 No vamos a entrenar en mixed precision aquí — la CPU de Borja no se beneficia. Pero la matemática del redondeo decide la mitad de los bugs de las fases 26 y 27. Hoy medimos el drift; en Phase 26 lo aprovechamos.

Why a preview, not a deep dive¶

Phase 18 is on CPU. The Intel UHD 620 has no Tensor Cores, no bf16 ops, no FP8 anything. Training in mixed precision on this hardware is a memory savings (2× fewer bytes per weight) with no speed savings (CPU fp16 is either software-emulated or, on AVX-512, slower than fp32 anyway). It's not worth the engineering — Phase 18 does fp32 end-to-end.

But two later phases hinge on mixed-precision math:

Phase 26 (Quantization) — INT8 and 4-bit quantization is the extreme of low-precision arithmetic. Mixed precision (fp16/bf16) is the intermediate stop.
Phase 27 (FlashAttention) — the online softmax used by FlashAttention requires fp32 accumulators even when the matmul is fp16. The rule "accumulator stays in higher precision" is general.

So Phase 18 covers the mathematics of mixed precision and measures the per-layer drift on a fp32-trained MiniGPT. The lab forces Borja to see fp16's dynamic-range cliff with his own eyes. Phase 26 then exploits it.

The four formats¶

Format	Sign	Exponent	Mantissa	Total bits	Dynamic range	Precision (mantissa LSB)
fp32	1	8	23	32	\(\sim 10^{-38}\) to \(10^{38}\)	\(\sim 10^{-7}\)
fp16	1	5	10	16	\(\sim 10^{-5}\) to \(10^{5}\)	\(\sim 10^{-3}\)
bf16	1	8	7	16	\(\sim 10^{-38}\) to \(10^{38}\)	\(\sim 10^{-2}\)
fp8 (E4M3)	1	4	3	8	\(\sim 10^{-2}\) to \(10^{2}\)	\(\sim 10^{-1}\)

Two things to notice:

fp16 vs bf16 trade exponent for mantissa. fp16 has more precision (one extra bit) but ~10000× less dynamic range. bf16 keeps fp32's range but cuts precision in half.
fp32 → fp16 is a real rounding. A weight value \(1.234567\) in fp32 becomes \(1.234\) in fp16 — three significant digits lost. This is what "drift" measures.

The dynamic-range difference is why fp16 needs loss scaling but bf16 doesn't. With fp16, gradients smaller than \(\sim 10^{-5}\) become zero (underflow). With bf16, \(\sim 10^{-38}\) before underflow — no scaling needed. fp16 was the standard from 2017–2020 (NVIDIA V100); bf16 (Ampere+) and fp8 (Hopper+) are the modern path.

The drift bound¶

Casting an fp32 value \(x\) to fp16 introduces a rounding error bounded by:

\[ | x - \text{fp16}(x) | \le 2^{-10} \, |x| \]

(The mantissa has 10 bits, so the relative error is \(\le 2^{-10} \approx 10^{-3}\).) This is a multiplicative bound: large values lose more absolute precision.

For matmul: if \(A\) and \(B\) are fp16 and the matmul \(C = AB\) accumulates in fp32 (the standard Tensor-Core operation), the error in \(C\) is bounded by \(O(N \cdot 2^{-10}) \, \|A\| \|B\|\) where \(N\) is the inner dimension. The factor of \(N\) comes from summing \(N\) rounded products. For a transformer with \(d_\text{model} = 64\), the per-element matmul error is \(\sim 64 \cdot 10^{-3} \approx 6\%\) relative.

That's one matmul. A transformer has on the order of \(10\) matmuls per token. The errors compound. Phase 26 derives the compounding bound formally; for now: per-matmul fp16 error is roughly \(10^{-3}\) relative, per-block fp16 error is roughly \(10^{-2}\) relative, per-layer is \(10^{-1}\) if you let it compound.

The accumulator rule¶

The most important rule of mixed precision:

Accumulators stay in fp32, even when operands are fp16.

Why: an fp16 dot product of two length-\(N\) vectors performs \(N\) multiply-add operations. Each round of the sum can lose precision. If the running sum is held in fp16, after \(\sim 100\) additions the rounding error dominates. In fp32 accumulation, the running sum has 23 mantissa bits; rounding errors stay sub-percent for \(N\) up to \(10^6\).

This is why Tensor Cores work the way they do: fp16 inputs, fp32 accumulator, fp16 output. The cost of the fp32 accumulator is small (the multiplier circuit is the bulk of a Tensor Core; the accumulator is bookkeeping). The benefit is enormous: matmul precision is preserved.

For Phase 18's preview lab, we don't run the matmul in fp16. We cast the weights to fp16, cast back to fp32, and measure the per-layer activation drift introduced by the round-trip. This is the "rounding shadow" of fp16 weights on the forward pass.

Where mixed precision actually helps¶

Three concrete wins, in order of impact:

Memory. Weights in fp16 are half the size. A 70B-parameter model is 140 GB in fp32, 70 GB in fp16. KV cache (Phase 22) shrinks similarly. This is the reason production inference is mixed-precision.
Bandwidth. Weights are loaded from HBM to SRAM at 2× the rate in fp16. Memory-bound kernels speed up directly (recall Phase 1's roofline).
Compute. Tensor Cores execute fp16 matmul at 2× the FLOPS of fp32 (A100) or 8× (H100 with sparsity). Pure compute kernels speed up by 2-8×.

Borja's CPU gets benefit #1 (memory) and partial #2 (bandwidth), but loses #3 (CPU fp16 is no faster). On CPU, the cost-benefit is roughly break-even; on GPU, it's 4×+. Phase 26 is where it matters.

The verb-grammar example¶

Take the sentence "yo trabajo / I work". After tokenization (Phase 11), this is some 7-token sequence. MiniGPT's forward pass is:

Embedding lookup → activation tensor (7, d_model=64) of fp32 values.
Layer 1: LayerNorm → Attention → +residual → LayerNorm → FFN → +residual.
Layer 2-4: same.
Final LayerNorm → output projection → logits (7, V).

If we cast the weights to fp16 only, leaving activations and the forward computation in fp32, the per-layer drift is what we measure. Expected pattern (lab 03 produces the plot):

Layer 1 activations: drift ~\(10^{-3}\) relative (one matmul of error).
Layer 4 activations: drift ~\(10^{-2}\) relative (compounded).
Output logits: drift ~\(10^{-2}\) relative; argmax often unchanged.
A few of the 7 positions: argmax changes. The model "predicts" a different token under fp16 weights.

The number of argmax-changing positions is the headline number. For our tiny model, it's typically 0–2 out of 7 — i.e., the predictions are mostly stable, with a handful of marginal tokens flipping. This matters because Phase 21 will sample from these logits; argmax flips for one token are visible in generated text.

What we are not doing¶

Training in fp16. Both forward and backward, with loss scaling, is the full mixed-precision recipe. Phase 26.
bf16 measurement. bf16 is qualitatively easier (no scaling needed, range matches fp32). Once you understand fp16, bf16 is straightforward. Phase 26.
fp8. Hopper-only feature, irrelevant to CPU. Phase 26 surveys it.

Drill problems¶

fp16 has 10 mantissa bits. What is the largest representable integer that has an exact fp16 representation? (Hint: \(2^{11}\).)
fp32 value \(x = 0.1\) cast to fp16. What's the rounding error? Compare to \(x = 1000.0\).
A matmul \(C = AB\) with \(A, B\) fp16 and \(C\) accumulated in fp16. Inner dim \(N = 1024\). Estimate the relative error in \(C\). Why is this unusable in practice?
The verb sentence "he is going to be working" is 8 tokens. After the fp16 weight cast, predicting "working" as the final token requires the logit difference between "working" and "work" to exceed the noise floor of fp16 drift. If the original (fp32) logit margin is \(0.5\), will fp16 preserve the argmax?

One-paragraph recap¶

fp16 trades 16× less dynamic range for one extra mantissa bit vs bf16. The per-cast rounding error is \(\le 2^{-10}\) relative; per matmul of inner dim \(N\), the error is \(\sim N \cdot 2^{-10}\) relative. The accumulator rule — keep the running sum in fp32 even with fp16 operands — is the single most important convention; it's what makes Tensor Cores work. Mixed precision's main benefit on GPU is 2-8× speedup; on CPU, it's memory-only and not worth the engineering. Phase 18's preview measures per-layer fp16 drift; Phase 26 exploits the savings on GPU.

What this section does NOT cover¶

Loss scaling for fp16 backward. Phase 26.
Stochastic rounding. Outside scope.
fp16 in optimizer state (master weights). Phase 26.
Mixed-precision attention. Phase 27.

Next: theory/04-checkpoints-and-mlflow.md.