English · Español
01 — Number Formats: FP32, FP16, BF16, TF32, FP8¶
🇪🇸 Antes de cuantizar a entero, hay que entender qué hace cada formato flotante: cuántos bits para exponente (rango dinámico) y cuántos para mantisa (precisión). BF16 sacrifica precisión por rango; FP16 al revés; FP8 los empuja a ambos al límite.
The IEEE 754 anatomy¶
Any finite floating-point value is encoded as three fields:
Decoded value: (-1)^sign × 1.mantissa × 2^(exponent - bias).
Three knobs:
- Total bits = 1 + E + M. Determines memory footprint per value.
- Exponent bits E — determines dynamic range. Max representable ≈
2^(2^(E-1)). Min normal ≈2^(-(2^(E-1)-2)). - Mantissa bits M — determines precision. Machine epsilon ≈
2^(-M).
The four formats relevant to this curriculum:
| Format | Total | E | M | Range (approx) | Eps (approx) | When to use |
|---|---|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | ±3.4e38 | 1.2e-7 | Reference precision. Training default through Phase 23. |
| FP16 | 16 | 5 | 10 | ±6.5e4 | 9.8e-4 | Inference + GPU mixed-precision training. Tight range — overflows during attention pre-softmax. |
| BF16 | 16 | 8 | 7 | ±3.4e38 | 7.8e-3 | Training (TPUs, A100+). Same range as FP32; trades mantissa for safety. |
| FP8 (E4M3) | 8 | 4 | 3 | ±448 | 0.125 | Hopper+ training; not usable on Borja's CPU. Survey-only. |
Why BF16 won training¶
The historical move from FP16 to BF16 in large-model training is itself a roofline-adjacent story, but the dominant reason is gradient dynamic range.
A FP16 gradient that overflows becomes +inf and poisons the optimizer; the standard FP16 fix is loss scaling (multiply the loss by 2^k, do the backward in FP16, unscale before the optimizer step). It works but requires the dynamic-loss-scaler hyperparameter.
BF16's exponent range matches FP32's. Gradients don't overflow. The precision loss (7 mantissa bits) is acceptable because gradient updates are noisy anyway — Adam's running averages absorb the round-off. No loss scaler needed.
For inference, the trade-off flips. FP16 is acceptable because gradients aren't computed; activations are well-behaved (after training). BF16 is also acceptable but rarely meaningfully better for inference latency.
Where FP16 hurts: pre-softmax in attention¶
The single attention computation softmax(Q Kᵀ / √d) V has one pathological point: Q Kᵀ produces values whose magnitudes scale with d (the head dimension) before the 1/√d correction. For d=64, a single dot product of two unit vectors lies in [-8, 8]. For very long sequences with high-magnitude embeddings, intermediate values can hit 60+. exp(60) ≈ 1e26 — well above FP16's max of 6.5e4. Result: inf, then nan, then training instability.
This is why FlashAttention's online softmax (Phase 27) keeps a running max and subtracts it: exp(x_i − max) ∈ (0, 1] is FP16-safe.
For Phase 26 PTQ, we don't touch FP16 attention dynamics — the model has already been trained — but we do observe that quantizing attention to INT8 needs the same precaution: outliers in Q Kᵀ blow up the per-tensor INT8 scale. This motivates per-channel quantization (theory file 02).
INT8 / INT4 in this hierarchy¶
INT8 and INT4 are not IEEE 754 formats. They're plain two's-complement integers with an external scale s (and optionally a zero-point z) shared across some unit of values (per-tensor, per-channel, per-group):
representable values: {-128, -127, ..., -1, 0, 1, ..., 127} (INT8, symmetric)
decoded value: x̂ = s × q (symmetric, scale stored separately)
Compared to FP16's 65,536 distinct values across the same range, INT8's 256 distinct values is a 256× coarser quantization grid. The bargain is: most weight distributions are far from uniform. They cluster near zero, with a long tail of outliers. The grid only needs to resolve the cluster; the tails get clipped to the extreme bins (and we pay an error penalty there).
INT4's 16 distinct values is 16× coarser still. At per-group granularity (one scale per 64 weights), the effective resolution rises sharply because each group's local distribution can pick its own scale.
TF32 — Ampere's compromise¶
TF32 (TensorFloat-32) is NVIDIA Ampere's input format for tensor cores: 1 sign + 8 exponent + 10 mantissa = 19 useful bits (stored in 32-bit words for memory pipeline compatibility). Same exponent as FP32, same mantissa as FP16. Tensor cores accumulate to FP32.
On Borja's i5-8250U, TF32 doesn't exist. Survey-only: mention it because every "compute capability 8.0+" paper talks about it.
How this connects to Phase 26¶
The phase implements one number-format change: FP32 → INT8 → INT4. We don't implement FP16↔BF16 conversion (PyTorch has it). We don't implement FP8 (no hardware). The conceptual takeaway from this file:
- Bits per value is a free parameter. Hardware constrains which values are fast, not which are possible.
- Range vs precision is the trade-off knob within a fixed bit count. BF16 keeps range, FP16 keeps precision. INT8/INT4 do something different — they shift the "precision budget" to a scalar (the scale).
- Activation distributions matter more than weight distributions. Outliers are why naive PTQ degrades; they're why every modern scheme (SmoothQuant, AWQ, LLM.int8()) treats activations specially.
A small empirical exercise (not graded)¶
Run in PyTorch:
Then with x = torch.randn(1000) * 1e30. Observe what BF16 does (still OK — exponent range covers it) vs FP16 (overflows). This is the gradient-overflow argument in one line.
One-paragraph recap¶
Floating-point formats split bits between exponent (range) and mantissa (precision). FP32 has plenty of both. FP16 squeezes range; BF16 squeezes precision. Integer quantization (INT8, INT4) abandons the floating-point grid entirely and stores a separate scalar to recover the range — at the cost of a much coarser per-value resolution, partly mitigated by choosing the scale at fine granularity (per-channel, per-group). The next theory file derives the symmetric/asymmetric quantization map and bounds its error.
Next: theory/02-scales-and-zeros.md.