English · Español

01 — Number Formats: FP32, FP16, BF16, TF32, FP8¶

🇪🇸 Antes de cuantizar a entero, hay que entender qué hace cada formato flotante: cuántos bits para exponente (rango dinámico) y cuántos para mantisa (precisión). BF16 sacrifica precisión por rango; FP16 al revés; FP8 los empuja a ambos al límite.

The IEEE 754 anatomy¶

Any finite floating-point value is encoded as three fields:

±  exponent (E bits)   mantissa (M bits)
sign      ^                   ^
         bias=2^(E-1)-1     implicit leading 1

Decoded value: (-1)^sign × 1.mantissa × 2^(exponent - bias).

Three knobs:

Total bits = 1 + E + M. Determines memory footprint per value.
Exponent bits E — determines dynamic range. Max representable ≈ 2^(2^(E-1)). Min normal ≈ 2^(-(2^(E-1)-2)).
Mantissa bits M — determines precision. Machine epsilon ≈ 2^(-M).

The four formats relevant to this curriculum:

Format	Total	E	M	Range (approx)	Eps (approx)	When to use
FP32	32	8	23	±3.4e38	1.2e-7	Reference precision. Training default through Phase 23.
FP16	16	5	10	±6.5e4	9.8e-4	Inference + GPU mixed-precision training. Tight range — overflows during attention pre-softmax.
BF16	16	8	7	±3.4e38	7.8e-3	Training (TPUs, A100+). Same range as FP32; trades mantissa for safety.
FP8 (E4M3)	8	4	3	±448	0.125	Hopper+ training; not usable on Borja's CPU. Survey-only.

Why BF16 won training¶

The historical move from FP16 to BF16 in large-model training is itself a roofline-adjacent story, but the dominant reason is gradient dynamic range.

A FP16 gradient that overflows becomes +inf and poisons the optimizer; the standard FP16 fix is loss scaling (multiply the loss by 2^k, do the backward in FP16, unscale before the optimizer step). It works but requires the dynamic-loss-scaler hyperparameter.

BF16's exponent range matches FP32's. Gradients don't overflow. The precision loss (7 mantissa bits) is acceptable because gradient updates are noisy anyway — Adam's running averages absorb the round-off. No loss scaler needed.

For inference, the trade-off flips. FP16 is acceptable because gradients aren't computed; activations are well-behaved (after training). BF16 is also acceptable but rarely meaningfully better for inference latency.

Where FP16 hurts: pre-softmax in attention¶

The single attention computation softmax(Q Kᵀ / √d) V has one pathological point: Q Kᵀ produces values whose magnitudes scale with d (the head dimension) before the 1/√d correction. For d=64, a single dot product of two unit vectors lies in [-8, 8]. For very long sequences with high-magnitude embeddings, intermediate values can hit 60+. exp(60) ≈ 1e26 — well above FP16's max of 6.5e4. Result: inf, then nan, then training instability.

This is why FlashAttention's online softmax (Phase 27) keeps a running max and subtracts it: exp(x_i − max) ∈ (0, 1] is FP16-safe.

For Phase 26 PTQ, we don't touch FP16 attention dynamics — the model has already been trained — but we do observe that quantizing attention to INT8 needs the same precaution: outliers in Q Kᵀ blow up the per-tensor INT8 scale. This motivates per-channel quantization (theory file 02).

INT8 / INT4 in this hierarchy¶

INT8 and INT4 are not IEEE 754 formats. They're plain two's-complement integers with an external scale s (and optionally a zero-point z) shared across some unit of values (per-tensor, per-channel, per-group):

representable values: {-128, -127, ..., -1, 0, 1, ..., 127}   (INT8, symmetric)
decoded value:        x̂ = s × q     (symmetric, scale stored separately)

Compared to FP16's 65,536 distinct values across the same range, INT8's 256 distinct values is a 256× coarser quantization grid. The bargain is: most weight distributions are far from uniform. They cluster near zero, with a long tail of outliers. The grid only needs to resolve the cluster; the tails get clipped to the extreme bins (and we pay an error penalty there).

INT4's 16 distinct values is 16× coarser still. At per-group granularity (one scale per 64 weights), the effective resolution rises sharply because each group's local distribution can pick its own scale.

TF32 — Ampere's compromise¶

TF32 (TensorFloat-32) is NVIDIA Ampere's input format for tensor cores: 1 sign + 8 exponent + 10 mantissa = 19 useful bits (stored in 32-bit words for memory pipeline compatibility). Same exponent as FP32, same mantissa as FP16. Tensor cores accumulate to FP32.

On Borja's i5-8250U, TF32 doesn't exist. Survey-only: mention it because every "compute capability 8.0+" paper talks about it.

How this connects to Phase 26¶

The phase implements one number-format change: FP32 → INT8 → INT4. We don't implement FP16↔BF16 conversion (PyTorch has it). We don't implement FP8 (no hardware). The conceptual takeaway from this file:

Bits per value is a free parameter. Hardware constrains which values are fast, not which are possible.
Range vs precision is the trade-off knob within a fixed bit count. BF16 keeps range, FP16 keeps precision. INT8/INT4 do something different — they shift the "precision budget" to a scalar (the scale).
Activation distributions matter more than weight distributions. Outliers are why naive PTQ degrades; they're why every modern scheme (SmoothQuant, AWQ, LLM.int8()) treats activations specially.

A small empirical exercise (not graded)¶

Run in PyTorch:

import torch
x = torch.randn(1000)
print(x.float().std(), x.half().std(), x.bfloat16().std())

Then with x = torch.randn(1000) * 1e30. Observe what BF16 does (still OK — exponent range covers it) vs FP16 (overflows). This is the gradient-overflow argument in one line.

One-paragraph recap¶

Floating-point formats split bits between exponent (range) and mantissa (precision). FP32 has plenty of both. FP16 squeezes range; BF16 squeezes precision. Integer quantization (INT8, INT4) abandons the floating-point grid entirely and stores a separate scalar to recover the range — at the cost of a much coarser per-value resolution, partly mitigated by choosing the scale at fine granularity (per-channel, per-group). The next theory file derives the symmetric/asymmetric quantization map and bounds its error.

Next: theory/02-scales-and-zeros.md.