English · Español
04 — The precision zoo: BF16, TF32, FP8, INT8/INT4¶
🇪🇸 Más allá de fp32/fp64 hay un zoo de formatos diseñados para AI: bf16 (entreno), fp16 (inferencia legacy), fp8 (entreno frontier), int8/int4 (cuantización). Esta página presenta el panorama; Fase 26 lo profundiza.
This page is a survey. Phase 26 (quantization) will revisit each format in detail, with real calibration and error-bound experiments. For now you need the bit layouts, the design intent, and a handful of one-line mnemonics so that papers and library docs are readable.
The trade space¶
Three knobs are in play for any numeric format used in AI:
- Width in bits — directly controls memory footprint, memory bandwidth, and (if the hardware supports it) arithmetic throughput.
- Exponent bits — control the range of representable values (smallest non-zero positive to largest finite).
- Mantissa bits — control the precision (relative resolution) within a given exponent.
Wider exponent → can represent very large and very small values without overflow/underflow. Wider mantissa → can distinguish nearby values more finely. These compete for bits; reducing total width forces a choice.
| Format | Bits | Exp | Mant | Range | Decimal digits | Designed for |
|---|---|---|---|---|---|---|
| fp64 | 64 | 11 | 52 | ~1.7e308 |
~15.95 | Oracle / scientific |
| fp32 | 32 | 8 | 23 | ~3.4e38 |
~7.22 | General-purpose ML |
| TF32 | 19 | 8 | 10 | ~3.4e38 |
~3.3 | NVIDIA tensor cores (matmul accumulate) |
| bf16 | 16 | 8 | 7 | ~3.4e38 |
~2.4 | Modern training default |
| fp16 | 16 | 5 | 10 | ~6.5e4 |
~3.3 | Legacy inference, gradient scaling needed |
| FP8 E4M3 | 8 | 4 | 3 | ~448 |
~1.0 | Forward activations (Hopper, Blackwell) |
| FP8 E5M2 | 8 | 5 | 2 | ~5.7e4 |
~0.7 | Gradients (wider range) |
| INT8 | 8 | — | — | [-128, 127] |
— | Quantized inference |
| INT4 | 4 | — | — | [-8, 7] |
— | Weights-only quantization |
(Integer formats don't have exponent / mantissa; they're sign-bit + magnitude with no implicit scaling. The "scale" lives outside, in a separate fp value per tensor or per channel — covered in Phase 26.)
fp16 vs bf16 — the trade that drove modern training¶
Both are 16-bit. fp16 has 5-bit exponent + 10-bit mantissa; bf16 has 8-bit exponent + 7-bit mantissa.
fp16's narrow exponent overflows on softmax. Recall from theory 02: exp(x) overflows fp32 around x ≈ 89. For fp16, the largest representable is ~6.5 × 10⁴, so exp(x) overflows at x ≈ 11. A modestly-sized logit (which is normal in un-normalized model outputs) immediately overflows. Mixed-precision training in fp16 requires gradient scaling (multiply loss by 2^k, clip after backward, descale) to keep gradients in the representable range.
bf16 inherits fp32's exponent. exp(x) overflow happens at the same x ≈ 89 as fp32. No gradient scaling needed in the common case. The tradeoff is mantissa precision: bf16 has only 7 mantissa bits (ε ≈ 7.8 × 10⁻³ ≈ 0.8%), about 8× worse than fp16. But gradients tolerate noise; weights and activations rarely need more than 1% relative precision. The win is huge.
Why this matters for Borja. Phase 26 trains MiniGPT in bf16 (when GPU available). Phase 27's FlashAttention uses bf16 / fp16 mixed precision. Reading the literature requires knowing this trade by heart. Mnemonic: fp16 = precision-first; bf16 = range-first; modern training picked range.
TF32 — NVIDIA's compromise¶
A 19-bit format with fp32 exponent (8 bits) but only 10 mantissa bits. Lives only on NVIDIA Ampere and later. Used internally by tensor cores when both operands are fp32 — the multiplier truncates to TF32 precision for speed, then accumulates in fp32. The user sees fp32 in and fp32 out, but the multiply is TF32-accurate.
You will not write TF32 directly. You will see it in PyTorch's torch.backends.cuda.matmul.allow_tf32 = True/False toggle. Default is True for matmul on Ampere+, which Phase 25 will measure.
FP8 — the frontier¶
Hopper (H100) and Blackwell GPUs ship two FP8 formats:
- E4M3 (1 sign + 4 exponent + 3 mantissa): bias 7, range up to
448. Used for activations and weights. Reserved bit patterns:S.1111.111= NaN,S.1111.110=448(max finite), noInf. This deviates from IEEE-754; the design choice favors using all bit patterns for finite values. - E5M2 (1 sign + 5 exponent + 2 mantissa): bias 15, range up to
~57,344. Used for gradients (wider range needed because of the loss-scale chain). HasInfandNaNlike normal IEEE-754.
Phase 26 covers the calibration and mixed-precision recipes for fp8 training. Phase 2 only needs you to know the bit layouts and the design intent.
Why two formats? Forward activations have bounded dynamic range (after layer norm); E4M3 is enough. Gradients have huge dynamic range (chain-of-derivatives can multiply small numbers into very small ones); E5M2 saves them from underflow.
Integer formats — INT8 and INT4¶
Integer formats are not IEEE-754. They store a signed integer in a fixed range:
- INT8:
[-128, 127]. Each tensor element is stored as one byte. - INT4:
[-8, 7]. Each tensor element is half a byte. Two values per byte.
To represent real-valued tensors as integers, you need a scale (and optionally a zero point):
Where q is the stored integer, s is a per-tensor or per-channel scale (fp32), z is an integer zero point (often 0 for symmetric quantization).
The error of this representation depends on s and the actual data distribution. For a normally-distributed tensor with σ = 1, INT8 symmetric quantization with s = 2σ/127 gives a quantization step of ~0.016, so relative error is roughly 0.5%. Comparable to bf16's mantissa precision.
The whole point of integer quantization is memory. INT8 weights are 4× smaller than fp32 weights, INT4 are 8× smaller. Bandwidth-bound inference (which is most LLM inference at batch size 1, per Phase 1 roofline analysis) gets a near-proportional speedup from the smaller weights, even though INT8 multiplies aren't faster than fp32 on most CPUs. You shop in bytes, not in FLOPs.
What's lost¶
- NaN, Inf, sign-zero. Integer formats can't represent these.
- Logarithmic spacing of values. fp's value spacing is
~ε × |x|; integer spacing is uniforms. This means integer formats are worse at representing small values (around zero, integer spacing iss, but fp spacing there iss × ε << s). Phase 26 covers strategies (per-channel quantization, GPTQ, AWQ) that mitigate this.
Quantization preview lab¶
Lab 03-quantization-preview.md has you round an fp32 logit vector (the tense classification logits for a §A13 verb-form prediction) to INT8 and back, and plot the error distribution. This is not real quantization — there's no calibration, no per-channel scale, no SmoothQuant. It's a preview that motivates Phase 26.
When each format is used in practice¶
| Use case | Typical format | Why |
|---|---|---|
| Reference / gradient check | fp64 | Maximum precision oracle |
| Default training | fp32 (CPU) / bf16 (GPU) | Standard precision |
| Training accumulators | fp32 / fp64 | Avoid Kahan overhead |
| Mixed-precision training (legacy) | fp16 + fp32 master | Needs gradient scaling |
| Frontier training (H100+) | fp8 + bf16 master | Memory + bandwidth wins |
| LLM inference (consumer GPU) | fp16, bf16, or 4-bit | Memory-bound; smaller wins |
| LLM inference (CPU, llama.cpp) | Q4_K_M, Q5_K_M, Q8_0 | GGUF custom formats; Phase 26 |
| MiniGPT in this curriculum | fp32 (CPU, NumPy reference) | Simplicity; performance is not the point |
You'll touch every row above by Phase 26. Phase 2 just gives you the vocabulary.
How to emulate formats you don't have hardware for¶
Borja's i5-8250U has fp32/fp64 native (via AVX2) and no fp16/bf16/fp8 hardware. To explore those formats:
- bf16: bit-cast fp32 to uint32, zero the low 16 bits, bit-cast back. Done.
- fp16:
numpy.float16(NumPy emulates). Slow but correct. - fp8 E4M3 / E5M2: roll your own bit manipulation, or install the
ml_dtypeslibrary (out of scope for Phase 2 — emulate by hand in the lab). - INT8/INT4: round-then-scale, see lab
03.
Lab 00-bit-anatomy.md walks through bf16 and fp16 emulation. Lab 03 does INT8.
What's worth memorizing¶
Three things, before moving to Phase 3:
- Bit widths and exponent widths of fp32, bf16, fp16. So you can predict overflow thresholds (
exp(x)overflows atx ≈ 89for fp32 and bf16;x ≈ 11for fp16). - Mantissa widths of fp32 (23) and bf16 (7). So you can predict relative precision (
ε_fp32 ≈ 1.2e-7,ε_bf16 ≈ 7.8e-3). - INT8 with scale
shas uniform steps. So you can quantify rounding error.
Everything else can be looked up. These three you should know without looking.
Drill problems¶
Solutions in solutions/04-precision-zoo-ref.md (phase-open).
- The §A13 model produces a length-600 fp32 logit vector. The maximum logit is 12.3. Can the naive
exp(logits) / sum(exp(logits))proceed in (a) fp32 (b) bf16 © fp16 without overflow? Why? - Convert
0.5to fp16. To bf16. To fp8 E4M3. What bit patterns? What is the round-trip error in each? - INT8-quantize a tensor whose values are in
[-2.5, 2.5]. Choosessymmetrically. What is the quantization step? What is the worst-case absolute error? - The
1/600probability of theory01. Represent it in bf16. What is the relative error? Is bf16 fine for this distribution? - Show that summing 600 bf16 numbers each of magnitude
1/600accumulatesO(N · ε_bf16) ≈ 4.7of error — i.e., the sum is meaningless. What does this say about using bf16 for accumulators?
One-paragraph recap¶
ML uses a zoo of numeric formats — fp64 (oracle), fp32 (default), bf16 (modern training, fp32 exponent + 7-bit mantissa), fp16 (legacy, narrow exponent), fp8 E4M3/E5M2 (frontier training), INT8/INT4 (quantized inference). Each trades range, precision, and width. The recurring pattern: 16-bit training formats (bf16) preserve exponent range at the cost of precision; inference quantization (INT8/INT4) preserves the data's effective range via per-tensor scales at the cost of uniform-step error near zero. Phase 26 returns to each. Phase 2 just teaches you to read the columns.
What this page does NOT cover¶
- Calibration algorithms (GPTQ, AWQ, SmoothQuant). Phase 26.
- GGUF file format for llama.cpp. Phase 26.
- Loss scaling automation (
torch.cuda.amp.GradScaler). Phase 25 + 26. - Mixed-precision policies (autocast). Phase 25.
Phase 2 theory complete. Next stop: lab/00-bit-anatomy.md.