English · Español

04 — The precision zoo: BF16, TF32, FP8, INT8/INT4¶

🇪🇸 Más allá de fp32/fp64 hay un zoo de formatos diseñados para AI: bf16 (entreno), fp16 (inferencia legacy), fp8 (entreno frontier), int8/int4 (cuantización). Esta página presenta el panorama; Fase 26 lo profundiza.

This page is a survey. Phase 26 (quantization) will revisit each format in detail, with real calibration and error-bound experiments. For now you need the bit layouts, the design intent, and a handful of one-line mnemonics so that papers and library docs are readable.

The trade space¶

Three knobs are in play for any numeric format used in AI:

Width in bits — directly controls memory footprint, memory bandwidth, and (if the hardware supports it) arithmetic throughput.
Exponent bits — control the range of representable values (smallest non-zero positive to largest finite).
Mantissa bits — control the precision (relative resolution) within a given exponent.

Wider exponent → can represent very large and very small values without overflow/underflow. Wider mantissa → can distinguish nearby values more finely. These compete for bits; reducing total width forces a choice.

Format	Bits	Exp	Mant	Range	Decimal digits	Designed for
fp64	64	11	52	`~1.7e308`	~15.95	Oracle / scientific
fp32	32	8	23	`~3.4e38`	~7.22	General-purpose ML
TF32	19	8	10	`~3.4e38`	~3.3	NVIDIA tensor cores (matmul accumulate)
bf16	16	8	7	`~3.4e38`	~2.4	Modern training default
fp16	16	5	10	`~6.5e4`	~3.3	Legacy inference, gradient scaling needed
FP8 E4M3	8	4	3	`~448`	~1.0	Forward activations (Hopper, Blackwell)
FP8 E5M2	8	5	2	`~5.7e4`	~0.7	Gradients (wider range)
INT8	8	—	—	`[-128, 127]`	—	Quantized inference
INT4	4	—	—	`[-8, 7]`	—	Weights-only quantization

(Integer formats don't have exponent / mantissa; they're sign-bit + magnitude with no implicit scaling. The "scale" lives outside, in a separate fp value per tensor or per channel — covered in Phase 26.)

fp16 vs bf16 — the trade that drove modern training¶

Both are 16-bit. fp16 has 5-bit exponent + 10-bit mantissa; bf16 has 8-bit exponent + 7-bit mantissa.

fp16's narrow exponent overflows on softmax. Recall from theory 02: exp(x) overflows fp32 around x ≈ 89. For fp16, the largest representable is ~6.5 × 10⁴, so exp(x) overflows at x ≈ 11. A modestly-sized logit (which is normal in un-normalized model outputs) immediately overflows. Mixed-precision training in fp16 requires gradient scaling (multiply loss by 2^k, clip after backward, descale) to keep gradients in the representable range.

bf16 inherits fp32's exponent. exp(x) overflow happens at the same x ≈ 89 as fp32. No gradient scaling needed in the common case. The tradeoff is mantissa precision: bf16 has only 7 mantissa bits (ε ≈ 7.8 × 10⁻³ ≈ 0.8%), about 8× worse than fp16. But gradients tolerate noise; weights and activations rarely need more than 1% relative precision. The win is huge.

Why this matters for Borja. Phase 26 trains MiniGPT in bf16 (when GPU available). Phase 27's FlashAttention uses bf16 / fp16 mixed precision. Reading the literature requires knowing this trade by heart. Mnemonic: fp16 = precision-first; bf16 = range-first; modern training picked range.

TF32 — NVIDIA's compromise¶

A 19-bit format with fp32 exponent (8 bits) but only 10 mantissa bits. Lives only on NVIDIA Ampere and later. Used internally by tensor cores when both operands are fp32 — the multiplier truncates to TF32 precision for speed, then accumulates in fp32. The user sees fp32 in and fp32 out, but the multiply is TF32-accurate.

You will not write TF32 directly. You will see it in PyTorch's torch.backends.cuda.matmul.allow_tf32 = True/False toggle. Default is True for matmul on Ampere+, which Phase 25 will measure.

FP8 — the frontier¶

Hopper (H100) and Blackwell GPUs ship two FP8 formats:

E4M3 (1 sign + 4 exponent + 3 mantissa): bias 7, range up to 448. Used for activations and weights. Reserved bit patterns: S.1111.111 = NaN, S.1111.110 = 448 (max finite), no Inf. This deviates from IEEE-754; the design choice favors using all bit patterns for finite values.
E5M2 (1 sign + 5 exponent + 2 mantissa): bias 15, range up to ~57,344. Used for gradients (wider range needed because of the loss-scale chain). Has Inf and NaN like normal IEEE-754.

Phase 26 covers the calibration and mixed-precision recipes for fp8 training. Phase 2 only needs you to know the bit layouts and the design intent.

Why two formats? Forward activations have bounded dynamic range (after layer norm); E4M3 is enough. Gradients have huge dynamic range (chain-of-derivatives can multiply small numbers into very small ones); E5M2 saves them from underflow.

Integer formats — INT8 and INT4¶

Integer formats are not IEEE-754. They store a signed integer in a fixed range:

INT8: [-128, 127]. Each tensor element is stored as one byte.
INT4: [-8, 7]. Each tensor element is half a byte. Two values per byte.

To represent real-valued tensors as integers, you need a scale (and optionally a zero point):

\[ x_{\mathrm{real}} \approx s \cdot (q - z) \]

Where q is the stored integer, s is a per-tensor or per-channel scale (fp32), z is an integer zero point (often 0 for symmetric quantization).

The error of this representation depends on s and the actual data distribution. For a normally-distributed tensor with σ = 1, INT8 symmetric quantization with s = 2σ/127 gives a quantization step of ~0.016, so relative error is roughly 0.5%. Comparable to bf16's mantissa precision.

The whole point of integer quantization is memory. INT8 weights are 4× smaller than fp32 weights, INT4 are 8× smaller. Bandwidth-bound inference (which is most LLM inference at batch size 1, per Phase 1 roofline analysis) gets a near-proportional speedup from the smaller weights, even though INT8 multiplies aren't faster than fp32 on most CPUs. You shop in bytes, not in FLOPs.

What's lost¶

NaN, Inf, sign-zero. Integer formats can't represent these.
Logarithmic spacing of values. fp's value spacing is ~ε × |x|; integer spacing is uniform s. This means integer formats are worse at representing small values (around zero, integer spacing is s, but fp spacing there is s × ε << s). Phase 26 covers strategies (per-channel quantization, GPTQ, AWQ) that mitigate this.

Quantization preview lab¶

Lab 03-quantization-preview.md has you round an fp32 logit vector (the tense classification logits for a §A13 verb-form prediction) to INT8 and back, and plot the error distribution. This is not real quantization — there's no calibration, no per-channel scale, no SmoothQuant. It's a preview that motivates Phase 26.

When each format is used in practice¶

Use case	Typical format	Why
Reference / gradient check	fp64	Maximum precision oracle
Default training	fp32 (CPU) / bf16 (GPU)	Standard precision
Training accumulators	fp32 / fp64	Avoid Kahan overhead
Mixed-precision training (legacy)	fp16 + fp32 master	Needs gradient scaling
Frontier training (H100+)	fp8 + bf16 master	Memory + bandwidth wins
LLM inference (consumer GPU)	fp16, bf16, or 4-bit	Memory-bound; smaller wins
LLM inference (CPU, llama.cpp)	Q4_K_M, Q5_K_M, Q8_0	GGUF custom formats; Phase 26
MiniGPT in this curriculum	fp32 (CPU, NumPy reference)	Simplicity; performance is not the point

You'll touch every row above by Phase 26. Phase 2 just gives you the vocabulary.

How to emulate formats you don't have hardware for¶

Borja's i5-8250U has fp32/fp64 native (via AVX2) and no fp16/bf16/fp8 hardware. To explore those formats:

bf16: bit-cast fp32 to uint32, zero the low 16 bits, bit-cast back. Done.
fp16: numpy.float16 (NumPy emulates). Slow but correct.
fp8 E4M3 / E5M2: roll your own bit manipulation, or install the ml_dtypes library (out of scope for Phase 2 — emulate by hand in the lab).
INT8/INT4: round-then-scale, see lab 03.

Lab 00-bit-anatomy.md walks through bf16 and fp16 emulation. Lab 03 does INT8.

What's worth memorizing¶

Three things, before moving to Phase 3:

Bit widths and exponent widths of fp32, bf16, fp16. So you can predict overflow thresholds (exp(x) overflows at x ≈ 89 for fp32 and bf16; x ≈ 11 for fp16).
Mantissa widths of fp32 (23) and bf16 (7). So you can predict relative precision (ε_fp32 ≈ 1.2e-7, ε_bf16 ≈ 7.8e-3).
INT8 with scale s has uniform step s. So you can quantify rounding error.

Everything else can be looked up. These three you should know without looking.

Drill problems¶

Solutions in solutions/04-precision-zoo-ref.md (phase-open).

The §A13 model produces a length-600 fp32 logit vector. The maximum logit is 12.3. Can the naive exp(logits) / sum(exp(logits)) proceed in (a) fp32 (b) bf16 © fp16 without overflow? Why?
Convert 0.5 to fp16. To bf16. To fp8 E4M3. What bit patterns? What is the round-trip error in each?
INT8-quantize a tensor whose values are in [-2.5, 2.5]. Choose s symmetrically. What is the quantization step? What is the worst-case absolute error?
The 1/600 probability of theory 01. Represent it in bf16. What is the relative error? Is bf16 fine for this distribution?
Show that summing 600 bf16 numbers each of magnitude 1/600 accumulates O(N · ε_bf16) ≈ 4.7 of error — i.e., the sum is meaningless. What does this say about using bf16 for accumulators?

One-paragraph recap¶

ML uses a zoo of numeric formats — fp64 (oracle), fp32 (default), bf16 (modern training, fp32 exponent + 7-bit mantissa), fp16 (legacy, narrow exponent), fp8 E4M3/E5M2 (frontier training), INT8/INT4 (quantized inference). Each trades range, precision, and width. The recurring pattern: 16-bit training formats (bf16) preserve exponent range at the cost of precision; inference quantization (INT8/INT4) preserves the data's effective range via per-tensor scales at the cost of uniform-step error near zero. Phase 26 returns to each. Phase 2 just teaches you to read the columns.

What this page does NOT cover¶

Calibration algorithms (GPTQ, AWQ, SmoothQuant). Phase 26.
GGUF file format for llama.cpp. Phase 26.
Loss scaling automation (torch.cuda.amp.GradScaler). Phase 25 + 26.
Mixed-precision policies (autocast). Phase 25.

Phase 2 theory complete. Next stop: lab/00-bit-anatomy.md.