Skip to content

English · Español

04 — The precision zoo: BF16, TF32, FP8, INT8/INT4

🇪🇸 Más allá de fp32/fp64 hay un zoo de formatos diseñados para AI: bf16 (entreno), fp16 (inferencia legacy), fp8 (entreno frontier), int8/int4 (cuantización). Esta página presenta el panorama; Fase 26 lo profundiza.

This page is a survey. Phase 26 (quantization) will revisit each format in detail, with real calibration and error-bound experiments. For now you need the bit layouts, the design intent, and a handful of one-line mnemonics so that papers and library docs are readable.


The trade space

Three knobs are in play for any numeric format used in AI:

  1. Width in bits — directly controls memory footprint, memory bandwidth, and (if the hardware supports it) arithmetic throughput.
  2. Exponent bits — control the range of representable values (smallest non-zero positive to largest finite).
  3. Mantissa bits — control the precision (relative resolution) within a given exponent.

Wider exponent → can represent very large and very small values without overflow/underflow. Wider mantissa → can distinguish nearby values more finely. These compete for bits; reducing total width forces a choice.

Format Bits Exp Mant Range Decimal digits Designed for
fp64 64 11 52 ~1.7e308 ~15.95 Oracle / scientific
fp32 32 8 23 ~3.4e38 ~7.22 General-purpose ML
TF32 19 8 10 ~3.4e38 ~3.3 NVIDIA tensor cores (matmul accumulate)
bf16 16 8 7 ~3.4e38 ~2.4 Modern training default
fp16 16 5 10 ~6.5e4 ~3.3 Legacy inference, gradient scaling needed
FP8 E4M3 8 4 3 ~448 ~1.0 Forward activations (Hopper, Blackwell)
FP8 E5M2 8 5 2 ~5.7e4 ~0.7 Gradients (wider range)
INT8 8 [-128, 127] Quantized inference
INT4 4 [-8, 7] Weights-only quantization

(Integer formats don't have exponent / mantissa; they're sign-bit + magnitude with no implicit scaling. The "scale" lives outside, in a separate fp value per tensor or per channel — covered in Phase 26.)

fp16 vs bf16 — the trade that drove modern training

Both are 16-bit. fp16 has 5-bit exponent + 10-bit mantissa; bf16 has 8-bit exponent + 7-bit mantissa.

fp16's narrow exponent overflows on softmax. Recall from theory 02: exp(x) overflows fp32 around x ≈ 89. For fp16, the largest representable is ~6.5 × 10⁴, so exp(x) overflows at x ≈ 11. A modestly-sized logit (which is normal in un-normalized model outputs) immediately overflows. Mixed-precision training in fp16 requires gradient scaling (multiply loss by 2^k, clip after backward, descale) to keep gradients in the representable range.

bf16 inherits fp32's exponent. exp(x) overflow happens at the same x ≈ 89 as fp32. No gradient scaling needed in the common case. The tradeoff is mantissa precision: bf16 has only 7 mantissa bits (ε ≈ 7.8 × 10⁻³ ≈ 0.8%), about 8× worse than fp16. But gradients tolerate noise; weights and activations rarely need more than 1% relative precision. The win is huge.

Why this matters for Borja. Phase 26 trains MiniGPT in bf16 (when GPU available). Phase 27's FlashAttention uses bf16 / fp16 mixed precision. Reading the literature requires knowing this trade by heart. Mnemonic: fp16 = precision-first; bf16 = range-first; modern training picked range.

TF32 — NVIDIA's compromise

A 19-bit format with fp32 exponent (8 bits) but only 10 mantissa bits. Lives only on NVIDIA Ampere and later. Used internally by tensor cores when both operands are fp32 — the multiplier truncates to TF32 precision for speed, then accumulates in fp32. The user sees fp32 in and fp32 out, but the multiply is TF32-accurate.

You will not write TF32 directly. You will see it in PyTorch's torch.backends.cuda.matmul.allow_tf32 = True/False toggle. Default is True for matmul on Ampere+, which Phase 25 will measure.

FP8 — the frontier

Hopper (H100) and Blackwell GPUs ship two FP8 formats:

  • E4M3 (1 sign + 4 exponent + 3 mantissa): bias 7, range up to 448. Used for activations and weights. Reserved bit patterns: S.1111.111 = NaN, S.1111.110 = 448 (max finite), no Inf. This deviates from IEEE-754; the design choice favors using all bit patterns for finite values.
  • E5M2 (1 sign + 5 exponent + 2 mantissa): bias 15, range up to ~57,344. Used for gradients (wider range needed because of the loss-scale chain). Has Inf and NaN like normal IEEE-754.

Phase 26 covers the calibration and mixed-precision recipes for fp8 training. Phase 2 only needs you to know the bit layouts and the design intent.

Why two formats? Forward activations have bounded dynamic range (after layer norm); E4M3 is enough. Gradients have huge dynamic range (chain-of-derivatives can multiply small numbers into very small ones); E5M2 saves them from underflow.

Integer formats — INT8 and INT4

Integer formats are not IEEE-754. They store a signed integer in a fixed range:

  • INT8: [-128, 127]. Each tensor element is stored as one byte.
  • INT4: [-8, 7]. Each tensor element is half a byte. Two values per byte.

To represent real-valued tensors as integers, you need a scale (and optionally a zero point):

\[ x_{\mathrm{real}} \approx s \cdot (q - z) \]

Where q is the stored integer, s is a per-tensor or per-channel scale (fp32), z is an integer zero point (often 0 for symmetric quantization).

The error of this representation depends on s and the actual data distribution. For a normally-distributed tensor with σ = 1, INT8 symmetric quantization with s = 2σ/127 gives a quantization step of ~0.016, so relative error is roughly 0.5%. Comparable to bf16's mantissa precision.

The whole point of integer quantization is memory. INT8 weights are 4× smaller than fp32 weights, INT4 are 8× smaller. Bandwidth-bound inference (which is most LLM inference at batch size 1, per Phase 1 roofline analysis) gets a near-proportional speedup from the smaller weights, even though INT8 multiplies aren't faster than fp32 on most CPUs. You shop in bytes, not in FLOPs.

What's lost

  • NaN, Inf, sign-zero. Integer formats can't represent these.
  • Logarithmic spacing of values. fp's value spacing is ~ε × |x|; integer spacing is uniform s. This means integer formats are worse at representing small values (around zero, integer spacing is s, but fp spacing there is s × ε << s). Phase 26 covers strategies (per-channel quantization, GPTQ, AWQ) that mitigate this.

Quantization preview lab

Lab 03-quantization-preview.md has you round an fp32 logit vector (the tense classification logits for a §A13 verb-form prediction) to INT8 and back, and plot the error distribution. This is not real quantization — there's no calibration, no per-channel scale, no SmoothQuant. It's a preview that motivates Phase 26.

When each format is used in practice

Use case Typical format Why
Reference / gradient check fp64 Maximum precision oracle
Default training fp32 (CPU) / bf16 (GPU) Standard precision
Training accumulators fp32 / fp64 Avoid Kahan overhead
Mixed-precision training (legacy) fp16 + fp32 master Needs gradient scaling
Frontier training (H100+) fp8 + bf16 master Memory + bandwidth wins
LLM inference (consumer GPU) fp16, bf16, or 4-bit Memory-bound; smaller wins
LLM inference (CPU, llama.cpp) Q4_K_M, Q5_K_M, Q8_0 GGUF custom formats; Phase 26
MiniGPT in this curriculum fp32 (CPU, NumPy reference) Simplicity; performance is not the point

You'll touch every row above by Phase 26. Phase 2 just gives you the vocabulary.

How to emulate formats you don't have hardware for

Borja's i5-8250U has fp32/fp64 native (via AVX2) and no fp16/bf16/fp8 hardware. To explore those formats:

  • bf16: bit-cast fp32 to uint32, zero the low 16 bits, bit-cast back. Done.
  • fp16: numpy.float16 (NumPy emulates). Slow but correct.
  • fp8 E4M3 / E5M2: roll your own bit manipulation, or install the ml_dtypes library (out of scope for Phase 2 — emulate by hand in the lab).
  • INT8/INT4: round-then-scale, see lab 03.

Lab 00-bit-anatomy.md walks through bf16 and fp16 emulation. Lab 03 does INT8.

What's worth memorizing

Three things, before moving to Phase 3:

  1. Bit widths and exponent widths of fp32, bf16, fp16. So you can predict overflow thresholds (exp(x) overflows at x ≈ 89 for fp32 and bf16; x ≈ 11 for fp16).
  2. Mantissa widths of fp32 (23) and bf16 (7). So you can predict relative precision (ε_fp32 ≈ 1.2e-7, ε_bf16 ≈ 7.8e-3).
  3. INT8 with scale s has uniform step s. So you can quantify rounding error.

Everything else can be looked up. These three you should know without looking.

Drill problems

Solutions in solutions/04-precision-zoo-ref.md (phase-open).

  1. The §A13 model produces a length-600 fp32 logit vector. The maximum logit is 12.3. Can the naive exp(logits) / sum(exp(logits)) proceed in (a) fp32 (b) bf16 © fp16 without overflow? Why?
  2. Convert 0.5 to fp16. To bf16. To fp8 E4M3. What bit patterns? What is the round-trip error in each?
  3. INT8-quantize a tensor whose values are in [-2.5, 2.5]. Choose s symmetrically. What is the quantization step? What is the worst-case absolute error?
  4. The 1/600 probability of theory 01. Represent it in bf16. What is the relative error? Is bf16 fine for this distribution?
  5. Show that summing 600 bf16 numbers each of magnitude 1/600 accumulates O(N · ε_bf16) ≈ 4.7 of error — i.e., the sum is meaningless. What does this say about using bf16 for accumulators?

One-paragraph recap

ML uses a zoo of numeric formats — fp64 (oracle), fp32 (default), bf16 (modern training, fp32 exponent + 7-bit mantissa), fp16 (legacy, narrow exponent), fp8 E4M3/E5M2 (frontier training), INT8/INT4 (quantized inference). Each trades range, precision, and width. The recurring pattern: 16-bit training formats (bf16) preserve exponent range at the cost of precision; inference quantization (INT8/INT4) preserves the data's effective range via per-tensor scales at the cost of uniform-step error near zero. Phase 26 returns to each. Phase 2 just teaches you to read the columns.

What this page does NOT cover

  • Calibration algorithms (GPTQ, AWQ, SmoothQuant). Phase 26.
  • GGUF file format for llama.cpp. Phase 26.
  • Loss scaling automation (torch.cuda.amp.GradScaler). Phase 25 + 26.
  • Mixed-precision policies (autocast). Phase 25.

Phase 2 theory complete. Next stop: lab/00-bit-anatomy.md.