Skip to content

English · Español

01 — IEEE-754, bit by bit

🇪🇸 Un fp32 son 32 bits divididos en signo (1), exponente (8) y mantisa (23). Aprender la fórmula (-1)^s · 1.m · 2^(e-127) y memorizar los casos especiales (denormal, ±inf, NaN) es el trabajo de esta página. El resto de la fase consume estos hechos.


The general layout

An IEEE-754 binary floating-point number has three fields:

| sign | exponent | mantissa (significand) |
|  1   |    E     |           M            |

For the four formats Borja will see this curriculum:

Format Total Sign Exponent Mantissa Bias Decimal digits
fp64 64 1 11 52 1023 ~15.95
fp32 32 1 8 23 127 ~7.22
fp16 16 1 5 10 15 ~3.31
bf16 16 1 8 7 127 ~2.40

Note that bf16 and fp32 share the same exponent width (8 bits) — bf16 is fp32 with a truncated mantissa. That makes bf16 ↔ fp32 conversion trivial (drop or zero-fill the low 16 bits) and is why ML hardware adopted it.

The decoding formula

For a normal number:

\[ x = (-1)^s \cdot 1.m \cdot 2^{e - \text{bias}} \]

Where: - s is the sign bit (0 = positive, 1 = negative). - m is the mantissa interpreted as a binary fraction 0.b_{22} b_{21} \dots b_0 (for fp32), giving an effective 1.m in [1, 2). - e is the unsigned interpretation of the exponent field, with bias = 127 for fp32 (1023 for fp64).

The implicit leading 1. is the hidden bit — it's not stored, but it counts. This gives fp32 an effective 24-bit significand from 23 stored bits.

Worked example — decode 0x3F800000

Bits: 0 01111111 00000000000000000000000 (regrouped: sign / 8-bit exponent / 23-bit mantissa).

  • s = 0 → positive.
  • e = 0b01111111 = 127 → exponent value 127 - 127 = 0.
  • m = 0 → significand 1.0.
  • Value: +1.0 × 2^0 = 1.0.

Worked example — decode 0x3ADA740E (= 1/6000.001\overline{6})

This is the bit pattern of the §A13 vocabulary's uniform probability. Compute it:

  • Bits: 0 01110101 10110100111010000001110. Regrouped: sign 0, exponent 0b01110101 = 117, mantissa 0b10110100111010000001110 = 5927950.
  • Stored exponent 117, unbiased exponent 117 - 127 = -10.
  • Significand: 1 + 5927950 / 2^23 = 1 + 0.706666... = 1.706666...
  • Value: +1.706666... × 2^-10 = 1.706666... / 1024 ≈ 0.001\overline{6}.

Lab 00-bit-anatomy.md makes you do this by hand for 0.1, 1/600, and a number from a tense-logit vector.

Range and precision of fp32

Normal-number range:

  • Largest finite: exponent 0xFE = 254, mantissa all ones. Value ≈ 3.4028235 × 10^38.
  • Smallest normal: exponent 0x01 = 1, mantissa zero. Value 2^{-126} ≈ 1.175 × 10^{-38}.

Machine epsilon: ε_fp32 = 2^{-23} ≈ 1.192 × 10^{-7}. This is the smallest x such that 1 + x ≠ 1. The exponent doesn't matter — the precision is relative, set by the mantissa width.

The same machine epsilon means: between 1.0 and 2.0 there are 2^23 ≈ 8.4 × 10^6 representable fp32 values. Between 1024 and 2048 there are also 2^23 values. Between 1e30 and 2e30 there are still 2^23 values. The spacing is ε × x everywhere, so resolution near zero is much finer than near 1e30.

Special encodings — the carve-outs

IEEE-754 reserves two exponent values for non-normal cases:

Exponent all zeros (e = 0) — zero and denormals

  • Mantissa zero → signed zero (+0.0 and -0.0 both exist; they compare equal but their bits differ).
  • Mantissa nonzerodenormal (also called "subnormal"). The implicit leading bit is 0 (not 1), and the exponent is treated as 1 - bias (i.e., -126 for fp32) rather than 0 - bias. This gives values from ~1.4 × 10^{-45} (the smallest positive fp32) up to just below the smallest normal ~1.175 × 10^{-38}.

Denormals gradually lose precision as they approach zero. They give graceful underflow — without them, anything smaller than the smallest normal would jump straight to zero. With them, precision degrades smoothly down to 1 bit.

Performance note. Many CPUs handle denormals via microcode (~10-100× slower than normals). Numerical kernels often set "flush-to-zero" mode to avoid the penalty. AI training routinely produces denormals near the end of training when weights converge; aware projects flush-to-zero deliberately.

Exponent all ones (e = 0xFF for fp32, 0x7FF for fp64) — infinities and NaN

  • Mantissa zero±Inf. Result of overflow, or 1 / 0, etc.
  • Mantissa nonzeroNaN (Not a Number). Result of 0 / 0, Inf - Inf, sqrt(-1), log(-1), etc.

The mantissa pattern of a NaN can encode a "payload" (which the standard allows for diagnostic use). Most software just propagates an arbitrary NaN bit pattern. Two NaNs are never equalNaN == NaN is False. This is the IEEE-754-correct way to test for NaN: x != x.

NaN propagation is the killer in ML: any operation involving a NaN produces a NaN. One NaN in your length-600 verb-probability vector poisons the whole forward pass, the whole gradient, the whole step.

Rounding modes

When the exact result of an operation isn't representable, the result must be rounded. IEEE-754 specifies five rounding modes; the default is round half to even ("banker's rounding"):

  • If the exact result is closer to one representable value, round there.
  • If it's exactly halfway, round to the one whose mantissa LSB is 0 (i.e., to the "even" representable).

Banker's rounding eliminates the small statistical bias that "round half up" introduces over many operations.

Other modes (rarely used in ML): round toward zero (truncation), round toward +∞, round toward -∞, round half away from zero.

Why 0.1 + 0.2 ≠ 0.3

0.1 in binary is the infinite repeating fraction 0.0001100110011.... It can't be stored exactly in fp32. The stored value is the nearest representable, off by about 1.49 × 10^{-9} (call it 0.1 + ε₁).

0.2 similarly is 0.2 + ε₂.

Their sum is 0.3 + ε₁ + ε₂ + ε_{round} where ε_{round} is the rounding of the sum, not of either operand. This sum is not the same fp32 value as the nearest representation of 0.3, which has its own ε₃.

>>> hex(struct.unpack('<I', struct.pack('<f', 0.1 + 0.2))[0])
'0x3e99999a'
>>> hex(struct.unpack('<I', struct.pack('<f', 0.3))[0])
'0x3e99999a'  # In fp32, they happen to match — but in fp64 they don't.
>>> 0.1 + 0.2  # Python uses fp64 by default
0.30000000000000004

The lesson: never == floats. Always abs(a - b) < tol, with a tolerance that reflects the expected accumulated error.

fp32 vs fp64 vs fp16 vs bf16 — when each makes sense

Format Use case Why
fp64 Reference / oracle computations; gradient checks Maximum precision; slow on GPUs; rarely used in deployed models
fp32 Default for training; safety baseline Wide hardware support; sufficient precision for most kernels
fp16 Inference, sometimes training with care 2× smaller, 2× faster on tensor cores; narrow exponent range (overflows easily)
bf16 Modern training default fp32 exponent range with fp16 size; mantissa precision halved but rarely fatal

The bf16 vs fp16 trade-off: fp16 has more mantissa precision (10 bits vs 7), so it represents numbers near 1.0 more accurately. But fp16's narrow 5-bit exponent (range ~6 × 10^{-5} to ~6.5 × 10^4) overflows during the exp(x) of a softmax with x > 11 or so — much earlier than fp32's threshold of 89. bf16's 8-bit exponent matches fp32; no overflow problem. Modern training adopted bf16 because exponent range matters more than mantissa precision for gradients and activations.

You will not write bf16 code in Phase 2 (no hardware), but you'll emulate it in lab/00-bit-anatomy.md via bit-casting and truncation.

How a fp32 + fp32 happens, mechanically

The hardware path (you can implement this in Python in lab 00):

  1. Align exponents. The smaller operand's mantissa is right-shifted until the exponents match. Bits shifted out are lost (rounded).
  2. Add (or subtract) mantissas. With one extra guard bit, one round bit, and one sticky bit (the IEEE-754 "GRS bits") to support correct rounding.
  3. Normalize. Shift the result so the leading bit is 1 (or zero-out for denormals).
  4. Round per the current rounding mode.

This is why (a + b) + c ≠ a + (b + c) in general: step 1 (alignment) discards bits that depend on operand ordering. The discarded bits in one ordering aren't the same as in another.

What every Phase-2 lab does with this

  • Lab 00-bit-anatomy.md makes you decode and re-encode numbers including 1/600, 0.1, and a logit from a tense-classification vector.
  • Lab 01-softmax-stability.md shows you what happens when a single fp32 exponent gets large enough to push exp(x) past 3.4 × 10^{38}.
  • Lab 02-summation-experiments.md measures how a 10⁶-long sum of ~1/600-magnitude probabilities accumulates ε per addition.
  • Lab 03-quantization-preview.md shows what's lost when you map an fp32 logit array down to int8 and back.

One-paragraph recap

IEEE-754 floats are (-1)^s · 1.m · 2^{e-bias} with carve-outs for zero (e=0, m=0), denormals (e=0, m≠0), infinities (e=max, m=0), and NaN (e=max, m≠0). fp32 gives ~7 decimal digits of precision and range up to ~3.4 × 10^{38}. bf16 trades mantissa for exponent range. Rounding is non-associative, so reduction order matters. Memorize the layout — every later phase consumes it.

What this page does NOT cover

  • Hardware-specific subtleties (Intel MMX legacy, AVX-512 caveats). Phase 24.
  • Decimal floating-point (decimal64). Out of scope.
  • IEEE-754-2019 changes (augmented operations). Theoretical interest only.
  • FP8 — covered separately in theory/04-precision-zoo.md.

Next: theory/02-softmax-stability.md.