English · Español
01 — IEEE-754, bit by bit¶
🇪🇸 Un fp32 son 32 bits divididos en signo (1), exponente (8) y mantisa (23). Aprender la fórmula
(-1)^s · 1.m · 2^(e-127)y memorizar los casos especiales (denormal, ±inf, NaN) es el trabajo de esta página. El resto de la fase consume estos hechos.
The general layout¶
An IEEE-754 binary floating-point number has three fields:
For the four formats Borja will see this curriculum:
| Format | Total | Sign | Exponent | Mantissa | Bias | Decimal digits |
|---|---|---|---|---|---|---|
| fp64 | 64 | 1 | 11 | 52 | 1023 | ~15.95 |
| fp32 | 32 | 1 | 8 | 23 | 127 | ~7.22 |
| fp16 | 16 | 1 | 5 | 10 | 15 | ~3.31 |
| bf16 | 16 | 1 | 8 | 7 | 127 | ~2.40 |
Note that bf16 and fp32 share the same exponent width (8 bits) — bf16 is fp32 with a truncated mantissa. That makes bf16 ↔ fp32 conversion trivial (drop or zero-fill the low 16 bits) and is why ML hardware adopted it.
The decoding formula¶
For a normal number:
Where:
- s is the sign bit (0 = positive, 1 = negative).
- m is the mantissa interpreted as a binary fraction 0.b_{22} b_{21} \dots b_0 (for fp32), giving an effective 1.m in [1, 2).
- e is the unsigned interpretation of the exponent field, with bias = 127 for fp32 (1023 for fp64).
The implicit leading 1. is the hidden bit — it's not stored, but it counts. This gives fp32 an effective 24-bit significand from 23 stored bits.
Worked example — decode 0x3F800000¶
Bits: 0 01111111 00000000000000000000000 (regrouped: sign / 8-bit exponent / 23-bit mantissa).
s = 0→ positive.e = 0b01111111 = 127→ exponent value127 - 127 = 0.m = 0→ significand1.0.- Value:
+1.0 × 2^0 = 1.0.
Worked example — decode 0x3ADA740E (= 1/600 ≈ 0.001\overline{6})¶
This is the bit pattern of the §A13 vocabulary's uniform probability. Compute it:
- Bits:
0 01110101 10110100111010000001110. Regrouped: sign0, exponent0b01110101 = 117, mantissa0b10110100111010000001110 = 5927950. - Stored exponent
117, unbiased exponent117 - 127 = -10. - Significand:
1 + 5927950 / 2^23 = 1 + 0.706666... = 1.706666... - Value:
+1.706666... × 2^-10 = 1.706666... / 1024 ≈ 0.001\overline{6}.
Lab 00-bit-anatomy.md makes you do this by hand for 0.1, 1/600, and a number from a tense-logit vector.
Range and precision of fp32¶
Normal-number range:
- Largest finite: exponent
0xFE = 254, mantissa all ones. Value ≈3.4028235 × 10^38. - Smallest normal: exponent
0x01 = 1, mantissa zero. Value2^{-126} ≈ 1.175 × 10^{-38}.
Machine epsilon: ε_fp32 = 2^{-23} ≈ 1.192 × 10^{-7}. This is the smallest x such that 1 + x ≠ 1. The exponent doesn't matter — the precision is relative, set by the mantissa width.
The same machine epsilon means: between 1.0 and 2.0 there are 2^23 ≈ 8.4 × 10^6 representable fp32 values. Between 1024 and 2048 there are also 2^23 values. Between 1e30 and 2e30 there are still 2^23 values. The spacing is ε × x everywhere, so resolution near zero is much finer than near 1e30.
Special encodings — the carve-outs¶
IEEE-754 reserves two exponent values for non-normal cases:
Exponent all zeros (e = 0) — zero and denormals¶
- Mantissa zero → signed zero (
+0.0and-0.0both exist; they compare equal but their bits differ). - Mantissa nonzero → denormal (also called "subnormal"). The implicit leading bit is
0(not1), and the exponent is treated as1 - bias(i.e.,-126for fp32) rather than0 - bias. This gives values from~1.4 × 10^{-45}(the smallest positive fp32) up to just below the smallest normal~1.175 × 10^{-38}.
Denormals gradually lose precision as they approach zero. They give graceful underflow — without them, anything smaller than the smallest normal would jump straight to zero. With them, precision degrades smoothly down to 1 bit.
Performance note. Many CPUs handle denormals via microcode (~10-100× slower than normals). Numerical kernels often set "flush-to-zero" mode to avoid the penalty. AI training routinely produces denormals near the end of training when weights converge; aware projects flush-to-zero deliberately.
Exponent all ones (e = 0xFF for fp32, 0x7FF for fp64) — infinities and NaN¶
- Mantissa zero →
±Inf. Result of overflow, or1 / 0, etc. - Mantissa nonzero →
NaN(Not a Number). Result of0 / 0,Inf - Inf,sqrt(-1),log(-1), etc.
The mantissa pattern of a NaN can encode a "payload" (which the standard allows for diagnostic use). Most software just propagates an arbitrary NaN bit pattern. Two NaNs are never equal — NaN == NaN is False. This is the IEEE-754-correct way to test for NaN: x != x.
NaN propagation is the killer in ML: any operation involving a NaN produces a NaN. One NaN in your length-600 verb-probability vector poisons the whole forward pass, the whole gradient, the whole step.
Rounding modes¶
When the exact result of an operation isn't representable, the result must be rounded. IEEE-754 specifies five rounding modes; the default is round half to even ("banker's rounding"):
- If the exact result is closer to one representable value, round there.
- If it's exactly halfway, round to the one whose mantissa LSB is
0(i.e., to the "even" representable).
Banker's rounding eliminates the small statistical bias that "round half up" introduces over many operations.
Other modes (rarely used in ML): round toward zero (truncation), round toward +∞, round toward -∞, round half away from zero.
Why 0.1 + 0.2 ≠ 0.3¶
0.1 in binary is the infinite repeating fraction 0.0001100110011.... It can't be stored exactly in fp32. The stored value is the nearest representable, off by about 1.49 × 10^{-9} (call it 0.1 + ε₁).
0.2 similarly is 0.2 + ε₂.
Their sum is 0.3 + ε₁ + ε₂ + ε_{round} where ε_{round} is the rounding of the sum, not of either operand. This sum is not the same fp32 value as the nearest representation of 0.3, which has its own ε₃.
>>> hex(struct.unpack('<I', struct.pack('<f', 0.1 + 0.2))[0])
'0x3e99999a'
>>> hex(struct.unpack('<I', struct.pack('<f', 0.3))[0])
'0x3e99999a' # In fp32, they happen to match — but in fp64 they don't.
>>> 0.1 + 0.2 # Python uses fp64 by default
0.30000000000000004
The lesson: never == floats. Always abs(a - b) < tol, with a tolerance that reflects the expected accumulated error.
fp32 vs fp64 vs fp16 vs bf16 — when each makes sense¶
| Format | Use case | Why |
|---|---|---|
| fp64 | Reference / oracle computations; gradient checks | Maximum precision; slow on GPUs; rarely used in deployed models |
| fp32 | Default for training; safety baseline | Wide hardware support; sufficient precision for most kernels |
| fp16 | Inference, sometimes training with care | 2× smaller, 2× faster on tensor cores; narrow exponent range (overflows easily) |
| bf16 | Modern training default | fp32 exponent range with fp16 size; mantissa precision halved but rarely fatal |
The bf16 vs fp16 trade-off: fp16 has more mantissa precision (10 bits vs 7), so it represents numbers near 1.0 more accurately. But fp16's narrow 5-bit exponent (range ~6 × 10^{-5} to ~6.5 × 10^4) overflows during the exp(x) of a softmax with x > 11 or so — much earlier than fp32's threshold of 89. bf16's 8-bit exponent matches fp32; no overflow problem. Modern training adopted bf16 because exponent range matters more than mantissa precision for gradients and activations.
You will not write bf16 code in Phase 2 (no hardware), but you'll emulate it in lab/00-bit-anatomy.md via bit-casting and truncation.
How a fp32 + fp32 happens, mechanically¶
The hardware path (you can implement this in Python in lab 00):
- Align exponents. The smaller operand's mantissa is right-shifted until the exponents match. Bits shifted out are lost (rounded).
- Add (or subtract) mantissas. With one extra guard bit, one round bit, and one sticky bit (the IEEE-754 "GRS bits") to support correct rounding.
- Normalize. Shift the result so the leading bit is
1(or zero-out for denormals). - Round per the current rounding mode.
This is why (a + b) + c ≠ a + (b + c) in general: step 1 (alignment) discards bits that depend on operand ordering. The discarded bits in one ordering aren't the same as in another.
What every Phase-2 lab does with this¶
- Lab
00-bit-anatomy.mdmakes you decode and re-encode numbers including1/600,0.1, and a logit from a tense-classification vector. - Lab
01-softmax-stability.mdshows you what happens when a single fp32 exponent gets large enough to pushexp(x)past3.4 × 10^{38}. - Lab
02-summation-experiments.mdmeasures how a 10⁶-long sum of~1/600-magnitude probabilities accumulates ε per addition. - Lab
03-quantization-preview.mdshows what's lost when you map an fp32 logit array down to int8 and back.
One-paragraph recap¶
IEEE-754 floats are (-1)^s · 1.m · 2^{e-bias} with carve-outs for zero (e=0, m=0), denormals (e=0, m≠0), infinities (e=max, m=0), and NaN (e=max, m≠0). fp32 gives ~7 decimal digits of precision and range up to ~3.4 × 10^{38}. bf16 trades mantissa for exponent range. Rounding is non-associative, so reduction order matters. Memorize the layout — every later phase consumes it.
What this page does NOT cover¶
- Hardware-specific subtleties (Intel MMX legacy, AVX-512 caveats). Phase 24.
- Decimal floating-point (
decimal64). Out of scope. - IEEE-754-2019 changes (augmented operations). Theoretical interest only.
- FP8 — covered separately in
theory/04-precision-zoo.md.
Next: theory/02-softmax-stability.md.