English · Español

00 — Why a numerical-representation phase before linear algebra¶

🇪🇸 La intuición central: la aritmética de coma flotante no es la matemática que aprendiste en la escuela. Es una aproximación con un puñado de reglas raras (denormales, NaN, overflow, asociatividad rota) que, cuando se rompen, no avisan: producen loss = inf, gradientes muertos o predicciones absurdas. Fase 2 te enseña esas reglas antes de que las necesites a la fuerza.

The lie textbooks tell¶

A linear algebra textbook treats real numbers as exactly real. Add two of them, multiply, divide, take a logarithm — the result is the result. No surprises.

This is fine on paper. It is false on a CPU.

A 32-bit float (fp32) holds roughly 7 decimal digits of precision. There are only 2^32 representable fp32 values; everything else gets rounded to the nearest one. 0.1 is not representable exactly. 0.2 is not either. Their sum is not 0.3. This isn't a bug — it's the design.

>>> 0.1 + 0.2 == 0.3
False

When you write code in Phase 7 to compute gradients, the chain rule will run through dozens of operations. Each one rounds. Each one may amplify a previous error. If you've never thought about the substrate, your gradients will silently drift, your loss will diverge, you'll spend an afternoon debugging "the optimizer" when the problem is that you computed log(1 + 1e-8) and got zero.

The thesis of Phase 2¶

Phase 2 trains one habit:

Before you write exp(x) or log(p) or 1 - q or sum(arr), ask: what magnitudes do x, p, q, and the elements of arr actually have? Will the floating-point operation preserve the precision I need, or quietly destroy it?

By the end of Phase 2, you should be able to read a numeric expression and tag each subexpression with its failure mode — "this is fine", "this overflows for x > 89", "this loses 6 digits when b ≈ a", "this NaNs if any element is -inf" — before you run it.

What "the substrate" actually is¶

The substrate is IEEE-754: a 1985 standard that pins down, bit by bit, how computers store and operate on approximate real numbers. Every CPU, GPU, and accelerator built since respects (some subset of) IEEE-754 by law of arithmetic compatibility — your model behavior must be identical across vendors.

A fp32 number is 32 bits split into:

1 sign bit (0 = positive, 1 = negative).
8 exponent bits (offset-by-127 to allow negative exponents).
23 mantissa bits (the "significand" — the actual digits, in binary).

The value is (-1)^sign × 1.mantissa × 2^(exponent − 127). The "1." is implicit for normal numbers; this gives one free bit of precision.

Theory 01-ieee754-anatomy.md does the full decode, plus the carve-outs for zero, denormals, ±Inf, NaN, and rounding modes. You will memorize the layout by the end.

Why this matters for AI specifically — through the §A13 lens¶

The microscopic scope of this project (§A13) is English verb grammar: 20 verbs × 5 tenses × 3 persons + Spanish pairs ≈ 600 forms. Borja's tiny model will, at every forward pass, compute a probability distribution over these ~600 forms — a softmax over a length-600 logit vector. Five concrete numerical pitfalls land in that single computation:

The softmax's exp overflows fp32 when any logit is ≳ 89. A model with un-normalized logits routinely produces such values during early training. Naive exp(logits) / sum(exp(logits)) returns inf / inf = NaN. The -max trick (theory 02) fixes it.
The probabilities themselves are tiny. Uniform over 600 forms is 1/600 ≈ 0.00167. Trained probabilities for low-likelihood forms can be 1e-20 or smaller. log(p) for p = 1e-20 is -46; subtracting two such numbers loses every meaningful bit if they're nearly equal — catastrophic cancellation (theory 03).
Summing 600 tiny probabilities accumulates error. A naive sum of 600 values of magnitude 1e-10 lands somewhere with O(600 × ε) relative error. Kahan summation (theory 03) preserves O(ε) total.
Reduction order is not associative. Sum [a, b, c] left-to-right vs right-to-left gives different fp32 results when magnitudes differ. CI tests for "the same model produces the same output" depend on freezing the reduction order. This bites hard in distributed training (Phase 35).
Cross-entropy −log p_correct blows up when p_correct ≈ 0. A confident-wrong prediction produces a near-zero probability for the true label (say the model assigns probability 1e-30 to worked when worked is the right answer); the negative log is huge; the gradient is huge; weights move enormously in one step. "Loss explosion at step 1" is almost always this.

Every one of these is a Phase-2 phenomenon. None of them require any AI knowledge to understand. All of them will break a model in later phases if Phase 2 didn't take.

The path through Phase 2¶

Theory 01 does the anatomy: bit layouts, denormals, NaN, ±Inf, rounding. The flat exposition you reference forever.
Theory 02 does the one most important numerical algorithm in deep learning: the -max trick for stable softmax. Derived from first principles, applied to a length-5 tense-classification vector (infinitive / present / past / past-participle / future).
Theory 03 does summation: catastrophic cancellation, Kahan, Neumaier. Applied to summing the per-form probability mass of 600 verb conjugations.
Theory 04 surveys the precision zoo (BF16, TF32, FP8, INT8/INT4) — bit layouts and trade-offs. Read it as foreshadowing of Phase 26.
Labs 00–03 make you do the measurements yourself. Decode bit patterns by hand. Break naive softmax on tense logits. Measure Kahan's improvement on a 10⁶-element sum of ~1/600-magnitude probabilities. Round to int8 and back.

What this phase is NOT¶

Phase 2 is not a numerical analysis course. We do not cover interval arithmetic, posits, or fixed-point DSP formats. We do not prove error bounds for matrix factorizations (Phase 3 touches them in passing). We do not optimize for specific hardware (Phase 24 covers SIMD-aware kernels).

Phase 2 is a literacy phase: by the end, you can read floating-point code and see the failure modes that other people only discover at debug time.

Stop here if¶

You are tempted to skip Phase 2 because "I've programmed in Python for years; I know about float precision." Don't. The test is not "can you say the word denormal." The test is: given a length-600 logit vector with one element at 92 and the others at -3, can you predict — without running it — which of the three softmax implementations on your screen produces a valid distribution? If you can't yet, Phase 2 is for you.

One-paragraph recap¶

Floating-point arithmetic is an approximate, non-associative number system with well-defined failure modes (overflow, underflow, denormals, NaN, catastrophic cancellation). Every later phase of this curriculum operates on fp32/fp64 tensors and inherits those failure modes. Phase 2 teaches you to predict the failures by reading code, anchored to one running example: softmax + cross-entropy over the ~600-form English-verb-conjugation vocabulary defined in §A13.

What this phase does NOT cover¶

Gradients through numerical primitives (Phase 4 / 7).
A reusable src/minigrad/numerics.py module (Phase 7, when the autograd consumer exists).
Real production quantization (Phase 26).
GPU floating-point hardware (Phase 23+).
Multi-precision arithmetic libraries (out of scope).

Next: theory/01-ieee754-anatomy.md.