English · Español
03 — Summation order, cancellation, and Kahan¶
🇪🇸 La segunda clase de fallos numéricos: cuando sumas muchos números o restas dos casi iguales, los bits "bajos" desaparecen. La cancelación catastrófica y la suma de Kahan son dos caras del mismo problema: la mantisa es finita. Ejemplo concreto: sumar las 600 probabilidades por forma verbal del vocabulario (§A13) y verificar que la suma da
1.0con suficiente precisión.
The setup¶
Naive floating-point addition has error proportional to the sum length. If you add N numbers each of magnitude μ, the result has worst-case absolute error O(N · ε · μ) where ε is machine epsilon. For N = 10⁶ fp32 additions, that's 10⁶ × 1.2 × 10⁻⁷ × μ ≈ 0.12 · μ — a 12% relative error, on a sum that should be exact to 7 digits.
In §A13's universe, the most common summation pattern is "normalize a vector of probabilities" or "compute the partition function". A length-600 vector of probabilities, each of magnitude ~1/600, summed naively, accumulates 600 × ε × (1/600) ≈ ε. Tolerable but not negligible. A length-10⁶ accumulator over many training steps' losses is not tolerable.
This page does two things:
- Quantifies the failure modes (catastrophic cancellation, error growth in summation).
- Gives you the two standard fixes (Kahan, Neumaier).
Failure mode 1 — catastrophic cancellation¶
Subtract two nearly equal floating-point numbers. The result is small, but the leading digits cancel, and what's left is determined by the low-order mantissa bits — which are precisely the bits with the largest relative rounding error.
Concrete example. Say two probabilities for adjacent verb forms in our distribution are very close:
Both are stored to ~7 decimal digits of precision; the actual representable values differ from the stated decimals by ~p × ε.
Compute p_a - p_b symbolically: 7e-8. Compute it in fp32: the leading 6 digits cancel, and what's left is the difference of the last ~1 bit of each operand's mantissa. The result is correct to within ~1 ULP (unit in the last place), but that ULP is now the entire magnitude of the answer. The relative error of the result is 100%.
This is catastrophic cancellation. It doesn't NaN, doesn't overflow, doesn't print a warning. It just gives you a number that looks plausible and is total noise.
Patterns that hide catastrophic cancellation¶
The C-style formulation is (big_number) - (almost_the_same_big_number). Where this hides in ML:
log(1 + x)for tinyx.1 + x ≈ 1; the addition discardsx's significant bits; thenlog(1)is exactly0. Uselog1p(x)which is computed via a Taylor series and preserves the bits.exp(x) - 1for tinyx. Same shape. Useexpm1(x).sqrt(1 + x²) - 1(Pythagorean distance). Numerically unstable for smallx. Use the algebraic rearrangementx² / (sqrt(1+x²) + 1).- Subtractive softmax debugging.
softmax(x)[i] - softmax(y)[i]for two nearly identical logit vectors will look noisy on small probabilities. Compute log-probabilities and subtract those instead. - Variance via
E[X²] - E[X]²(the "naïve" formula). IfE[X]is large,E[X²]andE[X]²are both large and nearly equal; subtracting cancels. Use Welford's online algorithm.
A useful heuristic: if your formula has a subtraction of two terms that you expect to be close, ask whether you can algebraically transform it into a sum.
Failure mode 2 — drift in long summation¶
Naive summation accumulates one rounding error per addition. After N additions:
Where s_i is the running partial sum. In the worst case (all addends positive), |s_i| grows linearly with i, giving total error O(N² · ε). In the average case (random signs), the error is O(√N · ε) (random walk). For all-positive sums of similar-magnitude terms — the typical ML use case — the relevant bound is O(N · ε).
For N = 10⁶ and ε = 1.19 × 10⁻⁷ (fp32), the bound is ~0.12. Twelve percent. The fp64 case (ε ≈ 2.2 × 10⁻¹⁶) bounds at ~2.2 × 10⁻¹⁰ — invisible — which is why "use fp64 for accumulators" is a common pragma.
The fix — Kahan summation¶
The idea: track the rounding error from each addition and reinject it on the next step.
sum = 0 # running total
c = 0 # running compensation (the "lost" bits from prior rounds)
for x in seq:
y = x - c # correct the input by the accumulated error
t = sum + y # this addition discards low-order bits of y
c = (t - sum) - y # what got discarded
sum = t
The magic is the line c = (t - sum) - y. In exact arithmetic, this is zero. In fp arithmetic, t = sum + y rounded, so t - sum (computed in fp) is not exactly y — it's y minus whatever was discarded. So (t - sum) - y is the negative of what was discarded. Next iteration, that gets added back via x - c.
The total error becomes O(ε) independent of N. Specifically:
For N = 10⁶ fp32, the bound is ~2.4 × 10⁻⁷ of the sum — five orders of magnitude better than naive.
Neumaier's variant¶
Kahan's compensation step assumes |sum| ≥ |y|. If |y| > |sum| (the new term is huge), Kahan loses precision. Neumaier (1974) added a guard:
sum = 0
c = 0
for x in seq:
t = sum + x
if abs(sum) >= abs(x):
c += (sum - t) + x # lost low bits of x
else:
c += (x - t) + sum # lost low bits of sum
sum = t
sum += c # add accumulated compensation at the end
For ML summations of similar-magnitude terms, Kahan and Neumaier perform identically. For mixed-magnitude (sum of 1e10 and 1e-10 terms), Neumaier wins.
A worked example — summing 600 verb-form probabilities¶
§A13 vocabulary: 600 forms. Suppose your model's predicted probability for each form, in fp32, sums (mathematically) to exactly 1.0. Each p_i ≈ 1/600 ≈ 0.001667.
Naive summation:
Worst-case error: 600 × ε × 1/600 = ε ≈ 1.2 × 10⁻⁷. So total lands in [1 - 1.2e-7, 1 + 1.2e-7]. Fine for almost every use.
But scale this up. Suppose Borja's training loop accumulates per-batch losses over 100,000 batches × 64 examples each = 6.4 × 10⁶ summands:
Now N = 100,000 (or 6.4 × 10⁶ if you sum per-example). Error grows to N × ε × μ, which for μ ≈ 2.0 (typical CE) and N = 10⁵ gives ~0.02. Two percent error on the reported running average. The "average loss" Borja sees in his training log is wrong by 2% just from naïve summation.
The fix:
Or, equivalently in NumPy: use np.float64 for the accumulator and np.float32 for the data. Casting up gives you Kahan-equivalent precision for free, at the cost of 8 bytes per accumulator (negligible). Lab 02 measures both approaches.
When to actually use Kahan vs fp64 accumulators¶
| Situation | Best fix |
|---|---|
| Inner loop, performance-critical, no Kahan-aware kernel | fp64 accumulator |
| You only have fp32 hardware and the inner loop is large | Kahan |
| You're computing variance / mean | Welford's online algorithm |
| Summing gradients across devices | Tree-reduction (built into NCCL, FSDP) |
Computing log-likelihood over N tokens |
logsumexp is a stable sum; no extra trick needed |
For Phase 2's labs, you'll implement Kahan in NumPy and compare to fp64-accumulator naive. The takeaway: for ML, "use fp64 for accumulators" is the cheapest correct fix in almost every case. Kahan is for when fp64 isn't available (some embedded inference paths).
Determinism¶
A subtle consequence of non-associativity: the order of summation changes the result, even with the same set of inputs. For reproducibility:
- Fix the order.
np.sum(arr)uses pairwise summation (already much better than naive); its order depends on shape, axis, and stride. - Avoid parallel reductions in places where bit-exact reproducibility matters. NumPy is single-threaded by default; PyTorch is not (Phase 25).
- CI tests that assert
loss == expected_lossto 1e-6 will be flaky if the reduction order varies between runs (e.g., depending onOMP_NUM_THREADS). Use relative tolerance or fp64 accumulators.
scripts/seed_everything (Phase 0) sets thread counts to 1 for this reason during deterministic tests.
Drill problems¶
Solutions in solutions/03-summation-ref.md (phase-open).
- Sum the fp32 vector
[1.0] + [1e-9] * 10⁶naively. What do you expect? What does Kahan give? What does fp64 accumulator give? Why? - Compute
var([1e10 + 1, 1e10 + 2, 1e10 + 3])two ways: with the naiveE[X²] - E[X]²formula and with Welford's online algorithm. Compare to the exact answer (var = 2/3). - Compute
log(1 + 1e-9)andlog1p(1e-9). Predict the relative error of each; then verify. - Show that Kahan summation of a length-
Nlist of numbers of equal magnitude has total errorO(ε), notO(N·ε). (Hint: bound the compensation term.) - Given the §A13 vocabulary's 600 probabilities, each
~1/600 + δ_iwhereΣ δ_i = 0(the distribution is normalized), what is the expected naïve-summation error? When does it matter?
One-paragraph recap¶
Naive fp summation accumulates one ε of error per addition; over N terms it grows to O(N·ε), which is ~12% for N = 10⁶ in fp32 — unacceptable for ML accumulators. Catastrophic cancellation in subtractions of nearly-equal values destroys the result silently. The two fixes are (1) Kahan summation (track and reinject the rounding error each step, giving O(ε) total) and (2) log1p / expm1 / Welford's algorithm and similar reformulations for cancellation-prone expressions. For ML, the cheapest correct fix is "use fp64 for accumulators"; reach for Kahan when fp64 isn't available.
What this page does NOT cover¶
- Parallel-tree reductions (deferred to Phase 35 distributed).
- Pairwise summation (NumPy's default; a middle ground between naive and Kahan).
- Compensated dot products. Out of scope.
- Exact arithmetic (
fractions,Decimal). Used as oracles in tests only.
Next: theory/04-precision-zoo.md.