English · Español

03 — Summation order, cancellation, and Kahan¶

🇪🇸 La segunda clase de fallos numéricos: cuando sumas muchos números o restas dos casi iguales, los bits "bajos" desaparecen. La cancelación catastrófica y la suma de Kahan son dos caras del mismo problema: la mantisa es finita. Ejemplo concreto: sumar las 600 probabilidades por forma verbal del vocabulario (§A13) y verificar que la suma da 1.0 con suficiente precisión.

The setup¶

Naive floating-point addition has error proportional to the sum length. If you add N numbers each of magnitude μ, the result has worst-case absolute error O(N · ε · μ) where ε is machine epsilon. For N = 10⁶ fp32 additions, that's 10⁶ × 1.2 × 10⁻⁷ × μ ≈ 0.12 · μ — a 12% relative error, on a sum that should be exact to 7 digits.

In §A13's universe, the most common summation pattern is "normalize a vector of probabilities" or "compute the partition function". A length-600 vector of probabilities, each of magnitude ~1/600, summed naively, accumulates 600 × ε × (1/600) ≈ ε. Tolerable but not negligible. A length-10⁶ accumulator over many training steps' losses is not tolerable.

This page does two things:

Quantifies the failure modes (catastrophic cancellation, error growth in summation).
Gives you the two standard fixes (Kahan, Neumaier).

Failure mode 1 — catastrophic cancellation¶

Subtract two nearly equal floating-point numbers. The result is small, but the leading digits cancel, and what's left is determined by the low-order mantissa bits — which are precisely the bits with the largest relative rounding error.

Concrete example. Say two probabilities for adjacent verb forms in our distribution are very close:

p_a = 0.16739871        # P(token = "worked", in fp32)
p_b = 0.16739864        # P(token = "walked", in fp32)

Both are stored to ~7 decimal digits of precision; the actual representable values differ from the stated decimals by ~p × ε.

Compute p_a - p_b symbolically: 7e-8. Compute it in fp32: the leading 6 digits cancel, and what's left is the difference of the last ~1 bit of each operand's mantissa. The result is correct to within ~1 ULP (unit in the last place), but that ULP is now the entire magnitude of the answer. The relative error of the result is 100%.

This is catastrophic cancellation. It doesn't NaN, doesn't overflow, doesn't print a warning. It just gives you a number that looks plausible and is total noise.

Patterns that hide catastrophic cancellation¶

The C-style formulation is (big_number) - (almost_the_same_big_number). Where this hides in ML:

log(1 + x) for tiny x. 1 + x ≈ 1; the addition discards x's significant bits; then log(1) is exactly 0. Use log1p(x) which is computed via a Taylor series and preserves the bits.
exp(x) - 1 for tiny x. Same shape. Use expm1(x).
sqrt(1 + x²) - 1 (Pythagorean distance). Numerically unstable for small x. Use the algebraic rearrangement x² / (sqrt(1+x²) + 1).
Subtractive softmax debugging. softmax(x)[i] - softmax(y)[i] for two nearly identical logit vectors will look noisy on small probabilities. Compute log-probabilities and subtract those instead.
Variance via E[X²] - E[X]² (the "naïve" formula). If E[X] is large, E[X²] and E[X]² are both large and nearly equal; subtracting cancels. Use Welford's online algorithm.

A useful heuristic: if your formula has a subtraction of two terms that you expect to be close, ask whether you can algebraically transform it into a sum.

Failure mode 2 — drift in long summation¶

Naive summation accumulates one rounding error per addition. After N additions:

\[ \mathrm{error}_\mathrm{naive} \leq N \cdot \varepsilon \cdot \max_i |s_i| \]

Where s_i is the running partial sum. In the worst case (all addends positive), |s_i| grows linearly with i, giving total error O(N² · ε). In the average case (random signs), the error is O(√N · ε) (random walk). For all-positive sums of similar-magnitude terms — the typical ML use case — the relevant bound is O(N · ε).

For N = 10⁶ and ε = 1.19 × 10⁻⁷ (fp32), the bound is ~0.12. Twelve percent. The fp64 case (ε ≈ 2.2 × 10⁻¹⁶) bounds at ~2.2 × 10⁻¹⁰ — invisible — which is why "use fp64 for accumulators" is a common pragma.

The fix — Kahan summation¶

The idea: track the rounding error from each addition and reinject it on the next step.

sum   = 0          # running total
c     = 0          # running compensation (the "lost" bits from prior rounds)
for x in seq:
    y    = x - c       # correct the input by the accumulated error
    t    = sum + y     # this addition discards low-order bits of y
    c    = (t - sum) - y   # what got discarded
    sum  = t

The magic is the line c = (t - sum) - y. In exact arithmetic, this is zero. In fp arithmetic, t = sum + y rounded, so t - sum (computed in fp) is not exactly y — it's y minus whatever was discarded. So (t - sum) - y is the negative of what was discarded. Next iteration, that gets added back via x - c.

The total error becomes O(ε) independent of N. Specifically:

\[ \mathrm{error}_\mathrm{Kahan} \leq (2 \varepsilon + O(N \varepsilon^2)) \cdot \sum |s_i| \]

For N = 10⁶ fp32, the bound is ~2.4 × 10⁻⁷ of the sum — five orders of magnitude better than naive.

Neumaier's variant¶

Kahan's compensation step assumes |sum| ≥ |y|. If |y| > |sum| (the new term is huge), Kahan loses precision. Neumaier (1974) added a guard:

sum   = 0
c     = 0
for x in seq:
    t = sum + x
    if abs(sum) >= abs(x):
        c += (sum - t) + x       # lost low bits of x
    else:
        c += (x - t) + sum       # lost low bits of sum
    sum = t
sum += c                         # add accumulated compensation at the end

For ML summations of similar-magnitude terms, Kahan and Neumaier perform identically. For mixed-magnitude (sum of 1e10 and 1e-10 terms), Neumaier wins.

A worked example — summing 600 verb-form probabilities¶

§A13 vocabulary: 600 forms. Suppose your model's predicted probability for each form, in fp32, sums (mathematically) to exactly 1.0. Each p_i ≈ 1/600 ≈ 0.001667.

Naive summation:

total = 0.0
for p in probabilities:    # length 600
    total += p

Worst-case error: 600 × ε × 1/600 = ε ≈ 1.2 × 10⁻⁷. So total lands in [1 - 1.2e-7, 1 + 1.2e-7]. Fine for almost every use.

But scale this up. Suppose Borja's training loop accumulates per-batch losses over 100,000 batches × 64 examples each = 6.4 × 10⁶ summands:

running_loss = 0.0
for batch in train_loader:
    running_loss += loss_of_batch

Now N = 100,000 (or 6.4 × 10⁶ if you sum per-example). Error grows to N × ε × μ, which for μ ≈ 2.0 (typical CE) and N = 10⁵ gives ~0.02. Two percent error on the reported running average. The "average loss" Borja sees in his training log is wrong by 2% just from naïve summation.

The fix:

total = 0.0
c     = 0.0
for p in probabilities:
    y     = p - c
    t     = total + y
    c     = (t - total) - y
    total = t

Or, equivalently in NumPy: use np.float64 for the accumulator and np.float32 for the data. Casting up gives you Kahan-equivalent precision for free, at the cost of 8 bytes per accumulator (negligible). Lab 02 measures both approaches.

When to actually use Kahan vs fp64 accumulators¶

Situation	Best fix
Inner loop, performance-critical, no Kahan-aware kernel	fp64 accumulator
You only have fp32 hardware and the inner loop is large	Kahan
You're computing variance / mean	Welford's online algorithm
Summing gradients across devices	Tree-reduction (built into NCCL, FSDP)
Computing log-likelihood over `N` tokens	`logsumexp` is a stable sum; no extra trick needed

For Phase 2's labs, you'll implement Kahan in NumPy and compare to fp64-accumulator naive. The takeaway: for ML, "use fp64 for accumulators" is the cheapest correct fix in almost every case. Kahan is for when fp64 isn't available (some embedded inference paths).

Determinism¶

A subtle consequence of non-associativity: the order of summation changes the result, even with the same set of inputs. For reproducibility:

Fix the order. np.sum(arr) uses pairwise summation (already much better than naive); its order depends on shape, axis, and stride.
Avoid parallel reductions in places where bit-exact reproducibility matters. NumPy is single-threaded by default; PyTorch is not (Phase 25).
CI tests that assert loss == expected_loss to 1e-6 will be flaky if the reduction order varies between runs (e.g., depending on OMP_NUM_THREADS). Use relative tolerance or fp64 accumulators.

scripts/seed_everything (Phase 0) sets thread counts to 1 for this reason during deterministic tests.

Drill problems¶

Solutions in solutions/03-summation-ref.md (phase-open).

Sum the fp32 vector [1.0] + [1e-9] * 10⁶ naively. What do you expect? What does Kahan give? What does fp64 accumulator give? Why?
Compute var([1e10 + 1, 1e10 + 2, 1e10 + 3]) two ways: with the naive E[X²] - E[X]² formula and with Welford's online algorithm. Compare to the exact answer (var = 2/3).
Compute log(1 + 1e-9) and log1p(1e-9). Predict the relative error of each; then verify.
Show that Kahan summation of a length-N list of numbers of equal magnitude has total error O(ε), not O(N·ε). (Hint: bound the compensation term.)
Given the §A13 vocabulary's 600 probabilities, each ~1/600 + δ_i where Σ δ_i = 0 (the distribution is normalized), what is the expected naïve-summation error? When does it matter?

One-paragraph recap¶

Naive fp summation accumulates one ε of error per addition; over N terms it grows to O(N·ε), which is ~12% for N = 10⁶ in fp32 — unacceptable for ML accumulators. Catastrophic cancellation in subtractions of nearly-equal values destroys the result silently. The two fixes are (1) Kahan summation (track and reinject the rounding error each step, giving O(ε) total) and (2) log1p / expm1 / Welford's algorithm and similar reformulations for cancellation-prone expressions. For ML, the cheapest correct fix is "use fp64 for accumulators"; reach for Kahan when fp64 isn't available.

What this page does NOT cover¶

Parallel-tree reductions (deferred to Phase 35 distributed).
Pairwise summation (NumPy's default; a middle ground between naive and Kahan).
Compensated dot products. Out of scope.
Exact arithmetic (fractions, Decimal). Used as oracles in tests only.

Next: theory/04-precision-zoo.md.