Skip to content

English · Español

Lab 02 — Measure summation error and fix it with Kahan

Goal: see a naive fp32 sum stop growing, then watch Kahan summation save it. Anchor: sum the 600 per-form probabilities of the §A13 verb vocabulary and verify the total equals 1.0 to ε precision.

Estimated time: 60–90 minutes.

Prereq: theory 03-summation-and-cancellation.md read.


What you produce

A directory experiments/02-kahan-vs-naive/ containing:

  • summation.py — implementations of naive_sum, pairwise_sum (NumPy default emulation), kahan_sum, neumaier_sum.
  • experiment.py — driver that runs all four on the three test workloads (below) and records error against an fp64 reference.
  • results.json — per-workload error numbers.
  • error_growth.png — log-log plot of cumulative error vs N.
  • manifest.json.
  • README.md — interpretation.

The three workloads

Workload 1 — uniform §A13 vocabulary probabilities

A length-600 fp32 array, all entries equal to 1.0 / 600.0. Sum should be exactly 1.0.

probs = np.full(600, 1.0/600.0, dtype=np.float32)

Naive sum, pairwise sum, Kahan sum. Record each result and the error relative to 1.0.

Workload 2 — 10⁶ samples of a small-magnitude distribution

Generate N = 10⁶ fp32 values, each ~1/600 + δ_i where δ_i ~ Normal(0, 1e-5). Use np.random.default_rng(42).

N = 1_000_000
rng = np.random.default_rng(42)
x = (1.0/600.0 + rng.standard_normal(N).astype(np.float32) * 1e-5).astype(np.float32)
expected = x.astype(np.float64).sum()  # fp64 reference

Run all four summation strategies. Plot error vs N in increments (sweep N from 10³ to 10⁶ by factors of 10) — this is error_growth.png.

Workload 3 — one large + many tiny

Construct a worst-case mixed-magnitude sum: one 1.0e6 followed by 10⁶ ones.

x = np.array([1.0e6] + [1.0] * 1_000_000, dtype=np.float32)
expected = 1_001_000.0  # exact in fp32 (representable)

Predict: naive sum loses every 1.0 after the running total exceeds 2²³ ≈ 8.4e6 (the point where 1.0 is smaller than the running total's ULP). Kahan keeps all of them. Verify.

TODOs

Block A — implementations

summation.py:

def naive_sum(arr):
    total = 0.0
    for x in arr:
        total += x
    return total

def pairwise_sum(arr):
    # Recursive halving — like np.sum's default
    if len(arr) <= 8:
        return float(arr.sum())   # base case OK to use np.sum here
    mid = len(arr) // 2
    return pairwise_sum(arr[:mid]) + pairwise_sum(arr[mid:])

def kahan_sum(arr):
    # TODO: implement per theory/03 §"the fix"
    ...

def neumaier_sum(arr):
    # TODO: implement per theory/03 §"Neumaier's variant"
    ...

All must accept a 1D np.float32 array and return a float (Python's native fp64 is fine for the return type — the discipline is internal).

Variant: also write fp64_accumulator_naive(arr) that does total = np.float64(0.0) and accumulates in fp64. This is the "cheapest correct fix" comparison.

Block B — predict before measuring

In README.md, before running, predict for each workload:

  • Naive fp32 error magnitude.
  • Pairwise fp32 error magnitude.
  • Kahan fp32 error magnitude.
  • fp64-accumulator-on-fp32-input error magnitude.

Use theory 03 § "The fix" to predict. Then run and compare.

Block C — workload 1: §A13 vocabulary

For the length-600 uniform probability vector, all four methods should give errors comfortably below ε. State which method is best and why.

🇪🇸 La intuición: 600 sumas de magnitudes idénticas tienen error naïve O(600 · ε) ≈ 7e-5. Kahan: O(ε) ≈ 1e-7. fp64 accumulator: O(ε_{fp64}) ≈ 2e-16. Si crees que total debe ser bit-exacto 1.0, ninguno de los métodos te lo garantiza — los δ_i de redondeo en 1/600 mismo ya rompen la igualdad exacta.

Block D — workload 2: error vs N

Generate the 10⁶ array. For N ∈ [10³, 10⁴, 10⁵, 10⁶], compute each summation method's error and record. Plot error_growth.png:

  • x-axis: N (log).
  • y-axis: absolute error vs fp64 reference (log).
  • One curve per method.
  • Annotations: theoretical bounds O(N · ε) for naive, O(log N · ε) for pairwise, O(ε) for Kahan.

The expected shape is naive growing linearly, pairwise as log N, Kahan flat. Comment on whether your measurements match in README.md.

Block E — workload 3: catastrophic case

Run all four methods on the [1e6] + [1.0] * 10⁶ array. Predict and verify:

  • naive_sum returns... what? At what index does adding 1.0 to the running total become a no-op?
  • pairwise_sum is somewhat better but still not perfect.
  • kahan_sum and neumaier_sum should both recover the exact answer (or within 1 ULP).

Compute the index at which naive stops growing: when total > 2²³ ≈ 8.4e6. For our workload, total reaches 1e6 + i × 1.0, so naive stops growing at i ≈ 7.4e6 - 1e6 = ... — but the array is only 1e6 long, so for this specific case the issue is more subtle. Work it out, write down the answer in README.md.

Block F — the practical recommendation

In README.md, conclude with one paragraph: which of these methods would Borja use in actual training code, when. Anchor on this scenario:

Training MiniGPT on the §A13 vocabulary, you accumulate per-batch losses over 100,000 batches × 64 examples each = 6.4 million summands per epoch. Each summand is a positive fp32 loss in [0, 10]. Naive accumulation gives ~5% relative error. Which fix?

Expected answer: total = np.float64(0.0) and accumulate in fp64. Trivial code change, no Kahan complexity, sufficient precision. Reach for Kahan only if you're stuck in fp32 (some embedded inference paths).

Constraints

  • No np.sum in naive_sum, kahan_sum, or neumaier_sum. Roll the loop. The point is to see the accumulation.
  • Pure Python loops are slow. For N = 10⁶ they may take minutes. For the workload-2 plot you can either tolerate the slowness or implement the loops with numba.njit (optional; pin in experiments group only if you go this route — don't bloat the env).
  • Use the fp64 reference always. Never the "expected mathematical answer" — the fp32 sum has a genuine roundoff floor that fp64 reveals.

Stop conditions

Done when:

  1. Four summation implementations work and self-test (sum of [1.0]*10 returns 10.0).
  2. results.json contains errors for all three workloads × four methods.
  3. error_growth.png shows the predicted scaling.
  4. README.md makes the practical recommendation explicitly.
  5. You can state, in one sentence, when Kahan is worth its complexity over an fp64 accumulator.

Pitfalls

  • Mixing fp32 and fp64. 1.0/600.0 in Python is fp64. np.float32(1.0/600.0) truncates to fp32. The arrays must be np.float32 for the experiment to demonstrate fp32 error.
  • Calling arr.sum() accidentally. np.float32 arrays use NumPy's pairwise summation by default — much better than naive. The whole point is the naive loop. Don't shortcut.
  • Timing variance. If you implement naive_sum in pure Python, it'll be 100-1000× slower than np.sum. That's expected; performance is not the measurement here.
  • fp64 reference floor. Even the fp64 reference has ~10⁻¹⁶ relative error. For workload 2 your fp32 Kahan error of ~10⁻⁷ is fine, but don't expect it to match the fp64 to bit identity.

When to consult solutions/

After committing all five files. Solution at solutions/02-summation-ref.md (written at phase open).


Next lab: lab/03-quantization-preview.md.