English · Español

Lab 02 — Measure summation error and fix it with Kahan¶

Goal: see a naive fp32 sum stop growing, then watch Kahan summation save it. Anchor: sum the 600 per-form probabilities of the §A13 verb vocabulary and verify the total equals 1.0 to ε precision.

Estimated time: 60–90 minutes.

Prereq: theory 03-summation-and-cancellation.md read.

What you produce¶

A directory experiments/02-kahan-vs-naive/ containing:

summation.py — implementations of naive_sum, pairwise_sum (NumPy default emulation), kahan_sum, neumaier_sum.
experiment.py — driver that runs all four on the three test workloads (below) and records error against an fp64 reference.
results.json — per-workload error numbers.
error_growth.png — log-log plot of cumulative error vs N.
manifest.json.
README.md — interpretation.

The three workloads¶

Workload 1 — uniform §A13 vocabulary probabilities¶

A length-600 fp32 array, all entries equal to 1.0 / 600.0. Sum should be exactly 1.0.

probs = np.full(600, 1.0/600.0, dtype=np.float32)

Naive sum, pairwise sum, Kahan sum. Record each result and the error relative to 1.0.

Workload 2 — 10⁶ samples of a small-magnitude distribution¶

Generate N = 10⁶ fp32 values, each ~1/600 + δ_i where δ_i ~ Normal(0, 1e-5). Use np.random.default_rng(42).

N = 1_000_000
rng = np.random.default_rng(42)
x = (1.0/600.0 + rng.standard_normal(N).astype(np.float32) * 1e-5).astype(np.float32)
expected = x.astype(np.float64).sum()  # fp64 reference

Run all four summation strategies. Plot error vs N in increments (sweep N from 10³ to 10⁶ by factors of 10) — this is error_growth.png.

Workload 3 — one large + many tiny¶

Construct a worst-case mixed-magnitude sum: one 1.0e6 followed by 10⁶ ones.

x = np.array([1.0e6] + [1.0] * 1_000_000, dtype=np.float32)
expected = 1_001_000.0  # exact in fp32 (representable)

Predict: naive sum loses every 1.0 after the running total exceeds 2²³ ≈ 8.4e6 (the point where 1.0 is smaller than the running total's ULP). Kahan keeps all of them. Verify.

TODOs¶

Block A — implementations¶

summation.py:

def naive_sum(arr):
    total = 0.0
    for x in arr:
        total += x
    return total

def pairwise_sum(arr):
    # Recursive halving — like np.sum's default
    if len(arr) <= 8:
        return float(arr.sum())   # base case OK to use np.sum here
    mid = len(arr) // 2
    return pairwise_sum(arr[:mid]) + pairwise_sum(arr[mid:])

def kahan_sum(arr):
    # TODO: implement per theory/03 §"the fix"
    ...

def neumaier_sum(arr):
    # TODO: implement per theory/03 §"Neumaier's variant"
    ...

All must accept a 1D np.float32 array and return a float (Python's native fp64 is fine for the return type — the discipline is internal).

Variant: also write fp64_accumulator_naive(arr) that does total = np.float64(0.0) and accumulates in fp64. This is the "cheapest correct fix" comparison.

Block B — predict before measuring¶

In README.md, before running, predict for each workload:

Naive fp32 error magnitude.
Pairwise fp32 error magnitude.
Kahan fp32 error magnitude.
fp64-accumulator-on-fp32-input error magnitude.

Use theory 03 § "The fix" to predict. Then run and compare.

Block C — workload 1: §A13 vocabulary¶

For the length-600 uniform probability vector, all four methods should give errors comfortably below ε. State which method is best and why.

🇪🇸 La intuición: 600 sumas de magnitudes idénticas tienen error naïve O(600 · ε) ≈ 7e-5. Kahan: O(ε) ≈ 1e-7. fp64 accumulator: O(ε_{fp64}) ≈ 2e-16. Si crees que total debe ser bit-exacto 1.0, ninguno de los métodos te lo garantiza — los δ_i de redondeo en 1/600 mismo ya rompen la igualdad exacta.

Block D — workload 2: error vs N¶

Generate the 10⁶ array. For N ∈ [10³, 10⁴, 10⁵, 10⁶], compute each summation method's error and record. Plot error_growth.png:

x-axis: N (log).
y-axis: absolute error vs fp64 reference (log).
One curve per method.
Annotations: theoretical bounds O(N · ε) for naive, O(log N · ε) for pairwise, O(ε) for Kahan.

The expected shape is naive growing linearly, pairwise as log N, Kahan flat. Comment on whether your measurements match in README.md.

Block E — workload 3: catastrophic case¶

Run all four methods on the [1e6] + [1.0] * 10⁶ array. Predict and verify:

naive_sum returns... what? At what index does adding 1.0 to the running total become a no-op?
pairwise_sum is somewhat better but still not perfect.
kahan_sum and neumaier_sum should both recover the exact answer (or within 1 ULP).

Compute the index at which naive stops growing: when total > 2²³ ≈ 8.4e6. For our workload, total reaches 1e6 + i × 1.0, so naive stops growing at i ≈ 7.4e6 - 1e6 = ... — but the array is only 1e6 long, so for this specific case the issue is more subtle. Work it out, write down the answer in README.md.

Block F — the practical recommendation¶

In README.md, conclude with one paragraph: which of these methods would Borja use in actual training code, when. Anchor on this scenario:

Training MiniGPT on the §A13 vocabulary, you accumulate per-batch losses over 100,000 batches × 64 examples each = 6.4 million summands per epoch. Each summand is a positive fp32 loss in [0, 10]. Naive accumulation gives ~5% relative error. Which fix?

Expected answer: total = np.float64(0.0) and accumulate in fp64. Trivial code change, no Kahan complexity, sufficient precision. Reach for Kahan only if you're stuck in fp32 (some embedded inference paths).

Constraints¶

No np.sum in naive_sum, kahan_sum, or neumaier_sum. Roll the loop. The point is to see the accumulation.
Pure Python loops are slow. For N = 10⁶ they may take minutes. For the workload-2 plot you can either tolerate the slowness or implement the loops with numba.njit (optional; pin in experiments group only if you go this route — don't bloat the env).
Use the fp64 reference always. Never the "expected mathematical answer" — the fp32 sum has a genuine roundoff floor that fp64 reveals.

Stop conditions¶

Done when:

Four summation implementations work and self-test (sum of [1.0]*10 returns 10.0).
results.json contains errors for all three workloads × four methods.
error_growth.png shows the predicted scaling.
README.md makes the practical recommendation explicitly.
You can state, in one sentence, when Kahan is worth its complexity over an fp64 accumulator.

Pitfalls¶

Mixing fp32 and fp64. 1.0/600.0 in Python is fp64. np.float32(1.0/600.0) truncates to fp32. The arrays must be np.float32 for the experiment to demonstrate fp32 error.
Calling arr.sum() accidentally. np.float32 arrays use NumPy's pairwise summation by default — much better than naive. The whole point is the naive loop. Don't shortcut.
Timing variance. If you implement naive_sum in pure Python, it'll be 100-1000× slower than np.sum. That's expected; performance is not the measurement here.
fp64 reference floor. Even the fp64 reference has ~10⁻¹⁶ relative error. For workload 2 your fp32 Kahan error of ~10⁻⁷ is fine, but don't expect it to match the fp64 to bit identity.

When to consult `solutions/`¶

After committing all five files. Solution at solutions/02-summation-ref.md (written at phase open).

Next lab: lab/03-quantization-preview.md.