English · Español

Phase 2 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-02-numerical-representation.yaml. Pensado para que el aprendiz se equivoque al menos en uno: las trampas vienen de ideas que "parecen razonables".

Source of truth: data/quizzes/phase-02-numerical-representation.yaml.

q-02-01 — Why does the `-max` softmax trick work?¶

Because exp(x - m) ≤ exp(0) = 1 for every x ≤ m, so the exponentials never overflow; the common factor exp(-m) in numerator and denominator cancels exactly.
Because exp is concave, so subtracting the max linearizes it.
Because softmax is translation-invariant only in expectation.
Because IEEE 754 rounding rounds toward the max, making the difference negligible.

Answer

**Choice 1.** Both halves matter: the algebraic identity (cancellation of `exp(-m)`) and the numerical bound (`exp(x - m) ≤ 1`).

q-02-02 — fp16 vs bf16 (multi-choice)¶

fp16 has 10 mantissa bits and 5 exponent bits; bf16 has 7 mantissa bits and 8 exponent bits.
bf16 has the same exponent range as fp32, so overflow/underflow behavior matches fp32 closely.
fp16 has better precision per number than bf16 (smaller relative error for in-range values).
bf16 was designed primarily for inference on consumer GPUs, fp16 for training.
Casting a fp32 weight to bf16 always loses information; casting to fp16 sometimes preserves it exactly.

Answer

**Choices 1, 2, 3.** bf16 came from Google Brain for *training* stability (wide exponent matches fp32). Both formats lose information vs fp32 except where the value already fits.

q-02-03 — Catastrophic cancellation (free)¶

Compute 1.0000001 - 1.0 in fp32. How many significant decimal digits does the result have? Name the phenomenon.

Answer

About **one** good digit. Both inputs have ~7 digits of fp32 precision, but the leading 6 digits cancel, promoting the trailing rounding error to the leading position. This is **catastrophic cancellation**. Rewrite the algorithm to avoid subtracting nearly-equal numbers.

q-02-04 — When is the `-max` trick not enough? (free)¶

Answer

The trick stabilizes softmax forward, but downstream `log(p)` of small probabilities still underflows. Fix: never materialize `p`; compute `log_softmax(x) = x - logsumexp(x)` directly. This is the fused `cross-entropy-from-logits` pattern.

q-02-05 — Kahan summation¶

Sort the inputs in descending order before summing.
Kahan compensated summation (track running error in a second accumulator).
Cast to fp64, sum, cast back.
Use SIMD for the reduction.

Answer

**Choice 2.** Kahan uses a single extra compensation variable to recover the discarded low-order bits each addition, giving `O(ε)` error rather than `O(Nε)`.

Phase 2 — Quizzes¶

q-02-01 — Why does the -max softmax trick work?¶

q-02-02 — fp16 vs bf16 (multi-choice)¶

q-02-03 — Catastrophic cancellation (free)¶

q-02-04 — When is the -max trick not enough? (free)¶

q-02-05 — Kahan summation¶

q-02-01 — Why does the `-max` softmax trick work?¶

q-02-04 — When is the `-max` trick not enough? (free)¶