Skip to content

English · Español

Phase 2 — Quizzes

🇪🇸 Espejo legible de data/quizzes/phase-02-numerical-representation.yaml. Pensado para que el aprendiz se equivoque al menos en uno: las trampas vienen de ideas que "parecen razonables".

Source of truth: data/quizzes/phase-02-numerical-representation.yaml.


q-02-01 — Why does the -max softmax trick work?

  1. Because exp(x - m) ≤ exp(0) = 1 for every x ≤ m, so the exponentials never overflow; the common factor exp(-m) in numerator and denominator cancels exactly.
  2. Because exp is concave, so subtracting the max linearizes it.
  3. Because softmax is translation-invariant only in expectation.
  4. Because IEEE 754 rounding rounds toward the max, making the difference negligible.
Answer **Choice 1.** Both halves matter: the algebraic identity (cancellation of `exp(-m)`) and the numerical bound (`exp(x - m) ≤ 1`).

q-02-02 — fp16 vs bf16 (multi-choice)

  1. fp16 has 10 mantissa bits and 5 exponent bits; bf16 has 7 mantissa bits and 8 exponent bits.
  2. bf16 has the same exponent range as fp32, so overflow/underflow behavior matches fp32 closely.
  3. fp16 has better precision per number than bf16 (smaller relative error for in-range values).
  4. bf16 was designed primarily for inference on consumer GPUs, fp16 for training.
  5. Casting a fp32 weight to bf16 always loses information; casting to fp16 sometimes preserves it exactly.
Answer **Choices 1, 2, 3.** bf16 came from Google Brain for *training* stability (wide exponent matches fp32). Both formats lose information vs fp32 except where the value already fits.

q-02-03 — Catastrophic cancellation (free)

Compute 1.0000001 - 1.0 in fp32. How many significant decimal digits does the result have? Name the phenomenon.

Answer About **one** good digit. Both inputs have ~7 digits of fp32 precision, but the leading 6 digits cancel, promoting the trailing rounding error to the leading position. This is **catastrophic cancellation**. Rewrite the algorithm to avoid subtracting nearly-equal numbers.

q-02-04 — When is the -max trick not enough? (free)

Answer The trick stabilizes softmax forward, but downstream `log(p)` of small probabilities still underflows. Fix: never materialize `p`; compute `log_softmax(x) = x - logsumexp(x)` directly. This is the fused `cross-entropy-from-logits` pattern.

q-02-05 — Kahan summation

  1. Sort the inputs in descending order before summing.
  2. Kahan compensated summation (track running error in a second accumulator).
  3. Cast to fp64, sum, cast back.
  4. Use SIMD for the reduction.
Answer **Choice 2.** Kahan uses a single extra compensation variable to recover the discarded low-order bits each addition, giving `O(ε)` error rather than `O(Nε)`.