English · Español
Phase 2 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-02-numerical-representation.yaml. Pensado para que el aprendiz se equivoque al menos en uno: las trampas vienen de ideas que "parecen razonables".
Source of truth: data/quizzes/phase-02-numerical-representation.yaml.
q-02-01 — Why does the -max softmax trick work?¶
- Because
exp(x - m) ≤ exp(0) = 1for everyx ≤ m, so the exponentials never overflow; the common factorexp(-m)in numerator and denominator cancels exactly. - Because exp is concave, so subtracting the max linearizes it.
- Because softmax is translation-invariant only in expectation.
- Because IEEE 754 rounding rounds toward the max, making the difference negligible.
Answer
**Choice 1.** Both halves matter: the algebraic identity (cancellation of `exp(-m)`) and the numerical bound (`exp(x - m) ≤ 1`).q-02-02 — fp16 vs bf16 (multi-choice)¶
- fp16 has 10 mantissa bits and 5 exponent bits; bf16 has 7 mantissa bits and 8 exponent bits.
- bf16 has the same exponent range as fp32, so overflow/underflow behavior matches fp32 closely.
- fp16 has better precision per number than bf16 (smaller relative error for in-range values).
- bf16 was designed primarily for inference on consumer GPUs, fp16 for training.
- Casting a fp32 weight to bf16 always loses information; casting to fp16 sometimes preserves it exactly.
Answer
**Choices 1, 2, 3.** bf16 came from Google Brain for *training* stability (wide exponent matches fp32). Both formats lose information vs fp32 except where the value already fits.q-02-03 — Catastrophic cancellation (free)¶
Compute 1.0000001 - 1.0 in fp32. How many significant decimal digits does the result have? Name the phenomenon.
Answer
About **one** good digit. Both inputs have ~7 digits of fp32 precision, but the leading 6 digits cancel, promoting the trailing rounding error to the leading position. This is **catastrophic cancellation**. Rewrite the algorithm to avoid subtracting nearly-equal numbers.q-02-04 — When is the -max trick not enough? (free)¶
Answer
The trick stabilizes softmax forward, but downstream `log(p)` of small probabilities still underflows. Fix: never materialize `p`; compute `log_softmax(x) = x - logsumexp(x)` directly. This is the fused `cross-entropy-from-logits` pattern.q-02-05 — Kahan summation¶
- Sort the inputs in descending order before summing.
- Kahan compensated summation (track running error in a second accumulator).
- Cast to fp64, sum, cast back.
- Use SIMD for the reduction.