English · Español

Phase 5 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-05-probability-information.yaml. Incluye la identidad H(p,q) = H(p) + KL(p||q) y la explicación de calibración.

Source of truth: data/quizzes/phase-05-probability-information.yaml.

q-05-01 — Entropy of uniform 5-tense distribution (free)¶

p = [0.2, 0.2, 0.2, 0.2, 0.2]. Compute H(p) in nats and bits.

Answer

`H = log(5) ≈ 1.609 nats ≈ 2.322 bits`. This is the **maximum** possible entropy for 5 classes — total uncertainty.

q-05-02 — KL asymmetry (multi-choice)¶

KL(p || q) = expected extra nats encoding p with a q-code.
KL(p || q) = mode-covering: penalizes q when small where p is large.
KL(q || p) = mode-seeking: allows q to concentrate on a single mode.
KL is a metric.
Cross-entropy training minimizes KL(p_data || p_model) = mode-covering.

Answer

**Choices 1, 2, 3, 5.** KL is not a metric (asymmetric, no triangle inequality).

q-05-03 — Cross-entropy identity (free)¶

Show H(p, q) = H(p) + KL(p || q) and use it to explain why CE training ≡ KL minimization.

Answer

`H(p, q) = -Σ p_i log q_i = -Σ p_i log p_i + Σ p_i log(p_i / q_i) = H(p) + KL(p || q)`. Since `H(p_data)` is constant in θ, minimizing `H(p_data, p_model)` w.r.t. θ is identical to minimizing `KL(p_data || p_model)`.

q-05-04 — Why is `log_softmax` more stable than `log(softmax)`?¶

Because logsumexp uses fp64 internally.
Because softmax can underflow to 0; log(0) = -inf. log_softmax never materializes small probabilities.
Because logsumexp is differentiable and softmax is not.
Because PyTorch fuses them on CUDA.

Answer

**Choice 2.** Even with `-max`, softmax can produce `exp(-100) ≈ 0`. `log_softmax(x) = x - logsumexp(x)` only computes a finite difference of finite numbers.

q-05-05 — Calibration (free)¶

Answer

A model is **well-calibrated** if, among predictions made with confidence `p`, the empirical accuracy is `~p` (e.g., predictions with 80% confidence are correct ~80% of the time). A model can be highly accurate (argmax usually right) yet over-confident (gives 99% even when right only 80% of the time). Modern neural nets are typically over-confident; **temperature scaling** is the standard post-hoc fix.

Phase 5 — Quizzes¶

q-05-01 — Entropy of uniform 5-tense distribution (free)¶

q-05-02 — KL asymmetry (multi-choice)¶

q-05-03 — Cross-entropy identity (free)¶

q-05-04 — Why is log_softmax more stable than log(softmax)?¶

q-05-05 — Calibration (free)¶

q-05-04 — Why is `log_softmax` more stable than `log(softmax)`?¶