Skip to content

English · Español

Phase 5 — Quizzes

🇪🇸 Espejo legible de data/quizzes/phase-05-probability-information.yaml. Incluye la identidad H(p,q) = H(p) + KL(p||q) y la explicación de calibración.

Source of truth: data/quizzes/phase-05-probability-information.yaml.


q-05-01 — Entropy of uniform 5-tense distribution (free)

p = [0.2, 0.2, 0.2, 0.2, 0.2]. Compute H(p) in nats and bits.

Answer `H = log(5) ≈ 1.609 nats ≈ 2.322 bits`. This is the **maximum** possible entropy for 5 classes — total uncertainty.

q-05-02 — KL asymmetry (multi-choice)

  1. KL(p || q) = expected extra nats encoding p with a q-code.
  2. KL(p || q) = mode-covering: penalizes q when small where p is large.
  3. KL(q || p) = mode-seeking: allows q to concentrate on a single mode.
  4. KL is a metric.
  5. Cross-entropy training minimizes KL(p_data || p_model) = mode-covering.
Answer **Choices 1, 2, 3, 5.** KL is not a metric (asymmetric, no triangle inequality).

q-05-03 — Cross-entropy identity (free)

Show H(p, q) = H(p) + KL(p || q) and use it to explain why CE training ≡ KL minimization.

Answer `H(p, q) = -Σ p_i log q_i = -Σ p_i log p_i + Σ p_i log(p_i / q_i) = H(p) + KL(p || q)`. Since `H(p_data)` is constant in θ, minimizing `H(p_data, p_model)` w.r.t. θ is identical to minimizing `KL(p_data || p_model)`.

q-05-04 — Why is log_softmax more stable than log(softmax)?

  1. Because logsumexp uses fp64 internally.
  2. Because softmax can underflow to 0; log(0) = -inf. log_softmax never materializes small probabilities.
  3. Because logsumexp is differentiable and softmax is not.
  4. Because PyTorch fuses them on CUDA.
Answer **Choice 2.** Even with `-max`, softmax can produce `exp(-100) ≈ 0`. `log_softmax(x) = x - logsumexp(x)` only computes a finite difference of finite numbers.

q-05-05 — Calibration (free)

Answer A model is **well-calibrated** if, among predictions made with confidence `p`, the empirical accuracy is `~p` (e.g., predictions with 80% confidence are correct ~80% of the time). A model can be highly accurate (argmax usually right) yet over-confident (gives 99% even when right only 80% of the time). Modern neural nets are typically over-confident; **temperature scaling** is the standard post-hoc fix.