English · Español
Lab 01 — KL divergence and cross-entropy¶
Read
theory/02-entropy-and-kl.mdandtheory/03-cross-entropy-and-mle.md. Do not consultsolutions/.
Objective¶
Implement KL divergence and cross-entropy on categorical distributions. Empirically verify the decomposition identity \(H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q)\). Reproduce Gibbs' inequality proof.
Setup¶
Continue with the 5-tense alphabet.
Tasks¶
Task 1 — implement kl(p, q) and cross_entropy(p, q)¶
In src/phase05/probability.py, add:
def kl(p: NDArray[np.float64], q: NDArray[np.float64]) -> float:
"""D_KL(p || q) in nats. +inf if p has support outside q's."""
def cross_entropy(p: NDArray[np.float64], q: NDArray[np.float64]) -> float:
"""H(p, q) = -sum_i p_i log q_i, in nats."""
Constraints:
- Pure NumPy.
- Handle \(p_i = 0\) via the convention \(0 \log(0/q_i) = 0\).
- Return
+np.inf(not raise) when \(p_i > 0\) and \(q_i = 0\). - Validate inputs (same as
entropy).
Task 2 — verify the decomposition identity¶
For 100 random pairs \((p, q)\) from Dirichlet(1, ..., 1):
- Compute \(H(p)\), \(D_{\text{KL}}(p \,\|\, q)\), \(H(p, q)\) independently.
- Assert \(\big| H(p) + D_{\text{KL}}(p \,\|\, q) - H(p, q) \big| < 10^{-10}\).
Add as a pytest property test: tests/test_phase05_decomposition.py.
Task 3 — reproduce Gibbs' inequality numerically + on paper¶
- Hand-write the Jensen proof of \(D_{\text{KL}}(p \,\|\, q) \ge 0\) (mirror
theory/02-entropy-and-kl.md§"Proof of non-negativity"). - Numerically: for 1000 random pairs \((p, q)\), verify \(D_{\text{KL}}(p \,\|\, q) \ge 0\) — but also verify \(D_{\text{KL}}(p \,\|\, q) = 0\) iff \(p = q\) by checking the cases where you set \(q = p\).
Task 4 — asymmetry¶
Compute \(D_{\text{KL}}(p \,\|\, q)\) vs \(D_{\text{KL}}(q \,\|\, p)\) for:
| Case | \(p\) | \(q\) |
|---|---|---|
| A | \((0.5, 0.5, 0, 0, 0)\) | \((0.2, 0.2, 0.2, 0.2, 0.2)\) |
| B | \((0.2, 0.2, 0.2, 0.2, 0.2)\) | \((0.5, 0.5, 0, 0, 0)\) |
What happens in case B? Why? Document.
Task 5 — cross-entropy on the verb model (forward-looking)¶
Pretend the model outputs \(q = (0.6, 0.1, 0.1, 0.1, 0.1)\) and the ground truth is \(p = (1, 0, 0, 0, 0)\) (past tense). Compute:
- \(H(p, q)\)
- \(-\log q_{y^*}\) where \(y^* = 0\) (past).
They should be equal — and equal to the negative log-likelihood of the true label under the model. This is the form we'll wire into Phase 07's autograd.
Measurements to capture¶
- Per-pair wall-clock of
kl(p, q)at \(V = 600\). - Decomposition-identity test: 100 cases, all pass within tolerance.
- Asymmetry table (Task 4) saved as
experiments/<date>-phase-05-kl/asymmetry.csv.
Acceptance¶
-
klandcross_entropyimplemented; pass property tests. - Decomposition identity verified within
1e-10for 100 random pairs. - Gibbs proof reproduced in your notes.
- Asymmetry case study documented.
- Phase-07-foreshadowing check (Task 5) numerically confirmed.
Pitfalls to expect¶
- KL with support mismatch should return
+inf, not raise. Thepytesttest shouldassert np.isposinf(kl(p, q))for the mismatch case. - Asymmetry confuses people the first time — write your notes explicitly so future-you doesn't mis-remember which side is which. Convention here: \(p\) is always the true distribution (or empirical), \(q\) is the model.
- Don't compare \(D_{\text{KL}}(p \,\|\, q)\) between different \(V\) — KL is unitful in nats and the magnitudes are not directly comparable across vocabulary sizes.
Next: 02-log-sum-exp.md