English · Español

Lab 01 — KL divergence and cross-entropy¶

Read theory/02-entropy-and-kl.md and theory/03-cross-entropy-and-mle.md. Do not consult solutions/.

Objective¶

Implement KL divergence and cross-entropy on categorical distributions. Empirically verify the decomposition identity \(H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q)\). Reproduce Gibbs' inequality proof.

Setup¶

Continue with the 5-tense alphabet.

Tasks¶

Task 1 — implement `kl(p, q)` and `cross_entropy(p, q)`¶

In src/phase05/probability.py, add:

def kl(p: NDArray[np.float64], q: NDArray[np.float64]) -> float:
    """D_KL(p || q) in nats. +inf if p has support outside q's."""

def cross_entropy(p: NDArray[np.float64], q: NDArray[np.float64]) -> float:
    """H(p, q) = -sum_i p_i log q_i, in nats."""

Constraints:

Pure NumPy.
Handle \(p_i = 0\) via the convention \(0 \log(0/q_i) = 0\).
Return +np.inf (not raise) when \(p_i > 0\) and \(q_i = 0\).
Validate inputs (same as entropy).

Task 2 — verify the decomposition identity¶

For 100 random pairs \((p, q)\) from Dirichlet(1, ..., 1):

Compute \(H(p)\), \(D_{\text{KL}}(p \,\|\, q)\), \(H(p, q)\) independently.
Assert \(\big| H(p) + D_{\text{KL}}(p \,\|\, q) - H(p, q) \big| < 10^{-10}\).

Add as a pytest property test: tests/test_phase05_decomposition.py.

Task 3 — reproduce Gibbs' inequality numerically + on paper¶

Hand-write the Jensen proof of \(D_{\text{KL}}(p \,\|\, q) \ge 0\) (mirror theory/02-entropy-and-kl.md §"Proof of non-negativity").
Numerically: for 1000 random pairs \((p, q)\), verify \(D_{\text{KL}}(p \,\|\, q) \ge 0\) — but also verify \(D_{\text{KL}}(p \,\|\, q) = 0\) iff \(p = q\) by checking the cases where you set \(q = p\).

Task 4 — asymmetry¶

Compute \(D_{\text{KL}}(p \,\|\, q)\) vs \(D_{\text{KL}}(q \,\|\, p)\) for:

Case	\(p\)	\(q\)
A	\((0.5, 0.5, 0, 0, 0)\)	\((0.2, 0.2, 0.2, 0.2, 0.2)\)
B	\((0.2, 0.2, 0.2, 0.2, 0.2)\)	\((0.5, 0.5, 0, 0, 0)\)

What happens in case B? Why? Document.

Task 5 — cross-entropy on the verb model (forward-looking)¶

Pretend the model outputs \(q = (0.6, 0.1, 0.1, 0.1, 0.1)\) and the ground truth is \(p = (1, 0, 0, 0, 0)\) (past tense). Compute:

\(H(p, q)\)
\(-\log q_{y^*}\) where \(y^* = 0\) (past).

They should be equal — and equal to the negative log-likelihood of the true label under the model. This is the form we'll wire into Phase 07's autograd.

Measurements to capture¶

Per-pair wall-clock of kl(p, q) at \(V = 600\).
Decomposition-identity test: 100 cases, all pass within tolerance.
Asymmetry table (Task 4) saved as experiments/<date>-phase-05-kl/asymmetry.csv.

Acceptance¶

kl and cross_entropy implemented; pass property tests.
Decomposition identity verified within 1e-10 for 100 random pairs.
Gibbs proof reproduced in your notes.
Asymmetry case study documented.
Phase-07-foreshadowing check (Task 5) numerically confirmed.

Pitfalls to expect¶

KL with support mismatch should return +inf, not raise. The pytest test should assert np.isposinf(kl(p, q)) for the mismatch case.
Asymmetry confuses people the first time — write your notes explicitly so future-you doesn't mis-remember which side is which. Convention here: \(p\) is always the true distribution (or empirical), \(q\) is the model.
Don't compare \(D_{\text{KL}}(p \,\|\, q)\) between different \(V\) — KL is unitful in nats and the magnitudes are not directly comparable across vocabulary sizes.

Next: 02-log-sum-exp.md