Skip to content

English · Español

Lab 01 — KL divergence and cross-entropy

Read theory/02-entropy-and-kl.md and theory/03-cross-entropy-and-mle.md. Do not consult solutions/.

Objective

Implement KL divergence and cross-entropy on categorical distributions. Empirically verify the decomposition identity \(H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q)\). Reproduce Gibbs' inequality proof.

Setup

Continue with the 5-tense alphabet.

Tasks

Task 1 — implement kl(p, q) and cross_entropy(p, q)

In src/phase05/probability.py, add:

def kl(p: NDArray[np.float64], q: NDArray[np.float64]) -> float:
    """D_KL(p || q) in nats. +inf if p has support outside q's."""

def cross_entropy(p: NDArray[np.float64], q: NDArray[np.float64]) -> float:
    """H(p, q) = -sum_i p_i log q_i, in nats."""

Constraints:

  • Pure NumPy.
  • Handle \(p_i = 0\) via the convention \(0 \log(0/q_i) = 0\).
  • Return +np.inf (not raise) when \(p_i > 0\) and \(q_i = 0\).
  • Validate inputs (same as entropy).

Task 2 — verify the decomposition identity

For 100 random pairs \((p, q)\) from Dirichlet(1, ..., 1):

  1. Compute \(H(p)\), \(D_{\text{KL}}(p \,\|\, q)\), \(H(p, q)\) independently.
  2. Assert \(\big| H(p) + D_{\text{KL}}(p \,\|\, q) - H(p, q) \big| < 10^{-10}\).

Add as a pytest property test: tests/test_phase05_decomposition.py.

Task 3 — reproduce Gibbs' inequality numerically + on paper

  1. Hand-write the Jensen proof of \(D_{\text{KL}}(p \,\|\, q) \ge 0\) (mirror theory/02-entropy-and-kl.md §"Proof of non-negativity").
  2. Numerically: for 1000 random pairs \((p, q)\), verify \(D_{\text{KL}}(p \,\|\, q) \ge 0\) — but also verify \(D_{\text{KL}}(p \,\|\, q) = 0\) iff \(p = q\) by checking the cases where you set \(q = p\).

Task 4 — asymmetry

Compute \(D_{\text{KL}}(p \,\|\, q)\) vs \(D_{\text{KL}}(q \,\|\, p)\) for:

Case \(p\) \(q\)
A \((0.5, 0.5, 0, 0, 0)\) \((0.2, 0.2, 0.2, 0.2, 0.2)\)
B \((0.2, 0.2, 0.2, 0.2, 0.2)\) \((0.5, 0.5, 0, 0, 0)\)

What happens in case B? Why? Document.

Task 5 — cross-entropy on the verb model (forward-looking)

Pretend the model outputs \(q = (0.6, 0.1, 0.1, 0.1, 0.1)\) and the ground truth is \(p = (1, 0, 0, 0, 0)\) (past tense). Compute:

  • \(H(p, q)\)
  • \(-\log q_{y^*}\) where \(y^* = 0\) (past).

They should be equal — and equal to the negative log-likelihood of the true label under the model. This is the form we'll wire into Phase 07's autograd.

Measurements to capture

  • Per-pair wall-clock of kl(p, q) at \(V = 600\).
  • Decomposition-identity test: 100 cases, all pass within tolerance.
  • Asymmetry table (Task 4) saved as experiments/<date>-phase-05-kl/asymmetry.csv.

Acceptance

  • kl and cross_entropy implemented; pass property tests.
  • Decomposition identity verified within 1e-10 for 100 random pairs.
  • Gibbs proof reproduced in your notes.
  • Asymmetry case study documented.
  • Phase-07-foreshadowing check (Task 5) numerically confirmed.

Pitfalls to expect

  • KL with support mismatch should return +inf, not raise. The pytest test should assert np.isposinf(kl(p, q)) for the mismatch case.
  • Asymmetry confuses people the first time — write your notes explicitly so future-you doesn't mis-remember which side is which. Convention here: \(p\) is always the true distribution (or empirical), \(q\) is the model.
  • Don't compare \(D_{\text{KL}}(p \,\|\, q)\) between different \(V\) — KL is unitful in nats and the magnitudes are not directly comparable across vocabulary sizes.

Next: 02-log-sum-exp.md