Skip to content

English · Español

02 — Entropy and KL divergence

🇪🇸 Dos cantidades centrales: cuánta incertidumbre tiene una distribución (entropía), y cuán lejos están dos distribuciones (KL). Cross-entropy se va a derivar de estas dos.

Entropy: \(H(p)\)

The entropy of a categorical distribution \(p\) over a finite support of size \(V\) is:

\[H(p) = -\sum_{i=1}^{V} p_i \log p_i,\]

with the convention \(0 \log 0 = 0\) (a removable singularity — the limit is 0 from above).

Interpretation

Three equivalent intuitions:

  1. Average surprise. If we sample \(i \sim p\), the "surprise" of seeing \(i\) is \(-\log p_i\) (rare events are more surprising). Entropy is the expected surprise.
  2. Optimal coding cost. The minimum average bits/nats to encode samples from \(p\) (Shannon's source-coding theorem).
  3. Concentration. A distribution close to uniform has high entropy; a near-point-mass has low entropy.

Bounds

For a distribution on \(V\) outcomes:

\[0 \le H(p) \le \log V.\]
  • Lower bound (\(H = 0\)): achieved iff \(p\) is a point mass.
  • Upper bound (\(H = \log V\)): achieved iff \(p\) is uniform.

Proof of upper bound (Jensen). \(-\log\) is convex. By Jensen, for any probabilities \(p_i\) summing to 1:

\[H(p) = \sum_i p_i \cdot (-\log p_i) \le -\log\left(\sum_i p_i \cdot p_i\right) \cdot \text{NO} — \text{this is wrong}.\]

Correct argument: \(H(p) = \sum_i p_i \log(1/p_i)\). By Jensen on the concave function \(\log\):

\[H(p) = \mathbb{E}_p[\log(1/p_i)] \le \log \mathbb{E}_p[1/p_i] = \log V.\]

The last equality uses \(\mathbb{E}_p[1/p_i] = \sum_i p_i \cdot (1/p_i) = V\). Equality holds in Jensen iff the random variable is constant — i.e., \(1/p_i = V\) for all \(i\), i.e., \(p\) uniform. ✓

🇪🇸 Tienes que reproducir esta demostración a mano en el lab 00. La parte sutil es el uso correcto de Jensen.

Worked example: entropy of the 5-tense uniform

For a uniform distribution over 5 tenses, \(p_i = 1/5\) for all \(i\). Then \(H(p) = 5 \cdot (1/5) \log 5 = \log 5 \approx 1.609\) nats \(\approx 2.322\) bits.

If the model is certain about the tense (one \(p_i = 1\), all others 0), \(H(p) = 0\).

A "medium-confidence" model with \(p = (0.6, 0.1, 0.1, 0.1, 0.1)\) has \(H \approx 1.226\) nats.

KL divergence: \(D_{\text{KL}}(p \,\|\, q)\)

The Kullback-Leibler divergence from \(q\) to \(p\) is:

\[D_{\text{KL}}(p \,\|\, q) = \sum_i p_i \log \frac{p_i}{q_i},\]

with the convention \(0 \log(0/0) = 0\) and \(0 \log(0/q_i) = 0\) for \(q_i > 0\). If there's any \(i\) with \(p_i > 0\) and \(q_i = 0\), the KL is \(+\infty\) (this is the "support mismatch" failure mode).

Properties

  1. Non-negativity. \(D_{\text{KL}}(p \,\|\, q) \ge 0\), with equality iff \(p = q\) (Gibbs' inequality).
  2. Asymmetric. \(D_{\text{KL}}(p \,\|\, q) \ne D_{\text{KL}}(q \,\|\, p)\) in general — KL is not a metric.
  3. No triangle inequality. Confirms KL is not a metric.
  4. Information-geometric meaning. Up to second order around \(p = q\), \(D_{\text{KL}}\) behaves like a Riemannian distance squared (Fisher information metric). Out of scope here, but worth knowing for later.

Proof of non-negativity (Gibbs)

By Jensen on the convex function \(-\log\):

\[D_{\text{KL}}(p \,\|\, q) = \sum_i p_i \log \frac{p_i}{q_i} = -\sum_i p_i \log \frac{q_i}{p_i} = \mathbb{E}_p\left[-\log \frac{q_i}{p_i}\right].\]

By Jensen (\(-\log\) convex, \(\mathbb{E}[-\log Z] \ge -\log \mathbb{E}[Z]\)):

\[D_{\text{KL}}(p \,\|\, q) \ge -\log \mathbb{E}_p\left[\frac{q_i}{p_i}\right] = -\log \sum_i p_i \cdot \frac{q_i}{p_i} = -\log \sum_i q_i = -\log 1 = 0.\]

Equality in Jensen iff \(q_i / p_i\) is constant, which (combined with both being normalised) forces \(p = q\). ✓

Borja reproduces this proof in lab 01.

Asymmetry — which way is it?

Convention varies. Our convention: \(D_{\text{KL}}(p \,\|\, q)\) is "the cost in extra nats of coding samples from \(p\) using a code optimised for \(q\)". In machine learning, \(p\) is typically the true distribution and \(q\) is the model.

Practical consequence: the gradient of \(D_{\text{KL}}(p^* \,\|\, q_\theta)\) with respect to \(\theta\) is the gradient of \(H(p^*, q_\theta)\) (since \(H(p^*)\) doesn't depend on \(\theta\)). This is exactly the cross-entropy gradient — the centrepiece of 03-cross-entropy-and-mle.md.

Worked example: KL between two verb-tense distributions

Model output \(q = (0.6, 0.1, 0.1, 0.1, 0.1)\) and ground truth \(p = (1, 0, 0, 0, 0)\) (point mass on past tense). Then:

\[D_{\text{KL}}(p \,\|\, q) = 1 \cdot \log(1/0.6) + 0 + 0 + 0 + 0 = \log(5/3) \approx 0.511 \text{ nats}.\]

The reverse: \(D_{\text{KL}}(q \,\|\, p) = +\infty\) (support mismatch — \(p\) assigns 0 to forms where \(q\) has support). This is why we always write the true distribution first: \(D_{\text{KL}}(\text{true} \,\|\, \text{model})\), never the other way around.

Mutual information: \(I(X; Y)\)

Defined as:

\[I(X; Y) = D_{\text{KL}}\big(p(X, Y) \,\|\, p(X) \, p(Y)\big) = H(X) + H(Y) - H(X, Y).\]

Equivalently \(I(X; Y) = H(Y) - H(Y \mid X)\) — the reduction in uncertainty about \(Y\) from observing \(X\). For language modelling, \(I(\text{prefix}; \text{next-token})\) is what the model is trying to capture.

We won't compute MI in a lab here (deferred to Phase 20 probing analysis), but the definition is foundational.

What this file does NOT cover

  • Conditional entropy as the basis of decoding theory (Shannon's noisy-channel theorem).
  • Rényi entropy, Tsallis entropy, and the family of generalisations.
  • Estimation of MI from samples — a non-trivial statistical problem deferred to Phase 20.

Next: 03-cross-entropy-and-mle.md