English · Español
02 — Entropy and KL divergence¶
🇪🇸 Dos cantidades centrales: cuánta incertidumbre tiene una distribución (entropía), y cuán lejos están dos distribuciones (KL). Cross-entropy se va a derivar de estas dos.
Entropy: \(H(p)\)¶
The entropy of a categorical distribution \(p\) over a finite support of size \(V\) is:
with the convention \(0 \log 0 = 0\) (a removable singularity — the limit is 0 from above).
Interpretation¶
Three equivalent intuitions:
- Average surprise. If we sample \(i \sim p\), the "surprise" of seeing \(i\) is \(-\log p_i\) (rare events are more surprising). Entropy is the expected surprise.
- Optimal coding cost. The minimum average bits/nats to encode samples from \(p\) (Shannon's source-coding theorem).
- Concentration. A distribution close to uniform has high entropy; a near-point-mass has low entropy.
Bounds¶
For a distribution on \(V\) outcomes:
- Lower bound (\(H = 0\)): achieved iff \(p\) is a point mass.
- Upper bound (\(H = \log V\)): achieved iff \(p\) is uniform.
Proof of upper bound (Jensen). \(-\log\) is convex. By Jensen, for any probabilities \(p_i\) summing to 1:
Correct argument: \(H(p) = \sum_i p_i \log(1/p_i)\). By Jensen on the concave function \(\log\):
The last equality uses \(\mathbb{E}_p[1/p_i] = \sum_i p_i \cdot (1/p_i) = V\). Equality holds in Jensen iff the random variable is constant — i.e., \(1/p_i = V\) for all \(i\), i.e., \(p\) uniform. ✓
🇪🇸 Tienes que reproducir esta demostración a mano en el lab 00. La parte sutil es el uso correcto de Jensen.
Worked example: entropy of the 5-tense uniform¶
For a uniform distribution over 5 tenses, \(p_i = 1/5\) for all \(i\). Then \(H(p) = 5 \cdot (1/5) \log 5 = \log 5 \approx 1.609\) nats \(\approx 2.322\) bits.
If the model is certain about the tense (one \(p_i = 1\), all others 0), \(H(p) = 0\).
A "medium-confidence" model with \(p = (0.6, 0.1, 0.1, 0.1, 0.1)\) has \(H \approx 1.226\) nats.
KL divergence: \(D_{\text{KL}}(p \,\|\, q)\)¶
The Kullback-Leibler divergence from \(q\) to \(p\) is:
with the convention \(0 \log(0/0) = 0\) and \(0 \log(0/q_i) = 0\) for \(q_i > 0\). If there's any \(i\) with \(p_i > 0\) and \(q_i = 0\), the KL is \(+\infty\) (this is the "support mismatch" failure mode).
Properties¶
- Non-negativity. \(D_{\text{KL}}(p \,\|\, q) \ge 0\), with equality iff \(p = q\) (Gibbs' inequality).
- Asymmetric. \(D_{\text{KL}}(p \,\|\, q) \ne D_{\text{KL}}(q \,\|\, p)\) in general — KL is not a metric.
- No triangle inequality. Confirms KL is not a metric.
- Information-geometric meaning. Up to second order around \(p = q\), \(D_{\text{KL}}\) behaves like a Riemannian distance squared (Fisher information metric). Out of scope here, but worth knowing for later.
Proof of non-negativity (Gibbs)¶
By Jensen on the convex function \(-\log\):
By Jensen (\(-\log\) convex, \(\mathbb{E}[-\log Z] \ge -\log \mathbb{E}[Z]\)):
Equality in Jensen iff \(q_i / p_i\) is constant, which (combined with both being normalised) forces \(p = q\). ✓
Borja reproduces this proof in lab 01.
Asymmetry — which way is it?¶
Convention varies. Our convention: \(D_{\text{KL}}(p \,\|\, q)\) is "the cost in extra nats of coding samples from \(p\) using a code optimised for \(q\)". In machine learning, \(p\) is typically the true distribution and \(q\) is the model.
Practical consequence: the gradient of \(D_{\text{KL}}(p^* \,\|\, q_\theta)\) with respect to \(\theta\) is the gradient of \(H(p^*, q_\theta)\) (since \(H(p^*)\) doesn't depend on \(\theta\)). This is exactly the cross-entropy gradient — the centrepiece of 03-cross-entropy-and-mle.md.
Worked example: KL between two verb-tense distributions¶
Model output \(q = (0.6, 0.1, 0.1, 0.1, 0.1)\) and ground truth \(p = (1, 0, 0, 0, 0)\) (point mass on past tense). Then:
The reverse: \(D_{\text{KL}}(q \,\|\, p) = +\infty\) (support mismatch — \(p\) assigns 0 to forms where \(q\) has support). This is why we always write the true distribution first: \(D_{\text{KL}}(\text{true} \,\|\, \text{model})\), never the other way around.
Mutual information: \(I(X; Y)\)¶
Defined as:
Equivalently \(I(X; Y) = H(Y) - H(Y \mid X)\) — the reduction in uncertainty about \(Y\) from observing \(X\). For language modelling, \(I(\text{prefix}; \text{next-token})\) is what the model is trying to capture.
We won't compute MI in a lab here (deferred to Phase 20 probing analysis), but the definition is foundational.
What this file does NOT cover¶
- Conditional entropy as the basis of decoding theory (Shannon's noisy-channel theorem).
- Rényi entropy, Tsallis entropy, and the family of generalisations.
- Estimation of MI from samples — a non-trivial statistical problem deferred to Phase 20.