English · Español

03 — Cross-entropy and MLE: the central derivation¶

🇪🇸 Esta es la pieza central de la fase. Demuestra que la "cross-entropy loss" no es una elección caprichosa — es lo que sale forzosamente cuando aplicas máxima verosimilitud a una distribución categórica.

Cross-entropy: definition¶

The cross-entropy between two distributions \(p, q\) on the same support is:

\[H(p, q) := -\sum_i p_i \log q_i.\]

(Note the asymmetry: \(H(p, q) \ne H(q, p)\) in general.)

The decomposition identity¶

By direct algebra:

\[H(p, q) = -\sum_i p_i \log q_i = -\sum_i p_i \log p_i + \sum_i p_i \log \frac{p_i}{q_i} = H(p) + D_{\text{KL}}(p \,\|\, q).\]

This is the most important identity of the phase. Read it three ways:

Cross-entropy is entropy + divergence-from-the-model.
When \(p = q\), \(H(p, q) = H(p)\) (CE saturates at entropy when the model matches truth).
The only way to lower CE below \(H(p)\) is to make \(D_{\text{KL}}(p \,\|\, q) < 0\), which Gibbs' inequality rules out. So \(H(p, q) \ge H(p)\) always, with equality iff \(p = q\).

Worked example¶

Reusing the prior example: \(p = (1, 0, 0, 0, 0)\), \(q = (0.6, 0.1, 0.1, 0.1, 0.1)\).

\(H(p) = 0\) (point mass).
\(D_{\text{KL}}(p \,\|\, q) = \log(5/3) \approx 0.511\).
\(H(p, q) = -\log 0.6 \approx 0.511\).

Check: \(H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q) = 0 + 0.511\). ✓

Note that with \(p\) a point mass, \(H(p, q)\) reduces to \(-\log q_{y^*}\) where \(y^*\) is the true label — the negative log-likelihood of the true label. This is the cross-entropy loss we'll wire into Phase 07's autograd and Phase 18's training loop.

Maximum-likelihood estimation (MLE)¶

Given \(N\) observations \(\{y^{(n)}\}_{n=1}^N\) drawn i.i.d. from a parametric family \(\{q_\theta\}\), the MLE is:

\[\theta^{\text{MLE}} = \arg\max_\theta \prod_n q_\theta(y^{(n)}) = \arg\max_\theta \sum_n \log q_\theta(y^{(n)}).\]

Equivalently — and this is the form we want — using the empirical distribution \(\widehat{p}(y) = \frac{1}{N} \sum_n \mathbb{1}[y^{(n)} = y]\):

\[\sum_n \log q_\theta(y^{(n)}) = \sum_y N \cdot \widehat{p}(y) \log q_\theta(y) = -N \cdot H(\widehat{p}, q_\theta).\]

So MLE is exactly empirical-cross-entropy minimisation:

\[\theta^{\text{MLE}} = \arg\min_\theta H(\widehat{p}, q_\theta).\]

Conditional version (the actual training objective)¶

For supervised learning with inputs \(x^{(n)}\) and labels \(y^{(n)}\) from the joint \(p^*(x, y)\):

\[\theta^{\text{MLE}} = \arg\max_\theta \sum_n \log q_\theta(y^{(n)} \mid x^{(n)}).\]

By the same algebra, this equals minimising the conditional empirical cross-entropy averaged over inputs:

\[\theta^{\text{MLE}} = \arg\min_\theta \frac{1}{N} \sum_n -\log q_\theta(y^{(n)} \mid x^{(n)}).\]

The right-hand side is exactly what F.cross_entropy(logits, labels) computes in PyTorch (and what we'll implement in NumPy in Phase 07).

The asymptotic argument¶

What does this minimum converge to as \(N \to \infty\)?

By the law of large numbers, the empirical average converges to the true expectation:

\[\frac{1}{N} \sum_n -\log q_\theta(y^{(n)} \mid x^{(n)}) \xrightarrow{p} \mathbb{E}_{(x,y) \sim p^*}[-\log q_\theta(y \mid x)] = \mathbb{E}_x[H(p^*(\cdot \mid x), q_\theta(\cdot \mid x))].\]

By the decomposition identity:

\[\mathbb{E}_x[H(p^*(\cdot \mid x), q_\theta(\cdot \mid x))] = \mathbb{E}_x[H(p^*(\cdot \mid x))] + \mathbb{E}_x[D_{\text{KL}}(p^*(\cdot \mid x) \,\|\, q_\theta(\cdot \mid x))].\]

The first term doesn't depend on \(\theta\) — it's the irreducible entropy of the true labelling process. The second is the expected KL between truth and model.

Conclusion: in the infinite-data limit, MLE minimises the expected KL between truth and model. This is the cleanest possible justification for cross-entropy loss.

🇪🇸 Esto es lo que tienes que poder explicar después de la fase 05: la cross-entropy loss es la única elección consistente bajo máxima verosimilitud para una distribución categórica. No es heurística; es matemática.

Per-token language-model cross-entropy¶

For a sequence model on a sequence \(y_{1:T}\):

\[\mathcal{L} = -\sum_{t=1}^T \log q_\theta(y_t \mid y_{<t}).\]

Per-token, the average cross-entropy is \(\mathcal{L} / T\). Perplexity is \(\exp(\mathcal{L}/T)\) — the geometric-mean inverse-probability assigned to the next token. The mini-GPT in Phase 17 outputs this object; the training loop in Phase 18 minimises it.

For our §A13 corpus, an untrained model has per-token CE \(\approx \log V \approx \log 600 \approx 6.4\) nats (uniform output). A perfectly-trained model in the limit reaches the per-token entropy of the true conditional, which for a deterministic conjugation task is 0 if the prefix uniquely determines the form. For ambiguous prefixes (e.g., "Yesterday" followed by an unknown subject), the entropy floor is non-zero.

Connections forward¶

Phase 07: implement cross_entropy(logits, label) in the scalar autograd. Derive the gradient \(\partial \mathcal{L} / \partial z_i = q_i - \mathbb{1}[i = y^*]\) from first principles.
Phase 18: the training loop's loss is exactly this. The CE-vs-step curve is what we monitor.
Phase 20: evaluation metrics are derived from this — perplexity, top-k accuracy, calibration (next file).
Phase 26: quantisation introduces small distributional shifts; we measure them in KL.

What this file does NOT cover¶

Bayesian alternatives (MAP, variational inference).
Regularisation framed as a prior on \(\theta\) (worth knowing; deferred).
The relation to the EM algorithm (out of scope for the supervised setting).

Next: 04-log-sum-exp-and-stability.md