English · Español
00 — Motivation: why probability for a verb-grammar model¶
The picture¶
A trained mini-GPT reads "Yesterday I" and emits a probability distribution over its full output vocabulary. For the §A13 verb-grammar corpus that vocabulary has ≈600 forms (20 verbs × 5 tenses × 3 persons + a small set of connectors). The model is not certain that the next token is "worked". It assigns probability 0.78 to "worked", 0.09 to "work", 0.05 to "work-ing" (deferred but informative), and a thin tail across everything else.
This is the central object of supervised language modelling: a conditional distribution over the vocabulary given the prefix. Three foundational questions follow:
- How do we measure how "good" that distribution is? Spoiler: cross-entropy — but we have to earn the right to that statement.
- What does it mean for two distributions to be close or far? Answer: KL divergence — but again, derived from a definition, not asserted.
- Why is
nn.CrossEntropyLossnumerically delicate? Answer: small probabilities × large vocabularies ×log= underflow unless we work in log-space throughout.
Phase 05 builds the toolkit to answer all three rigorously.
The §A13 setting in numbers¶
Concretely, when we say "the model outputs a probability over the verb-form vocabulary", here is the exact thing:
| Quantity | Symbol | Value (Phase 12+) |
|---|---|---|
| Vocab size | \(V\) | ≈ 600 |
| Verbs | \(N_{\text{verb}}\) | 20 |
| Tenses | \(N_{\text{tense}}\) | 5 |
| Persons | \(N_{\text{person}}\) | 3 |
| Forms = N_verb × N_tense × N_person | 300 | |
| + Spanish translations (paired 1-to-1) | 300 | |
+ a small set of connectors (I, you, he, ...) |
≈ 20–30 |
🇪🇸 El vocabulario completo es chico — unas 600 formas. Esto significa que las distribuciones que vamos a manipular son chicas y manejables a mano. Buen tamaño para entender la matemática sin perderte en escala.
The smallness of \(V\) is a feature: we can plot full distributions, compute entropies by hand, and verify identities exactly. By Phase 11 (BPE) we may merge going → go + ing and grow \(V\) modestly; we'll revisit calibration analysis at that point.
Why this phase is between Phase 04 and Phase 06¶
Phase 04 (calculus + optimisation) gave us the engine — gradient descent. Phase 05 gives us the objective — what the engine optimises. Without Phase 05, the choice of cross_entropy_loss would feel arbitrary. With Phase 05, the choice is forced: it is the unique loss compatible with MLE on a categorical distribution.
Phase 06 (Python for AI engineering) is implementation hygiene. Without Phase 05, the implementations would be plausibly buggy in subtle ways (forgetting log_softmax, allowing \(\log 0\), computing \(p \log p\) at \(p = 0\)).
The five-line argument we will reconstruct¶
By the end of this phase, Borja will reproduce this argument from scratch:
- MLE. Given i.i.d. observations \((x^{(n)}, y^{(n)})\) from the true conditional \(p^*(y \mid x)\), the maximum-likelihood estimator is \(\theta^{\text{MLE}} = \arg\max_\theta \sum_n \log q_\theta(y^{(n)} \mid x^{(n)})\).
- Empirical CE. That sum equals \(-N \cdot \widehat{H}(p_{\text{emp}}, q_\theta)\) where \(p_{\text{emp}}\) is the empirical distribution.
- Population CE. As \(N \to \infty\), \(\widehat{H}(p_{\text{emp}}, q_\theta) \to \mathbb{E}_x[H(p^*(\cdot \mid x), q_\theta(\cdot \mid x))]\).
- CE = entropy + KL. \(H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q)\) by direct algebra.
- Conclusion. Since \(H(p^*)\) doesn't depend on \(\theta\), minimising CE over \(\theta\) ≡ minimising KL ≡ MLE.
This is the entire justification for cross_entropy_loss. It is five lines and it deserves to be understood deeply.
What follows¶
01-discrete-distributions.md sets up the formalism. 02-entropy-and-kl.md introduces uncertainty and divergence. 03-cross-entropy-and-mle.md is the centrepiece — the chain of equalities above. 04-log-sum-exp-and-stability.md covers the numerics.
What this file does NOT cover¶
- Bayesian inference and posteriors (deferred indefinitely; not in §A13 scope).
- Continuous distributions, change-of-variables, normalising flows (out of scope).
- The connection to coding theory and Shannon source coding (interesting, but optional — referenced as a footnote in §02).