Skip to content

English · Español

00 — Motivation: why probability for a verb-grammar model

The picture

A trained mini-GPT reads "Yesterday I" and emits a probability distribution over its full output vocabulary. For the §A13 verb-grammar corpus that vocabulary has ≈600 forms (20 verbs × 5 tenses × 3 persons + a small set of connectors). The model is not certain that the next token is "worked". It assigns probability 0.78 to "worked", 0.09 to "work", 0.05 to "work-ing" (deferred but informative), and a thin tail across everything else.

This is the central object of supervised language modelling: a conditional distribution over the vocabulary given the prefix. Three foundational questions follow:

  1. How do we measure how "good" that distribution is? Spoiler: cross-entropy — but we have to earn the right to that statement.
  2. What does it mean for two distributions to be close or far? Answer: KL divergence — but again, derived from a definition, not asserted.
  3. Why is nn.CrossEntropyLoss numerically delicate? Answer: small probabilities × large vocabularies × log = underflow unless we work in log-space throughout.

Phase 05 builds the toolkit to answer all three rigorously.

The §A13 setting in numbers

Concretely, when we say "the model outputs a probability over the verb-form vocabulary", here is the exact thing:

Quantity Symbol Value (Phase 12+)
Vocab size \(V\) ≈ 600
Verbs \(N_{\text{verb}}\) 20
Tenses \(N_{\text{tense}}\) 5
Persons \(N_{\text{person}}\) 3
Forms = N_verb × N_tense × N_person 300
+ Spanish translations (paired 1-to-1) 300
+ a small set of connectors (I, you, he, ...) ≈ 20–30

🇪🇸 El vocabulario completo es chico — unas 600 formas. Esto significa que las distribuciones que vamos a manipular son chicas y manejables a mano. Buen tamaño para entender la matemática sin perderte en escala.

The smallness of \(V\) is a feature: we can plot full distributions, compute entropies by hand, and verify identities exactly. By Phase 11 (BPE) we may merge goinggo + ing and grow \(V\) modestly; we'll revisit calibration analysis at that point.

Why this phase is between Phase 04 and Phase 06

Phase 04 (calculus + optimisation) gave us the engine — gradient descent. Phase 05 gives us the objective — what the engine optimises. Without Phase 05, the choice of cross_entropy_loss would feel arbitrary. With Phase 05, the choice is forced: it is the unique loss compatible with MLE on a categorical distribution.

Phase 06 (Python for AI engineering) is implementation hygiene. Without Phase 05, the implementations would be plausibly buggy in subtle ways (forgetting log_softmax, allowing \(\log 0\), computing \(p \log p\) at \(p = 0\)).

The five-line argument we will reconstruct

By the end of this phase, Borja will reproduce this argument from scratch:

  1. MLE. Given i.i.d. observations \((x^{(n)}, y^{(n)})\) from the true conditional \(p^*(y \mid x)\), the maximum-likelihood estimator is \(\theta^{\text{MLE}} = \arg\max_\theta \sum_n \log q_\theta(y^{(n)} \mid x^{(n)})\).
  2. Empirical CE. That sum equals \(-N \cdot \widehat{H}(p_{\text{emp}}, q_\theta)\) where \(p_{\text{emp}}\) is the empirical distribution.
  3. Population CE. As \(N \to \infty\), \(\widehat{H}(p_{\text{emp}}, q_\theta) \to \mathbb{E}_x[H(p^*(\cdot \mid x), q_\theta(\cdot \mid x))]\).
  4. CE = entropy + KL. \(H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q)\) by direct algebra.
  5. Conclusion. Since \(H(p^*)\) doesn't depend on \(\theta\), minimising CE over \(\theta\) ≡ minimising KL ≡ MLE.

This is the entire justification for cross_entropy_loss. It is five lines and it deserves to be understood deeply.

What follows

01-discrete-distributions.md sets up the formalism. 02-entropy-and-kl.md introduces uncertainty and divergence. 03-cross-entropy-and-mle.md is the centrepiece — the chain of equalities above. 04-log-sum-exp-and-stability.md covers the numerics.

What this file does NOT cover

  • Bayesian inference and posteriors (deferred indefinitely; not in §A13 scope).
  • Continuous distributions, change-of-variables, normalising flows (out of scope).
  • The connection to coding theory and Shannon source coding (interesting, but optional — referenced as a footnote in §02).

Next: 01-discrete-distributions.md