English · Español
01 — Discrete distributions over conjugations¶
🇪🇸 Empezamos con lo más básico: ¿qué es una distribución de probabilidad sobre el conjunto de conjugaciones? Cómo se suman, se condicionan, se marginalizan.
Sample spaces, events, and the categorical distribution¶
A sample space \(\Omega\) is the set of possible outcomes. In our setting, the most useful sample space is:
For the §A13 corpus, \(V \approx 600\), with \(w_i\) being a specific verb form (e.g., w_{37} might be worked, w_{38} might be the Spanish pair trabajó).
A probability distribution \(p\) on \(\Omega\) is a function \(p : \Omega \to [0, 1]\) with \(\sum_i p(w_i) = 1\). The categorical distribution is exactly this: a discrete distribution on a finite set.
Throughout this phase, "distribution" means categorical unless we say otherwise. Vectors in \(\Delta^{V-1}\) (the \((V-1)\)-simplex) are the natural objects.
Joint, marginal, conditional¶
Let \(X\) be the random variable for the prefix (e.g., the model's input context) and \(Y\) for the next-token output (one of the \(V\) forms). Then:
- Joint: \(p(X = x, Y = y)\) — the probability of a particular (prefix, next-token) pair.
- Marginal: \(p(Y = y) = \sum_x p(X = x, Y = y)\) — the probability of \(Y = y\) regardless of \(x\).
- Conditional: \(p(Y = y \mid X = x) = \dfrac{p(X = x, Y = y)}{p(X = x)}\), defined when \(p(X = x) > 0\).
The model's output is exactly the conditional: given the prefix, what's the distribution over next tokens? Training adjusts \(\theta\) so \(q_\theta(y \mid x) \to p^*(y \mid x)\).
Worked example: a 2-tense, 2-person toy¶
To make this concrete, consider a toy corpus restricted to {work} as the only verb, with \(\text{tense} \in \{\text{present}, \text{past}\}\) and \(\text{person} \in \{\text{I}, \text{he}\}\):
| Form | Tense | Person |
|---|---|---|
work |
present | I |
works |
present | he |
worked |
past | I |
worked |
past | he |
Note the surface ambiguity: worked appears in two cells. If the corpus is sampled uniformly across (tense, person), the marginal \(p(\text{form} = \text{worked}) = 1/2\), while \(p(\text{form} = \text{works}) = 1/4\).
Conditional on person: \(p(\text{form} = \text{works} \mid \text{person} = \text{he}) = 1/2\). Tiny example, but it illustrates that the conditional reshapes the marginal in a meaningful way.
Independence¶
\(Y\) is independent of \(X\) (written \(Y \perp X\)) iff \(p(Y \mid X) = p(Y)\) for all \(X\). In language modelling, independence is the enemy: if next-token were independent of prefix, prediction would be impossible. The whole point of the model is to exploit dependencies.
We measure these dependencies later via mutual information (02-entropy-and-kl.md).
Expectation and variance¶
For a real-valued function \(f : \Omega \to \mathbb{R}\):
Loss functions are expectations: the cross-entropy loss is \(\mathbb{E}_{(x,y) \sim p^*}[-\log q_\theta(y \mid x)]\). The mini-batch average is its sample estimator.
Worked example: expected length of a verb form¶
Define \(f(w_i) = \text{len}(w_i)\) — the character length of the \(i\)-th form. On our 4-form toy:
Trivial, but the muscle memory of expectation as a sum-of-(prob × value) is what we need.
Mini-batches and unbiased estimators¶
In training, the true expectation \(\mathbb{E}_{(x,y) \sim p^*}[L(\theta; x, y)]\) is what we want to minimise. We don't have \(p^*\); we have \(N\) samples. The sample mean
is an unbiased estimator of the true expectation: \(\mathbb{E}[\widehat{L}] = L\). The variance scales as \(1/B\) — this is the foundation under "bigger batches give cleaner gradients".
We'll use this directly in Phase 18 (training loop) when we average the per-example cross-entropy across the batch.
Empirical distribution¶
Given \(N\) observations \(\{y^{(1)}, \ldots, y^{(N)}\}\) drawn from a categorical, the empirical distribution is:
Properties: it's a valid distribution; it's the MLE for \(p^*\) under the categorical model (Phase 03 of this set of theory files demonstrates this); it converges to \(p^*\) as \(N \to \infty\) (by the law of large numbers).
The empirical distribution is the bridge between "we have data" and "we have a distribution we can compute KL against".
Numerical practice: log-probabilities¶
In practice, we almost never store \(p(y)\) directly when \(V\) is large or \(p\) is sharp. Instead we store \(\log p(y)\):
- Sums become log-sum-exp (next theory file).
- Products become sums (which don't underflow as easily).
- The cross-entropy and KL formulas read more naturally on log-probabilities.
This is foreshadowing — full treatment in 04-log-sum-exp-and-stability.md.
What this file does NOT cover¶
- Continuous distributions (next phase if ever).
- Exchangeability, de Finetti, and Bayesian foundations (out of scope).
- Sufficiency, ancillary statistics, exponential families (worth knowing, but not for the §A13 task).
Next: 02-entropy-and-kl.md