English · Español

01 — Discrete distributions over conjugations¶

🇪🇸 Empezamos con lo más básico: ¿qué es una distribución de probabilidad sobre el conjunto de conjugaciones? Cómo se suman, se condicionan, se marginalizan.

Sample spaces, events, and the categorical distribution¶

A sample space \(\Omega\) is the set of possible outcomes. In our setting, the most useful sample space is:

\[\Omega = \{\text{the } V \text{ verb forms in the corpus}\} = \{w_1, w_2, \ldots, w_V\}.\]

For the §A13 corpus, \(V \approx 600\), with \(w_i\) being a specific verb form (e.g., w_{37} might be worked, w_{38} might be the Spanish pair trabajó).

A probability distribution \(p\) on \(\Omega\) is a function \(p : \Omega \to [0, 1]\) with \(\sum_i p(w_i) = 1\). The categorical distribution is exactly this: a discrete distribution on a finite set.

Throughout this phase, "distribution" means categorical unless we say otherwise. Vectors in \(\Delta^{V-1}\) (the \((V-1)\)-simplex) are the natural objects.

Joint, marginal, conditional¶

Let \(X\) be the random variable for the prefix (e.g., the model's input context) and \(Y\) for the next-token output (one of the \(V\) forms). Then:

Joint: \(p(X = x, Y = y)\) — the probability of a particular (prefix, next-token) pair.
Marginal: \(p(Y = y) = \sum_x p(X = x, Y = y)\) — the probability of \(Y = y\) regardless of \(x\).
Conditional: \(p(Y = y \mid X = x) = \dfrac{p(X = x, Y = y)}{p(X = x)}\), defined when \(p(X = x) > 0\).

The model's output is exactly the conditional: given the prefix, what's the distribution over next tokens? Training adjusts \(\theta\) so \(q_\theta(y \mid x) \to p^*(y \mid x)\).

Worked example: a 2-tense, 2-person toy¶

To make this concrete, consider a toy corpus restricted to {work} as the only verb, with \(\text{tense} \in \{\text{present}, \text{past}\}\) and \(\text{person} \in \{\text{I}, \text{he}\}\):

Form	Tense	Person
`work`	present	I
`works`	present	he
`worked`	past	I
`worked`	past	he

Note the surface ambiguity: worked appears in two cells. If the corpus is sampled uniformly across (tense, person), the marginal \(p(\text{form} = \text{worked}) = 1/2\), while \(p(\text{form} = \text{works}) = 1/4\).

Conditional on person: \(p(\text{form} = \text{works} \mid \text{person} = \text{he}) = 1/2\). Tiny example, but it illustrates that the conditional reshapes the marginal in a meaningful way.

Independence¶

\(Y\) is independent of \(X\) (written \(Y \perp X\)) iff \(p(Y \mid X) = p(Y)\) for all \(X\). In language modelling, independence is the enemy: if next-token were independent of prefix, prediction would be impossible. The whole point of the model is to exploit dependencies.

We measure these dependencies later via mutual information (02-entropy-and-kl.md).

Expectation and variance¶

For a real-valued function \(f : \Omega \to \mathbb{R}\):

\[\mathbb{E}_p[f] = \sum_i p(w_i) f(w_i), \qquad \mathrm{Var}_p[f] = \mathbb{E}_p[f^2] - (\mathbb{E}_p[f])^2.\]

Loss functions are expectations: the cross-entropy loss is \(\mathbb{E}_{(x,y) \sim p^*}[-\log q_\theta(y \mid x)]\). The mini-batch average is its sample estimator.

Worked example: expected length of a verb form¶

Define \(f(w_i) = \text{len}(w_i)\) — the character length of the \(i\)-th form. On our 4-form toy:

\[\mathbb{E}[f] = \tfrac{1}{4}(4 + 5 + 6 + 6) = 5.25.\]

Trivial, but the muscle memory of expectation as a sum-of-(prob × value) is what we need.

Mini-batches and unbiased estimators¶

In training, the true expectation \(\mathbb{E}_{(x,y) \sim p^*}[L(\theta; x, y)]\) is what we want to minimise. We don't have \(p^*\); we have \(N\) samples. The sample mean

\[\widehat{L}(\theta) = \frac{1}{B} \sum_{i \in \text{batch}} L(\theta; x_i, y_i)\]

is an unbiased estimator of the true expectation: \(\mathbb{E}[\widehat{L}] = L\). The variance scales as \(1/B\) — this is the foundation under "bigger batches give cleaner gradients".

We'll use this directly in Phase 18 (training loop) when we average the per-example cross-entropy across the batch.

Empirical distribution¶

Given \(N\) observations \(\{y^{(1)}, \ldots, y^{(N)}\}\) drawn from a categorical, the empirical distribution is:

\[\widehat{p}(y) = \frac{1}{N} \sum_n \mathbb{1}[y^{(n)} = y].\]

Properties: it's a valid distribution; it's the MLE for \(p^*\) under the categorical model (Phase 03 of this set of theory files demonstrates this); it converges to \(p^*\) as \(N \to \infty\) (by the law of large numbers).

The empirical distribution is the bridge between "we have data" and "we have a distribution we can compute KL against".

Numerical practice: log-probabilities¶

In practice, we almost never store \(p(y)\) directly when \(V\) is large or \(p\) is sharp. Instead we store \(\log p(y)\):

Sums become log-sum-exp (next theory file).
Products become sums (which don't underflow as easily).
The cross-entropy and KL formulas read more naturally on log-probabilities.

This is foreshadowing — full treatment in 04-log-sum-exp-and-stability.md.

What this file does NOT cover¶

Continuous distributions (next phase if ever).
Exchangeability, de Finetti, and Bayesian foundations (out of scope).
Sufficiency, ancillary statistics, exponential families (worth knowing, but not for the §A13 task).

Next: 02-entropy-and-kl.md