Skip to content

English · Español

01 — Discrete distributions over conjugations

🇪🇸 Empezamos con lo más básico: ¿qué es una distribución de probabilidad sobre el conjunto de conjugaciones? Cómo se suman, se condicionan, se marginalizan.

Sample spaces, events, and the categorical distribution

A sample space \(\Omega\) is the set of possible outcomes. In our setting, the most useful sample space is:

\[\Omega = \{\text{the } V \text{ verb forms in the corpus}\} = \{w_1, w_2, \ldots, w_V\}.\]

For the §A13 corpus, \(V \approx 600\), with \(w_i\) being a specific verb form (e.g., w_{37} might be worked, w_{38} might be the Spanish pair trabajó).

A probability distribution \(p\) on \(\Omega\) is a function \(p : \Omega \to [0, 1]\) with \(\sum_i p(w_i) = 1\). The categorical distribution is exactly this: a discrete distribution on a finite set.

Throughout this phase, "distribution" means categorical unless we say otherwise. Vectors in \(\Delta^{V-1}\) (the \((V-1)\)-simplex) are the natural objects.

Joint, marginal, conditional

Let \(X\) be the random variable for the prefix (e.g., the model's input context) and \(Y\) for the next-token output (one of the \(V\) forms). Then:

  • Joint: \(p(X = x, Y = y)\) — the probability of a particular (prefix, next-token) pair.
  • Marginal: \(p(Y = y) = \sum_x p(X = x, Y = y)\) — the probability of \(Y = y\) regardless of \(x\).
  • Conditional: \(p(Y = y \mid X = x) = \dfrac{p(X = x, Y = y)}{p(X = x)}\), defined when \(p(X = x) > 0\).

The model's output is exactly the conditional: given the prefix, what's the distribution over next tokens? Training adjusts \(\theta\) so \(q_\theta(y \mid x) \to p^*(y \mid x)\).

Worked example: a 2-tense, 2-person toy

To make this concrete, consider a toy corpus restricted to {work} as the only verb, with \(\text{tense} \in \{\text{present}, \text{past}\}\) and \(\text{person} \in \{\text{I}, \text{he}\}\):

Form Tense Person
work present I
works present he
worked past I
worked past he

Note the surface ambiguity: worked appears in two cells. If the corpus is sampled uniformly across (tense, person), the marginal \(p(\text{form} = \text{worked}) = 1/2\), while \(p(\text{form} = \text{works}) = 1/4\).

Conditional on person: \(p(\text{form} = \text{works} \mid \text{person} = \text{he}) = 1/2\). Tiny example, but it illustrates that the conditional reshapes the marginal in a meaningful way.

Independence

\(Y\) is independent of \(X\) (written \(Y \perp X\)) iff \(p(Y \mid X) = p(Y)\) for all \(X\). In language modelling, independence is the enemy: if next-token were independent of prefix, prediction would be impossible. The whole point of the model is to exploit dependencies.

We measure these dependencies later via mutual information (02-entropy-and-kl.md).

Expectation and variance

For a real-valued function \(f : \Omega \to \mathbb{R}\):

\[\mathbb{E}_p[f] = \sum_i p(w_i) f(w_i), \qquad \mathrm{Var}_p[f] = \mathbb{E}_p[f^2] - (\mathbb{E}_p[f])^2.\]

Loss functions are expectations: the cross-entropy loss is \(\mathbb{E}_{(x,y) \sim p^*}[-\log q_\theta(y \mid x)]\). The mini-batch average is its sample estimator.

Worked example: expected length of a verb form

Define \(f(w_i) = \text{len}(w_i)\) — the character length of the \(i\)-th form. On our 4-form toy:

\[\mathbb{E}[f] = \tfrac{1}{4}(4 + 5 + 6 + 6) = 5.25.\]

Trivial, but the muscle memory of expectation as a sum-of-(prob × value) is what we need.

Mini-batches and unbiased estimators

In training, the true expectation \(\mathbb{E}_{(x,y) \sim p^*}[L(\theta; x, y)]\) is what we want to minimise. We don't have \(p^*\); we have \(N\) samples. The sample mean

\[\widehat{L}(\theta) = \frac{1}{B} \sum_{i \in \text{batch}} L(\theta; x_i, y_i)\]

is an unbiased estimator of the true expectation: \(\mathbb{E}[\widehat{L}] = L\). The variance scales as \(1/B\) — this is the foundation under "bigger batches give cleaner gradients".

We'll use this directly in Phase 18 (training loop) when we average the per-example cross-entropy across the batch.

Empirical distribution

Given \(N\) observations \(\{y^{(1)}, \ldots, y^{(N)}\}\) drawn from a categorical, the empirical distribution is:

\[\widehat{p}(y) = \frac{1}{N} \sum_n \mathbb{1}[y^{(n)} = y].\]

Properties: it's a valid distribution; it's the MLE for \(p^*\) under the categorical model (Phase 03 of this set of theory files demonstrates this); it converges to \(p^*\) as \(N \to \infty\) (by the law of large numbers).

The empirical distribution is the bridge between "we have data" and "we have a distribution we can compute KL against".

Numerical practice: log-probabilities

In practice, we almost never store \(p(y)\) directly when \(V\) is large or \(p\) is sharp. Instead we store \(\log p(y)\):

  • Sums become log-sum-exp (next theory file).
  • Products become sums (which don't underflow as easily).
  • The cross-entropy and KL formulas read more naturally on log-probabilities.

This is foreshadowing — full treatment in 04-log-sum-exp-and-stability.md.

What this file does NOT cover

  • Continuous distributions (next phase if ever).
  • Exchangeability, de Finetti, and Bayesian foundations (out of scope).
  • Sufficiency, ancillary statistics, exponential families (worth knowing, but not for the §A13 task).

Next: 02-entropy-and-kl.md