English · Español

01 — N-gram language models¶

🇪🇸 Un n-grama es un modelo de lenguaje primitivo: la probabilidad de la siguiente palabra depende solo de las n-1 anteriores. Es estúpido. Para corpora pequeños y locales —como el nuestro de conjugaciones verbales inglesas— es sorprendentemente bueno, y por eso es la baseline correcta para el transformer.

This file derives the n-gram language model, the smoothing strategy, the perplexity metric, and the empirical knobs that affect the baseline number Borja commits.

The setup¶

A language model is a probability distribution over token sequences:

\[ P(w_1, w_2, \ldots, w_T) \]

For our verb-grammar corpus, the sequences are short rows like I worked / yo trabajé . (with a separator and an end-of-row token). \(T\) is typically 6–12.

By the chain rule of probability, any joint distribution factorizes as:

\[ P(w_1, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, \ldots, w_{t-1}) \]

An ideal language model would compute each \(P(w_t \mid w_1, \ldots, w_{t-1})\) using the entire prefix. That's infeasible: the number of distinct prefixes is exponential in \(t\), and we have a finite corpus.

The n-gram assumption truncates the conditional dependence to the last \(n - 1\) tokens:

\[ P(w_t \mid w_1, \ldots, w_{t-1}) \approx P(w_t \mid w_{t-n+1}, \ldots, w_{t-1}) \]

Now there are only \(|V|^{n-1}\) distinct conditioning contexts (with vocabulary \(V\)). For \(|V| = 64\) (a reasonable estimate for our BPE-tokenized verb-grammar corpus after Phase 11) and \(n = 5\), that's \(64^4 = 16.7\) million contexts. Most of them never appear in the corpus — see smoothing below.

Maximum-likelihood estimation¶

Given a corpus of token sequences, the maximum likelihood estimate (MLE) for an n-gram is the empirical count ratio:

\[ P_\text{MLE}(w_t \mid w_{t-n+1}, \ldots, w_{t-1}) = \frac{c(w_{t-n+1}, \ldots, w_t)}{c(w_{t-n+1}, \ldots, w_{t-1})} \]

where \(c(\cdot)\) counts occurrences of an n-gram (numerator) or (n-1)-gram (denominator) in the training corpus.

Worked example on the verb-grammar corpus. Suppose for \(n = 3\) we want \(P(\text{works} \mid \text{he, } \emptyset)\) where \(\emptyset\) is the previous separator token (so the window is (<sep>, he, ?)).

count((<sep>, he, works)) = 12      ← seen 12 times in training
count((<sep>, he, work))  =  2      ← seen 2 times (likely in subjunctive constructions)
count((<sep>, he, worked))=  8
count((<sep>, he, ...))   = total = 50   ← all observed continuations

Then \(P_\text{MLE}(\text{works} \mid \langle\text{sep}\rangle, \text{he}) = 12 / 50 = 0.24\).

The MLE is a discrete categorical distribution over \(V\), parametrized by counts. Implementing it = building a dict[(context_tokens), Counter[next_token]]. Implementing it efficiently = a trie or a hash map keyed on a tuple. We do not need efficiency in Phase 14.

The zero-count problem¶

The MLE assigns probability zero to any n-gram that didn't appear in training. That's a disaster:

The perplexity of a sequence containing any unseen n-gram is \(\infty\). The baseline number is meaningless.
The model can't generalize at all. "I will play" might appear; "I will play tomorrow" might not (because tomorrow is rare in our corpus) — the model would assign zero to the whole row.

This is why smoothing exists.

Add-α smoothing (Laplace, \(\alpha = 1\); we use \(\alpha = 0.01\))¶

Add a pseudo-count \(\alpha\) to every n-gram count, including unseen ones:

\[ P_\alpha(w_t \mid \text{context}) = \frac{c(\text{context}, w_t) + \alpha}{c(\text{context}) + \alpha |V|} \]

This redistributes probability mass from seen events to unseen ones. The hyperparameter \(\alpha\) controls how much:

\(\alpha = 0\) → MLE, zero probabilities, infinite perplexity on test.
\(\alpha = 1\) → Laplace smoothing. Smooths a lot. Often over-smooths for small data.
\(\alpha = 0.01\) → modest smoothing. Our default. Works well for our corpus.

Note that with \(\alpha = 0.01\) and \(|V| = 64\), the denominator shift is \(0.01 \times 64 = 0.64\). For a context with count \(c(\text{context}) = 50\), the denominator becomes \(50.64\) instead of \(50\) — a barely perceptible shift for frequent contexts, and a large shift for rare ones (which is what you want).

Better smoothing: a survey¶

For completeness — we use add-\(\alpha\), but you should know these exist:

Kneser-Ney (KN) smoothing. Considers how diverse a token's contexts are. A token that appears in many contexts (the in English) is more likely to appear in a new context than a token that always appears in the same context (Francisco after San). KN is the state-of-the-art among non-neural language models. We do not implement it. One paragraph mention.
Stupid backoff (Brants et al. 2007). If the n-gram isn't seen, fall back to the (n-1)-gram with a fixed penalty. Used in large-scale practical systems before neural LMs. Two-line implementation. We do not implement.
Good-Turing. Estimates probability of unseen events from the rate of events seen exactly once. Theoretically appealing, practically replaced by KN.

None of these will close the gap to a neural language model on tasks with semantic structure. They're sophisticated only relative to MLE; they're toys relative to a transformer.

Perplexity: the evaluation metric¶

A language model assigns a probability to a sequence. To compare models, we need a single number. Perplexity (PPL) is that number.

Definition: \(\text{PPL}\) is the exponential of the average per-token negative log-likelihood:

\[ \text{PPL} = \exp\left(-\frac{1}{N} \sum_{t=1}^{N} \log P(w_t \mid w_{<t})\right) \]

where \(N\) is the total number of tokens in the evaluation set. (See Phase 5 theory file for the derivation; the gloss is: PPL is the effective vocabulary size the model is uncertain over per token.)

Interpretations:

\(\text{PPL} = 1\) means the model is perfectly certain at every step.
\(\text{PPL} = |V|\) means the model is uniformly uncertain — equivalent to assigning equal probability to every vocabulary token.
\(\text{PPL} = e^{H(w)}\) where \(H(w)\) is the per-token cross-entropy in nats.

For our verb-grammar corpus with \(|V| \approx 64\): - A random model has \(\text{PPL} \approx 64\). - A 1-gram (unigram) on our corpus has \(\text{PPL} \approx 30\) (some tokens are much more frequent: I, you, he, the, separators). - A 3-gram should land around 4–8 — the model is essentially right when context is enough. - A 5-gram should approach 2–4. - A perfectly-trained Mini-GPT (Phase 17/18) should hit < 2 on the held-out set, because the task is mostly deterministic.

These are predictions for Phase 14's lab; they are not guarantees. The point is to know roughly what numbers to expect, so a result of \(\text{PPL} = 30\) for the 3-gram is a sign something is wrong (bug, smoothing issue, eval mismatch), not progress.

Reading the perplexity ↔ NLL relationship¶

Some teams report negative log-likelihood (NLL) in nats per token, others report perplexity. They're trivially related:

\(\text{NLL} = -\frac{1}{N} \sum \log P\) (in nats if using \(\ln\), in bits if using \(\log_2\)).
\(\text{PPL} = \exp(\text{NLL})\) (with same base; \(\exp\) if NLL in nats, \(2^{(\cdot)}\) if NLL in bits).

The two carry the same information. Perplexity is more interpretable as "effective vocabulary size"; NLL is more useful when you want to add up token-level losses (it's additive).

Sequence boundary handling¶

Concretely: our corpus has rows like I work / yo trabajo separated by a newline / end-of-row token. The n-gram needs to know where rows begin and end.

Two conventions:

Pad with start tokens. Prepend \(n - 1\) copies of a <bos> token to every row. A 3-gram sees (<bos>, <bos>, I), (<bos>, I, work), (I, work, /), .... The first prediction is conditioned on a context that exists.
End-of-sequence token. Append <eos> to every row. The model learns when to stop. Counts as part of the per-token loss.

For Phase 14, use both: <bos> prepended and <eos> appended. Lab 01 specifies.

Compute and storage¶

For our corpus (~600 rows, ~6000 tokens total):

A 3-gram model has at most \(|V|^3 = 64^3 \approx 260\text{k}\) entries. In practice, only the seen subset is stored — a few thousand entries.
A 5-gram model has at most \(|V|^5 \approx 10^9\) entries. We store only the seen subset — still a few thousand entries (most contexts never appear).
Training is one pass over the corpus, \(O(N \cdot n)\) time, \(O(\text{unique n-grams})\) memory.
Inference (compute perplexity on a held-out set) is \(O(N)\) time, one lookup per token.

This fits in milliseconds and kilobytes. There is no "training time" or "GPU" concern in Phase 14.

What an n-gram cannot learn¶

Three concrete failures, demonstrated in lab 01:

Cross-paradigm generalization. Suppose the training set has many examples of he works, he played, he listened but no examples of he walked. A 3-gram learns nothing about he walked from the other patterns — the trigram (<sep>, he, walked) has count 0 and gets only the \(\alpha\) smoothing mass. By contrast, an attention-based model with shared embeddings could route he to walked correctly because walked is embedded near worked, played, listened.
Tense-auxiliary chains. The future construction he is going to work is a 4-token dependency: is → going → to → work. A 3-gram with window 3 sees (is, going, to) but not (he, is, going). It will predict to correctly after going but cannot constrain the verb form three tokens later. A 5-gram fixes this for the specific construction but does not generalize across going to / will / 'll.
Bilingual alignment. Given I work / yo, the n-gram should predict trabajo. It will — if it saw exactly that bigram before. It does not learn the function "translate the English verb to Spanish 1^st-singular present". It just memorizes co-occurrences. A bigger corpus would close the gap; ours won't.

These failures are the seed of attention.

What an n-gram is great at¶

For our corpus, an n-gram is near-optimal at:

Pronoun → present-simple form. I work, you work, he ___ → works because (<sep>, he, works) is the most frequent trigram after (<sep>, he, *).
Pronoun → past form. I worked, you worked, he ___ → worked likewise.
Spanish article placement. yo ___ → trabajo (or whatever pronoun-aligned verb form follows yo).

If your task is exactly pronoun → verb-form completion, a 5-gram with \(\alpha = 0.01\) on our corpus will be within 10% of the optimal model. The transformer's win has to come from generalization across paradigms, not from beating the n-gram on the central task.

This realization is the bridge to Phase 15: we need a model whose internals make generalization the natural mode of operation, not memorization.

A drill before lab¶

Compute by hand: given training counts

(<bos>, <bos>, I)     = 20
(<bos>, I, work)      = 14
(<bos>, I, walk)      =  3
(<bos>, I, worked)    =  3

For \(n = 3\), \(\alpha = 0.01\), \(|V| = 64\), compute \(P_\alpha(\text{work} \mid \langle\text{bos}\rangle, I)\).

\[ P_\alpha = \frac{14 + 0.01}{20 + 0.01 \cdot 64} = \frac{14.01}{20.64} \approx 0.679 \]

If you can compute this in your head, you understand the model. The rest of Phase 14 is data plumbing.

What this phase does NOT cover¶

Implementing Kneser-Ney smoothing. One paragraph above is the entire treatment.
Hierarchical Bayesian language models, neural language models prior to GPT, RNN-LMs as a separate category. These are interesting but not on the critical path.
The 1990s n-gram empire (translation, IR, retrieval). We touch RAG in Phase 29; non-LM uses of n-grams are out of scope.
Variable-order n-grams ("interpolated KN"). We use fixed \(n\) per model.
Smoothing hyperparameter tuning. Lock \(\alpha = 0.01\).

Next: theory/02-rnn-recurrence.md.