English · Español
01 — N-gram language models¶
🇪🇸 Un n-grama es un modelo de lenguaje primitivo: la probabilidad de la siguiente palabra depende solo de las
n-1anteriores. Es estúpido. Para corpora pequeños y locales —como el nuestro de conjugaciones verbales inglesas— es sorprendentemente bueno, y por eso es la baseline correcta para el transformer.
This file derives the n-gram language model, the smoothing strategy, the perplexity metric, and the empirical knobs that affect the baseline number Borja commits.
The setup¶
A language model is a probability distribution over token sequences:
For our verb-grammar corpus, the sequences are short rows like I worked / yo trabajé . (with a separator and an end-of-row token). \(T\) is typically 6–12.
By the chain rule of probability, any joint distribution factorizes as:
An ideal language model would compute each \(P(w_t \mid w_1, \ldots, w_{t-1})\) using the entire prefix. That's infeasible: the number of distinct prefixes is exponential in \(t\), and we have a finite corpus.
The n-gram assumption truncates the conditional dependence to the last \(n - 1\) tokens:
Now there are only \(|V|^{n-1}\) distinct conditioning contexts (with vocabulary \(V\)). For \(|V| = 64\) (a reasonable estimate for our BPE-tokenized verb-grammar corpus after Phase 11) and \(n = 5\), that's \(64^4 = 16.7\) million contexts. Most of them never appear in the corpus — see smoothing below.
Maximum-likelihood estimation¶
Given a corpus of token sequences, the maximum likelihood estimate (MLE) for an n-gram is the empirical count ratio:
where \(c(\cdot)\) counts occurrences of an n-gram (numerator) or (n-1)-gram (denominator) in the training corpus.
Worked example on the verb-grammar corpus. Suppose for \(n = 3\) we want \(P(\text{works} \mid \text{he, } \emptyset)\) where \(\emptyset\) is the previous separator token (so the window is (<sep>, he, ?)).
count((<sep>, he, works)) = 12 ← seen 12 times in training
count((<sep>, he, work)) = 2 ← seen 2 times (likely in subjunctive constructions)
count((<sep>, he, worked))= 8
count((<sep>, he, ...)) = total = 50 ← all observed continuations
Then \(P_\text{MLE}(\text{works} \mid \langle\text{sep}\rangle, \text{he}) = 12 / 50 = 0.24\).
The MLE is a discrete categorical distribution over \(V\), parametrized by counts. Implementing it = building a dict[(context_tokens), Counter[next_token]]. Implementing it efficiently = a trie or a hash map keyed on a tuple. We do not need efficiency in Phase 14.
The zero-count problem¶
The MLE assigns probability zero to any n-gram that didn't appear in training. That's a disaster:
- The perplexity of a sequence containing any unseen n-gram is \(\infty\). The baseline number is meaningless.
- The model can't generalize at all. "I will play" might appear; "I will play tomorrow" might not (because tomorrow is rare in our corpus) — the model would assign zero to the whole row.
This is why smoothing exists.
Add-α smoothing (Laplace, \(\alpha = 1\); we use \(\alpha = 0.01\))¶
Add a pseudo-count \(\alpha\) to every n-gram count, including unseen ones:
This redistributes probability mass from seen events to unseen ones. The hyperparameter \(\alpha\) controls how much:
- \(\alpha = 0\) → MLE, zero probabilities, infinite perplexity on test.
- \(\alpha = 1\) → Laplace smoothing. Smooths a lot. Often over-smooths for small data.
- \(\alpha = 0.01\) → modest smoothing. Our default. Works well for our corpus.
Note that with \(\alpha = 0.01\) and \(|V| = 64\), the denominator shift is \(0.01 \times 64 = 0.64\). For a context with count \(c(\text{context}) = 50\), the denominator becomes \(50.64\) instead of \(50\) — a barely perceptible shift for frequent contexts, and a large shift for rare ones (which is what you want).
Better smoothing: a survey¶
For completeness — we use add-\(\alpha\), but you should know these exist:
- Kneser-Ney (KN) smoothing. Considers how diverse a token's contexts are. A token that appears in many contexts (
thein English) is more likely to appear in a new context than a token that always appears in the same context (FranciscoafterSan). KN is the state-of-the-art among non-neural language models. We do not implement it. One paragraph mention. - Stupid backoff (Brants et al. 2007). If the n-gram isn't seen, fall back to the (n-1)-gram with a fixed penalty. Used in large-scale practical systems before neural LMs. Two-line implementation. We do not implement.
- Good-Turing. Estimates probability of unseen events from the rate of events seen exactly once. Theoretically appealing, practically replaced by KN.
None of these will close the gap to a neural language model on tasks with semantic structure. They're sophisticated only relative to MLE; they're toys relative to a transformer.
Perplexity: the evaluation metric¶
A language model assigns a probability to a sequence. To compare models, we need a single number. Perplexity (PPL) is that number.
Definition: \(\text{PPL}\) is the exponential of the average per-token negative log-likelihood:
where \(N\) is the total number of tokens in the evaluation set. (See Phase 5 theory file for the derivation; the gloss is: PPL is the effective vocabulary size the model is uncertain over per token.)
Interpretations:
- \(\text{PPL} = 1\) means the model is perfectly certain at every step.
- \(\text{PPL} = |V|\) means the model is uniformly uncertain — equivalent to assigning equal probability to every vocabulary token.
- \(\text{PPL} = e^{H(w)}\) where \(H(w)\) is the per-token cross-entropy in nats.
For our verb-grammar corpus with \(|V| \approx 64\):
- A random model has \(\text{PPL} \approx 64\).
- A 1-gram (unigram) on our corpus has \(\text{PPL} \approx 30\) (some tokens are much more frequent: I, you, he, the, separators).
- A 3-gram should land around 4–8 — the model is essentially right when context is enough.
- A 5-gram should approach 2–4.
- A perfectly-trained Mini-GPT (Phase 17/18) should hit < 2 on the held-out set, because the task is mostly deterministic.
These are predictions for Phase 14's lab; they are not guarantees. The point is to know roughly what numbers to expect, so a result of \(\text{PPL} = 30\) for the 3-gram is a sign something is wrong (bug, smoothing issue, eval mismatch), not progress.
Reading the perplexity ↔ NLL relationship¶
Some teams report negative log-likelihood (NLL) in nats per token, others report perplexity. They're trivially related:
- \(\text{NLL} = -\frac{1}{N} \sum \log P\) (in nats if using \(\ln\), in bits if using \(\log_2\)).
- \(\text{PPL} = \exp(\text{NLL})\) (with same base; \(\exp\) if NLL in nats, \(2^{(\cdot)}\) if NLL in bits).
The two carry the same information. Perplexity is more interpretable as "effective vocabulary size"; NLL is more useful when you want to add up token-level losses (it's additive).
Sequence boundary handling¶
Concretely: our corpus has rows like I work / yo trabajo separated by a newline / end-of-row token. The n-gram needs to know where rows begin and end.
Two conventions:
- Pad with start tokens. Prepend \(n - 1\) copies of a
<bos>token to every row. A 3-gram sees(<bos>, <bos>, I), (<bos>, I, work), (I, work, /), .... The first prediction is conditioned on a context that exists. - End-of-sequence token. Append
<eos>to every row. The model learns when to stop. Counts as part of the per-token loss.
For Phase 14, use both: <bos> prepended and <eos> appended. Lab 01 specifies.
Compute and storage¶
For our corpus (~600 rows, ~6000 tokens total):
- A 3-gram model has at most \(|V|^3 = 64^3 \approx 260\text{k}\) entries. In practice, only the seen subset is stored — a few thousand entries.
- A 5-gram model has at most \(|V|^5 \approx 10^9\) entries. We store only the seen subset — still a few thousand entries (most contexts never appear).
- Training is one pass over the corpus, \(O(N \cdot n)\) time, \(O(\text{unique n-grams})\) memory.
- Inference (compute perplexity on a held-out set) is \(O(N)\) time, one lookup per token.
This fits in milliseconds and kilobytes. There is no "training time" or "GPU" concern in Phase 14.
What an n-gram cannot learn¶
Three concrete failures, demonstrated in lab 01:
- Cross-paradigm generalization. Suppose the training set has many examples of
he works, he played, he listenedbut no examples ofhe walked. A 3-gram learns nothing abouthe walkedfrom the other patterns — the trigram(<sep>, he, walked)has count 0 and gets only the \(\alpha\) smoothing mass. By contrast, an attention-based model with shared embeddings could routehetowalkedcorrectly becausewalkedis embedded nearworked, played, listened. - Tense-auxiliary chains. The future construction
he is going to workis a 4-token dependency:is → going → to → work. A 3-gram with window 3 sees(is, going, to)but not(he, is, going). It will predicttocorrectly aftergoingbut cannot constrain the verb form three tokens later. A 5-gram fixes this for the specific construction but does not generalize acrossgoing to / will / 'll. - Bilingual alignment. Given
I work / yo, the n-gram should predicttrabajo. It will — if it saw exactly that bigram before. It does not learn the function "translate the English verb to Spanish 1st-singular present". It just memorizes co-occurrences. A bigger corpus would close the gap; ours won't.
These failures are the seed of attention.
What an n-gram is great at¶
For our corpus, an n-gram is near-optimal at:
- Pronoun → present-simple form.
I work, you work, he ___→worksbecause(<sep>, he, works)is the most frequent trigram after(<sep>, he, *). - Pronoun → past form.
I worked, you worked, he ___→workedlikewise. - Spanish article placement.
yo ___→trabajo(or whatever pronoun-aligned verb form followsyo).
If your task is exactly pronoun → verb-form completion, a 5-gram with \(\alpha = 0.01\) on our corpus will be within 10% of the optimal model. The transformer's win has to come from generalization across paradigms, not from beating the n-gram on the central task.
This realization is the bridge to Phase 15: we need a model whose internals make generalization the natural mode of operation, not memorization.
A drill before lab¶
Compute by hand: given training counts
For \(n = 3\), \(\alpha = 0.01\), \(|V| = 64\), compute \(P_\alpha(\text{work} \mid \langle\text{bos}\rangle, I)\).
If you can compute this in your head, you understand the model. The rest of Phase 14 is data plumbing.
What this phase does NOT cover¶
- Implementing Kneser-Ney smoothing. One paragraph above is the entire treatment.
- Hierarchical Bayesian language models, neural language models prior to GPT, RNN-LMs as a separate category. These are interesting but not on the critical path.
- The 1990s n-gram empire (translation, IR, retrieval). We touch RAG in Phase 29; non-LM uses of n-grams are out of scope.
- Variable-order n-grams ("interpolated KN"). We use fixed \(n\) per model.
- Smoothing hyperparameter tuning. Lock \(\alpha = 0.01\).
Next: theory/02-rnn-recurrence.md.