English · Español
Lab 01 — N-gram baseline¶
Goal: commit the perplexity number that the Mini-GPT in Phases 17–18 has to beat.
Estimated time: 90–120 minutes.
Prereq: lab 00 (tokenized splits) committed.
What you produce¶
A directory experiments/14-ngram-baseline/ containing:
ngram.py— your n-gram LM implementation (persrc/minimodel/sequence_baselines/ngram.pyblueprint).train.py— fits 3-gram and 5-gram on the train split, evaluates perplexity on dev.results.json— perplexity numbers for \(n \in \{1, 2, 3, 4, 5, 6, 7\}\) on dev.perplexity_vs_n.png— plot of dev perplexity vs \(n\).manifest.json— perLYNX_CORTEX.md§5.README.md(2–3 paragraphs) — what the numbers mean.
The model¶
Implement NGramLM per the blueprint in src/minimodel/BLUEPRINT.md §1. Public API:
class NGramLM:
def __init__(self, n: int, alpha: float = 0.01, vocab_size: int): ...
def fit(self, sequences: list[list[int]]) -> None: ...
def logp_token(self, context: tuple[int, ...], token: int) -> float: ...
def logp_sequence(self, sequence: list[int]) -> float: ...
def perplexity(self, sequences: list[list[int]]) -> float: ...
Backing store: a dict[tuple[int, ...], collections.Counter] mapping context (last \(n-1\) tokens) to a counter over the next token.
TODOs¶
Block A — implement the model¶
- Pre-pad each sequence with \((n-1)\) copies of the
<bos>token before counting. - Build a
dict[context_tuple -> Counter[next_token]]from the train split. - Also store
context_totals: dict[context_tuple -> int]for \(c(\text{context})\) — convenience. - Implement
logp_token(context, token)using add-\(\alpha\) smoothing. - Implement
perplexity(sequences)per the formula intheory/01-ngram-models.md. - Sanity test: train on a tiny toy corpus (5 sequences), confirm
logp_tokenis reasonable, confirm perplexity is computed correctly. Unit test before evaluating.
Block B — fit and evaluate¶
For each \(n \in \{1, 2, 3, 4, 5, 6, 7\}\):
- Fit on
train_ids. - Compute perplexity on
dev_ids. - Record (n, perplexity_train, perplexity_dev) in
results.json.
Plot perplexity_vs_n.png with \(n\) on x-axis, perplexity on y-axis (log scale). Plot both train and dev curves.
Block C — the conjugation-completion probe¶
Beyond perplexity, the headline qualitative result is the conjugation-completion task:
- For the 3-gram and 5-gram, compute the top-5 predicted tokens given the prompt
<bos> <bos> I work , you work , he. - Print and commit the ranked list with probabilities.
- Expected result: top-1 should be
worksfor the 5-gram. The 3-gram might pickwork(if its window doesn't reach the pronoun); check empirically.
Repeat for two other prompts:
- <bos> <bos> I worked , you worked , he → expected worked.
- <bos> <bos> I will work , you will work , he → expected will.
Commit results as conjugation_completion.json.
Block D — interpret¶
In README.md:
- State the baseline. "Dev perplexity for \(n=5\) is X. This is the number Phase 17/18 must beat."
- Comment on the U-curve. Perplexity-vs-n typically decreases with \(n\) up to the point where smoothing dominates, then increases (because most n-grams are unseen → all mass goes to \(\alpha |V|\)). Where is the minimum for our corpus? What does that say about the locality of subject-pronoun → verb-form dependency?
- Conjugation-completion observation. Did the 3-gram correctly predict
works? If yes: state that the task is local (window 3 is enough). If no: state that the task requires longer context — bridges to Phase 15. - Bilingual question (optional). Did the n-gram learn alignment between
I work / yo→trabajo? Compute the perplexity of just the Spanish-side tokens conditioned on the English-side prefix. Compare to unconditional Spanish perplexity. The delta is the model's "translation skill".
Block E — manifest¶
{
"experiment": "14-ngram-baseline",
"date": "YYYY-MM-DD",
"seed": 42,
"config": {
"n_values": [1, 2, 3, 4, 5, 6, 7],
"alpha": 0.01,
"tokenization": "from lab 00 manifest"
},
"results_summary": {
"best_n": null,
"best_dev_perplexity": null,
"conjugation_top1_correct": null
},
"versions": {"python": "3.11.x", "numpy": "X.Y.Z"}
}
Fill nulls.
Constraints¶
- No PyTorch. NumPy + standard library only.
- No external n-gram libraries (no
nltk,kenlm). Build the dict yourself; this is the point of the lab. - Deterministic. Seed every shuffle. Document in manifest.
- Smoothing locked to \(\alpha = 0.01\) for the headline number. You may sweep \(\alpha\) as an extension, but the committed baseline uses \(\alpha = 0.01\).
Stop conditions¶
Done when:
- The directory has all six files.
results.jsonshows perplexity numbers for all 7 values of \(n\).perplexity_vs_n.pngis visible and labeled.conjugation_completion.jsonshows the top-5 for at least 3 probes.README.mdanswers the four interpretation questions.
Pitfalls (read before debugging)¶
- Infinite perplexity. Either you forgot smoothing, or your tokenizer is producing
<unk>for evaluation tokens that didn't appear in training. Check both. - Perplexity LOWER on test than train. Either a bug, or a tokenization mismatch. Train should be easier (the model memorized it).
- Top-1 prediction for
he ___is<eos>. This means(<sep>, he)is most often followed by end-of-sequence in your training set — possibly because of how you formatted rows. Check the prompt format. - 3-gram does WORSE than 1-gram. Possible if your corpus is so small that most trigrams are unseen and the smoothing dominates. Check
len(set(trigram_contexts_seen))— should be in the hundreds for a healthy corpus. - Spanish perplexity is much higher than English. Likely a tokenization artifact — Spanish characters (
á,é,ñ) may produce more subword pieces, inflating the per-token perplexity. Document but don't fix.
When to consult solutions/¶
After you have committed all six files and answered the four interpretation questions. The solution at solutions/01-ngram-baseline-ref.md compares your numbers to a reference and discusses common edge cases.
Next lab: lab/02-rnn-by-hand.md.