English · Español

Lab 01 — N-gram baseline¶

Goal: commit the perplexity number that the Mini-GPT in Phases 17–18 has to beat.

Estimated time: 90–120 minutes.

Prereq: lab 00 (tokenized splits) committed.

What you produce¶

A directory experiments/14-ngram-baseline/ containing:

ngram.py — your n-gram LM implementation (per src/minimodel/sequence_baselines/ngram.py blueprint).
train.py — fits 3-gram and 5-gram on the train split, evaluates perplexity on dev.
results.json — perplexity numbers for \(n \in \{1, 2, 3, 4, 5, 6, 7\}\) on dev.
perplexity_vs_n.png — plot of dev perplexity vs \(n\).
manifest.json — per LYNX_CORTEX.md §5.
README.md (2–3 paragraphs) — what the numbers mean.

The model¶

Implement NGramLM per the blueprint in src/minimodel/BLUEPRINT.md §1. Public API:

class NGramLM:
    def __init__(self, n: int, alpha: float = 0.01, vocab_size: int): ...
    def fit(self, sequences: list[list[int]]) -> None: ...
    def logp_token(self, context: tuple[int, ...], token: int) -> float: ...
    def logp_sequence(self, sequence: list[int]) -> float: ...
    def perplexity(self, sequences: list[list[int]]) -> float: ...

Backing store: a dict[tuple[int, ...], collections.Counter] mapping context (last \(n-1\) tokens) to a counter over the next token.

TODOs¶

Block A — implement the model¶

Pre-pad each sequence with \((n-1)\) copies of the <bos> token before counting.
Build a dict[context_tuple -> Counter[next_token]] from the train split.
Also store context_totals: dict[context_tuple -> int] for \(c(\text{context})\) — convenience.
Implement logp_token(context, token) using add-\(\alpha\) smoothing.
Implement perplexity(sequences) per the formula in theory/01-ngram-models.md.
Sanity test: train on a tiny toy corpus (5 sequences), confirm logp_token is reasonable, confirm perplexity is computed correctly. Unit test before evaluating.

Block B — fit and evaluate¶

For each \(n \in \{1, 2, 3, 4, 5, 6, 7\}\):

Fit on train_ids.
Compute perplexity on dev_ids.
Record (n, perplexity_train, perplexity_dev) in results.json.

Plot perplexity_vs_n.png with \(n\) on x-axis, perplexity on y-axis (log scale). Plot both train and dev curves.

Block C — the conjugation-completion probe¶

Beyond perplexity, the headline qualitative result is the conjugation-completion task:

For the 3-gram and 5-gram, compute the top-5 predicted tokens given the prompt <bos> <bos> I work , you work , he.
Print and commit the ranked list with probabilities.
Expected result: top-1 should be works for the 5-gram. The 3-gram might pick work (if its window doesn't reach the pronoun); check empirically.

Repeat for two other prompts: - <bos> <bos> I worked , you worked , he → expected worked. - <bos> <bos> I will work , you will work , he → expected will.

Commit results as conjugation_completion.json.

Block D — interpret¶

In README.md:

State the baseline. "Dev perplexity for \(n=5\) is X. This is the number Phase 17/18 must beat."
Comment on the U-curve. Perplexity-vs-n typically decreases with \(n\) up to the point where smoothing dominates, then increases (because most n-grams are unseen → all mass goes to \(\alpha |V|\)). Where is the minimum for our corpus? What does that say about the locality of subject-pronoun → verb-form dependency?
Conjugation-completion observation. Did the 3-gram correctly predict works? If yes: state that the task is local (window 3 is enough). If no: state that the task requires longer context — bridges to Phase 15.
Bilingual question (optional). Did the n-gram learn alignment between I work / yo → trabajo? Compute the perplexity of just the Spanish-side tokens conditioned on the English-side prefix. Compare to unconditional Spanish perplexity. The delta is the model's "translation skill".

Block E — manifest¶

{
  "experiment": "14-ngram-baseline",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "config": {
    "n_values": [1, 2, 3, 4, 5, 6, 7],
    "alpha": 0.01,
    "tokenization": "from lab 00 manifest"
  },
  "results_summary": {
    "best_n": null,
    "best_dev_perplexity": null,
    "conjugation_top1_correct": null
  },
  "versions": {"python": "3.11.x", "numpy": "X.Y.Z"}
}

Fill nulls.

Constraints¶

No PyTorch. NumPy + standard library only.
No external n-gram libraries (no nltk, kenlm). Build the dict yourself; this is the point of the lab.
Deterministic. Seed every shuffle. Document in manifest.
Smoothing locked to \(\alpha = 0.01\) for the headline number. You may sweep \(\alpha\) as an extension, but the committed baseline uses \(\alpha = 0.01\).

Stop conditions¶

Done when:

The directory has all six files.
results.json shows perplexity numbers for all 7 values of \(n\).
perplexity_vs_n.png is visible and labeled.
conjugation_completion.json shows the top-5 for at least 3 probes.
README.md answers the four interpretation questions.

Pitfalls (read before debugging)¶

Infinite perplexity. Either you forgot smoothing, or your tokenizer is producing <unk> for evaluation tokens that didn't appear in training. Check both.
Perplexity LOWER on test than train. Either a bug, or a tokenization mismatch. Train should be easier (the model memorized it).
Top-1 prediction for he ___ is <eos>. This means (<sep>, he) is most often followed by end-of-sequence in your training set — possibly because of how you formatted rows. Check the prompt format.
3-gram does WORSE than 1-gram. Possible if your corpus is so small that most trigrams are unseen and the smoothing dominates. Check len(set(trigram_contexts_seen)) — should be in the hundreds for a healthy corpus.
Spanish perplexity is much higher than English. Likely a tokenization artifact — Spanish characters (á, é, ñ) may produce more subword pieces, inflating the per-token perplexity. Document but don't fix.

When to consult `solutions/`¶

After you have committed all six files and answered the four interpretation questions. The solution at solutions/01-ngram-baseline-ref.md compares your numbers to a reference and discusses common edge cases.

Next lab: lab/02-rnn-by-hand.md.