Skip to content

English · Español

Lab 01 — N-gram baseline

Goal: commit the perplexity number that the Mini-GPT in Phases 17–18 has to beat.

Estimated time: 90–120 minutes.

Prereq: lab 00 (tokenized splits) committed.


What you produce

A directory experiments/14-ngram-baseline/ containing:

  • ngram.py — your n-gram LM implementation (per src/minimodel/sequence_baselines/ngram.py blueprint).
  • train.py — fits 3-gram and 5-gram on the train split, evaluates perplexity on dev.
  • results.json — perplexity numbers for \(n \in \{1, 2, 3, 4, 5, 6, 7\}\) on dev.
  • perplexity_vs_n.png — plot of dev perplexity vs \(n\).
  • manifest.json — per LYNX_CORTEX.md §5.
  • README.md (2–3 paragraphs) — what the numbers mean.

The model

Implement NGramLM per the blueprint in src/minimodel/BLUEPRINT.md §1. Public API:

class NGramLM:
    def __init__(self, n: int, alpha: float = 0.01, vocab_size: int): ...
    def fit(self, sequences: list[list[int]]) -> None: ...
    def logp_token(self, context: tuple[int, ...], token: int) -> float: ...
    def logp_sequence(self, sequence: list[int]) -> float: ...
    def perplexity(self, sequences: list[list[int]]) -> float: ...

Backing store: a dict[tuple[int, ...], collections.Counter] mapping context (last \(n-1\) tokens) to a counter over the next token.

TODOs

Block A — implement the model

  • Pre-pad each sequence with \((n-1)\) copies of the <bos> token before counting.
  • Build a dict[context_tuple -> Counter[next_token]] from the train split.
  • Also store context_totals: dict[context_tuple -> int] for \(c(\text{context})\) — convenience.
  • Implement logp_token(context, token) using add-\(\alpha\) smoothing.
  • Implement perplexity(sequences) per the formula in theory/01-ngram-models.md.
  • Sanity test: train on a tiny toy corpus (5 sequences), confirm logp_token is reasonable, confirm perplexity is computed correctly. Unit test before evaluating.

Block B — fit and evaluate

For each \(n \in \{1, 2, 3, 4, 5, 6, 7\}\):

  • Fit on train_ids.
  • Compute perplexity on dev_ids.
  • Record (n, perplexity_train, perplexity_dev) in results.json.

Plot perplexity_vs_n.png with \(n\) on x-axis, perplexity on y-axis (log scale). Plot both train and dev curves.

Block C — the conjugation-completion probe

Beyond perplexity, the headline qualitative result is the conjugation-completion task:

  • For the 3-gram and 5-gram, compute the top-5 predicted tokens given the prompt <bos> <bos> I work , you work , he.
  • Print and commit the ranked list with probabilities.
  • Expected result: top-1 should be works for the 5-gram. The 3-gram might pick work (if its window doesn't reach the pronoun); check empirically.

Repeat for two other prompts: - <bos> <bos> I worked , you worked , he → expected worked. - <bos> <bos> I will work , you will work , he → expected will.

Commit results as conjugation_completion.json.

Block D — interpret

In README.md:

  1. State the baseline. "Dev perplexity for \(n=5\) is X. This is the number Phase 17/18 must beat."
  2. Comment on the U-curve. Perplexity-vs-n typically decreases with \(n\) up to the point where smoothing dominates, then increases (because most n-grams are unseen → all mass goes to \(\alpha |V|\)). Where is the minimum for our corpus? What does that say about the locality of subject-pronoun → verb-form dependency?
  3. Conjugation-completion observation. Did the 3-gram correctly predict works? If yes: state that the task is local (window 3 is enough). If no: state that the task requires longer context — bridges to Phase 15.
  4. Bilingual question (optional). Did the n-gram learn alignment between I work / yotrabajo? Compute the perplexity of just the Spanish-side tokens conditioned on the English-side prefix. Compare to unconditional Spanish perplexity. The delta is the model's "translation skill".

Block E — manifest

{
  "experiment": "14-ngram-baseline",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "config": {
    "n_values": [1, 2, 3, 4, 5, 6, 7],
    "alpha": 0.01,
    "tokenization": "from lab 00 manifest"
  },
  "results_summary": {
    "best_n": null,
    "best_dev_perplexity": null,
    "conjugation_top1_correct": null
  },
  "versions": {"python": "3.11.x", "numpy": "X.Y.Z"}
}

Fill nulls.

Constraints

  • No PyTorch. NumPy + standard library only.
  • No external n-gram libraries (no nltk, kenlm). Build the dict yourself; this is the point of the lab.
  • Deterministic. Seed every shuffle. Document in manifest.
  • Smoothing locked to \(\alpha = 0.01\) for the headline number. You may sweep \(\alpha\) as an extension, but the committed baseline uses \(\alpha = 0.01\).

Stop conditions

Done when:

  1. The directory has all six files.
  2. results.json shows perplexity numbers for all 7 values of \(n\).
  3. perplexity_vs_n.png is visible and labeled.
  4. conjugation_completion.json shows the top-5 for at least 3 probes.
  5. README.md answers the four interpretation questions.

Pitfalls (read before debugging)

  • Infinite perplexity. Either you forgot smoothing, or your tokenizer is producing <unk> for evaluation tokens that didn't appear in training. Check both.
  • Perplexity LOWER on test than train. Either a bug, or a tokenization mismatch. Train should be easier (the model memorized it).
  • Top-1 prediction for he ___ is <eos>. This means (<sep>, he) is most often followed by end-of-sequence in your training set — possibly because of how you formatted rows. Check the prompt format.
  • 3-gram does WORSE than 1-gram. Possible if your corpus is so small that most trigrams are unseen and the smoothing dominates. Check len(set(trigram_contexts_seen)) — should be in the hundreds for a healthy corpus.
  • Spanish perplexity is much higher than English. Likely a tokenization artifact — Spanish characters (á, é, ñ) may produce more subword pieces, inflating the per-token perplexity. Document but don't fix.

When to consult solutions/

After you have committed all six files and answered the four interpretation questions. The solution at solutions/01-ngram-baseline-ref.md compares your numbers to a reference and discusses common edge cases.


Next lab: lab/02-rnn-by-hand.md.