Skip to content

English · Español

Lab 00 — Wire up Phase 11 tokenizer + Phase 12 corpus as inputs

Goal: produce a clean (train_ids, dev_ids, test_ids) triple that every other Phase 14 lab consumes. One-shot.

Estimated time: 20–40 minutes.

Prereq: Phase 11 (BPE tokenizer) and Phase 12 (verb-grammar corpus) both committed.


What you produce

A single committed file: experiments/14-tokenize/dataset.py (or .npz — either is fine) that exposes:

train_ids: list[list[int]]   # one inner list per row of the train split
dev_ids:   list[list[int]]   # ditto for dev
test_ids:  list[list[int]]   # ditto for test
vocab:     dict[str, int]    # token -> id mapping
inv_vocab: dict[int, str]    # id -> token (for printing)

Plus a manifest.json documenting the tokenization choices.

TODOs

Block A — load the corpus

  • Read data/processed/train.jsonl, dev.jsonl, test.jsonl from Phase 12.
  • Confirm each row is a (english_form, spanish_form, metadata) triple. The exact field names are set in Phase 12's corpus_spec.md; consult it.
  • Concatenate each row into a single string using the format Phase 12 settled on. Default: f"{english} / {spanish}" followed by an <eos> token.

Block B — tokenize

  • Load the BPE tokenizer from Phase 11 (src/minitokenizer/bpe.py + the saved vocab/merges files).
  • Encode every row to a list of token IDs.
  • Prepend <bos> and append <eos> to every row. (If Phase 11 didn't reserve these in the vocab, add them now and document the change in manifest.json.)
  • Verify: decoding a tokenized row reproduces the original string exactly. Spot-check 5 random rows.

Block C — alternative: word-level fallback

If the BPE tokenizer from Phase 11 produces too many subword tokens for the verb-grammar corpus (more than ~80 unique tokens), consider a word-level tokenization as a Phase-14-only fallback. Verb-grammar has so few unique forms (~600) that one-token-per-word is reasonable. This is a phase-open decision — open question (b) of PHASE_14_PLAN.md.

  • Run a token-count diagnostic: len(set(t for row in all_rows for t in tokenize(row))). If > 80, flag and discuss before proceeding.
  • If using word-level, save the word-vocab mapping separately as vocab_wordlevel.json. Do not overwrite Phase 11's BPE vocab.

Block D — split sanity checks

  • Print per-split token counts and unique-token counts. Expected (rough): train ~5000 tokens, dev ~600 tokens, test ~600 tokens.
  • Confirm no row is duplicated across splits (a Phase 12 concern, but re-verify here — set intersection between train and dev should be 0).
  • Confirm both English and Spanish forms are present in every split. (For some splits, Phase 12 may have chosen English-only or Spanish-only — if so, document.)

Block E — manifest

manifest.json schema:

{
  "experiment": "14-tokenize",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "upstream": {
    "phase_11_tokenizer": "path/to/bpe/vocab",
    "phase_12_corpus": "path/to/data/processed/"
  },
  "tokenization": {
    "scheme": "bpe" | "wordlevel",
    "vocab_size": null,
    "added_special_tokens": ["<bos>", "<eos>", "<sep>"]
  },
  "splits": {
    "train_rows": null,
    "dev_rows": null,
    "test_rows": null,
    "train_tokens_total": null
  },
  "versions": {"python": "3.11.x", "numpy": "X.Y.Z"}
}

Fill nulls after running.

Constraints

  • No PyTorch. Per anti-goal §10. NumPy + standard library only.
  • Reproducible. Set np.random.seed(42) before any shuffle. Document in manifest.
  • No leakage. Verify train/dev/test are disjoint by row.

Stop conditions

You're done when:

  1. dataset.py (or .npz) loads in <1 second.
  2. The print statement print(len(train_ids), len(dev_ids), len(test_ids)) works.
  3. manifest.json exists with all null values replaced by numbers.
  4. You can decode a random train_ids[i] and read back the original (english / spanish) row.

Pitfalls

  • <bos> / <eos> doubling. If you accidentally prepend <bos> twice (once from Phase 11, once here), every n-gram statistic is off by one. Check by decoding a single row and counting <bos> occurrences.
  • Vocab drift. If you re-encode a Phase 12 row that contains a character or substring the Phase 11 BPE didn't see, you'll hit <unk>. Count <unk> occurrences; should be 0 if Phase 12 corpus was used to train the BPE.
  • Split shuffling without seed. If Phase 12 didn't fix the random seed for its splits, your splits will differ on every run. Either fix the seed here, or check Phase 12's manifest for the seed it used.

When to consult solutions/

After committing dataset.py and manifest.json. The solution at solutions/00-tokenize-corpus-ref.md (written at phase open) compares your tokenization choices and split sanity numbers.


Next lab: lab/01-ngram-baseline.md.