Skip to content

English · Español

Lab 00 — Build the batcher and the mask

Goal: assemble the data pipeline that turns the Phase-12 verb-conjugation JSONL shards into batched (input, target, loss_mask, attn_pad_mask) quadruples ready to feed MiniGPT.

Estimated time: 90–120 minutes.

Prereq: Phase 11 BPE encoder + Phase 12 corpus JSONL committed.


What you produce

A new module file + tests:

  • src/minitrain/data.py — corpus loader, deterministic shuffle, batcher, mask builder, held-out split helper.
  • tests/minitrain/test_data.py — schema, shape, mask, and split tests.

A small experiment:

  • experiments/18-batch-sanity/manifest.json + a printed sample batch with shapes and one decoded sentence per row.

Background you must have read

  • theory/01-batching-loss-mask.md — the two masks, the causal shift, token-mean reduction, held-out split policy.
  • Phase 12 corpus_spec.md — JSONL schema for verb-conjugation entries.

TODOs

Block A — corpus loader

Implement:

@dataclass(frozen=True)
class VerbExample:
    id: str                # e.g. "work.present.1sg"
    en: str                # "I work"
    es: str                # "yo trabajo"
    verb: str              # "work"
    tense: str             # "present_simple"
    person: str            # "1sg"
    is_regular: bool

def load_corpus(path: Path) -> list[VerbExample]: ...
  • Parses data/processed/{train,val,test}.jsonl (Phase 12 output).
  • Validates: every row has all required fields; verb ∈ {20 verbs}, tense ∈ {5 tenses}, person ∈ {1sg, 2sg, 3sg}.
  • Rejects unknown values with a clear error citing the offending row id.

Block B — tokenized example builder

The model trains on tokenized sequences of the form:

<bos> es:<spanish_form> / en:<english_form> <eos>

(Both directions are present in the corpus — the model learns the pairing.)

  • Implement tokenize_example(ex: VerbExample, tokenizer) -> list[int] returning the token ids. Use Phase 11's BPE.
  • Add the <bos>, <eos>, <pad> special tokens; their ids must be reserved in Phase 11's vocabulary. If not, extend the vocabulary deterministically (Phase 11's lab pinned this).
  • Property test: decode(tokenize_example(ex)) == "<bos> es:... / en:... <eos>".

Block C — held-out split

Implement the hold-out-4-verbs policy from theory/01:

HELDOUT_VERBS = ("look", "like", "see", "eat")  # 2 regular + 2 irregular

def held_out_split(examples: list[VerbExample]) -> tuple[list, list]:
    """Returns (train, val). Val is all examples whose verb is in HELDOUT_VERBS."""
    ...
  • Train should contain 16 verbs × 5 tenses × 3 persons = 240 forms.
  • Val should contain 4 verbs × 5 tenses × 3 persons = 60 forms.
  • No leakage: train ∩ val == ∅ (by example id).
  • Test: assert sizes; assert the held-out verbs are exactly the ones in HELDOUT_VERBS; assert all 5 tenses and all 3 persons appear in both train and val.

Block D — batcher and masks

Implement:

def make_batch(
    examples: list[list[int]],
    pad_id: int,
) -> tuple[ndarray, ndarray, ndarray, ndarray]:
    """Returns (input, target, loss_mask, attn_pad_mask)."""

The shapes after the causal shift are all (B, L-1). See theory/01 for the canonical implementation sketch — you may transcribe it, but you must add the tests in Block E.

  • input.shape == target.shape == loss_mask.shape == attn_pad_mask.shape == (B, L-1).
  • loss_mask is 1 where target != pad_id, else 0.
  • attn_pad_mask is 1 where input != pad_id, else 0.
  • input and target are int32; loss_mask and attn_pad_mask are float32.

Block E — deterministic shuffle

Implement a BatchIterator:

class BatchIterator:
    def __init__(self, examples: list[list[int]], batch_size: int, seed: int): ...
    def __iter__(self) -> Iterator[tuple[ndarray, ndarray, ndarray, ndarray]]: ...
    def state_dict(self) -> dict: ...
    def load_state_dict(self, state: dict) -> None: ...
  • Each epoch shuffles the examples using a seeded RNG, distinct from the training-control RNG.
  • state_dict() returns the current epoch, position, and RNG state; round-trip via load_state_dict resumes the same sequence.
  • Last batch of each epoch may be smaller than batch_size; do not drop it.

Block F — tests

In tests/minitrain/test_data.py:

  1. test_corpus_schemaload_corpus(train.jsonl) validates without error; expected count = 240.
  2. test_heldout_split_sizes — 240/60 split, disjoint by id.
  3. test_heldout_coverage — both train and val contain all 5 tenses and all 3 persons; train has 16 verbs, val has 4.
  4. test_batch_shapes — given examples of lengths [6, 11, 9, 14], batched shapes are (4, 13) (after pad to 14 then causal shift to 13).
  5. test_batch_masks — assert loss_mask has exactly the right count of zeros (the pad positions in the targets).
  6. test_iterator_resume — iterate 5 batches, snapshot state_dict, iterate 5 more; reload from snapshot, iterate 5 more, verify same batches as the original second-5.
  7. test_no_overlap_train_val — every example id in val is not in train.

Block G — sanity experiment

experiments/18-batch-sanity/:

  • manifest.json with seed, versions, config (batch_size, tokenizer hash).
  • print_sample.py — load corpus, build one batch of size 4, decode each row back to text, print shapes and decoded text.
  • Output: sample_batch.txt with the 4 decoded sentences and the 4 mask vectors as [1 1 1 1 1 0 0 0 ...] strings.
  • Visual check: the mask zeros line up with the pad tokens in the decoded sentences.

Constraints

  • Pure NumPy + standard library.
  • No Pandas (overkill for 600 examples).
  • The shuffle RNG must be a fresh np.random.Generator, not the global RNG, so it doesn't entangle with the training-control RNG.

Stop conditions

Done when:

  1. pytest tests/minitrain/test_data.py -v passes all seven tests.
  2. experiments/18-batch-sanity/sample_batch.txt is committed with decoded sentences and mask vectors.
  3. The held-out split numbers match theory/01 (240 train / 60 val).
  4. You can re-state the loss-mask vs attention-mask distinction without consulting the file.

Pitfalls

  • Putting <pad> before <eos> in a row. Don't pre-pad; right-pad after <eos>. If your tokenizer mishandles this, extend it.
  • Random shuffle using the global RNG. This breaks reproducibility across reload-and-resume. Use a dedicated Generator.
  • Allocating a new buffer every batch. For 600 examples it's fine; for big corpora you'd reuse. Note this in a comment, then move on.
  • Forgetting that the target is input[:, 1:], not input[:, :-1]. The mask is derived from the target, not the input.

When to consult solutions/

After all seven tests pass. Solution at solutions/00-batch-and-mask-ref.md (written at phase open).


Next lab: lab/01-train-mini.md.