English · Español

Lab 00 — Build the batcher and the mask¶

Goal: assemble the data pipeline that turns the Phase-12 verb-conjugation JSONL shards into batched (input, target, loss_mask, attn_pad_mask) quadruples ready to feed MiniGPT.

Estimated time: 90–120 minutes.

Prereq: Phase 11 BPE encoder + Phase 12 corpus JSONL committed.

What you produce¶

A new module file + tests:

src/minitrain/data.py — corpus loader, deterministic shuffle, batcher, mask builder, held-out split helper.
tests/minitrain/test_data.py — schema, shape, mask, and split tests.

A small experiment:

experiments/18-batch-sanity/ — manifest.json + a printed sample batch with shapes and one decoded sentence per row.

Background you must have read¶

theory/01-batching-loss-mask.md — the two masks, the causal shift, token-mean reduction, held-out split policy.
Phase 12 corpus_spec.md — JSONL schema for verb-conjugation entries.

TODOs¶

Block A — corpus loader¶

Implement:

@dataclass(frozen=True)
class VerbExample:
    id: str                # e.g. "work.present.1sg"
    en: str                # "I work"
    es: str                # "yo trabajo"
    verb: str              # "work"
    tense: str             # "present_simple"
    person: str            # "1sg"
    is_regular: bool

def load_corpus(path: Path) -> list[VerbExample]: ...

Parses data/processed/{train,val,test}.jsonl (Phase 12 output).
Validates: every row has all required fields; verb ∈ {20 verbs}, tense ∈ {5 tenses}, person ∈ {1sg, 2sg, 3sg}.
Rejects unknown values with a clear error citing the offending row id.

Block B — tokenized example builder¶

The model trains on tokenized sequences of the form:

<bos> es:<spanish_form> / en:<english_form> <eos>

(Both directions are present in the corpus — the model learns the pairing.)

Implement tokenize_example(ex: VerbExample, tokenizer) -> list[int] returning the token ids. Use Phase 11's BPE.
Add the <bos>, <eos>, <pad> special tokens; their ids must be reserved in Phase 11's vocabulary. If not, extend the vocabulary deterministically (Phase 11's lab pinned this).
Property test: decode(tokenize_example(ex)) == "<bos> es:... / en:... <eos>".

Block C — held-out split¶

Implement the hold-out-4-verbs policy from theory/01:

HELDOUT_VERBS = ("look", "like", "see", "eat")  # 2 regular + 2 irregular

def held_out_split(examples: list[VerbExample]) -> tuple[list, list]:
    """Returns (train, val). Val is all examples whose verb is in HELDOUT_VERBS."""
    ...

Train should contain 16 verbs × 5 tenses × 3 persons = 240 forms.
Val should contain 4 verbs × 5 tenses × 3 persons = 60 forms.
No leakage: train ∩ val == ∅ (by example id).
Test: assert sizes; assert the held-out verbs are exactly the ones in HELDOUT_VERBS; assert all 5 tenses and all 3 persons appear in both train and val.

Block D — batcher and masks¶

Implement:

def make_batch(
    examples: list[list[int]],
    pad_id: int,
) -> tuple[ndarray, ndarray, ndarray, ndarray]:
    """Returns (input, target, loss_mask, attn_pad_mask)."""

The shapes after the causal shift are all (B, L-1). See theory/01 for the canonical implementation sketch — you may transcribe it, but you must add the tests in Block E.

input.shape == target.shape == loss_mask.shape == attn_pad_mask.shape == (B, L-1).
loss_mask is 1 where target != pad_id, else 0.
attn_pad_mask is 1 where input != pad_id, else 0.
input and target are int32; loss_mask and attn_pad_mask are float32.

Block E — deterministic shuffle¶

Implement a BatchIterator:

class BatchIterator:
    def __init__(self, examples: list[list[int]], batch_size: int, seed: int): ...
    def __iter__(self) -> Iterator[tuple[ndarray, ndarray, ndarray, ndarray]]: ...
    def state_dict(self) -> dict: ...
    def load_state_dict(self, state: dict) -> None: ...

Each epoch shuffles the examples using a seeded RNG, distinct from the training-control RNG.
state_dict() returns the current epoch, position, and RNG state; round-trip via load_state_dict resumes the same sequence.
Last batch of each epoch may be smaller than batch_size; do not drop it.

Block F — tests¶

In tests/minitrain/test_data.py:

test_corpus_schema — load_corpus(train.jsonl) validates without error; expected count = 240.
test_heldout_split_sizes — 240/60 split, disjoint by id.
test_heldout_coverage — both train and val contain all 5 tenses and all 3 persons; train has 16 verbs, val has 4.
test_batch_shapes — given examples of lengths [6, 11, 9, 14], batched shapes are (4, 13) (after pad to 14 then causal shift to 13).
test_batch_masks — assert loss_mask has exactly the right count of zeros (the pad positions in the targets).
test_iterator_resume — iterate 5 batches, snapshot state_dict, iterate 5 more; reload from snapshot, iterate 5 more, verify same batches as the original second-5.
test_no_overlap_train_val — every example id in val is not in train.

Block G — sanity experiment¶

experiments/18-batch-sanity/:

manifest.json with seed, versions, config (batch_size, tokenizer hash).
print_sample.py — load corpus, build one batch of size 4, decode each row back to text, print shapes and decoded text.
Output: sample_batch.txt with the 4 decoded sentences and the 4 mask vectors as [1 1 1 1 1 0 0 0 ...] strings.
Visual check: the mask zeros line up with the pad tokens in the decoded sentences.

Constraints¶

Pure NumPy + standard library.
No Pandas (overkill for 600 examples).
The shuffle RNG must be a fresh np.random.Generator, not the global RNG, so it doesn't entangle with the training-control RNG.

Stop conditions¶

Done when:

pytest tests/minitrain/test_data.py -v passes all seven tests.
experiments/18-batch-sanity/sample_batch.txt is committed with decoded sentences and mask vectors.
The held-out split numbers match theory/01 (240 train / 60 val).
You can re-state the loss-mask vs attention-mask distinction without consulting the file.

Pitfalls¶

Putting <pad> before <eos> in a row. Don't pre-pad; right-pad after <eos>. If your tokenizer mishandles this, extend it.
Random shuffle using the global RNG. This breaks reproducibility across reload-and-resume. Use a dedicated Generator.
Allocating a new buffer every batch. For 600 examples it's fine; for big corpora you'd reuse. Note this in a comment, then move on.
Forgetting that the target is input[:, 1:], not input[:, :-1]. The mask is derived from the target, not the input.

When to consult `solutions/`¶

After all seven tests pass. Solution at solutions/00-batch-and-mask-ref.md (written at phase open).

Next lab: lab/01-train-mini.md.