English · Español
Lab 00 — Build the batcher and the mask¶
Goal: assemble the data pipeline that turns the Phase-12 verb-conjugation JSONL shards into batched
(input, target, loss_mask, attn_pad_mask)quadruples ready to feed MiniGPT.Estimated time: 90–120 minutes.
Prereq: Phase 11 BPE encoder + Phase 12 corpus JSONL committed.
What you produce¶
A new module file + tests:
src/minitrain/data.py— corpus loader, deterministic shuffle, batcher, mask builder, held-out split helper.tests/minitrain/test_data.py— schema, shape, mask, and split tests.
A small experiment:
experiments/18-batch-sanity/—manifest.json+ a printed sample batch with shapes and one decoded sentence per row.
Background you must have read¶
theory/01-batching-loss-mask.md— the two masks, the causal shift, token-mean reduction, held-out split policy.- Phase 12
corpus_spec.md— JSONL schema for verb-conjugation entries.
TODOs¶
Block A — corpus loader¶
Implement:
@dataclass(frozen=True)
class VerbExample:
id: str # e.g. "work.present.1sg"
en: str # "I work"
es: str # "yo trabajo"
verb: str # "work"
tense: str # "present_simple"
person: str # "1sg"
is_regular: bool
def load_corpus(path: Path) -> list[VerbExample]: ...
- Parses
data/processed/{train,val,test}.jsonl(Phase 12 output). - Validates: every row has all required fields;
verb ∈ {20 verbs},tense ∈ {5 tenses},person ∈ {1sg, 2sg, 3sg}. - Rejects unknown values with a clear error citing the offending row id.
Block B — tokenized example builder¶
The model trains on tokenized sequences of the form:
(Both directions are present in the corpus — the model learns the pairing.)
- Implement
tokenize_example(ex: VerbExample, tokenizer) -> list[int]returning the token ids. Use Phase 11's BPE. - Add the
<bos>,<eos>,<pad>special tokens; their ids must be reserved in Phase 11's vocabulary. If not, extend the vocabulary deterministically (Phase 11's lab pinned this). - Property test:
decode(tokenize_example(ex)) == "<bos> es:... / en:... <eos>".
Block C — held-out split¶
Implement the hold-out-4-verbs policy from theory/01:
HELDOUT_VERBS = ("look", "like", "see", "eat") # 2 regular + 2 irregular
def held_out_split(examples: list[VerbExample]) -> tuple[list, list]:
"""Returns (train, val). Val is all examples whose verb is in HELDOUT_VERBS."""
...
- Train should contain 16 verbs × 5 tenses × 3 persons = 240 forms.
- Val should contain 4 verbs × 5 tenses × 3 persons = 60 forms.
- No leakage:
train ∩ val == ∅(by example id). - Test: assert sizes; assert the held-out verbs are exactly the ones in
HELDOUT_VERBS; assert all 5 tenses and all 3 persons appear in both train and val.
Block D — batcher and masks¶
Implement:
def make_batch(
examples: list[list[int]],
pad_id: int,
) -> tuple[ndarray, ndarray, ndarray, ndarray]:
"""Returns (input, target, loss_mask, attn_pad_mask)."""
The shapes after the causal shift are all (B, L-1). See theory/01 for the canonical implementation sketch — you may transcribe it, but you must add the tests in Block E.
-
input.shape == target.shape == loss_mask.shape == attn_pad_mask.shape == (B, L-1). -
loss_maskis1wheretarget != pad_id, else0. -
attn_pad_maskis1whereinput != pad_id, else0. -
inputandtargetare int32;loss_maskandattn_pad_maskare float32.
Block E — deterministic shuffle¶
Implement a BatchIterator:
class BatchIterator:
def __init__(self, examples: list[list[int]], batch_size: int, seed: int): ...
def __iter__(self) -> Iterator[tuple[ndarray, ndarray, ndarray, ndarray]]: ...
def state_dict(self) -> dict: ...
def load_state_dict(self, state: dict) -> None: ...
- Each epoch shuffles the examples using a seeded RNG, distinct from the training-control RNG.
-
state_dict()returns the current epoch, position, and RNG state; round-trip viaload_state_dictresumes the same sequence. - Last batch of each epoch may be smaller than
batch_size; do not drop it.
Block F — tests¶
In tests/minitrain/test_data.py:
test_corpus_schema—load_corpus(train.jsonl)validates without error; expected count = 240.test_heldout_split_sizes— 240/60 split, disjoint by id.test_heldout_coverage— both train and val contain all 5 tenses and all 3 persons; train has 16 verbs, val has 4.test_batch_shapes— given examples of lengths[6, 11, 9, 14], batched shapes are(4, 13)(after pad to 14 then causal shift to 13).test_batch_masks— assertloss_maskhas exactly the right count of zeros (the pad positions in the targets).test_iterator_resume— iterate 5 batches, snapshotstate_dict, iterate 5 more; reload from snapshot, iterate 5 more, verify same batches as the original second-5.test_no_overlap_train_val— every example id in val is not in train.
Block G — sanity experiment¶
experiments/18-batch-sanity/:
-
manifest.jsonwith seed, versions, config (batch_size, tokenizer hash). -
print_sample.py— load corpus, build one batch of size 4, decode each row back to text, print shapes and decoded text. - Output:
sample_batch.txtwith the 4 decoded sentences and the 4 mask vectors as[1 1 1 1 1 0 0 0 ...]strings. - Visual check: the mask zeros line up with the pad tokens in the decoded sentences.
Constraints¶
- Pure NumPy + standard library.
- No Pandas (overkill for 600 examples).
- The shuffle RNG must be a fresh
np.random.Generator, not the global RNG, so it doesn't entangle with the training-control RNG.
Stop conditions¶
Done when:
pytest tests/minitrain/test_data.py -vpasses all seven tests.experiments/18-batch-sanity/sample_batch.txtis committed with decoded sentences and mask vectors.- The held-out split numbers match
theory/01(240 train / 60 val). - You can re-state the loss-mask vs attention-mask distinction without consulting the file.
Pitfalls¶
- Putting
<pad>before<eos>in a row. Don't pre-pad; right-pad after<eos>. If your tokenizer mishandles this, extend it. - Random shuffle using the global RNG. This breaks reproducibility across reload-and-resume. Use a dedicated
Generator. - Allocating a new buffer every batch. For 600 examples it's fine; for big corpora you'd reuse. Note this in a comment, then move on.
- Forgetting that the target is
input[:, 1:], notinput[:, :-1]. The mask is derived from the target, not the input.
When to consult solutions/¶
After all seven tests pass. Solution at solutions/00-batch-and-mask-ref.md (written at phase open).
Next lab: lab/01-train-mini.md.