English · Español
Lab 02 — Write validate_corpus.py and split_corpus.py¶
Goal: validate the generator's output against the 12 rules from theory 02. Then split into train/val/test by (verb, tense). Emit the final manifest.
Estimated time: 3–4 hours.
Prereq: lab 01 committed;
data/raw/all_rows.jsonlexists.
What you produce¶
scripts/validate_corpus.py— runs every check from theory 02. Emitsdata/raw/validation_report.json.scripts/split_corpus.py— producesdata/processed/{train,val,test}.jsonl.scripts/build_manifest.py(or the samevalidatescript extended) — emitsdata/MANIFEST.jsonper theory 03.tests/test_validation.py— unit tests on a small fixture.tests/test_split.py— unit tests on a small fixture.
TODOs¶
Block A — scripts/validate_corpus.py¶
- CLI:
argparsewith--input(defaultdata/raw/all_rows.jsonl),--spec(defaultdata/corpus_spec.md),--report(defaultdata/raw/validation_report.json). - Load all rows.
- Run the 12 checks from theory 02 (numbered below). Each check is its own function; aggregate results.
The 12 checks:
- Schema validity. Every row parses against the JSONSchema in
data/corpus_spec.md. Usejsonschema(already pinned perpyproject.toml). - No exact duplicates. Every
fingerprintis unique across all rows. - Cell coverage. All 360 (verb, tense, person) cells have ≥ 1
correctrow. (Enumerate the cross product; check membership.) - All 20 verbs present.
set(row.verb_lemma for row in rows) == VERB_TABLE_SET(the 20 lemmas). - All 6 tense-surfaces present per verb. For each verb, the union of
tensevalues across its rows is the full 6. - All 3 persons present per (verb, tense). Note: for the
infinitivetense, this constraint relaxes —to workis the same for all 3 persons. Decision: emit one row per (verb, infinitive, person) anyway (per spec), or one row per (verb, infinitive) shared across persons? v1: one row per (verb, infinitive, 1sg) only, with no per-person variation. Document in spec; validator allows the relaxation forinfinitive. - Mis-conjugation types from canonical taxonomy. Every non-null
mis_conjugation_type∈ the 6-type taxonomy. - Mis-conjugation rows have
correct_formpopulated.label == "mis_conjugated"⇒correct_formnon-null and non-empty. - Correct rows have
mis_conjugation_type = null.label == "correct"⇒mis_conjugation_type is None and correct_form is None. - Every row has a
spanishfield. Non-empty. - NFC normalization. For every row,
unicodedata.is_normalized('NFC', text)andunicodedata.is_normalized('NFC', spanish). -
Text length in range.
2 <= len(text.encode('utf-8')) <= 30for English;2 <= len(spanish.encode('utf-8')) <= 40for Spanish. -
Each check populates a result dict with
passed: bool,failures: list[row_id],message: str. - Emit
validation_report.jsonsummarizing all 12. - Exit code
0if all pass,1otherwise.
Block B — scripts/split_corpus.py¶
- CLI:
argparsewith--input(defaultdata/raw/all_rows.jsonl),--output-dir(defaultdata/processed/),--seed(default 42),--ratios(default0.8,0.1,0.1). - Call
seed_everything(args.seed). - Bucket rows by
(verb_lemma, tense)pair. - Sort bucket keys deterministically (alphabetical) before shuffling.
- Shuffle the bucket keys (seeded).
- Slice: 80% → train_keys, next 10% → val_keys, last 10% → test_keys.
- Assign every row to its split based on its key.
- Write
data/processed/{train,val,test}.jsonl(one JSON object per line). - Verify the post-split invariant: no fingerprint appears in two splits (sanity guard).
- Verify the (verb, tense) invariant: every (verb, tense) appears in exactly one split.
- Emit
data/processed/split_log.jsonwith per-split row count, per-(verb, tense) assignment, and the seed.
Block C — scripts/build_manifest.py¶
- Compute SHA256 of each
data/processed/*.jsonl. - Count rows per cell (verb × tense × person × label).
- Count mis-conjugations by type.
- Read versions (Python, NumPy, corpus_spec version from
data/corpus_spec.mdpreamble). - Emit
data/MANIFEST.jsonper the schema in theory 03.
Block D — tests/test_validation.py¶
- Fixture: a small in-memory list of ~10 rows including 2 correct + 1 mis-conjugated for
(work, present_simple), similar for(go, past_simple). - Test each of the 12 checks: build the fixture to pass, then introduce a deliberate violation per check and assert that check fails.
- Test that a clean fixture produces all 12
passed=True. - Test that an unclean fixture (e.g., NFD-encoded Spanish) is caught by check 11.
Block E — tests/test_split.py¶
- Fixture: rows with 4 (verb, tense) pairs.
- Test that splitting yields the (verb, tense) invariant.
- Test reproducibility: same seed → same split (verify row IDs).
- Test ratio approximation: with 100 (verb, tense) pairs and 80/10/10, the splits get 80 / 10 / 10 keys.
- Test that mis-conjugations follow their (verb, tense) into the same split.
Block F — end-to-end sanity¶
- Run the pipeline:
python scripts/gen_corpus.py --seed 42 && python scripts/validate_corpus.py && python scripts/split_corpus.py --seed 42 && python scripts/build_manifest.py. - Inspect
data/MANIFEST.json. Verify the per-cell counts, the totals, the splits. - Run twice; assert the SHA256s match. Otherwise the pipeline is non-deterministic.
Constraints¶
- Pure Python. Standard library +
jsonschema. mypy --strictclean.ruffclean.banditclean — nosubprocessshells, noeval.- Determinism. Both
gen_corpus.pyandsplit_corpus.pyare independently seedable and reproducible.
Stop conditions¶
Done when:
- All three scripts + tests committed.
pytest -q tests/test_validation.py tests/test_split.pygreen.python scripts/validate_corpus.pyexits 0 on the lab-01 output.data/processed/{train,val,test}.jsonlexist; per-split row counts add up to the input row count.data/MANIFEST.jsonexists with all 12 checks recorded as passed.- Re-running the pipeline yields identical SHA256s.
Pitfalls¶
- NFC check on the wrong field. Remember to check both
textandspanish(andcorrect_formif non-null). - Sort key for shuffle. Python's
random.shuffle(list(some_set))is non-deterministic across runs because set iteration order is non-deterministic. Alwayssorted(set(...))first, then shuffle. - Split ratios that don't round to integers. With 120 (verb, tense) pairs and 80/10/10, you get 96/12/12. With 99 pairs, you get 79/9/11 — the off-by-one matters. Use
int(n * ratio)consistently and document the rounding. - Off-by-one in length check. Should
len('I work'.encode('utf-8')) = 6pass? Yes (≥ 2). Shouldlen('to work'.encode('utf-8')) = 7pass? Yes. Empty Spanish: fails. Test boundary cases. - Manifest paths. Use
pathlib.Pathand write paths relative to repo root, not absolute. Otherwise the manifest isn't portable.
Hint of last resort¶
If 3 hours in and the manifest hashes don't reproduce: the most likely cause is Python dict iteration order. JSON-serialize with sort_keys=True everywhere. Check that json.dumps(d, sort_keys=True) gives identical bytes across runs.
When to consult solutions/¶
After all tests pass and end-to-end runs reproduce. Solution: solutions/02-validate-and-split-ref.md (phase open).
Next lab: lab/03-version-with-dvc.md.