English · Español

Lab 02 — Write `validate_corpus.py` and `split_corpus.py`¶

Goal: validate the generator's output against the 12 rules from theory 02. Then split into train/val/test by (verb, tense). Emit the final manifest.

Estimated time: 3–4 hours.

Prereq: lab 01 committed; data/raw/all_rows.jsonl exists.

What you produce¶

scripts/validate_corpus.py — runs every check from theory 02. Emits data/raw/validation_report.json.
scripts/split_corpus.py — produces data/processed/{train,val,test}.jsonl.
scripts/build_manifest.py (or the same validate script extended) — emits data/MANIFEST.json per theory 03.
tests/test_validation.py — unit tests on a small fixture.
tests/test_split.py — unit tests on a small fixture.

TODOs¶

Block A — `scripts/validate_corpus.py`¶

CLI: argparse with --input (default data/raw/all_rows.jsonl), --spec (default data/corpus_spec.md), --report (default data/raw/validation_report.json).
Load all rows.
Run the 12 checks from theory 02 (numbered below). Each check is its own function; aggregate results.

The 12 checks:

Schema validity. Every row parses against the JSONSchema in data/corpus_spec.md. Use jsonschema (already pinned per pyproject.toml).
No exact duplicates. Every fingerprint is unique across all rows.
Cell coverage. All 360 (verb, tense, person) cells have ≥ 1 correct row. (Enumerate the cross product; check membership.)
All 20 verbs present. set(row.verb_lemma for row in rows) == VERB_TABLE_SET (the 20 lemmas).
All 6 tense-surfaces present per verb. For each verb, the union of tense values across its rows is the full 6.
All 3 persons present per (verb, tense). Note: for the infinitive tense, this constraint relaxes — to work is the same for all 3 persons. Decision: emit one row per (verb, infinitive, person) anyway (per spec), or one row per (verb, infinitive) shared across persons? v1: one row per (verb, infinitive, 1sg) only, with no per-person variation. Document in spec; validator allows the relaxation for infinitive.
Mis-conjugation types from canonical taxonomy. Every non-null mis_conjugation_type ∈ the 6-type taxonomy.
Mis-conjugation rows have correct_form populated. label == "mis_conjugated" ⇒ correct_form non-null and non-empty.
Correct rows have mis_conjugation_type = null. label == "correct" ⇒ mis_conjugation_type is None and correct_form is None.
Every row has a spanish field. Non-empty.
NFC normalization. For every row, unicodedata.is_normalized('NFC', text) and unicodedata.is_normalized('NFC', spanish).
Text length in range. 2 <= len(text.encode('utf-8')) <= 30 for English; 2 <= len(spanish.encode('utf-8')) <= 40 for Spanish.
Each check populates a result dict with passed: bool, failures: list[row_id], message: str.
Emit validation_report.json summarizing all 12.
Exit code 0 if all pass, 1 otherwise.

Block B — `scripts/split_corpus.py`¶

Block C — `scripts/build_manifest.py`¶

Compute SHA256 of each data/processed/*.jsonl.
Count rows per cell (verb × tense × person × label).
Count mis-conjugations by type.
Read versions (Python, NumPy, corpus_spec version from data/corpus_spec.md preamble).
Emit data/MANIFEST.json per the schema in theory 03.

Block D — `tests/test_validation.py`¶

Fixture: a small in-memory list of ~10 rows including 2 correct + 1 mis-conjugated for (work, present_simple), similar for (go, past_simple).
Test each of the 12 checks: build the fixture to pass, then introduce a deliberate violation per check and assert that check fails.
Test that a clean fixture produces all 12 passed=True.
Test that an unclean fixture (e.g., NFD-encoded Spanish) is caught by check 11.

Block E — `tests/test_split.py`¶

Fixture: rows with 4 (verb, tense) pairs.
Test that splitting yields the (verb, tense) invariant.
Test reproducibility: same seed → same split (verify row IDs).
Test ratio approximation: with 100 (verb, tense) pairs and 80/10/10, the splits get 80 / 10 / 10 keys.
Test that mis-conjugations follow their (verb, tense) into the same split.

Block F — end-to-end sanity¶

Run the pipeline: python scripts/gen_corpus.py --seed 42 && python scripts/validate_corpus.py && python scripts/split_corpus.py --seed 42 && python scripts/build_manifest.py.
Inspect data/MANIFEST.json. Verify the per-cell counts, the totals, the splits.
Run twice; assert the SHA256s match. Otherwise the pipeline is non-deterministic.

Constraints¶

Pure Python. Standard library + jsonschema.
mypy --strict clean.
ruff clean.
bandit clean — no subprocess shells, no eval.
Determinism. Both gen_corpus.py and split_corpus.py are independently seedable and reproducible.

Stop conditions¶

Done when:

All three scripts + tests committed.
pytest -q tests/test_validation.py tests/test_split.py green.
python scripts/validate_corpus.py exits 0 on the lab-01 output.
data/processed/{train,val,test}.jsonl exist; per-split row counts add up to the input row count.
data/MANIFEST.json exists with all 12 checks recorded as passed.
Re-running the pipeline yields identical SHA256s.

Pitfalls¶

NFC check on the wrong field. Remember to check both text and spanish (and correct_form if non-null).
Sort key for shuffle. Python's random.shuffle(list(some_set)) is non-deterministic across runs because set iteration order is non-deterministic. Always sorted(set(...)) first, then shuffle.
Split ratios that don't round to integers. With 120 (verb, tense) pairs and 80/10/10, you get 96/12/12. With 99 pairs, you get 79/9/11 — the off-by-one matters. Use int(n * ratio) consistently and document the rounding.
Off-by-one in length check. Should len('I work'.encode('utf-8')) = 6 pass? Yes (≥ 2). Should len('to work'.encode('utf-8')) = 7 pass? Yes. Empty Spanish: fails. Test boundary cases.
Manifest paths. Use pathlib.Path and write paths relative to repo root, not absolute. Otherwise the manifest isn't portable.

Hint of last resort¶

If 3 hours in and the manifest hashes don't reproduce: the most likely cause is Python dict iteration order. JSON-serialize with sort_keys=True everywhere. Check that json.dumps(d, sort_keys=True) gives identical bytes across runs.

When to consult `solutions/`¶

After all tests pass and end-to-end runs reproduce. Solution: solutions/02-validate-and-split-ref.md (phase open).

Next lab: lab/03-version-with-dvc.md.

Lab 02 — Write validate_corpus.py and split_corpus.py¶