Skip to content

English · Español

Lab 02 — Write validate_corpus.py and split_corpus.py

Goal: validate the generator's output against the 12 rules from theory 02. Then split into train/val/test by (verb, tense). Emit the final manifest.

Estimated time: 3–4 hours.

Prereq: lab 01 committed; data/raw/all_rows.jsonl exists.


What you produce

  • scripts/validate_corpus.py — runs every check from theory 02. Emits data/raw/validation_report.json.
  • scripts/split_corpus.py — produces data/processed/{train,val,test}.jsonl.
  • scripts/build_manifest.py (or the same validate script extended) — emits data/MANIFEST.json per theory 03.
  • tests/test_validation.py — unit tests on a small fixture.
  • tests/test_split.py — unit tests on a small fixture.

TODOs

Block A — scripts/validate_corpus.py

  • CLI: argparse with --input (default data/raw/all_rows.jsonl), --spec (default data/corpus_spec.md), --report (default data/raw/validation_report.json).
  • Load all rows.
  • Run the 12 checks from theory 02 (numbered below). Each check is its own function; aggregate results.

The 12 checks:

  1. Schema validity. Every row parses against the JSONSchema in data/corpus_spec.md. Use jsonschema (already pinned per pyproject.toml).
  2. No exact duplicates. Every fingerprint is unique across all rows.
  3. Cell coverage. All 360 (verb, tense, person) cells have ≥ 1 correct row. (Enumerate the cross product; check membership.)
  4. All 20 verbs present. set(row.verb_lemma for row in rows) == VERB_TABLE_SET (the 20 lemmas).
  5. All 6 tense-surfaces present per verb. For each verb, the union of tense values across its rows is the full 6.
  6. All 3 persons present per (verb, tense). Note: for the infinitive tense, this constraint relaxes — to work is the same for all 3 persons. Decision: emit one row per (verb, infinitive, person) anyway (per spec), or one row per (verb, infinitive) shared across persons? v1: one row per (verb, infinitive, 1sg) only, with no per-person variation. Document in spec; validator allows the relaxation for infinitive.
  7. Mis-conjugation types from canonical taxonomy. Every non-null mis_conjugation_type ∈ the 6-type taxonomy.
  8. Mis-conjugation rows have correct_form populated. label == "mis_conjugated"correct_form non-null and non-empty.
  9. Correct rows have mis_conjugation_type = null. label == "correct"mis_conjugation_type is None and correct_form is None.
  10. Every row has a spanish field. Non-empty.
  11. NFC normalization. For every row, unicodedata.is_normalized('NFC', text) and unicodedata.is_normalized('NFC', spanish).
  12. Text length in range. 2 <= len(text.encode('utf-8')) <= 30 for English; 2 <= len(spanish.encode('utf-8')) <= 40 for Spanish.

  13. Each check populates a result dict with passed: bool, failures: list[row_id], message: str.

  14. Emit validation_report.json summarizing all 12.
  15. Exit code 0 if all pass, 1 otherwise.

Block B — scripts/split_corpus.py

  • CLI: argparse with --input (default data/raw/all_rows.jsonl), --output-dir (default data/processed/), --seed (default 42), --ratios (default 0.8,0.1,0.1).
  • Call seed_everything(args.seed).
  • Bucket rows by (verb_lemma, tense) pair.
  • Sort bucket keys deterministically (alphabetical) before shuffling.
  • Shuffle the bucket keys (seeded).
  • Slice: 80% → train_keys, next 10% → val_keys, last 10% → test_keys.
  • Assign every row to its split based on its key.
  • Write data/processed/{train,val,test}.jsonl (one JSON object per line).
  • Verify the post-split invariant: no fingerprint appears in two splits (sanity guard).
  • Verify the (verb, tense) invariant: every (verb, tense) appears in exactly one split.
  • Emit data/processed/split_log.json with per-split row count, per-(verb, tense) assignment, and the seed.

Block C — scripts/build_manifest.py

  • Compute SHA256 of each data/processed/*.jsonl.
  • Count rows per cell (verb × tense × person × label).
  • Count mis-conjugations by type.
  • Read versions (Python, NumPy, corpus_spec version from data/corpus_spec.md preamble).
  • Emit data/MANIFEST.json per the schema in theory 03.

Block D — tests/test_validation.py

  • Fixture: a small in-memory list of ~10 rows including 2 correct + 1 mis-conjugated for (work, present_simple), similar for (go, past_simple).
  • Test each of the 12 checks: build the fixture to pass, then introduce a deliberate violation per check and assert that check fails.
  • Test that a clean fixture produces all 12 passed=True.
  • Test that an unclean fixture (e.g., NFD-encoded Spanish) is caught by check 11.

Block E — tests/test_split.py

  • Fixture: rows with 4 (verb, tense) pairs.
  • Test that splitting yields the (verb, tense) invariant.
  • Test reproducibility: same seed → same split (verify row IDs).
  • Test ratio approximation: with 100 (verb, tense) pairs and 80/10/10, the splits get 80 / 10 / 10 keys.
  • Test that mis-conjugations follow their (verb, tense) into the same split.

Block F — end-to-end sanity

  • Run the pipeline: python scripts/gen_corpus.py --seed 42 && python scripts/validate_corpus.py && python scripts/split_corpus.py --seed 42 && python scripts/build_manifest.py.
  • Inspect data/MANIFEST.json. Verify the per-cell counts, the totals, the splits.
  • Run twice; assert the SHA256s match. Otherwise the pipeline is non-deterministic.

Constraints

  • Pure Python. Standard library + jsonschema.
  • mypy --strict clean.
  • ruff clean.
  • bandit clean — no subprocess shells, no eval.
  • Determinism. Both gen_corpus.py and split_corpus.py are independently seedable and reproducible.

Stop conditions

Done when:

  1. All three scripts + tests committed.
  2. pytest -q tests/test_validation.py tests/test_split.py green.
  3. python scripts/validate_corpus.py exits 0 on the lab-01 output.
  4. data/processed/{train,val,test}.jsonl exist; per-split row counts add up to the input row count.
  5. data/MANIFEST.json exists with all 12 checks recorded as passed.
  6. Re-running the pipeline yields identical SHA256s.

Pitfalls

  • NFC check on the wrong field. Remember to check both text and spanish (and correct_form if non-null).
  • Sort key for shuffle. Python's random.shuffle(list(some_set)) is non-deterministic across runs because set iteration order is non-deterministic. Always sorted(set(...)) first, then shuffle.
  • Split ratios that don't round to integers. With 120 (verb, tense) pairs and 80/10/10, you get 96/12/12. With 99 pairs, you get 79/9/11 — the off-by-one matters. Use int(n * ratio) consistently and document the rounding.
  • Off-by-one in length check. Should len('I work'.encode('utf-8')) = 6 pass? Yes (≥ 2). Should len('to work'.encode('utf-8')) = 7 pass? Yes. Empty Spanish: fails. Test boundary cases.
  • Manifest paths. Use pathlib.Path and write paths relative to repo root, not absolute. Otherwise the manifest isn't portable.

Hint of last resort

If 3 hours in and the manifest hashes don't reproduce: the most likely cause is Python dict iteration order. JSON-serialize with sort_keys=True everywhere. Check that json.dumps(d, sort_keys=True) gives identical bytes across runs.

When to consult solutions/

After all tests pass and end-to-end runs reproduce. Solution: solutions/02-validate-and-split-ref.md (phase open).


Next lab: lab/03-version-with-dvc.md.