Skip to content

English · Español

Lab 01 — Write scripts/gen_corpus.py

Goal: implement the corpus generator. Enumerate the 20 × 6 × 3 = 360 cells of correct forms with Spanish pairs, plus emit mis-conjugations from the closed taxonomy. Output is data/raw/all_rows.jsonl + a partial manifest.

Estimated time: 4–6 hours.

Prereq: lab 00 committed (data/corpus_spec.md).


What you produce

  • scripts/gen_corpus.py — entry point. Reads the verb table from src/minicorpus/verb_table.py. Emits data/raw/all_rows.jsonl.
  • src/minicorpus/__init__.py — re-exports.
  • src/minicorpus/verb_table.py — the 20-verb table as Python data structures (no logic).
  • src/minicorpus/conjugate.py — pure functions: conjugate_english(verb, tense, person, table) -> str and the Spanish equivalent.
  • src/minicorpus/mis_conjugate.py — given a correct form, produce a mis-conjugated form for a given type.
  • tests/test_conjugate.py — unit tests on a handful of (verb, tense, person) tuples.

The generator is not a single monolithic script — it's a small package. The script orchestrates; the package conjugates.

TODOs

Block A — src/minicorpus/verb_table.py

  • Define VERB_TABLE: dict[str, VerbEntry] with one entry per of the 20 verbs.
  • VerbEntry is a frozen dataclass with: english_lemma, spanish_lemma, english_regularity, spanish_regularity, english_past, english_participle, spanish_irregular_forms: dict[(tense, person), str] (sparse — only fill in irregular cells; regular cells are computed from the lemma).
  • For each of the 8 English irregulars, populate english_past and english_participle.
  • For each Spanish irregular form, populate spanish_irregular_forms.
  • Add SPANISH_PRONOUNS: dict[person, str] mapping 1sg → "yo", 2sg → "tú", 3sg → "él".
  • Add ENGLISH_PRONOUNS: dict[person, str] mapping 1sg → "I", 2sg → "you", 3sg → "he".

Block B — src/minicorpus/conjugate.py

  • conjugate_english(verb_lemma: str, tense: str, person: str, table=VERB_TABLE) -> str. Pure function. Looks up the entry, computes the surface form per the rules in data/corpus_spec.md.
  • Handle the 6 tense surface-forms:
  • infinitive: to <lemma> (or just <lemma> — pick the convention from spec; document).
  • present_simple: <pronoun> <lemma> for 1sg/2sg; <pronoun> <lemma+s> for 3sg. Handle -es for verbs ending in -sh, -ch, -s, -x, -o (watch → watches, study → studies via -y → -ies).
  • past_simple: regulars get <lemma+ed> (with -y → -ied for study); irregulars use entry.english_past. The be past has per-person variants (was/were/was).
  • past_participle: regulars same as past simple; irregulars use entry.english_participle. Bare (no pronoun).
  • future_will: <pronoun> will <lemma>.
  • future_going_to: <pronoun> <to-be-conjugated-for-person> going to <lemma>. E.g., I am going to work, you are going to work, he is going to work.
  • conjugate_spanish(verb_lemma: str, tense: str, person: str, table=VERB_TABLE) -> str. Same idea for Spanish. Handle regular -ar/-er/-ir patterns; look up irregulars from entry.spanish_irregular_forms.
  • Special-case like → gustar with inverted syntax: instead of <yo/tú/él> <gustar-conjugation>, emit <me/te/le> <gusta> (3sg always) — document this exception in the function.

Block C — src/minicorpus/mis_conjugate.py

  • For each mis-conjugation type, define a function apply_<type>(correct_form, verb_entry, tense, person) -> Optional[str].
  • Each function:
  • Returns None if the type doesn't apply to this cell (e.g., missing_third_person_s returns None for non-3sg, non-present-simple cells).
  • Otherwise returns the deviant form.
  • The 6 mis-conjugation functions:
  • apply_missing_third_person_s(form, entry, tense, person) — strip the -s from 3sg present-simple. he works → he work.
  • apply_overregularization_past(form, entry, tense, person) — for irregular verbs in past_simple or past_participle, generate the "what if it were regular" form. he went → he goed. eaten → eated.
  • apply_wrong_aux_will_with_to(form, entry, tense, person) — for future_will, insert to after will. he will work → he will to work.
  • apply_wrong_aux_going_to_missing_ing(form, entry, tense, person) — for future_going_to, change going to go. I am going to work → I am go to work.
  • apply_subject_verb_disagreement(form, entry, tense, person) — wrong auxiliary form. For you have worked, swap to you has worked. For (you, be, past_simple) you were → you was.
  • apply_bare_participle_missing_aux(form, entry, tense, person) — for past_participle, prepend a wrong pronoun + no aux. gone → he gone (the "correct" expected form would be he has gone).

  • Helper eligible_mis_conjugations(entry, tense, person) -> list[str] returns the names of types that could apply to this cell.

Block D — scripts/gen_corpus.py

  • CLI: argparse with --seed (default 42), --output (default data/raw/).
  • Call seed_everything(args.seed).
  • Enumerate all 20 verbs × 6 tense surface-forms × 3 persons = 360 cells.
  • For each cell, emit one correct row with text, spanish, all schema fields, label="correct", mis_conjugation_type=null, correct_form=null.
  • Within each cell, query eligible_mis_conjugations; if non-empty, randomly select 0–2 (gated by an RNG-based coin flip) and emit each as a mis_conjugated row with correct_form populated.
  • Assign sequential id (zero-padded to 4 digits).
  • Compute fingerprint = sha256(NFC(text)) per row.
  • Verify the row passes the JSONSchema before writing (lightweight in-process check; the full validator is lab 02).
  • Write to data/raw/all_rows.jsonl.
  • Emit a partial manifest (data/raw/MANIFEST_partial.json) with seed, version stamps, and total row count.

Block E — tests/test_conjugate.py

  • One test per of the 6 tense surface-forms × 1 regular verb × 3 persons.
  • One test per of the 6 tense surface-forms × 1 irregular verb (go) × 3 persons.
  • Special-case test: (like, present_simple, 1sg) returns I like (English) and me gusta (Spanish).
  • Special-case test: (be, past_simple, 1sg) returns I was; (be, past_simple, 2sg) returns you were.
  • Special-case test: (study, present_simple, 3sg) returns he studies.
  • Special-case test: (watch, present_simple, 3sg) returns he watches.
  • Mis-conjugation tests: at least one per of the 6 types, asserting the deviant form is what the spec says.
  • mypy --strict src/minicorpus/ clean.
  • ruff clean.

Block F — sanity print

  • At the bottom of the script (or in a separate notebook in experiments/12-corpus-generation/), print:
  • Total rows generated.
  • Per-cell coverage (expect 1 correct + 0–2 mis-conjugated per cell).
  • Sample 10 random rows for human review.
  • Total mis-conjugated rows by type.

Constraints

  • Pure Python. No numpy needed (the corpus is structured text, not numerical).
  • mypy --strict clean.
  • ruff clean.
  • bandit clean.
  • Determinism enforced via tests/conftest.py seed fixture.
  • No regex-driven conjugation logic. Use explicit string operations + table lookups. Regexes for morphology are hard to debug.
  • NFC normalize every emitted text and spanish field. Use unicodedata.normalize('NFC', s).

Stop conditions

Done when:

  1. All five Python files committed under src/minicorpus/ and scripts/.
  2. All tests pass; pytest -q tests/test_conjugate.py is green.
  3. mypy --strict src/minicorpus/ clean.
  4. python scripts/gen_corpus.py --seed 42 produces data/raw/all_rows.jsonl with at least 460 rows (360 correct + ~100–200 mis-conjugated).
  5. The sample-print sanity check looks correct to Borja's eyes — no weird Spanish, no English typos.

Pitfalls

  • study → studies vs study → studys. The -y → -ies rule for verbs ending in consonant + y needs explicit handling.
  • watch → watches, finish → finishes. Verbs ending in -ch, -sh, -s, -x, -z add -es, not -s.
  • be is uniquely irregular in present simple. I am, you are, he is. Don't try to derive these from a rule.
  • Spanish stem changes. querer → quiero (e → ie), empezar → empiezo. Enumerate, don't rule-derive.
  • like → gustar syntax inversion. The whole subject becomes the indirect object. Best to special-case like in conjugate_spanish.
  • Mis-conjugation that's actually correct. apply_missing_third_person_s("I work", entry, "present_simple", "1sg") should return None — there's no -s to remove for I work. Easy to forget the eligibility check.
  • Bytes vs codepoints in NFC. Always call unicodedata.normalize('NFC', s) on Python strings (not bytes). Don't mix levels.
  • I capitalization. The English pronoun I is uppercase; everything else in the corpus is lowercase. Don't lowercase the whole string at the end.

Hint of last resort

If 5 hours in and the conjugation logic is a mess: simplify by enumerating more and computing less. The verb table can have all 360 forms hardcoded if writing the rule is too painful. The rules-vs-lookup tradeoff favors lookup for our scale.

When to consult solutions/

After all tests pass. Solution: solutions/01-implement-generator-ref.md (phase open). The reference contains the full verb table, the conjugation functions, and the mis-conjugation handlers.


Next lab: lab/02-validate-and-split.md.