English · Español
Lab 01 — Write scripts/gen_corpus.py¶
Goal: implement the corpus generator. Enumerate the 20 × 6 × 3 = 360 cells of correct forms with Spanish pairs, plus emit mis-conjugations from the closed taxonomy. Output is
data/raw/all_rows.jsonl+ a partial manifest.Estimated time: 4–6 hours.
Prereq: lab 00 committed (
data/corpus_spec.md).
What you produce¶
scripts/gen_corpus.py— entry point. Reads the verb table fromsrc/minicorpus/verb_table.py. Emitsdata/raw/all_rows.jsonl.src/minicorpus/__init__.py— re-exports.src/minicorpus/verb_table.py— the 20-verb table as Python data structures (no logic).src/minicorpus/conjugate.py— pure functions:conjugate_english(verb, tense, person, table) -> strand the Spanish equivalent.src/minicorpus/mis_conjugate.py— given a correct form, produce a mis-conjugated form for a given type.tests/test_conjugate.py— unit tests on a handful of (verb, tense, person) tuples.
The generator is not a single monolithic script — it's a small package. The script orchestrates; the package conjugates.
TODOs¶
Block A — src/minicorpus/verb_table.py¶
- Define
VERB_TABLE: dict[str, VerbEntry]with one entry per of the 20 verbs. -
VerbEntryis a frozen dataclass with:english_lemma,spanish_lemma,english_regularity,spanish_regularity,english_past,english_participle,spanish_irregular_forms: dict[(tense, person), str](sparse — only fill in irregular cells; regular cells are computed from the lemma). - For each of the 8 English irregulars, populate
english_pastandenglish_participle. - For each Spanish irregular form, populate
spanish_irregular_forms. - Add
SPANISH_PRONOUNS: dict[person, str]mapping1sg → "yo",2sg → "tú",3sg → "él". - Add
ENGLISH_PRONOUNS: dict[person, str]mapping1sg → "I",2sg → "you",3sg → "he".
Block B — src/minicorpus/conjugate.py¶
-
conjugate_english(verb_lemma: str, tense: str, person: str, table=VERB_TABLE) -> str. Pure function. Looks up the entry, computes the surface form per the rules indata/corpus_spec.md. - Handle the 6 tense surface-forms:
infinitive:to <lemma>(or just<lemma>— pick the convention from spec; document).present_simple:<pronoun> <lemma>for1sg/2sg;<pronoun> <lemma+s>for3sg. Handle-esfor verbs ending in-sh,-ch,-s,-x,-o(watch → watches,study → studiesvia-y → -ies).past_simple: regulars get<lemma+ed>(with-y → -iedforstudy); irregulars useentry.english_past. Thebepast has per-person variants (was/were/was).past_participle: regulars same as past simple; irregulars useentry.english_participle. Bare (no pronoun).future_will:<pronoun> will <lemma>.future_going_to:<pronoun> <to-be-conjugated-for-person> going to <lemma>. E.g.,I am going to work,you are going to work,he is going to work.-
conjugate_spanish(verb_lemma: str, tense: str, person: str, table=VERB_TABLE) -> str. Same idea for Spanish. Handle regular-ar/-er/-irpatterns; look up irregulars fromentry.spanish_irregular_forms. - Special-case
like → gustarwith inverted syntax: instead of<yo/tú/él> <gustar-conjugation>, emit<me/te/le> <gusta>(3sg always) — document this exception in the function.
Block C — src/minicorpus/mis_conjugate.py¶
- For each mis-conjugation type, define a function
apply_<type>(correct_form, verb_entry, tense, person) -> Optional[str]. - Each function:
- Returns
Noneif the type doesn't apply to this cell (e.g.,missing_third_person_sreturnsNonefor non-3sg, non-present-simple cells). - Otherwise returns the deviant form.
- The 6 mis-conjugation functions:
apply_missing_third_person_s(form, entry, tense, person)— strip the-sfrom 3sg present-simple.he works → he work.apply_overregularization_past(form, entry, tense, person)— for irregular verbs inpast_simpleorpast_participle, generate the "what if it were regular" form.he went → he goed.eaten → eated.apply_wrong_aux_will_with_to(form, entry, tense, person)— forfuture_will, inserttoafterwill.he will work → he will to work.apply_wrong_aux_going_to_missing_ing(form, entry, tense, person)— forfuture_going_to, changegoingtogo.I am going to work → I am go to work.apply_subject_verb_disagreement(form, entry, tense, person)— wrong auxiliary form. Foryou have worked, swap toyou has worked. For(you, be, past_simple)you were → you was.-
apply_bare_participle_missing_aux(form, entry, tense, person)— forpast_participle, prepend a wrong pronoun + no aux.gone → he gone(the "correct" expected form would behe has gone). -
Helper
eligible_mis_conjugations(entry, tense, person) -> list[str]returns the names of types that could apply to this cell.
Block D — scripts/gen_corpus.py¶
- CLI:
argparsewith--seed(default 42),--output(defaultdata/raw/). - Call
seed_everything(args.seed). - Enumerate all 20 verbs × 6 tense surface-forms × 3 persons = 360 cells.
- For each cell, emit one correct row with
text,spanish, all schema fields,label="correct",mis_conjugation_type=null,correct_form=null. - Within each cell, query
eligible_mis_conjugations; if non-empty, randomly select 0–2 (gated by an RNG-based coin flip) and emit each as amis_conjugatedrow withcorrect_formpopulated. - Assign sequential
id(zero-padded to 4 digits). - Compute
fingerprint = sha256(NFC(text))per row. - Verify the row passes the JSONSchema before writing (lightweight in-process check; the full validator is lab 02).
- Write to
data/raw/all_rows.jsonl. - Emit a partial manifest (
data/raw/MANIFEST_partial.json) with seed, version stamps, and total row count.
Block E — tests/test_conjugate.py¶
- One test per of the 6 tense surface-forms × 1 regular verb × 3 persons.
- One test per of the 6 tense surface-forms × 1 irregular verb (
go) × 3 persons. - Special-case test:
(like, present_simple, 1sg)returnsI like(English) andme gusta(Spanish). - Special-case test:
(be, past_simple, 1sg)returnsI was;(be, past_simple, 2sg)returnsyou were. - Special-case test:
(study, present_simple, 3sg)returnshe studies. - Special-case test:
(watch, present_simple, 3sg)returnshe watches. - Mis-conjugation tests: at least one per of the 6 types, asserting the deviant form is what the spec says.
-
mypy --strict src/minicorpus/clean. -
ruffclean.
Block F — sanity print¶
- At the bottom of the script (or in a separate notebook in
experiments/12-corpus-generation/), print: - Total rows generated.
- Per-cell coverage (expect 1 correct + 0–2 mis-conjugated per cell).
- Sample 10 random rows for human review.
- Total mis-conjugated rows by type.
Constraints¶
- Pure Python. No
numpyneeded (the corpus is structured text, not numerical). mypy --strictclean.ruffclean.banditclean.- Determinism enforced via
tests/conftest.pyseed fixture. - No regex-driven conjugation logic. Use explicit string operations + table lookups. Regexes for morphology are hard to debug.
- NFC normalize every emitted
textandspanishfield. Useunicodedata.normalize('NFC', s).
Stop conditions¶
Done when:
- All five Python files committed under
src/minicorpus/andscripts/. - All tests pass;
pytest -q tests/test_conjugate.pyis green. mypy --strict src/minicorpus/clean.python scripts/gen_corpus.py --seed 42producesdata/raw/all_rows.jsonlwith at least 460 rows (360 correct + ~100–200 mis-conjugated).- The sample-print sanity check looks correct to Borja's eyes — no weird Spanish, no English typos.
Pitfalls¶
study → studiesvsstudy → studys. The-y → -iesrule for verbs ending in consonant +yneeds explicit handling.watch → watches,finish → finishes. Verbs ending in-ch,-sh,-s,-x,-zadd-es, not-s.beis uniquely irregular in present simple.I am,you are,he is. Don't try to derive these from a rule.- Spanish stem changes.
querer → quiero(e → ie),empezar → empiezo. Enumerate, don't rule-derive. like → gustarsyntax inversion. The whole subject becomes the indirect object. Best to special-caselikeinconjugate_spanish.- Mis-conjugation that's actually correct.
apply_missing_third_person_s("I work", entry, "present_simple", "1sg")should returnNone— there's no-sto remove forI work. Easy to forget the eligibility check. - Bytes vs codepoints in NFC. Always call
unicodedata.normalize('NFC', s)on Python strings (not bytes). Don't mix levels. Icapitalization. The English pronounIis uppercase; everything else in the corpus is lowercase. Don't lowercase the whole string at the end.
Hint of last resort¶
If 5 hours in and the conjugation logic is a mess: simplify by enumerating more and computing less. The verb table can have all 360 forms hardcoded if writing the rule is too painful. The rules-vs-lookup tradeoff favors lookup for our scale.
When to consult solutions/¶
After all tests pass. Solution: solutions/01-implement-generator-ref.md (phase open). The reference contains the full verb table, the conjugation functions, and the mis-conjugation handlers.
Next lab: lab/02-validate-and-split.md.