English · Español

Lab 01 — Write `scripts/gen_corpus.py`¶

Goal: implement the corpus generator. Enumerate the 20 × 6 × 3 = 360 cells of correct forms with Spanish pairs, plus emit mis-conjugations from the closed taxonomy. Output is data/raw/all_rows.jsonl + a partial manifest.

Estimated time: 4–6 hours.

Prereq: lab 00 committed (data/corpus_spec.md).

What you produce¶

scripts/gen_corpus.py — entry point. Reads the verb table from src/minicorpus/verb_table.py. Emits data/raw/all_rows.jsonl.
src/minicorpus/__init__.py — re-exports.
src/minicorpus/verb_table.py — the 20-verb table as Python data structures (no logic).
src/minicorpus/conjugate.py — pure functions: conjugate_english(verb, tense, person, table) -> str and the Spanish equivalent.
src/minicorpus/mis_conjugate.py — given a correct form, produce a mis-conjugated form for a given type.
tests/test_conjugate.py — unit tests on a handful of (verb, tense, person) tuples.

The generator is not a single monolithic script — it's a small package. The script orchestrates; the package conjugates.

TODOs¶

Block A — `src/minicorpus/verb_table.py`¶

Define VERB_TABLE: dict[str, VerbEntry] with one entry per of the 20 verbs.
VerbEntry is a frozen dataclass with: english_lemma, spanish_lemma, english_regularity, spanish_regularity, english_past, english_participle, spanish_irregular_forms: dict[(tense, person), str] (sparse — only fill in irregular cells; regular cells are computed from the lemma).
For each of the 8 English irregulars, populate english_past and english_participle.
For each Spanish irregular form, populate spanish_irregular_forms.
Add SPANISH_PRONOUNS: dict[person, str] mapping 1sg → "yo", 2sg → "tú", 3sg → "él".
Add ENGLISH_PRONOUNS: dict[person, str] mapping 1sg → "I", 2sg → "you", 3sg → "he".

Block B — `src/minicorpus/conjugate.py`¶

conjugate_english(verb_lemma: str, tense: str, person: str, table=VERB_TABLE) -> str. Pure function. Looks up the entry, computes the surface form per the rules in data/corpus_spec.md.
Handle the 6 tense surface-forms:
infinitive: to <lemma> (or just <lemma> — pick the convention from spec; document).
present_simple: <pronoun> <lemma> for 1sg/2sg; <pronoun> <lemma+s> for 3sg. Handle -es for verbs ending in -sh, -ch, -s, -x, -o (watch → watches, study → studies via -y → -ies).
past_simple: regulars get <lemma+ed> (with -y → -ied for study); irregulars use entry.english_past. The be past has per-person variants (was/were/was).
past_participle: regulars same as past simple; irregulars use entry.english_participle. Bare (no pronoun).
future_will: <pronoun> will <lemma>.
future_going_to: <pronoun> <to-be-conjugated-for-person> going to <lemma>. E.g., I am going to work, you are going to work, he is going to work.
conjugate_spanish(verb_lemma: str, tense: str, person: str, table=VERB_TABLE) -> str. Same idea for Spanish. Handle regular -ar/-er/-ir patterns; look up irregulars from entry.spanish_irregular_forms.
Special-case like → gustar with inverted syntax: instead of <yo/tú/él> <gustar-conjugation>, emit <me/te/le> <gusta> (3sg always) — document this exception in the function.

Block C — `src/minicorpus/mis_conjugate.py`¶

For each mis-conjugation type, define a function apply_<type>(correct_form, verb_entry, tense, person) -> Optional[str].
Each function:
Returns None if the type doesn't apply to this cell (e.g., missing_third_person_s returns None for non-3sg, non-present-simple cells).
Otherwise returns the deviant form.
The 6 mis-conjugation functions:
apply_missing_third_person_s(form, entry, tense, person) — strip the -s from 3sg present-simple. he works → he work.
apply_overregularization_past(form, entry, tense, person) — for irregular verbs in past_simple or past_participle, generate the "what if it were regular" form. he went → he goed. eaten → eated.
apply_wrong_aux_will_with_to(form, entry, tense, person) — for future_will, insert to after will. he will work → he will to work.
apply_wrong_aux_going_to_missing_ing(form, entry, tense, person) — for future_going_to, change going to go. I am going to work → I am go to work.
apply_subject_verb_disagreement(form, entry, tense, person) — wrong auxiliary form. For you have worked, swap to you has worked. For (you, be, past_simple) you were → you was.
apply_bare_participle_missing_aux(form, entry, tense, person) — for past_participle, prepend a wrong pronoun + no aux. gone → he gone (the "correct" expected form would be he has gone).
Helper eligible_mis_conjugations(entry, tense, person) -> list[str] returns the names of types that could apply to this cell.

Block D — `scripts/gen_corpus.py`¶

Block E — `tests/test_conjugate.py`¶

One test per of the 6 tense surface-forms × 1 regular verb × 3 persons.
One test per of the 6 tense surface-forms × 1 irregular verb (go) × 3 persons.
Special-case test: (like, present_simple, 1sg) returns I like (English) and me gusta (Spanish).
Special-case test: (be, past_simple, 1sg) returns I was; (be, past_simple, 2sg) returns you were.
Special-case test: (study, present_simple, 3sg) returns he studies.
Special-case test: (watch, present_simple, 3sg) returns he watches.
Mis-conjugation tests: at least one per of the 6 types, asserting the deviant form is what the spec says.
mypy --strict src/minicorpus/ clean.
ruff clean.

Block F — sanity print¶

At the bottom of the script (or in a separate notebook in experiments/12-corpus-generation/), print:
Total rows generated.
Per-cell coverage (expect 1 correct + 0–2 mis-conjugated per cell).
Sample 10 random rows for human review.
Total mis-conjugated rows by type.

Constraints¶

Pure Python. No numpy needed (the corpus is structured text, not numerical).
mypy --strict clean.
ruff clean.
bandit clean.
Determinism enforced via tests/conftest.py seed fixture.
No regex-driven conjugation logic. Use explicit string operations + table lookups. Regexes for morphology are hard to debug.
NFC normalize every emitted text and spanish field. Use unicodedata.normalize('NFC', s).

Stop conditions¶

Done when:

All five Python files committed under src/minicorpus/ and scripts/.
All tests pass; pytest -q tests/test_conjugate.py is green.
mypy --strict src/minicorpus/ clean.
python scripts/gen_corpus.py --seed 42 produces data/raw/all_rows.jsonl with at least 460 rows (360 correct + ~100–200 mis-conjugated).
The sample-print sanity check looks correct to Borja's eyes — no weird Spanish, no English typos.

Pitfalls¶

study → studies vs study → studys. The -y → -ies rule for verbs ending in consonant + y needs explicit handling.
watch → watches, finish → finishes. Verbs ending in -ch, -sh, -s, -x, -z add -es, not -s.
be is uniquely irregular in present simple. I am, you are, he is. Don't try to derive these from a rule.
Spanish stem changes. querer → quiero (e → ie), empezar → empiezo. Enumerate, don't rule-derive.
like → gustar syntax inversion. The whole subject becomes the indirect object. Best to special-case like in conjugate_spanish.
Mis-conjugation that's actually correct. apply_missing_third_person_s("I work", entry, "present_simple", "1sg") should return None — there's no -s to remove for I work. Easy to forget the eligibility check.
Bytes vs codepoints in NFC. Always call unicodedata.normalize('NFC', s) on Python strings (not bytes). Don't mix levels.
I capitalization. The English pronoun I is uppercase; everything else in the corpus is lowercase. Don't lowercase the whole string at the end.

Hint of last resort¶

If 5 hours in and the conjugation logic is a mess: simplify by enumerating more and computing less. The verb table can have all 360 forms hardcoded if writing the rule is too painful. The rules-vs-lookup tradeoff favors lookup for our scale.

When to consult `solutions/`¶

After all tests pass. Solution: solutions/01-implement-generator-ref.md (phase open). The reference contains the full verb table, the conjugation functions, and the mis-conjugation handlers.

Next lab: lab/02-validate-and-split.md.

Lab 01 — Write scripts/gen_corpus.py¶