English · Español

00 — Why a tiny enumerated bilingual corpus beats a giant scraped one¶

🇪🇸 Datos de baja calidad anulan cualquier arquitectura. Construimos a mano un corpus pequeño y opinado: la enumeración completa de 20 verbos × 5 tiempos × 3 personas con su traducción al español. Etiquetas correctas por construcción, cobertura garantizada, reproducible desde una semilla. Lo que el modelo aprenda sobre gramática verbal viene de aquí; si esto está mal, todo está mal.

"Garbage in, mystery out"¶

A model is a compressor: it learns patterns in its training data and reproduces them at inference. If the patterns in the data are wrong, the model is wrong. No clever architecture, no fancy regularizer, no amount of compute fixes a corrupt corpus.

This is well-known. It's also routinely ignored. Public AI projects in 2026 still build models on scraped data with unverified labels — and then spend months debugging "why does the model do X?" when the answer is "your training set says to do X."

Phase 12 is the most leverage-per-hour phase in the curriculum. Done well, the rest of the project is straightforward. Done badly, Phases 13–32 produce a model that mysteries-out on every interesting input — including the grammar tutor capstone.

The two extremes (and why we pick the small one)¶

Extreme A: scrape the open web for English text¶

100 GiB of English text from Common Crawl. Sounds great. Problems:

No morphological labels. A scraped sentence He works hard. carries no annotation saying "this is 3^rd-person singular present simple of work." To get those labels, you need a parser (which is itself a model — circularity). To get Spanish pairs, you need a translator (another model). The whole pipeline becomes "trust other models' outputs as labels," which is exactly what we said we wouldn't do.
Distributional skew. Real English text is wildly skewed: the verb be is ~10× more common than walk; 3^rd-person he/she/it is ~5× rarer than 1^st-person I in dialogue but ~3× more common in narrative. Without explicit balancing, the model overfits to whichever skew the scrape happened to have.
Tense coverage is non-uniform. Past perfect is rare; simple present is everywhere. We want our 20 × 5 × 3 matrix uniformly covered.
License risk. Scraped text has licensing complications. Republishing is its own legal headache.
Reproducibility. The web today ≠ the web yesterday. You can't pin a corpus this way.

Extreme B: hand-write 600 sentences by paper¶

Type out every English form + Spanish translation by hand. Quality? Excellent. Cost? ~3–5 hours of pure data entry. Borja has better things to do; also, typos will creep in and propagate silently.

Our pick: enumerated synthetic¶

For each of 20 verbs × 6 tense surface-forms × 3 persons = 360 cells (the simple future splits into will and going to surfaces; see "The 20 × 6 × 3 coverage matrix" below), enumerate the English form via a verb-table-driven generator. For each English form, look up the Spanish form in a parallel Spanish-conjugation table. For a curated subset of cells, emit a deliberately wrong form with a mis_conjugation_type label.

Metric	Web-scrape	Hand-type	Enumerated synthetic
Cost (hours)	100s (cleaning)	5 (typing)	~2 (verb table) + 4 (generator+pairs)
Label quality	Low (inferred)	High (manual)	Perfect (by construction)
Coverage	Skewed	Manual	Guaranteed (360 cells)
Reproducibility	None	Manual	Seeded
License	Tangled	Owned	Owned
Spanish pairs	Translate model	Manual	From a static dictionary

The enumerated approach wins on every axis. The cost is opinionated authoring — the §A13 micro-grammar is the spec. That's the whole point of §A13's microscopic scope: we make the corpus complete.

The 20 × 6 × 3 coverage matrix (5 tenses, 6 surface-forms)¶

Per §A13, the verbs in scope are:

Regular (12): work, play, walk, talk, listen, watch, study, finish, start, look, want, like

Irregular (8): be, have, do, go, come, see, eat, write

The 5 tenses in scope:

Infinitive (bare: work, to work).
Present simple — 3^rd person singular adds -s (or -es): I work, he works.
Past simple — regulars add -ed; irregulars use stored stems (go → went).
Past participle — for regulars same as past simple (worked); for irregulars distinct (go → gone, eat → eaten).
Simple future — both the will-form (I will work) and the going to-form (I am going to work).

The 3 persons (per §A13):

1^st sg I.
2^nd sg you.
3^rd sg he / she / it — we use he as the canonical 3^rd-sg in the corpus to keep the matrix small; the model learns the morphology, not the lexical choice of pronoun.

So the matrix is:

                    I       you     he
infinitive          •       •       •
pres simple         •       •       •
past simple         •       •       •
past participle     •       •       •
will-future         •       •       •
going-to-future     •       •       •

Wait — that's 6 rows × 3 columns = 18 cells per verb × 20 verbs = 360 cells. The §A13 wording bundles will and going to as one "simple future" tense; we split them into two cells per (verb, person) for clarity in the corpus. So:

                                I       you     he
infinitive (bare)               •       •       •
present simple                  •       •       •
past simple                     •       •       •
past participle                 •       •       •
simple future (will)            •       •       •
simple future (going to)        •       •       •

6 tense-variants × 3 persons × 20 verbs = 360 cells. The §A13 "5 tenses" is the conceptual count; the corpus splits one of them into two surface forms, so cell count is 360, not 300. We'll use 360 going forward as the canonical cell count. The validator checks >= 360.

(Note: past participle as a standalone form is unusual outside auxiliary constructions like I have worked. We include it for the model to learn the participle morphology; downstream phases can synthesize auxiliary constructions if needed.)

Why exact coverage matters: morphological generalization¶

If only 280 of the 360 cells appear in training, the embedding+model learns the morphology partially. The 80 unseen cells become a generalization test — but a biased one: we don't know whether the missing cells are easy (regular verbs with standard suffixes) or hard (irregular past participles).

Exact coverage means: the model sees every (verb, tense, person) at least once. Generalization tests happen on unseen sentences containing seen morphology — e.g., He works hard. (a sentence-context wrap, optional v1.5) — not on missing forms.

This is the right scoping for the §A13 grammar tutor: the tutor's job is to recognize a mis-conjugation like he work, and to output the correct form (he works). It doesn't need to extrapolate to unseen verb stems. It needs to memorize the 360-cell table robustly enough to apply it to mis-conjugated inputs.

Mis-conjugations: the supervised target for Phase 32¶

Beyond the 360 correct forms, the corpus contains deliberate mis-conjugations — wrong forms paired with their corrections. Examples:

Wrong	Type	Correction
`he work`	`missing_third_person_s`	`he works`
`I goed`	`overregularization_past`	`I went`
`she eated`	`overregularization_past`	`she ate`
`he will to work`	`wrong_aux_will_with_to`	`he will work`
`I am go to work`	`wrong_aux_going_to_missing_ing`	`I am going to work`
`you has worked`	`subject_verb_disagreement`	`you have worked`
`he gone`	`bare_participle_missing_aux`	`he has gone`

Each mis-conjugation row carries:

text: the wrong form (e.g., he work).
label: mis_conjugated.
mis_conjugation_type: one of the canonical codes.
correct_form: the right form (he works).
The same (verb, tense, person, regularity) fields as the correct rows.

This is exactly the supervision Phase 32's grammar tutor needs: input = mis-conjugated, output = correct + type explanation.

Anti-pattern: generating mis-conjugations at random by perturbing the correct form's bytes. Real grammatical errors follow patterns (overregularization, missing agreement morpheme); random bytes don't.

Each mis-conjugation type is enumerated explicitly by the generator, drawing from a closed taxonomy of ~6 types (see 01-schema-and-labels.md).

Spanish pairs: built-in cross-lingual signal¶

Per §A2, every English row has a Spanish translation. For our enumerated grammar, this means: the generator also enumerates the 20 Spanish verb infinitives (trabajar, jugar, caminar, hablar, escuchar, ver/mirar, estudiar, terminar, empezar, mirar, querer, gustar, ser/estar, tener, hacer, ir, venir, ver, comer, escribir) and their conjugations across the corresponding Spanish tenses and persons.

Subtleties:

Spanish has more person distinctions than English (yo, tú, él/ella/usted, …). We map: 1^st sg I → yo, 2^nd sg you → tú, 3^rd sg he → él. (usted is the formal 2^nd sg; out of scope.)
Some English verbs map ambiguously (be → ser or estar; look → mirar or parecer; like → gustar with inverted syntax). For v1, the generator picks one canonical Spanish lemma per English verb, recorded in the verb table. Document the choice; Phase 13 may revisit.
Spanish irregulars don't align with English irregulars. work (regular EN) maps to trabajar (regular ES); go (irregular EN) maps to ir (highly irregular ES). The regularity field is per-language — the row carries english_regularity and spanish_regularity separately.

Why bilingual?

Phase 13's headline check (embedding visualization) wants to see whether work and trabajar end up near each other in the embedding space. That's only possible if the corpus contains both consistently.
Tokenizer (Phase 11) coverage. Spanish accents (ñ, á, é) force the byte-level BPE to learn multi-byte merges — which is exactly the cross-script robustness theory 03 of Phase 11 argues for.
Pedagogically, Borja gets cross-lingual receipts that the geometry is meaningful, not just coincidental clustering of English suffixes.

What "opinionated" buys us¶

An enumerated corpus is opinionated by definition — we choose what's in it. That's the source of its value:

We can guarantee the model sees every morpheme. No corpus blind spots.
We can balance every (verb, tense, person) cell exactly. No skew advantage from over-represented be.
We can stress-test specific mis-conjugation types. If the agent does poorly on wrong_aux_will_with_to ("he will to work"), we can generate more such examples.

The trade-off: the corpus doesn't reflect real English's distribution. That's fine — Phase 32's tutor is evaluated on similarly enumerated test inputs. If we ever want to generalize to "real English in the wild," that's a v2 (Phase 38+ refinement).

Reproducibility is non-negotiable¶

Per CLAUDE.md §0.5, every numeric/data-producing script seeds RNGs and writes a manifest. For Phase 12:

gen_corpus.py takes a --seed (default 42). Same seed → same SHA256 of every output file. Note: most of the corpus is deterministic enumeration — the seed only affects (a) the order of rows in the output JSONL and (b) which subset of mis-conjugation types is applied to which (verb, tense, person) cells.
MANIFEST.json records the seed, the Python version, the corpus spec version, and SHA256s of all output files.
CI runs gen_corpus.py --seed 42 and asserts the SHA256 matches the manifest.

Why so paranoid? Because Phase 13 trains embeddings on this corpus, Phase 16 trains the model, Phase 32 evaluates the agent. If the corpus changes silently, every downstream result is invalid. The SHA256 chain ties them together.

Leakage: the silent killer for morphology¶

The classic data-science failure mode: training data leaks into the test split, accuracy looks great, generalization is zero.

For a morphology task, the leakage mode is sneaky:

Split row-by-row: train sees he works, test sees she works (assuming we included she too). The model "generalizes" by memorizing the pattern of any single (verb, tense) cell.
Fix: split by (verb, tense). All 3 persons of a (verb, tense) go to the same split. The model is forced to learn the morphology from other cells.

That said, even (verb, tense)-stratified splits aren't perfectly leakage-free — the model can memorize -s from one verb's present-simple cells and apply it to another. That's a desired generalization, not leakage. Theory 02 makes the distinction precise.

How Phase 12 connects to everything later¶

Phase 11 (BPE) retrains on the Phase 12 corpus once it's done. The bootstrap corpus in Phase 11 (~30 hand-typed sentences) is replaced.
Phase 13 (embeddings) trains on Phase 12's snippets. The headline visualization tests whether work ↔ trabajar cluster.
Phase 14 (n-gram baseline) computes perplexity on Phase 12's test split — the baseline-to-beat.
Phase 16 (training loop) trains the model on Phase 12's train split.
Phase 32 (grammar tutor) evaluates on Phase 12's test split + new generated probes (mis-conjugations the model hasn't seen).

Every one of those phases depends on this one. Do it well.

What this phase does NOT cover¶

Plurals. Per §A13, plurals are deferred.
Negative forms. I don't work and friends — out of scope for v1.
Questions. Do you work? — out of scope.
Modal verbs other than will. No can, should, would — out of scope.
Aspect. Continuous (I am working), perfect (I have worked) — out of scope.
Free-form sentence wrappers. Optional v1.5 enhancement.

One-paragraph recap¶

A small, enumerated, bilingual corpus is the right tool for this curriculum. We author a verb table per the 20 × 5 × 3 = 360-cell matrix, the generator emits every cell deterministically, every English row is paired with its Spanish translation, and a controlled set of deliberate mis-conjugations gives Phase 32's tutor agent supervised correction targets. The 360-cell coverage is non-negotiable (§A13). Leakage prevention requires (verb, tense)-stratified splits. The corpus is the project's spec; if it's wrong, nothing later can be right.

Next: theory/01-schema-and-labels.md.