English · Español

Phase 12 — The Corpus: Designing the Microscopic Dataset¶

Requires: 11 — Tokenization Theory + BPE Implementation Teaches: corpus-design · enumeration · stratified-split · data-manifest · reproducibility Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open. Updated for §A13 (English verb grammar, supersedes §A1).

🇪🇸 El corpus es la palanca: pequeño, opinado, reproducible. Enumeramos la matriz 20 verbos × 5 tiempos × 3 personas con su par en español. La cobertura completa no es negociable, y las mis-conjugaciones controladas son la supervisión que necesita el tutor de Phase 32.

Goal¶

Borja designs, implements, and validates the canonical project corpus — the dataset every later phase trains on. Per §A13, the corpus enumerates all 20 in-scope verbs (12 regular + 8 irregular) × 5 tenses (infinitive, present simple, past simple, past participle, simple future — with will and going to split into two surface forms in the corpus → 6 tense surfaces) × 3 persons (1^st sg I, 2^nd sg you, 3^rd sg he/she/it), with Spanish translation pairs for every English form (per A2). On top of that, a curated set of deliberate mis-conjugations with mis_conjugation_type labels gives Phase 32's tutor agent supervised correction targets. The phase headline artefact is data/MANIFEST.json whose hashes pin the corpus and whose per-cell counts confirm 360-cell coverage (20 × 6 × 3). See theory/00-motivation.md for the 5-vs-6 surface-form split rationale.

Read order¶

theory/00-motivation.md — why a tiny enumerated bilingual corpus beats a noisy 100× scrape for this project.
theory/01-schema-and-labels.md — the row schema, the (verb, tense, person, regularity, label) tuple, and the mis-conjugation taxonomy.
theory/02-leakage-and-splits.md — what data leakage looks like for a morphology-learning task, and the (verb, tense)-stratified split that prevents it.
theory/03-reproducibility-and-versioning.md — seeded generation, NFC/UTF-8 normalization, SHA256 manifests, dvc minimal usage.
lab/00-corpus-spec.md — write data/corpus_spec.md (the schema + verb table + mis-conjugation taxonomy). Hand exercise.
lab/01-implement-generator.md — write scripts/gen_corpus.py. Enumerate all 360 cells (20 × 6 tense surfaces × 3 persons); emit Spanish pairs; emit mis-conjugations.
lab/02-validate-and-split.md — write scripts/validate_corpus.py + scripts/split_corpus.py. Coverage check, dedup, (verb, tense)-stratified split, leakage check.
lab/03-version-with-dvc.md — dvc add data/processed/, write the MANIFEST, commit.

solutions/ is empty during pre-write — populated at phase open.

Definition of Done¶

See PHASE_12_PLAN.md §6. Briefly:

data/MANIFEST.json lists exactly 360 (verb, tense-surface, person) cells (20 × 6 × 3), each with ≥ 1 correct form.
Every English row has a non-empty spanish field.
Mis-conjugation count: ≥ 100 rows across ≥ 4 distinct types.
Re-running gen_corpus.py from the same seed reproduces identical SHA256s (CI test).
Train/val/test split is (verb, tense)-stratified; no leakage.
Coverage heatmap, length-distribution, mis-conjugation Pareto plots committed.
dvc add performed; .dvc files committed.

What this phase intentionally does NOT cover¶

Free-form English text scraping. We deliberately stay enumerative. The §A13 grammar is finite; scraping adds noise we don't need.
Plurals. Per §A13: plurals deferred. Persons are 1^st sg I, 2^nd sg you, 3^rd sg he/she/it. we/they are out of scope.
Other tenses. Present continuous (I am working), present perfect (I have worked), conditionals, subjunctive — all deferred. v1 covers exactly the 5 tenses in §A13.
Multi-clause sentences. Single subject + verb (+ optional auxiliary) per row. No coordination, no embedding.
Tokenization of the corpus. Phase 11 trains the tokenizer; Phase 12 produces the source text. The corpus stores raw strings, not token IDs.
Embedding training. Phase 13 trains embeddings on the corpus output here.
Other languages besides English + Spanish pairs. Per §A2: Spanish is the only paired language.

Phase 12's scope is the canonical enumerated bilingual corpus of English verb forms with Spanish translations and controlled mis-conjugations. Nothing more.