English · Español
Phase 12 — The Corpus: Designing the Microscopic Dataset¶
Requires: 11 — Tokenization Theory + BPE Implementation Teaches:
corpus-design·enumeration·stratified-split·data-manifest·reproducibilityJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open. Updated for §A13 (English verb grammar, supersedes §A1).
🇪🇸 El corpus es la palanca: pequeño, opinado, reproducible. Enumeramos la matriz 20 verbos × 5 tiempos × 3 personas con su par en español. La cobertura completa no es negociable, y las mis-conjugaciones controladas son la supervisión que necesita el tutor de Phase 32.
Goal¶
Borja designs, implements, and validates the canonical project corpus — the dataset every later phase trains on. Per §A13, the corpus enumerates all 20 in-scope verbs (12 regular + 8 irregular) × 5 tenses (infinitive, present simple, past simple, past participle, simple future — with will and going to split into two surface forms in the corpus → 6 tense surfaces) × 3 persons (1st sg I, 2nd sg you, 3rd sg he/she/it), with Spanish translation pairs for every English form (per A2). On top of that, a curated set of deliberate mis-conjugations with mis_conjugation_type labels gives Phase 32's tutor agent supervised correction targets. The phase headline artefact is data/MANIFEST.json whose hashes pin the corpus and whose per-cell counts confirm 360-cell coverage (20 × 6 × 3). See theory/00-motivation.md for the 5-vs-6 surface-form split rationale.
Read order¶
theory/00-motivation.md— why a tiny enumerated bilingual corpus beats a noisy 100× scrape for this project.theory/01-schema-and-labels.md— the row schema, the (verb, tense, person, regularity, label) tuple, and the mis-conjugation taxonomy.theory/02-leakage-and-splits.md— what data leakage looks like for a morphology-learning task, and the (verb, tense)-stratified split that prevents it.theory/03-reproducibility-and-versioning.md— seeded generation, NFC/UTF-8 normalization, SHA256 manifests,dvcminimal usage.lab/00-corpus-spec.md— writedata/corpus_spec.md(the schema + verb table + mis-conjugation taxonomy). Hand exercise.lab/01-implement-generator.md— writescripts/gen_corpus.py. Enumerate all 360 cells (20 × 6 tense surfaces × 3 persons); emit Spanish pairs; emit mis-conjugations.lab/02-validate-and-split.md— writescripts/validate_corpus.py+scripts/split_corpus.py. Coverage check, dedup, (verb, tense)-stratified split, leakage check.lab/03-version-with-dvc.md—dvc add data/processed/, write the MANIFEST, commit.
solutions/ is empty during pre-write — populated at phase open.
Definition of Done¶
See PHASE_12_PLAN.md §6. Briefly:
data/MANIFEST.jsonlists exactly 360 (verb, tense-surface, person) cells (20 × 6 × 3), each with ≥ 1 correct form.- Every English row has a non-empty
spanishfield. - Mis-conjugation count: ≥ 100 rows across ≥ 4 distinct types.
- Re-running
gen_corpus.pyfrom the same seed reproduces identical SHA256s (CI test). - Train/val/test split is (verb, tense)-stratified; no leakage.
- Coverage heatmap, length-distribution, mis-conjugation Pareto plots committed.
dvc addperformed;.dvcfiles committed.
What this phase intentionally does NOT cover¶
- Free-form English text scraping. We deliberately stay enumerative. The §A13 grammar is finite; scraping adds noise we don't need.
- Plurals. Per §A13: plurals deferred. Persons are 1st sg
I, 2nd sgyou, 3rd sghe/she/it.we/theyare out of scope. - Other tenses. Present continuous (
I am working), present perfect (I have worked), conditionals, subjunctive — all deferred. v1 covers exactly the 5 tenses in §A13. - Multi-clause sentences. Single subject + verb (+ optional auxiliary) per row. No coordination, no embedding.
- Tokenization of the corpus. Phase 11 trains the tokenizer; Phase 12 produces the source text. The corpus stores raw strings, not token IDs.
- Embedding training. Phase 13 trains embeddings on the corpus output here.
- Other languages besides English + Spanish pairs. Per §A2: Spanish is the only paired language.
Phase 12's scope is the canonical enumerated bilingual corpus of English verb forms with Spanish translations and controlled mis-conjugations. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Datasheets for Datasets — Gebru et al. · 2018. how to document a corpus so others trust it.