Skip to content

English · Español

Phase 12 — The Corpus: Designing the Microscopic Dataset

Requires: 11 — Tokenization Theory + BPE Implementation Teaches: corpus-design · enumeration · stratified-split · data-manifest · reproducibility Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open. Updated for §A13 (English verb grammar, supersedes §A1).

🇪🇸 El corpus es la palanca: pequeño, opinado, reproducible. Enumeramos la matriz 20 verbos × 5 tiempos × 3 personas con su par en español. La cobertura completa no es negociable, y las mis-conjugaciones controladas son la supervisión que necesita el tutor de Phase 32.


Goal

Borja designs, implements, and validates the canonical project corpus — the dataset every later phase trains on. Per §A13, the corpus enumerates all 20 in-scope verbs (12 regular + 8 irregular) × 5 tenses (infinitive, present simple, past simple, past participle, simple future — with will and going to split into two surface forms in the corpus → 6 tense surfaces) × 3 persons (1st sg I, 2nd sg you, 3rd sg he/she/it), with Spanish translation pairs for every English form (per A2). On top of that, a curated set of deliberate mis-conjugations with mis_conjugation_type labels gives Phase 32's tutor agent supervised correction targets. The phase headline artefact is data/MANIFEST.json whose hashes pin the corpus and whose per-cell counts confirm 360-cell coverage (20 × 6 × 3). See theory/00-motivation.md for the 5-vs-6 surface-form split rationale.

Read order

  1. theory/00-motivation.md — why a tiny enumerated bilingual corpus beats a noisy 100× scrape for this project.
  2. theory/01-schema-and-labels.md — the row schema, the (verb, tense, person, regularity, label) tuple, and the mis-conjugation taxonomy.
  3. theory/02-leakage-and-splits.md — what data leakage looks like for a morphology-learning task, and the (verb, tense)-stratified split that prevents it.
  4. theory/03-reproducibility-and-versioning.md — seeded generation, NFC/UTF-8 normalization, SHA256 manifests, dvc minimal usage.
  5. lab/00-corpus-spec.md — write data/corpus_spec.md (the schema + verb table + mis-conjugation taxonomy). Hand exercise.
  6. lab/01-implement-generator.md — write scripts/gen_corpus.py. Enumerate all 360 cells (20 × 6 tense surfaces × 3 persons); emit Spanish pairs; emit mis-conjugations.
  7. lab/02-validate-and-split.md — write scripts/validate_corpus.py + scripts/split_corpus.py. Coverage check, dedup, (verb, tense)-stratified split, leakage check.
  8. lab/03-version-with-dvc.mddvc add data/processed/, write the MANIFEST, commit.

solutions/ is empty during pre-write — populated at phase open.

Definition of Done

See PHASE_12_PLAN.md §6. Briefly:

  • data/MANIFEST.json lists exactly 360 (verb, tense-surface, person) cells (20 × 6 × 3), each with ≥ 1 correct form.
  • Every English row has a non-empty spanish field.
  • Mis-conjugation count: ≥ 100 rows across ≥ 4 distinct types.
  • Re-running gen_corpus.py from the same seed reproduces identical SHA256s (CI test).
  • Train/val/test split is (verb, tense)-stratified; no leakage.
  • Coverage heatmap, length-distribution, mis-conjugation Pareto plots committed.
  • dvc add performed; .dvc files committed.

What this phase intentionally does NOT cover

  • Free-form English text scraping. We deliberately stay enumerative. The §A13 grammar is finite; scraping adds noise we don't need.
  • Plurals. Per §A13: plurals deferred. Persons are 1st sg I, 2nd sg you, 3rd sg he/she/it. we/they are out of scope.
  • Other tenses. Present continuous (I am working), present perfect (I have worked), conditionals, subjunctive — all deferred. v1 covers exactly the 5 tenses in §A13.
  • Multi-clause sentences. Single subject + verb (+ optional auxiliary) per row. No coordination, no embedding.
  • Tokenization of the corpus. Phase 11 trains the tokenizer; Phase 12 produces the source text. The corpus stores raw strings, not token IDs.
  • Embedding training. Phase 13 trains embeddings on the corpus output here.
  • Other languages besides English + Spanish pairs. Per §A2: Spanish is the only paired language.

Phase 12's scope is the canonical enumerated bilingual corpus of English verb forms with Spanish translations and controlled mis-conjugations. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.