Skip to content

English · Español

Lab 00 — Write data/corpus_spec.md (the canonical spec)

Goal: before any code runs, write the human-readable spec of the corpus: the verb table, the tense schemata, the mis-conjugation taxonomy, the row JSONSchema. The validator and generator are derived from this document.

Estimated time: 90–120 minutes (mostly typing the verb + Spanish-pairing table).

Prereq: theory 00..03.md read.


What you produce

A single committed file data/corpus_spec.md containing:

  1. Preamble — corpus version (1.0.0), scope reference (§A13), license.
  2. The 20-verb table (English → Spanish with regularity per language and irregular-form columns).
  3. The 6 tense surface-form schemata (one section per tense, with the English and Spanish patterns for each person).
  4. The mis-conjugation taxonomy (6 types in v1).
  5. The row JSONSchema (as a code block — machine-readable).
  6. Validation rules (in prose) that the validator script must implement.

data/corpus_spec.md is the contract. The generator reads from it (conceptually); the validator checks against it.

TODOs

Block A — preamble

  • Title, version, date.
  • Cite §A13 (English verb grammar) and §A2 (Spanish pairing) by file+section.
  • One-paragraph summary in English.
  • One-paragraph summary in Spanish (per CLAUDE.md §0.6, the bilingual policy).

Block B — the 20-verb table

  • List all 20 verbs (12 regular + 8 irregular) per §A13.
  • For each, fill: English lemma, Spanish lemma, English-regularity, Spanish-regularity.
  • For the 8 English irregulars, add columns: past simple, past participle.
  • For Spanish irregulars (whichever they are), add a short note (e.g., "stem change e→ie") rather than enumerating every form (the generator handles that).
  • Notes section: collisions (watch/look both → mirar), inverted syntax (likegustar).

Suggested format:

| English | Spanish | EN_reg  | ES_reg  | EN past | EN participle | Notes              |
|---------|---------|---------|---------|---------|---------------|--------------------|
| work    | trabajar| reg     | reg     | worked  | worked        |                    |
| go      | ir      | irreg   | irreg   | went    | gone          | ES highly irreg    |
| like    | gustar  | reg     | reg     | liked   | liked         | inverted syntax    |
| ...     | ...     | ...     | ...     | ...     | ...           | ...                |

Block C — tense surface-form schemata

  • Six subsections, one per surface form: infinitive, present_simple, past_simple, past_participle, future_will, future_going_to.
  • Each subsection shows English + Spanish patterns for all 3 persons (1sg, 2sg, 3sg).
  • Mark the morphological "twist" for each: e.g., -s for 3sg present; auxiliary to be agreement in future_going_to.
  • For each surface form, provide one fully worked example for both a regular verb (work) and an irregular verb (go).

Block D — mis-conjugation taxonomy

  • List the 6 v1 types from theory 01 (missing_third_person_s, overregularization_past, wrong_aux_will_with_to, wrong_aux_going_to_missing_ing, subject_verb_disagreement, bare_participle_missing_aux).
  • For each: trigger conditions (which cells eligible), one example, the correct form, one-sentence pedagogical explanation.
  • Note that the taxonomy is closed for v1 — any new type requires a corpus-spec version bump.

Block E — row JSONSchema

  • Embed a JSONSchema for the row format (per theory 01). Use draft-07 or 2020-12.
  • Fields: id, text, spanish, verb_lemma, spanish_lemma, tense, person, english_regularity, spanish_regularity, label, mis_conjugation_type, correct_form, seed, fingerprint.
  • Required vs optional clearly marked. mis_conjugation_type and correct_form are nullable but the validator enforces the cross-field constraint (both populated iff label = "mis_conjugated").
  • Enums for tense, person, label, mis_conjugation_type, english_regularity, spanish_regularity.

Block F — validation rules in prose

List the 12 checks from theory 02:

  • Schema validity.
  • No exact duplicates by fingerprint.
  • Cell coverage: 360 cells all have ≥ 1 correct row.
  • All 20 verbs present.
  • All 6 tense-surfaces per verb.
  • All 3 persons per (verb, tense).
  • Mis-conjugation types from canonical taxonomy.
  • Mis-conjugation rows have correct_form populated.
  • Correct rows have mis_conjugation_type = null.
  • Every row has a Spanish field.
  • NFC normalization.
  • Text length in range.
  • (Post-split) no fingerprint overlap and no (verb, tense) cross-split overlap.

Block G — sanity check

  • Add a final section: "Worked example of one cell." Pick (work, present_simple, 3sg). Write out the row JSON in full. Show that it satisfies every rule listed in Block F.

Constraints

  • Markdown only, no code beyond inlined JSON / JSONSchema blocks. This is the contract, not the implementation.
  • No external links to Wikipedia/grammar resources in v1; the table is self-contained. If we need to cite something, it goes in docs/phase-12-corpus-design/theory/.
  • Spanish must be NFC. Type ñ, not n + ˜ (combining tilde). Verify with cat -v or python -c "import unicodedata; print(unicodedata.is_normalized('NFC', open('data/corpus_spec.md').read()))".
  • Lock the version at 1.0.0. Any future change increments per the semver table in theory 03.

Stop conditions

You're done when:

  1. data/corpus_spec.md exists and is committed.
  2. The 20-verb table has all 20 entries with Spanish pairings and irregular-form columns filled.
  3. All 6 tense surface-form schemata are documented with worked examples for regular and irregular verbs.
  4. The 6 mis-conjugation types are documented with examples and correction patterns.
  5. The JSONSchema is embedded and valid (you can paste it into a JSONSchema validator and have it accept a sample row).
  6. The 12 validation rules are listed in prose.

Pitfalls

  • Spanish accent encoding. If your editor saves NFD, your spec is incorrectly encoded. Verify NFC.
  • Translation ambiguity. be can be ser or estar. The spec must pick one (default ser) and document the choice — not leave it ambiguous for the generator.
  • Irregular Spanish stems. For Spanish irregulars (e.g., tener → tengo for 1st-sg present), the spec can either enumerate the irregular forms or describe them by rule. Recommendation: enumerate. Rules are bug magnets.
  • likegustar syntax inversion. Easy to forget. The spec must call this out explicitly so the generator's schemata for like don't follow the standard pattern.
  • Forgetting was/were for be past. I was, you were, he was. The spec must encode this distinction.

Hint of last resort

If 2 hours in and the verb table is still half-empty: don't perfect the Spanish — pick one canonical translation per English verb, document any uncertainty in the Notes column, and move on. The lab is about getting the spec down, not about being a Spanish linguist. Borja's Spanish review will catch errors at validation time.

When to consult solutions/

After the spec is committed. Solution: solutions/00-corpus-spec-ref.md (phase open) contains a reference 20-verb table and the canonical JSONSchema.


Next lab: lab/01-implement-generator.md.