English · Español

Lab 00 — Write `data/corpus_spec.md` (the canonical spec)¶

Goal: before any code runs, write the human-readable spec of the corpus: the verb table, the tense schemata, the mis-conjugation taxonomy, the row JSONSchema. The validator and generator are derived from this document.

Estimated time: 90–120 minutes (mostly typing the verb + Spanish-pairing table).

Prereq: theory 00..03.md read.

What you produce¶

A single committed file data/corpus_spec.md containing:

Preamble — corpus version (1.0.0), scope reference (§A13), license.
The 20-verb table (English → Spanish with regularity per language and irregular-form columns).
The 6 tense surface-form schemata (one section per tense, with the English and Spanish patterns for each person).
The mis-conjugation taxonomy (6 types in v1).
The row JSONSchema (as a code block — machine-readable).
Validation rules (in prose) that the validator script must implement.

data/corpus_spec.md is the contract. The generator reads from it (conceptually); the validator checks against it.

TODOs¶

Block A — preamble¶

Title, version, date.
Cite §A13 (English verb grammar) and §A2 (Spanish pairing) by file+section.
One-paragraph summary in English.
One-paragraph summary in Spanish (per CLAUDE.md §0.6, the bilingual policy).

Block B — the 20-verb table¶

List all 20 verbs (12 regular + 8 irregular) per §A13.
For each, fill: English lemma, Spanish lemma, English-regularity, Spanish-regularity.
For the 8 English irregulars, add columns: past simple, past participle.
For Spanish irregulars (whichever they are), add a short note (e.g., "stem change e→ie") rather than enumerating every form (the generator handles that).
Notes section: collisions (watch/look both → mirar), inverted syntax (like → gustar).

Suggested format:

| English | Spanish | EN_reg  | ES_reg  | EN past | EN participle | Notes              |
|---------|---------|---------|---------|---------|---------------|--------------------|
| work    | trabajar| reg     | reg     | worked  | worked        |                    |
| go      | ir      | irreg   | irreg   | went    | gone          | ES highly irreg    |
| like    | gustar  | reg     | reg     | liked   | liked         | inverted syntax    |
| ...     | ...     | ...     | ...     | ...     | ...           | ...                |

Block C — tense surface-form schemata¶

Six subsections, one per surface form: infinitive, present_simple, past_simple, past_participle, future_will, future_going_to.
Each subsection shows English + Spanish patterns for all 3 persons (1sg, 2sg, 3sg).
Mark the morphological "twist" for each: e.g., -s for 3sg present; auxiliary to be agreement in future_going_to.
For each surface form, provide one fully worked example for both a regular verb (work) and an irregular verb (go).

Block D — mis-conjugation taxonomy¶

List the 6 v1 types from theory 01 (missing_third_person_s, overregularization_past, wrong_aux_will_with_to, wrong_aux_going_to_missing_ing, subject_verb_disagreement, bare_participle_missing_aux).
For each: trigger conditions (which cells eligible), one example, the correct form, one-sentence pedagogical explanation.
Note that the taxonomy is closed for v1 — any new type requires a corpus-spec version bump.

Block E — row JSONSchema¶

Embed a JSONSchema for the row format (per theory 01). Use draft-07 or 2020-12.
Fields: id, text, spanish, verb_lemma, spanish_lemma, tense, person, english_regularity, spanish_regularity, label, mis_conjugation_type, correct_form, seed, fingerprint.
Required vs optional clearly marked. mis_conjugation_type and correct_form are nullable but the validator enforces the cross-field constraint (both populated iff label = "mis_conjugated").
Enums for tense, person, label, mis_conjugation_type, english_regularity, spanish_regularity.

Block F — validation rules in prose¶

List the 12 checks from theory 02:

Block G — sanity check¶

Add a final section: "Worked example of one cell." Pick (work, present_simple, 3sg). Write out the row JSON in full. Show that it satisfies every rule listed in Block F.

Constraints¶

Markdown only, no code beyond inlined JSON / JSONSchema blocks. This is the contract, not the implementation.
No external links to Wikipedia/grammar resources in v1; the table is self-contained. If we need to cite something, it goes in docs/phase-12-corpus-design/theory/.
Spanish must be NFC. Type ñ, not n + ˜ (combining tilde). Verify with cat -v or python -c "import unicodedata; print(unicodedata.is_normalized('NFC', open('data/corpus_spec.md').read()))".
Lock the version at 1.0.0. Any future change increments per the semver table in theory 03.

Stop conditions¶

You're done when:

data/corpus_spec.md exists and is committed.
The 20-verb table has all 20 entries with Spanish pairings and irregular-form columns filled.
All 6 tense surface-form schemata are documented with worked examples for regular and irregular verbs.
The 6 mis-conjugation types are documented with examples and correction patterns.
The JSONSchema is embedded and valid (you can paste it into a JSONSchema validator and have it accept a sample row).
The 12 validation rules are listed in prose.

Pitfalls¶

Spanish accent encoding. If your editor saves NFD, your spec is incorrectly encoded. Verify NFC.
Translation ambiguity. be can be ser or estar. The spec must pick one (default ser) and document the choice — not leave it ambiguous for the generator.
Irregular Spanish stems. For Spanish irregulars (e.g., tener → tengo for 1^st-sg present), the spec can either enumerate the irregular forms or describe them by rule. Recommendation: enumerate. Rules are bug magnets.
like → gustar syntax inversion. Easy to forget. The spec must call this out explicitly so the generator's schemata for like don't follow the standard pattern.
Forgetting was/were for be past. I was, you were, he was. The spec must encode this distinction.

Hint of last resort¶

If 2 hours in and the verb table is still half-empty: don't perfect the Spanish — pick one canonical translation per English verb, document any uncertainty in the Notes column, and move on. The lab is about getting the spec down, not about being a Spanish linguist. Borja's Spanish review will catch errors at validation time.

When to consult `solutions/`¶

After the spec is committed. Solution: solutions/00-corpus-spec-ref.md (phase open) contains a reference 20-verb table and the canonical JSONSchema.

Next lab: lab/01-implement-generator.md.

Lab 00 — Write data/corpus_spec.md (the canonical spec)¶