English · Español
Lab 00 — Write data/corpus_spec.md (the canonical spec)¶
Goal: before any code runs, write the human-readable spec of the corpus: the verb table, the tense schemata, the mis-conjugation taxonomy, the row JSONSchema. The validator and generator are derived from this document.
Estimated time: 90–120 minutes (mostly typing the verb + Spanish-pairing table).
Prereq: theory
00..03.mdread.
What you produce¶
A single committed file data/corpus_spec.md containing:
- Preamble — corpus version (
1.0.0), scope reference (§A13), license. - The 20-verb table (English → Spanish with regularity per language and irregular-form columns).
- The 6 tense surface-form schemata (one section per tense, with the English and Spanish patterns for each person).
- The mis-conjugation taxonomy (6 types in v1).
- The row JSONSchema (as a code block — machine-readable).
- Validation rules (in prose) that the validator script must implement.
data/corpus_spec.md is the contract. The generator reads from it (conceptually); the validator checks against it.
TODOs¶
Block A — preamble¶
- Title, version, date.
- Cite §A13 (English verb grammar) and §A2 (Spanish pairing) by file+section.
- One-paragraph summary in English.
- One-paragraph summary in Spanish (per CLAUDE.md §0.6, the bilingual policy).
Block B — the 20-verb table¶
- List all 20 verbs (12 regular + 8 irregular) per §A13.
- For each, fill: English lemma, Spanish lemma, English-regularity, Spanish-regularity.
- For the 8 English irregulars, add columns: past simple, past participle.
- For Spanish irregulars (whichever they are), add a short note (e.g., "stem change e→ie") rather than enumerating every form (the generator handles that).
- Notes section: collisions (
watch/lookboth →mirar), inverted syntax (like→gustar).
Suggested format:
| English | Spanish | EN_reg | ES_reg | EN past | EN participle | Notes |
|---------|---------|---------|---------|---------|---------------|--------------------|
| work | trabajar| reg | reg | worked | worked | |
| go | ir | irreg | irreg | went | gone | ES highly irreg |
| like | gustar | reg | reg | liked | liked | inverted syntax |
| ... | ... | ... | ... | ... | ... | ... |
Block C — tense surface-form schemata¶
- Six subsections, one per surface form:
infinitive,present_simple,past_simple,past_participle,future_will,future_going_to. - Each subsection shows English + Spanish patterns for all 3 persons (
1sg,2sg,3sg). - Mark the morphological "twist" for each: e.g.,
-sfor 3sg present; auxiliaryto beagreement infuture_going_to. - For each surface form, provide one fully worked example for both a regular verb (
work) and an irregular verb (go).
Block D — mis-conjugation taxonomy¶
- List the 6 v1 types from theory 01 (
missing_third_person_s,overregularization_past,wrong_aux_will_with_to,wrong_aux_going_to_missing_ing,subject_verb_disagreement,bare_participle_missing_aux). - For each: trigger conditions (which cells eligible), one example, the correct form, one-sentence pedagogical explanation.
- Note that the taxonomy is closed for v1 — any new type requires a corpus-spec version bump.
Block E — row JSONSchema¶
- Embed a JSONSchema for the row format (per theory 01). Use draft-07 or 2020-12.
- Fields:
id,text,spanish,verb_lemma,spanish_lemma,tense,person,english_regularity,spanish_regularity,label,mis_conjugation_type,correct_form,seed,fingerprint. - Required vs optional clearly marked.
mis_conjugation_typeandcorrect_formarenullable but the validator enforces the cross-field constraint (both populated ifflabel = "mis_conjugated"). - Enums for
tense,person,label,mis_conjugation_type,english_regularity,spanish_regularity.
Block F — validation rules in prose¶
List the 12 checks from theory 02:
- Schema validity.
- No exact duplicates by fingerprint.
- Cell coverage: 360 cells all have ≥ 1 correct row.
- All 20 verbs present.
- All 6 tense-surfaces per verb.
- All 3 persons per (verb, tense).
- Mis-conjugation types from canonical taxonomy.
- Mis-conjugation rows have
correct_formpopulated. - Correct rows have
mis_conjugation_type = null. - Every row has a Spanish field.
- NFC normalization.
- Text length in range.
- (Post-split) no fingerprint overlap and no (verb, tense) cross-split overlap.
Block G — sanity check¶
- Add a final section: "Worked example of one cell." Pick
(work, present_simple, 3sg). Write out the row JSON in full. Show that it satisfies every rule listed in Block F.
Constraints¶
- Markdown only, no code beyond inlined JSON / JSONSchema blocks. This is the contract, not the implementation.
- No external links to Wikipedia/grammar resources in v1; the table is self-contained. If we need to cite something, it goes in
docs/phase-12-corpus-design/theory/. - Spanish must be NFC. Type
ñ, notn + ˜(combining tilde). Verify withcat -vorpython -c "import unicodedata; print(unicodedata.is_normalized('NFC', open('data/corpus_spec.md').read()))". - Lock the version at 1.0.0. Any future change increments per the semver table in theory 03.
Stop conditions¶
You're done when:
data/corpus_spec.mdexists and is committed.- The 20-verb table has all 20 entries with Spanish pairings and irregular-form columns filled.
- All 6 tense surface-form schemata are documented with worked examples for regular and irregular verbs.
- The 6 mis-conjugation types are documented with examples and correction patterns.
- The JSONSchema is embedded and valid (you can paste it into a JSONSchema validator and have it accept a sample row).
- The 12 validation rules are listed in prose.
Pitfalls¶
- Spanish accent encoding. If your editor saves NFD, your spec is incorrectly encoded. Verify NFC.
- Translation ambiguity.
becan beserorestar. The spec must pick one (defaultser) and document the choice — not leave it ambiguous for the generator. - Irregular Spanish stems. For Spanish irregulars (e.g.,
tener → tengofor 1st-sg present), the spec can either enumerate the irregular forms or describe them by rule. Recommendation: enumerate. Rules are bug magnets. like→gustarsyntax inversion. Easy to forget. The spec must call this out explicitly so the generator's schemata forlikedon't follow the standard pattern.- Forgetting
was/wereforbepast.I was,you were,he was. The spec must encode this distinction.
Hint of last resort¶
If 2 hours in and the verb table is still half-empty: don't perfect the Spanish — pick one canonical translation per English verb, document any uncertainty in the Notes column, and move on. The lab is about getting the spec down, not about being a Spanish linguist. Borja's Spanish review will catch errors at validation time.
When to consult solutions/¶
After the spec is committed. Solution: solutions/00-corpus-spec-ref.md (phase open) contains a reference 20-verb table and the canonical JSONSchema.
Next lab: lab/01-implement-generator.md.