English · Español
01 — Schema, labels, and the mis-conjugation taxonomy¶
🇪🇸 Cada fila del corpus tiene un esquema fijo: la forma en inglés, su traducción al español, lema verbal, tiempo, persona, regularidad, y una etiqueta (
correctomis_conjugated) con tipo de error si aplica. El esquema es contrato — si no se valida, la fila no entra al corpus.
The row schema¶
Every row in data/processed/*.jsonl is a JSON object with this shape:
{
"id": "0042",
"text": "he works",
"spanish": "él trabaja",
"verb_lemma": "work",
"spanish_lemma": "trabajar",
"tense": "present_simple",
"person": "3sg",
"english_regularity": "regular",
"spanish_regularity": "regular",
"label": "correct",
"mis_conjugation_type": null,
"correct_form": null,
"seed": 42042,
"fingerprint": "a1b2c3..."
}
A mis-conjugated row has the same shape with the deviant text, label = "mis_conjugated", and the canonical correct_form populated:
{
"id": "0837",
"text": "he work",
"spanish": "él trabaja",
"verb_lemma": "work",
"spanish_lemma": "trabajar",
"tense": "present_simple",
"person": "3sg",
"english_regularity": "regular",
"spanish_regularity": "regular",
"label": "mis_conjugated",
"mis_conjugation_type": "missing_third_person_s",
"correct_form": "he works",
"seed": 42837,
"fingerprint": "f9e8d7..."
}
Field semantics:
id: stable across regenerations.text: the English form (correct or mis-conjugated).spanish: the Spanish translation of the intended form. For mis-conjugated English,spanishstill carries the correct Spanish form — we don't generate "wrong Spanish" in v1.verb_lemma: the English infinitive (one of the 20).spanish_lemma: the Spanish infinitive (the canonical pairing).tense: one ofinfinitive,present_simple,past_simple,past_participle,future_will,future_going_to.person: one of1sg,2sg,3sg.english_regularity:regular(12 verbs) orirregular(8 verbs).spanish_regularity:regularorirregular. Mappings differ from English — e.g.,do(irregular EN) ↔hacer(irregular ES);work(regular EN) ↔trabajar(regular ES);see(irregular EN past) ↔ver(regular-ish ES). Documented in the verb table.label:correctormis_conjugated.mis_conjugation_type: one of the canonical codes (see § below). Null for correct rows.correct_form: the canonical correct English form, for mis-conjugated rows. Null for correct rows (wheretextitself is the correct form).seed: the per-row RNG seed used (mostly cosmetic — see theory 03).fingerprint:sha256(normalize(text)). Dedup key.
The schema is enforced by scripts/validate_corpus.py via a JSONSchema in data/corpus_spec.md. Any row that fails validation is removed, not silently included.
The two labels¶
correct¶
The English text is a grammatically valid form for the given (verb, tense, person). The Spanish field is the canonical translation. Examples:
(I, work, present_simple)→text="I work",spanish="yo trabajo".(he, eat, past_simple)→text="he ate",spanish="él comió".(you, go, future_going_to)→text="you are going to go",spanish="tú vas a ir".
A correct row has label = "correct" and mis_conjugation_type = null.
mis_conjugated¶
The English text is wrong in a known way. The correct_form field carries what the form should be. The Spanish field still carries the correct Spanish (we're not modeling Spanish errors in v1).
A mis_conjugated row has label = "mis_conjugated", a non-null mis_conjugation_type from the closed taxonomy, and a non-null correct_form.
The mis-conjugation taxonomy (v1 draft, 6 types)¶
The closed list of error types in v1:
| Code | Trigger | Example | Correct |
|---|---|---|---|
missing_third_person_s |
3rd-sg present without -s |
he work |
he works |
overregularization_past |
Irregular past treated as regular | I goed |
I went |
wrong_aux_will_with_to |
will followed by to |
he will to work |
he will work |
wrong_aux_going_to_missing_ing |
going to rendered as go to |
I am go to work |
I am going to work |
subject_verb_disagreement |
Wrong auxiliary for the person | you has worked |
you have worked |
bare_participle_missing_aux |
Past participle without auxiliary in a context that requires it | he gone |
he has gone |
These six cover the most common error modes a beginner ESL speaker makes. The generator may emit ~1–3 mis-conjugations per cell, gated by:
- Cell relevance.
missing_third_person_sonly applies to 3rd-sg present-simple cells. - Regularity.
overregularization_pastonly applies to irregular verbs in past tense. - Tense.
wrong_aux_will_with_toonly applies tofuture_willcells. - Random subset. When a cell is eligible for multiple types, the generator samples (seeded) which to emit.
Anti-pattern: generating a "wrong" form that doesn't fit a clean taxonomy entry. If you can't name the error type, don't generate the row.
Snippet-level constraints¶
Each row:
- Is a single subject + verb (+ optional auxiliary) form. No multi-clause sentences in v1.
- Has target text length 2–25 bytes (roughly 1–6 BPE tokens after Phase 11).
- No punctuation in v1. Bare forms only —
I work, notI work.. (Optional v1.5: add sentence-final periods.) - Lowercase only, except the pronoun
Iwhich is conventionally capitalized in English. Spanish follows Spanish capitalization conventions (subject pronounyolowercase). This convention is locked in the spec for hash stability. - Uses no verbs outside the 20 in scope.
- Uses no pronouns outside
I,you,hefor English. (she,itare equivalent tohemorphologically; we pickheas canonical.) - Uses no Spanish pronouns outside
yo,tú,él.
The 20 verbs and their canonical Spanish pairings (verb table)¶
This is the lookup table gen_corpus.py uses:
English Spanish EN_reg ES_reg
------------- ----------- ------- -------
work trabajar reg reg
play jugar reg irreg (stem change e→ie)
walk caminar reg reg
talk hablar reg reg
listen escuchar reg reg
watch mirar reg reg
study estudiar reg reg
finish terminar reg reg
start empezar reg irreg (stem change e→ie)
look mirar reg reg (note: same Spanish lemma as watch)
want querer reg irreg
like gustar reg reg (note: gustar has inverted syntax in Spanish — see below)
be ser irreg irreg
have tener irreg irreg
do hacer irreg irreg
go ir irreg irreg
come venir irreg irreg
see ver irreg reg-ish (some irregularity in past)
eat comer irreg reg
write escribir irreg reg (note: past participle "escrito" is irregular)
Notes captured in the spec:
watch ↔ mirarandlook ↔ mirarcollide on the Spanish side. Both English verbs map to the same Spanish lemma; downstream phases see this and may learn that one Spanish form covers two English meanings. Acceptable for v1.like ↔ gustarhas inverted syntax in Spanish:I like X→me gusta X(literally "X is pleasing to me"). v1 corpus uses the literalme gustaform, notyo gusto(which would be wrong). Document explicitly.be ↔ serusesser; theestaralternative is out of scope for v1.
The 6 tense surface-forms and their schemata¶
For each (verb, person) pair, the generator emits these 6 surface forms (matching the cell breakdown of theory 00):
| Tense | English schema | Spanish schema (regular trabajar) |
|---|---|---|
infinitive |
to work |
trabajar |
present_simple |
I work / you work / he works |
yo trabajo / tú trabajas / él trabaja |
past_simple |
I worked / you worked / he worked |
yo trabajé / tú trabajaste / él trabajó |
past_participle |
worked (bare) |
trabajado (bare) |
future_will |
I will work / you will work / he will work |
yo trabajaré / tú trabajarás / él trabajará |
future_going_to |
I am going to work / you are going to work / he is going to work |
yo voy a trabajar / tú vas a trabajar / él va a trabajar |
The past_participle cell is bare — no auxiliary. Including it gives the model the morphological surface form for participle. (Used downstream by bare_participle_missing_aux mis-conjugation, where the wrong form omits the auxiliary that should be present.)
For irregular verbs, the schema is the same but the stem changes by verb. The verb table (above) maps each irregular verb to its irregular forms:
Verb Past simple Past participle
go went gone
be was/were been
have had had
do did done
come came come
see saw seen
eat ate eaten
write wrote written
For be past simple: 1st-sg was, 2nd-sg were, 3rd-sg was. (One of the only English verbs that distinguishes past-tense forms by person — captured explicitly in the table.)
Special tokens reservation¶
Reserved IDs in the Phase 11 tokenizer's vocab (for downstream phases):
| ID | Token | Purpose |
|---|---|---|
| 0 | <\|pad\|> |
Padding for batched training. |
| 1 | <\|endoftext\|> |
Row boundary. |
| 2 | <\|unk\|> |
Reserved (byte-level BPE never emits this, but slot reserved). |
| 3 | <\|sep\|> |
Phase 32 tutor-agent separator (e.g., between mis-conjugated input and proposed correction). |
Rows do not include these in text; the tokenizer adds them at training/inference time.
Drill problems¶
Solutions in solutions/01-schema-and-labels-ref.md (phase open).
- For the row
text="he goed", identify themis_conjugation_type, the correct form, and explain why this is a systematic (not random) error. What pattern in real-world language acquisition does this mirror? - The English verb
likemaps to Spanishgustarwith inverted syntax. Write the schema for(I, like, present_simple)in both English and Spanish. Now write the schema for(he, like, present_simple). What's the Spanish surface form, and why does it look like 3rd-persongustaeven though the English is 3rd-personlikes? - The
past_participlecell is bare (no auxiliary). Argue why we still include it — what would the model fail to learn if we omitted it?
One-paragraph recap¶
Every corpus row has a fixed schema: English text, Spanish translation, verb lemma + Spanish lemma, tense, person, regularity (per language), label (correct or mis_conjugated), and — for mis-conjugated rows — a type code and the canonical correct form. The 20 verbs × 6 surface-forms × 3 persons = 360 cells fully cover §A13's grammar matrix. Mis-conjugations come from a closed taxonomy of 6 types, gated by cell eligibility. Spanish pairings are looked up from a static verb table that records the canonical lemma per English verb plus regularity per language.
Next: theory/02-leakage-and-splits.md.