Skip to content

English · Español

01 — Schema, labels, and the mis-conjugation taxonomy

🇪🇸 Cada fila del corpus tiene un esquema fijo: la forma en inglés, su traducción al español, lema verbal, tiempo, persona, regularidad, y una etiqueta (correct o mis_conjugated) con tipo de error si aplica. El esquema es contrato — si no se valida, la fila no entra al corpus.


The row schema

Every row in data/processed/*.jsonl is a JSON object with this shape:

{
  "id": "0042",
  "text": "he works",
  "spanish": "él trabaja",
  "verb_lemma": "work",
  "spanish_lemma": "trabajar",
  "tense": "present_simple",
  "person": "3sg",
  "english_regularity": "regular",
  "spanish_regularity": "regular",
  "label": "correct",
  "mis_conjugation_type": null,
  "correct_form": null,
  "seed": 42042,
  "fingerprint": "a1b2c3..."
}

A mis-conjugated row has the same shape with the deviant text, label = "mis_conjugated", and the canonical correct_form populated:

{
  "id": "0837",
  "text": "he work",
  "spanish": "él trabaja",
  "verb_lemma": "work",
  "spanish_lemma": "trabajar",
  "tense": "present_simple",
  "person": "3sg",
  "english_regularity": "regular",
  "spanish_regularity": "regular",
  "label": "mis_conjugated",
  "mis_conjugation_type": "missing_third_person_s",
  "correct_form": "he works",
  "seed": 42837,
  "fingerprint": "f9e8d7..."
}

Field semantics:

  • id: stable across regenerations.
  • text: the English form (correct or mis-conjugated).
  • spanish: the Spanish translation of the intended form. For mis-conjugated English, spanish still carries the correct Spanish form — we don't generate "wrong Spanish" in v1.
  • verb_lemma: the English infinitive (one of the 20).
  • spanish_lemma: the Spanish infinitive (the canonical pairing).
  • tense: one of infinitive, present_simple, past_simple, past_participle, future_will, future_going_to.
  • person: one of 1sg, 2sg, 3sg.
  • english_regularity: regular (12 verbs) or irregular (8 verbs).
  • spanish_regularity: regular or irregular. Mappings differ from English — e.g., do (irregular EN) ↔ hacer (irregular ES); work (regular EN) ↔ trabajar (regular ES); see (irregular EN past) ↔ ver (regular-ish ES). Documented in the verb table.
  • label: correct or mis_conjugated.
  • mis_conjugation_type: one of the canonical codes (see § below). Null for correct rows.
  • correct_form: the canonical correct English form, for mis-conjugated rows. Null for correct rows (where text itself is the correct form).
  • seed: the per-row RNG seed used (mostly cosmetic — see theory 03).
  • fingerprint: sha256(normalize(text)). Dedup key.

The schema is enforced by scripts/validate_corpus.py via a JSONSchema in data/corpus_spec.md. Any row that fails validation is removed, not silently included.

The two labels

correct

The English text is a grammatically valid form for the given (verb, tense, person). The Spanish field is the canonical translation. Examples:

  • (I, work, present_simple)text="I work", spanish="yo trabajo".
  • (he, eat, past_simple)text="he ate", spanish="él comió".
  • (you, go, future_going_to)text="you are going to go", spanish="tú vas a ir".

A correct row has label = "correct" and mis_conjugation_type = null.

mis_conjugated

The English text is wrong in a known way. The correct_form field carries what the form should be. The Spanish field still carries the correct Spanish (we're not modeling Spanish errors in v1).

A mis_conjugated row has label = "mis_conjugated", a non-null mis_conjugation_type from the closed taxonomy, and a non-null correct_form.

The mis-conjugation taxonomy (v1 draft, 6 types)

The closed list of error types in v1:

Code Trigger Example Correct
missing_third_person_s 3rd-sg present without -s he work he works
overregularization_past Irregular past treated as regular I goed I went
wrong_aux_will_with_to will followed by to he will to work he will work
wrong_aux_going_to_missing_ing going to rendered as go to I am go to work I am going to work
subject_verb_disagreement Wrong auxiliary for the person you has worked you have worked
bare_participle_missing_aux Past participle without auxiliary in a context that requires it he gone he has gone

These six cover the most common error modes a beginner ESL speaker makes. The generator may emit ~1–3 mis-conjugations per cell, gated by:

  • Cell relevance. missing_third_person_s only applies to 3rd-sg present-simple cells.
  • Regularity. overregularization_past only applies to irregular verbs in past tense.
  • Tense. wrong_aux_will_with_to only applies to future_will cells.
  • Random subset. When a cell is eligible for multiple types, the generator samples (seeded) which to emit.

Anti-pattern: generating a "wrong" form that doesn't fit a clean taxonomy entry. If you can't name the error type, don't generate the row.

Snippet-level constraints

Each row:

  1. Is a single subject + verb (+ optional auxiliary) form. No multi-clause sentences in v1.
  2. Has target text length 2–25 bytes (roughly 1–6 BPE tokens after Phase 11).
  3. No punctuation in v1. Bare forms only — I work, not I work.. (Optional v1.5: add sentence-final periods.)
  4. Lowercase only, except the pronoun I which is conventionally capitalized in English. Spanish follows Spanish capitalization conventions (subject pronoun yo lowercase). This convention is locked in the spec for hash stability.
  5. Uses no verbs outside the 20 in scope.
  6. Uses no pronouns outside I, you, he for English. (she, it are equivalent to he morphologically; we pick he as canonical.)
  7. Uses no Spanish pronouns outside yo, , él.

The 20 verbs and their canonical Spanish pairings (verb table)

This is the lookup table gen_corpus.py uses:

English        Spanish      EN_reg   ES_reg
-------------  -----------  -------  -------
work           trabajar     reg      reg
play           jugar        reg      irreg (stem change e→ie)
walk           caminar      reg      reg
talk           hablar       reg      reg
listen         escuchar     reg      reg
watch          mirar        reg      reg
study          estudiar     reg      reg
finish         terminar     reg      reg
start          empezar      reg      irreg (stem change e→ie)
look           mirar        reg      reg     (note: same Spanish lemma as watch)
want           querer       reg      irreg
like           gustar       reg      reg     (note: gustar has inverted syntax in Spanish — see below)
be             ser          irreg    irreg
have           tener        irreg    irreg
do             hacer        irreg    irreg
go             ir           irreg    irreg
come           venir        irreg    irreg
see            ver          irreg    reg-ish (some irregularity in past)
eat            comer        irreg    reg
write          escribir     irreg    reg     (note: past participle "escrito" is irregular)

Notes captured in the spec:

  • watch ↔ mirar and look ↔ mirar collide on the Spanish side. Both English verbs map to the same Spanish lemma; downstream phases see this and may learn that one Spanish form covers two English meanings. Acceptable for v1.
  • like ↔ gustar has inverted syntax in Spanish: I like Xme gusta X (literally "X is pleasing to me"). v1 corpus uses the literal me gusta form, not yo gusto (which would be wrong). Document explicitly.
  • be ↔ ser uses ser; the estar alternative is out of scope for v1.

The 6 tense surface-forms and their schemata

For each (verb, person) pair, the generator emits these 6 surface forms (matching the cell breakdown of theory 00):

Tense English schema Spanish schema (regular trabajar)
infinitive to work trabajar
present_simple I work / you work / he works yo trabajo / tú trabajas / él trabaja
past_simple I worked / you worked / he worked yo trabajé / tú trabajaste / él trabajó
past_participle worked (bare) trabajado (bare)
future_will I will work / you will work / he will work yo trabajaré / tú trabajarás / él trabajará
future_going_to I am going to work / you are going to work / he is going to work yo voy a trabajar / tú vas a trabajar / él va a trabajar

The past_participle cell is bare — no auxiliary. Including it gives the model the morphological surface form for participle. (Used downstream by bare_participle_missing_aux mis-conjugation, where the wrong form omits the auxiliary that should be present.)

For irregular verbs, the schema is the same but the stem changes by verb. The verb table (above) maps each irregular verb to its irregular forms:

Verb       Past simple   Past participle
go         went          gone
be         was/were      been
have       had           had
do         did           done
come       came          come
see        saw           seen
eat        ate           eaten
write      wrote         written

For be past simple: 1st-sg was, 2nd-sg were, 3rd-sg was. (One of the only English verbs that distinguishes past-tense forms by person — captured explicitly in the table.)

Special tokens reservation

Reserved IDs in the Phase 11 tokenizer's vocab (for downstream phases):

ID Token Purpose
0 <\|pad\|> Padding for batched training.
1 <\|endoftext\|> Row boundary.
2 <\|unk\|> Reserved (byte-level BPE never emits this, but slot reserved).
3 <\|sep\|> Phase 32 tutor-agent separator (e.g., between mis-conjugated input and proposed correction).

Rows do not include these in text; the tokenizer adds them at training/inference time.

Drill problems

Solutions in solutions/01-schema-and-labels-ref.md (phase open).

  1. For the row text="he goed", identify the mis_conjugation_type, the correct form, and explain why this is a systematic (not random) error. What pattern in real-world language acquisition does this mirror?
  2. The English verb like maps to Spanish gustar with inverted syntax. Write the schema for (I, like, present_simple) in both English and Spanish. Now write the schema for (he, like, present_simple). What's the Spanish surface form, and why does it look like 3rd-person gusta even though the English is 3rd-person likes?
  3. The past_participle cell is bare (no auxiliary). Argue why we still include it — what would the model fail to learn if we omitted it?

One-paragraph recap

Every corpus row has a fixed schema: English text, Spanish translation, verb lemma + Spanish lemma, tense, person, regularity (per language), label (correct or mis_conjugated), and — for mis-conjugated rows — a type code and the canonical correct form. The 20 verbs × 6 surface-forms × 3 persons = 360 cells fully cover §A13's grammar matrix. Mis-conjugations come from a closed taxonomy of 6 types, gated by cell eligibility. Spanish pairings are looked up from a static verb table that records the canonical lemma per English verb plus regularity per language.


Next: theory/02-leakage-and-splits.md.