English · Español

02 — Leakage, dedup, and stratified splits¶

🇪🇸 Si la división train/val/test no es limpia, la métrica de generalización miente. Dedup por huella normalizada y división estratificada por (verbo, tiempo) son los dos mecanismos. Sin ellos, el tutor de Phase 32 puede memorizar la matriz y parecer brillante mientras sigue siendo inútil.

What leakage looks like for a morphology task¶

Leakage is when information that should be in the test set's "unknown" appears in the training set. A model can then memorize the leaked information instead of learning the underlying pattern. Test-set accuracy looks great; real-world performance is poor.

For an enumerated verb-conjugation corpus, the leakage modes are different from a free-text corpus. Specifically:

Exact duplicates. The generator emits the same row twice. Unlikely given enumeration, but possible if mis-conjugation generators collide.
Person-level leakage within a (verb, tense). Train sees I work and you work; test sees he works. A model that learns "<pronoun> work + agreement morpheme" can predict the test row trivially without learning anything about the verb work that it didn't already know from the train rows. This is the biggest leakage risk for our corpus.
Mis-conjugation ↔ correct-form pair leakage. A mis-conjugated row he work is paired with he works (via the correct_form field). If he work is in test and he works is in train, the model has seen the answer.

Each has a fix.

Fix 1: dedup by fingerprint¶

Define a row's fingerprint as sha256(normalize(text)), where normalize performs:

NFC Unicode normalization (per theory 03).
Strip leading/trailing whitespace.
Lowercase, except preserve the canonical capitalization of I (so I work and i work differ — they shouldn't both exist, but if they do, this catches it).

After normalization, two rows with the same text have the same fingerprint, and dedup removes the duplicate.

Note: we do not rename pronouns or verb stems during normalization. Unlike Phase 12 in the old A1 framing (where C identifier normalization was needed), here the text is short and the morphology is the signal. We want I work and he works to have different fingerprints because they are different rows.

After dedup:

Expected: 360 correct rows + 100–300 mis-conjugations = ~460–660 unique rows.
If dedup removes more than 5% of rows, a generator template is collapsing — investigate.

Fix 2: (verb, tense)-stratified split¶

Naive split: shuffle all rows and take 80/10/10. Wrong.

Why: with 360 correct cells and 3 persons per (verb, tense), if I work lands in train and he works in test, the model can solve he works from "I learned work from training; the test is asking me to apply the 3^rd-person -s to a stem I've already seen." That's not the generalization we want to measure here.

But wait: that is the generalization we want. The whole point is to learn -s from one cell and apply to another. So is it leakage or not?

The answer depends on what claim the test split is supposed to validate:

Claim A: "The model has learned the morphology rule 3rd-sg present-simple takes -s." → testing on a new (verb, tense, person) triple, given that we've seen other (verb, tense, person) triples, validates this. (Person-level split is OK.)
Claim B: "The model can produce a (verb, tense, person) triple it has never seen any form of." → requires holding out an entire (verb, tense), so the test verbs in that tense are completely unseen.

Our claim is B, because for the §A13 tutor, the more meaningful test is: given a mis-conjugation involving (verb, tense) cells the model didn't memorize, can it still produce the correction by applying morphological generalization? Therefore: split by (verb, tense).

For each of 120 (verb, tense) pairs (20 verbs × 6 tense surface-forms):
  - 80% → train
  - 10% → val
  - 10% → test
Each (verb, tense) sends all 3 persons to the same split. Mis-conjugations
of that (verb, tense) also go to the same split.

This gives:

~96 (verb, tense) pairs in train × 3 persons = ~288 correct rows in train, plus mis-conjugations.
~12 in val × 3 persons = ~36 correct rows in val, plus mis-conjugations.
~12 in test × 3 persons = ~36 correct rows in test, plus mis-conjugations.

12 (verb, tense) pairs in test is small — only 36 correct rows. The val/test estimates have high variance. We compensate by:

Reporting accuracy per (verb, tense) not just aggregate.
Adding generated probes at Phase 32 evaluation time (new mis-conjugations of test verbs+tenses).
Considering bumping the corpus to multiple rows per cell in a v1.5 (with sentence-context wrapping).

Implementation:

def stratified_split_by_verb_tense(rows, ratios=(0.8, 0.1, 0.1), seed=42):
    rng = random.Random(seed)
    # bucket by (verb, tense)
    buckets = defaultdict(list)
    for r in rows:
        buckets[(r.verb_lemma, r.tense)].append(r)
    keys = sorted(buckets.keys())   # deterministic order
    rng.shuffle(keys)
    n = len(keys)
    n_train = int(n * ratios[0])
    n_val   = int(n * ratios[1])
    train_keys = set(keys[:n_train])
    val_keys   = set(keys[n_train:n_train + n_val])
    test_keys  = set(keys[n_train + n_val:])
    train, val, test = [], [], []
    for r in rows:
        k = (r.verb_lemma, r.tense)
        if k in train_keys: train.append(r)
        elif k in val_keys: val.append(r)
        else:               test.append(r)
    return train, val, test

The same seed produces the same split every run — required by the reproducibility invariant.

Fix 3: mis-conjugation ↔ correct-form pair containment¶

A mis-conjugated row text="he work", correct_form="he works" is potentially leaky if the correct form (he works) of that exact cell is in train but the mis-conjugated row is in test. The model has effectively seen the answer.

Mitigation: the (verb, tense)-stratified split guarantees that all rows of a (verb, tense) cell — correct and mis-conjugated — go to the same split. So he work (mis) and he works (correct), both belonging to (work, present_simple), are always in the same split. No pair-leakage.

This is one of the reasons (verb, tense) is the right grain. A finer grain (per-row) would split mis from correct.

What about "robustness probes"?¶

A robustness probe is a held-out test set the model has never seen during training, designed to test specific generalization properties. Examples for our task:

Unseen verb entirely — train without write, test only on write (in any tense, any person).
Unseen tense for a known verb — train sees all 6 tenses of work except past_participle, test only the past participle.
Cross-lingual probe — present the Spanish form and ask for the English (or vice-versa). Tests bilingual alignment.

For Phase 12 v1, we don't hold out a probe set. The 12 (verb, tense)-test cells are the only test. If Phase 32 needs more, add them as a v2 corpus.

(One open question we discuss in PHASE_12_PLAN.md §7d: whether to hold out one entire verb to test cross-verb generalization. Default: no for v1.)

The validator's job¶

scripts/validate_corpus.py runs after generation and asserts:

These are 12 separate checks. The validator writes a summary report + exits non-zero on any failure.

Why this matters for Phase 32¶

Phase 32's grammar tutor is evaluated by:

Taking the test split of the corpus.
For each mis-conjugated test row, asking the agent: "Is this correct? If not, what's the correct form and what was the error?"
Scoring the agent's responses against correct_form and mis_conjugation_type.

If the test split is leaked (especially via cross-person leakage within a (verb, tense)), the agent's score is meaningless — it could be pattern-matching rather than learning the rule. Phase 12's leakage prevention is the foundation of Phase 32's evaluation validity.

Drill problems¶

Solutions in solutions/02-leakage-and-splits-ref.md (phase open).

The (verb, tense)-stratified split puts all 3 persons of a (verb, tense) into the same split. Argue why this is not "throwing away signal" — wouldn't we get a better-trained model if we trained on more persons and tested on fewer?
With 120 (verb, tense) pairs and a 12/12 val/test split, the val and test sets contain only 12 distinct (verb, tense) combinations. Is this enough? Estimate the variance of the val accuracy estimate.
Suppose Borja accidentally re-uses the same seed for gen_corpus.py and split_corpus.py. Does this cause leakage? (Hint: think about what each script's RNG controls.)

One-paragraph recap¶

Three leakage channels: exact duplicates (fix: dedup by sha256(NFC-normalize(text))), within-(verb,tense) person leakage (fix: (verb, tense)-stratified split — all 3 persons + all mis-conjugations of a cell go to the same split), and mis-conjugation/correct-form pair leakage (fix: solved automatically by the (verb, tense) split). The validator runs 12 checks before declaring the corpus done; the most critical is the all-20-verbs × 6-tense-surfaces × 3-persons coverage check from §A13.

Next: theory/03-reproducibility-and-versioning.md.