Skip to content

English · Español

Break — Evaluate on the training set instead of held-out; show how the score lies

🇪🇸 Cambiamos un solo path en el harness — el split de evaluación pasa de val a train. Reportamos 99% de accuracy con un modelo que en realidad solo memorizó. Es la trampa más vieja y más común en evaluación; verla en frío evita caerla en caliente.


Symptom Borja will see

Two evaluation reports:

  • Report A (control): harness reads data/eval/val.jsonl (60 held-out sentences). Reports CCR-regular = 82%, CCR-irregular = 64%, ConjAcc = 76%.
  • Report B (break): harness reads data/eval/train.jsonl (240 training sentences). Reports CCR-regular = 95%, CCR-irregular = 87%, ConjAcc = 88%.

Same model checkpoint. Same metrics code. Different conclusion.

If Borja sees Report B and pastes "ConjAcc 88%, ready to ship" into the journal, he just committed the most common ML reporting error.

The break, mechanically

In src/minieval/harness.py:

# Run A (control)
def run_eval(model, split: str = "val") -> Report:
    ...

# Run B (break) — one line in the config
eval_split: train   # was: val

Or directly in code: hardcode split = "train" and bury the change in a refactor commit. (This is exactly how the bug appears in real-world repos. A refactor renames a config key, the default falls through to "train", and nobody re-reads the diff.)

The whole break is one identifier swap.

Why this teaches the concept

At §A13 scale with 103k parameters and 240 training sentences, the model has roughly 430 parameters per training example. That's not parameter-starved — the model has the capacity to memorize entire sentences if it wants to.

The training set has the property that the model has seen every form in it, often multiple times. Predicting wrote after He is, for the trained model, a near-deterministic lookup. The model is not generalizing; it's recalling.

When you evaluate on train:

  • PPL on train is biased low by the memorization the model did during training. It does not reflect the model's behavior on novel inputs.
  • CCR on train is biased high by the same memorization. The model's high CCR-irregular score (87%) is "I memorized these 8 verbs' irregular forms" not "I learned the structure of irregular conjugation".
  • Bilingual alignment on train is biased high for the same reason.

Every metric is inflated. None of them tell you anything about generalization.

This is the entire purpose of the train/val split. The §A12 / §A14 portal infrastructure (Phase 41) will enforce the split at the data-loader level, but at Phase 20 the harness is still on the honor system.

Diagnostic ladder Borja should walk

  1. First check: read the manifest. What is eval_split?
  2. Second check: compare the per-slice tables. If train and val are similar, the model is either super-generalizing (rare at §A13 scale) or you're evaluating on the wrong split.
  3. Third check: the canonical sanity test — pick 5 sentences from the eval set and check whether they appear in the train set. If yes, the split is broken.
  4. Diagnosis: the eval_split is train, not val. Or, the train and val files were concatenated by a bad preprocessing step.

Reproducer

# Control
just phase-20-eval split=val

# Break (looks like "wow we trained well")
just phase-20-eval split=train

# Compare
just phase-20-eval-compare experiments/20-eval-val experiments/20-eval-train

The compare script prints both metric tables side by side and computes the bias (train − val) per slice. The biases are the size of "what memorization bought you".

Hint cascade

  1. (Mild) "The two reports use the same model. What else could differ?"
  2. (Medium) "What does manifest.json say about the eval data?"
  3. (Direct) "The eval is reading the training set. Where does the harness decide which file to open?"

Fix

Change eval_split: train back to eval_split: val. Or, structurally, refactor the harness so the eval API takes a path and train is not a valid value in production. Phase 41's portal will go further: held-out data is in a separate filesystem mount that the training loop physically cannot reach.

A subtler version of the same bug

A more insidious variant: train and val are correctly split, but the probe construction uses prompts that appeared in train. For example, the probe "He went to the ___" is asked in eval, but the exact prompt "He went to the school" appears in train. The model's continuation is memorization.

Defense: ensure prompts in data/eval/probes.jsonl are novel — their prefixes do not appear in data/train/*.jsonl. The §A13 corpus is small enough that this is straightforward to enforce. Phase 20's probe-schema.md (lab/00) requires this check.

What this break is NOT

  • Not a model bug.
  • Not a metric-math bug.
  • Not a leak through inadvertent data preprocessing.

It is the simplest possible evaluation error: reading the wrong file. The lesson is that the most common evaluation mistakes are not exotic — they're "the path is wrong" or "the slice is wrong". A loud, simple harness with explicit, separated paths is the defense.

Cross-refs

  • theory/04-perplexity-pitfalls-tiny-corpus.md — even on a clean val split, PPL alone is misleading.
  • theory/03-probe-construction.md — how probes are built to avoid leaks.
  • Phase 41 §A14 — portal-level data isolation.