English · Español
03 — Probe construction: schema, coverage, and the verb-grammar grid¶
🇪🇸 Una probe set bien hecha vale más que un test set diez veces más grande. Aquí están las reglas: schema, balance por verbo / tiempo / persona / idioma, prevención de leak, y las trampas adversariales que separan modelos de verdad de modelos de juguete.
What "probe" means here¶
A probe is a single labeled example with rich metadata. Schema (JSONL, one example per line):
{
"id": "probe-work-pres-3sg-001",
"language": "en",
"verb": "work",
"regularity": "regular",
"tense": "present_simple",
"person": "3sg",
"prompt": "Every day she ___ at the office.",
"candidates": ["work", "works", "worked", "working"],
"expected": "works",
"label": "correct",
"category": "core",
"reason_code": "PRES-3SG-S",
"explanation": "3rd-person singular present requires the -s ending on regular verbs.",
"source": "hand"
}
Required fields:
id— unique, stable. The harness uses this to match across eval runs.language—en|es.verb— one of the 20 target verbs (12 regular + 8 irregular).regularity—regular|irregular. Derived fromverbbut stored explicitly for slicing.tense— one of{infinitive, present_simple, past_simple, past_participle, future}.person— one of{1sg, 2sg, 3sg}.prompt— the sentence shown to the model. Contains___where the verb form goes (for multiple-choice) OR an inline form for classification.candidates— list of candidate forms (multiple-choice setup). Includes the correct form and 2-3 distractors.expected— the correct form. Must be incandidates.label—correct|incorrect|ambiguous. (ambiguousis for sentences that could be either, e.g.,You workreads identical in 2sg and plural;expectedis still set, but the model is not penalized for picking either.)category—core|adversarial.
Optional fields (recommended):
reason_code— short symbolic tag (PRES-3SG-S,PAST-IRREG,EN-ES-MISMATCH, etc.). Used in error analysis.explanation— one-sentence rationale, EN or ES. Not used in metrics; useful for the eval REPORT.md.source—hand|generated. Hand-curated is gold; generated needs hygiene checks.
Coverage matrix¶
The verb-grammar grid is 20 verbs × 5 tenses × 3 persons = 300 cells per language × 2 languages = 600 cells. We do not probe exhaustively — we sample. The sampling rules:
| Slice | Minimum core probes |
|---|---|
| Per language (EN, ES) | ≥ 30 each |
| Per regularity (regular, irregular) | ≥ 30 each |
| Per tense (5 tenses) | ≥ 10 each |
| Per person (3 persons) | ≥ 15 each |
| Adversarial total | ≥ 20 |
Target: ~60 core + ~20 adversarial = ~80 probes. Phase 20's DoD says minimum 60 core + 20 adversarial.
Hand-construction tip: build a small table first (verb × tense × person), check off cells as you write probes, ensure every regular verb has at least 1 entry and every irregular verb has at least 2 (irregulars are more diagnostic).
Eval-hygiene rules¶
-
Zero overlap with train + val. Hash every probe's
prompt + expectedpair (after normalization: strip whitespace, lowercase). Compare to every train+val sentence hash. Any collision → reject the probe. Hash function: SHA256. -
No probe is a copy-modification of a train example. Stricter than (1). If two sentences differ only in incidental noun choice (
at the office→at the school), they're "the same probe" for leakage purposes. Use a normalized form (verb-form-and-tense-marker preserved; other content nouns replaced with<NOUN>) and hash that. -
Probe authors don't see train data while authoring. Operationally: Claude writes initial probes from the §A13 spec and the verb list, not from
data/processed/train.jsonl. (This is strict; in practice, Claude has seen the corpus generator and so probes reflect its conventions. Acknowledge the bias in the report.) -
The probe set is versioned. Probes are committed to git as JSONL. The SHA256 of the probe file (sorted by id) is recorded in every REPORT.md.
-
No probe is modified after first use. If a probe is found to be ambiguous, mark it
deprecated: trueand add a replacement with a new id. Don't silently change a probe — historical eval reports would no longer be comparable.
The adversarial category¶
Adversarial probes test the boundary cases where naive rules give wrong answers. Categories (target 3-5 probes per category):
-
Over-regularization (irregular verbs).
goedinstead ofwent.eatedinstead ofate.writedinstead ofwrote.comedinstead ofcame. Trap for a model that learnedverb + ed = pastbut didn't memorize irregular forms. -
Wrong-person agreement (3sg
-smissing).She work hard.He go to school.It have a name.Trap for a model that didn't internalize the 3sg-srule. -
Wrong-tense for time marker.
Yesterday I work.(should be past-simpleworked.)Tomorrow she went.(should be futurewill go/is going to go.)Now he ate.(should be present.) Trap for a model that ignores temporal context. -
Auxiliary mismatch (perfect aspect).
She have eat.(should beShe has eaten— bothhave→hasfor 3sg ANDeat→eatenfor past-participle.) Tests two rules at once. -
EN↔ES form mismatch. English prompt (
Yesterday I ___ to the store.), candidate set contains bothwent(correct) andfui(the correct ES form but wrong language). Trap for a bilingual model that conflates languages. -
Plural / out-of-scope.
We work— per §A13, plurals are out of scope. A probe asking the model to classify this should yieldambiguous. A model that confidently labels itincorrect(because "we" wasn't in train) reveals brittleness.
Each adversarial probe has reason_code pointing to its category. The eval REPORT breaks down adversarial accuracy by category, so you can see "model handles category 1 (over-regularization) well but fails category 4 (auxiliary mismatch)."
Seeding the probe set¶
scripts/build_probes.py (Claude provides at phase open) generates the probe JSONL deterministically:
python scripts/build_probes.py --seed 42 --out data/eval/probes.jsonl
python scripts/build_probes.py --seed 42 --adversarial --out data/eval/adversarial.jsonl
Re-running with the same seed produces byte-identical files. The seed is committed to the script; changing it is a versioned event.
Common probe-writing mistakes¶
-
Hand-written examples that all look the same. Variety matters. Mix subject nouns (proper names, pronouns, common nouns where allowed), temporal markers (
yesterday,every day,tomorrow,last week,right now), sentence shapes. -
Examples where the whole context matters. A probe should be evaluable from the prompt alone. If grammaticality depends on something outside the sentence, the probe is ambiguous (label it so).
-
Ambiguous labels confused with correct. Some sentences read fine in both 2sg and plural (
You work hard.). Mark theseambiguous, don't pretend they're unambiguously correct. -
Over-representing one verb. Aggregate accuracy gets dominated by the over-represented verb. Stick to the coverage table.
-
Adversarial probes that the model couldn't possibly know. A probe testing knowledge of a verb not in the 20-verb set is unfair. Stay within scope.
-
Forgetting Spanish coverage. Half the curriculum is bilingual per §A13. An EN-only probe set hides Spanish-side failures.
Probe-set version control¶
Probes live at:
data/eval/probes.jsonl— core probes.data/eval/adversarial.jsonl— adversarial probes.data/eval/probe_prompt.txt— the fixed prompt template used to query the model.
All three are in git. Changes are versioned. The eval REPORT.md records the SHA256 of all three files.
When a probe is added/removed, bump the probe-set version (semver: probes_v_1_2_0.jsonl if you want, or just rely on git SHA). Phase 20 starts with v0.1.0; growth is expected.
Probe-set checks (the validate_probes.py script)¶
Before any eval run, the harness validates the probe set:
- Every line is valid JSON with required fields.
- Every
verbis in the 20-verb set. - Every
tenseis in the 5-tense set. - Every
personis in the 3-person set. - Every
languageisenores. - Every
labelis in{correct, incorrect, ambiguous}. -
expected∈candidates. -
regularitymatches the verb's class (regular ↔ in the 12-regular list; irregular ↔ in the 8-irregular list). - No two probes have the same
id. - No probe's normalized
prompt + expectedhash appears in train or val. - Coverage matrix is satisfied (minimum counts per slice).
If any check fails, the harness refuses to run and explains which probe caused the failure. This is the security-mindset rule: don't produce numbers from a broken eval set.
One-paragraph recap¶
A probe is a hand-curated, schema-validated, leak-free labeled example tagged with verb, tense, person, language, and regularity for slicing. The grid is 20 verbs × 5 tenses × 3 persons × 2 languages; we sample ~60 core + 20 adversarial probes covering the slices. Adversarial probes test boundary cases (over-regularization, wrong-person, wrong-tense, auxiliary mismatch, EN↔ES mismatch, out-of-scope). Probes are versioned in git, validated before every eval run, and never silently modified. The probe set is the measurement instrument; its quality bounds the eval's value.
Next: lab/00-probe-schema.md.