Skip to content

English · Español

03 — Probe construction: schema, coverage, and the verb-grammar grid

🇪🇸 Una probe set bien hecha vale más que un test set diez veces más grande. Aquí están las reglas: schema, balance por verbo / tiempo / persona / idioma, prevención de leak, y las trampas adversariales que separan modelos de verdad de modelos de juguete.


What "probe" means here

A probe is a single labeled example with rich metadata. Schema (JSONL, one example per line):

{
  "id": "probe-work-pres-3sg-001",
  "language": "en",
  "verb": "work",
  "regularity": "regular",
  "tense": "present_simple",
  "person": "3sg",
  "prompt": "Every day she ___ at the office.",
  "candidates": ["work", "works", "worked", "working"],
  "expected": "works",
  "label": "correct",
  "category": "core",
  "reason_code": "PRES-3SG-S",
  "explanation": "3rd-person singular present requires the -s ending on regular verbs.",
  "source": "hand"
}

Required fields:

  • id — unique, stable. The harness uses this to match across eval runs.
  • languageen | es.
  • verb — one of the 20 target verbs (12 regular + 8 irregular).
  • regularityregular | irregular. Derived from verb but stored explicitly for slicing.
  • tense — one of {infinitive, present_simple, past_simple, past_participle, future}.
  • person — one of {1sg, 2sg, 3sg}.
  • prompt — the sentence shown to the model. Contains ___ where the verb form goes (for multiple-choice) OR an inline form for classification.
  • candidates — list of candidate forms (multiple-choice setup). Includes the correct form and 2-3 distractors.
  • expected — the correct form. Must be in candidates.
  • labelcorrect | incorrect | ambiguous. (ambiguous is for sentences that could be either, e.g., You work reads identical in 2sg and plural; expected is still set, but the model is not penalized for picking either.)
  • categorycore | adversarial.

Optional fields (recommended):

  • reason_code — short symbolic tag (PRES-3SG-S, PAST-IRREG, EN-ES-MISMATCH, etc.). Used in error analysis.
  • explanation — one-sentence rationale, EN or ES. Not used in metrics; useful for the eval REPORT.md.
  • sourcehand | generated. Hand-curated is gold; generated needs hygiene checks.

Coverage matrix

The verb-grammar grid is 20 verbs × 5 tenses × 3 persons = 300 cells per language × 2 languages = 600 cells. We do not probe exhaustively — we sample. The sampling rules:

Slice Minimum core probes
Per language (EN, ES) ≥ 30 each
Per regularity (regular, irregular) ≥ 30 each
Per tense (5 tenses) ≥ 10 each
Per person (3 persons) ≥ 15 each
Adversarial total ≥ 20

Target: ~60 core + ~20 adversarial = ~80 probes. Phase 20's DoD says minimum 60 core + 20 adversarial.

Hand-construction tip: build a small table first (verb × tense × person), check off cells as you write probes, ensure every regular verb has at least 1 entry and every irregular verb has at least 2 (irregulars are more diagnostic).

Eval-hygiene rules

  1. Zero overlap with train + val. Hash every probe's prompt + expected pair (after normalization: strip whitespace, lowercase). Compare to every train+val sentence hash. Any collision → reject the probe. Hash function: SHA256.

  2. No probe is a copy-modification of a train example. Stricter than (1). If two sentences differ only in incidental noun choice (at the officeat the school), they're "the same probe" for leakage purposes. Use a normalized form (verb-form-and-tense-marker preserved; other content nouns replaced with <NOUN>) and hash that.

  3. Probe authors don't see train data while authoring. Operationally: Claude writes initial probes from the §A13 spec and the verb list, not from data/processed/train.jsonl. (This is strict; in practice, Claude has seen the corpus generator and so probes reflect its conventions. Acknowledge the bias in the report.)

  4. The probe set is versioned. Probes are committed to git as JSONL. The SHA256 of the probe file (sorted by id) is recorded in every REPORT.md.

  5. No probe is modified after first use. If a probe is found to be ambiguous, mark it deprecated: true and add a replacement with a new id. Don't silently change a probe — historical eval reports would no longer be comparable.

The adversarial category

Adversarial probes test the boundary cases where naive rules give wrong answers. Categories (target 3-5 probes per category):

  1. Over-regularization (irregular verbs). goed instead of went. eated instead of ate. writed instead of wrote. comed instead of came. Trap for a model that learned verb + ed = past but didn't memorize irregular forms.

  2. Wrong-person agreement (3sg -s missing). She work hard. He go to school. It have a name. Trap for a model that didn't internalize the 3sg -s rule.

  3. Wrong-tense for time marker. Yesterday I work. (should be past-simple worked.) Tomorrow she went. (should be future will go / is going to go.) Now he ate. (should be present.) Trap for a model that ignores temporal context.

  4. Auxiliary mismatch (perfect aspect). She have eat. (should be She has eaten — both have→has for 3sg AND eat→eaten for past-participle.) Tests two rules at once.

  5. EN↔ES form mismatch. English prompt (Yesterday I ___ to the store.), candidate set contains both went (correct) and fui (the correct ES form but wrong language). Trap for a bilingual model that conflates languages.

  6. Plural / out-of-scope. We work — per §A13, plurals are out of scope. A probe asking the model to classify this should yield ambiguous. A model that confidently labels it incorrect (because "we" wasn't in train) reveals brittleness.

Each adversarial probe has reason_code pointing to its category. The eval REPORT breaks down adversarial accuracy by category, so you can see "model handles category 1 (over-regularization) well but fails category 4 (auxiliary mismatch)."

Seeding the probe set

scripts/build_probes.py (Claude provides at phase open) generates the probe JSONL deterministically:

python scripts/build_probes.py --seed 42 --out data/eval/probes.jsonl
python scripts/build_probes.py --seed 42 --adversarial --out data/eval/adversarial.jsonl

Re-running with the same seed produces byte-identical files. The seed is committed to the script; changing it is a versioned event.

Common probe-writing mistakes

  1. Hand-written examples that all look the same. Variety matters. Mix subject nouns (proper names, pronouns, common nouns where allowed), temporal markers (yesterday, every day, tomorrow, last week, right now), sentence shapes.

  2. Examples where the whole context matters. A probe should be evaluable from the prompt alone. If grammaticality depends on something outside the sentence, the probe is ambiguous (label it so).

  3. Ambiguous labels confused with correct. Some sentences read fine in both 2sg and plural (You work hard.). Mark these ambiguous, don't pretend they're unambiguously correct.

  4. Over-representing one verb. Aggregate accuracy gets dominated by the over-represented verb. Stick to the coverage table.

  5. Adversarial probes that the model couldn't possibly know. A probe testing knowledge of a verb not in the 20-verb set is unfair. Stay within scope.

  6. Forgetting Spanish coverage. Half the curriculum is bilingual per §A13. An EN-only probe set hides Spanish-side failures.

Probe-set version control

Probes live at:

  • data/eval/probes.jsonl — core probes.
  • data/eval/adversarial.jsonl — adversarial probes.
  • data/eval/probe_prompt.txt — the fixed prompt template used to query the model.

All three are in git. Changes are versioned. The eval REPORT.md records the SHA256 of all three files.

When a probe is added/removed, bump the probe-set version (semver: probes_v_1_2_0.jsonl if you want, or just rely on git SHA). Phase 20 starts with v0.1.0; growth is expected.

Probe-set checks (the validate_probes.py script)

Before any eval run, the harness validates the probe set:

  • Every line is valid JSON with required fields.
  • Every verb is in the 20-verb set.
  • Every tense is in the 5-tense set.
  • Every person is in the 3-person set.
  • Every language is en or es.
  • Every label is in {correct, incorrect, ambiguous}.
  • expectedcandidates.
  • regularity matches the verb's class (regular ↔ in the 12-regular list; irregular ↔ in the 8-irregular list).
  • No two probes have the same id.
  • No probe's normalized prompt + expected hash appears in train or val.
  • Coverage matrix is satisfied (minimum counts per slice).

If any check fails, the harness refuses to run and explains which probe caused the failure. This is the security-mindset rule: don't produce numbers from a broken eval set.

One-paragraph recap

A probe is a hand-curated, schema-validated, leak-free labeled example tagged with verb, tense, person, language, and regularity for slicing. The grid is 20 verbs × 5 tenses × 3 persons × 2 languages; we sample ~60 core + 20 adversarial probes covering the slices. Adversarial probes test boundary cases (over-regularization, wrong-person, wrong-tense, auxiliary mismatch, EN↔ES mismatch, out-of-scope). Probes are versioned in git, validated before every eval run, and never silently modified. The probe set is the measurement instrument; its quality bounds the eval's value.

Next: lab/00-probe-schema.md.