English · Español
Lab 00 — Define the probe-set schema; load 60 labeled examples¶
Goal: lock the probe schema, build the loader, and validate a starter set of 60 hand-curated probes spanning the §A13 verb-grammar grid.
Estimated time: 90-120 minutes.
Prereq: Phase 12 corpus is committed;
data/corpus_spec.mddescribes the labeling conventions.
What you produce¶
A new module:
src/eval/probes.py— schema validator + loader.tests/eval/test_probes.py— schema + leak-detection tests.
A new data file:
data/eval/probes.jsonl— 60 core probes (Claude provides at phase open as a seed; Borja extends as desired).data/eval/probe_prompt.txt— the fixed prompt template.data/corpus_spec.md— extended with a "Probe schema" section.
TODOs¶
Block A — extend data/corpus_spec.md¶
Add a "Probe schema" section that documents:
- Required fields and types.
- Allowed values for
language,verb,regularity,tense,person,label,category. - Validation rules (uniqueness of
id, leak prevention,expected∈candidates, regularity↔verb consistency). - The coverage matrix from
theory/03.
Block B — implement src/eval/probes.py¶
from dataclasses import dataclass
from pathlib import Path
import hashlib, json
REGULAR_VERBS = {"work", "play", "walk", "talk", "listen",
"watch", "study", "finish", "start", "look",
"want", "like"}
IRREGULAR_VERBS = {"be", "have", "do", "go", "come",
"see", "eat", "write"}
TARGET_VERBS = REGULAR_VERBS | IRREGULAR_VERBS # 20
TENSES = {"infinitive", "present_simple", "past_simple",
"past_participle", "future"}
PERSONS = {"1sg", "2sg", "3sg"}
LANGUAGES = {"en", "es"}
LABELS = {"correct", "incorrect", "ambiguous"}
CATEGORIES = {"core", "adversarial"}
@dataclass(frozen=True)
class Probe:
id: str
language: str
verb: str
regularity: str # "regular" | "irregular"
tense: str
person: str
prompt: str
candidates: tuple[str, ...]
expected: str
label: str
category: str
reason_code: str | None = None
explanation: str | None = None
source: str = "hand"
def load_probes(path: Path) -> list[Probe]: ...
def validate_probes(probes: list[Probe],
train_hashes: set[str],
val_hashes: set[str]) -> list[str]:
"""Returns list of error strings; [] = clean."""
...
def normalize_prompt(prompt: str, expected: str) -> str:
"""Whitespace+lowercase normalization; replaces non-verb content
nouns with <NOUN> for stricter leak detection."""
...
def hash_prompt(prompt: str, expected: str) -> str:
"""SHA256 hex of the normalized form."""
...
-
load_probesparses JSONL intoProbedataclasses (tuple forcandidates). -
validate_probesruns every check fromtheory/03§"Probe-set checks": - schema correctness (all required fields present, correct types)
- id uniqueness
verb ∈ TARGET_VERBS,tense ∈ TENSES,person ∈ PERSONS,language ∈ LANGUAGESlabel ∈ LABELS,category ∈ CATEGORIESexpected ∈ candidatesregularitymatches verb's class- leakage against
train_hashesandval_hashes - coverage-matrix minima
-
normalize_promptstrips comments, normalizes whitespace, lowercases, replaces incidental content nouns with<NOUN>(rough heuristic — list the limitations in the docstring). -
hash_promptreturns SHA256 hex of the normalized(prompt, expected)pair.
Block C — write the prompt template¶
data/eval/probe_prompt.txt (multiple-choice format):
// Task: pick the grammatically correct verb form for the blank.
// Allowed forms are listed; pick exactly one.
//
// Sentence: {prompt}
// Candidates: {candidates_joined}
// Answer:
For classification-format probes (sentence already has a form in it; classify as CORRECT/INCORRECT/AMBIGUOUS), a parallel template lives at data/eval/classify_prompt.txt:
// Task: classify the grammaticality of the verb form in the sentence below.
// Categories: CORRECT | INCORRECT | AMBIGUOUS
// CORRECT: the verb agrees with the subject's person and the sentence's tense.
// INCORRECT: the verb form is wrong for the subject or the tense.
// AMBIGUOUS: the sentence is grammatical in more than one reading (e.g., "You work").
//
// Sentence: {sentence}
// Classification:
Both files are frozen per eval run. Changes invalidate cross-run comparisons.
Block D — seed data/eval/probes.jsonl (60 examples)¶
Claude provides the 60 starter probes at phase open. Borja's job in this lab is to:
- Run
validate_probes(load_probes('data/eval/probes.jsonl'), train_hashes, val_hashes). - Fix any validation errors. If a probe leaks against train, replace it.
- Add at least 5 probes of your own — pick verb/tense/person/language combinations that you suspect the model will struggle with based on Phase 18-19 reports.
Block E — tests in tests/eval/test_probes.py¶
test_load_known_probes— parses the seed file without error; count = 60.test_validate_clean_set— given the seed file and empty train_hashes, returns[].test_detects_id_dup— given a probe set with two probes sharing an id, returns an error about duplication.test_detects_leak— given a probe whose prompt-expected hash matches a train hash, returns an error about leakage.test_detects_regularity_mismatch— a probe withverb="go"andregularity="regular"is rejected.test_detects_expected_not_in_candidates— a probe withexpected="ate"butcandidates=["eat","eats","eaten"]is rejected.test_coverage_matrix_minima— given a probe set missing theeslanguage entirely, returns a coverage error.test_normalize_idempotent—normalize_prompt(normalize_prompt(p, e), e) == normalize_prompt(p, e).
Constraints¶
- Pure Python + standard library. No external schema validators (Pydantic, etc.); a dataclass with a few checks is enough.
- Leak detection is mandatory. No exceptions.
- Probes are JSONL, not YAML / JSON-array. One line per probe, append-only conceptually.
- EN and ES coexist in the same file — distinguish by the
languagefield, not by file path. (Some teams split by language; we don't, because slicing logic stays simpler.)
Stop conditions¶
Done when:
pytest tests/eval/test_probes.py -vpasses all tests.data/eval/probes.jsonlhas ≥ 60 probes that passvalidate_probesagainst Phase-12 train+val.- Coverage minima are met (≥ 30 EN, ≥ 30 ES, ≥ 30 regular, ≥ 30 irregular, ≥ 10 per tense, ≥ 15 per person).
data/corpus_spec.mdis extended with the Probe schema.
Pitfalls¶
- Normalizing without preserving verb form. If
normalize_promptstrips the verb form along with surrounding nouns, it can't detect leaks correctly. Preserve the verb form; replace only incidental content nouns. - Forgetting
ambiguousis a real class. Some 2sg sentences are genuinely ambiguous between 2sg and plural in English; mark themambiguous, notcorrect. - Spanish-side probes that are actually Spanglish. Stay consistent within a probe:
language: esmeans the whole prompt is Spanish, including the time markers. - Probe
idcollisions across files.probes.jsonlandadversarial.jsonlshare id namespace. Use prefix conventions likeprobe-...andadv-.... - Irregular verb list confusion.
be / have / do / go / come / see / eat / write— write it down once, refer to that list. Avoid relying on memory mid-probe-write.
When to consult solutions/¶
After all eight tests pass. The solution at solutions/00-probe-schema-ref.md (written at phase open) walks through the normalization function and the coverage-matrix check.
Next lab: lab/01-harness-perplexity-accuracy.md.