Skip to content

English · Español

Lab 00 — Define the probe-set schema; load 60 labeled examples

Goal: lock the probe schema, build the loader, and validate a starter set of 60 hand-curated probes spanning the §A13 verb-grammar grid.

Estimated time: 90-120 minutes.

Prereq: Phase 12 corpus is committed; data/corpus_spec.md describes the labeling conventions.


What you produce

A new module:

  • src/eval/probes.py — schema validator + loader.
  • tests/eval/test_probes.py — schema + leak-detection tests.

A new data file:

  • data/eval/probes.jsonl — 60 core probes (Claude provides at phase open as a seed; Borja extends as desired).
  • data/eval/probe_prompt.txt — the fixed prompt template.
  • data/corpus_spec.md — extended with a "Probe schema" section.

TODOs

Block A — extend data/corpus_spec.md

Add a "Probe schema" section that documents:

  • Required fields and types.
  • Allowed values for language, verb, regularity, tense, person, label, category.
  • Validation rules (uniqueness of id, leak prevention, expectedcandidates, regularity↔verb consistency).
  • The coverage matrix from theory/03.

Block B — implement src/eval/probes.py

from dataclasses import dataclass
from pathlib import Path
import hashlib, json

REGULAR_VERBS = {"work", "play", "walk", "talk", "listen",
                 "watch", "study", "finish", "start", "look",
                 "want", "like"}
IRREGULAR_VERBS = {"be", "have", "do", "go", "come",
                   "see", "eat", "write"}
TARGET_VERBS = REGULAR_VERBS | IRREGULAR_VERBS  # 20

TENSES = {"infinitive", "present_simple", "past_simple",
          "past_participle", "future"}
PERSONS = {"1sg", "2sg", "3sg"}
LANGUAGES = {"en", "es"}
LABELS = {"correct", "incorrect", "ambiguous"}
CATEGORIES = {"core", "adversarial"}

@dataclass(frozen=True)
class Probe:
    id: str
    language: str
    verb: str
    regularity: str       # "regular" | "irregular"
    tense: str
    person: str
    prompt: str
    candidates: tuple[str, ...]
    expected: str
    label: str
    category: str
    reason_code: str | None = None
    explanation: str | None = None
    source: str = "hand"

def load_probes(path: Path) -> list[Probe]: ...

def validate_probes(probes: list[Probe],
                    train_hashes: set[str],
                    val_hashes: set[str]) -> list[str]:
    """Returns list of error strings; [] = clean."""
    ...

def normalize_prompt(prompt: str, expected: str) -> str:
    """Whitespace+lowercase normalization; replaces non-verb content
    nouns with <NOUN> for stricter leak detection."""
    ...

def hash_prompt(prompt: str, expected: str) -> str:
    """SHA256 hex of the normalized form."""
    ...
  • load_probes parses JSONL into Probe dataclasses (tuple for candidates).
  • validate_probes runs every check from theory/03 §"Probe-set checks":
  • schema correctness (all required fields present, correct types)
  • id uniqueness
  • verb ∈ TARGET_VERBS, tense ∈ TENSES, person ∈ PERSONS, language ∈ LANGUAGES
  • label ∈ LABELS, category ∈ CATEGORIES
  • expected ∈ candidates
  • regularity matches verb's class
  • leakage against train_hashes and val_hashes
  • coverage-matrix minima
  • normalize_prompt strips comments, normalizes whitespace, lowercases, replaces incidental content nouns with <NOUN> (rough heuristic — list the limitations in the docstring).
  • hash_prompt returns SHA256 hex of the normalized (prompt, expected) pair.

Block C — write the prompt template

data/eval/probe_prompt.txt (multiple-choice format):

// Task: pick the grammatically correct verb form for the blank.
// Allowed forms are listed; pick exactly one.
//
// Sentence: {prompt}
// Candidates: {candidates_joined}
// Answer:

For classification-format probes (sentence already has a form in it; classify as CORRECT/INCORRECT/AMBIGUOUS), a parallel template lives at data/eval/classify_prompt.txt:

// Task: classify the grammaticality of the verb form in the sentence below.
// Categories: CORRECT | INCORRECT | AMBIGUOUS
//   CORRECT: the verb agrees with the subject's person and the sentence's tense.
//   INCORRECT: the verb form is wrong for the subject or the tense.
//   AMBIGUOUS: the sentence is grammatical in more than one reading (e.g., "You work").
//
// Sentence: {sentence}
// Classification:

Both files are frozen per eval run. Changes invalidate cross-run comparisons.

Block D — seed data/eval/probes.jsonl (60 examples)

Claude provides the 60 starter probes at phase open. Borja's job in this lab is to:

  1. Run validate_probes(load_probes('data/eval/probes.jsonl'), train_hashes, val_hashes).
  2. Fix any validation errors. If a probe leaks against train, replace it.
  3. Add at least 5 probes of your own — pick verb/tense/person/language combinations that you suspect the model will struggle with based on Phase 18-19 reports.

Block E — tests in tests/eval/test_probes.py

  1. test_load_known_probes — parses the seed file without error; count = 60.
  2. test_validate_clean_set — given the seed file and empty train_hashes, returns [].
  3. test_detects_id_dup — given a probe set with two probes sharing an id, returns an error about duplication.
  4. test_detects_leak — given a probe whose prompt-expected hash matches a train hash, returns an error about leakage.
  5. test_detects_regularity_mismatch — a probe with verb="go" and regularity="regular" is rejected.
  6. test_detects_expected_not_in_candidates — a probe with expected="ate" but candidates=["eat","eats","eaten"] is rejected.
  7. test_coverage_matrix_minima — given a probe set missing the es language entirely, returns a coverage error.
  8. test_normalize_idempotentnormalize_prompt(normalize_prompt(p, e), e) == normalize_prompt(p, e).

Constraints

  • Pure Python + standard library. No external schema validators (Pydantic, etc.); a dataclass with a few checks is enough.
  • Leak detection is mandatory. No exceptions.
  • Probes are JSONL, not YAML / JSON-array. One line per probe, append-only conceptually.
  • EN and ES coexist in the same file — distinguish by the language field, not by file path. (Some teams split by language; we don't, because slicing logic stays simpler.)

Stop conditions

Done when:

  1. pytest tests/eval/test_probes.py -v passes all tests.
  2. data/eval/probes.jsonl has ≥ 60 probes that pass validate_probes against Phase-12 train+val.
  3. Coverage minima are met (≥ 30 EN, ≥ 30 ES, ≥ 30 regular, ≥ 30 irregular, ≥ 10 per tense, ≥ 15 per person).
  4. data/corpus_spec.md is extended with the Probe schema.

Pitfalls

  • Normalizing without preserving verb form. If normalize_prompt strips the verb form along with surrounding nouns, it can't detect leaks correctly. Preserve the verb form; replace only incidental content nouns.
  • Forgetting ambiguous is a real class. Some 2sg sentences are genuinely ambiguous between 2sg and plural in English; mark them ambiguous, not correct.
  • Spanish-side probes that are actually Spanglish. Stay consistent within a probe: language: es means the whole prompt is Spanish, including the time markers.
  • Probe id collisions across files. probes.jsonl and adversarial.jsonl share id namespace. Use prefix conventions like probe-... and adv-....
  • Irregular verb list confusion. be / have / do / go / come / see / eat / write — write it down once, refer to that list. Avoid relying on memory mid-probe-write.

When to consult solutions/

After all eight tests pass. The solution at solutions/00-probe-schema-ref.md (written at phase open) walks through the normalization function and the coverage-matrix check.


Next lab: lab/01-harness-perplexity-accuracy.md.