English · Español
Lab 04 — Grammar-Tutor Evaluator Harness¶
🇪🇸 Construye un arnés evaluador (~30 prompts §A13) que ejecuta el agente del tutor de gramática contra ground-truth conocido, calcula precisión, fidelidad y coste medio de turnos/llamadas a herramientas, y emite un reporte machine-readable. Mismo formato que el A4 capstone-tracker para que ambos converjan al cierre de Fase 32.
Read
theory/05-agent-loop-architecture.mdandlab/01-tutor-end-to-end.mdbefore starting. Do not consultsolutions/.
Objective¶
Wire up a deterministic, reproducible evaluation pipeline for the grammar-tutor agent. The harness reads a YAML of ~30 §A13 prompts with gold answers, runs each prompt through the agent, and produces a structured report. The report drives Phase 32's DoD.
Scope (§A13 only)¶
The eval set covers:
- 20 verbs (12 regular + 8 irregular).
- 5 tenses (infinitive, present simple, past simple, past participle, simple future).
- 3 persons (1st sg, 2nd sg, 3rd sg).
- Prompts in both English and Spanish (bilingual policy, §A2).
Out of scope:
- Plural persons (1st/2nd/3rd plural) — deferred per §A13.
- Verbs outside the 20 listed in §A13.
- Tenses outside the 5 listed.
The harness enforces these constraints via the JSON-Schema mask on the agent's tool calls (Phase 30 → 32 cross-reference).
Tasks¶
Task 1 — author the eval set¶
Create data/eval/phase-32-grammar-tutor.yaml with ~30 entries. Schema (per entry):
- id: tut-01-eat-pst-3s
category: irregular_verb_correction
prompt_en: "Correct this sentence: 'He eated the apple.'"
prompt_es: "Corrige esta oración: 'Él comió la manzana.'"
gold_answer: |
{"original": "He eated the apple.",
"verb": "eat", "tense": "past simple", "person": "3rd singular",
"correct_form": "ate", "spanish": "comió"}
expected_tool_calls: ["conjugate"]
max_turns: 3
Aim for coverage, not bulk:
- 8 entries: irregular-verb corrections (one per irregular verb).
- 6 entries: regular-verb conjugations (sample across the 12).
- 5 entries: tense identification ("what tense is
wrote?"). - 5 entries: English ↔ Spanish translation pairs.
- 4 entries: adversarial / out-of-scope ("conjugate
swimin past simple" → expected: agent declines, sinceswim∉ §A13). - 2 entries: ambiguous / underspecified inputs (the agent should request clarification or give up gracefully).
Task 2 — harness script¶
Create scripts/eval_grammar_tutor.py:
def run_eval(eval_path: Path, model_ckpt: Path) -> EvalReport:
"""Load eval set, instantiate agent, run each prompt, accumulate metrics."""
The harness must:
- Use the autouse seed fixture so two runs produce identical outputs (deterministic).
- Time each prompt; record wall-clock and turn count.
- Validate the agent's output against
gold_answerusing a structural matcher (not just==): for the "correct_form" field, exact match; for the "spanish" field, allow synonyms from a small whitelist (e.g.,comióandse comióboth accepted for "ate"). - Emit an
EvalReport:
@dataclass
class EvalReport:
n_prompts: int
correct: int # exact-match on correct_form
faithful: int # gold supported by retrieved chunks / tool outputs
declined: int # agent gave up; gold also "declined"
failed: int # agent gave up; gold expected an answer
mean_turns: float
mean_tool_calls: float
p95_latency_ms: float
by_category: dict[str, dict[str, float]]
Task 3 — DoD thresholds¶
The Phase 32 close criteria, encoded in the harness's exit code (non-zero if any threshold fails):
| Metric | Threshold |
|---|---|
correct / n_prompts |
≥ 0.85 (after LoRA fine-tune from Phase 28) |
faithful / correct |
≥ 0.95 (no hallucinated citations) |
mean_turns |
≤ 3.5 |
mean_tool_calls |
≤ 2.5 |
p95_latency_ms |
≤ 1500 on Borja's i5-8250U |
declined on out-of-scope inputs |
= 4 (matches the 4 out-of-scope prompts) |
failed |
≤ 2 (allow 2 inputs to fail without blocking close) |
If correct < 0.85, the report should print the worst 5 prompts with the agent's actual answer and the gold for visual diff. This is the diagnostic input to the Phase 32 reflection.
Task 4 — JSON-line trace log¶
For each prompt, emit one JSON line to experiments/<date>-32-tutor-eval/traces.jsonl containing:
{
"id": "tut-01-eat-pst-3s",
"prompt_en": "Correct this sentence: 'He eated the apple.'",
"scratchpad": [{"role": "user", "text": "..."}, {"role": "model", "action": "tool_call", ...}, ...],
"agent_output": {"correct_form": "ate", "spanish": "comió"},
"gold": {...},
"metrics": {"correct": true, "faithful": true, "turns": 2, "tool_calls": 1, "latency_ms": 437}
}
This trace is the audit log a teacher uses to diagnose individual failures. The journal-summarizer subagent can later digest it.
Task 5 — convergence with capstone tracker (A4)¶
The Phase 41 portal's exam mode reads the same data/eval/phase-32-grammar-tutor.yaml. The harness here and the portal's exam runner must agree on the metrics — both consume the same YAML, both produce comparable EvalReport shapes. This convergence is the §A4 hook: the harness is the offline eval; the portal is the online eval; both reach the same conclusion about whether the tutor is good enough.
The portal's contract is documented in src/miniportal/BLUEPRINT.md §"Exam scoring". Your harness's output must satisfy:
- Same metric names (
correct,faithful, etc.). - Same threshold values (≥ 0.85 correctness, etc.).
- Same trace-log JSON schema.
Measurements to capture¶
- All EvalReport fields above.
- Per-category breakdown: irregular-verb category accuracy, regular-verb accuracy, OOS-decline accuracy.
- A histogram of
turnsandtool_callsacross the 30 prompts. - A per-prompt latency scatter.
Acceptance¶
-
data/eval/phase-32-grammar-tutor.yamlhas ~30 entries spanning all 6 categories above. -
scripts/eval_grammar_tutor.pyruns to completion on Borja's CPU in ≤ 90 s. - Re-running the harness with the same seed produces byte-identical
traces.jsonl. - All seven DoD thresholds pass (or the harness exits non-zero with a clear failure summary).
- The trace JSON-line schema matches the portal's expected format.
- The eval set is published under
data/eval/and tracked by DVC (dvc add data/eval/phase-32-grammar-tutor.yaml) so the version is reproducible.
Pitfalls to expect¶
- Forgetting the seed: two runs disagree on a couple of prompts because Mini-GPT's sampler is non-deterministic. Set
temperature=0.0for eval runs; if a prompt fails at T=0, that's the real signal. - Counting OOS-declines as failures: an out-of-scope prompt (
"conjugate swim") the agent correctly declines is a SUCCESS, not a failure. Thedeclinedmetric is separate fromfailed. Most evaluators conflate them — be careful here. - Whitelist creep on spanish synonyms: tempting to whitelist
corrió,corría,correríaall forcorrió. Resist. Add synonyms one at a time as you encounter false negatives; document each addition in the YAML with a comment. - Trace log size: ~30 traces × ~5 KiB each = 150 KiB per run. Don't commit traces; they're experiment artifacts (
.gitignoreexperiments/).
Cross-references¶
docs/phase-20-evaluation-harness/— the eval harness pattern reuses Phase 20's structure for grammar tasks.src/miniportal/BLUEPRINT.md§"Exam scoring" — the contract this harness must converge with.docs/extension-track/X3-rlhf-dpo/lab/01-dpo-on-grammar-tutor.md— DPO uses this harness as the reward signal; the metrics here become the DPO objective.
Next: with the harness green, write PHASE_32_REPORT.md summarizing the agent's score on the 30 prompts and reflect on which categories needed which Phase 26-31 components most.