English · Español

Lab 04 — Grammar-Tutor Evaluator Harness¶

🇪🇸 Construye un arnés evaluador (~30 prompts §A13) que ejecuta el agente del tutor de gramática contra ground-truth conocido, calcula precisión, fidelidad y coste medio de turnos/llamadas a herramientas, y emite un reporte machine-readable. Mismo formato que el A4 capstone-tracker para que ambos converjan al cierre de Fase 32.

Read theory/05-agent-loop-architecture.md and lab/01-tutor-end-to-end.md before starting. Do not consult solutions/.

Objective¶

Wire up a deterministic, reproducible evaluation pipeline for the grammar-tutor agent. The harness reads a YAML of ~30 §A13 prompts with gold answers, runs each prompt through the agent, and produces a structured report. The report drives Phase 32's DoD.

Scope (§A13 only)¶

The eval set covers:

20 verbs (12 regular + 8 irregular).
5 tenses (infinitive, present simple, past simple, past participle, simple future).
3 persons (1^st sg, 2^nd sg, 3^rd sg).
Prompts in both English and Spanish (bilingual policy, §A2).

Out of scope:

Plural persons (1^st/2^nd/3^rd plural) — deferred per §A13.
Verbs outside the 20 listed in §A13.
Tenses outside the 5 listed.

The harness enforces these constraints via the JSON-Schema mask on the agent's tool calls (Phase 30 → 32 cross-reference).

Tasks¶

Task 1 — author the eval set¶

Create data/eval/phase-32-grammar-tutor.yaml with ~30 entries. Schema (per entry):

- id: tut-01-eat-pst-3s
  category: irregular_verb_correction
  prompt_en: "Correct this sentence: 'He eated the apple.'"
  prompt_es: "Corrige esta oración: 'Él comió la manzana.'"
  gold_answer: |
    {"original": "He eated the apple.",
     "verb": "eat", "tense": "past simple", "person": "3rd singular",
     "correct_form": "ate", "spanish": "comió"}
  expected_tool_calls: ["conjugate"]
  max_turns: 3

Aim for coverage, not bulk:

8 entries: irregular-verb corrections (one per irregular verb).
6 entries: regular-verb conjugations (sample across the 12).
5 entries: tense identification ("what tense is wrote?").
5 entries: English ↔ Spanish translation pairs.
4 entries: adversarial / out-of-scope ("conjugate swim in past simple" → expected: agent declines, since swim ∉ §A13).
2 entries: ambiguous / underspecified inputs (the agent should request clarification or give up gracefully).

Task 2 — harness script¶

Create scripts/eval_grammar_tutor.py:

def run_eval(eval_path: Path, model_ckpt: Path) -> EvalReport:
    """Load eval set, instantiate agent, run each prompt, accumulate metrics."""

The harness must:

Use the autouse seed fixture so two runs produce identical outputs (deterministic).
Time each prompt; record wall-clock and turn count.
Validate the agent's output against gold_answer using a structural matcher (not just ==): for the "correct_form" field, exact match; for the "spanish" field, allow synonyms from a small whitelist (e.g., comió and se comió both accepted for "ate").
Emit an EvalReport:

@dataclass
class EvalReport:
    n_prompts: int
    correct: int                 # exact-match on correct_form
    faithful: int                # gold supported by retrieved chunks / tool outputs
    declined: int                # agent gave up; gold also "declined"
    failed: int                  # agent gave up; gold expected an answer
    mean_turns: float
    mean_tool_calls: float
    p95_latency_ms: float
    by_category: dict[str, dict[str, float]]

Task 3 — DoD thresholds¶

The Phase 32 close criteria, encoded in the harness's exit code (non-zero if any threshold fails):

Metric	Threshold
`correct / n_prompts`	≥ 0.85 (after LoRA fine-tune from Phase 28)
`faithful / correct`	≥ 0.95 (no hallucinated citations)
`mean_turns`	≤ 3.5
`mean_tool_calls`	≤ 2.5
`p95_latency_ms`	≤ 1500 on Borja's i5-8250U
`declined` on out-of-scope inputs	= 4 (matches the 4 out-of-scope prompts)
`failed`	≤ 2 (allow 2 inputs to fail without blocking close)

If correct < 0.85, the report should print the worst 5 prompts with the agent's actual answer and the gold for visual diff. This is the diagnostic input to the Phase 32 reflection.

Task 4 — JSON-line trace log¶

For each prompt, emit one JSON line to experiments/<date>-32-tutor-eval/traces.jsonl containing:

{
  "id": "tut-01-eat-pst-3s",
  "prompt_en": "Correct this sentence: 'He eated the apple.'",
  "scratchpad": [{"role": "user", "text": "..."}, {"role": "model", "action": "tool_call", ...}, ...],
  "agent_output": {"correct_form": "ate", "spanish": "comió"},
  "gold": {...},
  "metrics": {"correct": true, "faithful": true, "turns": 2, "tool_calls": 1, "latency_ms": 437}
}

This trace is the audit log a teacher uses to diagnose individual failures. The journal-summarizer subagent can later digest it.

Task 5 — convergence with capstone tracker (A4)¶

The Phase 41 portal's exam mode reads the same data/eval/phase-32-grammar-tutor.yaml. The harness here and the portal's exam runner must agree on the metrics — both consume the same YAML, both produce comparable EvalReport shapes. This convergence is the §A4 hook: the harness is the offline eval; the portal is the online eval; both reach the same conclusion about whether the tutor is good enough.

The portal's contract is documented in src/miniportal/BLUEPRINT.md §"Exam scoring". Your harness's output must satisfy:

Same metric names (correct, faithful, etc.).
Same threshold values (≥ 0.85 correctness, etc.).
Same trace-log JSON schema.

Measurements to capture¶

All EvalReport fields above.
Per-category breakdown: irregular-verb category accuracy, regular-verb accuracy, OOS-decline accuracy.
A histogram of turns and tool_calls across the 30 prompts.
A per-prompt latency scatter.

Acceptance¶

data/eval/phase-32-grammar-tutor.yaml has ~30 entries spanning all 6 categories above.
scripts/eval_grammar_tutor.py runs to completion on Borja's CPU in ≤ 90 s.
Re-running the harness with the same seed produces byte-identical traces.jsonl.
All seven DoD thresholds pass (or the harness exits non-zero with a clear failure summary).
The trace JSON-line schema matches the portal's expected format.
The eval set is published under data/eval/ and tracked by DVC (dvc add data/eval/phase-32-grammar-tutor.yaml) so the version is reproducible.

Pitfalls to expect¶

Forgetting the seed: two runs disagree on a couple of prompts because Mini-GPT's sampler is non-deterministic. Set temperature=0.0 for eval runs; if a prompt fails at T=0, that's the real signal.
Counting OOS-declines as failures: an out-of-scope prompt ("conjugate swim") the agent correctly declines is a SUCCESS, not a failure. The declined metric is separate from failed. Most evaluators conflate them — be careful here.
Whitelist creep on spanish synonyms: tempting to whitelist corrió, corría, correría all for corrió. Resist. Add synonyms one at a time as you encounter false negatives; document each addition in the YAML with a comment.
Trace log size: ~30 traces × ~5 KiB each = 150 KiB per run. Don't commit traces; they're experiment artifacts (.gitignore experiments/).

Cross-references¶

docs/phase-20-evaluation-harness/ — the eval harness pattern reuses Phase 20's structure for grammar tasks.
src/miniportal/BLUEPRINT.md §"Exam scoring" — the contract this harness must converge with.
docs/extension-track/X3-rlhf-dpo/lab/01-dpo-on-grammar-tutor.md — DPO uses this harness as the reward signal; the metrics here become the DPO objective.

Next: with the harness green, write PHASE_32_REPORT.md summarizing the agent's score on the 30 prompts and reflect on which categories needed which Phase 26-31 components most.