Skip to content

English · Español

Lab 04 — Grammar-Tutor Evaluator Harness

🇪🇸 Construye un arnés evaluador (~30 prompts §A13) que ejecuta el agente del tutor de gramática contra ground-truth conocido, calcula precisión, fidelidad y coste medio de turnos/llamadas a herramientas, y emite un reporte machine-readable. Mismo formato que el A4 capstone-tracker para que ambos converjan al cierre de Fase 32.

Read theory/05-agent-loop-architecture.md and lab/01-tutor-end-to-end.md before starting. Do not consult solutions/.

Objective

Wire up a deterministic, reproducible evaluation pipeline for the grammar-tutor agent. The harness reads a YAML of ~30 §A13 prompts with gold answers, runs each prompt through the agent, and produces a structured report. The report drives Phase 32's DoD.

Scope (§A13 only)

The eval set covers:

  • 20 verbs (12 regular + 8 irregular).
  • 5 tenses (infinitive, present simple, past simple, past participle, simple future).
  • 3 persons (1st sg, 2nd sg, 3rd sg).
  • Prompts in both English and Spanish (bilingual policy, §A2).

Out of scope:

  • Plural persons (1st/2nd/3rd plural) — deferred per §A13.
  • Verbs outside the 20 listed in §A13.
  • Tenses outside the 5 listed.

The harness enforces these constraints via the JSON-Schema mask on the agent's tool calls (Phase 30 → 32 cross-reference).

Tasks

Task 1 — author the eval set

Create data/eval/phase-32-grammar-tutor.yaml with ~30 entries. Schema (per entry):

- id: tut-01-eat-pst-3s
  category: irregular_verb_correction
  prompt_en: "Correct this sentence: 'He eated the apple.'"
  prompt_es: "Corrige esta oración: 'Él comió la manzana.'"
  gold_answer: |
    {"original": "He eated the apple.",
     "verb": "eat", "tense": "past simple", "person": "3rd singular",
     "correct_form": "ate", "spanish": "comió"}
  expected_tool_calls: ["conjugate"]
  max_turns: 3

Aim for coverage, not bulk:

  • 8 entries: irregular-verb corrections (one per irregular verb).
  • 6 entries: regular-verb conjugations (sample across the 12).
  • 5 entries: tense identification ("what tense is wrote?").
  • 5 entries: English ↔ Spanish translation pairs.
  • 4 entries: adversarial / out-of-scope ("conjugate swim in past simple" → expected: agent declines, since swim ∉ §A13).
  • 2 entries: ambiguous / underspecified inputs (the agent should request clarification or give up gracefully).

Task 2 — harness script

Create scripts/eval_grammar_tutor.py:

def run_eval(eval_path: Path, model_ckpt: Path) -> EvalReport:
    """Load eval set, instantiate agent, run each prompt, accumulate metrics."""

The harness must:

  1. Use the autouse seed fixture so two runs produce identical outputs (deterministic).
  2. Time each prompt; record wall-clock and turn count.
  3. Validate the agent's output against gold_answer using a structural matcher (not just ==): for the "correct_form" field, exact match; for the "spanish" field, allow synonyms from a small whitelist (e.g., comió and se comió both accepted for "ate").
  4. Emit an EvalReport:
@dataclass
class EvalReport:
    n_prompts: int
    correct: int                 # exact-match on correct_form
    faithful: int                # gold supported by retrieved chunks / tool outputs
    declined: int                # agent gave up; gold also "declined"
    failed: int                  # agent gave up; gold expected an answer
    mean_turns: float
    mean_tool_calls: float
    p95_latency_ms: float
    by_category: dict[str, dict[str, float]]

Task 3 — DoD thresholds

The Phase 32 close criteria, encoded in the harness's exit code (non-zero if any threshold fails):

Metric Threshold
correct / n_prompts ≥ 0.85 (after LoRA fine-tune from Phase 28)
faithful / correct ≥ 0.95 (no hallucinated citations)
mean_turns ≤ 3.5
mean_tool_calls ≤ 2.5
p95_latency_ms ≤ 1500 on Borja's i5-8250U
declined on out-of-scope inputs = 4 (matches the 4 out-of-scope prompts)
failed ≤ 2 (allow 2 inputs to fail without blocking close)

If correct < 0.85, the report should print the worst 5 prompts with the agent's actual answer and the gold for visual diff. This is the diagnostic input to the Phase 32 reflection.

Task 4 — JSON-line trace log

For each prompt, emit one JSON line to experiments/<date>-32-tutor-eval/traces.jsonl containing:

{
  "id": "tut-01-eat-pst-3s",
  "prompt_en": "Correct this sentence: 'He eated the apple.'",
  "scratchpad": [{"role": "user", "text": "..."}, {"role": "model", "action": "tool_call", ...}, ...],
  "agent_output": {"correct_form": "ate", "spanish": "comió"},
  "gold": {...},
  "metrics": {"correct": true, "faithful": true, "turns": 2, "tool_calls": 1, "latency_ms": 437}
}

This trace is the audit log a teacher uses to diagnose individual failures. The journal-summarizer subagent can later digest it.

Task 5 — convergence with capstone tracker (A4)

The Phase 41 portal's exam mode reads the same data/eval/phase-32-grammar-tutor.yaml. The harness here and the portal's exam runner must agree on the metrics — both consume the same YAML, both produce comparable EvalReport shapes. This convergence is the §A4 hook: the harness is the offline eval; the portal is the online eval; both reach the same conclusion about whether the tutor is good enough.

The portal's contract is documented in src/miniportal/BLUEPRINT.md §"Exam scoring". Your harness's output must satisfy:

  • Same metric names (correct, faithful, etc.).
  • Same threshold values (≥ 0.85 correctness, etc.).
  • Same trace-log JSON schema.

Measurements to capture

  • All EvalReport fields above.
  • Per-category breakdown: irregular-verb category accuracy, regular-verb accuracy, OOS-decline accuracy.
  • A histogram of turns and tool_calls across the 30 prompts.
  • A per-prompt latency scatter.

Acceptance

  • data/eval/phase-32-grammar-tutor.yaml has ~30 entries spanning all 6 categories above.
  • scripts/eval_grammar_tutor.py runs to completion on Borja's CPU in ≤ 90 s.
  • Re-running the harness with the same seed produces byte-identical traces.jsonl.
  • All seven DoD thresholds pass (or the harness exits non-zero with a clear failure summary).
  • The trace JSON-line schema matches the portal's expected format.
  • The eval set is published under data/eval/ and tracked by DVC (dvc add data/eval/phase-32-grammar-tutor.yaml) so the version is reproducible.

Pitfalls to expect

  • Forgetting the seed: two runs disagree on a couple of prompts because Mini-GPT's sampler is non-deterministic. Set temperature=0.0 for eval runs; if a prompt fails at T=0, that's the real signal.
  • Counting OOS-declines as failures: an out-of-scope prompt ("conjugate swim") the agent correctly declines is a SUCCESS, not a failure. The declined metric is separate from failed. Most evaluators conflate them — be careful here.
  • Whitelist creep on spanish synonyms: tempting to whitelist corrió, corría, correría all for corrió. Resist. Add synonyms one at a time as you encounter false negatives; document each addition in the YAML with a comment.
  • Trace log size: ~30 traces × ~5 KiB each = 150 KiB per run. Don't commit traces; they're experiment artifacts (.gitignore experiments/).

Cross-references

  • docs/phase-20-evaluation-harness/ — the eval harness pattern reuses Phase 20's structure for grammar tasks.
  • src/miniportal/BLUEPRINT.md §"Exam scoring" — the contract this harness must converge with.
  • docs/extension-track/X3-rlhf-dpo/lab/01-dpo-on-grammar-tutor.md — DPO uses this harness as the reward signal; the metrics here become the DPO objective.

Next: with the harness green, write PHASE_32_REPORT.md summarizing the agent's score on the 30 prompts and reflect on which categories needed which Phase 26-31 components most.