English · Español

Lab 01 — Tutor agent on 30 canonical sentences¶

Read theory/00-motivation.md and theory/01-react-and-planning.md. Do not consult solutions/.

Objective¶

Wire Planner, ScratchpadMemory, LongTermMemory, and the Phase 31 MCP tools into a GrammarTutorAgent. Run it on the 30-sentence canonical test set. Achieve ≥ 90% correctness. Capture the full trace for one happy path, one ambiguity, and one out-of-scope refusal.

Setup¶

Files to create / extend:

src/miniagent/agent.py — GrammarTutorAgent
src/miniagent/types.py — CorrectionResult, Step, PlannerState
tests/test_agent_loop.py — the 30-sentence regression test
experiments/<date>-phase-32-tutor-demo/ — transcripts + summary

Tasks¶

Task 1 — assemble the canonical test set¶

Create experiments/<date>-phase-32-tutor-demo/test_sentences.json with 30 entries:

10 happy-path corrections (clear input, single error). Examples: "He goed to school" → "He went to school", "I has a book" → "I have a book", "She don't like it" → "She doesn't like it".
5 already-correct sentences (no correction needed). Examples: "I work every day". The agent should return corrected = None (since the original is correct).
5 ambiguous sentences (multiple valid corrections). Examples: "I will going to the store" — could be "I will go" or "I am going". Pick one canonical correction.
5 multi-error sentences (≥ 2 errors). Example: "He goed and she have went home". Agent should fix one error and report the rest as additional_issues.
5 out-of-scope sentences (use plural pronouns, non-§A13 verbs, or non-§A13 tenses). Example: "They are running." Agent should return corrected = None, in_scope = False, rationale = ["plural pronoun out of scope"].

Each entry:

{
  "id": 1,
  "original": "He goed to school.",
  "expected": {
    "corrected": "He went to school.",
    "in_scope": true,
    "rationale_keywords": ["irregular", "past", "went"],
    "must_call_tools": ["lookup_irregular_verb", "conjugate"]
  }
}

The must_call_tools field is the gold path through tool space — useful for diagnosing where the agent's planner deviates.

Task 2 — implement `GrammarTutorAgent`¶

class GrammarTutorAgent:
    def __init__(
        self,
        planner: Planner | MockPlanner,
        tool_dispatcher: MCPClient,
        long_term: LongTermMemory,
        max_steps: int = 8,
    ):
        ...

    def correct(self, sentence: str, learner_id: str = "borja") -> CorrectionResult:
        scratchpad = ScratchpadMemory()  # fresh per call
        for step_index in range(self.max_steps):
            state = PlannerState(
                original=sentence,
                scratchpad=scratchpad,
                long_term_view=self.long_term.view_for(learner_id, sentence),
                step_index=step_index,
            )
            step = self.planner.next_step(state)

            if isinstance(step, FinalAnswer):
                self._update_long_term(learner_id, sentence, step.answer)
                return CorrectionResult(
                    original=sentence,
                    corrected=step.answer.corrected,
                    rationale=step.answer.rationale,
                    spanish_gloss=step.answer.spanish_gloss,
                    in_scope=step.answer.in_scope,
                    tool_trace=scratchpad.tool_calls(),
                )
            elif isinstance(step, ToolCall):
                if self._is_duplicate_action(step, scratchpad):
                    return self._budget_exhausted_result(sentence, scratchpad, reason="loop")
                result = run_under_sandbox(
                    self.tool_dispatcher.get_tool(step.tool),
                    step.args,
                    policy=SandboxPolicy.PERMISSIVE,
                )
                scratchpad.append(thought="", action=step, observation=result)
            else:
                raise TypeError(f"unexpected step type: {type(step)}")

        return self._budget_exhausted_result(sentence, scratchpad, reason="budget")

Constraints:

scratchpad is constructed inside correct(), not stored on self.
long_term is shared across corrections (it's persistent state).
All tools are dispatched via the Phase 31 MCP client (no direct imports).
Duplicate-action detection: if the same (tool, args) appears twice in a row, halt.

Task 3 — wire it up against MockPlanner first¶

Before plugging in the real (untrained Mini-GPT) planner, run against MockPlanner from Lab 00. The mock returns scripted steps for the 30 sentences. This verifies the loop works before introducing model uncertainty.

Run on all 30 sentences:

agent = GrammarTutorAgent(planner=MockPlanner(scripts), ...)
results = [agent.correct(s["original"]) for s in test_sentences]

Expected: ≥ 95% correctness against MockPlanner (with scripts written correctly, this is near-100%). Any failures here are agent loop bugs, not planner bugs.

Task 4 — capture three transcripts¶

For inclusion in PHASE_32_REPORT.md:

Happy path — "He goed to school" — full trace.
Ambiguity — "I will going home" — full trace, showing how the agent picks one canonical correction.
Out-of-scope — "They went home" — full trace, showing the agent recognising and reporting out-of-scope.

For each, save:

{
  "sentence": "...",
  "trace": [
    {"step": 1, "action": {"tool": "...", "args": {...}}, "observation": "..."},
    {"step": 2, "action": {"tool": "...", "args": {...}}, "observation": "..."},
    {"step": 3, "action": "FINAL_ANSWER", "answer": {...}}
  ]
}

Task 5 — measure correctness¶

For the 30-sentence test set, report:

Accuracy (correct ÷ total) by category (happy / already-correct / ambiguous / multi-error / out-of-scope).
Mean steps per correction.
Mean tools called per correction.
Distribution of step counts (histogram, save as experiments/<date>-phase-32-tutor-demo/steps_histogram.png).

Goal: ≥ 90% overall accuracy with MockPlanner. With the real (untrained) planner, accuracy will likely be ~random; this is documented as the gap between Phase 17's "trained for forward correctness" and Phase 28's "fine-tuned for instruction-following."

Task 6 — long-term memory update verification¶

After running the 30 sentences, inspect longterm.json. The mistake counters for verbs like go, have, do should have ≥ 1 entry. The per-tense accuracy stats should be populated.

A second run on the same 30 sentences should:

Produce the same corrections (deterministic with a fixed seed).
Show doubled mistake counts in longterm.json (since each sentence was corrected twice).
Possibly show different rationales if the agent injects long-term context into the prompt.

Measurements to capture¶

30-sentence test results (per-sentence pass/fail).
Three transcripts (happy / ambiguity / out-of-scope).
Step distribution histogram.
longterm.json after one run.
Diff between longterm.json after run 1 and run 2.

Acceptance¶

GrammarTutorAgent.correct() works on all 30 sentences without raising.
MockPlanner-based accuracy ≥ 90%.
Three transcripts saved in the experiment dir.
Step histogram saved.
longterm.json updates correctly across runs.
Test tests/test_agent_loop.py is green.

Pitfalls to expect¶

Scratchpad leak. If scratchpad is stored on self, the second correct() sees the first's trace. Test for this explicitly: call correct("A") then correct("B"); the second call's scratchpad should have only B's steps.
Long-term update on out-of-scope. Should we increment mistake counters for out-of-scope sentences? Default: no (out-of-scope isn't a verb error, it's a corpus mismatch). Document the choice.
MockPlanner script gaps. If you scripted only 25 of 30 sentences, the 5 missing will hit a KeyError. Either script all 30 or have MockPlanner raise a clear "scripted-but-not-found" error.
The agent calls final_answer immediately. A buggy planner might emit final_answer as step 1 without calling any tools. The result is unreliable but structurally valid. Catch this in the test: assert step count ≥ 1 (at least one tool call) before accepting an answer, except for "already correct" sentences.

Next: 02-sandbox-an-evil-tool.md