English · Español
Lab 01 — Tutor agent on 30 canonical sentences¶
Read
theory/00-motivation.mdandtheory/01-react-and-planning.md. Do not consultsolutions/.
Objective¶
Wire Planner, ScratchpadMemory, LongTermMemory, and the Phase 31 MCP tools into a GrammarTutorAgent. Run it on the 30-sentence canonical test set. Achieve ≥ 90% correctness. Capture the full trace for one happy path, one ambiguity, and one out-of-scope refusal.
Setup¶
Files to create / extend:
src/miniagent/agent.py—GrammarTutorAgentsrc/miniagent/types.py—CorrectionResult,Step,PlannerStatetests/test_agent_loop.py— the 30-sentence regression testexperiments/<date>-phase-32-tutor-demo/— transcripts + summary
Tasks¶
Task 1 — assemble the canonical test set¶
Create experiments/<date>-phase-32-tutor-demo/test_sentences.json with 30 entries:
- 10 happy-path corrections (clear input, single error). Examples:
"He goed to school" → "He went to school","I has a book" → "I have a book","She don't like it" → "She doesn't like it". - 5 already-correct sentences (no correction needed). Examples:
"I work every day". The agent should returncorrected = None(since the original is correct). - 5 ambiguous sentences (multiple valid corrections). Examples:
"I will going to the store"— could be "I will go" or "I am going". Pick one canonical correction. - 5 multi-error sentences (≥ 2 errors). Example:
"He goed and she have went home". Agent should fix one error and report the rest asadditional_issues. - 5 out-of-scope sentences (use plural pronouns, non-§A13 verbs, or non-§A13 tenses). Example:
"They are running."Agent should returncorrected = None, in_scope = False, rationale = ["plural pronoun out of scope"].
Each entry:
{
"id": 1,
"original": "He goed to school.",
"expected": {
"corrected": "He went to school.",
"in_scope": true,
"rationale_keywords": ["irregular", "past", "went"],
"must_call_tools": ["lookup_irregular_verb", "conjugate"]
}
}
The must_call_tools field is the gold path through tool space — useful for diagnosing where the agent's planner deviates.
Task 2 — implement GrammarTutorAgent¶
class GrammarTutorAgent:
def __init__(
self,
planner: Planner | MockPlanner,
tool_dispatcher: MCPClient,
long_term: LongTermMemory,
max_steps: int = 8,
):
...
def correct(self, sentence: str, learner_id: str = "borja") -> CorrectionResult:
scratchpad = ScratchpadMemory() # fresh per call
for step_index in range(self.max_steps):
state = PlannerState(
original=sentence,
scratchpad=scratchpad,
long_term_view=self.long_term.view_for(learner_id, sentence),
step_index=step_index,
)
step = self.planner.next_step(state)
if isinstance(step, FinalAnswer):
self._update_long_term(learner_id, sentence, step.answer)
return CorrectionResult(
original=sentence,
corrected=step.answer.corrected,
rationale=step.answer.rationale,
spanish_gloss=step.answer.spanish_gloss,
in_scope=step.answer.in_scope,
tool_trace=scratchpad.tool_calls(),
)
elif isinstance(step, ToolCall):
if self._is_duplicate_action(step, scratchpad):
return self._budget_exhausted_result(sentence, scratchpad, reason="loop")
result = run_under_sandbox(
self.tool_dispatcher.get_tool(step.tool),
step.args,
policy=SandboxPolicy.PERMISSIVE,
)
scratchpad.append(thought="", action=step, observation=result)
else:
raise TypeError(f"unexpected step type: {type(step)}")
return self._budget_exhausted_result(sentence, scratchpad, reason="budget")
Constraints:
scratchpadis constructed insidecorrect(), not stored onself.long_termis shared across corrections (it's persistent state).- All tools are dispatched via the Phase 31 MCP client (no direct imports).
- Duplicate-action detection: if the same
(tool, args)appears twice in a row, halt.
Task 3 — wire it up against MockPlanner first¶
Before plugging in the real (untrained Mini-GPT) planner, run against MockPlanner from Lab 00. The mock returns scripted steps for the 30 sentences. This verifies the loop works before introducing model uncertainty.
Run on all 30 sentences:
agent = GrammarTutorAgent(planner=MockPlanner(scripts), ...)
results = [agent.correct(s["original"]) for s in test_sentences]
Expected: ≥ 95% correctness against MockPlanner (with scripts written correctly, this is near-100%). Any failures here are agent loop bugs, not planner bugs.
Task 4 — capture three transcripts¶
For inclusion in PHASE_32_REPORT.md:
- Happy path —
"He goed to school"— full trace. - Ambiguity —
"I will going home"— full trace, showing how the agent picks one canonical correction. - Out-of-scope —
"They went home"— full trace, showing the agent recognising and reporting out-of-scope.
For each, save:
{
"sentence": "...",
"trace": [
{"step": 1, "action": {"tool": "...", "args": {...}}, "observation": "..."},
{"step": 2, "action": {"tool": "...", "args": {...}}, "observation": "..."},
{"step": 3, "action": "FINAL_ANSWER", "answer": {...}}
]
}
Task 5 — measure correctness¶
For the 30-sentence test set, report:
- Accuracy (correct ÷ total) by category (happy / already-correct / ambiguous / multi-error / out-of-scope).
- Mean steps per correction.
- Mean tools called per correction.
- Distribution of step counts (histogram, save as
experiments/<date>-phase-32-tutor-demo/steps_histogram.png).
Goal: ≥ 90% overall accuracy with MockPlanner. With the real (untrained) planner, accuracy will likely be ~random; this is documented as the gap between Phase 17's "trained for forward correctness" and Phase 28's "fine-tuned for instruction-following."
Task 6 — long-term memory update verification¶
After running the 30 sentences, inspect longterm.json. The mistake counters for verbs like go, have, do should have ≥ 1 entry. The per-tense accuracy stats should be populated.
A second run on the same 30 sentences should:
- Produce the same corrections (deterministic with a fixed seed).
- Show doubled mistake counts in
longterm.json(since each sentence was corrected twice). - Possibly show different rationales if the agent injects long-term context into the prompt.
Measurements to capture¶
- 30-sentence test results (per-sentence pass/fail).
- Three transcripts (happy / ambiguity / out-of-scope).
- Step distribution histogram.
longterm.jsonafter one run.- Diff between
longterm.jsonafter run 1 and run 2.
Acceptance¶
-
GrammarTutorAgent.correct()works on all 30 sentences without raising. - MockPlanner-based accuracy ≥ 90%.
- Three transcripts saved in the experiment dir.
- Step histogram saved.
-
longterm.jsonupdates correctly across runs. - Test
tests/test_agent_loop.pyis green.
Pitfalls to expect¶
- Scratchpad leak. If
scratchpadis stored onself, the secondcorrect()sees the first's trace. Test for this explicitly: callcorrect("A")thencorrect("B"); the second call's scratchpad should have only B's steps. - Long-term update on out-of-scope. Should we increment mistake counters for out-of-scope sentences? Default: no (out-of-scope isn't a verb error, it's a corpus mismatch). Document the choice.
- MockPlanner script gaps. If you scripted only 25 of 30 sentences, the 5 missing will hit a
KeyError. Either script all 30 or haveMockPlannerraise a clear "scripted-but-not-found" error. - The agent calls
final_answerimmediately. A buggy planner might emitfinal_answeras step 1 without calling any tools. The result is unreliable but structurally valid. Catch this in the test: assert step count ≥ 1 (at least one tool call) before accepting an answer, except for "already correct" sentences.