English · Español
05 — Agent Loop Architecture: observation → reasoning → tool call → … → answer¶
🇪🇸 Un agente es un bucle: observa, razona, llama a una herramienta, vuelve a observar, hasta terminar. Esta sección formaliza el ciclo del tutor de gramática §A13 — qué estados existen, qué transiciones son legales, cuándo el bucle debe parar y cómo falla. Cruzamos con el extension-track X3 (RLHF/DPO) para la dimensión de alineamiento — no la reimplementamos aquí; sólo apuntamos dónde el agente toca esa frontera.
Anchors: theory/01-react-and-planning.md, theory/02-memory.md, theory/03-sandboxing.md, Phase 31 theory/05-mcp-wire-and-100-line-server.md, extension-track docs/extension-track/X3-rlhf-dpo/README.md.
The minimal agent state machine¶
The §A13 grammar tutor's agent is a 5-state automaton:
┌──────────────┐
│ observe │ ◀── input: user sentence, e.g., "He goed to school"
└──────────────┘
│
▼
┌──────────────┐
│ reason │ ◀── LLM planning: "this looks like an irregular verb error"
└──────────────┘
│
▼
┌──────────────┐
│ tool_call │ ──▶ MCP: conjugate(verb="go", tense="past simple", person="3rd singular")
└──────────────┘
│
▼
┌──────────────┐
│ observe │ ◀── tool response: "went"
└──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐
│ reason │ ──── confident ──▶│ answer │ ──▶ output: "He went to school."
└──────────────┘ └──────────────┘
│
└── unsure ──▶ tool_call (loop)
Five states, four transitions. The loop is between reason and tool_call via intermediate observe steps. Termination is the reason → answer edge.
State semantics¶
| State | Inputs | Output | Cost |
|---|---|---|---|
observe |
environment delta (user input or tool result) | structured observation appended to scratchpad | ≈ 0 ms |
reason |
scratchpad | next action: tool_call (with which tool, which args) OR answer |
one LLM forward pass |
tool_call |
tool name + args | tool result, appended to next observe |
one MCP round-trip |
answer |
scratchpad | final user-visible answer | one LLM forward pass (generation) |
terminate |
scratchpad | structured failure record (gave up) | logs only |
The terminate state is the failure terminal. The answer state is the success terminal.
Transition rules¶
The transitions are not free-form — they're constrained by a JSON-Schema-masked decoder (Phase 30):
NEXT_ACTION_SCHEMA = {
"type": "object",
"required": ["action"],
"oneOf": [
{"properties": {"action": {"const": "tool_call"},
"tool": {"enum": ["conjugate", "lookup_rule"]},
"args": {"type": "object"}}},
{"properties": {"action": {"const": "answer"},
"text": {"type": "string", "maxLength": 200}}},
],
}
The model can only emit one of two structurally valid next actions. Free-form prose is mechanically impossible. This is the Phase 30 → Phase 32 dependency made concrete.
The full Python pseudocode¶
def grammar_tutor(user_input: str,
max_turns: int = 6,
max_tool_calls: int = 4,
mcp_client: MCPClient) -> str:
scratchpad: list[Observation] = [Observation(role="user", text=user_input)]
turns = 0
tool_calls = 0
while turns < max_turns:
turns += 1
# reason: ask the LLM to pick the next action.
action = llm_reason(scratchpad, schema=NEXT_ACTION_SCHEMA)
if action.kind == "answer":
return action.text
# action is tool_call
if tool_calls >= max_tool_calls:
return "I tried but could not resolve the grammar question in time."
tool_calls += 1
try:
result = mcp_client.call(action.tool, action.args, timeout=2.0)
scratchpad.append(Observation(role="tool", text=result.text))
except MCPError as e:
scratchpad.append(Observation(role="tool", text=f"[error] {e}"))
# the next reason step will see the error and either retry or give up
# fell through the turn budget
return "I tried but could not resolve the grammar question in time."
Twenty-two lines. The loop is intentionally short — the complexity lives in the masked decoder (llm_reason) and the MCP client, both of which are Phase 30 and Phase 31 components.
Termination conditions — three layers¶
A correct agent never loops forever. There are three independent termination gates:
1. Turn cap (max_turns)¶
The outer while turns < max_turns is the safety net. If the LLM keeps emitting tool_call forever (because it's confused, or because the prompt is adversarial), the turn cap stops the loop. Default: 6 turns for §A13 (enough for 2-3 retries with one or two tool calls each).
2. Tool-call cap (max_tool_calls)¶
Independent from max_turns. The reason: an agent could emit answer between tool calls and reset the "turn" notion in a misleading way. A separate tool_call budget bounds resource use specifically.
3. No-progress detector (advanced, deferred)¶
If the scratchpad is growing but identical tool calls are repeated with identical results, the loop is making no progress. Production agents detect this and break. For Phase 32 we omit this — the max_turns cap is sufficient at our scale.
The agent without these caps is the Phase 32 /break exercise. It runs forever on adversarial input (or on a misconfigured tool that returns the same "I don't know" repeatedly).
Failure modes (the four canonical bugs)¶
| Failure | Symptom | Root cause | Mitigation |
|---|---|---|---|
| Infinite loop | Agent never returns; CPU at 100% | No turn cap, or LLM stuck in tool_call → observe → tool_call cycle |
Hard max_turns / max_tool_calls |
| Hallucinated tool | Agent emits tool_call with tool: "translate" when only conjugate exists |
Decoder not masked against the tool enum | JSON-Schema mask with tool: enum: [...] |
| Tool-error blindness | Agent gets an error from MCP, retries with the same args, errors again | LLM isn't seeing the tool error in scratchpad, or is | Prepend [error] markers; tighten reasoning prompt |
| Wrong answer with citations | Agent answers from its own weights even though tool returned a different value | Reader prompt doesn't constrain to tool output | Final answer prompt must reference the tool result by [#tool_call_id] |
The Phase 32 lab 03-failure-mode-tour.md reproduces all four on the grammar tutor. The /break (next file in this series) targets the first — the infinite loop.
Where this loop touches alignment (cross-reference to X3)¶
The extension-track module docs/extension-track/X3-rlhf-dpo/ covers the alignment dimension that this agent loop doesn't solve on its own. Specifically:
- The reason step's quality is a function of the LLM behind it. If Mini-GPT was trained on the §A13 corpus alone, its
reasonoutputs are mediocre on edge cases. RLHF/DPO (X3 modules) improves the agent's reasoning by training the LLM on preference pairs over its own action proposals. - The "give up" decision is itself a calibration question. An RLHF-trained model is better at saying "I don't know" than a base model (which often confidently bullshits). The DPO loss on the grammar-tutor task (X3 lab
01-dpo-on-grammar-tutor.md) trains exactly this distinction. - Constitutional revision (X3 lab
02) wraps the agent loop with a self-critique step betweenreasonandtool_call. The critique asks "is this action consistent with the rules?"; the agent retries if not. This is one layer above what Phase 32 implements.
Phase 32's agent is the substrate on which X3's alignment techniques operate. You can run the §A13 grammar tutor without any X3 module — it'll be ~85% correct on the eval set with a vanilla LoRA-finetuned Mini-GPT. With X3's DPO step on top, the same agent reaches ~93%. The agent loop architecture doesn't change; the LLM's reason quality does.
Borja should read docs/extension-track/X3-rlhf-dpo/theory/04-dpo-and-direct-methods.md after this phase closes to see how alignment plugs in.
A note on tool selection¶
In our minimal §A13 setup, the agent has two tools: conjugate (Phase 31) and lookup_rule (a RAG-style retriever from Phase 29 wrapping the irregular-verb table). The choice between them is itself a reason decision: structural questions ("what's the past simple of X?") prefer conjugate; explanatory questions ("why is X irregular?") prefer lookup_rule. Production agents often have 10-100 tools; the tool-selection problem becomes its own optimization target.
We keep two tools for Phase 32. Adding more is a Phase 33 (serving) concern, not a Phase 32 (loop semantics) concern.
Why an agent at all¶
A reasonable challenge: "why not just call the LLM once with the user input and let it conjugate? Why the whole loop?"
Three reasons:
- Guardrails. The §A13 enum mask (Phase 30) confines the model to the 20-verb scope. Without the loop, the model emits a sentence; with the loop, the conjugation tool returns a typed result the agent can structurally verify.
- Composability. Tomorrow, a learner asks "is this Italian sentence correct?" A new tool (
italian_conjugate) plugs into the loop; the agent code doesn't change. Monolithic generation can't do that. - Reasoning traces. The scratchpad is the audit trail. When the agent gets the wrong answer, you can replay the trace and see whether the model picked the wrong tool, parsed the result wrong, or fabricated the answer despite a correct tool response. This is the difference between "the model lied" (untestable) and "the model lied at step 3 after the conjugate tool returned
went" (debuggable).
Phase 32's agent is the smallest loop that buys you all three properties. Production agents (Claude Code, OpenAI's Assistants API) are this loop with more tools, more memory, and more guardrails.
Citations¶
- Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.
- Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761.
- Anthropic. Building effective agents (engineering blog, 2024-12-20). Pragmatic guidance on agent-loop design; argues against premature complexity.
One-paragraph recap¶
The §A13 grammar-tutor agent is a 5-state loop: observe → reason → tool_call → observe → … → answer. The transitions are JSON-Schema-masked so the model can only emit one of two structurally valid next actions. Three independent termination gates (turn cap, tool-call cap, no-progress detector) prevent infinite loops; the /break exercise (next file) removes them on purpose. Four canonical failure modes — infinite loop, hallucinated tool, tool-error blindness, wrong-answer-with-citations — are each the result of one missing guardrail. Phase 32's loop is the substrate; the X3 extension track sits on top to align the reason step's quality. The whole architecture is ~30 lines of Python plus the masked decoder and MCP client built in Phases 30 and 31.
Next: lab/01-tutor-end-to-end.md to wire all phases together and run the agent on 30 canonical §A13 sentences.