English · Español
01 — ReAct and the planner-executor split¶
🇪🇸 Hay dos arquitecturas dominantes para agentes: ReAct (decide → actúa → observa → decide) y plan-and-execute (planifica todo de una vez, luego ejecuta). Para nuestro tutor, ReAct gana — porque las decisiones dependen del resultado de la herramienta anterior y no sabemos a priori qué herramientas hacen falta.
ReAct: think + act, interleaved¶
ReAct comes from Yao et al. 2022 ("ReAct: Synergizing Reasoning and Acting in Language Models"). The shape:
loop:
thought ← planner(state)
if thought.terminates:
return thought.answer
action ← thought.tool_call
observation ← execute(action)
state ← append(state, thought, action, observation)
The interleaving is the key. The model thinks after seeing each observation, so each new action is conditioned on the cumulative result of prior actions. This handles the case where the planner doesn't know which tool to call next until it's seen the previous tool's output — which is the common case for grammar correction.
Example trace for "He goed to school":
state_0: "He goed to school"
thought_1: subject seems to be "he"; verb seems to be "goed". Let me check if "go" is regular.
action_1: lookup_irregular_verb(verb="go")
observation_1: {"is_irregular": true, "past_simple": "went", ...}
state_1: {... above ...}
thought_2: irregular → "goed" might be wrong. Let me look up past simple of "go" for 3rd person sg.
action_2: conjugate(verb="go", tense="past_simple", person="3sg")
observation_2: "went"
state_2: {... above ...}
thought_3: confirmed: "goed" should be "went". Compose final answer.
action_3: FINAL_ANSWER
answer: CorrectionResult(corrected="He went to school", rationale=["go is irregular; past simple 3sg is 'went' not 'goed'"], spanish_gloss="Él fue a la escuela")
Three steps. Each action was conditioned on the previous observation. A plan-and-execute approach would have to guess the right tools up front; ReAct discovers them as it goes.
Plan-and-execute: compile, then run¶
The alternative (Wang et al. 2023 "Plan-and-Solve Prompting"; Sun et al. 2023 "AdaPlanner"): the model first produces a full plan — a sequence (or DAG) of intended tool calls — and then executes the plan.
plan ← planner(state_0) # ["lookup_irregular_verb", "conjugate", ...]
for step in plan:
observation ← execute(step)
state ← append(state, step, observation)
answer ← responder(state)
Pros:
- One model call to plan, \(n\) to execute — fewer model invocations overall.
- The plan is introspectable: a human can sanity-check the plan before executing destructive actions.
Cons:
- The plan can't adapt to observations. If
lookup_irregular_verb("go").is_irregular = truereturns something unexpected, the plan is committed. - Hard to plan when the action space depends on prior results. ("If the verb is regular, do X; otherwise do Y" requires a conditional plan, which is harder to generate.)
For Phase 32, ReAct wins because:
- Decision tree branches on tool results (regular/irregular, agreement-OK/not, in-scope/out-of-scope).
- The §A13 corpus is small enough that ReAct's per-step LM calls are cheap.
- The number of steps is typically 2–4 — small even by ReAct standards.
Plan-and-execute is the right choice when steps are expensive (e.g., remote API calls with rate limits) and the plan can be checked statically.
The planner as constrained decoding¶
The planner's job: given the current state, produce the next step. We constrain the planner's output to one of two schemas:
// Schema for ToolCall
{
"next": "tool_call",
"tool": "<enum: conjugate | lookup_irregular_verb | check_subject_verb_agreement | lookup_spanish>",
"args": { ... } // schema depends on tool
}
// Schema for FinalAnswer
{
"next": "final_answer",
"answer": {
"corrected": "<string | null>",
"rationale": ["<string>", "..."],
"spanish_gloss": "<string | null>",
"in_scope": "<boolean>"
}
}
The planner is literally a JSONSchemaMask over the union of these two schemas. At each generation step, only tokens that keep the partial output valid are allowed. This is the technique from Phase 30 applied to the agent's control flow.
A subtle but important consequence: the planner cannot hallucinate a tool name that doesn't exist. The tool field is an enum over the registered tools' names. Tokens outside that enum are masked. This eliminates the most common failure mode of free-text agents.
State and observations: what the planner sees¶
The planner sees a prompt assembled from:
- The system instruction (what the agent does, what tools exist, what schema to follow).
- The user input (the sentence to correct).
- The trace so far: each
(thought, action, observation)triple from earlier in the loop. - The prompt asking the planner to emit the next step.
The trace is the agent's "context window" — it's the only state the planner can read. Long-term memory (across-correction) is fetched into the prompt by hand when relevant. We'll cover memory in the next file.
A canonical prompt:
You are a grammar tutor for English verb conjugations (5 tenses, 3 singular persons).
You have these tools: <enumerate with arg schemas>.
Output JSON matching the planner schema.
SENTENCE: He goed to school.
TRACE:
step 1:
action: lookup_irregular_verb(verb="go")
observation: {"is_irregular": true, "past_simple": "went", "past_participle": "gone"}
What's the next step?
The model completes with a JSON-masked output. We parse it. We dispatch. We repeat.
Termination, budgets, and dedup¶
Three termination signals:
FinalAnsweremitted. The happy path.- Step budget exhausted. Hard cap at
K = 8. The agent returns a structured "could not converge" result with the partial trace. - Duplicate action loop. If the same
(tool, args)appears twice in a row, the agent halts and reports the loop. This is a planner bug; we want to see it, not loop forever.
These three give the agent a bounded execution. Any agent that doesn't have all three is unsafe to deploy.
The Bitter Lesson, agent edition¶
Rich Sutton's "Bitter Lesson" (2019) says: in the long run, general methods that scale with compute beat clever methods that bake in human knowledge. For agents, this means:
- A larger model with a simpler loop generally beats a smaller model with a more elaborate planning system.
- Tool-use trained via RLHF / DPO beats prompted tool-use, given enough compute.
Phase 32 is on the clever-method side of this divide — small model, lots of structure (the JSON mask). That's the right call for a small project; in production you would also have the training data to do the bitter-lesson version.
What this file does NOT cover¶
- Plan-DAG executors. Mentioned for completeness; not implemented.
- The math of mask-constrained logit distributions. Phase 30.
- MCTS / tree-of-thought / reflection. Higher-power agent loops that are expensive and don't fit the Mini-GPT budget. Cited only for vocabulary.
Next: 02-memory.md