English · Español
Lab 03 — Mask-Driven Tool-Call Generation¶
Goal: close the Phase 30 → Phase 31 loop. Use
ministruct.JSONSchemaMaskto constrain a model's output so it emits a valid tool-call JSON; parse the JSON; dispatch through the MCP client; observe the result. End-to-end.Estimated time: 2–3 hours.
Prereq: Lab 02 done (
MCPClientworking) AND Phase 30 lab 01 done (JSONSchemaMaskworking on the conjugation schema).
What you produce¶
In experiments/31-mask-driven-toolcall/:
demo.py— runs the full pipeline.transcript.json—{prompt, model_output_str, parsed_tool_call, tool_result}for ~20 test prompts.results.json—{total_prompts, parse_failures, dispatch_failures, tool_errors}.manifest.json— versions + seed.
Plus a small adapter in src/miniagent/:
src/miniagent/tool_call.py— bridges a constrained-decoding output blob to a tool-call dispatch.
The pipeline¶
prompt
│
▼
MiniGPT.generate(prompt, mask=JSONSchemaMask(tool_call_schema), temperature=0.7)
│ → "{\"name\":\"conjugate\",\"arguments\":{\"verb\":\"eat\",\"tense\":\"past_simple\",\"person\":\"3sg\"}}"
▼
json.loads(output)
│ → {"name": "conjugate", "arguments": {...}}
▼
MCPClient.call_tool("conjugate", verb="eat", tense="past_simple", person="3sg")
│ → "ate"
▼
result printed; logged to transcript
Each arrow is testable. Lab 03 wires them together.
TODOs¶
Block A — the tool-call envelope schema¶
In src/miniagent/tool_call.py, define the JSON Schema for a tool-call message (not the tool's own argument schema — one layer up):
-
def build_tool_call_schema(tool_schemas: dict[str, dict]) -> dict:return { "type": "object", "properties": { "name": {"type": "string", "enum": list(tool_schemas.keys())}, "arguments": { # see note below on per-tool argument schemas "oneOf": [ {"type": "object", "properties": {"name": {"const": name}, ...}} for name, schema in tool_schemas.items() ] } }, "required": ["name", "arguments"], "additionalProperties": False } - Reality check. Building a full
oneOfdiscriminator is fiddly. For Phase 31's lab, simplify to a two-stage mask: first generatenameunder an enum-only schema; then re-instantiate the mask with the specific tool'sinput_schemaand generatearguments. This avoidsoneOfentirely. Document the simplification intool_call.py's module docstring.
The two-stage variant in pseudocode:
def generate_tool_call(model, prompt, tool_schemas, *, temperature=0.7) -> dict:
name_schema = {"type": "string", "enum": list(tool_schemas)}
name = json.loads(model.generate(prompt + "\nname=", mask=JSONSchemaMask(name_schema)))
arg_schema = tool_schemas[name]
args_text = model.generate(prompt + f'\nname={name}\narguments=', mask=JSONSchemaMask(arg_schema))
return {"name": name, "arguments": json.loads(args_text)}
This is the pedagogical path. A production system would use oneOf in one pass — that's a Phase 33 optimization.
Block B — prompt set¶
In experiments/31-mask-driven-toolcall/prompts.json:
- 20 short prompts, each one designed to elicit a specific tool call. Examples:
(Five of each tool, varied wording.)
[ {"prompt": "I need the past simple of 'eat' for he/she/it.", "expected_tool": "conjugate", "expected_args": {"verb": "eat", "tense": "past_simple", "person": "3sg"}}, {"prompt": "Is 'go' an irregular verb?", "expected_tool": "lookup_irregular_verb", "expected_args": {"verb": "go"}}, {"prompt": "Translate 'ate' to Spanish.", "expected_tool": "lookup_spanish", "expected_args": {"english_form": "ate"}}, {"prompt": "Does 'he go' agree?", "expected_tool": "check_subject_verb_agreement", "expected_args": {"subject": "he", "verb_form": "go"}} ] - These are gold labels for evaluation. We don't expect the small MiniGPT to nail tool selection — that's Phase 32's job. We just need the parse to succeed.
Block C — the demo loop¶
In experiments/31-mask-driven-toolcall/demo.py:
- Initialize:
- Load MiniGPT (from wherever Phase 26-29 left it; if the model isn't ready, mock with a fixed-output stub that emits the gold labels — this is acceptable for Phase 31 since the constrained-decoding-plus-dispatch path is what we're testing, not the model's quality).
JSONSchemaMaskper tool (cache).with MCPClient(["python", "-m", "miniagent.mcp_server"]) as client: client.initialize(); client.list_tools().- For each prompt:
- Generate name under the name-only mask.
- Generate args under the per-tool mask.
parsed = {"name": name, "arguments": json.loads(args_text)}.- Validate
parsedis a well-formed tool call (build_tool_call_schemavalidates). try: result = client.call_tool(**parsed);except ToolError as e: result = {"error": str(e)}.- Append
{prompt, model_output_str, parsed_tool_call, tool_result}totranscript.json. - Aggregate counters →
results.json.
Block D — assertions¶
The lab's operational claim (from theory/01-function-calling-formats.md §"The argument-format question"): under the mask, parse failures are impossible.
- Assert
results["parse_failures"] == 0. If any prompt produced unparseable JSON, the mask is broken — debug Phase 30'sJSONSchemaMask, not the dispatch. - Assert every
tool_resultis either a valid tool output (string or dict) or aToolErrortext (logical failure of the tool, fine — that's the tool's prerogative). - Do not assert tool-selection accuracy. The model is small; tool selection is Phase 32 with planning. We're testing the plumbing, not the brain.
Block E — measurement¶
Per-prompt timings written to timings.json:
-
mask_construction_ms— how long to buildJSONSchemaMaskper tool. -
generation_ms— model-side time spent generating the JSON. -
parse_ms—json.loads(always microseconds). -
dispatch_ms— round-trip through the MCP client to server and back. -
total_ms.
Report aggregate: mean, P50, P95. This is the Phase 31 latency baseline; Phase 33 will compare.
Block F — log a representative successful example¶
In transcript.json, the first entry should be a clean demonstrative case for the phase report:
{
"prompt": "I need the past simple of 'eat' for he/she/it.",
"model_output_name": "conjugate",
"model_output_args": "{\"verb\":\"eat\",\"tense\":\"past_simple\",\"person\":\"3sg\"}",
"parsed_tool_call": {"name": "conjugate", "arguments": {"verb": "eat", "tense": "past_simple", "person": "3sg"}},
"tool_result": "ate",
"timings_ms": {"mask_construction": 1.4, "generation": 142.7, "parse": 0.1, "dispatch": 8.3, "total": 152.5}
}
This is the artifact that goes into PHASE_31_REPORT.md.
Constraints¶
- No model retraining. Use whatever MiniGPT exists at phase open. If it can't tool-call (or doesn't exist), use the stub (Block C note) and document.
- Tool error is OK. A masked generation that produces a valid-JSON tool-call but a logically wrong call (e.g.,
conjugate("be", "past_simple", "3sg")returning"was"when the prompt was abouteat) is a correctness failure of the model, not a failure of the lab. - No retries on parse failure. If
parse_failures > 0, fix the mask, don't paper over with retries.
Stop conditions¶
Done when:
results["parse_failures"] == 0over 20 prompts.transcript.jsonexists with 20 entries.- Aggregate timings written.
- One representative success case selected and copy-pasted into the phase report draft.
Pitfalls¶
- The mask only constrains what's between the brackets. Surrounding prose ("Sure, here you go: {...}") still corrupts the JSON. Either prefix the generation with
{(so the model doesn't have to choose to start the JSON) or use an outer schema that includes the leading-brace token in its accept set. Phase 30 lab 01 should have handled this; if not, fix it there. - Two-stage generation re-anchors context. When you generate
arguments=after generatingname=, the model's prior context differs from a one-shot generation. This may hurt accuracy. Acceptable for Phase 31; Phase 33 may unify into one mask. oneOfand JSON Schema. Many JSON-Schema validators handleoneOfdifferently. We don't useoneOfat decode time (we use the two-stage shortcut). We do use it inbuild_tool_call_schemafor validation; ensure yourjsonschemalib version supports it (it does, Draft7+).- Caching masks. Building a
JSONSchemaMaskfrom scratch on every call is expensive. Cache by(schema_id, tokenizer_id). This is the only optimization permitted in Lab 03 — anything else (e.g., parallel calls, request batching) waits for Phase 33.
When to consult solutions/¶
After results["parse_failures"] == 0 and you have 20 clean transcript entries. The solution at solutions/03-mask-driven-toolcall-ref.md walks through how the reference implementation handled the two-stage simplification and what oneOf would have looked like if we'd unified.
End of Phase 31 labs. Next: write PHASE_31_REPORT.md, then open Phase 32 (docs/phase-32-agents/).