English · Español

Lab 03 — Mask-Driven Tool-Call Generation¶

Goal: close the Phase 30 → Phase 31 loop. Use ministruct.JSONSchemaMask to constrain a model's output so it emits a valid tool-call JSON; parse the JSON; dispatch through the MCP client; observe the result. End-to-end.

Estimated time: 2–3 hours.

Prereq: Lab 02 done (MCPClient working) AND Phase 30 lab 01 done (JSONSchemaMask working on the conjugation schema).

What you produce¶

In experiments/31-mask-driven-toolcall/:

demo.py — runs the full pipeline.
transcript.json — {prompt, model_output_str, parsed_tool_call, tool_result} for ~20 test prompts.
results.json — {total_prompts, parse_failures, dispatch_failures, tool_errors}.
manifest.json — versions + seed.

Plus a small adapter in src/miniagent/:

src/miniagent/tool_call.py — bridges a constrained-decoding output blob to a tool-call dispatch.

The pipeline¶

prompt
  │
  ▼
MiniGPT.generate(prompt, mask=JSONSchemaMask(tool_call_schema), temperature=0.7)
  │  → "{\"name\":\"conjugate\",\"arguments\":{\"verb\":\"eat\",\"tense\":\"past_simple\",\"person\":\"3sg\"}}"
  ▼
json.loads(output)
  │  → {"name": "conjugate", "arguments": {...}}
  ▼
MCPClient.call_tool("conjugate", verb="eat", tense="past_simple", person="3sg")
  │  → "ate"
  ▼
result printed; logged to transcript

Each arrow is testable. Lab 03 wires them together.

TODOs¶

Block A — the tool-call envelope schema¶

In src/miniagent/tool_call.py, define the JSON Schema for a tool-call message (not the tool's own argument schema — one layer up):

def build_tool_call_schema(tool_schemas: dict[str, dict]) -> dict:

return {
  "type": "object",
  "properties": {
    "name": {"type": "string", "enum": list(tool_schemas.keys())},
    "arguments": {  # see note below on per-tool argument schemas
      "oneOf": [
        {"type": "object", "properties": {"name": {"const": name}, ...}}
        for name, schema in tool_schemas.items()
      ]
    }
  },
  "required": ["name", "arguments"],
  "additionalProperties": False
}

Reality check. Building a full oneOf discriminator is fiddly. For Phase 31's lab, simplify to a two-stage mask: first generate name under an enum-only schema; then re-instantiate the mask with the specific tool's input_schema and generate arguments. This avoids oneOf entirely. Document the simplification in tool_call.py's module docstring.

The two-stage variant in pseudocode:

def generate_tool_call(model, prompt, tool_schemas, *, temperature=0.7) -> dict:
    name_schema = {"type": "string", "enum": list(tool_schemas)}
    name = json.loads(model.generate(prompt + "\nname=", mask=JSONSchemaMask(name_schema)))
    arg_schema = tool_schemas[name]
    args_text = model.generate(prompt + f'\nname={name}\narguments=', mask=JSONSchemaMask(arg_schema))
    return {"name": name, "arguments": json.loads(args_text)}

This is the pedagogical path. A production system would use oneOf in one pass — that's a Phase 33 optimization.

Block B — prompt set¶

In experiments/31-mask-driven-toolcall/prompts.json:

20 short prompts, each one designed to elicit a specific tool call. Examples:

[
  {"prompt": "I need the past simple of 'eat' for he/she/it.", "expected_tool": "conjugate", "expected_args": {"verb": "eat", "tense": "past_simple", "person": "3sg"}},
  {"prompt": "Is 'go' an irregular verb?", "expected_tool": "lookup_irregular_verb", "expected_args": {"verb": "go"}},
  {"prompt": "Translate 'ate' to Spanish.", "expected_tool": "lookup_spanish", "expected_args": {"english_form": "ate"}},
  {"prompt": "Does 'he go' agree?", "expected_tool": "check_subject_verb_agreement", "expected_args": {"subject": "he", "verb_form": "go"}}
]

(Five of each tool, varied wording.)

These are gold labels for evaluation. We don't expect the small MiniGPT to nail tool selection — that's Phase 32's job. We just need the parse to succeed.

Block C — the demo loop¶

In experiments/31-mask-driven-toolcall/demo.py:

Initialize:
Load MiniGPT (from wherever Phase 26-29 left it; if the model isn't ready, mock with a fixed-output stub that emits the gold labels — this is acceptable for Phase 31 since the constrained-decoding-plus-dispatch path is what we're testing, not the model's quality).
JSONSchemaMask per tool (cache).
with MCPClient(["python", "-m", "miniagent.mcp_server"]) as client: client.initialize(); client.list_tools().
For each prompt:
Generate name under the name-only mask.
Generate args under the per-tool mask.
parsed = {"name": name, "arguments": json.loads(args_text)}.
Validate parsed is a well-formed tool call (build_tool_call_schema validates).
try: result = client.call_tool(**parsed) ; except ToolError as e: result = {"error": str(e)}.
Append {prompt, model_output_str, parsed_tool_call, tool_result} to transcript.json.
Aggregate counters → results.json.

Block D — assertions¶

The lab's operational claim (from theory/01-function-calling-formats.md §"The argument-format question"): under the mask, parse failures are impossible.

Assert results["parse_failures"] == 0. If any prompt produced unparseable JSON, the mask is broken — debug Phase 30's JSONSchemaMask, not the dispatch.
Assert every tool_result is either a valid tool output (string or dict) or a ToolError text (logical failure of the tool, fine — that's the tool's prerogative).
Do not assert tool-selection accuracy. The model is small; tool selection is Phase 32 with planning. We're testing the plumbing, not the brain.

Block E — measurement¶

Per-prompt timings written to timings.json:

mask_construction_ms — how long to build JSONSchemaMask per tool.
generation_ms — model-side time spent generating the JSON.
parse_ms — json.loads (always microseconds).
dispatch_ms — round-trip through the MCP client to server and back.
total_ms.

Report aggregate: mean, P50, P95. This is the Phase 31 latency baseline; Phase 33 will compare.

Block F — log a representative successful example¶

In transcript.json, the first entry should be a clean demonstrative case for the phase report:

{
  "prompt": "I need the past simple of 'eat' for he/she/it.",
  "model_output_name": "conjugate",
  "model_output_args": "{\"verb\":\"eat\",\"tense\":\"past_simple\",\"person\":\"3sg\"}",
  "parsed_tool_call": {"name": "conjugate", "arguments": {"verb": "eat", "tense": "past_simple", "person": "3sg"}},
  "tool_result": "ate",
  "timings_ms": {"mask_construction": 1.4, "generation": 142.7, "parse": 0.1, "dispatch": 8.3, "total": 152.5}
}

This is the artifact that goes into PHASE_31_REPORT.md.

Constraints¶

No model retraining. Use whatever MiniGPT exists at phase open. If it can't tool-call (or doesn't exist), use the stub (Block C note) and document.
Tool error is OK. A masked generation that produces a valid-JSON tool-call but a logically wrong call (e.g., conjugate("be", "past_simple", "3sg") returning "was" when the prompt was about eat) is a correctness failure of the model, not a failure of the lab.
No retries on parse failure. If parse_failures > 0, fix the mask, don't paper over with retries.

Stop conditions¶

Done when:

results["parse_failures"] == 0 over 20 prompts.
transcript.json exists with 20 entries.
Aggregate timings written.
One representative success case selected and copy-pasted into the phase report draft.

Pitfalls¶

The mask only constrains what's between the brackets. Surrounding prose ("Sure, here you go: {...}") still corrupts the JSON. Either prefix the generation with { (so the model doesn't have to choose to start the JSON) or use an outer schema that includes the leading-brace token in its accept set. Phase 30 lab 01 should have handled this; if not, fix it there.
Two-stage generation re-anchors context. When you generate arguments= after generating name=, the model's prior context differs from a one-shot generation. This may hurt accuracy. Acceptable for Phase 31; Phase 33 may unify into one mask.
oneOf and JSON Schema. Many JSON-Schema validators handle oneOf differently. We don't use oneOf at decode time (we use the two-stage shortcut). We do use it in build_tool_call_schema for validation; ensure your jsonschema lib version supports it (it does, Draft7+).
Caching masks. Building a JSONSchemaMask from scratch on every call is expensive. Cache by (schema_id, tokenizer_id). This is the only optimization permitted in Lab 03 — anything else (e.g., parallel calls, request batching) waits for Phase 33.

When to consult `solutions/`¶

After results["parse_failures"] == 0 and you have 20 clean transcript entries. The solution at solutions/03-mask-driven-toolcall-ref.md walks through how the reference implementation handled the two-stage simplification and what oneOf would have looked like if we'd unified.

End of Phase 31 labs. Next: write PHASE_31_REPORT.md, then open Phase 32 (docs/phase-32-agents/).