English · Español
03 — Threat modeling, with numbers¶
🇪🇸 La matriz de amenazas no necesita LaTeX, necesita una hoja de cálculo. Para cada clase de ataque: \(R = P \cdot S \cdot (1 - D)\). \(P\) = probabilidad de intento, \(S\) = severidad (1-5), \(D\) = probabilidad de detección. Las estimaciones son honestas, no medidas — pero llenar la fila es lo que fuerza la conversación.
Why a number, not a vibe¶
Two common failure modes in security work:
- Vibes-only ranking. "Pickle is bad, RAG poisoning is bad, we should fix the worst one first." Which is worst? The discussion is unbounded; everyone has an opinion.
- Theatre-grade rigor. Multi-page risk-rating frameworks (CVSS, DREAD) calibrated against industry datasets the team has never seen. Effort goes into the framework, not the threats.
The middle path: assign three numbers per threat, multiply them, sort. The numbers are estimates. The point is forcing a comparison, not pretending to measure absolute risk.
The formula¶
For each threat \(T\):
Where:
- \(P_T \in [0, 1]\) — probability of an attempt during the system's lifetime.
- \(S_T \in \{1, 2, 3, 4, 5\}\) — severity if the attempt succeeds. (1 = annoyance; 5 = full host compromise.)
- \(D_T \in [0, 1]\) — probability that current defenses detect or block the attempt.
\(R_T\) is a residual risk score. Higher = more important to fix. The scale is arbitrary; only relative ranks matter.
Why not severity alone¶
"Sort by severity" is the wrong move when half the threats are severity 5 (host RCE, data exfiltration, model corruption) and the other half are 5/5 too. Probability and detection are what discriminate.
Why include detection¶
A high-severity attack with high detection probability is less urgent to mitigate than a medium-severity attack with low detection. The detection term captures the value of existing defenses (Phase 30 schema, Phase 32 sandbox, Phase 34 log redactor).
The grammar-tutor matrix¶
Concrete numbers for the Phase 32 grammar tutor, as of Phase 37 entry (pre-mitigation):
| # | Threat | \(P\) | \(S\) | \(D\) | \(R = P \cdot S \cdot (1 - D)\) |
|---|---|---|---|---|---|
| T1 | Direct prompt injection ("respond as pirate") | 0.8 | 2 | 0.4 | 0.96 |
| T2 | Indirect injection via poisoned RAG chunk | 0.3 | 3 | 0.1 | 0.81 |
| T3 | Jailbreak (DAN-style, encoding tricks) | 0.4 | 2 | 0.3 | 0.56 |
| T4 | Tool abuse — path traversal | 0.5 | 4 | 0.7 | 0.60 |
| T5 | Tool abuse — command injection | 0.4 | 5 | 0.7 | 0.60 |
| T6 | Supply chain — pickle on weight load | 0.1 | 5 | 0.0 | 0.50 |
| T7 | Supply chain — KB chunk tampering | 0.2 | 3 | 0.3 | 0.42 |
| T8 | Trace storage — injection payload logged | 0.6 | 2 | 0.8 | 0.24 |
| T9 | Memory poisoning (no long-term mem) | 0.0 | 3 | 1.0 | 0.00 |
Justifications for the headline rows:
- T1, \(P = 0.8\): anyone who hears about the tutor will try the pirate prompt at least once. It's the first thing people try with any LLM.
- T1, \(S = 2\): the worst case is the tutor speaks pirate. There's no harmful content channel; severity is "integrity violation," not "user harm."
- T2, \(P = 0.3\): requires write access to the KB. Lower than T1, but not negligible — anyone with repo access can edit
chunks.jsonl. - T2, \(D = 0.1\): no current defense detects poisoned chunks. This is the highest-residual entry that's not already mitigated.
- T6, \(P = 0.1\): only happens if Borja downloads a hostile checkpoint. Hand-trained, so low.
- T6, \(D = 0.0\): nothing currently checks weights — until Lab 04 ships
verify_artifacts.sh. - T9: the grammar tutor has no long-term memory (Phase 32 explicitly chose stateless agents for §A13's microscopic scope), so this row exists only to document the absence of the threat.
Sorting the matrix¶
Sorted by \(R\) (descending), pre-mitigation:
- T1 (0.96) — direct prompt injection
- T2 (0.81) — indirect via RAG
- T4 (0.60) — path traversal
- T5 (0.60) — command injection
- T3 (0.56) — jailbreaks
- T6 (0.50) — pickle deserialization
- T7 (0.42) — KB tampering
- T8 (0.24) — trace storage
- T9 (0.00) — memory poisoning (n/a)
This sort dictates lab order:
- Lab 00 addresses T1 (highest-\(R\) unmitigated).
- Lab 01 addresses T2 (second-highest).
- Lab 02 addresses T3 (lower-\(R\) but technique transfer).
- Lab 03 addresses T4, T5 (tooling-related; bundled).
- Lab 04 addresses T6, T7 (supply chain bundled).
Re-scoring after mitigation¶
The same matrix, post-Phase-37 mitigations (output schema enforced, sandbox tested, manifest checked):
| # | Threat | \(P\) | \(S\) | \(D\) (post) | \(R\) (post) | Δ |
|---|---|---|---|---|---|---|
| T1 | Direct prompt injection | 0.8 | 2 | 0.9 | 0.16 | −0.80 |
| T2 | Indirect injection via RAG | 0.3 | 3 | 0.5 | 0.45 | −0.36 |
| T3 | Jailbreak | 0.4 | 2 | 0.6 | 0.32 | −0.24 |
| T4 | Path traversal | 0.5 | 4 | 0.95 | 0.10 | −0.50 |
| T5 | Command injection | 0.4 | 5 | 0.95 | 0.10 | −0.50 |
| T6 | Pickle deserialization | 0.1 | 5 | 0.99 | 0.005 | −0.495 |
| T7 | KB chunk tampering | 0.2 | 3 | 0.9 | 0.06 | −0.36 |
T2 (indirect injection) is the largest residual. Output schema and KB hygiene help but don't close the gap — the attacker still controls retrieved content. This becomes the headline finding in experiments/37-redteam-report/findings.md.
What the numbers don't tell you¶
- Correlated failures. Two independent threats each at \(R = 0.1\) may compose into a chain at \(R \gg 0.2\). The matrix is per-threat; chained-exploit risk is a separate exercise (out of scope for Phase 37).
- Adversary capability assumed constant. The matrix assumes a generic motivated attacker. A nation-state attacker collapses every \(D\) to near-zero.
- Severity is one-dimensional. A real severity scale separates confidentiality / integrity / availability. The grammar tutor's microscopic scope makes that overkill; the 1–5 scalar is fine.
- No time dynamics. Risk changes as the codebase evolves. The matrix is a snapshot; re-score at major refactors.
How to use the matrix¶
The matrix lives in experiments/37-redteam-report/findings.md and is the primary deliverable of Phase 37, alongside the regression tests. Its job is:
- Communicate priorities. Anyone reading the report can see "T2 is the largest residual" without inferring it from prose.
- Justify time allocation. Lab 00 takes a day; T1 → mitigation closes 0.80 of residual risk. Lab 04 takes a few hours; T6 → mitigation closes 0.495. Lab 02 (jailbreaks) takes more time than its \(R = 0.24\) reduction warrants — flagged in the report's "lessons learned" section.
- Carry forward. Phase 40 (hardening / postmortem) revisits the matrix; numbers shift as the system evolves.
One-paragraph recap¶
Threat modeling needs numbers, not vibes. Three estimates per threat (\(P\), \(S\), \(D\)) multiply to a residual risk score; sorting by \(R\) dictates mitigation order. The point is comparative ranking, not absolute measurement. For the grammar tutor, the pre-mitigation top-three are direct injection, indirect RAG injection, and tool-abuse paths; post-mitigation, indirect RAG injection becomes the largest residual and the report's headline finding.
Next: theory/04-fuzzing-and-sandbox.md — Hypothesis fuzzing of tool args and Phase 32 sandbox review.