Skip to content

English · Español

Reproducibility — seeds, lockfiles, manifests

🇪🇸 Resumen. Tres mecanismos: (1) sembrar todas las fuentes de aleatoriedad, (2) congelar versiones exactas con un lockfile, (3) persistir un manifiesto por experimento. Sin los tres, los resultados no son reproducibles — son anécdotas.

§0 The three pillars

A result is reproducible when someone else, six months from now, on different hardware, can re-run your script and get bit-identical (or, on GPU, "within documented tolerance") numerical results. That requires three things:

  1. All randomness is seeded — every RNG that touches the computation graph.
  2. All code dependencies are pinned — exact versions + hashes, in a lockfile that's committed.
  3. Per-experiment provenance is recorded — the seed, the lockfile sha, the git sha, the hardware, the config. Lose any one and the run is unreproducible by definition.

§1 Sources of randomness

In a NumPy + (eventually) PyTorch + CUDA stack, the RNG sources are:

Source Where it bites you
random (stdlib) random.shuffle, random.choice, anything in secrets (no — secrets is intentionally non-seedable)
numpy.random (legacy + Generator) Most pre-ML phases
torch CPU RNG torch.randn, torch.randperm, dropout, init layers
torch.cuda per-device RNG GPU dropouts, GPU init
cuDNN nondeterministic algorithms torch.backends.cudnn.deterministic, cudnn.benchmark
PYTHONHASHSEED Dict iteration order, hash(str) — affects dataloaders that index by hash
OS scheduling / multi-threading OMP_NUM_THREADS, BLAS threading; sum-reduction order on >2 threads is non-deterministic
Hardware nondeterminism (TF32, FP16 atomic add) TF32 on Ampere+; atomicAdd in FP16

The function in src/utils/seeding.py covers the first five. The remaining three are handled at run boundaries — OMP_NUM_THREADS=1 for fully deterministic CPU runs, TF32 disabled where determinism matters.

§1.1 Why each seeding call?

def seed_everything(seed: int) -> None:
    os.environ["PYTHONHASHSEED"] = str(seed)   # dict order, hash(str)
    random.seed(seed)                           # stdlib RNG
    np.random.seed(seed)                        # NumPy legacy global RNG
    torch.manual_seed(seed)                     # CPU + default-device RNG
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)        # every CUDA device
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

Subtleties: - PYTHONHASHSEED must be set before Python starts to affect hash(str) deterministically across processes; setting it via os.environ after import only helps within the same process. In practice we set it in the launcher (just recipe or shell wrapper). Code-level setting is hygienic but not sufficient. - np.random.seed only seeds the legacy global generator. Code that uses rng = np.random.default_rng(seed) is independent — and better — but the legacy seed is still set for libraries that haven't migrated. - cudnn.benchmark = True lets cuDNN pick the fastest algorithm at runtime; the choice depends on input shapes and can flip between runs. Disabling it costs throughput but is the price of determinism.

§1.2 What seed_everything does not do

  • It doesn't make O(n) parallel reductions deterministic on multi-threaded BLAS. For that, set OMP_NUM_THREADS=1 or use deterministic reduction implementations.
  • It doesn't make non-deterministic algorithms (some scatter_add variants, torch.use_deterministic_algorithms(True) is the way to enforce; it raises if you use a non-deterministic op).
  • It doesn't seed any threads spawned before the call. Call it first.

§2 Lockfiles — the difference between "I installed numpy 2.x" and "we both installed numpy 2.0.1 from the same wheel hash"

A requirements.txt with version specifiers like numpy>=2,<3 is not a lockfile — it's a constraint. The same constraint resolves to different exact versions on different days, depending on what's been released since.

A lockfile records: - Every direct dependency's exact version. - Every transitive dependency's exact version. - The wheel/sdist hash for each. - The resolver decisions (which conflicts were broken which way).

uv.lock is uv's lockfile format. It's checked in. uv sync reads it; it does not re-resolve unless you ask. The hash check means a compromised PyPI index can't quietly swap a package — installation would fail.

🇪🇸 Un requirements.txt con >= y < es una restricción, no un lockfile. Resuelve a versiones diferentes cada día. El lockfile congela las versiones exactas y los hashes — si alguien manipula el índice, la instalación falla.

§2.1 When the lockfile changes

  • A direct dep is added/removed/upgraded in pyproject.toml → re-run uv lock.
  • A transitive dep's constraint moves (because a direct dep changed its constraints) → uv lock regenerates.
  • The lockfile is committed. PRs that change it must justify the change.

§3 The experiment manifest

For every run that produces a numeric artifact (a loss, a metric, a checkpoint, a plot), persist:

{
  "id": "2026-05-22-softmax-stability",
  "git_sha": "a1b2c3d4...",
  "git_dirty": false,
  "seed": 42,
  "config": { "/* the actual hyperparameters */": null },
  "versions": {
    "python": "3.11.9",
    "numpy": "2.0.1",
    "torch": "2.3.1",
    "uv": "0.4.18"
  },
  "hardware": {
    "cpu": "Intel i5-8250U",
    "ram_gb": 62,
    "gpu": "Intel UHD 620 (no CUDA)",
    "os": "Fedora 43"
  },
  "started_at": "2026-05-22T19:14:02Z",
  "finished_at": "2026-05-22T19:14:08Z",
  "wall_seconds": 6.31,
  "artifacts": ["plot.svg", "loss.npy"]
}

src/utils/seeding.py has log_versions() for the versions block; the rest is composed in the experiment script.

§3.1 Why hardware is in the manifest

CPU vs. GPU paths are bit-different. Within GPU, sm_70 vs sm_80 differ on TF32 default. Within CPU, the BLAS reduction tree can vary by core count. If you don't record hardware, you can't tell whether a numerical discrepancy is your bug or the machine's.

§3.2 Why "git_dirty" matters

If the working tree is dirty (uncommitted changes), the git sha is a lie. Manifests with git_dirty: true are quarantine artifacts — useful for fast iteration, useless for the report. The phase-gatekeeper flags any DoD-relevant manifest with git_dirty: true.

§4 What the lab will test

  • §lab/03: re-implement seed_everything from scratch without peeking. Confirm with pytest that 10 invocations of random.random() after seed_everything(0) produce the same sequence as 10 invocations after random.seed(0).
  • §lab/00 checklist: confirm uv.lock is present, pip-audit is clean, bandit is clean.

§5 Pitfalls

  • Forgetting to seed before forking a worker process. Subprocess gets a fresh RNG state unless you seed it inside.
  • Setting PYTHONHASHSEED in code instead of in the launcher. Affects hash(str) only for new processes.
  • Relying on bool(torch.cuda.is_available()) at module import time. It can return True on systems where CUDA is broken at runtime. Wrap CUDA-only paths in try/except.
  • Trusting numpy.random.seed to seed np.random.default_rng(...). It doesn't — default_rng has its own state.
  • Persisting the manifest without wall_seconds and finished_at. You'll want the timing data when you're debugging "why did this run take 8× longer this time?".

§6 Exercises (solutions in solutions/)

  1. Without looking at src/utils/seeding.py, write a seed_everything(seed: int) -> None that covers random, numpy, torch (if importable). Test that calling it twice with the same seed gives the same first ten random.random() outputs.
  2. Add a log_versions() -> dict[str, str] that returns Python + NumPy + Torch + uv versions. Handle the case where any of them isn't importable.
  3. Write a record_manifest(experiment_id: str, config: dict, seed: int, artifacts: list[str]) -> Path that captures the schema in §3, writes it to experiments/<date>-<id>/manifest.json, and returns the path. Include git_sha, git_dirty, wall time.

§7 References

  • Reproducibility in ML — Pineau et al., 2020 (NeurIPS reproducibility checklist).
  • PyTorch determinism docs — torch.use_deterministic_algorithms semantics.
  • uv docs — lockfile format, uv sync vs. uv pip install.

02-engineering-hygiene.md — pre-commit, ruff, mypy, bandit, pip-audit as policy.