English · Español

Reproducibility — seeds, lockfiles, manifests¶

🇪🇸 Resumen. Tres mecanismos: (1) sembrar todas las fuentes de aleatoriedad, (2) congelar versiones exactas con un lockfile, (3) persistir un manifiesto por experimento. Sin los tres, los resultados no son reproducibles — son anécdotas.

§0 The three pillars¶

A result is reproducible when someone else, six months from now, on different hardware, can re-run your script and get bit-identical (or, on GPU, "within documented tolerance") numerical results. That requires three things:

All randomness is seeded — every RNG that touches the computation graph.
All code dependencies are pinned — exact versions + hashes, in a lockfile that's committed.
Per-experiment provenance is recorded — the seed, the lockfile sha, the git sha, the hardware, the config. Lose any one and the run is unreproducible by definition.

§1 Sources of randomness¶

In a NumPy + (eventually) PyTorch + CUDA stack, the RNG sources are:

Source	Where it bites you
`random` (stdlib)	`random.shuffle`, `random.choice`, anything in `secrets` (no — `secrets` is intentionally non-seedable)
`numpy.random` (legacy + `Generator`)	Most pre-ML phases
`torch` CPU RNG	`torch.randn`, `torch.randperm`, dropout, init layers
`torch.cuda` per-device RNG	GPU dropouts, GPU init
cuDNN nondeterministic algorithms	`torch.backends.cudnn.deterministic`, `cudnn.benchmark`
`PYTHONHASHSEED`	Dict iteration order, `hash(str)` — affects dataloaders that index by hash
OS scheduling / multi-threading	`OMP_NUM_THREADS`, BLAS threading; sum-reduction order on `>2` threads is non-deterministic
Hardware nondeterminism (TF32, FP16 atomic add)	TF32 on Ampere+; `atomicAdd` in FP16

The function in src/utils/seeding.py covers the first five. The remaining three are handled at run boundaries — OMP_NUM_THREADS=1 for fully deterministic CPU runs, TF32 disabled where determinism matters.

§1.1 Why each seeding call?¶

def seed_everything(seed: int) -> None:
    os.environ["PYTHONHASHSEED"] = str(seed)   # dict order, hash(str)
    random.seed(seed)                           # stdlib RNG
    np.random.seed(seed)                        # NumPy legacy global RNG
    torch.manual_seed(seed)                     # CPU + default-device RNG
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)        # every CUDA device
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

Subtleties: - PYTHONHASHSEED must be set before Python starts to affect hash(str) deterministically across processes; setting it via os.environ after import only helps within the same process. In practice we set it in the launcher (just recipe or shell wrapper). Code-level setting is hygienic but not sufficient. - np.random.seed only seeds the legacy global generator. Code that uses rng = np.random.default_rng(seed) is independent — and better — but the legacy seed is still set for libraries that haven't migrated. - cudnn.benchmark = True lets cuDNN pick the fastest algorithm at runtime; the choice depends on input shapes and can flip between runs. Disabling it costs throughput but is the price of determinism.

§1.2 What `seed_everything` does not do¶

It doesn't make O(n) parallel reductions deterministic on multi-threaded BLAS. For that, set OMP_NUM_THREADS=1 or use deterministic reduction implementations.
It doesn't make non-deterministic algorithms (some scatter_add variants, torch.use_deterministic_algorithms(True) is the way to enforce; it raises if you use a non-deterministic op).
It doesn't seed any threads spawned before the call. Call it first.

§2 Lockfiles — the difference between "I installed numpy 2.x" and "we both installed numpy 2.0.1 from the same wheel hash"¶

A requirements.txt with version specifiers like numpy>=2,<3 is not a lockfile — it's a constraint. The same constraint resolves to different exact versions on different days, depending on what's been released since.

A lockfile records: - Every direct dependency's exact version. - Every transitive dependency's exact version. - The wheel/sdist hash for each. - The resolver decisions (which conflicts were broken which way).

uv.lock is uv's lockfile format. It's checked in. uv sync reads it; it does not re-resolve unless you ask. The hash check means a compromised PyPI index can't quietly swap a package — installation would fail.

🇪🇸 Un requirements.txt con >= y < es una restricción, no un lockfile. Resuelve a versiones diferentes cada día. El lockfile congela las versiones exactas y los hashes — si alguien manipula el índice, la instalación falla.

§2.1 When the lockfile changes¶

A direct dep is added/removed/upgraded in pyproject.toml → re-run uv lock.
A transitive dep's constraint moves (because a direct dep changed its constraints) → uv lock regenerates.
The lockfile is committed. PRs that change it must justify the change.

§3 The experiment manifest¶

For every run that produces a numeric artifact (a loss, a metric, a checkpoint, a plot), persist:

{
  "id": "2026-05-22-softmax-stability",
  "git_sha": "a1b2c3d4...",
  "git_dirty": false,
  "seed": 42,
  "config": { "/* the actual hyperparameters */": null },
  "versions": {
    "python": "3.11.9",
    "numpy": "2.0.1",
    "torch": "2.3.1",
    "uv": "0.4.18"
  },
  "hardware": {
    "cpu": "Intel i5-8250U",
    "ram_gb": 62,
    "gpu": "Intel UHD 620 (no CUDA)",
    "os": "Fedora 43"
  },
  "started_at": "2026-05-22T19:14:02Z",
  "finished_at": "2026-05-22T19:14:08Z",
  "wall_seconds": 6.31,
  "artifacts": ["plot.svg", "loss.npy"]
}

src/utils/seeding.py has log_versions() for the versions block; the rest is composed in the experiment script.

§3.1 Why hardware is in the manifest¶

CPU vs. GPU paths are bit-different. Within GPU, sm_70 vs sm_80 differ on TF32 default. Within CPU, the BLAS reduction tree can vary by core count. If you don't record hardware, you can't tell whether a numerical discrepancy is your bug or the machine's.

§3.2 Why "git_dirty" matters¶

If the working tree is dirty (uncommitted changes), the git sha is a lie. Manifests with git_dirty: true are quarantine artifacts — useful for fast iteration, useless for the report. The phase-gatekeeper flags any DoD-relevant manifest with git_dirty: true.

§4 What the lab will test¶

§lab/03: re-implement seed_everything from scratch without peeking. Confirm with pytest that 10 invocations of random.random() after seed_everything(0) produce the same sequence as 10 invocations after random.seed(0).
§lab/00 checklist: confirm uv.lock is present, pip-audit is clean, bandit is clean.

§5 Pitfalls¶

Forgetting to seed before forking a worker process. Subprocess gets a fresh RNG state unless you seed it inside.
Setting PYTHONHASHSEED in code instead of in the launcher. Affects hash(str) only for new processes.
Relying on bool(torch.cuda.is_available()) at module import time. It can return True on systems where CUDA is broken at runtime. Wrap CUDA-only paths in try/except.
Trusting numpy.random.seed to seed np.random.default_rng(...). It doesn't — default_rng has its own state.
Persisting the manifest without wall_seconds and finished_at. You'll want the timing data when you're debugging "why did this run take 8× longer this time?".

§6 Exercises (solutions in `solutions/`)¶

Without looking at src/utils/seeding.py, write a seed_everything(seed: int) -> None that covers random, numpy, torch (if importable). Test that calling it twice with the same seed gives the same first ten random.random() outputs.
Add a log_versions() -> dict[str, str] that returns Python + NumPy + Torch + uv versions. Handle the case where any of them isn't importable.
Write a record_manifest(experiment_id: str, config: dict, seed: int, artifacts: list[str]) -> Path that captures the schema in §3, writes it to experiments/<date>-<id>/manifest.json, and returns the path. Include git_sha, git_dirty, wall time.

§7 References¶

Reproducibility in ML — Pineau et al., 2020 (NeurIPS reproducibility checklist).
PyTorch determinism docs — torch.use_deterministic_algorithms semantics.
uv docs — lockfile format, uv sync vs. uv pip install.

§8 Read next¶

→ 02-engineering-hygiene.md — pre-commit, ruff, mypy, bandit, pip-audit as policy.