Skip to content

English · Español

04 — Manifest anatomy: a worked example of "what to log when"

🇪🇸 El manifiesto del experimento no es burocracia; es la única forma de distinguir "mi código cambió" de "la máquina cambió" cuando los números se desvían. Aquí lo desmontamos campo a campo con un ejemplo concreto.

The §3 of 01-reproducibility.md showed the shape of a manifest. This page closes the loop with a worked drift-diagnosis story: two manifests that disagree on a single number, and the field-by-field walk through how you triage the cause.


§1 The two manifests

A learner runs experiments/softmax-bench/run.py on Monday and again on Tuesday. The Monday manifest reports mean_loss: 0.4321; Tuesday reports mean_loss: 0.4327. Where did the drift come from?

// monday.json (excerpt)
{
  "git_sha": "a1b2c3d",
  "git_dirty": false,
  "seed": 42,
  "versions": { "python": "3.11.9", "numpy": "2.0.1", "torch": "not installed" },
  "hardware": { "cpu": "Intel i5-8250U", "ram_gb": 62, "os": "Fedora 43" },
  "env": { "OMP_NUM_THREADS": "8", "PYTHONHASHSEED": "42" },
  "mean_loss": 0.4321
}
// tuesday.json (excerpt)
{
  "git_sha": "a1b2c3d",
  "git_dirty": false,
  "seed": 42,
  "versions": { "python": "3.11.9", "numpy": "2.0.2", "torch": "not installed" },
  "hardware": { "cpu": "Intel i5-8250U", "ram_gb": 62, "os": "Fedora 43" },
  "env": { "OMP_NUM_THREADS": "8", "PYTHONHASHSEED": "42" },
  "mean_loss": 0.4327
}

The only field that differs: numpy 2.0.1numpy 2.0.2. The git sha, seed, hardware, env, and OMP threads are all identical. Diagnosis cost: 30 seconds. This is what the manifest is for.

Without the manifest, you would be reduced to bisecting in time, re-running with hypothesized changes, and arguing with yourself about whether you imagined the drift. With it, you check the diff between the two and the answer is one line.

§2 Field-by-field — what each one buys you

Field What you can rule out / rule in when the field disagrees
git_sha + git_dirty Code drift. Different sha → bisect commits. Dirty → re-run after committing.
seed RNG-path drift. Different seed → expected drift; the question is whether the distribution of outputs is stable, not the point estimate.
versions[python] Interpreter-level changes (e.g., dict ordering, float formatting). Rare but real.
versions[numpy] BLAS-binding-level changes; LAPACK upgrades inside the wheel. Common cause of small drift.
versions[torch] cuDNN version, ATen kernel changes. Big driver of drift across pytorch minor versions.
hardware[cpu] AVX-512 vs AVX2 path differences; SIMD width changes summation order.
hardware[gpu] sm_70 vs sm_80 → TF32 default flips on Ampere+.
env[OMP_NUM_THREADS] Multi-threaded BLAS sums in different orders for different thread counts. Set this for determinism.
env[MKL_NUM_THREADS] Same story for Intel MKL.
env[PYTHONHASHSEED] Cross-process hash(str) ordering (see break/00-break-pythonhashseed.md).
wall_seconds Performance regressions or CPU thermal throttling (8 s → 30 s is a hardware story, not a code story).
started_at / finished_at Time-of-day effects (other processes competing for cache, background indexing on a fresh laptop).

Reading the table the other direction: when you write a manifest, every field is there because some specific drift cause needs it to be diagnosable later. There are no decorative fields.

§3 What to do when fields you wish you had logged are missing

A manifest from three months ago lacks OMP_NUM_THREADS. Today's run drifts. You cannot retroactively know what OMP was on the old run. Options, in increasing order of pain:

  1. Re-run today with the missing fields varied. Set OMP_NUM_THREADS to 1, 2, 4, 8 and see which value reproduces the old number. If a single setting reproduces, that was the missing variable.
  2. Examine the old shell history (history, ~/.zsh_history). If you can find the launcher command, the env may be reconstructable.
  3. Mark the run as "unreproducible — schema-too-thin" in the journal and add the field to the manifest schema going forward.

The lesson is unidirectional: schemas only get richer over time. Adding a field is cheap; removing one is forbidden by the reproducibility contract.

§4 The "minimal-but-not-too-minimal" rule

A manifest with only git_sha is too minimal. A manifest with screenshots and the contents of every .bashrc is too rich. The right line:

If the field is plausibly load-bearing for a numeric outcome, log it. Otherwise don't.

Concretely, log: - Everything that affects the RNG path (seed, hash seed, library versions). - Everything that affects the BLAS reduction order (thread counts, CPU SIMD width). - Everything that affects the float path (precision flags, TF32, cuDNN determinism). - The git sha + dirty bit. - Wall time + start/finish timestamps (for performance drift triage).

Don't log: - Editor settings, shell theme, kernel version unless you have a specific kernel-related regression to track. - File paths that are repo-relative and reconstructable from the git sha. - The entire os.environ (privacy risk — captures secrets accidentally).

§5 The §A13 connection

The grammar corpus generator (Phase 12) will produce 600 forms. Every regeneration writes a manifest containing the seed, the corpus version, the grammar grid checksum, and the bilingual-pair coverage. If a downstream eval drifts in Phase 20, the manifest tells you whether the corpus changed or the eval did. Without it, "the model got worse" is unsolvable.

§6 References

  • Pineau et al., Improving Reproducibility in Machine Learning Research (ML reproducibility checklist), J. Mach. Learn. Res. 22 (2021).
  • Joel Grus, Reproducibility in ML: Why It Matters and How to Achieve It — talk notes, 2019.
  • The PyTorch docs page on torch.use_deterministic_algorithms lists every known non-deterministic op in the framework — useful when your manifest can't explain the drift.

→ Lab 04 (new): write a manifest-diff helper that takes two manifest paths and prints only the fields that changed, sorted by "drift-causing likelihood."