Skip to content

English · Español

03 — Reproducibility, manifests, and dvc

🇪🇸 Si el corpus no se reproduce desde una semilla, los resultados de cada phase posterior son anécdota. La cadena de SHA256s en MANIFEST.json une corpus, embeddings, checkpoint y reporte. dvc rastrea el data/processed/ sin tocar git.


The reproducibility contract

Per CLAUDE.md §0.5: every numeric script seeds RNGs and writes a manifest. For Phase 12, the contract is strict:

Running python scripts/gen_corpus.py --seed 42 on any machine with the pinned versions produces a data/processed/{train,val,test}.jsonl triple whose SHA256s match the manifest exactly.

This is a CI test. If a refactor breaks reproducibility silently, the CI fails on the next push.

Subtlety: most of the corpus is deterministic enumeration — every (verb, tense, person) cell yields one fixed correct row. Determinism here doesn't depend on the seed. The seed only controls:

  1. The order rows are written to JSONL (which affects id assignment and therefore the file SHA256).
  2. Which subset of eligible mis-conjugation types is applied to which cells (the generator samples ~1–3 per eligible cell from the 6-type taxonomy).
  3. Mis-conjugation row IDs.

So changing the seed changes the SHA256, but the set of cells covered stays the same.

What "seeded" means in practice

Three sources of randomness in our pipeline:

  1. random.Random(seed) — used to pick the subset of mis-conjugation types per cell and to shuffle row order.
  2. numpy.random.default_rng(seed) — not used in v1 (no NumPy randomness needed for enumeration); reserved for later additions.
  3. Python dict iteration orderimplicit randomness. Since Python 3.7 dicts are insertion-ordered, so this is deterministic. We rely on this.

The seed_everything(seed) helper (src/utils/seeding.py, from Phase 0) seeds both random and numpy.random, plus sets PYTHONHASHSEED (which only takes effect for subprocess invocations, not the current process — but we record it for completeness).

Anti-pattern: using random.random() (the module-level functions) without passing the RNG explicitly. The module-level random uses a shared global state; if any imported library calls it, your "seeded" run is contaminated.

Convention: every helper receives an explicit rng: random.Random argument and uses only rng.choice, rng.randint, etc.

NFC normalization (and why)

Per Phase 11 theory 03, the project NFC-normalizes all text before any hashing or tokenization. This is non-negotiable for Phase 12 because of Spanish:

  • ñ can be encoded as NFC (single codepoint U+00F1) or NFD (n + combining tilde U+006E + U+0303). Both render visually identical; both have different byte sequences and SHA256s.
  • macOS file system uses NFD by default; Linux uses NFC; Windows mixed. A corpus generated on macOS and validated on Linux would diverge.

Mitigation: gen_corpus.py writes NFC, validate_corpus.py re-NFCs every field and asserts equality before computing the manifest hash. The Spanish dictionary entries embedded in the verb table are stored NFC.

The manifest

data/MANIFEST.json:

{
  "corpus_name": "lynx-cortex-verbgrammar-v1",
  "version": "1.0.0",
  "date_generated": "2026-MM-DD",
  "seed": 42,
  "versions": {
    "python": "3.11.x",
    "numpy": "X.Y.Z",
    "corpus_spec": "1.0.0"
  },
  "config": {
    "verbs": ["work", "play", "walk", "talk", "listen", "watch", "study",
              "finish", "start", "look", "want", "like",
              "be", "have", "do", "go", "come", "see", "eat", "write"],
    "tense_surfaces": ["infinitive", "present_simple", "past_simple",
                       "past_participle", "future_will", "future_going_to"],
    "persons": ["1sg", "2sg", "3sg"],
    "labels": ["correct", "mis_conjugated"],
    "mis_conjugation_types": ["missing_third_person_s",
                              "overregularization_past",
                              "wrong_aux_will_with_to",
                              "wrong_aux_going_to_missing_ing",
                              "subject_verb_disagreement",
                              "bare_participle_missing_aux"],
    "split_ratios": [0.8, 0.1, 0.1],
    "split_grain": "(verb, tense)"
  },
  "files": {
    "data/processed/train.jsonl": {
      "sha256": "abc123...",
      "lines": 364,
      "bytes": 18420
    },
    "data/processed/val.jsonl": { "...": "..." },
    "data/processed/test.jsonl": { "...": "..." }
  },
  "per_cell_counts": {
    "work:present_simple:1sg": 1,
    "work:present_simple:2sg": 1,
    "work:present_simple:3sg": 1,
    "go:past_simple:1sg": 1,
    "...": "..."
  },
  "mis_conjugation_counts_by_type": {
    "missing_third_person_s": 20,
    "overregularization_past": 16,
    "...": "..."
  },
  "total_correct": 360,
  "total_mis_conjugated": 128,
  "total_rows": 488
}

The versions.corpus_spec field is the version of data/corpus_spec.md. When the spec changes (a mis-conjugation type is added, a Spanish lemma is corrected), the corpus version bumps and the SHA256s change. Downstream phases that pinned to v1.0.0 keep working with the old corpus; new training runs use the new one.

dvc for the data layer

Per A8, dvc is installed at Phase 12. We use only its file-tracking feature:

dvc add data/processed/train.jsonl
dvc add data/processed/val.jsonl
dvc add data/processed/test.jsonl
git add data/processed/*.dvc data/.dvcignore
git commit -m "chore: track corpus with dvc"

This creates train.jsonl.dvc (a small text file containing the SHA256 of the data) which is checked into git. The actual .jsonl files are added to .gitignore. The data lives in .dvc/cache/.

That's it. We don't define DVC pipelines (dvc.yaml), we don't push to a remote, we don't use experiments. The whole dvc integration is ~5 commands.

Why bother for a corpus this small (~50 KiB)? Because:

  1. The project will grow other artefacts (embeddings .npy, model checkpoints) that don't belong in git. The same dvc workflow handles them later.
  2. Practice now → habit later.

Pay the dvc cost now; profit later.

Remote storage (deferred)

dvc can push the cache to S3-compatible storage. We don't do this in Phase 12. Reasons:

  • Borja's machine has 62 GiB RAM and plenty of disk; local cache is fine.
  • Pushing to a remote adds an account, costs, and a permission step.
  • Phase 23 (cloud GPU) will need remote storage anyway; defer to that phase's setup.

If Borja runs dvc push without configuring a remote, dvc errors out. Note in the lab.

Version bumps

When does the corpus version bump?

Change Bump
Fix a typo in a Spanish translation Patch (1.0.0 → 1.0.1)
Add a new mis-conjugation type Minor (1.0.0 → 1.1.0)
Change normalize semantics (affects fingerprints) Major (1.0.0 → 2.0.0)
Add a new verb (would violate §A13) Forbidden without A-amendment.
Change (verb, tense) to (verb, tense, person) split grain Major.
Bug fix in a generator template Patch.

A major version change requires downstream phases to retrain. A minor change might require it; a patch is usually backwards-compatible. The phase report records which.

CI's reproducibility check

.github/workflows/ci.yml (or wherever Phase 0 set up CI) has a job:

- name: Reproduce corpus
  run: |
    python scripts/gen_corpus.py --seed 42 --output /tmp/test-corpus
    diff <(jq -S '.files | to_entries | map({key, sha256: .value.sha256})' data/MANIFEST.json) \
         <(jq -S '.files | to_entries | map({key, sha256: .value.sha256})' /tmp/test-corpus/MANIFEST.json)

If the SHA256 changes silently, this fails. The test runs on every push that touches src/minicorpus/, scripts/gen_corpus.py, or data/corpus_spec.md.

Distribution beyond Borja

If another learner clones the repo, what gets them the corpus?

  1. git clone → has the .dvc pointers but not the actual .jsonl files.
  2. dvc pull → without a configured remote, this fails. Their fallback: python scripts/gen_corpus.py --seed 42 regenerates the same files. The SHA256 check confirms.

Important: the corpus is fully regenerable from the seed + the codebase. No "data" needs to be shared. The whole project is a single Git repo plus a regenerable corpus.

(This is one of the reasons we chose enumerated over scraped. A scraped corpus can't be regenerated; you'd have to store the actual bytes and distribute them.)

What we don't do

  • Multi-machine dvc workflows. Out of scope.
  • dvc pipelines (dvc.yaml). Useful when data has expensive transforms; ours doesn't.
  • Continuous corpus rolling. The corpus is static within a version. We don't update it daily.
  • Streaming corpus access. All rows fit in RAM at our scale.

These are reasonable v2 considerations. v1 keeps dvc to its 5-command minimum.

Drill problems

Solutions in solutions/03-reproducibility-and-versioning-ref.md (phase open).

  1. Borja regenerates the corpus on a different machine and gets different SHA256s. Three likely root causes — what are they, and how do you diagnose each? (Hint: one of them involves NFC vs NFD.)
  2. The Python version changes from 3.11.5 to 3.11.6. Should this trigger a corpus version bump? (Hint: think about which randomness sources are sensitive to Python version.)
  3. Suggest a CI test that catches "a Spanish accent was saved as NFD instead of NFC" before merge.

One-paragraph recap

Reproducibility is enforced by seeded RNGs (explicit rng per helper), a MANIFEST.json carrying SHA256s + versions + per-cell counts, and a CI test that regenerates the corpus and diffs the hashes. NFC normalization of all text and Spanish fields is non-negotiable to keep hashes stable cross-platform. dvc is used in its minimal mode (just file tracking, no pipelines, no remote in v1). Corpus version follows semver: patch for typo fixes, minor for new mis-conjugation types or Spanish entries, major for normalization changes. The whole corpus is regenerable from the seed + the codebase — no external data needs to be distributed.


Next: lab/00-corpus-spec.md.