English · Español

01 — Registry and Lineage (wrapped over MLflow + DVC)¶

🇪🇸 Un registro es un almacén content-addressable: la clave es el SHA del contenido canónico, no un nombre, ni el ID que asigne MLflow. Lineage es la cadena de manifiestos que va desde un checkpoint del tutor de gramática hasta los datos crudos (verbos en el corpus DVC), el código y la semilla. Juntos responden a la pregunta "¿qué exactamente sirvió esta corrección?".

What a registry is¶

A model registry is a key-value store with two properties:

Content-addressable. The key is the cryptographic hash of the value. If you change a single byte of the model, the key changes. Same property git uses for blobs.
Immutable. Once registered, an entry is never modified — only superseded by a new entry. Deletion is a tombstone, not a mutation.

These two properties together give you the operational guarantee Phase 38 needs: "if a SHA was logged when serving response R, I can fetch exactly that model, byte-for-byte, and replay R."

A semver overlay (v0.3.1, v0.3.2) is navigation — a human-readable pointer at one of the SHAs. It is not the identity. Two semvers can point at the same SHA (e.g., latest and v0.3.1); the same semver cannot point at two SHAs.

Why we wrap MLflow rather than replace it¶

MLflow ships a model registry. We use it — but only as the storage backend, not as the source of identity. Three reasons:

MLflow's "model version" is mutable. You can attach new tags to a version, change its description, transition its stage (Staging → Production). For a registry whose job is "exactly this model, byte-for-byte", mutation is the enemy.
MLflow's SHA is over the on-disk layout, not canonical content. Save the same checkpoint twice and you get two different model-version IDs because MLflow embeds metadata (timestamps, run IDs) in the registered artifact. Reproducibility breaks.
MLflow's lineage is run-centric, not content-centric. It tracks "this run produced these artifacts" but doesn't enforce that re-running the same code produces the same artifact SHA. We need the content guarantee.

Solution: a 150-LOC wrapper in scripts/mlops/registry.py that:

Computes a canonical SHA over a fixed bundle (described below) before handing the bundle to MLflow.
Stores that canonical SHA as the primary key in a lightweight index.jsonl audit log we own.
Uses MLflow as the artifact store (mlflow.artifacts.download_artifacts to fetch by run_id).
Lets us walk lineage via MLflow run tags (parent_run_id, corpus_dvc_hash, code_git_sha).

The wrapper is small precisely because MLflow's storage layer is fine. Identity is what we own.

What gets registered (the canonical bundle)¶

A registry entry for a grammar-tutor model is not just a .safetensors file. It is the bundle that lets a third party rerun the model deterministically:

File	Purpose
`model.safetensors`	The weights (Mini-GPT base or LoRA adapter, depending on entry).
`tokenizer.json`	The Phase 11 BPE merge table + vocab (trained on the English verb corpus).
`config.json`	Architecture (n_layers, d_model, n_heads, ...) — Phase 17.
`eval_report.json`	Latest Phase 20 eval scores: conjugation accuracy by (verb, tense, person), plus the per-bucket pass rates.
`train_manifest.json`	Pointer to the training run: corpus DVC hash, code git SHA, seed, hparams, MLflow run_id.
`parent_sha`	The canonical SHA of the model this one was fine-tuned from (or `null` for a from-scratch train).

The registered SHA is the hash of a canonical representation of this bundle. Canonical means: sorted JSON keys, sorted file order, no mtime in archive headers. A naive tar is not canonical — it embeds timestamps. We hash the content, not the layout:

SHA = sha256(
    sha256(model.safetensors)
    ‖ sha256(canonical_json(tokenizer.json))
    ‖ sha256(canonical_json(config.json))
    ‖ sha256(canonical_json(eval_report.json))
    ‖ sha256(canonical_json(train_manifest.json))
    ‖ utf8(parent_sha or "null")
)

where canonical_json means: sorted keys, no whitespace separators, UTF-8 encoding, no trailing newline.

Properties this gives us:

Re-saving the model with a different tarball compressor doesn't change the SHA.
Re-saving the eval report with reordered keys doesn't change the SHA.
Bumping the patch version in config.json does change the SHA, because the file content changed.
A genuine model change (one weight differs) changes sha256(model.safetensors), which changes the outer SHA.

Borja will verify this property in lab 00: register the same checkpoint twice from different shells, confirm both registrations produce the same canonical SHA even though MLflow assigns different model-version IDs underneath.

The hash function: choosing SHA-256¶

We use hashlib.sha256 from stdlib. Why:

Collision resistance. \(2^{128}\) work to find a collision. Two unrelated checkpoints will never collide in any horizon we care about.
Stdlib. No new dep.
Industry default. Git is moving from SHA-1 to SHA-256. Safetensors uses SHA-256 internally. PyPI uses SHA-256 for wheel integrity.

We do not use SHA-1 (broken since 2017), MD5 (broken since 2004), or non-cryptographic hashes (xxhash, MurmurHash — fast but collision-prone for adversarial inputs).

Lineage: the chain¶

Lineage answers: "given a grammar-tutor checkpoint SHA, walk back to the verb-corpus."

The chain is implemented by train_manifest.json containing a parent_sha pointer plus pointers to upstream artifacts (the corpus DVC hash, the tokenizer SHA, the code git SHA). A walk looks like:

checkpoint:  <model_sha>
  trained from corpus DVC hash:    <corpus_dvc_hash>   (e.g., the Phase 12 corpus at commit X)
    which was tokenized using:     <tokenizer_sha>     (Phase 11 BPE merges)
  with code at git SHA:            <code_git_sha>      (the repo at training time)
  starting from parent model:      <parent_sha>        ← recurse (Phase 28 LoRA → Phase 18 base)
  with MLflow run ID:              <mlflow_run_id>     (lookup hyperparameters)
  with seed:                       42

Three properties matter:

Acyclic. A model's parent_sha cannot be itself or any descendant. The lineage graph is a DAG. The registry enforces this by rejecting registrations whose parent_sha is the descendant of the candidate (cheap reachability check over the index.jsonl).
Walkable in code. lineage(<sha>) -> LineageTree returns the full ancestor tree in O(depth). No live database call — the manifest chain is files + MLflow run lookups.
Auditable from a single SHA. Given any served response, the trace logs the model SHA. From the SHA the entire chain is reproducible. This is the operability claim from theory/00.

DVC's role in lineage¶

DVC versions the verb corpus. Every commit to data/processed/train.jsonl produces a DVC hash; dvc pull <hash> reconstructs the exact corpus bytes. The training script records the DVC hash in train_manifest.json at the moment of training.

This matters operationally: if the lineage walk says "trained on corpus_dvc_hash=abc123", and abc123 is a corpus where "go" was missing from the irregular-verb set, then a tutor that accepts "I goed" stops being a mystery — it's a corpus bug, fixable by adding "go" to the irregular-verb generator script and rerunning DVC versioning.

DVC also matters for eval_baseline.json (the CI gate). The eval set itself is DVC-versioned; CI verifies the eval-set DVC hash matches before comparing pass rates. Otherwise the baseline could drift silently and CpQU rows would become incomparable.

Semver as a thin overlay¶

Once you have SHAs, semver is bookkeeping. We use the standard scheme:

MAJOR — architecture change (added a layer, changed d_model, switched LoRA rank).
MINOR — training-corpus change (added new verbs, expanded the tense table from 5 to 6, added plural persons).
PATCH — same architecture, same corpus, different seed or hparams.

A semver maps to exactly one SHA at any moment. The mapping is stored in MLflow as a tag on the model version (e.g., tag:semver=v0.3.1) and also in our tags.json file in the registry root — redundant, but the local file is the fast lookup path and the MLflow tag is the authoritative copy.

We do not allow semver-to-semver pointers (v0.3 → v0.3.2). Always semver → sha. Indirection breaks reproducibility under registry compaction or migration.

What lineage is not¶

It is not a training-pipeline orchestration tool. The pipeline is the Justfile + .github/workflows/. The lineage just records what the pipeline did.
It is not a blame trail. If a model produces a bad response, lineage tells you what produced it; it does not tell you why. The "why" is in Phase 34 traces + Phase 40 postmortem methodology.
It is not a legal compliance artifact. Some jurisdictions require "model cards" with bias evaluations etc. Lineage is a precondition for those but not a substitute.

Storage layout¶

The MLflow artifact store handles the actual bytes. Our wrapper adds two files at the registry root:

mlruns/                                ← MLflow's artifact root (configurable)
  <experiment_id>/<run_id>/artifacts/
    model.safetensors
    tokenizer.json
    config.json
    eval_report.json
    train_manifest.json
data/dvc-remote/                       ← DVC's local remote (gitignored)
  files/md5/<corpus_hash>...
.lynx-registry/                        ← our wrapper's metadata
  tags.json                            ← semver → canonical_sha mapping
  index.jsonl                          ← append-only audit log of registrations

index.jsonl is the audit trail: every registration appends a line with {canonical_sha, semver, mlflow_run_id, mlflow_model_version, timestamp, registrar, parent_sha}. Append-only because audit logs must be tamper-evident.

Reading vs writing¶

Reading (registry.get(<sha>) or registry.get("v0.3.1")) returns a ModelHandle — a lightweight pointer that lazy-loads the bundle from MLflow. O(1) for the lookup; O(size) on first weight access.
Writing (registry.register(run_id, semver=...)) canonicalizes the artifact bundle, hashes, refuses if SHA already exists with a different semver mapping, appends to index.jsonl, atomically updates tags.json. O(size-of-artifact) for the hash; O(1) for the index update.

The register() call is idempotent on the canonical SHA: registering the same bundle twice produces the same SHA and no duplicate entry. The second registration is a no-op with a returned "already registered, here's the existing entry" handle. This matters for CI — the workflow may re-run; it must not pollute the registry.

What the CI gate sees¶

The deploy workflow (theory/05, lab 04) calls registry.register_pending(run_id) first, which computes the canonical SHA but does not add a semver tag. The candidate sits in a pending/ namespace until the eval gate passes; if it passes, registry.promote(canonical_sha, semver) adds the tag and emits the audit-log entry. If it fails, the pending entry is garbage-collected after 24 hours (MLflow artifact still exists; only our .lynx-registry/pending/ pointer is removed).

This split — register_pending then promote — is the only place where the registry has a non-trivial state machine. Everywhere else, registration is one-shot and immutable.

Drill problems (work these before lab 00)¶

Solutions in solutions/01-registry-and-lineage-ref.md — written at phase open.

You register a checkpoint as v0.3.0. A week later you find that eval_report.json had a bug — the per-verb conjugation accuracy for "go" was wrong. You fix the eval and want the corrected report attached to the same model. Two options: (a) re-register the bundle with the corrected report (gets a new canonical SHA), (b) attach the new report as a side-car under the existing SHA. Which is correct, and why?
The training pipeline crashed after writing model.safetensors but before eval_report.json. The MLflow run is half-populated. Should registry.register_pending() succeed? What does that imply for the atomicity contract?
Lineage walk: given a grammar-tutor SHA, design the registry layout so the walk-to-corpus-DVC-hash is O(depth) lookups (not O(N) over all entries). Sketch the file structure and the MLflow-tag schema.

If you can answer all three, you understand registry mechanics. Lab 00 makes the abstractions concrete.

One-paragraph recap¶

A registry is content-addressable + immutable. The key is the canonical SHA of the bundle (weights + tokenizer + config + eval + train-manifest + parent_sha); MLflow is the storage backend, not the identity authority. Lineage is the parent_sha + train_manifest chain, walkable in O(depth) from any served response, with the corpus DVC hash as the root. Semver is a thin convenience overlay. The whole wrapper is ~150 LOC of Python — no database, no daemon, no separate service.

Next: theory/02-traffic-strategies.md.