English · Español

Lab 00 — Registry roundtrip (MLflow + DVC + canonical SHA)¶

Goal: prove the registry's canonical-SHA stability and the lineage walk on real grammar-tutor artifacts.

Estimated time: 3–5 hours.

Prereq: Phase 18 + Phase 26 + Phase 28 artifacts exist (Mini-GPT base, INT8 variant, LoRA grammar tutor); MLflow tracking server is up (mlflow ui --backend-store-uri ./mlruns); DVC initialized with a local remote (.dvc/config points at data/dvc-remote/).

What you produce¶

experiments/38-registry-roundtrip/ containing:

register.py — your driver script.
results.json — registered canonical SHAs, semvers, lineage trees, MLflow run IDs.
lineage.mmd — mermaid render of the lineage DAG.
manifest.json — {seed, versions, config, hardware} per CLAUDE.md §1.
README.md — what you registered, what the SHA stability test showed.

The scenario¶

Three artifacts to register:

The Phase 18 FP32 trained Mini-GPT base model (no LoRA yet).
The Phase 26 INT8 quantized variant of the Phase 18 checkpoint.
The Phase 28 LoRA grammar-tutor adapter on top of the Phase 18 base.

Phase 38's job is to register all three under the canonical-SHA scheme, with lineage walking back to the Phase 12 verb-corpus DVC hash.

TODOs¶

Block A — register the three artifacts¶

Write a register.py that:
Pulls the Phase 12 verb-corpus via DVC: dvc pull data/processed/train.jsonl.dvc. Capture the dvc_hash.
Loads the Phase 18 MLflow run (mlflow_run_id_fp32). Calls registry.register_pending(run_id=...) from scripts/mlops/registry.py. Print the canonical SHA.
Calls registry.promote(canonical_sha=<fp32_sha>, semver="v0.1.0") (this is a dev-mode run; in production this step is gated by CI per theory/05).
Loads the Phase 26 INT8 MLflow run. The run's train_manifest.json must include parent_sha = <fp32_sha> (set this in the Phase 26 training script — if it's not there, this lab fails until Phase 26 is revised).
Calls register_pending + promote for the INT8 bundle with semver v0.2.0.
Same for the Phase 28 LoRA bundle: parent_sha = <fp32_sha>, semver v0.3.0.
Confirm by inspecting .lynx-registry/: tags.json maps three semvers; index.jsonl has three lines.
Confirm MLflow's UI shows three registered models (in addition to your canonical-SHA layer).

Block B — canonical-SHA stability test¶

Two runs, same SHA. From a fresh shell, register the Phase 18 bundle again. The registry's register_pending() should detect the canonical SHA already exists and return the existing handle. Confirm index.jsonl did not grow.
Re-export the MLflow bundle, register again. This is the strict test: re-save the same logical model through MLflow's API (which embeds new timestamps in metadata files). Our canonical-SHA computation should ignore those and produce the same SHA. If it doesn't, the canonicalization is wrong.
Document the non-determinism you found. Almost certainly the first attempt produces a different SHA on the second run because something in the bundle was not canonical (typically mtime in tarball headers, MLflow run-metadata files included in the hash by mistake, or unordered JSON keys). Identify the source, fix scripts/mlops/registry.py's canonicalization, rerun.

Block C — lineage walk¶

Call lineage(<int8_sha>). Verify it returns: INT8 → FP32 → corpus_dvc_hash. Print the chain.
Call lineage(<lora_sha>). Verify it returns: LoRA → FP32 → corpus_dvc_hash.
Confirm that dvc pull on the returned corpus_dvc_hash reconstructs the exact training corpus.
Render the lineage DAG as a mermaid diagram (lineage.mmd). Three model nodes (FP32, INT8, LoRA), one corpus node, edges: FP32 → corpus, INT8 → FP32, LoRA → FP32.

Block D — `tags.json` invariants¶

Try to register a new canonical SHA under v0.1.0. Expected: RegistryError("semver already mapped to a different SHA").
Update the semver mapping explicitly: registry.retag(canonical_sha=<new_fp32_sha>, semver="v0.1.0"). Confirm tags.json updated atomically (write-to-tmp + rename).
Confirm index.jsonl has an entry recording the retag operation with the previous SHA.

Block E — manifest + README¶

Write manifest.json with seed, library versions (Python, MLflow, DVC, NumPy), config (the three bundles' MLflow run IDs and the corpus DVC hash), and hardware (per LYNX_CORTEX.md §5).
Write README.md (300–500 words) covering:
What you registered: the three canonical SHAs, the three MLflow run IDs, the corpus DVC hash.
The non-determinism source you found in MLflow's bundle (be specific: was it mtime in tar headers, JSON key order, an MLflow metadata file you accidentally included, fp precision, or something else?).
The lineage diagram and what it shows about the FP32 → INT8 / LoRA fan-out.
One sentence on why this matters for the Phase 39 capstone and Phase 38 lab 04 (the CI gate).

Constraints¶

Stdlib only for hashing (hashlib.sha256). No xxhash, no blake3, no third-party hashers.
MLflow only for storage. Don't use MLflow's model-registry transition_model_version_stage for identity. Identity lives in scripts/mlops/registry.py via the canonical SHA. MLflow's run/version IDs are metadata pointers, not the primary key.
DVC for the corpus. corpus_dvc_hash is in the train_manifest.json; the lineage walk uses it. Do not embed the corpus bytes in the registry — that's what DVC is for.
Atomic writes for tags.json mutations: write to *.tmp, then os.rename.
No new src/<module>/. All code goes in scripts/mlops/.
CPU-only. This lab does not touch a GPU.

Stop conditions¶

Done when:

Three registered canonical SHAs are stable across re-registration from a fresh shell.
lineage(<sha>) returns a correct chain for all three, terminating at the Phase 12 corpus DVC hash.
tags.json cannot be silently corrupted by a duplicate-semver registration.
manifest.json exists and lists the three SHAs, three MLflow run IDs, and the corpus DVC hash.
The mermaid DAG renders.

Pitfalls (read before debugging)¶

MLflow embeds timestamps in metadata. mlflow.pyfunc.save_model() writes a MLmodel YAML with creation timestamps. Including this file in the canonical-SHA computation will break stability. The canonical SHA must be over model.safetensors + tokenizer.json + config.json + eval_report.json + train_manifest.json + parent_sha only. Anything MLflow generates as side-cars is ignored.
tar is not canonical. tar cf bundle.tar models/mini-gpt-fp32/ produces a tarball with mtime headers. Two tar invocations produce different bytes. The fix is to hash the files in the bundle individually, not the tarball.
json.dumps is not canonical by default. Use json.dumps(d, sort_keys=True, separators=(',', ':')). Trailing whitespace, key order, and floating-point representation all matter. Test by dumping the same dict twice in two different Python sessions and comparing the bytes.
safetensors is canonical. A safetensors file is byte-identical given the same tensor data. Use it as the weights format; don't roll your own.
Locale-dependent JSON. If your Python is in a non-UTF-8 locale, JSON output can vary. Force UTF-8 via encoding='utf-8'.
DVC pulls into a working tree. dvc pull extracts to data/processed/train.jsonl. If you re-run the registration after editing that file (even by accident — e.g., a stray notebook execution), the corpus_dvc_hash recorded in the manifest may not match what's on disk. Always dvc status before registering.

When to consult `solutions/`¶

After all five blocks are done. solutions/00-registry-roundtrip-ref.md (written at phase open) compares your canonicalization choice, the structure of lineage()'s return type, the MLflow integration, and the register() atomicity strategy.

Next lab: lab/01-shadow-ab.md.