English · Español
Lab 00 — Registry roundtrip (MLflow + DVC + canonical SHA)¶
Goal: prove the registry's canonical-SHA stability and the lineage walk on real grammar-tutor artifacts.
Estimated time: 3–5 hours.
Prereq: Phase 18 + Phase 26 + Phase 28 artifacts exist (Mini-GPT base, INT8 variant, LoRA grammar tutor); MLflow tracking server is up (
mlflow ui --backend-store-uri ./mlruns); DVC initialized with a local remote (.dvc/configpoints atdata/dvc-remote/).
What you produce¶
experiments/38-registry-roundtrip/ containing:
register.py— your driver script.results.json— registered canonical SHAs, semvers, lineage trees, MLflow run IDs.lineage.mmd— mermaid render of the lineage DAG.manifest.json—{seed, versions, config, hardware}per CLAUDE.md §1.README.md— what you registered, what the SHA stability test showed.
The scenario¶
Three artifacts to register:
- The Phase 18 FP32 trained Mini-GPT base model (no LoRA yet).
- The Phase 26 INT8 quantized variant of the Phase 18 checkpoint.
- The Phase 28 LoRA grammar-tutor adapter on top of the Phase 18 base.
Phase 38's job is to register all three under the canonical-SHA scheme, with lineage walking back to the Phase 12 verb-corpus DVC hash.
TODOs¶
Block A — register the three artifacts¶
- Write a
register.pythat: - Pulls the Phase 12 verb-corpus via DVC:
dvc pull data/processed/train.jsonl.dvc. Capture thedvc_hash. - Loads the Phase 18 MLflow run (
mlflow_run_id_fp32). Callsregistry.register_pending(run_id=...)fromscripts/mlops/registry.py. Print the canonical SHA. - Calls
registry.promote(canonical_sha=<fp32_sha>, semver="v0.1.0")(this is adev-mode run; in production this step is gated by CI pertheory/05). - Loads the Phase 26 INT8 MLflow run. The run's
train_manifest.jsonmust includeparent_sha = <fp32_sha>(set this in the Phase 26 training script — if it's not there, this lab fails until Phase 26 is revised). - Calls
register_pending+promotefor the INT8 bundle with semverv0.2.0. - Same for the Phase 28 LoRA bundle:
parent_sha = <fp32_sha>, semverv0.3.0. - Confirm by inspecting
.lynx-registry/:tags.jsonmaps three semvers;index.jsonlhas three lines. - Confirm MLflow's UI shows three registered models (in addition to your canonical-SHA layer).
Block B — canonical-SHA stability test¶
- Two runs, same SHA. From a fresh shell, register the Phase 18 bundle again. The registry's
register_pending()should detect the canonical SHA already exists and return the existing handle. Confirmindex.jsonldid not grow. - Re-export the MLflow bundle, register again. This is the strict test: re-save the same logical model through MLflow's API (which embeds new timestamps in metadata files). Our canonical-SHA computation should ignore those and produce the same SHA. If it doesn't, the canonicalization is wrong.
- Document the non-determinism you found. Almost certainly the first attempt produces a different SHA on the second run because something in the bundle was not canonical (typically
mtimein tarball headers, MLflow run-metadata files included in the hash by mistake, or unordered JSON keys). Identify the source, fixscripts/mlops/registry.py's canonicalization, rerun.
Block C — lineage walk¶
- Call
lineage(<int8_sha>). Verify it returns: INT8 → FP32 → corpus_dvc_hash. Print the chain. - Call
lineage(<lora_sha>). Verify it returns: LoRA → FP32 → corpus_dvc_hash. - Confirm that
dvc pullon the returnedcorpus_dvc_hashreconstructs the exact training corpus. - Render the lineage DAG as a mermaid diagram (
lineage.mmd). Three model nodes (FP32, INT8, LoRA), one corpus node, edges: FP32 → corpus, INT8 → FP32, LoRA → FP32.
Block D — tags.json invariants¶
- Try to register a new canonical SHA under
v0.1.0. Expected:RegistryError("semver already mapped to a different SHA"). - Update the semver mapping explicitly:
registry.retag(canonical_sha=<new_fp32_sha>, semver="v0.1.0"). Confirmtags.jsonupdated atomically (write-to-tmp + rename). - Confirm
index.jsonlhas an entry recording the retag operation with the previous SHA.
Block E — manifest + README¶
- Write
manifest.jsonwith seed, library versions (Python, MLflow, DVC, NumPy), config (the three bundles' MLflow run IDs and the corpus DVC hash), and hardware (perLYNX_CORTEX.md§5). - Write
README.md(300–500 words) covering: - What you registered: the three canonical SHAs, the three MLflow run IDs, the corpus DVC hash.
- The non-determinism source you found in MLflow's bundle (be specific: was it
mtimein tar headers, JSON key order, an MLflow metadata file you accidentally included, fp precision, or something else?). - The lineage diagram and what it shows about the FP32 → INT8 / LoRA fan-out.
- One sentence on why this matters for the Phase 39 capstone and Phase 38 lab 04 (the CI gate).
Constraints¶
- Stdlib only for hashing (
hashlib.sha256). Noxxhash, noblake3, no third-party hashers. - MLflow only for storage. Don't use MLflow's model-registry
transition_model_version_stagefor identity. Identity lives inscripts/mlops/registry.pyvia the canonical SHA. MLflow's run/version IDs are metadata pointers, not the primary key. - DVC for the corpus.
corpus_dvc_hashis in thetrain_manifest.json; the lineage walk uses it. Do not embed the corpus bytes in the registry — that's what DVC is for. - Atomic writes for
tags.jsonmutations: write to*.tmp, thenos.rename. - No new
src/<module>/. All code goes inscripts/mlops/. - CPU-only. This lab does not touch a GPU.
Stop conditions¶
Done when:
- Three registered canonical SHAs are stable across re-registration from a fresh shell.
lineage(<sha>)returns a correct chain for all three, terminating at the Phase 12 corpus DVC hash.tags.jsoncannot be silently corrupted by a duplicate-semver registration.manifest.jsonexists and lists the three SHAs, three MLflow run IDs, and the corpus DVC hash.- The mermaid DAG renders.
Pitfalls (read before debugging)¶
- MLflow embeds timestamps in metadata.
mlflow.pyfunc.save_model()writes aMLmodelYAML with creation timestamps. Including this file in the canonical-SHA computation will break stability. The canonical SHA must be overmodel.safetensors+tokenizer.json+config.json+eval_report.json+train_manifest.json+parent_shaonly. Anything MLflow generates as side-cars is ignored. taris not canonical.tar cf bundle.tar models/mini-gpt-fp32/produces a tarball withmtimeheaders. Twotarinvocations produce different bytes. The fix is to hash the files in the bundle individually, not the tarball.json.dumpsis not canonical by default. Usejson.dumps(d, sort_keys=True, separators=(',', ':')). Trailing whitespace, key order, and floating-point representation all matter. Test by dumping the same dict twice in two different Python sessions and comparing the bytes.safetensorsis canonical. A safetensors file is byte-identical given the same tensor data. Use it as the weights format; don't roll your own.- Locale-dependent JSON. If your Python is in a non-UTF-8 locale, JSON output can vary. Force UTF-8 via
encoding='utf-8'. - DVC pulls into a working tree.
dvc pullextracts todata/processed/train.jsonl. If you re-run the registration after editing that file (even by accident — e.g., a stray notebook execution), thecorpus_dvc_hashrecorded in the manifest may not match what's on disk. Alwaysdvc statusbefore registering.
When to consult solutions/¶
After all five blocks are done. solutions/00-registry-roundtrip-ref.md (written at phase open) compares your canonicalization choice, the structure of lineage()'s return type, the MLflow integration, and the register() atomicity strategy.
Next lab: lab/01-shadow-ab.md.