Skip to content

English · Español

Lab 00 — Registry roundtrip (MLflow + DVC + canonical SHA)

Goal: prove the registry's canonical-SHA stability and the lineage walk on real grammar-tutor artifacts.

Estimated time: 3–5 hours.

Prereq: Phase 18 + Phase 26 + Phase 28 artifacts exist (Mini-GPT base, INT8 variant, LoRA grammar tutor); MLflow tracking server is up (mlflow ui --backend-store-uri ./mlruns); DVC initialized with a local remote (.dvc/config points at data/dvc-remote/).


What you produce

experiments/38-registry-roundtrip/ containing:

  • register.py — your driver script.
  • results.json — registered canonical SHAs, semvers, lineage trees, MLflow run IDs.
  • lineage.mmd — mermaid render of the lineage DAG.
  • manifest.json{seed, versions, config, hardware} per CLAUDE.md §1.
  • README.md — what you registered, what the SHA stability test showed.

The scenario

Three artifacts to register:

  1. The Phase 18 FP32 trained Mini-GPT base model (no LoRA yet).
  2. The Phase 26 INT8 quantized variant of the Phase 18 checkpoint.
  3. The Phase 28 LoRA grammar-tutor adapter on top of the Phase 18 base.

Phase 38's job is to register all three under the canonical-SHA scheme, with lineage walking back to the Phase 12 verb-corpus DVC hash.

TODOs

Block A — register the three artifacts

  • Write a register.py that:
  • Pulls the Phase 12 verb-corpus via DVC: dvc pull data/processed/train.jsonl.dvc. Capture the dvc_hash.
  • Loads the Phase 18 MLflow run (mlflow_run_id_fp32). Calls registry.register_pending(run_id=...) from scripts/mlops/registry.py. Print the canonical SHA.
  • Calls registry.promote(canonical_sha=<fp32_sha>, semver="v0.1.0") (this is a dev-mode run; in production this step is gated by CI per theory/05).
  • Loads the Phase 26 INT8 MLflow run. The run's train_manifest.json must include parent_sha = <fp32_sha> (set this in the Phase 26 training script — if it's not there, this lab fails until Phase 26 is revised).
  • Calls register_pending + promote for the INT8 bundle with semver v0.2.0.
  • Same for the Phase 28 LoRA bundle: parent_sha = <fp32_sha>, semver v0.3.0.
  • Confirm by inspecting .lynx-registry/: tags.json maps three semvers; index.jsonl has three lines.
  • Confirm MLflow's UI shows three registered models (in addition to your canonical-SHA layer).

Block B — canonical-SHA stability test

  • Two runs, same SHA. From a fresh shell, register the Phase 18 bundle again. The registry's register_pending() should detect the canonical SHA already exists and return the existing handle. Confirm index.jsonl did not grow.
  • Re-export the MLflow bundle, register again. This is the strict test: re-save the same logical model through MLflow's API (which embeds new timestamps in metadata files). Our canonical-SHA computation should ignore those and produce the same SHA. If it doesn't, the canonicalization is wrong.
  • Document the non-determinism you found. Almost certainly the first attempt produces a different SHA on the second run because something in the bundle was not canonical (typically mtime in tarball headers, MLflow run-metadata files included in the hash by mistake, or unordered JSON keys). Identify the source, fix scripts/mlops/registry.py's canonicalization, rerun.

Block C — lineage walk

  • Call lineage(<int8_sha>). Verify it returns: INT8 → FP32 → corpus_dvc_hash. Print the chain.
  • Call lineage(<lora_sha>). Verify it returns: LoRA → FP32 → corpus_dvc_hash.
  • Confirm that dvc pull on the returned corpus_dvc_hash reconstructs the exact training corpus.
  • Render the lineage DAG as a mermaid diagram (lineage.mmd). Three model nodes (FP32, INT8, LoRA), one corpus node, edges: FP32 → corpus, INT8 → FP32, LoRA → FP32.

Block D — tags.json invariants

  • Try to register a new canonical SHA under v0.1.0. Expected: RegistryError("semver already mapped to a different SHA").
  • Update the semver mapping explicitly: registry.retag(canonical_sha=<new_fp32_sha>, semver="v0.1.0"). Confirm tags.json updated atomically (write-to-tmp + rename).
  • Confirm index.jsonl has an entry recording the retag operation with the previous SHA.

Block E — manifest + README

  • Write manifest.json with seed, library versions (Python, MLflow, DVC, NumPy), config (the three bundles' MLflow run IDs and the corpus DVC hash), and hardware (per LYNX_CORTEX.md §5).
  • Write README.md (300–500 words) covering:
  • What you registered: the three canonical SHAs, the three MLflow run IDs, the corpus DVC hash.
  • The non-determinism source you found in MLflow's bundle (be specific: was it mtime in tar headers, JSON key order, an MLflow metadata file you accidentally included, fp precision, or something else?).
  • The lineage diagram and what it shows about the FP32 → INT8 / LoRA fan-out.
  • One sentence on why this matters for the Phase 39 capstone and Phase 38 lab 04 (the CI gate).

Constraints

  • Stdlib only for hashing (hashlib.sha256). No xxhash, no blake3, no third-party hashers.
  • MLflow only for storage. Don't use MLflow's model-registry transition_model_version_stage for identity. Identity lives in scripts/mlops/registry.py via the canonical SHA. MLflow's run/version IDs are metadata pointers, not the primary key.
  • DVC for the corpus. corpus_dvc_hash is in the train_manifest.json; the lineage walk uses it. Do not embed the corpus bytes in the registry — that's what DVC is for.
  • Atomic writes for tags.json mutations: write to *.tmp, then os.rename.
  • No new src/<module>/. All code goes in scripts/mlops/.
  • CPU-only. This lab does not touch a GPU.

Stop conditions

Done when:

  1. Three registered canonical SHAs are stable across re-registration from a fresh shell.
  2. lineage(<sha>) returns a correct chain for all three, terminating at the Phase 12 corpus DVC hash.
  3. tags.json cannot be silently corrupted by a duplicate-semver registration.
  4. manifest.json exists and lists the three SHAs, three MLflow run IDs, and the corpus DVC hash.
  5. The mermaid DAG renders.

Pitfalls (read before debugging)

  • MLflow embeds timestamps in metadata. mlflow.pyfunc.save_model() writes a MLmodel YAML with creation timestamps. Including this file in the canonical-SHA computation will break stability. The canonical SHA must be over model.safetensors + tokenizer.json + config.json + eval_report.json + train_manifest.json + parent_sha only. Anything MLflow generates as side-cars is ignored.
  • tar is not canonical. tar cf bundle.tar models/mini-gpt-fp32/ produces a tarball with mtime headers. Two tar invocations produce different bytes. The fix is to hash the files in the bundle individually, not the tarball.
  • json.dumps is not canonical by default. Use json.dumps(d, sort_keys=True, separators=(',', ':')). Trailing whitespace, key order, and floating-point representation all matter. Test by dumping the same dict twice in two different Python sessions and comparing the bytes.
  • safetensors is canonical. A safetensors file is byte-identical given the same tensor data. Use it as the weights format; don't roll your own.
  • Locale-dependent JSON. If your Python is in a non-UTF-8 locale, JSON output can vary. Force UTF-8 via encoding='utf-8'.
  • DVC pulls into a working tree. dvc pull extracts to data/processed/train.jsonl. If you re-run the registration after editing that file (even by accident — e.g., a stray notebook execution), the corpus_dvc_hash recorded in the manifest may not match what's on disk. Always dvc status before registering.

When to consult solutions/

After all five blocks are done. solutions/00-registry-roundtrip-ref.md (written at phase open) compares your canonicalization choice, the structure of lineage()'s return type, the MLflow integration, and the register() atomicity strategy.


Next lab: lab/01-shadow-ab.md.