English · Español
Lab 04 — Wire mlflow into the existing manifest discipline¶
Goal: make Phase 18's runs browsable in
mlflow uiwithout ever replacingmanifest.jsonas the source of truth.Estimated time: 60–90 minutes.
Prereq: labs 01-03 done.
What you produce¶
src/minitrain/mlflow_wrap.py— thin context-managed wrapper overmlflow.start_run().scripts/train_mini.pymodifications — calls into the wrapper.experiments/18-train-mini/re-run with mlflow logging.mlruns.db(SQLite, gitignored).experiments/18-train-mini/mlflow_screenshot.png— a screenshot of the run in the mlflow UI showing curves + params + artifacts.experiments/18-train-mini/README.mdupdated with the mlflow run URI.
Background you must have read¶
theory/04-checkpoints-and-mlflow.md§"mlflow — what it gives you and what it doesn't".
TODOs¶
Block A — mlflow setup¶
- Install mlflow per
pyproject.toml'sexperimentsopt group: - Set the tracking URI:
- Add
mlruns.dbto.gitignore. - Start the UI in another terminal:
mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000. Openhttp://localhost:5000to confirm it loads.
Block B — write the wrapper¶
# src/minitrain/mlflow_wrap.py
import mlflow
from contextlib import contextmanager
from pathlib import Path
@contextmanager
def tracking_run(experiment_name: str, run_name: str, manifest_path: Path, config: dict):
"""Open an mlflow run, log params from config, log manifest.json as artifact,
yield a logger that the training loop can call to log metrics."""
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=run_name) as run:
# Always log the manifest first — it's the source of truth.
mlflow.log_artifact(str(manifest_path))
mlflow.log_params(_flatten_config(config))
def log_metrics(metrics: dict[str, float], step: int):
for k, v in metrics.items():
mlflow.log_metric(k, v, step=step)
def log_artifact(path: Path):
mlflow.log_artifact(str(path))
# Update the manifest with the mlflow URI so the manifest can find this run.
_patch_manifest(manifest_path, mlflow_run_id=run.info.run_id)
yield SimpleNamespace(log_metrics=log_metrics, log_artifact=log_artifact, run=run)
- Implement
_flatten_config— flatten nested config todotted.keysfor mlflow's flat params. - Implement
_patch_manifest— re-writemanifest.jsonwith the additionalmlflow_run_urifield. Atomic write. - Test that the wrapper exits cleanly on
KeyboardInterrupt— thewithblock closes the run, the manifest stays consistent.
Block C — integrate with train_mini.py¶
Modify the training script to wrap the loop:
with tracking_run(
experiment_name="phase18_minigpt",
run_name=f"phase18_{git_sha[:7]}",
manifest_path=out_dir / "manifest.json",
config=config,
) as logger:
for step in range(total_steps):
...
if step % log_every == 0:
logger.log_metrics(
{"train_loss": loss, "lr": lr, "g_norm": g_norm}, step=step
)
if step % val_every == 0:
logger.log_metrics({"val_loss": val_loss, "val_ppl": val_ppl}, step=step)
logger.log_artifact(checkpoint_path)
logger.log_artifact(out_dir / "loss_curve.png")
logger.log_artifact(out_dir / "results.json")
- Run the training again (you can shorten to 500 steps if a full re-run is expensive — note this in the README).
- After the run, verify in the UI:
- The run appears in the
phase18_minigptexperiment. - Parameters show the flattened config.
- Metrics have step-indexed curves (train_loss, lr, g_norm, val_loss, val_ppl).
- Artifacts include
manifest.json,loss_curve.png,results.json,weights.safetensors.
Block D — manifest ↔ mlflow round-trip¶
The manifest is the source of truth. Test that:
def test_manifest_can_locate_mlflow_run():
manifest = json.loads(Path("experiments/18-train-mini/manifest.json").read_text())
run_id = manifest["mlflow_run_uri"].split("/")[-1]
run = mlflow.get_run(run_id)
assert run.data.params["optimizer.lr_max"] == "0.0003"
assert run.data.metrics["val_ppl"] < manifest["metrics"]["ngram_baseline_val_ppl"]
- This test confirms the cycle: manifest → mlflow URI → mlflow API → metrics that match what the manifest claims.
Block E — take the screenshot¶
experiments/18-train-mini/mlflow_screenshot.png:
- Open the run in
mlflow ui. - Screenshot the run view showing curves panel + parameters + artifacts list.
- Commit the PNG (one ~200 KB file).
Block F — README update¶
In experiments/18-train-mini/README.md, add:
## mlflow
- Tracking URI: `sqlite:///mlruns.db` (local, gitignored).
- Experiment: `phase18_minigpt`.
- Run ID: `<run_id>` (from manifest.json).
- Screenshot: `mlflow_screenshot.png`.
- To re-open the UI: `mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000`.
Constraints¶
- No autologging. Do not enable
mlflow.autolog(). Explicit logging only. The wrapper is ~30 lines; do not import the kitchen sink. - Manifest is the truth. If mlflow's view disagrees with manifest, the bug is in mlflow wiring, not in manifest. Test for this.
- One experiment per phase. Phase 18 uses
phase18_minigpt. Phase 19 will createphase19_dynamics. Don't mingle.
Stop conditions¶
Done when:
mlflow uishows the run with all three metric curves, all config parameters, and ≥ 4 artifacts.experiments/18-train-mini/mlflow_screenshot.pngis committed.- The manifest-to-mlflow round-trip test passes.
- You can re-find a run via either
manifest.jsonormlflow ui— both lead to the same artifacts.
Pitfalls¶
mlflow.start_run()without a context manager. Leaks open runs on Ctrl+C. Alwayswith.- Logging metrics without
step=. mlflow then auto-increments, and a metric logged from multiple places gets reordered. Always passstep=. - Logging the
safetensorsfile as a parameter instead of an artifact. Parameters are key-value strings, max 500 chars. Uselog_artifact. - Setting
MLFLOW_TRACKING_URIto a relative path without leadingsqlite:///. mlflow then writes a file store, not SQLite, and you get the slow-file-store pain.
When to consult solutions/¶
After the screenshot is committed. Solution at solutions/04-mlflow-wiring-ref.md (written at phase open).
Phase 18 lab work is complete. Continue with /quiz 18, then PHASE_18_REPORT.md, then learners/borja/phase-18/reflections.md, then proceed.