Skip to content

English · Español

Lab 04 — Wire mlflow into the existing manifest discipline

Goal: make Phase 18's runs browsable in mlflow ui without ever replacing manifest.json as the source of truth.

Estimated time: 60–90 minutes.

Prereq: labs 01-03 done.


What you produce

  • src/minitrain/mlflow_wrap.py — thin context-managed wrapper over mlflow.start_run().
  • scripts/train_mini.py modifications — calls into the wrapper.
  • experiments/18-train-mini/ re-run with mlflow logging.
  • mlruns.db (SQLite, gitignored).
  • experiments/18-train-mini/mlflow_screenshot.png — a screenshot of the run in the mlflow UI showing curves + params + artifacts.
  • experiments/18-train-mini/README.md updated with the mlflow run URI.

Background you must have read

  • theory/04-checkpoints-and-mlflow.md §"mlflow — what it gives you and what it doesn't".

TODOs

Block A — mlflow setup

  • Install mlflow per pyproject.toml's experiments opt group:
    uv pip install --group experiments
    
  • Set the tracking URI:
    export MLFLOW_TRACKING_URI=sqlite:///mlruns.db
    
  • Add mlruns.db to .gitignore.
  • Start the UI in another terminal: mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000. Open http://localhost:5000 to confirm it loads.

Block B — write the wrapper

# src/minitrain/mlflow_wrap.py
import mlflow
from contextlib import contextmanager
from pathlib import Path

@contextmanager
def tracking_run(experiment_name: str, run_name: str, manifest_path: Path, config: dict):
    """Open an mlflow run, log params from config, log manifest.json as artifact,
    yield a logger that the training loop can call to log metrics."""
    mlflow.set_experiment(experiment_name)
    with mlflow.start_run(run_name=run_name) as run:
        # Always log the manifest first — it's the source of truth.
        mlflow.log_artifact(str(manifest_path))
        mlflow.log_params(_flatten_config(config))

        def log_metrics(metrics: dict[str, float], step: int):
            for k, v in metrics.items():
                mlflow.log_metric(k, v, step=step)

        def log_artifact(path: Path):
            mlflow.log_artifact(str(path))

        # Update the manifest with the mlflow URI so the manifest can find this run.
        _patch_manifest(manifest_path, mlflow_run_id=run.info.run_id)

        yield SimpleNamespace(log_metrics=log_metrics, log_artifact=log_artifact, run=run)
  • Implement _flatten_config — flatten nested config to dotted.keys for mlflow's flat params.
  • Implement _patch_manifest — re-write manifest.json with the additional mlflow_run_uri field. Atomic write.
  • Test that the wrapper exits cleanly on KeyboardInterrupt — the with block closes the run, the manifest stays consistent.

Block C — integrate with train_mini.py

Modify the training script to wrap the loop:

with tracking_run(
    experiment_name="phase18_minigpt",
    run_name=f"phase18_{git_sha[:7]}",
    manifest_path=out_dir / "manifest.json",
    config=config,
) as logger:
    for step in range(total_steps):
        ...
        if step % log_every == 0:
            logger.log_metrics(
                {"train_loss": loss, "lr": lr, "g_norm": g_norm}, step=step
            )
        if step % val_every == 0:
            logger.log_metrics({"val_loss": val_loss, "val_ppl": val_ppl}, step=step)

    logger.log_artifact(checkpoint_path)
    logger.log_artifact(out_dir / "loss_curve.png")
    logger.log_artifact(out_dir / "results.json")
  • Run the training again (you can shorten to 500 steps if a full re-run is expensive — note this in the README).
  • After the run, verify in the UI:
  • The run appears in the phase18_minigpt experiment.
  • Parameters show the flattened config.
  • Metrics have step-indexed curves (train_loss, lr, g_norm, val_loss, val_ppl).
  • Artifacts include manifest.json, loss_curve.png, results.json, weights.safetensors.

Block D — manifest ↔ mlflow round-trip

The manifest is the source of truth. Test that:

def test_manifest_can_locate_mlflow_run():
    manifest = json.loads(Path("experiments/18-train-mini/manifest.json").read_text())
    run_id = manifest["mlflow_run_uri"].split("/")[-1]
    run = mlflow.get_run(run_id)
    assert run.data.params["optimizer.lr_max"] == "0.0003"
    assert run.data.metrics["val_ppl"] < manifest["metrics"]["ngram_baseline_val_ppl"]
  • This test confirms the cycle: manifest → mlflow URI → mlflow API → metrics that match what the manifest claims.

Block E — take the screenshot

experiments/18-train-mini/mlflow_screenshot.png:

  • Open the run in mlflow ui.
  • Screenshot the run view showing curves panel + parameters + artifacts list.
  • Commit the PNG (one ~200 KB file).

Block F — README update

In experiments/18-train-mini/README.md, add:

## mlflow

- Tracking URI: `sqlite:///mlruns.db` (local, gitignored).
- Experiment: `phase18_minigpt`.
- Run ID: `<run_id>` (from manifest.json).
- Screenshot: `mlflow_screenshot.png`.
- To re-open the UI: `mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000`.

Constraints

  • No autologging. Do not enable mlflow.autolog(). Explicit logging only. The wrapper is ~30 lines; do not import the kitchen sink.
  • Manifest is the truth. If mlflow's view disagrees with manifest, the bug is in mlflow wiring, not in manifest. Test for this.
  • One experiment per phase. Phase 18 uses phase18_minigpt. Phase 19 will create phase19_dynamics. Don't mingle.

Stop conditions

Done when:

  1. mlflow ui shows the run with all three metric curves, all config parameters, and ≥ 4 artifacts.
  2. experiments/18-train-mini/mlflow_screenshot.png is committed.
  3. The manifest-to-mlflow round-trip test passes.
  4. You can re-find a run via either manifest.json or mlflow ui — both lead to the same artifacts.

Pitfalls

  • mlflow.start_run() without a context manager. Leaks open runs on Ctrl+C. Always with.
  • Logging metrics without step=. mlflow then auto-increments, and a metric logged from multiple places gets reordered. Always pass step=.
  • Logging the safetensors file as a parameter instead of an artifact. Parameters are key-value strings, max 500 chars. Use log_artifact.
  • Setting MLFLOW_TRACKING_URI to a relative path without leading sqlite:///. mlflow then writes a file store, not SQLite, and you get the slow-file-store pain.

When to consult solutions/

After the screenshot is committed. Solution at solutions/04-mlflow-wiring-ref.md (written at phase open).


Phase 18 lab work is complete. Continue with /quiz 18, then PHASE_18_REPORT.md, then learners/borja/phase-18/reflections.md, then proceed.