English · Español

Lab 04 — Wire mlflow into the existing manifest discipline¶

Goal: make Phase 18's runs browsable in mlflow ui without ever replacing manifest.json as the source of truth.

Estimated time: 60–90 minutes.

Prereq: labs 01-03 done.

What you produce¶

src/minitrain/mlflow_wrap.py — thin context-managed wrapper over mlflow.start_run().
scripts/train_mini.py modifications — calls into the wrapper.
experiments/18-train-mini/ re-run with mlflow logging.
mlruns.db (SQLite, gitignored).
experiments/18-train-mini/mlflow_screenshot.png — a screenshot of the run in the mlflow UI showing curves + params + artifacts.
experiments/18-train-mini/README.md updated with the mlflow run URI.

Background you must have read¶

theory/04-checkpoints-and-mlflow.md §"mlflow — what it gives you and what it doesn't".

TODOs¶

Block A — mlflow setup¶

Install mlflow per pyproject.toml's experiments opt group:
```
uv pip install --group experiments
```

Set the tracking URI:

export MLFLOW_TRACKING_URI=sqlite:///mlruns.db

Add mlruns.db to .gitignore.
Start the UI in another terminal: mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000. Open http://localhost:5000 to confirm it loads.

Block B — write the wrapper¶

# src/minitrain/mlflow_wrap.py
import mlflow
from contextlib import contextmanager
from pathlib import Path

@contextmanager
def tracking_run(experiment_name: str, run_name: str, manifest_path: Path, config: dict):
    """Open an mlflow run, log params from config, log manifest.json as artifact,
    yield a logger that the training loop can call to log metrics."""
    mlflow.set_experiment(experiment_name)
    with mlflow.start_run(run_name=run_name) as run:
        # Always log the manifest first — it's the source of truth.
        mlflow.log_artifact(str(manifest_path))
        mlflow.log_params(_flatten_config(config))

        def log_metrics(metrics: dict[str, float], step: int):
            for k, v in metrics.items():
                mlflow.log_metric(k, v, step=step)

        def log_artifact(path: Path):
            mlflow.log_artifact(str(path))

        # Update the manifest with the mlflow URI so the manifest can find this run.
        _patch_manifest(manifest_path, mlflow_run_id=run.info.run_id)

        yield SimpleNamespace(log_metrics=log_metrics, log_artifact=log_artifact, run=run)

Implement _flatten_config — flatten nested config to dotted.keys for mlflow's flat params.
Implement _patch_manifest — re-write manifest.json with the additional mlflow_run_uri field. Atomic write.
Test that the wrapper exits cleanly on KeyboardInterrupt — the with block closes the run, the manifest stays consistent.

Block C — integrate with `train_mini.py`¶

Modify the training script to wrap the loop:

with tracking_run(
    experiment_name="phase18_minigpt",
    run_name=f"phase18_{git_sha[:7]}",
    manifest_path=out_dir / "manifest.json",
    config=config,
) as logger:
    for step in range(total_steps):
        ...
        if step % log_every == 0:
            logger.log_metrics(
                {"train_loss": loss, "lr": lr, "g_norm": g_norm}, step=step
            )
        if step % val_every == 0:
            logger.log_metrics({"val_loss": val_loss, "val_ppl": val_ppl}, step=step)

    logger.log_artifact(checkpoint_path)
    logger.log_artifact(out_dir / "loss_curve.png")
    logger.log_artifact(out_dir / "results.json")

Run the training again (you can shorten to 500 steps if a full re-run is expensive — note this in the README).
After the run, verify in the UI:
The run appears in the phase18_minigpt experiment.
Parameters show the flattened config.
Metrics have step-indexed curves (train_loss, lr, g_norm, val_loss, val_ppl).
Artifacts include manifest.json, loss_curve.png, results.json, weights.safetensors.

Block D — manifest ↔ mlflow round-trip¶

The manifest is the source of truth. Test that:

def test_manifest_can_locate_mlflow_run():
    manifest = json.loads(Path("experiments/18-train-mini/manifest.json").read_text())
    run_id = manifest["mlflow_run_uri"].split("/")[-1]
    run = mlflow.get_run(run_id)
    assert run.data.params["optimizer.lr_max"] == "0.0003"
    assert run.data.metrics["val_ppl"] < manifest["metrics"]["ngram_baseline_val_ppl"]

This test confirms the cycle: manifest → mlflow URI → mlflow API → metrics that match what the manifest claims.

Block E — take the screenshot¶

experiments/18-train-mini/mlflow_screenshot.png:

Open the run in mlflow ui.
Screenshot the run view showing curves panel + parameters + artifacts list.
Commit the PNG (one ~200 KB file).

Block F — README update¶

In experiments/18-train-mini/README.md, add:

## mlflow

- Tracking URI: `sqlite:///mlruns.db` (local, gitignored).
- Experiment: `phase18_minigpt`.
- Run ID: `<run_id>` (from manifest.json).
- Screenshot: `mlflow_screenshot.png`.
- To re-open the UI: `mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5000`.

Constraints¶

No autologging. Do not enable mlflow.autolog(). Explicit logging only. The wrapper is ~30 lines; do not import the kitchen sink.
Manifest is the truth. If mlflow's view disagrees with manifest, the bug is in mlflow wiring, not in manifest. Test for this.
One experiment per phase. Phase 18 uses phase18_minigpt. Phase 19 will create phase19_dynamics. Don't mingle.

Stop conditions¶

Done when:

mlflow ui shows the run with all three metric curves, all config parameters, and ≥ 4 artifacts.
experiments/18-train-mini/mlflow_screenshot.png is committed.
The manifest-to-mlflow round-trip test passes.
You can re-find a run via either manifest.json or mlflow ui — both lead to the same artifacts.

Pitfalls¶

mlflow.start_run() without a context manager. Leaks open runs on Ctrl+C. Always with.
Logging metrics without step=. mlflow then auto-increments, and a metric logged from multiple places gets reordered. Always pass step=.
Logging the safetensors file as a parameter instead of an artifact. Parameters are key-value strings, max 500 chars. Use log_artifact.
Setting MLFLOW_TRACKING_URI to a relative path without leading sqlite:///. mlflow then writes a file store, not SQLite, and you get the slow-file-store pain.

When to consult `solutions/`¶

After the screenshot is committed. Solution at solutions/04-mlflow-wiring-ref.md (written at phase open).

Phase 18 lab work is complete. Continue with /quiz 18, then PHASE_18_REPORT.md, then learners/borja/phase-18/reflections.md, then proceed.