Skip to content

English · Español

Lab 00 — Cold-start bring-up

🇪🇸 El primer arranque en frío de toda la pila. La regla de oro: lo que descubras roto aquí, lo arreglas en su fase de origen, no aquí. Phase 39 compone — no parchea. Documenta cada error, su fase responsable y el commit del fix. Si al final del lab docker compose up arranca verde en menos de 30 segundos, has terminado.

Goal

From a fresh checkout of lynx-cortex on Borja's i5-8250U, bring the full demo stack up cold and reach green health on every container within 30 seconds. Resolve every missing-config error along the way. Commit the experimental log under experiments/39-capstone-bringup/.

Why this lab exists

Every previous phase wrote one container, one config file, one service. The capstone composes them. Three things break, predictably:

  1. Missing config — Phase 33's MINISERVE_HOST env var isn't propagated to docker-compose.
  2. Port collisions — Phase 34's Prometheus default :9090 clashes with a Phase 38 service.
  3. Volume mount issues — a relative path that worked from cd src/miniserve/ doesn't work from cd infra/compose/.

Each fix lands in the originating phase's directory, with a one-line note in this lab's experiment log pointing to the fix commit. Do not fix in Phase 39. Phase 39 only composes.

Deliverables

  • infra/compose/full-stack.yml — the composed docker-compose for just demo-cold. Borja writes; this lab provides the starter template (§3 below).
  • infra/grafana/datasources/prometheus.yaml — the datasource provisioning file so the dashboard's panels resolve against the right Prometheus.
  • infra/grafana/dashboards/capstone.json — a skeleton dashboard with the 10 panels from Theory 03 §dashboard, even if some show "No Data" at this point.
  • Justfile recipes demo-cold and demo, wired to the compose file and scripts/demo/run.py.
  • experiments/39-capstone-bringup/log.md — chronological log of every error, every fix, every commit hash.
  • docs/DONE_ENOUGH.md — the ≤ 20 binary checks (draft).
  • tests/integration/test_stack_healthy.py — a pytest that runs just demo-cold, polls all /healthz endpoints, asserts green within 30 s.

Step 1 — Audit the starting state

$ cd lynx-cortex
$ git log --oneline -1
$ uv sync --frozen
$ rg -l 'docker' infra/      # list existing compose files

Expected: each previous phase contributed a single-service compose snippet under infra/compose/. Phase 39's job is to merge them into full-stack.yml.

The compose snippets to merge:

Source Service Phase
infra/compose/miniserve.yml miniserve 33
infra/compose/prometheus.yml prometheus 34
infra/compose/grafana.yml grafana 34
infra/compose/tempo.yml tempo 34
infra/compose/mlflow.yml mlflow-tracking 38
infra/compose/langfuse.yml (optional) langfuse 38

If any of these are missing, stop: they should have been written in their originating phase. Open the originating phase's PHASE_NN_REPORT.md and note the gap as a carry-over to fix outside Phase 39.

Step 2 — Merge into full-stack.yml

Starter template:

# infra/compose/full-stack.yml
name: lynx-cortex-demo
services:
  miniserve:
    extends:
      file: ./miniserve.yml
      service: miniserve
    depends_on:
      prometheus:
        condition: service_healthy
      tempo:
        condition: service_healthy

  prometheus:
    extends:
      file: ./prometheus.yml
      service: prometheus

  grafana:
    extends:
      file: ./grafana.yml
      service: grafana
    depends_on:
      prometheus:
        condition: service_healthy
    volumes:
      - ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro

  tempo:
    extends:
      file: ./tempo.yml
      service: tempo

  mlflow-tracking:
    extends:
      file: ./mlflow.yml
      service: mlflow-tracking

The extends pattern preserves the per-phase compose files (each phase still owns its service definition) while composing them into one. Do not duplicate service definitions — duplication invites drift.

Step 3 — First cold start

$ just demo-cold-up      # docker compose -f infra/compose/full-stack.yml up -d
$ docker compose -f infra/compose/full-stack.yml ps

Expected outcomes (ordered by likelihood):

  1. Port collisionprometheus and one of Borja's other services both want :9090. Fix by editing the originating Phase 34 compose to allow PROMETHEUS_PORT env var; default :9090 but overridable. Fix lands in Phase 34, not 39. Log: experiments/39-capstone-bringup/log.md2026-06-XX 10:32 — prometheus :9090 collision with system grafana — fix: Phase 34 compose adds PROMETHEUS_PORT=9091 — commit abc1234.
  2. Missing env varminiserve doesn't see OTEL_EXPORTER_OTLP_ENDPOINT. Fix in the originating phase's compose snippet.
  3. Volume mount path — Grafana datasource provisioning file path is ./grafana/datasources/ relative; from infra/compose/full-stack.yml it resolves correctly only if the compose file's working directory is infra/compose/. Verify with docker compose config.

Each error is logged with: timestamp, symptom, root cause, fixing phase, commit hash.

Step 4 — Health checks

Add healthcheck: to every service in its originating compose file (if not already present):

# miniserve
healthcheck:
  test: ["CMD", "curl", "-fsS", "http://localhost:8080/healthz"]
  interval: 5s
  timeout: 3s
  retries: 6
  start_period: 5s

Each service must reach healthy within 30 s of docker compose up. The test_stack_healthy.py integration test polls docker compose ps --format json and asserts all services show health: healthy within the limit.

# tests/integration/test_stack_healthy.py
import json, subprocess, time

def test_full_stack_reaches_healthy_within_30s():
    subprocess.run(["just", "demo-cold-up"], check=True)
    deadline = time.time() + 30
    while time.time() < deadline:
        ps = subprocess.run(
            ["docker", "compose", "-f", "infra/compose/full-stack.yml", "ps", "--format", "json"],
            check=True, capture_output=True, text=True,
        )
        statuses = [json.loads(line) for line in ps.stdout.splitlines() if line.strip()]
        if statuses and all(s.get("Health") == "healthy" for s in statuses):
            return
        time.sleep(2)
    subprocess.run(["just", "demo-cold-down"])
    raise AssertionError("stack did not reach healthy within 30s")

Step 5 — Provisioning Grafana

infra/grafana/datasources/prometheus.yaml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    editable: false

infra/grafana/dashboards/capstone.json is built by:

  1. Bringing the stack up.
  2. Opening Grafana on :3000, default creds.
  3. Manually building the 10 panels from Theory 03 §dashboard (some will be empty; that's fine for the skeleton).
  4. Exporting the dashboard JSON via the Grafana UI (Share → Export → Save to file).
  5. Committing the file under infra/grafana/dashboards/.

The Grafana provisioning then auto-imports it on next stack up.

Skeleton-first principle: a dashboard with 10 panels reading "No Data" is informative (shows the contract). A dashboard with 5 panels that all populate but lacks the other 5 hides the contract. Commit the full skeleton even if half is empty at this stage.

Step 6 — DONE_ENOUGH.md first draft

Write the ≤ 20 binary checks from Theory 05. Use this template:

# Phase 39 — Capstone DoD checklist

Every check is binary, automated, and runs as part of `just demo`. If any
check fails, the demo exits non-zero and the phase report is blocked.

| ID | Statement | How to verify | Owner phase |
|---|---|---|---|
| DE-001 | Stack starts within 30 s | `tests/integration/test_stack_healthy.py` | 39 |
| DE-002 | `miniserve` responds on :8080 within 5 s | `curl :8080/healthz` in demo script | 33 |
| ... | ... | ... | ... |

15 rows from Theory 05; add 5 more covering: (a) RAG retrieval returns ≥ 1 chunk; (b) trace context propagates across MCP boundary; © cost-decomposition identity holds; (d) Grafana dashboard provisioning has zero errors at boot; (e) the demo's transcript.jsonl is well-formed JSON-lines.

Step 7 — Wire the Justfile recipes

# Justfile excerpt
demo-cold-up:
    docker compose -f infra/compose/full-stack.yml up -d

demo-cold-down:
    docker compose -f infra/compose/full-stack.yml down -v --remove-orphans

demo-cold: demo-cold-up
    uv run python scripts/demo/run.py
    just demo-cold-down

demo: demo-cold-up
    uv run python scripts/demo/run.py

just demo leaves the stack up (interactive use); just demo-cold tears it down (CI).

Step 8 — Five consecutive runs

The DoD requires 5-of-5 success. Run:

$ for i in $(seq 1 5); do
    echo "=== run $i ==="
    just demo-cold || break
done

If any run fails, capture the failure in experiments/39-capstone-bringup/log.md and fix in the originating phase before continuing. Do not fix in Phase 39.

What "done" looks like

  • full-stack.yml exists and merges all per-phase compose snippets via extends.
  • just demo-cold-up brings every service to health: healthy within 30 s.
  • tests/integration/test_stack_healthy.py passes.
  • infra/grafana/datasources/prometheus.yaml and tempo datasource exist.
  • infra/grafana/dashboards/capstone.json skeleton committed (10 panels, even if half show "No Data").
  • docs/DONE_ENOUGH.md drafted with ≤ 20 rows.
  • just demo-cold-down cleanly removes all containers and volumes.
  • Five consecutive just demo-cold runs all succeed.
  • experiments/39-capstone-bringup/log.md lists every error encountered, with the fix commit and the originating phase.

Common pitfalls

  1. Fixing things in Phase 39. Tempting; wrong. The fix belongs in the originating phase. Phase 39 only composes.
  2. Hardcoding ports. Use env vars with defaults; the demo runs on Borja's machine and on CI hardware that may have port conflicts.
  3. Committing personal mlruns/ data. That's the supply-chain pitfall from Plan §5 #1. Add to .gitignore before first commit if not already.
  4. Forgetting --remove-orphans on down. A leftover container from a previous run blocks the next start; idempotency breaks.
  5. Trusting "it started" without health checks. A container that starts is not a container that's ready. The healthcheck is the contract.

Next: lab/01-end-to-end-grammar-tutor-request.md — single request through every layer; trace tree captured; cost identity verified.