English · Español
Lab 00 — Cold-start bring-up¶
🇪🇸 El primer arranque en frío de toda la pila. La regla de oro: lo que descubras roto aquí, lo arreglas en su fase de origen, no aquí. Phase 39 compone — no parchea. Documenta cada error, su fase responsable y el commit del fix. Si al final del lab
docker compose uparranca verde en menos de 30 segundos, has terminado.
Goal¶
From a fresh checkout of lynx-cortex on Borja's i5-8250U, bring the full demo stack up cold and reach green health on every container within 30 seconds. Resolve every missing-config error along the way. Commit the experimental log under experiments/39-capstone-bringup/.
Why this lab exists¶
Every previous phase wrote one container, one config file, one service. The capstone composes them. Three things break, predictably:
- Missing config — Phase 33's
MINISERVE_HOSTenv var isn't propagated to docker-compose. - Port collisions — Phase 34's Prometheus default
:9090clashes with a Phase 38 service. - Volume mount issues — a relative path that worked from
cd src/miniserve/doesn't work fromcd infra/compose/.
Each fix lands in the originating phase's directory, with a one-line note in this lab's experiment log pointing to the fix commit. Do not fix in Phase 39. Phase 39 only composes.
Deliverables¶
infra/compose/full-stack.yml— the composed docker-compose forjust demo-cold. Borja writes; this lab provides the starter template (§3 below).infra/grafana/datasources/prometheus.yaml— the datasource provisioning file so the dashboard's panels resolve against the right Prometheus.infra/grafana/dashboards/capstone.json— a skeleton dashboard with the 10 panels from Theory 03 §dashboard, even if some show "No Data" at this point.Justfilerecipesdemo-coldanddemo, wired to the compose file andscripts/demo/run.py.experiments/39-capstone-bringup/log.md— chronological log of every error, every fix, every commit hash.docs/DONE_ENOUGH.md— the ≤ 20 binary checks (draft).tests/integration/test_stack_healthy.py— a pytest that runsjust demo-cold, polls all/healthzendpoints, asserts green within 30 s.
Step 1 — Audit the starting state¶
$ cd lynx-cortex
$ git log --oneline -1
$ uv sync --frozen
$ rg -l 'docker' infra/ # list existing compose files
Expected: each previous phase contributed a single-service compose snippet under infra/compose/. Phase 39's job is to merge them into full-stack.yml.
The compose snippets to merge:
| Source | Service | Phase |
|---|---|---|
infra/compose/miniserve.yml |
miniserve |
33 |
infra/compose/prometheus.yml |
prometheus |
34 |
infra/compose/grafana.yml |
grafana |
34 |
infra/compose/tempo.yml |
tempo |
34 |
infra/compose/mlflow.yml |
mlflow-tracking |
38 |
infra/compose/langfuse.yml (optional) |
langfuse |
38 |
If any of these are missing, stop: they should have been written in their originating phase. Open the originating phase's PHASE_NN_REPORT.md and note the gap as a carry-over to fix outside Phase 39.
Step 2 — Merge into full-stack.yml¶
Starter template:
# infra/compose/full-stack.yml
name: lynx-cortex-demo
services:
miniserve:
extends:
file: ./miniserve.yml
service: miniserve
depends_on:
prometheus:
condition: service_healthy
tempo:
condition: service_healthy
prometheus:
extends:
file: ./prometheus.yml
service: prometheus
grafana:
extends:
file: ./grafana.yml
service: grafana
depends_on:
prometheus:
condition: service_healthy
volumes:
- ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
tempo:
extends:
file: ./tempo.yml
service: tempo
mlflow-tracking:
extends:
file: ./mlflow.yml
service: mlflow-tracking
The extends pattern preserves the per-phase compose files (each phase still owns its service definition) while composing them into one. Do not duplicate service definitions — duplication invites drift.
Step 3 — First cold start¶
$ just demo-cold-up # docker compose -f infra/compose/full-stack.yml up -d
$ docker compose -f infra/compose/full-stack.yml ps
Expected outcomes (ordered by likelihood):
- Port collision —
prometheusand one of Borja's other services both want:9090. Fix by editing the originating Phase 34 compose to allowPROMETHEUS_PORTenv var; default :9090 but overridable. Fix lands in Phase 34, not 39. Log:experiments/39-capstone-bringup/log.md—2026-06-XX 10:32 — prometheus :9090 collision with system grafana — fix: Phase 34 compose adds PROMETHEUS_PORT=9091 — commit abc1234. - Missing env var —
miniservedoesn't seeOTEL_EXPORTER_OTLP_ENDPOINT. Fix in the originating phase's compose snippet. - Volume mount path — Grafana datasource provisioning file path is
./grafana/datasources/relative; frominfra/compose/full-stack.ymlit resolves correctly only if the compose file's working directory isinfra/compose/. Verify withdocker compose config.
Each error is logged with: timestamp, symptom, root cause, fixing phase, commit hash.
Step 4 — Health checks¶
Add healthcheck: to every service in its originating compose file (if not already present):
# miniserve
healthcheck:
test: ["CMD", "curl", "-fsS", "http://localhost:8080/healthz"]
interval: 5s
timeout: 3s
retries: 6
start_period: 5s
Each service must reach healthy within 30 s of docker compose up. The test_stack_healthy.py integration test polls docker compose ps --format json and asserts all services show health: healthy within the limit.
# tests/integration/test_stack_healthy.py
import json, subprocess, time
def test_full_stack_reaches_healthy_within_30s():
subprocess.run(["just", "demo-cold-up"], check=True)
deadline = time.time() + 30
while time.time() < deadline:
ps = subprocess.run(
["docker", "compose", "-f", "infra/compose/full-stack.yml", "ps", "--format", "json"],
check=True, capture_output=True, text=True,
)
statuses = [json.loads(line) for line in ps.stdout.splitlines() if line.strip()]
if statuses and all(s.get("Health") == "healthy" for s in statuses):
return
time.sleep(2)
subprocess.run(["just", "demo-cold-down"])
raise AssertionError("stack did not reach healthy within 30s")
Step 5 — Provisioning Grafana¶
infra/grafana/datasources/prometheus.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
editable: false
infra/grafana/dashboards/capstone.json is built by:
- Bringing the stack up.
- Opening Grafana on :3000, default creds.
- Manually building the 10 panels from Theory 03 §dashboard (some will be empty; that's fine for the skeleton).
- Exporting the dashboard JSON via the Grafana UI (
Share → Export → Save to file). - Committing the file under
infra/grafana/dashboards/.
The Grafana provisioning then auto-imports it on next stack up.
Skeleton-first principle: a dashboard with 10 panels reading "No Data" is informative (shows the contract). A dashboard with 5 panels that all populate but lacks the other 5 hides the contract. Commit the full skeleton even if half is empty at this stage.
Step 6 — DONE_ENOUGH.md first draft¶
Write the ≤ 20 binary checks from Theory 05. Use this template:
# Phase 39 — Capstone DoD checklist
Every check is binary, automated, and runs as part of `just demo`. If any
check fails, the demo exits non-zero and the phase report is blocked.
| ID | Statement | How to verify | Owner phase |
|---|---|---|---|
| DE-001 | Stack starts within 30 s | `tests/integration/test_stack_healthy.py` | 39 |
| DE-002 | `miniserve` responds on :8080 within 5 s | `curl :8080/healthz` in demo script | 33 |
| ... | ... | ... | ... |
15 rows from Theory 05; add 5 more covering: (a) RAG retrieval returns ≥ 1 chunk; (b) trace context propagates across MCP boundary; © cost-decomposition identity holds; (d) Grafana dashboard provisioning has zero errors at boot; (e) the demo's transcript.jsonl is well-formed JSON-lines.
Step 7 — Wire the Justfile recipes¶
# Justfile excerpt
demo-cold-up:
docker compose -f infra/compose/full-stack.yml up -d
demo-cold-down:
docker compose -f infra/compose/full-stack.yml down -v --remove-orphans
demo-cold: demo-cold-up
uv run python scripts/demo/run.py
just demo-cold-down
demo: demo-cold-up
uv run python scripts/demo/run.py
just demo leaves the stack up (interactive use); just demo-cold tears it down (CI).
Step 8 — Five consecutive runs¶
The DoD requires 5-of-5 success. Run:
If any run fails, capture the failure in experiments/39-capstone-bringup/log.md and fix in the originating phase before continuing. Do not fix in Phase 39.
What "done" looks like¶
-
full-stack.ymlexists and merges all per-phase compose snippets viaextends. -
just demo-cold-upbrings every service tohealth: healthywithin 30 s. -
tests/integration/test_stack_healthy.pypasses. -
infra/grafana/datasources/prometheus.yamlandtempodatasource exist. -
infra/grafana/dashboards/capstone.jsonskeleton committed (10 panels, even if half show "No Data"). -
docs/DONE_ENOUGH.mddrafted with ≤ 20 rows. -
just demo-cold-downcleanly removes all containers and volumes. - Five consecutive
just demo-coldruns all succeed. -
experiments/39-capstone-bringup/log.mdlists every error encountered, with the fix commit and the originating phase.
Common pitfalls¶
- Fixing things in Phase 39. Tempting; wrong. The fix belongs in the originating phase. Phase 39 only composes.
- Hardcoding ports. Use env vars with defaults; the demo runs on Borja's machine and on CI hardware that may have port conflicts.
- Committing personal
mlruns/data. That's the supply-chain pitfall from Plan §5 #1. Add to.gitignorebefore first commit if not already. - Forgetting
--remove-orphansondown. A leftover container from a previous run blocks the next start; idempotency breaks. - Trusting "it started" without health checks. A container that starts is not a container that's ready. The healthcheck is the contract.
Next: lab/01-end-to-end-grammar-tutor-request.md — single request through every layer; trace tree captured; cost identity verified.