English · Español
Theory 05 — The demo script and binary acceptance¶
🇪🇸 La demo no es una presentación; es un contrato.
just demose ejecuta en frío, en una máquina nueva, sin trucos, y el visitante ve 90 segundos del curriculum funcionando. Las propiedades obligatorias: idempotente, determinista, narrada, y que falle ruidosamente cuando algo va mal — no que disimule contry/except: pass.
Why the demo script is load-bearing¶
Every curriculum produces some demo. Most are not load-bearing in the engineering sense — they're slides + a video. The capstone of lynx-cortex insists on a stronger property: the demo is the canonical verification that the repo's claims are true. Three corollaries:
- If
just demoexits non-zero, a curriculum claim is broken. Not "we'll look into it"; the report is blocked until the demo passes. - The script is the entrypoint for new contributors. A stranger reading the repo can run
just demofirst, then read the code with the demo's output as their map. Reverse: code → demo, doesn't work — there's too much code. - The CI runs it on every PR. Regressions are caught at PR-time. The demo is the integration test.
This chapter derives the four properties a load-bearing demo must have, then walks the anatomy of scripts/demo/run.py.
Property 1 — Idempotent¶
A demo that works the first time and fails the second is not idempotent. Causes:
- Database/state survives between runs and the second run uses stale data.
- Network ports left open by a crashed previous run.
mlruns/directory committed from a personal session.- A docker volume that grew between runs.
The capstone's just demo-cold recipe:
docker compose down -v --remove-orphansbefore anything else.- Removes any local
experiments/39-*-tmp/directories. - Brings the stack up fresh.
- Runs the demo.
- Tears the stack down (optional flag to keep up for inspection).
Each step is logged with a timestamp. The CI smoke test runs the recipe five times in a row and asserts all five pass — that's the operational definition of idempotent.
Property 2 — Deterministic¶
The same seed produces the same output. For the grammar tutor:
- The model is loaded with
torch.manual_seed(42). - The RAG bi-encoder's tie-break is sorted by
chunk_id(deterministic given the seed). - The model's decoder uses greedy sampling (no top-k randomness) for the demo path; the random sampler is exercised in a separate demo step that explicitly says "sampling is non-deterministic by design."
- The demo's request payloads are committed in
scripts/demo/payloads/; no random selection. - The wall-clock timestamps differ run-to-run, but the correctness checks are stable.
The CI smoke test diffs the content of the response (response.correction, response.explanation, response.spanish_translation) across five consecutive runs — bit-identical required.
What "deterministic" deliberately does NOT mean¶
- It does not mean latencies are bit-identical. CPU contention varies.
- It does not mean traces have the same span IDs (span IDs are time/random).
- It does not mean Prometheus counters are identical (they accumulate from prior runs in
just demointeractive use, though they reset indemo-cold).
The distinction: content is deterministic; metadata is not. The demo's assertion script knows the difference.
Property 3 — Narrated¶
Every step prints a line. Two kinds of lines:
Curriculum lines (so the viewer sees the spine):
[t=0.5s] [Phase 12] Loading verb corpus from DVC...
[t=1.1s] [Phase 11] BPE tokenizer ready (vocab=2048).
[t=2.3s] [Phase 28] Loading Mini-GPT base + LoRA grammar adapter (rev=sha:a1b2c3).
[t=2.4s] [Phase 22] KV-cache pre-allocated for 128 tokens.
[t=2.5s] [Phase 33] miniserve started on :8080.
[t=2.6s] [Phase 34] Cost emitter and OTel exporter armed.
Action lines (so the viewer sees what's happening):
[t=3.1s] >>> POST /v1/grammar/correct {"sentence": "Yesterday I goed to the store"}
[t=3.4s] [Phase 11 tokenize] 12 ms → 10 tokens
[t=3.5s] [Phase 29 retrieve] 28 ms → 5 chunks (top: irregular-go-past)
[t=4.6s] [Phase 17 prefill] 1100 ms
[t=6.8s] [Phase 17 decode] 2200 ms → "Yesterday I **went** to the store."
[t=6.81s] [Phase 30 format] 4 ms
[t=6.82s] [Phase 34 cost] €0.00041
[t=6.83s] <<< 200 OK
Narration is what turns a working system into a teaching artifact. The Plan §1 #4 (iv) says the script must be narrated — this is what that means.
Property 4 — Loud failure¶
The anti-pattern (Plan §5 #3 — "Demo theater"):
try:
response = client.post(...)
assert response.status_code == 200
except Exception:
print("Skipping this step") # WRONG
A demo that hides errors with try/except: pass is not a demo. The script must:
- On failure, print which phase / which contract broke.
- Print the offending values (truncated to keep terminal output readable).
- Exit non-zero with a clear message:
"FAIL: [Phase 33] miniserve did not respond on :8080 within 10s; check 'docker compose logs miniserve'".
The Plan's verification: deliberately break one component (point MLFLOW_TRACKING_URI at a wrong port), run the demo, confirm the failure message names the component and the remediation step. The lab walks this verification.
Anatomy of scripts/demo/run.py¶
The script has seven blocks, in order:
def main():
# Block 1 — Preflight (Phase 39 self-checks)
assert_environment_ready() # uv, docker, ports free, lockfile clean
# Block 2 — Stack bring-up (delegate to `just demo-cold`)
bring_up_stack(timeout_s=30)
wait_for_health() # /healthz on miniserve + Prometheus + Grafana
# Block 3 — Curriculum narration
narrate_loaded_components() # the [Phase NN] startup lines
# Block 4 — Happy-path request battery (3 sentences)
for sentence in HAPPY_PATH_PAYLOADS:
send_and_verify(sentence)
# Block 5 — Security run-through (3 replays from Theory 04)
replay_injection()
replay_oversized_body()
replay_mcp_sandbox()
# Block 6 — Acceptance (binary)
run_acceptance_checks() # the DONE_ENOUGH.md ≤ 20 checks
# Block 7 — Wrap-up
print_summary()
emit_eval_report() # writes experiments/39-end-to-end/eval-YYYY-MM-DD.json
Each block is a single function with a clear contract. Block 6 is the load-bearing acceptance gate: if any of the ≤ 20 binary checks fail, the script exits 1 with an enumerated failure list.
Binary acceptance: docs/DONE_ENOUGH.md¶
The capstone DoD is operationalized as ≤ 20 binary checks. Each check has:
- A unique id (e.g.,
DE-007). - A one-sentence statement.
- An automated check.
- A pass/fail visible in the demo's terminal output.
Sample rows:
| ID | Check | Automation |
|---|---|---|
| DE-001 | Stack starts within 30 s | Time docker compose up to healthy |
| DE-002 | miniserve responds on :8080 within 5 s of healthy |
curl :8080/healthz |
| DE-003 | First request completes within 10 s end-to-end | timed assert in script |
| DE-004 | p95 latency over the 3-sentence battery is < 5 s | Prometheus query |
| DE-005 | All 9 stages emit a span in the trace | Tempo query |
| DE-006 | Cost panel populated within 60 s of first request | Grafana panel data query |
| DE-007 | Cost identity holds within 0.1 % on every request | per-request assertion |
| DE-008 | Injection payload returns 400 with injection_blocked |
response assertion |
| DE-009 | 10 MB body returns 413 before prefill | response + log assertion |
| DE-010 | MCP sandbox subprocess has bounded CPU+RAM, exits ≤ 2 s | span attributes assertion |
| DE-011 | Trace orphan count = 0 over the demo run | Tempo query |
| DE-012 | All committed MANIFEST.json SHA256s match disk |
Phase 37 lab 04 |
| DE-013 | Grafana dashboard imports clean | curl -X POST /api/dashboards/db returns 200 |
| DE-014 | eval-YYYY-MM-DD.json written with the date in filename |
filesystem assertion |
| DE-015 | Demo exits with status 0 | echo $? |
(15 of ≤ 20; some rooms for phase-execution adjustments.) Lab 00 fills the remaining 5.
The script ends with a table of all DE checks, pass/fail. The viewer's last visual is that table. That table is the curriculum's report card.
What this theory does NOT cover¶
- The full
docs/DONE_ENOUGH.mdcontent. Drafted in Lab 00. - The full
scripts/demo/run.py. Drafted in Lab 04. - The asciinema recording mechanics. Lab 04.
- How to extend the demo with new payloads. Phase 40 reading-list.
- Multi-user demos / load-test demos. Lab 02 covers the load test, but it's a separate command, not the main demo.
End of Phase 39 theory chain. Next: the 5 lab statements (lab/00-cold-start-bringup.md through lab/04-demo-script.md).