English · Español

Theory 05 — The demo script and binary acceptance¶

🇪🇸 La demo no es una presentación; es un contrato. just demo se ejecuta en frío, en una máquina nueva, sin trucos, y el visitante ve 90 segundos del curriculum funcionando. Las propiedades obligatorias: idempotente, determinista, narrada, y que falle ruidosamente cuando algo va mal — no que disimule con try/except: pass.

Why the demo script is load-bearing¶

Every curriculum produces some demo. Most are not load-bearing in the engineering sense — they're slides + a video. The capstone of lynx-cortex insists on a stronger property: the demo is the canonical verification that the repo's claims are true. Three corollaries:

If just demo exits non-zero, a curriculum claim is broken. Not "we'll look into it"; the report is blocked until the demo passes.
The script is the entrypoint for new contributors. A stranger reading the repo can run just demo first, then read the code with the demo's output as their map. Reverse: code → demo, doesn't work — there's too much code.
The CI runs it on every PR. Regressions are caught at PR-time. The demo is the integration test.

This chapter derives the four properties a load-bearing demo must have, then walks the anatomy of scripts/demo/run.py.

Property 1 — Idempotent¶

A demo that works the first time and fails the second is not idempotent. Causes:

Database/state survives between runs and the second run uses stale data.
Network ports left open by a crashed previous run.
mlruns/ directory committed from a personal session.
A docker volume that grew between runs.

The capstone's just demo-cold recipe:

docker compose down -v --remove-orphans before anything else.
Removes any local experiments/39-*-tmp/ directories.
Brings the stack up fresh.
Runs the demo.
Tears the stack down (optional flag to keep up for inspection).

Each step is logged with a timestamp. The CI smoke test runs the recipe five times in a row and asserts all five pass — that's the operational definition of idempotent.

Property 2 — Deterministic¶

The same seed produces the same output. For the grammar tutor:

The model is loaded with torch.manual_seed(42).
The RAG bi-encoder's tie-break is sorted by chunk_id (deterministic given the seed).
The model's decoder uses greedy sampling (no top-k randomness) for the demo path; the random sampler is exercised in a separate demo step that explicitly says "sampling is non-deterministic by design."
The demo's request payloads are committed in scripts/demo/payloads/; no random selection.
The wall-clock timestamps differ run-to-run, but the correctness checks are stable.

The CI smoke test diffs the content of the response (response.correction, response.explanation, response.spanish_translation) across five consecutive runs — bit-identical required.

What "deterministic" deliberately does NOT mean¶

It does not mean latencies are bit-identical. CPU contention varies.
It does not mean traces have the same span IDs (span IDs are time/random).
It does not mean Prometheus counters are identical (they accumulate from prior runs in just demo interactive use, though they reset in demo-cold).

The distinction: content is deterministic; metadata is not. The demo's assertion script knows the difference.

Property 3 — Narrated¶

Every step prints a line. Two kinds of lines:

Curriculum lines (so the viewer sees the spine):

[t=0.5s] [Phase 12] Loading verb corpus from DVC...
[t=1.1s] [Phase 11] BPE tokenizer ready (vocab=2048).
[t=2.3s] [Phase 28] Loading Mini-GPT base + LoRA grammar adapter (rev=sha:a1b2c3).
[t=2.4s] [Phase 22] KV-cache pre-allocated for 128 tokens.
[t=2.5s] [Phase 33] miniserve started on :8080.
[t=2.6s] [Phase 34] Cost emitter and OTel exporter armed.

Action lines (so the viewer sees what's happening):

[t=3.1s] >>> POST /v1/grammar/correct  {"sentence": "Yesterday I goed to the store"}
[t=3.4s]     [Phase 11 tokenize]   12 ms  → 10 tokens
[t=3.5s]     [Phase 29 retrieve]   28 ms  → 5 chunks (top: irregular-go-past)
[t=4.6s]     [Phase 17 prefill]    1100 ms
[t=6.8s]     [Phase 17 decode]     2200 ms  → "Yesterday I **went** to the store."
[t=6.81s]   [Phase 30 format]      4 ms
[t=6.82s]   [Phase 34 cost]        €0.00041
[t=6.83s] <<< 200 OK

Narration is what turns a working system into a teaching artifact. The Plan §1 #4 (iv) says the script must be narrated — this is what that means.

Property 4 — Loud failure¶

The anti-pattern (Plan §5 #3 — "Demo theater"):

try:
    response = client.post(...)
    assert response.status_code == 200
except Exception:
    print("Skipping this step")   # WRONG

A demo that hides errors with try/except: pass is not a demo. The script must:

On failure, print which phase / which contract broke.
Print the offending values (truncated to keep terminal output readable).
Exit non-zero with a clear message: "FAIL: [Phase 33] miniserve did not respond on :8080 within 10s; check 'docker compose logs miniserve'".

The Plan's verification: deliberately break one component (point MLFLOW_TRACKING_URI at a wrong port), run the demo, confirm the failure message names the component and the remediation step. The lab walks this verification.

Anatomy of `scripts/demo/run.py`¶

The script has seven blocks, in order:

def main():
    # Block 1 — Preflight (Phase 39 self-checks)
    assert_environment_ready()       # uv, docker, ports free, lockfile clean

    # Block 2 — Stack bring-up (delegate to `just demo-cold`)
    bring_up_stack(timeout_s=30)
    wait_for_health()                # /healthz on miniserve + Prometheus + Grafana

    # Block 3 — Curriculum narration
    narrate_loaded_components()      # the [Phase NN] startup lines

    # Block 4 — Happy-path request battery (3 sentences)
    for sentence in HAPPY_PATH_PAYLOADS:
        send_and_verify(sentence)

    # Block 5 — Security run-through (3 replays from Theory 04)
    replay_injection()
    replay_oversized_body()
    replay_mcp_sandbox()

    # Block 6 — Acceptance (binary)
    run_acceptance_checks()          # the DONE_ENOUGH.md ≤ 20 checks

    # Block 7 — Wrap-up
    print_summary()
    emit_eval_report()               # writes experiments/39-end-to-end/eval-YYYY-MM-DD.json

Each block is a single function with a clear contract. Block 6 is the load-bearing acceptance gate: if any of the ≤ 20 binary checks fail, the script exits 1 with an enumerated failure list.

Binary acceptance: `docs/DONE_ENOUGH.md`¶

The capstone DoD is operationalized as ≤ 20 binary checks. Each check has:

A unique id (e.g., DE-007).
A one-sentence statement.
An automated check.
A pass/fail visible in the demo's terminal output.

Sample rows:

ID	Check	Automation
DE-001	Stack starts within 30 s	Time `docker compose up` to healthy
DE-002	`miniserve` responds on :8080 within 5 s of healthy	`curl :8080/healthz`
DE-003	First request completes within 10 s end-to-end	timed assert in script
DE-004	p95 latency over the 3-sentence battery is < 5 s	Prometheus query
DE-005	All 9 stages emit a span in the trace	Tempo query
DE-006	Cost panel populated within 60 s of first request	Grafana panel data query
DE-007	Cost identity holds within 0.1 % on every request	per-request assertion
DE-008	Injection payload returns 400 with `injection_blocked`	response assertion
DE-009	10 MB body returns 413 before prefill	response + log assertion
DE-010	MCP sandbox subprocess has bounded CPU+RAM, exits ≤ 2 s	span attributes assertion
DE-011	Trace orphan count = 0 over the demo run	Tempo query
DE-012	All committed `MANIFEST.json` SHA256s match disk	Phase 37 lab 04
DE-013	Grafana dashboard imports clean	`curl -X POST /api/dashboards/db` returns 200
DE-014	`eval-YYYY-MM-DD.json` written with the date in filename	filesystem assertion
DE-015	Demo exits with status 0	`echo $?`

(15 of ≤ 20; some rooms for phase-execution adjustments.) Lab 00 fills the remaining 5.

The script ends with a table of all DE checks, pass/fail. The viewer's last visual is that table. That table is the curriculum's report card.

What this theory does NOT cover¶

The full docs/DONE_ENOUGH.md content. Drafted in Lab 00.
The full scripts/demo/run.py. Drafted in Lab 04.
The asciinema recording mechanics. Lab 04.
How to extend the demo with new payloads. Phase 40 reading-list.
Multi-user demos / load-test demos. Lab 02 covers the load test, but it's a separate command, not the main demo.

End of Phase 39 theory chain. Next: the 5 lab statements (lab/00-cold-start-bringup.md through lab/04-demo-script.md).