Skip to content

English · Español

Theory 00 — Integration as a discipline; the closed DoD

🇪🇸 Integrar no es escribir glue code. Es verificar contratos. Cada módulo del repo declaró sus entradas y salidas en un BLUEPRINT.md; la fase 39 audita que cada productor y cada consumidor coincidan. Y "done enough" no es una sensación: es una lista cerrada de verificaciones binarias que un script reproduce en frío.

Why this theory matters

Of all the things software engineering tries to teach, two are routinely mis-taught: integration and definition of done. The first looks easy ("just call function B from function A") but produces 80% of production incidents. The second sounds rigorous ("our acceptance criteria") but is usually a list of aspirations dressed as checks. Phase 39 fixes both by being concrete.

We will define:

  • What an integration contract is, and why every cross-module call in lynx-cortex has one (whether the author knew it or not).
  • How to detect a broken contract before it bites in production (type checks, schema tests, fuzz tests, replay).
  • What a closed DoD checklist is — every row binary, every row automatable, no aspirations.
  • The "done enough" vocabulary: enough for what? For the demo. For 90 seconds. For one stranger to see the curriculum's spine.

Part 1 — Integration is contract enforcement

A module's contract has four parts

Pick any module in lynx-cortex — say src/minigpt/ (Phase 17). Its BLUEPRINT.md declares:

  1. Inputs. The shape of the tensor it consumes: (batch, seq_len) of int64 token ids in [0, |V|). Plus a config object.
  2. Outputs. The shape it produces: (batch, seq_len, |V|) of float32 logits.
  3. Invariants. The promise it makes about behavior under invariants: "given the same inputs and the same seed, produces the same outputs (bit-identical)". "The output sums to one after softmax along the last axis" (after the head, not before).
  4. Side effects. What it changes outside its return value: maybe nothing (pure function); maybe a Prometheus counter increment; maybe a log line. Undocumented side effects are bugs.

The downstream consumer (src/miniserve/handlers.py) has the mirror contract: it commits to providing (batch, seq_len) of int64, expects (batch, seq_len, |V|) of float32 back, depends on the bit-identical invariant for reproducibility tests, expects exactly the documented side effects.

Integration = those two contracts match. Not "the function call compiles". Not "the demo doesn't crash". The contracts match, every time, under every input we've thought to test.

Where contracts break

Six failure modes, all observed in real systems (and in this repo's earlier phases when checks were missing):

  1. Type drift. Producer changes its output from float32 to float16 for a memory win; consumer's downstream math under-flows on float16 and produces NaNs in the long-tail. The unit tests still pass because the test fixtures don't have the long-tail values.
  2. Schema drift. Producer adds a new key to a JSON output; consumer ignores it. Six weeks later, a different consumer reads the key and gets stale data because the original producer never populated it. (This is the corpus-manifest bug pattern from Phase 12.)
  3. Encoding drift. Producer writes UTF-8 with no BOM; consumer reads with default platform encoding (CP1252 on Windows). Special characters mangled. Tests pass on Linux only.
  4. Time drift. Producer emits timestamps in the producer's local timezone; consumer assumes UTC. Off by 1, 2, or 8 hours, depending on the user.
  5. Concurrency drift. Producer was single-threaded; consumer assumed it could safely call it from multiple coroutines. Producer got "optimized" to use shared mutable state; now it corrupts under concurrency.
  6. Version drift. Producer pinned numpy==1.26.0 at the time of the contract; consumer's environment ended up with numpy==2.1.0 due to lockfile churn; subtle API changes broke the contract silently.

The Phase 39 audit walks all six failure modes against every cross-module call in the demo path.

Detecting broken contracts: four techniques

  1. Type checks. mypy --strict on src/ from Phase 0. Catches signature drift but not value-level invariants.
  2. Schema tests. Each JSON-emitting module has a tests/schema/ directory with a JSON Schema + at least 5 example payloads. Consumer-side reads include a jsonschema.validate(payload, schema). Catches schema drift.
  3. Property tests. hypothesis (introduced in Phase 8 for tensor autograd). Generates random inputs that satisfy preconditions, asserts the invariant. Catches concurrency + numerical drift.
  4. Replay tests. A fixed corpus of past inputs (one per known failure mode) re-run on every PR. Catches regressions on already-discovered bugs.

For the capstone, the integration test in lab/01 runs all four against the demo path. That is the meaning of "the capstone is composition, not new code" — the new artifact is the integration test, not a new module.

Part 2 — "Done enough", operationalized

The two failure modes of "done"

  1. Aspirational DoD. "The system should be robust and easy to maintain." Untestable. Will be claimed-as-done by people with deadlines.
  2. Tautological DoD. "The tests pass." True but useless — the tests can be written to pass trivially.

A real DoD must be (a) binary (passes or fails, no judgment call), (b) automated (a script answers the question, not a human), © reproducible (the answer is the same on a fresh machine), (d) closed (a finite list).

The capstone's 8 binary checks

The DoD list is in PHASE_39_PLAN.md §7 and committed at phase open as docs/DONE_ENOUGH.md. Each row has an automated check. Re-stated here for emphasis:

# Check How automated
1 Fresh-host just demo succeeds in ≤ 3 minutes .github/workflows/demo-smoke.yml runs on every PR
2 just demo-cold brings the full stack up, runs, tears down, exits 0 Same CI workflow
3 CI is green on 5 consecutive PRs Visible in GitHub Actions UI
4 Every Grafana panel populates within 60s of first request lab/01 has a panel-coverage script
5 ≥ 3 security/THREATS.md rows annotated Phase 39 demo: verified Grep security/THREATS.md for the annotation
6 docs/ARCHITECTURE.md renders in mkdocs without manual fixups just docs exits 0
7 Every row of docs/DONE_ENOUGH.md passes on a fresh run The list checks itself
8 PHASE_39_REPORT.md committed with per-phase mapping table filled Visual inspection (this one is not automated; document why)

Eight rows, not twenty. Smaller is stronger. A 50-row DoD has 50 places to lie to yourself. An 8-row DoD that every single row passes is a real signal.

What "done enough" means in scope

"Done enough" is context-dependent. For Phase 39 it means:

  • Done enough for a stranger to see the system work end-to-end. Not done enough to take 1,000 RPS. Not done enough to be a paid product. Not done enough to survive a year of neglect.
  • Done enough for the curriculum to be teachable. A visitor reading docs/README.md + watching the demo can answer "what does Phase 23 contribute here?" The mapping table forces this.
  • Done enough to defend against the three most-pedagogical attacks. Not all 12+ rows of the threat model — those are Phase 40's audit. Three rows, demonstrated live.

Part 3 — The two pitfalls

Pitfall 1: gold-plating

The capstone is the place every engineer wants to refactor "while they're here". Resist. The DoD is binary; refactoring is unbounded. Every refactor not strictly required by a DoD row is a Phase 40 candidate. Log it in PHASE_39_REPORT.md § carry-overs and move on. Borja's user-supplied rule (CLAUDE.md §0) — "microscopic scope" — applies most strictly here.

Pitfall 2: skipping the mapping table

The per-phase mapping table in PHASE_39_REPORT.md is the only artifact that makes the curriculum visible at the capstone. Without it, a reader sees a working demo and a long src/ tree and has no map between them. With it, every Phase from 0 to 38 has a line saying "the demo touches this in file X at step Y". That table is the curriculum's product. Skipping it converts 40 phases of structured learning into "I built a chatbot, I think".

Worked example: the "embedding lookup" contract

Take one specific call in the demo path: Embedding.forward(token_ids) from Phase 13, invoked by Mini-GPT.forward from Phase 17.

  • Producer contract (src/minigpt/embedding.py's BLUEPRINT): input (batch, seq_len) of int64 in [0, |V|); output (batch, seq_len, d_model) of float32; pure function; no side effects; bit-identical given the same parameters and inputs.
  • Consumer contract (Mini-GPT.forward's BLUEPRINT): provides int64 token ids from the BPE tokenizer, expects float32 (d_model=128 for the grammar tutor); depends on bit-identical for the eval harness.
  • Audit: mypy verifies the types. A property test (tests/integration/test_embedding_contract.py) draws random (batch ∈ [1, 8], seq_len ∈ [1, 64]) of valid token ids, calls Embedding.forward, checks output shape and dtype, calls twice and checks bit-identity. A replay test feeds the 200 Phase 20 eval sentences and verifies the embedding output hashes match the committed hash.
  • What changed in Phase 39: the schema test was added (not present in Phase 17). That's the capstone's job: lift implicit contracts to explicit ones.

The same pattern applies to every cross-module call in the demo. Phase 39 lab 01 walks all 12 contracts in the demo path.

What this theory does NOT cover

  • How to write each module's BLUEPRINT.md. Done in the originating phase (e.g., Phase 17 wrote Mini-GPT's blueprint).
  • The math of any single module. Already derived in the originating phase's theory.
  • Distributed-systems contracts. The demo is single-process; multi-process contracts (sandbox subprocess, MLflow over HTTP) are mentioned but not derived here. See theory/04-security-and-threat-model-closeout.md for the sandbox boundary specifically.
  • CI configuration in detail. Covered in lab/00-cold-start-bringup.md and lab/04-demo-script.md.

Next: theory/01-architecture-of-the-tutor.md — the C4 model walk-through, identifying every container in the grammar-tutor system and the contracts at each boundary.