English · Español
Theory 01 — Architecture of the grammar tutor: a C4 walk-through¶
🇪🇸 La arquitectura del tutor de gramática es deliberadamente pequeña: un proceso (
miniserve), una observabilidad detrás (Prometheus + Grafana + Tempo), un registry (MLflow + DVC) que vive aparte y solo se consulta al arrancar, y un sandbox (Phase 31 MCP) para herramientas auditadas. C4 nos da un vocabulario de cuatro niveles (sistema, contenedor, componente, código) para verla sin perdernos.
Why C4¶
There are many architecture-diagram notations. Most are either too informal ("draw boxes and arrows") or too formal (UML 2.x with a 200-page spec). C4 (by Simon Brown) sits in the right spot: four levels of zoom (System → Container → Component → Code), three notations (boxes, arrows, labels), one rule (every arrow is labeled with the technology and the data flowing across it). For a single-process demo like ours, two levels suffice: System Context and Container. We add a sequence diagram for one full request to make the dynamic story visible. Three diagrams. Total.
Level 1 — System Context¶
At the System Context level, the grammar tutor is one box with arrows to/from external actors:
┌──────────────────────────────┐
│ │
┌────────────────┐ HTTP │ lynx-cortex │
│ Learner │────────►│ grammar tutor │
│ (Borja) │◄────────│ (single Python process) │
└────────────────┘ JSON │ │
│ │
└──────┬───────────┬───────────┘
│ │
MLflow │ │ DVC remote
tracking │ (local FS in demo;
server (HTTP) │ S3/GCS in prod)
▼ ▼
┌────────────┐ ┌────────────┐
│ artifacts │ │ corpus + │
│ + run logs │ │ eval sets │
└────────────┘ └────────────┘
The system has one human user (Borja, or the demo's stranger), two external systems (MLflow tracking, DVC remote — both technically internal in the local demo but logically external), and one input/output protocol (HTTP+JSON). Telemetry sinks (Prometheus, Tempo, optionally Langfuse) are at the Container level — they're inside the system boundary because the demo brings them up via docker-compose.
Render this in docs/phase-39-capstone/diagrams/c4-context.mmd using mermaid flowchart syntax.
Level 2 — Container¶
At the Container level, "container" means a separately-running process. For the demo we have:
miniserve(the grammar tutor itself) — Python 3.11 process, FastAPI, single asyncio event loop. Hosts:- The HTTP handlers (
/v1/grammar/correct,/v1/grammar/explain,/health). - The router from Phase 38 (
src/miniserve/traffic.py) — decides which model arm serves the request. - The model bundle (Phase 28 LoRA grammar tutor, loaded once at startup).
- The inference loop (Phase 33's
Serverclass, with Phase 22's KV-cache). - The cost emitter (Phase 34,
src/miniobserve/cost_emitter.py). - The observability spans (Phase 34, OpenTelemetry).
- Prometheus — scrapes
miniserve's/metricsevery 15s. Stores 7 days of metrics locally. - Grafana — single dashboard at
http://localhost:3000/d/capstone. Reads from Prometheus + Tempo. - Tempo — receives OTLP traces from
miniserve's OpenTelemetry SDK. Stores traces by trace_id. - MLflow tracking server —
http://localhost:5000. Stores model artifacts under./mlruns/. Consulted byminiserveat startup to fetch the promoted LoRA bundle. - OTel-Collector — fans OTLP traces from
miniserveto Tempo. Optional: same path to Langfuse if enabled. - (Optional) Langfuse — LLM-trace UX. Off by default in the demo for portability.
- MCP sandbox — not a long-running container. Spawned as a subprocess by
miniserveonly when an A13 tool call (conjugate,lookup_irregular_verb,lookup_spanish,check_subject_verb_agreement) is dispatched (Phase 31). Lives for the call's duration, then exits.
The contracts at each container boundary:
| From | To | Protocol | Data | Schema location |
|---|---|---|---|---|
| Learner | miniserve |
HTTP POST | {sentence: str, request_id?: str} |
src/miniserve/schemas/correct.py |
miniserve |
Learner | HTTP 200 | {correction: str, per_token: [...], request_id: str, model_sha: str} |
same |
miniserve |
Prometheus | scrape | Prometheus text format | src/miniobserve/metrics.py |
miniserve |
OTel-Collector | OTLP/gRPC | OTel spans | src/miniobserve/tracing.py |
| OTel-Collector | Tempo | OTLP | spans | (collector config) |
miniserve |
MLflow | HTTP (startup only) | artifact download | (mlflow client) |
miniserve |
MCP sandbox | subprocess + JSON-RPC stdio | tool call | src/miniserve/mcp_client.py |
| MCP sandbox | miniserve |
subprocess stdio | tool result | same |
Render in docs/phase-39-capstone/diagrams/c4-container.mmd. Mermaid flowchart LR, one box per container, every arrow labeled [protocol] data-shape.
Level 3 — Components inside miniserve¶
miniserve is the only container with non-trivial internal structure. Its components (each a Python module or sub-package):
miniserve/
├── handlers.py # FastAPI routes
├── traffic.py # Phase 38: router (shadow/A/B/canary)
├── pipeline.py # tokenize → embed → forward → sample → detokenize
├── model_loader.py # MLflow client + canonical-SHA verification
├── mcp_client.py # Phase 31 tool dispatch
└── schemas/ # Pydantic request/response models
Each component reads its inputs from the previous one and emits to the next. The full pipeline component is the one that walks a single English sentence from arrival to response. We document it in detail in theory/02-end-to-end-data-flow.md.
The sequence: one grammar-correction request¶
The third diagram (docs/phase-39-capstone/diagrams/sequence-request.mmd) is a mermaid sequenceDiagram showing exactly what happens for one request:
Learner → miniserve.HTTP : POST /v1/grammar/correct {"sentence": "Yesterday I goed to the store"}
miniserve.HTTP → security : rate-limit check (Phase 33), body-size check, injection-filter (Phase 37)
security → traffic : router.assign(request_id) → "production" arm
traffic → pipeline : pipeline.run(sentence, arm="production")
pipeline → BPE.encode : tokenize → [42, 17, 9001, ...]
pipeline → Embedding.forward : (batch=1, seq_len=8) → (1, 8, 128)
pipeline → Mini-GPT.forward : prefill, populate KV-cache (Phase 22)
pipeline → sampler.decode : structured generation (Phase 30) for the correction template
pipeline → BPE.decode : token ids → "Yesterday I went to the store"
pipeline → cost_emitter : record per-stage wall times → cost histogram
pipeline → tracing : emit OTel spans with model_sha, request_id, latency, cost
miniserve.HTTP → Learner : 200 OK {"correction": "Yesterday I went to the store", "per_token": [...], "model_sha": "abc123...", "request_id": "..."}
Every arrow corresponds to a Phase: Phase 33 (HTTP + rate-limit), Phase 37 (injection filter), Phase 38 (router), Phase 11 (BPE), Phase 13 (embedding), Phase 17 (Mini-GPT), Phase 22 (KV-cache), Phase 30 (structured gen), Phase 34 (cost + tracing). One sequence diagram, ten Phases visible.
The eight contributing Phases (and the four read-only ones)¶
The demo path actively executes code from ten Phases (33, 37, 38, 11, 13, 17, 22, 30, 34, 28-LoRA-adapter). Four more Phases contribute configuration or data without their code running on the request path:
- Phase 12 (corpus design) — the verb corpus is pulled by
dvc pullat demo start; its hash is verified againstmanifest.json. The corpus itself is not on the request path (training-only artifact). - Phase 18 (training loop) — the FP32 base checkpoint that the LoRA adapter sits on top of was produced here. Loaded into memory at startup; weights are then frozen.
- Phase 20 (eval harness) — runs after the demo's request flow, generating
experiments/39-end-to-end/eval-YYYY-MM-DD.json. Not on the request path. - Phase 26 (INT8 quantization) — optional dequantized weights loader path. Off by default; the demo runs FP32 + LoRA.
The remaining 26 Phases contributed foundations (numerical representation, linear algebra, calculus, BPE training, etc.) whose results are baked into the artifacts; their code is not directly invoked in the demo. The mapping table in PHASE_39_REPORT.md makes this fully explicit.
What contracts the architecture enforces¶
- No new src module. The container list above does not introduce one. Every component is in an existing
src/<module>/. - No fan-out. A single
miniserveprocess serves the demo. No worker pool, no Redis, no Kafka. The architecture diagram has fewer boxes than typical production. That is the point: complexity is earned, not inherited. - Telemetry is one-way.
miniserveemits to Prometheus/OTel; it does not consult them at request time. (Comparing live latency to a baseline is offline-only, in Phase 38's drift analysis. The demo does not change behavior based on telemetry.) - MLflow is read-only after startup. The bundle is loaded once. The demo does not hot-swap models. (Hot-swap is a Phase 40 reading-list item.)
- MCP is opt-in per request. A grammar-correction request can trigger the audit tool (e.g., if the request includes an explicit
audit=trueflag). The demo demonstrates this exactly once.
Pitfalls when drawing the diagrams¶
- Over-drawing. Resist showing every Python function. Containers are processes; components are major Python modules; the rest is
codelevel (Level 4), which we skip. - Arrows without labels. Every arrow has a protocol and a data shape. Unlabeled arrows are a hint that the contract isn't real.
- Bidirectional arrows. Use them sparingly. They obscure the direction of dependency. Prefer two unidirectional arrows.
- Static vs dynamic confusion. The Container diagram shows what's running. The Sequence diagram shows what happens for one request. Don't merge them — they answer different questions.
- Telemetry as a hub. It's tempting to draw Grafana as a central hub. It isn't — Grafana is a read-only viewer for the demo. The hub is
miniserve(the only emitter on the request path). - Missing the MCP sandbox. It only spawns sometimes, but it's a security boundary worth drawing explicitly. Show it dashed (lifecycle: spawns and dies per call).
A worked discrepancy: the contract between pipeline.py and tracing.py¶
The pipeline emits OTel spans. The tracing module promises to attach trace_id + span_id to the structured log line. The contract: every log line emitted on the request path includes the correct trace_id. If not, the Grafana "Logs for this trace" panel won't populate; the demo will visibly fail.
The audit: lab 01 runs the demo with trace_id logging enabled and verifies via grep that every log line from a single request shares the same trace_id. If even one log line is missing the field, the contract is broken and the diagram lied. This is what we mean by "architecture diagrams are tested": the audit catches diagrams that overstate reality.
What this theory does NOT cover¶
- Why FastAPI specifically. Done in Phase 33.
- Why OpenTelemetry specifically. Done in Phase 34 theory.
- Why MCP specifically. Done in Phase 31 theory.
- The training of the LoRA adapter. Done in Phase 28.
- Multi-region or HA architecture. Out of scope. Phase 40 reading-list.
- Microservices vs monolith debate. The demo is a monolith; defending that choice is a Phase 40 reflection item.
Next: theory/02-end-to-end-data-flow.md — one request, every layer, with byte counts and the latency budget.