Skip to content

English · Español

Theory 01 — Architecture of the grammar tutor: a C4 walk-through

🇪🇸 La arquitectura del tutor de gramática es deliberadamente pequeña: un proceso (miniserve), una observabilidad detrás (Prometheus + Grafana + Tempo), un registry (MLflow + DVC) que vive aparte y solo se consulta al arrancar, y un sandbox (Phase 31 MCP) para herramientas auditadas. C4 nos da un vocabulario de cuatro niveles (sistema, contenedor, componente, código) para verla sin perdernos.

Why C4

There are many architecture-diagram notations. Most are either too informal ("draw boxes and arrows") or too formal (UML 2.x with a 200-page spec). C4 (by Simon Brown) sits in the right spot: four levels of zoom (System → Container → Component → Code), three notations (boxes, arrows, labels), one rule (every arrow is labeled with the technology and the data flowing across it). For a single-process demo like ours, two levels suffice: System Context and Container. We add a sequence diagram for one full request to make the dynamic story visible. Three diagrams. Total.

Level 1 — System Context

At the System Context level, the grammar tutor is one box with arrows to/from external actors:

                              ┌──────────────────────────────┐
                              │                              │
   ┌────────────────┐  HTTP   │       lynx-cortex            │
   │  Learner       │────────►│   grammar tutor              │
   │  (Borja)       │◄────────│   (single Python process)    │
   └────────────────┘  JSON   │                              │
                              │                              │
                              └──────┬───────────┬───────────┘
                                     │           │
                              MLflow │           │ DVC remote
                              tracking           │ (local FS in demo;
                              server (HTTP)      │  S3/GCS in prod)
                                     ▼           ▼
                              ┌────────────┐  ┌────────────┐
                              │ artifacts  │  │ corpus +   │
                              │ + run logs │  │ eval sets  │
                              └────────────┘  └────────────┘

The system has one human user (Borja, or the demo's stranger), two external systems (MLflow tracking, DVC remote — both technically internal in the local demo but logically external), and one input/output protocol (HTTP+JSON). Telemetry sinks (Prometheus, Tempo, optionally Langfuse) are at the Container level — they're inside the system boundary because the demo brings them up via docker-compose.

Render this in docs/phase-39-capstone/diagrams/c4-context.mmd using mermaid flowchart syntax.

Level 2 — Container

At the Container level, "container" means a separately-running process. For the demo we have:

  1. miniserve (the grammar tutor itself) — Python 3.11 process, FastAPI, single asyncio event loop. Hosts:
  2. The HTTP handlers (/v1/grammar/correct, /v1/grammar/explain, /health).
  3. The router from Phase 38 (src/miniserve/traffic.py) — decides which model arm serves the request.
  4. The model bundle (Phase 28 LoRA grammar tutor, loaded once at startup).
  5. The inference loop (Phase 33's Server class, with Phase 22's KV-cache).
  6. The cost emitter (Phase 34, src/miniobserve/cost_emitter.py).
  7. The observability spans (Phase 34, OpenTelemetry).
  8. Prometheus — scrapes miniserve's /metrics every 15s. Stores 7 days of metrics locally.
  9. Grafana — single dashboard at http://localhost:3000/d/capstone. Reads from Prometheus + Tempo.
  10. Tempo — receives OTLP traces from miniserve's OpenTelemetry SDK. Stores traces by trace_id.
  11. MLflow tracking serverhttp://localhost:5000. Stores model artifacts under ./mlruns/. Consulted by miniserve at startup to fetch the promoted LoRA bundle.
  12. OTel-Collector — fans OTLP traces from miniserve to Tempo. Optional: same path to Langfuse if enabled.
  13. (Optional) Langfuse — LLM-trace UX. Off by default in the demo for portability.
  14. MCP sandboxnot a long-running container. Spawned as a subprocess by miniserve only when an A13 tool call (conjugate, lookup_irregular_verb, lookup_spanish, check_subject_verb_agreement) is dispatched (Phase 31). Lives for the call's duration, then exits.

The contracts at each container boundary:

From To Protocol Data Schema location
Learner miniserve HTTP POST {sentence: str, request_id?: str} src/miniserve/schemas/correct.py
miniserve Learner HTTP 200 {correction: str, per_token: [...], request_id: str, model_sha: str} same
miniserve Prometheus scrape Prometheus text format src/miniobserve/metrics.py
miniserve OTel-Collector OTLP/gRPC OTel spans src/miniobserve/tracing.py
OTel-Collector Tempo OTLP spans (collector config)
miniserve MLflow HTTP (startup only) artifact download (mlflow client)
miniserve MCP sandbox subprocess + JSON-RPC stdio tool call src/miniserve/mcp_client.py
MCP sandbox miniserve subprocess stdio tool result same

Render in docs/phase-39-capstone/diagrams/c4-container.mmd. Mermaid flowchart LR, one box per container, every arrow labeled [protocol] data-shape.

Level 3 — Components inside miniserve

miniserve is the only container with non-trivial internal structure. Its components (each a Python module or sub-package):

miniserve/
├── handlers.py          # FastAPI routes
├── traffic.py           # Phase 38: router (shadow/A/B/canary)
├── pipeline.py          # tokenize → embed → forward → sample → detokenize
├── model_loader.py      # MLflow client + canonical-SHA verification
├── mcp_client.py        # Phase 31 tool dispatch
└── schemas/             # Pydantic request/response models

Each component reads its inputs from the previous one and emits to the next. The full pipeline component is the one that walks a single English sentence from arrival to response. We document it in detail in theory/02-end-to-end-data-flow.md.

The sequence: one grammar-correction request

The third diagram (docs/phase-39-capstone/diagrams/sequence-request.mmd) is a mermaid sequenceDiagram showing exactly what happens for one request:

Learner → miniserve.HTTP        : POST /v1/grammar/correct {"sentence": "Yesterday I goed to the store"}
miniserve.HTTP → security       : rate-limit check (Phase 33), body-size check, injection-filter (Phase 37)
security → traffic              : router.assign(request_id) → "production" arm
traffic → pipeline              : pipeline.run(sentence, arm="production")
pipeline → BPE.encode           : tokenize → [42, 17, 9001, ...]
pipeline → Embedding.forward    : (batch=1, seq_len=8) → (1, 8, 128)
pipeline → Mini-GPT.forward     : prefill, populate KV-cache (Phase 22)
pipeline → sampler.decode       : structured generation (Phase 30) for the correction template
pipeline → BPE.decode           : token ids → "Yesterday I went to the store"
pipeline → cost_emitter         : record per-stage wall times → cost histogram
pipeline → tracing              : emit OTel spans with model_sha, request_id, latency, cost
miniserve.HTTP → Learner        : 200 OK {"correction": "Yesterday I went to the store", "per_token": [...], "model_sha": "abc123...", "request_id": "..."}

Every arrow corresponds to a Phase: Phase 33 (HTTP + rate-limit), Phase 37 (injection filter), Phase 38 (router), Phase 11 (BPE), Phase 13 (embedding), Phase 17 (Mini-GPT), Phase 22 (KV-cache), Phase 30 (structured gen), Phase 34 (cost + tracing). One sequence diagram, ten Phases visible.

The eight contributing Phases (and the four read-only ones)

The demo path actively executes code from ten Phases (33, 37, 38, 11, 13, 17, 22, 30, 34, 28-LoRA-adapter). Four more Phases contribute configuration or data without their code running on the request path:

  • Phase 12 (corpus design) — the verb corpus is pulled by dvc pull at demo start; its hash is verified against manifest.json. The corpus itself is not on the request path (training-only artifact).
  • Phase 18 (training loop) — the FP32 base checkpoint that the LoRA adapter sits on top of was produced here. Loaded into memory at startup; weights are then frozen.
  • Phase 20 (eval harness) — runs after the demo's request flow, generating experiments/39-end-to-end/eval-YYYY-MM-DD.json. Not on the request path.
  • Phase 26 (INT8 quantization) — optional dequantized weights loader path. Off by default; the demo runs FP32 + LoRA.

The remaining 26 Phases contributed foundations (numerical representation, linear algebra, calculus, BPE training, etc.) whose results are baked into the artifacts; their code is not directly invoked in the demo. The mapping table in PHASE_39_REPORT.md makes this fully explicit.

What contracts the architecture enforces

  1. No new src module. The container list above does not introduce one. Every component is in an existing src/<module>/.
  2. No fan-out. A single miniserve process serves the demo. No worker pool, no Redis, no Kafka. The architecture diagram has fewer boxes than typical production. That is the point: complexity is earned, not inherited.
  3. Telemetry is one-way. miniserve emits to Prometheus/OTel; it does not consult them at request time. (Comparing live latency to a baseline is offline-only, in Phase 38's drift analysis. The demo does not change behavior based on telemetry.)
  4. MLflow is read-only after startup. The bundle is loaded once. The demo does not hot-swap models. (Hot-swap is a Phase 40 reading-list item.)
  5. MCP is opt-in per request. A grammar-correction request can trigger the audit tool (e.g., if the request includes an explicit audit=true flag). The demo demonstrates this exactly once.

Pitfalls when drawing the diagrams

  1. Over-drawing. Resist showing every Python function. Containers are processes; components are major Python modules; the rest is code level (Level 4), which we skip.
  2. Arrows without labels. Every arrow has a protocol and a data shape. Unlabeled arrows are a hint that the contract isn't real.
  3. Bidirectional arrows. Use them sparingly. They obscure the direction of dependency. Prefer two unidirectional arrows.
  4. Static vs dynamic confusion. The Container diagram shows what's running. The Sequence diagram shows what happens for one request. Don't merge them — they answer different questions.
  5. Telemetry as a hub. It's tempting to draw Grafana as a central hub. It isn't — Grafana is a read-only viewer for the demo. The hub is miniserve (the only emitter on the request path).
  6. Missing the MCP sandbox. It only spawns sometimes, but it's a security boundary worth drawing explicitly. Show it dashed (lifecycle: spawns and dies per call).

A worked discrepancy: the contract between pipeline.py and tracing.py

The pipeline emits OTel spans. The tracing module promises to attach trace_id + span_id to the structured log line. The contract: every log line emitted on the request path includes the correct trace_id. If not, the Grafana "Logs for this trace" panel won't populate; the demo will visibly fail.

The audit: lab 01 runs the demo with trace_id logging enabled and verifies via grep that every log line from a single request shares the same trace_id. If even one log line is missing the field, the contract is broken and the diagram lied. This is what we mean by "architecture diagrams are tested": the audit catches diagrams that overstate reality.

What this theory does NOT cover

  • Why FastAPI specifically. Done in Phase 33.
  • Why OpenTelemetry specifically. Done in Phase 34 theory.
  • Why MCP specifically. Done in Phase 31 theory.
  • The training of the LoRA adapter. Done in Phase 28.
  • Multi-region or HA architecture. Out of scope. Phase 40 reading-list.
  • Microservices vs monolith debate. The demo is a monolith; defending that choice is a Phase 40 reflection item.

Next: theory/02-end-to-end-data-flow.md — one request, every layer, with byte counts and the latency budget.