English · Español
Theory 07 — Observability for the learner portal¶
🇪🇸 La observabilidad del portal reutiliza los tres pilares de la Fase 34 (logs, métricas, trazas) pero los reduce a la escala de un solo proceso FastAPI:
structlogen JSON,prometheus_clienten/metrics, y OTLP opcional hacia un Tempo local. La privacidad del estudiante es restricción dura: nunca se registran contraseñas, cookies, ni el cuerpo de los notes.
Why this matters here¶
The portal is small — one process, one SQLite file, ≤ 50 students — but the operational questions are identical to the ones Phase 34 answered for the model server: "is it up?", "is it slow?", "did this one request fail and why?". Skipping observability because the system is small is the classic mistake: by the time you wish you had a p95 latency chart for /login, the regression has already shipped. The portal reuses Phase 34's three-pillars stack (logs/metrics/traces) with two adaptations: (a) PII discipline is tighter — student journals are private; (b) the audit log is a database table, not a log line, because it is authoritative for security review (Phase 37 §05).
The three pillars, ported to the portal¶
Logs — structlog with JSON sink¶
Same renderer chain as Phase 34 theory 03, with one extra context binding (request_id, student_id).
# src/miniportal/observability.py (excerpt)
import structlog
def init_logging(level: str, json_output: bool) -> None:
processors = [
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso", utc=True),
]
processors.append(
structlog.processors.JSONRenderer() if json_output
else structlog.dev.ConsoleRenderer()
)
structlog.configure(
processors=processors,
wrapper_class=structlog.make_filtering_bound_logger(
getattr(__import__("logging"), level.upper())
),
)
Canonical field set for every portal log line:
| Field | Type | Notes |
|---|---|---|
request_id |
str | UUID4; minted at the edge if X-Request-Id header absent. |
student_id |
int? | Bound after the auth middleware resolves the session. |
route |
str | The matched FastAPI route template (/phase/{nn}/notes). |
method |
str | HTTP verb. |
status |
int | HTTP status code. |
latency_ms |
float | Wall time from request entry to response emission. |
event |
str | Stable identifier — request.completed, login.failed, … |
Example line emitted at the end of a successful note save:
{"timestamp":"2026-05-22T18:14:03.211Z","level":"info","event":"request.completed","request_id":"3f1a…","student_id":7,"route":"/phase/13/notes","method":"POST","status":200,"latency_ms":42.7}
A FastAPI dependency mints request_id and binds it before any handler runs:
from uuid import uuid4
import structlog
from fastapi import Header
async def bind_request_id(x_request_id: str | None = Header(default=None)) -> str:
rid = x_request_id or str(uuid4())
structlog.contextvars.bind_contextvars(request_id=rid)
return rid
student_id is bound by the auth middleware once the session cookie has been verified — never before, so anonymous traffic is logged as student_id: null.
Metrics — prometheus_client on /metrics¶
One registry, three families, exposed via the standard ASGI endpoint.
# src/miniportal/observability.py (excerpt)
from prometheus_client import (
CollectorRegistry, Counter, Histogram, make_asgi_app,
)
def init_metrics() -> CollectorRegistry:
reg = CollectorRegistry()
Histogram(
"portal_request_duration_seconds",
"Wall-clock duration of HTTP requests.",
labelnames=("route", "method", "status"),
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
registry=reg,
)
Counter(
"portal_login_attempts_total",
"Login attempts, labeled by outcome.",
labelnames=("ok",), # "true" / "false"
registry=reg,
)
Counter(
"portal_invite_redemptions_total",
"Invite-token redemptions, labeled by outcome.",
labelnames=("ok",),
registry=reg,
)
Counter(
"portal_csrf_blocks_total",
"Requests rejected by the CSRF middleware.",
registry=reg,
)
Counter(
"portal_ratelimit_blocks_total",
"Requests rejected by the rate-limit middleware.",
labelnames=("scope",), # "ip" / "session"
registry=reg,
)
Histogram(
"portal_vault_decrypt_seconds",
"AES-256-GCM decrypt latency for minivault columns.",
buckets=(0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05),
registry=reg,
)
return reg
Label cardinality stays bounded: route uses the route template, not the rendered path; status is a small integer set; ok is binary. This is the lesson from Phase 34 theory 02 (label cardinality is the trap), applied verbatim.
The CSRF middleware (Phase 37) and rate-limit middleware (Phase 33) each call their counter on a block, before returning the error response. That's the only metric write each does — they are otherwise pure pass-through.
Traces — OpenTelemetry via OTLP¶
Auto-instrumentation handles the request span; manual spans wrap the three expensive operations: Argon2 verify, vault decrypt, SM-2 scheduling.
# src/miniportal/observability.py (excerpt)
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
def init_tracing(otlp_endpoint: str | None) -> None:
if otlp_endpoint is None:
return # tracing is opt-in via env
provider = TracerProvider(resource=Resource.create({"service.name": "miniportal"}))
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True))
)
trace.set_tracer_provider(provider)
def instrument_app(app) -> None:
FastAPIInstrumentor.instrument_app(app, excluded_urls="/health,/metrics")
Manual span names (stable identifiers — these become Tempo search terms):
auth.argon2_verify— wraps theargon2-cffiverify call.vault.decrypt— wraps eachAES-GCMdecrypt of aVaultBytescolumn.review.sm2_update— wraps the SM-2 next-review computation insrc/minireview/.
The Tempo backend runs in docker/portal-compose.yml; the OTLP endpoint defaults to http://tempo:4317 inside compose, unset outside it.
What NOT to log (PII discipline)¶
The portal handles secrets and private content. The non-negotiable list:
- No raw passwords. Not in logs, not in spans, not in metrics labels. Ever.
- No password hashes. Argon2id output is sensitive — peppered or not.
- No master pass.
PORTAL_MASTER_PASSis read once at boot; the env var name is logged at leveldebug, the value never. - No session cookie value. The cookie's signature is the bearer credential; logging it == handing out the session. Log the cookie's
session_idHMAC fingerprint (8 hex chars) if you need correlation. - No
body_markdowncontent of student notes. The journal is the student's private space (theory 04 §privacy). Lognote.size_bytes,note.length_chars,note.id— never the body. - No exam answers in plaintext. Same rule: log answer length and the question id, not the text.
The audit log is separate. Read/write of admin-sensitive routes (GET /admin/students/{id}, POST /admin/students) writes a row into event_log (per architecture §database schema). That row IS the authoritative record for Phase 37 security review. It is not a log line, because log lines are best-effort and rotated; the audit table is durable and queryable. The two streams complement each other; neither replaces the other.
Local dev vs production¶
| Concern | Dev | Production |
|---|---|---|
| Log format | ConsoleRenderer (colored, human) |
JSONRenderer (stdout) |
| Metrics | /metrics on the same port |
/metrics scraped by Prometheus sidecar |
| Tracing | Disabled unless MINIPORTAL_OTLP_ENDPOINT set |
Always on; OTLP → Tempo |
| Log level | DEBUG |
INFO (auth events at WARNING on failure) |
| Sampling | 100% | 100% — volume is small enough |
The env flags:
MINIPORTAL_LOG_LEVEL(defaultINFO)MINIPORTAL_LOG_JSON(default0in dev,1in prod)MINIPORTAL_OTLP_ENDPOINT(unset → tracing disabled)
Implementation contract for src/miniportal/observability.py¶
The module exposes four functions, wired in app.py in this order:
from miniportal.observability import (
init_logging, init_metrics, init_tracing, instrument_app,
)
init_logging(level=os.getenv("MINIPORTAL_LOG_LEVEL", "INFO"),
json_output=os.getenv("MINIPORTAL_LOG_JSON") == "1")
registry = init_metrics()
init_tracing(otlp_endpoint=os.getenv("MINIPORTAL_OTLP_ENDPOINT"))
app = FastAPI(dependencies=[Depends(bind_request_id)])
instrument_app(app)
app.mount("/metrics", make_asgi_app(registry=registry))
The CSRF middleware (src/miniserve/middlewares/csrf.py) and the rate-limit middleware (src/miniserve/middlewares/ratelimit.py) each take the metrics registry by dependency injection and increment portal_csrf_blocks_total / portal_ratelimit_blocks_total{scope=…} on block.
Dashboards — textual sketch¶
infra/grafana/dashboards/portal.json ships with three stacked panels:
- Request latency (top):
histogram_quantile(0.50|0.95|0.99, sum by (le) (rate(portal_request_duration_seconds_bucket[1m]))). Three lines (p50/p95/p99) over a 6-hour window. - Auth health (middle): two stats —
sum(rate(portal_login_attempts_total{ok="true"}[5m])) / sum(rate(portal_login_attempts_total[5m]))for login success rate, andsum(increase(portal_invite_redemptions_total{ok="true"}[24h]))for invite redemptions/day. - Vault latency (bottom):
histogram_quantile(0.95, rate(portal_vault_decrypt_seconds_bucket[1m])). Single line; the slowest crypto path in the request lifecycle.
SLOs Borja can pin¶
Three concrete numbers, written into infra/grafana/dashboards/portal.json as threshold lines:
| SLO | Target | Why this number |
|---|---|---|
p95 /login latency |
< 500 ms | Argon2id verify dominates (~50 ms) + DB read + signing; 500 ms gives 10× headroom. |
p99 /me (dashboard) latency |
< 200 ms | No Argon2 on this path; pure DB joins. 200 ms is the threshold beyond which UX feels sluggish. |
| Invite-redemption success rate | > 99.9% / 30d | Failures here mean a student cannot onboard; this is the highest-stakes path. |
If any SLO is missed for ≥ 30 minutes, the Phase 41 lab 06 deploy procedure says: write a one-page postmortem under experiments/41-portal-slo-burn/<date>/.
Cross-references¶
- Phase 34 — the original three-pillars derivation (
docs/phase-34-observability-cost/theory/03-tracing-and-logging.md). The portal reuses the chain almost verbatim; if the structlog processor list changes there, it changes here. - Phase 37 §05 — portal threat model. The
event_logtable (audit) is the artifact those threats are reviewed against; logs and metrics are operational, not forensic. - Phase 33 —
/metricsand/healthendpoint conventions originate here.
Common pitfalls¶
- Binding
student_idtoo early. If you bind it inbind_request_id, anonymous logs get a stale id from a previous session. Bind it in the auth middleware, after session verification. - High-cardinality
routelabels. Loggingroute="/phase/13/notes"vsroute="/phase/14/notes"per-phase explodes the time-series count. Use the route template (/phase/{nn}/notes); FastAPI exposes it viarequest.scope["route"].path. - Forgetting to exclude
/metricsfrom auto-instrumentation. OTel will trace its own scrape, creating a feedback loop in the trace store. Theexcluded_urlsparam above is mandatory. - Logging the cookie value "just for debugging". Don't. If you need correlation, hash the cookie's signature with HMAC-SHA256 and log the first 8 hex chars.
- Treating
event_logas a substitute for structured logs. The two are not interchangeable:event_logis durable, schema-checked, audit-relevant; structured logs are best-effort, schema-loose, operational.
Forward reference¶
lab/06-deploy-and-backup.md wires this module into the docker-compose production layout: a Prometheus container scrapes /metrics, a Tempo container receives OTLP, and a Grafana container loads portal.json. The lab's DoD includes a screenshot of the three-panel dashboard during a synthetic 10-student load test.
Next: the nav addition in mkdocs.yml (owned by a sibling agent), then lab/06-deploy-and-backup.md for the production wiring.