Skip to content

English · Español

Theory 07 — Observability for the learner portal

🇪🇸 La observabilidad del portal reutiliza los tres pilares de la Fase 34 (logs, métricas, trazas) pero los reduce a la escala de un solo proceso FastAPI: structlog en JSON, prometheus_client en /metrics, y OTLP opcional hacia un Tempo local. La privacidad del estudiante es restricción dura: nunca se registran contraseñas, cookies, ni el cuerpo de los notes.

Why this matters here

The portal is small — one process, one SQLite file, ≤ 50 students — but the operational questions are identical to the ones Phase 34 answered for the model server: "is it up?", "is it slow?", "did this one request fail and why?". Skipping observability because the system is small is the classic mistake: by the time you wish you had a p95 latency chart for /login, the regression has already shipped. The portal reuses Phase 34's three-pillars stack (logs/metrics/traces) with two adaptations: (a) PII discipline is tighter — student journals are private; (b) the audit log is a database table, not a log line, because it is authoritative for security review (Phase 37 §05).

The three pillars, ported to the portal

Logs — structlog with JSON sink

Same renderer chain as Phase 34 theory 03, with one extra context binding (request_id, student_id).

# src/miniportal/observability.py (excerpt)
import structlog

def init_logging(level: str, json_output: bool) -> None:
    processors = [
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
    ]
    processors.append(
        structlog.processors.JSONRenderer() if json_output
        else structlog.dev.ConsoleRenderer()
    )
    structlog.configure(
        processors=processors,
        wrapper_class=structlog.make_filtering_bound_logger(
            getattr(__import__("logging"), level.upper())
        ),
    )

Canonical field set for every portal log line:

Field Type Notes
request_id str UUID4; minted at the edge if X-Request-Id header absent.
student_id int? Bound after the auth middleware resolves the session.
route str The matched FastAPI route template (/phase/{nn}/notes).
method str HTTP verb.
status int HTTP status code.
latency_ms float Wall time from request entry to response emission.
event str Stable identifier — request.completed, login.failed, …

Example line emitted at the end of a successful note save:

{"timestamp":"2026-05-22T18:14:03.211Z","level":"info","event":"request.completed","request_id":"3f1a…","student_id":7,"route":"/phase/13/notes","method":"POST","status":200,"latency_ms":42.7}

A FastAPI dependency mints request_id and binds it before any handler runs:

from uuid import uuid4
import structlog
from fastapi import Header

async def bind_request_id(x_request_id: str | None = Header(default=None)) -> str:
    rid = x_request_id or str(uuid4())
    structlog.contextvars.bind_contextvars(request_id=rid)
    return rid

student_id is bound by the auth middleware once the session cookie has been verified — never before, so anonymous traffic is logged as student_id: null.

Metrics — prometheus_client on /metrics

One registry, three families, exposed via the standard ASGI endpoint.

# src/miniportal/observability.py (excerpt)
from prometheus_client import (
    CollectorRegistry, Counter, Histogram, make_asgi_app,
)

def init_metrics() -> CollectorRegistry:
    reg = CollectorRegistry()
    Histogram(
        "portal_request_duration_seconds",
        "Wall-clock duration of HTTP requests.",
        labelnames=("route", "method", "status"),
        buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
        registry=reg,
    )
    Counter(
        "portal_login_attempts_total",
        "Login attempts, labeled by outcome.",
        labelnames=("ok",),  # "true" / "false"
        registry=reg,
    )
    Counter(
        "portal_invite_redemptions_total",
        "Invite-token redemptions, labeled by outcome.",
        labelnames=("ok",),
        registry=reg,
    )
    Counter(
        "portal_csrf_blocks_total",
        "Requests rejected by the CSRF middleware.",
        registry=reg,
    )
    Counter(
        "portal_ratelimit_blocks_total",
        "Requests rejected by the rate-limit middleware.",
        labelnames=("scope",),  # "ip" / "session"
        registry=reg,
    )
    Histogram(
        "portal_vault_decrypt_seconds",
        "AES-256-GCM decrypt latency for minivault columns.",
        buckets=(0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05),
        registry=reg,
    )
    return reg

Label cardinality stays bounded: route uses the route template, not the rendered path; status is a small integer set; ok is binary. This is the lesson from Phase 34 theory 02 (label cardinality is the trap), applied verbatim.

The CSRF middleware (Phase 37) and rate-limit middleware (Phase 33) each call their counter on a block, before returning the error response. That's the only metric write each does — they are otherwise pure pass-through.

Traces — OpenTelemetry via OTLP

Auto-instrumentation handles the request span; manual spans wrap the three expensive operations: Argon2 verify, vault decrypt, SM-2 scheduling.

# src/miniportal/observability.py (excerpt)
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def init_tracing(otlp_endpoint: str | None) -> None:
    if otlp_endpoint is None:
        return  # tracing is opt-in via env
    provider = TracerProvider(resource=Resource.create({"service.name": "miniportal"}))
    provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True))
    )
    trace.set_tracer_provider(provider)

def instrument_app(app) -> None:
    FastAPIInstrumentor.instrument_app(app, excluded_urls="/health,/metrics")

Manual span names (stable identifiers — these become Tempo search terms):

  • auth.argon2_verify — wraps the argon2-cffi verify call.
  • vault.decrypt — wraps each AES-GCM decrypt of a VaultBytes column.
  • review.sm2_update — wraps the SM-2 next-review computation in src/minireview/.

The Tempo backend runs in docker/portal-compose.yml; the OTLP endpoint defaults to http://tempo:4317 inside compose, unset outside it.

What NOT to log (PII discipline)

The portal handles secrets and private content. The non-negotiable list:

  • No raw passwords. Not in logs, not in spans, not in metrics labels. Ever.
  • No password hashes. Argon2id output is sensitive — peppered or not.
  • No master pass. PORTAL_MASTER_PASS is read once at boot; the env var name is logged at level debug, the value never.
  • No session cookie value. The cookie's signature is the bearer credential; logging it == handing out the session. Log the cookie's session_id HMAC fingerprint (8 hex chars) if you need correlation.
  • No body_markdown content of student notes. The journal is the student's private space (theory 04 §privacy). Log note.size_bytes, note.length_chars, note.id — never the body.
  • No exam answers in plaintext. Same rule: log answer length and the question id, not the text.

The audit log is separate. Read/write of admin-sensitive routes (GET /admin/students/{id}, POST /admin/students) writes a row into event_log (per architecture §database schema). That row IS the authoritative record for Phase 37 security review. It is not a log line, because log lines are best-effort and rotated; the audit table is durable and queryable. The two streams complement each other; neither replaces the other.

Local dev vs production

Concern Dev Production
Log format ConsoleRenderer (colored, human) JSONRenderer (stdout)
Metrics /metrics on the same port /metrics scraped by Prometheus sidecar
Tracing Disabled unless MINIPORTAL_OTLP_ENDPOINT set Always on; OTLP → Tempo
Log level DEBUG INFO (auth events at WARNING on failure)
Sampling 100% 100% — volume is small enough

The env flags:

  • MINIPORTAL_LOG_LEVEL (default INFO)
  • MINIPORTAL_LOG_JSON (default 0 in dev, 1 in prod)
  • MINIPORTAL_OTLP_ENDPOINT (unset → tracing disabled)

Implementation contract for src/miniportal/observability.py

The module exposes four functions, wired in app.py in this order:

from miniportal.observability import (
    init_logging, init_metrics, init_tracing, instrument_app,
)

init_logging(level=os.getenv("MINIPORTAL_LOG_LEVEL", "INFO"),
             json_output=os.getenv("MINIPORTAL_LOG_JSON") == "1")
registry = init_metrics()
init_tracing(otlp_endpoint=os.getenv("MINIPORTAL_OTLP_ENDPOINT"))

app = FastAPI(dependencies=[Depends(bind_request_id)])
instrument_app(app)
app.mount("/metrics", make_asgi_app(registry=registry))

The CSRF middleware (src/miniserve/middlewares/csrf.py) and the rate-limit middleware (src/miniserve/middlewares/ratelimit.py) each take the metrics registry by dependency injection and increment portal_csrf_blocks_total / portal_ratelimit_blocks_total{scope=…} on block.

Dashboards — textual sketch

infra/grafana/dashboards/portal.json ships with three stacked panels:

  1. Request latency (top): histogram_quantile(0.50|0.95|0.99, sum by (le) (rate(portal_request_duration_seconds_bucket[1m]))). Three lines (p50/p95/p99) over a 6-hour window.
  2. Auth health (middle): two stats — sum(rate(portal_login_attempts_total{ok="true"}[5m])) / sum(rate(portal_login_attempts_total[5m])) for login success rate, and sum(increase(portal_invite_redemptions_total{ok="true"}[24h])) for invite redemptions/day.
  3. Vault latency (bottom): histogram_quantile(0.95, rate(portal_vault_decrypt_seconds_bucket[1m])). Single line; the slowest crypto path in the request lifecycle.

SLOs Borja can pin

Three concrete numbers, written into infra/grafana/dashboards/portal.json as threshold lines:

SLO Target Why this number
p95 /login latency < 500 ms Argon2id verify dominates (~50 ms) + DB read + signing; 500 ms gives 10× headroom.
p99 /me (dashboard) latency < 200 ms No Argon2 on this path; pure DB joins. 200 ms is the threshold beyond which UX feels sluggish.
Invite-redemption success rate > 99.9% / 30d Failures here mean a student cannot onboard; this is the highest-stakes path.

If any SLO is missed for ≥ 30 minutes, the Phase 41 lab 06 deploy procedure says: write a one-page postmortem under experiments/41-portal-slo-burn/<date>/.

Cross-references

  • Phase 34 — the original three-pillars derivation (docs/phase-34-observability-cost/theory/03-tracing-and-logging.md). The portal reuses the chain almost verbatim; if the structlog processor list changes there, it changes here.
  • Phase 37 §05 — portal threat model. The event_log table (audit) is the artifact those threats are reviewed against; logs and metrics are operational, not forensic.
  • Phase 33/metrics and /health endpoint conventions originate here.

Common pitfalls

  • Binding student_id too early. If you bind it in bind_request_id, anonymous logs get a stale id from a previous session. Bind it in the auth middleware, after session verification.
  • High-cardinality route labels. Logging route="/phase/13/notes" vs route="/phase/14/notes" per-phase explodes the time-series count. Use the route template (/phase/{nn}/notes); FastAPI exposes it via request.scope["route"].path.
  • Forgetting to exclude /metrics from auto-instrumentation. OTel will trace its own scrape, creating a feedback loop in the trace store. The excluded_urls param above is mandatory.
  • Logging the cookie value "just for debugging". Don't. If you need correlation, hash the cookie's signature with HMAC-SHA256 and log the first 8 hex chars.
  • Treating event_log as a substitute for structured logs. The two are not interchangeable: event_log is durable, schema-checked, audit-relevant; structured logs are best-effort, schema-loose, operational.

Forward reference

lab/06-deploy-and-backup.md wires this module into the docker-compose production layout: a Prometheus container scrapes /metrics, a Tempo container receives OTLP, and a Grafana container loads portal.json. The lab's DoD includes a screenshot of the three-panel dashboard during a synthetic 10-student load test.


Next: the nav addition in mkdocs.yml (owned by a sibling agent), then lab/06-deploy-and-backup.md for the production wiring.