English · Español

04 — $/token math: self-hosted vs API-served; the §A13 RPS budget¶

🇪🇸 ¿Cuánto cuesta una corrección del tutor §A13? Depende: si lo auto-alojamos en el i5-8250U el coste es electricidad + amortización, no por-token. Si lo serviésemos por una API comercial (no lo haremos — anti-objetivo §10), el coste por token sería real y medible. Esta página hace ambas cuentas y deriva el RPS estable que el portal §A14 espera.

Two cost models, both useful¶

Model A — self-hosted (what we actually do)¶

The §A13 grammar tutor runs on Borja's laptop. The marginal cost per request is electricity + machine wear, not "tokens out × $price". Numbers (i5-8250U, full-load):

TDP under tutor load: ~25 W average (the model is small; we're not pegging the CPU).
Wall power including PSU loss, screen, peripherals: ~45 W.
Electricity rate (residential ES, 2026): ~0.16 €/kWh.
Cost per hour at full load: $0.045 \cdot 0.16 = 0.0072$ €/hour ≈ 0.7 cents/hour.

If the tutor runs idle most of the day and bursts to 100% during 1 hour of tutoring (one learner), the daily marginal cost is < 1 cent. The amortized capital cost (laptop / 5 yr) is the dominant line — and it's a sunk cost Borja already paid. Self-hosting is effectively free at this scale.

Model B — hypothetical API serving (we don't do this)¶

If we were renting GPT-4-class inference instead of self-hosting, the math is more familiar. As of mid-2026, an API in the GPT-4-mini class quotes ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens. For a typical tutor request:

Input: system prompt (≈ 200 tokens) + user sentence (≈ 24 tokens) = 224 tokens.
Output: correction (≈ 16 tokens) + Spanish gloss (≈ 8 tokens) = 24 tokens.
Cost per request: $224 \cdot 0.15 \cdot 10^{-6} + 24 \cdot 0.60 \cdot 10^{-6} = 4.8 \cdot 10^{-5}$ USD ≈ 0.005 cents.

Self-hosting wins on $/request by an order of magnitude only because the §A13 model is microscopic and runs on hardware already paid for. For a 7B-class model on GPU, the calculus inverts — paying for someone else's amortized H100 is cheaper than buying your own until you saturate it. The break-even point is the standard MLOps decision matrix (Phase 38 covers this).

The reason we self-host the §A13 tutor: it's part of the curriculum's pedagogical contract (CLAUDE.md §0.4, "build before abstracting"). We could be cheaper by renting; we deliberately aren't.

$/token doesn't tell the whole story¶

Three costs that the per-token price hides:

Egress / bandwidth. Streaming 24 output tokens at ~6 bytes each = 144 bytes. Negligible on the wire; ~0.0001 cent at typical cloud rates. But multiply by 10k learners × 100 requests/day and you're at 144 GB/day, which is no longer free.
Idle / reservation. Self-hosted = pay 24/7 even if idle. API = pay per-call. The crossover for a §A13-tutor-scale workload is roughly 1 request/minute: above, self-hosting wins; below, API wins on $.
Auditability. API responses are someone else's logs. For Phase 37 / Phase 41 audit logging requirements, self-hosted gives us provenance for free; API serving requires us to log every request/response which is its own cost line.

The RPS budget for the §A13 grammar tutor¶

Per LYNX_CORTEX_ADDENDUM.md §A14, the Phase 41 portal is the working consumer of the tutor. Expected load:

Multi-learner cohort: assume 30 concurrent learners as the design target (1 active class).
Active session time per learner: ~20 minutes/day average.
Submissions per session: ~12 (quiz items + free-text retries).
Submissions per day per learner: ~12.
Submissions per day, cohort: $30 \cdot 12 = 360$.
Within the 20-min window, peak burst rate: $12 / (20 \cdot 60) = 0.01$ submissions/learner-second.
Peak cohort RPS (everyone clicks at once — worst case): $30 \cdot 1$ = 30 RPS for ~1 second, then back to ~0.3 RPS.

Realistic steady-state during a class: 0.3 - 1 RPS, peaks of 5-10 RPS during quiz transitions. With Phase 33's batching (lab 03), the server handles ~30 RPS comfortably at the latency budget computed in theory 05. Capacity headroom: ~3-10×.

Reading this in the other direction: the budget gives the cost. If we wanted to size for a 100-learner cohort (3.3× the design target), we'd hit the batching ceiling at peak and need either (a) larger MAX_INFLIGHT (more memory per replica) or (b) a second replica (Phase 38's blue/green / horizontal-scale practice).

The $/quiz-question metric (portal-level)¶

The portal exposes a per-question "explain" button which calls the tutor. Worst-case cost per quiz session:

12 explanations × ~150 ms wall-clock = 1.8 seconds of compute.
At Model A self-hosted: ~$3.6 \cdot 10^{-6}$ in electricity. Essentially zero.
At Model B hypothetical API: ~$5.8 \cdot 10^{-4}$ ≈ 0.06 cents per session.

Even at 10k sessions/month, Model B is < $6/month. The cost of running the tutor is dwarfed by the cost of building it. That ratio inverts only at very large scale, which Phase 41 explicitly does not target.

Observability data the cost numbers depend on¶

To make the above non-fiction, the Phase 34 instrumentation must export:

tutor_requests_total{outcome="success|error"} — counter.
tutor_input_tokens_total / tutor_output_tokens_total — counters (for the API-serving math we don't use, but want to keep honest).
tutor_latency_seconds{quantile} — summary or histogram.
tutor_compute_seconds_total — counter for the cost math (integrates to wall-clock CPU time).
tutor_concurrent_inflight — gauge for the Little's-law-derived headroom check.

Lab 03 (docs/phase-34-observability-cost/lab/03-cost-and-loadtest.md) wires (1)-(4); (5) is the bridge to Phase 33's /readyz backpressure signal.

Engineering rules of thumb¶

Decision	Self-host wins when	API wins when
Model size	< ~1B params (fits on CPU/laptop)	≥ 7B (you'd need an H100 anyway)
Request rate	> ~1 RPS sustained	< ~1 RPS / spiky
Audit / data sovereignty required	Yes	(only providers with audit guarantees)
Latency SLO < 100 ms intra-region	Self-host close to user	Often loses to cold-start tax
You want to learn how it works	Always	Never

The §A13 grammar tutor satisfies all five rows in favor of self-hosting. That is not a coincidence — the topic was chosen (§A13) partly because it would be a clean self-hosted demonstration.

What this chapter does NOT cover¶

Cluster cost models — Phase 35 / X4 / X1 for the cluster math.
GPU $/hr regional arbitrage — Phase 38.
Spot vs on-demand pricing strategy — Phase 38.
Per-tenant cost attribution in a multi-tenant portal — Phase 41's responsibility; cross-ref docs/phase-41-learner-portal/theory/02-data-model.md.

Reference¶

Ouyang et al., "Training language models to follow instructions with human feedback" (NeurIPS 2022) — InstructGPT paper. Includes the per-token cost discussion that became the API pricing template.
Patterson et al., "Carbon Emissions and Large Neural Network Training" (2021). The energy → $ → CO₂ chain.

Next: ../lab/03-cost-and-loadtest.md.