English · Español
02 — Cost accounting¶
🇪🇸 El coste por petición tiene que ser una métrica de primera clase, no un Excel mensual. La fórmula es trivial —
tiempo × tarifa— pero las decisiones difíciles (qué tarifa, cómo amortizar, cómo no contar dos veces el prefetch) deciden si el dashboard miente.
A web service treats cost as a finance problem: the cloud bill arrives monthly, ops people allocate it across teams. LLM serving cannot treat cost that way, because per-request cost varies by 100× or more. If you don't measure per-request cost, you cannot:
- detect a customer abusing the service,
- compare two model versions on a cost-quality plane,
- decide whether to autoscale,
- price a product.
This page derives the formula, picks the rate, and lists the traps.
The first-principles formula¶
A serving box has a rate \(r\) in dollars per second of wall-clock time. For a cloud instance, \(r = (\text{hourly\_price}) / 3600\). For Borja's local i5-8250U with the notional rate of $0.17/hr: \(r \approx 4.7 \times 10^{-5}\) USD/sec.
A single request occupies the box for some wall time. Decompose into stages:
(In Phase 33's continuous-batching server, decode wall time is shared across active requests; we'll handle that correction below.)
Naïve per-request cost (no batching):
That's the formula. The hard part is the corrections.
Correction 1: batching overlap¶
When the batcher is running N requests in parallel, each decode step processes all N. The wall time per decode step is the same as for one request, but the work done is N×. The cost should be divided across requests:
where \(N_{\text{active}}(\text{step})\) is the number of requests sharing the batch at that step (which varies as requests join and leave).
Implementation trick: the cost tracker doesn't compute this exactly per-step; that's expensive. Instead, the batcher reports effective_decode_seconds_per_request = total_decode_wall_seconds / sum_over_steps(N_active). Each request adds effective × n_tokens_generated × r to its tab. Aggregate error is ≤ 1% for typical batch profiles.
Correction 2: prefill quadratic vs decode linear¶
Prefill compute scales as \(O(P^2)\) in prompt length \(P\) (attention). Decode compute per step scales as \(O(P + D)\) where \(D\) is generated tokens so far. Two implications:
- Long-prompt requests are disproportionately expensive in prefill. A 4 KB prompt costs ~16× a 1 KB prompt to prefill, not 4×.
- Long-generation requests are roughly linear in cost. Doubling output tokens roughly doubles decode cost.
The cost formula respects this automatically via the wall-time measurement — you don't need to model \(P^2\) explicitly, you measure \(T_{\text{prefill}}\) and the quadratic shows up in the number. But it does mean the histogram of cost is bimodal-ish: short prompts cluster near a floor, long prompts have a heavy right tail.
Correction 3: retrieval is real cost¶
If the request triggers RAG (Phase 29), retrieval wall time is part of the cost. Two sub-pieces:
- Embedding the query. Tens of milliseconds on CPU for a small encoder; longer if you're using a beefy one.
- Vector search. For FAISS-flat on the Phase 29 KB (~thousands of chunks), single-digit ms. For HNSW, even less.
Retrieval is small (≤5% of total cost) for the Phase 29 setup, but must be tracked so it doesn't get swept under the rug — and so that if Borja later swaps the embedding model for a bigger one, the cost shift is visible.
Picking the rate¶
Three honest choices for the notional $_per_hour rate:
| Choice | Value | When it's right |
|---|---|---|
| Cloud-equivalent | ~$0.17/hr (c5.xlarge) or $1.50/hr (g4dn.xlarge GPU) | Comparing to a hypothetical cloud deployment. Default for the curriculum. |
| Electricity-only | ~€0.01/hr for an i5-8250U at full tilt × €0.20/kWh | Comparing the marginal cost of running on owned hardware. Borja might prefer this for the local box. |
| Fully-loaded local | Electricity + amortized hardware + rent + admin time | The honest version for a real product. Out of scope for Phase 34. |
The lab uses cloud-equivalent by default; Borja can override at phase open.
The Prometheus encoding¶
Two histograms, log-spaced buckets:
COST_BUCKETS_USD = (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
COST_PER_1K_TOKENS_BUCKETS_USD = (1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
12 and 10 buckets respectively, covering 5 decades. Log-spaced because cost is right-skewed.
Two labels on each: model_name and kind (prompt vs completion for per-token cost). No further labels — cardinality budget.
PromQL queries Borja will actually write¶
The four queries that ought to be on the cost dashboard's panel:
# Mean cost per request, last 5 minutes:
sum(rate(cost_per_request_usd_sum[5m])) / sum(rate(cost_per_request_usd_count[5m]))
# p95 cost per 1k output tokens, last 5 minutes:
histogram_quantile(0.95, sum by (le) (rate(cost_per_1k_completion_tokens_usd_bucket[5m])))
# Total spend, last 24 hours:
sum(increase(cost_per_request_usd_sum[24h]))
# Cost-per-request distribution as a heatmap (Grafana panel type):
sum by (le) (rate(cost_per_request_usd_bucket[1m]))
The third — total spend — is the FinOps line that gets put in front of management. The second — p95 cost per 1k tokens — is the engineering KPI.
The trap: cost without quality is meaningless¶
A model that always returns "I don't know" is infinitely cost-efficient. Cost has to be reported alongside some quality signal — even a coarse one like "% of responses that parsed as valid JSON" (Phase 30) or "% of responses that passed the safety filter" (Phase 37). Phase 34 lays the cost number; Phase 38 wires it to the quality number on the same dashboard.
For the lab in Phase 34: commit a "cost per successful completion" panel where successful is defined as status==200 and completion_tokens > 0. Crude, but it prevents the degenerate "always 503" gaming.
Worked example: what should cost look like for Borja's MiniGPT?¶
Order-of-magnitude estimates for an i5-8250U serving the ~500k-param MiniGPT at FP32:
- TTFT for a 128-token prompt: ~50 ms.
- Per-token decode: ~10 ms.
- A 256-token completion takes ~50 + 256×10 = 2610 ms.
- At $0.17/hr rate: 2.61 × \(0.17/3600 = **\)1.23e-4 per request, or $0.48 per 1000 requests**.
- Per 1k completion tokens: (\(1.23e-4 / 256) × 1000 = **\)4.8e-4 per 1k tokens**.
Sanity check: OpenAI's gpt-4o-mini is ~\(0.15/1M completion tokens, or **\)1.5e-4 per 1k tokens. So a hand-rolled CPU MiniGPT, on a 2018 laptop, with a notional cloud rate, is about 3×** more expensive per 1k tokens than the cheapest production API. That's a reasonable answer for an educational system — not so cheap as to be suspect, not so expensive as to be embarrassing. Borja should land in roughly this range.
If the measured number ends up at $0.01 per 1k tokens (20× off), something is wrong: most likely the batcher's batching isn't being credited, or the rate is set too high, or the model is running uselessly slow.
One-paragraph recap¶
Per-request cost = wall-time × machine-rate, with three corrections: batch overlap divides decode cost across active requests, prefill quadratic shows up automatically in wall time but matters for the long-prompt tail, retrieval is small but tracked separately. Encode as two log-spaced Prometheus histograms (cost_per_request_usd, cost_per_1k_completion_tokens_usd) labeled by model name. p95 cost is the engineering KPI; total spend is the FinOps line. Always pair cost with a quality signal so degenerate "refuse everything" services don't read as wins.
Next: theory/03-tracing-and-logging.md.