Skip to content

English · Español

02 — Cost accounting

🇪🇸 El coste por petición tiene que ser una métrica de primera clase, no un Excel mensual. La fórmula es trivial — tiempo × tarifa — pero las decisiones difíciles (qué tarifa, cómo amortizar, cómo no contar dos veces el prefetch) deciden si el dashboard miente.

A web service treats cost as a finance problem: the cloud bill arrives monthly, ops people allocate it across teams. LLM serving cannot treat cost that way, because per-request cost varies by 100× or more. If you don't measure per-request cost, you cannot:

  • detect a customer abusing the service,
  • compare two model versions on a cost-quality plane,
  • decide whether to autoscale,
  • price a product.

This page derives the formula, picks the rate, and lists the traps.

The first-principles formula

A serving box has a rate \(r\) in dollars per second of wall-clock time. For a cloud instance, \(r = (\text{hourly\_price}) / 3600\). For Borja's local i5-8250U with the notional rate of $0.17/hr: \(r \approx 4.7 \times 10^{-5}\) USD/sec.

A single request occupies the box for some wall time. Decompose into stages:

\[T_{\text{req}} = T_{\text{retrieve}} + T_{\text{prefill}} + T_{\text{decode}}\]

(In Phase 33's continuous-batching server, decode wall time is shared across active requests; we'll handle that correction below.)

Naïve per-request cost (no batching):

\[\text{cost}_{\text{req}} = r \cdot T_{\text{req}}\]

That's the formula. The hard part is the corrections.

Correction 1: batching overlap

When the batcher is running N requests in parallel, each decode step processes all N. The wall time per decode step is the same as for one request, but the work done is N×. The cost should be divided across requests:

\[\text{cost}_{\text{decode}, i} = \frac{r \cdot T_{\text{decode-step}}}{N_{\text{active}}(\text{step})} \cdot n_{\text{steps}_i}\]

where \(N_{\text{active}}(\text{step})\) is the number of requests sharing the batch at that step (which varies as requests join and leave).

Implementation trick: the cost tracker doesn't compute this exactly per-step; that's expensive. Instead, the batcher reports effective_decode_seconds_per_request = total_decode_wall_seconds / sum_over_steps(N_active). Each request adds effective × n_tokens_generated × r to its tab. Aggregate error is ≤ 1% for typical batch profiles.

Correction 2: prefill quadratic vs decode linear

Prefill compute scales as \(O(P^2)\) in prompt length \(P\) (attention). Decode compute per step scales as \(O(P + D)\) where \(D\) is generated tokens so far. Two implications:

  • Long-prompt requests are disproportionately expensive in prefill. A 4 KB prompt costs ~16× a 1 KB prompt to prefill, not 4×.
  • Long-generation requests are roughly linear in cost. Doubling output tokens roughly doubles decode cost.

The cost formula respects this automatically via the wall-time measurement — you don't need to model \(P^2\) explicitly, you measure \(T_{\text{prefill}}\) and the quadratic shows up in the number. But it does mean the histogram of cost is bimodal-ish: short prompts cluster near a floor, long prompts have a heavy right tail.

Correction 3: retrieval is real cost

If the request triggers RAG (Phase 29), retrieval wall time is part of the cost. Two sub-pieces:

  • Embedding the query. Tens of milliseconds on CPU for a small encoder; longer if you're using a beefy one.
  • Vector search. For FAISS-flat on the Phase 29 KB (~thousands of chunks), single-digit ms. For HNSW, even less.

Retrieval is small (≤5% of total cost) for the Phase 29 setup, but must be tracked so it doesn't get swept under the rug — and so that if Borja later swaps the embedding model for a bigger one, the cost shift is visible.

Picking the rate

Three honest choices for the notional $_per_hour rate:

Choice Value When it's right
Cloud-equivalent ~$0.17/hr (c5.xlarge) or $1.50/hr (g4dn.xlarge GPU) Comparing to a hypothetical cloud deployment. Default for the curriculum.
Electricity-only ~€0.01/hr for an i5-8250U at full tilt × €0.20/kWh Comparing the marginal cost of running on owned hardware. Borja might prefer this for the local box.
Fully-loaded local Electricity + amortized hardware + rent + admin time The honest version for a real product. Out of scope for Phase 34.

The lab uses cloud-equivalent by default; Borja can override at phase open.

The Prometheus encoding

Two histograms, log-spaced buckets:

COST_BUCKETS_USD = (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
COST_PER_1K_TOKENS_BUCKETS_USD = (1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))

12 and 10 buckets respectively, covering 5 decades. Log-spaced because cost is right-skewed.

Two labels on each: model_name and kind (prompt vs completion for per-token cost). No further labels — cardinality budget.

PromQL queries Borja will actually write

The four queries that ought to be on the cost dashboard's panel:

# Mean cost per request, last 5 minutes:
sum(rate(cost_per_request_usd_sum[5m])) / sum(rate(cost_per_request_usd_count[5m]))

# p95 cost per 1k output tokens, last 5 minutes:
histogram_quantile(0.95, sum by (le) (rate(cost_per_1k_completion_tokens_usd_bucket[5m])))

# Total spend, last 24 hours:
sum(increase(cost_per_request_usd_sum[24h]))

# Cost-per-request distribution as a heatmap (Grafana panel type):
sum by (le) (rate(cost_per_request_usd_bucket[1m]))

The third — total spend — is the FinOps line that gets put in front of management. The second — p95 cost per 1k tokens — is the engineering KPI.

The trap: cost without quality is meaningless

A model that always returns "I don't know" is infinitely cost-efficient. Cost has to be reported alongside some quality signal — even a coarse one like "% of responses that parsed as valid JSON" (Phase 30) or "% of responses that passed the safety filter" (Phase 37). Phase 34 lays the cost number; Phase 38 wires it to the quality number on the same dashboard.

For the lab in Phase 34: commit a "cost per successful completion" panel where successful is defined as status==200 and completion_tokens > 0. Crude, but it prevents the degenerate "always 503" gaming.

Worked example: what should cost look like for Borja's MiniGPT?

Order-of-magnitude estimates for an i5-8250U serving the ~500k-param MiniGPT at FP32:

  • TTFT for a 128-token prompt: ~50 ms.
  • Per-token decode: ~10 ms.
  • A 256-token completion takes ~50 + 256×10 = 2610 ms.
  • At $0.17/hr rate: 2.61 × \(0.17/3600 = **\)1.23e-4 per request, or $0.48 per 1000 requests**.
  • Per 1k completion tokens: (\(1.23e-4 / 256) × 1000 = **\)4.8e-4 per 1k tokens**.

Sanity check: OpenAI's gpt-4o-mini is ~\(0.15/1M completion tokens, or **\)1.5e-4 per 1k tokens. So a hand-rolled CPU MiniGPT, on a 2018 laptop, with a notional cloud rate, is about 3×** more expensive per 1k tokens than the cheapest production API. That's a reasonable answer for an educational system — not so cheap as to be suspect, not so expensive as to be embarrassing. Borja should land in roughly this range.

If the measured number ends up at $0.01 per 1k tokens (20× off), something is wrong: most likely the batcher's batching isn't being credited, or the rate is set too high, or the model is running uselessly slow.

One-paragraph recap

Per-request cost = wall-time × machine-rate, with three corrections: batch overlap divides decode cost across active requests, prefill quadratic shows up automatically in wall time but matters for the long-prompt tail, retrieval is small but tracked separately. Encode as two log-spaced Prometheus histograms (cost_per_request_usd, cost_per_1k_completion_tokens_usd) labeled by model name. p95 cost is the engineering KPI; total spend is the FinOps line. Always pair cost with a quality signal so degenerate "refuse everything" services don't read as wins.


Next: theory/03-tracing-and-logging.md.