English · Español
Lab 03 — Cost tracker + load test + dashboard polish¶
Goal: finish the dashboard. Load-test the server. Commit numbers.
Estimated time: 2-3 hours.
Prereq: Labs 00-02 complete; metrics + traces flowing; dashboard skeleton from Lab 01.
What you produce¶
src/observability/cost.py— theCostTrackerclass + two new Prometheus histograms.- Phase 33 server wired so every request records cost via
CostTracker. infra/grafana/dashboards/llm.json— polished dashboard with all RED + USE + cost panels.experiments/34-load-test/— k6 script + results + dashboard screenshot.experiments/34-cost-calibration/— sweep of$_per_hourvalues; cost linearity verified.
The CostTracker contract¶
class CostTracker:
def __init__(self, rate_usd_per_hour: float):
self.rate_per_second = rate_usd_per_hour / 3600.0
self.per_request = {} # request_id -> dict(stage -> wall_seconds)
def start_stage(self, request_id: str, stage: str): ...
def end_stage(self, request_id: str, stage: str): ...
def finalize(self, request_id: str, output_tokens: int, model_name: str) -> float:
"""Sum stage wall times × rate. Emit two histograms. Return total cost USD."""
...
Stages: retrieve, prefill, decode, other. The decode wall time should be the effective time per request (accounting for batch overlap — see theory file 02, Correction 1). The batcher must expose request.effective_decode_seconds for this.
Two new histograms:
COST_BUCKETS_USD = (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
COST_PER_1K_TOKENS_BUCKETS_USD = (1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
cost_per_request_usd = Histogram(
"cost_per_request_usd",
"Total cost per request in USD",
["model_name"],
buckets=COST_BUCKETS_USD,
)
cost_per_1k_completion_tokens_usd = Histogram(
"cost_per_1k_completion_tokens_usd",
"Cost per 1000 completion tokens in USD",
["model_name"],
buckets=COST_PER_1K_TOKENS_BUCKETS_USD,
)
TODOs¶
Block A — src/observability/cost.py¶
- Implement
CostTrackerper the contract. - Store per-stage start times in a dict keyed by
(request_id, stage). Usetime.perf_counter(). - On
finalize: - Sum stage wall times to get total wall seconds.
- Multiply by
rate_per_secondfor cost in USD. - Observe in both histograms (the second only if
output_tokens > 0). - Set the span attribute
llm.cost.usd = totalon the current OTel span. - Log a single
request.coststructured-log event. - Pop the request_id from the per-request dict.
- Make rate configurable via env var
LLM_RATE_USD_PER_HOUR, default0.17.
Block B — wire into the server¶
- Instantiate one
CostTrackerper process at startup. - At every stage boundary in
app.py/batcher.py, callcost.start_stage(req_id, ...)andcost.end_stage(req_id, ...). - At end of request,
cost.finalize(req_id, output_tokens, model).
Block C — Grafana dashboard polish¶
Required panels (organize into 4 rows):
Row 1 — RED:
- Requests per second (
sum(rate(request_total[1m]))) - Error rate % (
sum(rate(request_total{status=~"5.."}[5m])) / sum(rate(request_total[5m])) * 100) - p50/p95/p99 request duration (three series on one panel,
histogram_quantile(...))
Row 2 — USE:
- CPU utilization (
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)) - RAM available (
node_memory_MemAvailable_bytes / 1024^3GiB) - Queue depth (
queue_depth) - KV cache slots used (
kv_cache_slots_used)
Row 3 — LLM:
- Tokens/sec by kind (
sum by (kind) (rate(tokens_total[1m]))) - TTFT p95 (
histogram_quantile(0.95, sum by(le) (rate(time_to_first_token_seconds_bucket[5m]))))
Row 4 — Cost:
- p95 cost per 1k completion tokens (
histogram_quantile(0.95, sum by(le) (rate(cost_per_1k_completion_tokens_usd_bucket[5m])))) - Total spend last 24h (
sum(increase(cost_per_request_usd_sum[24h]))) - Cost-per-request distribution (heatmap panel of
cost_per_request_usd_bucket)
Save dashboard, export JSON, commit to infra/grafana/dashboards/llm.json.
Block D — load test¶
- Install k6 (
sudo dnf install k6or use the docker image). - Write
experiments/34-load-test/loadtest.js: - Stages: ramp 0 → 100 VUs over 1 minute; hold 100 VUs for 3 minutes; ramp down over 1 minute.
- Each VU hits
/v1/completionswith one of three fixed prompt templates of varying length (short / medium / long). - Validate response is 200 + non-empty completion.
- Run:
k6 run experiments/34-load-test/loadtest.js. - During the run, take a screenshot of the dashboard at peak load. Save as
experiments/34-load-test/dashboard-screenshot.png. - Save k6's text output to
experiments/34-load-test/results.txt. -
manifest.jsonrecords:
{
"experiment": "34-load-test",
"date": "YYYY-MM-DD",
"seed": 42,
"versions": {"python": "...", "k6": "...", "miniserve_git_sha": "..."},
"hardware": {...from learners/borja/profile.md...},
"config": {"peak_vus": 100, "duration_s": 300, "prompts": [...]},
"results_summary": {
"rps_peak": null,
"p50_ms": null,
"p95_ms": null,
"p99_ms": null,
"error_rate_pct": null,
"mean_cost_per_1k_tokens_usd": null,
"p95_cost_per_1k_tokens_usd": null
}
}
Fill results_summary after the run.
Block E — cost calibration¶
- In
experiments/34-cost-calibration/calibrate.py, run a short fixed workload (10 requests) at three rate settings:LLM_RATE_USD_PER_HOUR=0.085,0.17,0.34. - For each, record the mean
cost_per_request_usd. - Verify: the three means are in 1:2:4 ratio (within ~5% noise). That confirms linearity.
- Commit a plot
experiments/34-cost-calibration/linearity.png: x = rate, y = mean cost.
Constraints¶
- No prompt/completion bodies stored anywhere. Verify by
grep -i prompt experiments/34-load-test/(should match only the loadtest.js fixtures, not any output). - Single-host load test. Don't try to distribute the load generator yet; Phase 35.
- Don't pin VU count to be CPU-pegging. 100 VUs on Borja's 4-core box should leave the CPU at ~70-80%. If it pegs at 100% with constant queueing, lower to 50 VUs. Document.
Stop conditions¶
Done when:
src/observability/cost.pyexists, conforms to contract.- Every request emits a histogram observation in
cost_per_request_usd. infra/grafana/dashboards/llm.jsoncommitted, opens with all 12+ panels populated.experiments/34-load-test/has manifest + script + results + screenshot.experiments/34-cost-calibration/proves linearity (3 points, 1:2:4 ratio).- PHASE_34_REPORT.md drafted with: peak RPS, p50/p95/p99 latency, error rate, mean cost per 1k completion tokens, p95 cost per 1k completion tokens.
Pitfalls (read before debugging)¶
- Cost numbers wildly off. If your cost/1k-tokens is 100× the order-of-magnitude estimate from theory file 02 ($4.8e-4), you're double-counting somewhere or summing all batched requests' decode wall times rather than per-request effective time. Re-read Correction 1.
- Heatmap panel empty. Heatmap requires the
_buckettime series, not the rate. Querysum by (le) (rate(cost_per_request_usd_bucket[1m]))— note thelegrouping. - Total spend panel shows 0.
increase()over a histogram's_sumis fine; over_countit gives you "requests served", not spend. - k6 errors out with "max VUs reached". Add
options.stagesand--vus-maxflag.
When to consult solutions/¶
After all six Stop conditions pass and a PHASE_34_REPORT.md draft exists. Solution at solutions/03-cost-and-loadtest-ref.md walks through expected numbers for Borja's specific hardware.
Phase done. Write PHASE_34_REPORT.md, run /phase-report 34, then stop and await Borja's proceed.