English · Español

Lab 03 — Cost tracker + load test + dashboard polish¶

Goal: finish the dashboard. Load-test the server. Commit numbers.

Estimated time: 2-3 hours.

Prereq: Labs 00-02 complete; metrics + traces flowing; dashboard skeleton from Lab 01.

What you produce¶

src/observability/cost.py — the CostTracker class + two new Prometheus histograms.
Phase 33 server wired so every request records cost via CostTracker.
infra/grafana/dashboards/llm.json — polished dashboard with all RED + USE + cost panels.
experiments/34-load-test/ — k6 script + results + dashboard screenshot.
experiments/34-cost-calibration/ — sweep of $_per_hour values; cost linearity verified.

The CostTracker contract¶

class CostTracker:
    def __init__(self, rate_usd_per_hour: float):
        self.rate_per_second = rate_usd_per_hour / 3600.0
        self.per_request = {}  # request_id -> dict(stage -> wall_seconds)

    def start_stage(self, request_id: str, stage: str): ...
    def end_stage(self, request_id: str, stage: str): ...

    def finalize(self, request_id: str, output_tokens: int, model_name: str) -> float:
        """Sum stage wall times × rate. Emit two histograms. Return total cost USD."""
        ...

Stages: retrieve, prefill, decode, other. The decode wall time should be the effective time per request (accounting for batch overlap — see theory file 02, Correction 1). The batcher must expose request.effective_decode_seconds for this.

Two new histograms:

COST_BUCKETS_USD = (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
COST_PER_1K_TOKENS_BUCKETS_USD = (1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))

cost_per_request_usd = Histogram(
    "cost_per_request_usd",
    "Total cost per request in USD",
    ["model_name"],
    buckets=COST_BUCKETS_USD,
)
cost_per_1k_completion_tokens_usd = Histogram(
    "cost_per_1k_completion_tokens_usd",
    "Cost per 1000 completion tokens in USD",
    ["model_name"],
    buckets=COST_PER_1K_TOKENS_BUCKETS_USD,
)

TODOs¶

Block A — `src/observability/cost.py`¶

Implement CostTracker per the contract.
Store per-stage start times in a dict keyed by (request_id, stage). Use time.perf_counter().
On finalize:
Sum stage wall times to get total wall seconds.
Multiply by rate_per_second for cost in USD.
Observe in both histograms (the second only if output_tokens > 0).
Set the span attribute llm.cost.usd = total on the current OTel span.
Log a single request.cost structured-log event.
Pop the request_id from the per-request dict.
Make rate configurable via env var LLM_RATE_USD_PER_HOUR, default 0.17.

Block B — wire into the server¶

Instantiate one CostTracker per process at startup.
At every stage boundary in app.py / batcher.py, call cost.start_stage(req_id, ...) and cost.end_stage(req_id, ...).
At end of request, cost.finalize(req_id, output_tokens, model).

Block C — Grafana dashboard polish¶

Required panels (organize into 4 rows):

Row 1 — RED:

Requests per second (sum(rate(request_total[1m])))
Error rate % (sum(rate(request_total{status=~"5.."}[5m])) / sum(rate(request_total[5m])) * 100)
p50/p95/p99 request duration (three series on one panel, histogram_quantile(...))

Row 2 — USE:

CPU utilization (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100))
RAM available (node_memory_MemAvailable_bytes / 1024^3 GiB)
Queue depth (queue_depth)
KV cache slots used (kv_cache_slots_used)

Row 3 — LLM:

Tokens/sec by kind (sum by (kind) (rate(tokens_total[1m])))
TTFT p95 (histogram_quantile(0.95, sum by(le) (rate(time_to_first_token_seconds_bucket[5m]))))

Row 4 — Cost:

p95 cost per 1k completion tokens (histogram_quantile(0.95, sum by(le) (rate(cost_per_1k_completion_tokens_usd_bucket[5m]))))
Total spend last 24h (sum(increase(cost_per_request_usd_sum[24h])))
Cost-per-request distribution (heatmap panel of cost_per_request_usd_bucket)

Save dashboard, export JSON, commit to infra/grafana/dashboards/llm.json.

Block D — load test¶

Install k6 (sudo dnf install k6 or use the docker image).
Write experiments/34-load-test/loadtest.js:
Stages: ramp 0 → 100 VUs over 1 minute; hold 100 VUs for 3 minutes; ramp down over 1 minute.
Each VU hits /v1/completions with one of three fixed prompt templates of varying length (short / medium / long).
Validate response is 200 + non-empty completion.
Run: k6 run experiments/34-load-test/loadtest.js.
During the run, take a screenshot of the dashboard at peak load. Save as experiments/34-load-test/dashboard-screenshot.png.
Save k6's text output to experiments/34-load-test/results.txt.
manifest.json records:

{
  "experiment": "34-load-test",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "versions": {"python": "...", "k6": "...", "miniserve_git_sha": "..."},
  "hardware": {...from learners/borja/profile.md...},
  "config": {"peak_vus": 100, "duration_s": 300, "prompts": [...]},
  "results_summary": {
    "rps_peak": null,
    "p50_ms": null,
    "p95_ms": null,
    "p99_ms": null,
    "error_rate_pct": null,
    "mean_cost_per_1k_tokens_usd": null,
    "p95_cost_per_1k_tokens_usd": null
  }
}

Fill results_summary after the run.

Block E — cost calibration¶

In experiments/34-cost-calibration/calibrate.py, run a short fixed workload (10 requests) at three rate settings: LLM_RATE_USD_PER_HOUR=0.085, 0.17, 0.34.
For each, record the mean cost_per_request_usd.
Verify: the three means are in 1:2:4 ratio (within ~5% noise). That confirms linearity.
Commit a plot experiments/34-cost-calibration/linearity.png: x = rate, y = mean cost.

Constraints¶

No prompt/completion bodies stored anywhere. Verify by grep -i prompt experiments/34-load-test/ (should match only the loadtest.js fixtures, not any output).
Single-host load test. Don't try to distribute the load generator yet; Phase 35.
Don't pin VU count to be CPU-pegging. 100 VUs on Borja's 4-core box should leave the CPU at ~70-80%. If it pegs at 100% with constant queueing, lower to 50 VUs. Document.

Stop conditions¶

Done when:

src/observability/cost.py exists, conforms to contract.
Every request emits a histogram observation in cost_per_request_usd.
infra/grafana/dashboards/llm.json committed, opens with all 12+ panels populated.
experiments/34-load-test/ has manifest + script + results + screenshot.
experiments/34-cost-calibration/ proves linearity (3 points, 1:2:4 ratio).
PHASE_34_REPORT.md drafted with: peak RPS, p50/p95/p99 latency, error rate, mean cost per 1k completion tokens, p95 cost per 1k completion tokens.

Pitfalls (read before debugging)¶

Cost numbers wildly off. If your cost/1k-tokens is 100× the order-of-magnitude estimate from theory file 02 ($4.8e-4), you're double-counting somewhere or summing all batched requests' decode wall times rather than per-request effective time. Re-read Correction 1.
Heatmap panel empty. Heatmap requires the _bucket time series, not the rate. Query sum by (le) (rate(cost_per_request_usd_bucket[1m])) — note the le grouping.
Total spend panel shows 0. increase() over a histogram's _sum is fine; over _count it gives you "requests served", not spend.
k6 errors out with "max VUs reached". Add options.stages and --vus-max flag.

When to consult `solutions/`¶

After all six Stop conditions pass and a PHASE_34_REPORT.md draft exists. Solution at solutions/03-cost-and-loadtest-ref.md walks through expected numbers for Borja's specific hardware.

Phase done. Write PHASE_34_REPORT.md, run /phase-report 34, then stop and await Borja's proceed.