Skip to content

English · Español

Lab 03 — Cost tracker + load test + dashboard polish

Goal: finish the dashboard. Load-test the server. Commit numbers.

Estimated time: 2-3 hours.

Prereq: Labs 00-02 complete; metrics + traces flowing; dashboard skeleton from Lab 01.


What you produce

  • src/observability/cost.py — the CostTracker class + two new Prometheus histograms.
  • Phase 33 server wired so every request records cost via CostTracker.
  • infra/grafana/dashboards/llm.json — polished dashboard with all RED + USE + cost panels.
  • experiments/34-load-test/ — k6 script + results + dashboard screenshot.
  • experiments/34-cost-calibration/ — sweep of $_per_hour values; cost linearity verified.

The CostTracker contract

class CostTracker:
    def __init__(self, rate_usd_per_hour: float):
        self.rate_per_second = rate_usd_per_hour / 3600.0
        self.per_request = {}  # request_id -> dict(stage -> wall_seconds)

    def start_stage(self, request_id: str, stage: str): ...
    def end_stage(self, request_id: str, stage: str): ...

    def finalize(self, request_id: str, output_tokens: int, model_name: str) -> float:
        """Sum stage wall times × rate. Emit two histograms. Return total cost USD."""
        ...

Stages: retrieve, prefill, decode, other. The decode wall time should be the effective time per request (accounting for batch overlap — see theory file 02, Correction 1). The batcher must expose request.effective_decode_seconds for this.

Two new histograms:

COST_BUCKETS_USD = (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))
COST_PER_1K_TOKENS_BUCKETS_USD = (1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1.0, float("inf"))

cost_per_request_usd = Histogram(
    "cost_per_request_usd",
    "Total cost per request in USD",
    ["model_name"],
    buckets=COST_BUCKETS_USD,
)
cost_per_1k_completion_tokens_usd = Histogram(
    "cost_per_1k_completion_tokens_usd",
    "Cost per 1000 completion tokens in USD",
    ["model_name"],
    buckets=COST_PER_1K_TOKENS_BUCKETS_USD,
)

TODOs

Block A — src/observability/cost.py

  • Implement CostTracker per the contract.
  • Store per-stage start times in a dict keyed by (request_id, stage). Use time.perf_counter().
  • On finalize:
  • Sum stage wall times to get total wall seconds.
  • Multiply by rate_per_second for cost in USD.
  • Observe in both histograms (the second only if output_tokens > 0).
  • Set the span attribute llm.cost.usd = total on the current OTel span.
  • Log a single request.cost structured-log event.
  • Pop the request_id from the per-request dict.
  • Make rate configurable via env var LLM_RATE_USD_PER_HOUR, default 0.17.

Block B — wire into the server

  • Instantiate one CostTracker per process at startup.
  • At every stage boundary in app.py / batcher.py, call cost.start_stage(req_id, ...) and cost.end_stage(req_id, ...).
  • At end of request, cost.finalize(req_id, output_tokens, model).

Block C — Grafana dashboard polish

Required panels (organize into 4 rows):

Row 1 — RED:

  • Requests per second (sum(rate(request_total[1m])))
  • Error rate % (sum(rate(request_total{status=~"5.."}[5m])) / sum(rate(request_total[5m])) * 100)
  • p50/p95/p99 request duration (three series on one panel, histogram_quantile(...))

Row 2 — USE:

  • CPU utilization (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100))
  • RAM available (node_memory_MemAvailable_bytes / 1024^3 GiB)
  • Queue depth (queue_depth)
  • KV cache slots used (kv_cache_slots_used)

Row 3 — LLM:

  • Tokens/sec by kind (sum by (kind) (rate(tokens_total[1m])))
  • TTFT p95 (histogram_quantile(0.95, sum by(le) (rate(time_to_first_token_seconds_bucket[5m]))))

Row 4 — Cost:

  • p95 cost per 1k completion tokens (histogram_quantile(0.95, sum by(le) (rate(cost_per_1k_completion_tokens_usd_bucket[5m]))))
  • Total spend last 24h (sum(increase(cost_per_request_usd_sum[24h])))
  • Cost-per-request distribution (heatmap panel of cost_per_request_usd_bucket)

Save dashboard, export JSON, commit to infra/grafana/dashboards/llm.json.

Block D — load test

  • Install k6 (sudo dnf install k6 or use the docker image).
  • Write experiments/34-load-test/loadtest.js:
  • Stages: ramp 0 → 100 VUs over 1 minute; hold 100 VUs for 3 minutes; ramp down over 1 minute.
  • Each VU hits /v1/completions with one of three fixed prompt templates of varying length (short / medium / long).
  • Validate response is 200 + non-empty completion.
  • Run: k6 run experiments/34-load-test/loadtest.js.
  • During the run, take a screenshot of the dashboard at peak load. Save as experiments/34-load-test/dashboard-screenshot.png.
  • Save k6's text output to experiments/34-load-test/results.txt.
  • manifest.json records:
{
  "experiment": "34-load-test",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "versions": {"python": "...", "k6": "...", "miniserve_git_sha": "..."},
  "hardware": {...from learners/borja/profile.md...},
  "config": {"peak_vus": 100, "duration_s": 300, "prompts": [...]},
  "results_summary": {
    "rps_peak": null,
    "p50_ms": null,
    "p95_ms": null,
    "p99_ms": null,
    "error_rate_pct": null,
    "mean_cost_per_1k_tokens_usd": null,
    "p95_cost_per_1k_tokens_usd": null
  }
}

Fill results_summary after the run.

Block E — cost calibration

  • In experiments/34-cost-calibration/calibrate.py, run a short fixed workload (10 requests) at three rate settings: LLM_RATE_USD_PER_HOUR=0.085, 0.17, 0.34.
  • For each, record the mean cost_per_request_usd.
  • Verify: the three means are in 1:2:4 ratio (within ~5% noise). That confirms linearity.
  • Commit a plot experiments/34-cost-calibration/linearity.png: x = rate, y = mean cost.

Constraints

  • No prompt/completion bodies stored anywhere. Verify by grep -i prompt experiments/34-load-test/ (should match only the loadtest.js fixtures, not any output).
  • Single-host load test. Don't try to distribute the load generator yet; Phase 35.
  • Don't pin VU count to be CPU-pegging. 100 VUs on Borja's 4-core box should leave the CPU at ~70-80%. If it pegs at 100% with constant queueing, lower to 50 VUs. Document.

Stop conditions

Done when:

  1. src/observability/cost.py exists, conforms to contract.
  2. Every request emits a histogram observation in cost_per_request_usd.
  3. infra/grafana/dashboards/llm.json committed, opens with all 12+ panels populated.
  4. experiments/34-load-test/ has manifest + script + results + screenshot.
  5. experiments/34-cost-calibration/ proves linearity (3 points, 1:2:4 ratio).
  6. PHASE_34_REPORT.md drafted with: peak RPS, p50/p95/p99 latency, error rate, mean cost per 1k completion tokens, p95 cost per 1k completion tokens.

Pitfalls (read before debugging)

  • Cost numbers wildly off. If your cost/1k-tokens is 100× the order-of-magnitude estimate from theory file 02 ($4.8e-4), you're double-counting somewhere or summing all batched requests' decode wall times rather than per-request effective time. Re-read Correction 1.
  • Heatmap panel empty. Heatmap requires the _bucket time series, not the rate. Query sum by (le) (rate(cost_per_request_usd_bucket[1m])) — note the le grouping.
  • Total spend panel shows 0. increase() over a histogram's _sum is fine; over _count it gives you "requests served", not spend.
  • k6 errors out with "max VUs reached". Add options.stages and --vus-max flag.

When to consult solutions/

After all six Stop conditions pass and a PHASE_34_REPORT.md draft exists. Solution at solutions/03-cost-and-loadtest-ref.md walks through expected numbers for Borja's specific hardware.


Phase done. Write PHASE_34_REPORT.md, run /phase-report 34, then stop and await Borja's proceed.