English · Español
Lab 03 — FinOps table: cost per quality unit for the grammar tutor¶
Goal: compute
docs/COSTS.md— one row per registered grammar-tutor model with conjugation accuracy, cost-per-1k-tokens, and CpQU.Estimated time: 2–3 hours.
Prereq: lab 00 done (registry has ≥ 3 entries); lab 01 done (latency data exists); Phase 20 eval harness can score each entry against the canonical conjugation table.
What you produce¶
experiments/38-finops/ containing:
compute.py— driver that reads the registry, runs eval, computes costs.cost_inputs.json— manually-recorded hardware-rate and timing inputs.cpqu.json— the per-entry numbers.manifest.json.
Plus, outside the experiment directory:
docs/COSTS.md— the FinOps table. Committed to the docs site.
TODOs¶
Block A — gather cost inputs¶
For each registered model canonical SHA (lab 00 produced ≥ 3):
- Hardware-hour rate. On Borja's local i5-8250U (CPU-only): use $0.05/hr as the amortized electricity + hardware rate (matches Phase 34's notional rate). Record in
cost_inputs.jsonwith a comment. - Tokens-per-second per model. From the Phase 33 serving benchmarks, or by running a 60-second throughput test on each model variant against a fixed 200-sentence Phase 20 sub-set. Record in
cost_inputs.json. - Conjugation accuracy. Run the Phase 20 eval set against each model; capture the aggregated
conjugation_accuracyplus the per-bucket breakdowns (per tense, per person, per verb). Save the full eval report incost_inputs.jsonunder each SHA.
Block B — compute cost-per-1k-tokens¶
For each canonical SHA \(s\):
(3.6 because tokens/sec × 3600 sec/hr / 1000 tokens-per-k = 3.6 ratio.)
- Compute and store in
cpqu.jsonkeyed by canonical SHA.
Block C — compute CpQU¶
For each SHA:
- Add to
cpqu.json. Lower is better. - Guard: if
conjugation_accuracy < 0.30, mark CpQU as"not-deployment-ready"and skip the division — the result would be a misleadingly-large number driven by a near-zero denominator.
Block D — emit docs/COSTS.md¶
- Create or update
docs/COSTS.mdwith one row per registered canonical SHA:
| SHA (8) | Semver | Conjugation accuracy | $/1k tokens | CpQU | Notes |
|---|---|---|---|---|---|
| `a1b2c3d4` | v0.1.0 | 0.752 | 0.0098 | 0.0130 | FP32 baseline (Phase 18), no grammar-tutor specialization |
| `e5f6a7b8` | v0.2.0 | 0.748 | 0.0042 | 0.0056 | INT8 (Phase 26), 0.4pp drop |
| `c9d0e1f2` | v0.3.0 | 0.781 | 0.0114 | 0.0146 | LoRA grammar tutor (Phase 28) |
(Numbers are illustrative — Borja's actual numbers come from his measurements.)
- Below the table, write 2–3 paragraphs on:
- Which entry has the best CpQU. Usually INT8 — similar accuracy, cheaper compute. State whether your data confirms.
- Which entry has the best raw conjugation accuracy. Usually the LoRA grammar tutor (Phase 28's whole point), but at higher CpQU.
- Per-bucket caveat. Aggregate accuracy hides per-tense and per-verb differences. Note any bucket where the "best" model is not the LoRA — these are areas the LoRA actually made worse.
- Operational recommendation. Given Borja's deployment constraints (Phase 39 capstone runs on a small CPU instance, perhaps a single cloud GPU at most), which would be the default serving model and which would be the premium-tier model.
Block E — close the loop with lab 01¶
- Revisit
experiments/38-shadow-ab/report.mdand fill in the previously-placeholder operational recommendation, using the CpQU numbers from this lab plus the per-bucket conjugation breakdowns.
Block F — manifest + Justfile¶
-
manifest.jsonrecords: registry canonical SHA list used; hardware rate; Phase 20 eval set DVC hash; tokens-per-second methodology. - Add
just cpquto theJustfile— invokesscripts/mlops/cpqu.pywhich reads the registry, runs eval, regeneratesdocs/COSTS.md. Idempotent.
Constraints¶
- One eval set. All CpQU rows share the same Phase 20 eval set (verified via DVC hash). If you change the eval set, the denominator changes and rows are incomparable. Document the eval-set DVC hash in the manifest.
- No cloud cost API. Costs are hand-recorded. The integration with AWS/GCP billing is out of scope (see
PHASE_38_PLAN.mdopen question f). - No latency in CpQU. CpQU is cost vs quality. Latency lives in the lab 01 report. Don't conflate them.
- No new
src/<module>/.cpqu.pylives inscripts/mlops/.
Stop conditions¶
Done when:
docs/COSTS.mdexists and has ≥ 3 rows.- The lab 01 report has its operational recommendation filled in.
cpqu.jsonis committed and matches the table.just cpquregeneratesdocs/COSTS.mddeterministically.
Pitfalls¶
- Misleading CpQU when accuracy is near zero. Add the guard from Block C — if accuracy < 0.30, report "not deployment-ready", not a CpQU number.
- Comparing CpQU across eval sets. Don't. Lock the eval set per
docs/COSTS.md; if it changes, recompute all rows. - Hardware rate inflation. If you migrate from the laptop notional rate to a real cloud rate in Phase 39, the absolute CpQU numbers shift uniformly. The rank is preserved, but be explicit about which rate was used.
- Per-bucket vs aggregate. A model that wins aggregate CpQU might be 10pp worse on past-participle conjugations — a problem if the actual learner cohort focuses on past participles. Always read the per-bucket eval before the table.
- Tokens-per-second on a noisy machine. Background processes on Borja's laptop will perturb
tps. Run the throughput test withnice -n 19and 3–5 repeats; record the median.
When to consult solutions/¶
After all six blocks. solutions/03-finops-ref.md (phase open) reviews the CpQU formulation, the per-bucket commentary, and the docs/COSTS.md layout for clarity.
Next lab: lab/04-ci-deploy-gate.md.