English · Español

Lab 03 — FinOps table: cost per quality unit for the grammar tutor¶

Goal: compute docs/COSTS.md — one row per registered grammar-tutor model with conjugation accuracy, cost-per-1k-tokens, and CpQU.

Estimated time: 2–3 hours.

Prereq: lab 00 done (registry has ≥ 3 entries); lab 01 done (latency data exists); Phase 20 eval harness can score each entry against the canonical conjugation table.

What you produce¶

experiments/38-finops/ containing:

compute.py — driver that reads the registry, runs eval, computes costs.
cost_inputs.json — manually-recorded hardware-rate and timing inputs.
cpqu.json — the per-entry numbers.
manifest.json.

Plus, outside the experiment directory:

docs/COSTS.md — the FinOps table. Committed to the docs site.

TODOs¶

Block A — gather cost inputs¶

For each registered model canonical SHA (lab 00 produced ≥ 3):

Hardware-hour rate. On Borja's local i5-8250U (CPU-only): use $0.05/hr as the amortized electricity + hardware rate (matches Phase 34's notional rate). Record in cost_inputs.json with a comment.
Tokens-per-second per model. From the Phase 33 serving benchmarks, or by running a 60-second throughput test on each model variant against a fixed 200-sentence Phase 20 sub-set. Record in cost_inputs.json.
Conjugation accuracy. Run the Phase 20 eval set against each model; capture the aggregated conjugation_accuracy plus the per-bucket breakdowns (per tense, per person, per verb). Save the full eval report in cost_inputs.json under each SHA.

Block B — compute cost-per-1k-tokens¶

For each canonical SHA $s$:

\[\text{cost\_per\_1k}_s = \frac{\text{rate}_\$\text{/hr}}{\text{tps}_s \cdot 3.6}\]

(3.6 because tokens/sec × 3600 sec/hr / 1000 tokens-per-k = 3.6 ratio.)

Compute and store in cpqu.json keyed by canonical SHA.

Block C — compute CpQU¶

For each SHA:

\[\text{CpQU}_s = \frac{\text{cost\_per\_1k}_s}{\text{conjugation\_accuracy}_s}\]

Add to cpqu.json. Lower is better.
Guard: if conjugation_accuracy < 0.30, mark CpQU as "not-deployment-ready" and skip the division — the result would be a misleadingly-large number driven by a near-zero denominator.

Block D — emit `docs/COSTS.md`¶

Create or update docs/COSTS.md with one row per registered canonical SHA:

| SHA (8) | Semver | Conjugation accuracy | $/1k tokens | CpQU | Notes |
|---|---|---|---|---|---|
| `a1b2c3d4` | v0.1.0 | 0.752 | 0.0098 | 0.0130 | FP32 baseline (Phase 18), no grammar-tutor specialization |
| `e5f6a7b8` | v0.2.0 | 0.748 | 0.0042 | 0.0056 | INT8 (Phase 26), 0.4pp drop |
| `c9d0e1f2` | v0.3.0 | 0.781 | 0.0114 | 0.0146 | LoRA grammar tutor (Phase 28) |

(Numbers are illustrative — Borja's actual numbers come from his measurements.)

Below the table, write 2–3 paragraphs on:
Which entry has the best CpQU. Usually INT8 — similar accuracy, cheaper compute. State whether your data confirms.
Which entry has the best raw conjugation accuracy. Usually the LoRA grammar tutor (Phase 28's whole point), but at higher CpQU.
Per-bucket caveat. Aggregate accuracy hides per-tense and per-verb differences. Note any bucket where the "best" model is not the LoRA — these are areas the LoRA actually made worse.
Operational recommendation. Given Borja's deployment constraints (Phase 39 capstone runs on a small CPU instance, perhaps a single cloud GPU at most), which would be the default serving model and which would be the premium-tier model.

Block E — close the loop with lab 01¶

Revisit experiments/38-shadow-ab/report.md and fill in the previously-placeholder operational recommendation, using the CpQU numbers from this lab plus the per-bucket conjugation breakdowns.

Block F — manifest + `Justfile`¶

manifest.json records: registry canonical SHA list used; hardware rate; Phase 20 eval set DVC hash; tokens-per-second methodology.
Add just cpqu to the Justfile — invokes scripts/mlops/cpqu.py which reads the registry, runs eval, regenerates docs/COSTS.md. Idempotent.

Constraints¶

One eval set. All CpQU rows share the same Phase 20 eval set (verified via DVC hash). If you change the eval set, the denominator changes and rows are incomparable. Document the eval-set DVC hash in the manifest.
No cloud cost API. Costs are hand-recorded. The integration with AWS/GCP billing is out of scope (see PHASE_38_PLAN.md open question f).
No latency in CpQU. CpQU is cost vs quality. Latency lives in the lab 01 report. Don't conflate them.
No new src/<module>/. cpqu.py lives in scripts/mlops/.

Stop conditions¶

Done when:

docs/COSTS.md exists and has ≥ 3 rows.
The lab 01 report has its operational recommendation filled in.
cpqu.json is committed and matches the table.
just cpqu regenerates docs/COSTS.md deterministically.

Pitfalls¶

Misleading CpQU when accuracy is near zero. Add the guard from Block C — if accuracy < 0.30, report "not deployment-ready", not a CpQU number.
Comparing CpQU across eval sets. Don't. Lock the eval set per docs/COSTS.md; if it changes, recompute all rows.
Hardware rate inflation. If you migrate from the laptop notional rate to a real cloud rate in Phase 39, the absolute CpQU numbers shift uniformly. The rank is preserved, but be explicit about which rate was used.
Per-bucket vs aggregate. A model that wins aggregate CpQU might be 10pp worse on past-participle conjugations — a problem if the actual learner cohort focuses on past participles. Always read the per-bucket eval before the table.
Tokens-per-second on a noisy machine. Background processes on Borja's laptop will perturb tps. Run the throughput test with nice -n 19 and 3–5 repeats; record the median.

When to consult `solutions/`¶

After all six blocks. solutions/03-finops-ref.md (phase open) reviews the CpQU formulation, the per-bucket commentary, and the docs/COSTS.md layout for clarity.

Next lab: lab/04-ci-deploy-gate.md.