English · Español
Lab 03 — Cost Curves: Memory and Latency vs Context Length¶
Goal: measure memory growth and per-token decode latency vs sequence length, with and without cache. Compare to the theory curves derived in
theory/02andtheory/03. Get within 10% of theory.Estimated time: 2–4 hours.
Prereq: lab 01 + lab 02 complete. Phase 17 MiniGPT model on disk.
What you produce¶
A directory experiments/22-kv-cost/ containing:
bench.py— measurement runner.memory.json— measuredcache_bytesvsSdata points.latency.json— measuredper_token_msvsSdata points (one curve for with-cache, one without).memory.png— measured + theory overlay.latency.png— measured + theory overlay.manifest.json.README.md— interpretation; explicitly state the measured vs theory gap.
The measurements¶
Two curves, both as a function of \(S\) (current cache length):
Curve 1: memory growth¶
For \(S \in \{8, 16, 32, 64, 128, 256, 512, 1024\}\):
- Run prefill on a prompt of length \(S\).
- Measure: cache.bytes_allocated() and RSS_growth_since_baseline (use psutil.Process().memory_info().rss).
- Theory prediction: 2 · L · H · d_h · S_max · B · s (with \(S_\text{max} = 1024\) for all points — the allocation is constant; only the used portion grows). Plot both cache.bytes_allocated() (constant — confirms pre-allocation) and 2 · L · H · d_h · S · B · s (theoretical used bytes; this grows linearly).
Curve 2: per-token decode latency¶
For \(S \in \{8, 16, 32, 64, 128, 256, 512, 1024\}\):
- Prefill a prompt of length \(S\), then generate 1 additional token. Time only the decode step (not prefill).
- Repeat 100×, take median.
- Record: per_token_ms_with_cache and per_token_ms_without_cache.
Theory:
- With cache: \(T \approx \alpha + \beta \cdot S\) for constants \(\alpha, \beta\) derived from theory/01 (the \(12 L d^2\) weight term contributes \(\alpha\); the \(2 L S d\) cache-attention term contributes \(\beta S\)).
- Without cache: \(T \approx \gamma \cdot S^2\) for some \(\gamma\) (each step does a full prefill on the prompt-so-far).
Fit the constants to the lower-\(S\) measurements; check the prediction extrapolates to the high-\(S\) measurements within 10%.
TODOs¶
Block A — write bench.py¶
- Load Phase-17 MiniGPT (verb-grammar trained per §A13). For this lab you need to temporarily bump
S_maxto 1024 to see the growth curve — real prompts in our corpus are all under 32 tokens, but the curve shape is what reveals the scaling. - Synthetic prompts of length \(S\): rather than draw from the grammar corpus (longest natural sentence ~10 tokens), generate length-\(S\) prompts by repeating short grammar fragments. The model output may be nonsense at large \(S\); that's fine — we're measuring cost, not quality.
- Implement the memory measurement loop. Use
seed_everything(42). Usenp.emptyfor pre-allocations, notnp.zeros(zeroing wastes time). - Implement the latency measurement loop. Use
time.perf_counter_ns(). 100 reps with 3 warm-ups. - Save both to JSON.
Block B — fit theory constants¶
- In a separate script or notebook cell: load
latency.json. Fit \(\alpha, \beta\) from the with-cache curve (linear regression on \(S\)). Fit \(\gamma\) from the without-cache curve (quadratic regression). - Report fit quality (\(R^2\)). Document in
README.md.
Block C — plot¶
-
memory.png: x = \(S\), y = bytes (log y). Two lines:cache.bytes_allocated()(flat line at \(S_\text{max}\) bytes), theoretical used bytes (2LHd_h S B sline). Plus the measured RSS growth as dots. Annotate: "pre-allocation hides the growth — the cache is constant-size by design". -
latency.png: x = \(S\), y =per_token_ms. Two lines (with-cache, without-cache) measured + dotted theory curves. Annotate the regime crossover.
Block D — interpret¶
In README.md, three paragraphs:
- Memory. Does measured RSS growth match the theoretical used bytes within 10%? Why might it not (other allocations during the forward pass)?
- Latency. How well does the linear-in-\(S\) fit work for the with-cache curve? Does the without-cache curve actually scale as \(S^2\), or is the constant-factor weight read still dominant at small \(S\)?
- Crossover. At what \(S\) does the cache start to "pay off" relative to the no-cache path? Below that \(S\), the cache is slower due to overhead. Document this.
Block E — manifest¶
{
"experiment": "22-kv-cost",
"date": "YYYY-MM-DD",
"seed": 42,
"versions": {"python": "3.11.x", "numpy": "X.Y.Z", "matplotlib": "X.Y.Z", "psutil": "X.Y.Z"},
"hardware": {
"cpu_model": "Intel Core i5-8250U",
"cores_threads": "4/8",
"ram_gib": 62,
"cpu_governor_at_run": "performance"
},
"model": {
"name": "miniGPT-phase17",
"L": null, "H": null, "d_h": null, "S_max": 1024, "dtype": "float32"
},
"config": {
"S_points": [8, 16, 32, 64, 128, 256, 512, 1024],
"reps_per_point": 100,
"warmup_reps": 3
},
"results_summary": {
"memory_max_relative_error_to_theory": null,
"latency_with_cache_R2": null,
"latency_without_cache_R2": null,
"crossover_S": null
}
}
Constraints¶
- CPU governor:
performance. Otherwise latency numbers are useless. - AC power, close other apps. Same as Phase-1 labs.
- No threading. Single-thread decode for clean measurements. Multi-thread is Phase 35.
- Don't time the prefill. Only the decode step. Prefill timing is a different experiment (and Phase 22's DoD doesn't require it).
without-cachereuses the same model code path, just disables the cache. Don't write a separate model; just passcache=Noneand re-callgenerateper step with a growing prompt.
Stop conditions¶
Done when:
- Memory measurement is within 10% of theory at every \(S\) point.
- With-cache latency fit has \(R^2 \geq 0.95\).
- Without-cache latency curve clearly shows \(O(S^2)\) growth (or you've documented why it doesn't at the scales you measured).
- Both PNGs committed and rendered correctly.
README.mdanswers all three Block D questions.manifest.jsoncommitted withresults_summaryfilled.
Pitfalls¶
bytes_allocated()is constant; that's correct. The pre-allocation is the whole point. Don't "fix" it.- First-iteration timing is huge. Page faults, JIT, cold caches. Drop the first 3 measurements (warm-ups), report median of 100, not mean.
- Memory measurement noise. RSS measurement is noisy — Linux can lazily allocate pages, and
np.emptyallocates virtual but not physical pages until first touch. The "measured" line should track theoretical used (filled) bytes, which is what gets touched. Document this. without-cachefaster than with-cache at small \(S\). Expected. The cache's append + read overhead dominates at \(S < 50\)ish on MiniGPT. The crossover is real and worth understanding.
When to consult solutions/¶
After all stop conditions are met. The reference at solutions/03-cost-curves-ref.md (written at phase open) shows the expected numbers and the fitting code.
Next: PHASE_22_REPORT.md. The phase is done after report + reflection.