Skip to content

English · Español

Lab 03 — Measure Mask-Construction Overhead

Goal: quantify how much wall-clock time the mask adds per decode step; place this on the roofline mental model from Phase 1.

Estimated time: 60–90 minutes.

Prereq: lab 02 finished.


What you produce

A directory experiments/30-mask-overhead/ containing:

  • bench.py — measurement script.
  • results.json{vocab_size, per_step_ms_no_mask, per_step_ms_with_mask, ratio}.
  • overhead.png — plot: bar chart of per-step wall time without mask vs with mask.
  • manifest.json.
  • README.md — interpretation.

TODOs

Block A — measurement

  • Use the conjugation-schema mask from lab 01.
  • Run generate(prompt, mask=None, max_new_tokens=64) 10 times. Record total wall time. Divide by (10 * 64) to get per-step ms.
  • Run generate(prompt, mask=JSONSchemaMask(...), max_new_tokens=64) 10 times. Record per-step ms.
  • Tag warm-up: the first iteration of each is discarded (page faults, JIT-like effects from the Python interpreter).

Block B — decompose

In bench.py, instrument the mask:

  • How much time is spent in the per-step state-machine simulation across all vocab tokens?
  • How much time is spent in adding the mask array to logits?
  • How much time is the rest of the decoder (matmul, softmax, sample)?

Plot a stacked-bar breakdown.

Block C — interpret

In README.md, answer:

  1. What fraction of per-step time is mask construction? For Phase 30's vocab of ~512 and a state-machine sim, this should be small — maybe 20–40% in pure Python.
  2. How would this scale to vocab=50k? Linearly. 50k / 512 ≈ 100× → mask construction would dominate. This is why production uses precomputed masks (theory/03-grammar-as-dfa.md).
  3. Roofline placement. The mask construction loop is mostly Python branching and dict lookups; it's latency-bound by the interpreter. It is not bandwidth-bound or compute-bound in the usual sense. Document this.
  4. What's the cost in tokens-per-second? If unmasked you get X tok/s, masked you get Y tok/s. Is Y / X < 0.5? Document the actual ratio.

Constraints

  • CPU governor = performance. Same as Phase 1's labs (learners/borja/profile.md mentions setting this).
  • Single-threaded. Don't introduce threading for this lab.
  • Fixed seed. Both runs use the same RNG seed so the sampling path is identical (only the mask differs).

Stop conditions

Done when:

  1. overhead.png shows clear per-component breakdown.
  2. README.md includes the four interpretation paragraphs.
  3. The ratio per_step_ms_with_mask / per_step_ms_no_mask is documented and explained.

Pitfalls

  • Garbage collection. Python's GC can fire during the run and skew numbers. Disable it during the measurement (gc.disable() / gc.enable()).
  • Logger overhead. If the decoder logs every step (Phase 21's tracing), that dominates. Disable tracing for this benchmark.
  • Different sampling paths. If your RNG state differs between runs, you might decode different tokens, hitting different mask code paths. Pin the seed and verify the same tokens come out (modulo mask filtering).
  • Small max_new_tokens noise. 64 tokens × 10 trials is a noisy estimate. If the ratio is borderline (say 1.5×), bump to 32 trials.

When to consult solutions/

After your numbers are clean. The solution will probably observe the same shape — most of the per-step overhead is in iterating the state machine through 512 tokens of vocabulary, and the path forward is precomputation.

Reflection prompts

These are not gates; they're for your learners/borja/journal/.

  • Did the mask cost surprise you? Why or why not?
  • Where do you think this would break (e.g., 100× vocab, 10× schema complexity)?
  • Where does this fit on the Phase 1 roofline plot (intensity? bandwidth?)?

Phase 30 labs complete. Write PHASE_30_REPORT.md.