English · Español
Lab 02 — The Quantization Pareto Curve¶
Goal: sweep over {FP32, FP16, INT8 per-tensor, INT8 per-channel, INT4 per-group=64, INT4 per-group=128} and plot the Pareto curve of perplexity vs bytes for MiniGPT.
Estimated time: 3–4 hours (most spent waiting on calibration; coding is light).
Prereq: labs 00 and 01 committed; MiniGPT loadable.
What you produce¶
A directory experiments/26-quant-curve/ containing:
sweep.py— driver script.results.json— measurements per scheme.pareto.png— log-log plot, bytes on x-axis, PPL on y-axis, one dot per scheme.verb_tense_accuracy.png— bar chart, verb-tense classification accuracy per scheme (the task metric per §A13).roofline_overlay.png— re-plot of the Phase 1 roofline with MiniGPT inference dots at FP32, INT8, INT4 overlaid.manifest.json.README.md— interpretation.
The sweep¶
| Scheme | Implementation | Where from |
|---|---|---|
| FP32 | No quantization; baseline. | MiniGPT |
| FP16 | model.half(). |
PyTorch |
| INT8 per-tensor | src/miniquant/quantize.py. |
Lab 00 |
| INT8 per-channel | src/miniquant/quantize.py. |
Lab 00 |
| INT4 per-group=64 | src/miniquant/quantize.py (Borja extends from INT8 — write the INT4 wrapper). |
Lab 00 extended |
| INT4 per-group=128 | Same as above with group_size=128. |
Same |
| INT4 per-group=64 + GPTQ | src/miniquant/gptq.py applied per layer. |
Lab 01 extended |
Notice the last row introduces a batch GPTQ over all Linears in MiniGPT — this is the integration of lab 01's per-layer GPTQ into the full model. It's not a separate algorithm; it's lab 01 in a for layer in model.linears: gptq_quantize(layer, calib_data) loop.
TODOs¶
Block A — extend the quantizer¶
- Add
quantize_symmetric_per_group(W, bits=4, group_size=64, dim=1)tosrc/miniquant/quantize.py. ReshapeW: (out, in) → (out, in/group_size, group_size); pick scale per group along the last axis; quantize; reshape back. - Storage: store as
int8containing values in[-7, 7](for symmetric INT4 with 15 used codes; one code wasted for symmetry). Don't bit-pack to a real 4-bit format yet — that's lab 03 (GGUF export). - Add tests for the new function (Claude scaffolds).
Block B — extend QuantizedLinear¶
- Accept a
schemeargument:"per_tensor","per_channel","per_group_64","per_group_128","gptq_per_group_64". - For the GPTQ schemes, the constructor accepts a calibration
H(computed externally via lab 01's machinery).
Block C — calibration pipeline¶
For the GPTQ row:
- Run MiniGPT in FP32 on 128 held-out sequences; record per-
Linearinput activations. - For each
Linear, computeH = X X.T / n; apply lab 01's GPTQ to get the quantized weight. - Wrap the original
Linearwith aQuantizedLinearcarrying the GPTQ result.
Block D — measure PPL, verb-tense accuracy, and bytes per scheme¶
- PPL via the same eval as lab 00 (held-out split of the verb corpus).
- Verb-tense classification accuracy: feed a batch of sentences with a held-out verb conjugation, ask the model to score the correct vs incorrect form (e.g.,
He __ to the storewith candidateswalk/walks/walked). Pick the argmax-probability form; count matches against the ground truth. This is the task metric per §A13. - Bytes: sum of all weight storage including scales (FP16 scales for INT8/INT4; subtract embedding bytes if quoting "Linear-only" sizes).
Block E — Pareto plot¶
- x-axis: bytes, log scale. y-axis: PPL, linear.
- One dot per scheme, labelled.
- Draw the Pareto frontier (the lower-left envelope).
Block F — roofline overlay¶
- Reuse
experiments/01-roofline/'s ceiling lines. - Compute the arithmetic intensity of a single MiniGPT inference step at FP32, INT8, INT4. Plot the dots.
- Annotate: which dot is on the memory ceiling, which is on the compute ceiling?
Block G — interpret in README.md¶
Four questions:
- Which schemes are on the Pareto frontier? Expect: FP32 (one corner), some INT8 scheme (middle), INT4 per-group=64 + GPTQ (other corner). The intermediate FP16 and per-tensor INT8 may be Pareto-dominated.
- What's the PPL gap of INT4 per-group=64 + GPTQ vs FP32? Should be < 15% (DoD threshold) and hopefully < 5%. If much worse, your GPTQ pipeline has a bug.
- By the roofline overlay, how much theoretical speedup does INT4 buy? Use the intensity ratio. Compare to what a real VNNI CPU would deliver (the "expected" number — Borja's i5-8250U won't show this since it lacks INT8 kernels).
- Where would you stop in production? Pick the scheme you'd ship if Borja were deploying this to a Raspberry Pi. Justify in 3 sentences.
Constraints¶
- All measurements use the same eval split (deterministic).
seed_everything(42)at every script start.- Don't quantize the embedding table; keep it FP16. (Surveyed in theory; consistent with production practice for small models.)
- Don't quantize layer-norm parameters (they're per-channel scalars; quantization buys nothing).
- Bytes accounting must include scales and zero-points, not just weights.
Stop conditions¶
Done when:
- Seven schemes measured and the table is in
results.json. - INT4 per-group=64 + GPTQ PPL gap < 15% (DoD).
- Roofline overlay shows the intensity shift.
README.mdanswers all four questions.
Pitfalls¶
- GPTQ scheme is worse than RTN per-group. Lab 01 didn't fully test on a real MiniGPT layer's
H. Re-deriveHfrom real activations and check that off-diagonals are non-trivially populated. - INT4 PPL is much worse than INT8. Did you set the right grid? Symmetric INT4 should round to
[-7, 7], not[-8, 7]or[-127, 127]. - The roofline dots don't line up with predictions. Real-world byte counts include scales, activations, and biases. The intensity isn't just "FLOPs / weight bytes"; it's "FLOPs / total memory traffic". Recompute including activations.
When to consult solutions/¶
After all four files committed and the DoD met. Reference at solutions/02-quant-curve-ref.md (phase open) compares the Pareto frontier shape.
Next lab: lab/03-gguf-export.md.