English · Español

Lab 02 — The Quantization Pareto Curve¶

Goal: sweep over {FP32, FP16, INT8 per-tensor, INT8 per-channel, INT4 per-group=64, INT4 per-group=128} and plot the Pareto curve of perplexity vs bytes for MiniGPT.

Estimated time: 3–4 hours (most spent waiting on calibration; coding is light).

Prereq: labs 00 and 01 committed; MiniGPT loadable.

What you produce¶

A directory experiments/26-quant-curve/ containing:

sweep.py — driver script.
results.json — measurements per scheme.
pareto.png — log-log plot, bytes on x-axis, PPL on y-axis, one dot per scheme.
verb_tense_accuracy.png — bar chart, verb-tense classification accuracy per scheme (the task metric per §A13).
roofline_overlay.png — re-plot of the Phase 1 roofline with MiniGPT inference dots at FP32, INT8, INT4 overlaid.
manifest.json.
README.md — interpretation.

The sweep¶

Scheme	Implementation	Where from
FP32	No quantization; baseline.	MiniGPT
FP16	`model.half()`.	PyTorch
INT8 per-tensor	`src/miniquant/quantize.py`.	Lab 00
INT8 per-channel	`src/miniquant/quantize.py`.	Lab 00
INT4 per-group=64	`src/miniquant/quantize.py` (Borja extends from INT8 — write the INT4 wrapper).	Lab 00 extended
INT4 per-group=128	Same as above with `group_size=128`.	Same
INT4 per-group=64 + GPTQ	`src/miniquant/gptq.py` applied per layer.	Lab 01 extended

Notice the last row introduces a batch GPTQ over all Linears in MiniGPT — this is the integration of lab 01's per-layer GPTQ into the full model. It's not a separate algorithm; it's lab 01 in a for layer in model.linears: gptq_quantize(layer, calib_data) loop.

TODOs¶

Block A — extend the quantizer¶

Add quantize_symmetric_per_group(W, bits=4, group_size=64, dim=1) to src/miniquant/quantize.py. Reshape W: (out, in) → (out, in/group_size, group_size); pick scale per group along the last axis; quantize; reshape back.
Storage: store as int8 containing values in [-7, 7] (for symmetric INT4 with 15 used codes; one code wasted for symmetry). Don't bit-pack to a real 4-bit format yet — that's lab 03 (GGUF export).
Add tests for the new function (Claude scaffolds).

Block B — extend `QuantizedLinear`¶

Accept a scheme argument: "per_tensor", "per_channel", "per_group_64", "per_group_128", "gptq_per_group_64".
For the GPTQ schemes, the constructor accepts a calibration H (computed externally via lab 01's machinery).

Block C — calibration pipeline¶

For the GPTQ row:

Run MiniGPT in FP32 on 128 held-out sequences; record per-Linear input activations.
For each Linear, compute H = X X.T / n; apply lab 01's GPTQ to get the quantized weight.
Wrap the original Linear with a QuantizedLinear carrying the GPTQ result.

Block D — measure PPL, verb-tense accuracy, and bytes per scheme¶

PPL via the same eval as lab 00 (held-out split of the verb corpus).
Verb-tense classification accuracy: feed a batch of sentences with a held-out verb conjugation, ask the model to score the correct vs incorrect form (e.g., He __ to the store with candidates walk / walks / walked). Pick the argmax-probability form; count matches against the ground truth. This is the task metric per §A13.
Bytes: sum of all weight storage including scales (FP16 scales for INT8/INT4; subtract embedding bytes if quoting "Linear-only" sizes).

Block E — Pareto plot¶

x-axis: bytes, log scale. y-axis: PPL, linear.
One dot per scheme, labelled.
Draw the Pareto frontier (the lower-left envelope).

Block F — roofline overlay¶

Reuse experiments/01-roofline/'s ceiling lines.
Compute the arithmetic intensity of a single MiniGPT inference step at FP32, INT8, INT4. Plot the dots.
Annotate: which dot is on the memory ceiling, which is on the compute ceiling?

Block G — interpret in `README.md`¶

Four questions:

Which schemes are on the Pareto frontier? Expect: FP32 (one corner), some INT8 scheme (middle), INT4 per-group=64 + GPTQ (other corner). The intermediate FP16 and per-tensor INT8 may be Pareto-dominated.
What's the PPL gap of INT4 per-group=64 + GPTQ vs FP32? Should be < 15% (DoD threshold) and hopefully < 5%. If much worse, your GPTQ pipeline has a bug.
By the roofline overlay, how much theoretical speedup does INT4 buy? Use the intensity ratio. Compare to what a real VNNI CPU would deliver (the "expected" number — Borja's i5-8250U won't show this since it lacks INT8 kernels).
Where would you stop in production? Pick the scheme you'd ship if Borja were deploying this to a Raspberry Pi. Justify in 3 sentences.

Constraints¶

All measurements use the same eval split (deterministic).
seed_everything(42) at every script start.
Don't quantize the embedding table; keep it FP16. (Surveyed in theory; consistent with production practice for small models.)
Don't quantize layer-norm parameters (they're per-channel scalars; quantization buys nothing).
Bytes accounting must include scales and zero-points, not just weights.

Stop conditions¶

Done when:

Seven schemes measured and the table is in results.json.
INT4 per-group=64 + GPTQ PPL gap < 15% (DoD).
Roofline overlay shows the intensity shift.
README.md answers all four questions.

Pitfalls¶

GPTQ scheme is worse than RTN per-group. Lab 01 didn't fully test on a real MiniGPT layer's H. Re-derive H from real activations and check that off-diagonals are non-trivially populated.
INT4 PPL is much worse than INT8. Did you set the right grid? Symmetric INT4 should round to [-7, 7], not [-8, 7] or [-127, 127].
The roofline dots don't line up with predictions. Real-world byte counts include scales, activations, and biases. The intensity isn't just "FLOPs / weight bytes"; it's "FLOPs / total memory traffic". Recompute including activations.

When to consult `solutions/`¶

After all four files committed and the DoD met. Reference at solutions/02-quant-curve-ref.md (phase open) compares the Pareto frontier shape.

Next lab: lab/03-gguf-export.md.