Skip to content

English · Español

Lab 02 — The Quantization Pareto Curve

Goal: sweep over {FP32, FP16, INT8 per-tensor, INT8 per-channel, INT4 per-group=64, INT4 per-group=128} and plot the Pareto curve of perplexity vs bytes for MiniGPT.

Estimated time: 3–4 hours (most spent waiting on calibration; coding is light).

Prereq: labs 00 and 01 committed; MiniGPT loadable.


What you produce

A directory experiments/26-quant-curve/ containing:

  • sweep.py — driver script.
  • results.json — measurements per scheme.
  • pareto.png — log-log plot, bytes on x-axis, PPL on y-axis, one dot per scheme.
  • verb_tense_accuracy.png — bar chart, verb-tense classification accuracy per scheme (the task metric per §A13).
  • roofline_overlay.png — re-plot of the Phase 1 roofline with MiniGPT inference dots at FP32, INT8, INT4 overlaid.
  • manifest.json.
  • README.md — interpretation.

The sweep

Scheme Implementation Where from
FP32 No quantization; baseline. MiniGPT
FP16 model.half(). PyTorch
INT8 per-tensor src/miniquant/quantize.py. Lab 00
INT8 per-channel src/miniquant/quantize.py. Lab 00
INT4 per-group=64 src/miniquant/quantize.py (Borja extends from INT8 — write the INT4 wrapper). Lab 00 extended
INT4 per-group=128 Same as above with group_size=128. Same
INT4 per-group=64 + GPTQ src/miniquant/gptq.py applied per layer. Lab 01 extended

Notice the last row introduces a batch GPTQ over all Linears in MiniGPT — this is the integration of lab 01's per-layer GPTQ into the full model. It's not a separate algorithm; it's lab 01 in a for layer in model.linears: gptq_quantize(layer, calib_data) loop.

TODOs

Block A — extend the quantizer

  • Add quantize_symmetric_per_group(W, bits=4, group_size=64, dim=1) to src/miniquant/quantize.py. Reshape W: (out, in) → (out, in/group_size, group_size); pick scale per group along the last axis; quantize; reshape back.
  • Storage: store as int8 containing values in [-7, 7] (for symmetric INT4 with 15 used codes; one code wasted for symmetry). Don't bit-pack to a real 4-bit format yet — that's lab 03 (GGUF export).
  • Add tests for the new function (Claude scaffolds).

Block B — extend QuantizedLinear

  • Accept a scheme argument: "per_tensor", "per_channel", "per_group_64", "per_group_128", "gptq_per_group_64".
  • For the GPTQ schemes, the constructor accepts a calibration H (computed externally via lab 01's machinery).

Block C — calibration pipeline

For the GPTQ row:

  • Run MiniGPT in FP32 on 128 held-out sequences; record per-Linear input activations.
  • For each Linear, compute H = X X.T / n; apply lab 01's GPTQ to get the quantized weight.
  • Wrap the original Linear with a QuantizedLinear carrying the GPTQ result.

Block D — measure PPL, verb-tense accuracy, and bytes per scheme

  • PPL via the same eval as lab 00 (held-out split of the verb corpus).
  • Verb-tense classification accuracy: feed a batch of sentences with a held-out verb conjugation, ask the model to score the correct vs incorrect form (e.g., He __ to the store with candidates walk / walks / walked). Pick the argmax-probability form; count matches against the ground truth. This is the task metric per §A13.
  • Bytes: sum of all weight storage including scales (FP16 scales for INT8/INT4; subtract embedding bytes if quoting "Linear-only" sizes).

Block E — Pareto plot

  • x-axis: bytes, log scale. y-axis: PPL, linear.
  • One dot per scheme, labelled.
  • Draw the Pareto frontier (the lower-left envelope).

Block F — roofline overlay

  • Reuse experiments/01-roofline/'s ceiling lines.
  • Compute the arithmetic intensity of a single MiniGPT inference step at FP32, INT8, INT4. Plot the dots.
  • Annotate: which dot is on the memory ceiling, which is on the compute ceiling?

Block G — interpret in README.md

Four questions:

  1. Which schemes are on the Pareto frontier? Expect: FP32 (one corner), some INT8 scheme (middle), INT4 per-group=64 + GPTQ (other corner). The intermediate FP16 and per-tensor INT8 may be Pareto-dominated.
  2. What's the PPL gap of INT4 per-group=64 + GPTQ vs FP32? Should be < 15% (DoD threshold) and hopefully < 5%. If much worse, your GPTQ pipeline has a bug.
  3. By the roofline overlay, how much theoretical speedup does INT4 buy? Use the intensity ratio. Compare to what a real VNNI CPU would deliver (the "expected" number — Borja's i5-8250U won't show this since it lacks INT8 kernels).
  4. Where would you stop in production? Pick the scheme you'd ship if Borja were deploying this to a Raspberry Pi. Justify in 3 sentences.

Constraints

  • All measurements use the same eval split (deterministic).
  • seed_everything(42) at every script start.
  • Don't quantize the embedding table; keep it FP16. (Surveyed in theory; consistent with production practice for small models.)
  • Don't quantize layer-norm parameters (they're per-channel scalars; quantization buys nothing).
  • Bytes accounting must include scales and zero-points, not just weights.

Stop conditions

Done when:

  1. Seven schemes measured and the table is in results.json.
  2. INT4 per-group=64 + GPTQ PPL gap < 15% (DoD).
  3. Roofline overlay shows the intensity shift.
  4. README.md answers all four questions.

Pitfalls

  • GPTQ scheme is worse than RTN per-group. Lab 01 didn't fully test on a real MiniGPT layer's H. Re-derive H from real activations and check that off-diagonals are non-trivially populated.
  • INT4 PPL is much worse than INT8. Did you set the right grid? Symmetric INT4 should round to [-7, 7], not [-8, 7] or [-127, 127].
  • The roofline dots don't line up with predictions. Real-world byte counts include scales, activations, and biases. The intensity isn't just "FLOPs / weight bytes"; it's "FLOPs / total memory traffic". Recompute including activations.

When to consult solutions/

After all four files committed and the DoD met. Reference at solutions/02-quant-curve-ref.md (phase open) compares the Pareto frontier shape.


Next lab: lab/03-gguf-export.md.