English · Español
Lab 00 — INT8 Post-Training Quantization on MiniGPT¶
Goal: implement weights-only INT8 PTQ (per-tensor and per-channel) and measure perplexity vs FP32 on MiniGPT.
Estimated time: 4–6 hours.
Prereq: MiniGPT from Phase 17 with a working
model.eval()forward pass; PyTorch from Phase 24;src/miniquant/BLUEPRINT.mdread.
What you produce¶
A directory experiments/26-int8-ptq/ containing:
quantize_minigpt.py— your script (you write).results.json— measurements (PPL FP32, PPL INT8-per-tensor, PPL INT8-per-channel, bytes-on-disk for each, calibration size).ppl_table.png— bar chart or table image.manifest.json—{seed, versions, config, hardware}perLYNX_CORTEX.md§5.README.md— short interpretation (2–4 paragraphs).
You also commit to src/miniquant/:
quantize.py— the per-tensor and per-channel symmetric quantizers and aQuantizedLinearmodule. Tests pass.
The kernel¶
The "kernel" of this lab is to wrap every nn.Linear in MiniGPT with a QuantizedLinear whose forward path is:
where W_int8 = quantize_symmetric(W, scheme) is computed once at calibration and stored as INT8 with a per-tensor or per-channel scale s_W (FP32).
This is fake-quant: we store the INT8 values but the matmul still happens in FP32. The point is to measure the numerical effect of quantization, not the speed. (Speed requires INT8 kernels, which we don't have on AVX2-without-VNNI.)
TODOs¶
Block A — implement the quantizer in src/miniquant/quantize.py¶
The BLUEPRINT lists the API. Recap:
-
quantize_symmetric_per_tensor(W: Tensor, bits: int = 8) -> (Tensor[int8], float). Returns(W_int, scale)withW_int ∈ [-127, 127]andscale = max(|W|) / 127. -
quantize_symmetric_per_channel(W: Tensor, bits: int = 8, dim: int = 0) -> (Tensor[int8], Tensor[float]). Per-row scales. -
dequantize(W_int: Tensor[int8], scale: Tensor) -> Tensor. Broadcasts scale correctly. -
QuantizedLinear(nn.Module). Constructor takes an existingnn.Linear+ scheme; forward does fake-quant matmul; preserves bias in FP32. - Tests in
tests/test_quantization.py(Claude scaffolds the failing tests; Borja makes them pass).
Block B — wrap MiniGPT¶
- Load Phase 17's MiniGPT in eval mode.
- Walk the module tree, replace every
nn.LinearwithQuantizedLinear(orig_linear, scheme). Note: do not quantize the embedding (it's ann.Embedding, notnn.Linear; and quantizing embeddings hurts more than weights, see theory 02). - Optional: skip the final
lm_headlinear too (matches LLM.int8() convention — the readout layer is sensitive). Measure with and without skipping; report both.
Block C — calibration¶
For per-channel weights-only quantization, no calibration is needed (weights are static). For activation quantization, you'd need calibration — we skip that in this lab and only quantize weights.
- Confirm: your
QuantizedLinearforward passes a tensor of the same shape and dtype as the originalLinear. Add an assertion in the test.
Block D — evaluate perplexity¶
- Use the same held-out perplexity eval from Phase 17 (
scripts/eval_minigpt_ppl.py). Run on: - FP32 (baseline).
- INT8 per-tensor.
- INT8 per-channel.
- Record bytes-on-disk after each quantization (sum of
numel × dtype_sizeover all weights, including scales).
Block E — results.json¶
{
"experiment": "26-int8-ptq",
"date": "YYYY-MM-DD",
"model": "minigpt-phase17",
"model_params": null,
"schemes": {
"fp32": { "ppl": null, "bytes": null },
"int8_per_tensor": { "ppl": null, "bytes": null, "ppl_gap_pct": null },
"int8_per_channel": { "ppl": null, "bytes": null, "ppl_gap_pct": null }
},
"notes": "..."
}
Block F — interpret in README.md¶
Three questions:
- What is the PPL gap per-tensor vs per-channel? Per-channel should be ≤ half the per-tensor gap. If it isn't, your model is unusually outlier-free or your weights are already low-precision somehow.
- Where is most of the byte savings? Compute the % of total bytes attributable to (a) weights of
Linears, (b) the embedding table, © layer-norm parameters. Embedding tables often dominate small-model byte counts. - Which layer's quantization hurts most? Hint: re-run with only one layer quantized at a time, measure PPL each time, plot a bar chart. Usually the output projection of attention or the final
lm_headis the worst.
Constraints¶
- No
torch.quantization's high-level wrappers. You may use low-level utilities (torch.int8,tensor.to(torch.int8)), but the quantization math itself is yours. - No
bitsandbytes. Same reason: black-box. - Reproducibility:
seed_everything(42)at the top of every script. - CPU only. No CUDA gate needed; assume the calibration dataset is small enough that FP32 forward passes complete in minutes.
Stop conditions¶
Done when:
- Tests in
tests/test_quantization.pyall pass. experiments/26-int8-ptq/has all five files.- INT8 per-channel PPL gap is < 5% (the DoD threshold); if not, debug per
pitfalls/notes below before consulting solutions. README.mdanswers all three Block F questions.
Pitfalls¶
- PPL gap > 20%. You probably forgot to dequantize before the matmul, or scale broadcasting is wrong (per-channel scale needs shape
(out, 1)not(out,)when multiplying a(out, in)matrix). - PPL gap suspiciously small (< 0.1%). You may have accidentally kept the original FP32 weights cached on the module. Print
model.layers[0].mlp.fc1.weight.dtypeafter wrapping; should be FP32 (the dequantized result), andmodel.layers[0].mlp.fc1.W_int8.dtypeshould beint8. - Memory blows up. You're keeping both INT8 and FP32 copies. The dequantized weight should be computed on the fly per forward, not cached.
- NaN in output. Per-tensor scale where
max(|W|) = 0for some pathological row. Add amax(s, 1e-9)guard.
When to consult solutions/¶
After all five files are committed and the DoD threshold is met. The reference at solutions/00-int8-ptq-ref.md (written at phase open) compares your numbers and the structure of QuantizedLinear.
Next lab: lab/01-gptq-toy.md.