Skip to content

English · Español

Lab 00 — INT8 Post-Training Quantization on MiniGPT

Goal: implement weights-only INT8 PTQ (per-tensor and per-channel) and measure perplexity vs FP32 on MiniGPT.

Estimated time: 4–6 hours.

Prereq: MiniGPT from Phase 17 with a working model.eval() forward pass; PyTorch from Phase 24; src/miniquant/BLUEPRINT.md read.


What you produce

A directory experiments/26-int8-ptq/ containing:

  • quantize_minigpt.py — your script (you write).
  • results.json — measurements (PPL FP32, PPL INT8-per-tensor, PPL INT8-per-channel, bytes-on-disk for each, calibration size).
  • ppl_table.png — bar chart or table image.
  • manifest.json{seed, versions, config, hardware} per LYNX_CORTEX.md §5.
  • README.md — short interpretation (2–4 paragraphs).

You also commit to src/miniquant/:

  • quantize.py — the per-tensor and per-channel symmetric quantizers and a QuantizedLinear module. Tests pass.

The kernel

The "kernel" of this lab is to wrap every nn.Linear in MiniGPT with a QuantizedLinear whose forward path is:

Linear(x) = (s_W * W_int8.float()) @ x + b

where W_int8 = quantize_symmetric(W, scheme) is computed once at calibration and stored as INT8 with a per-tensor or per-channel scale s_W (FP32).

This is fake-quant: we store the INT8 values but the matmul still happens in FP32. The point is to measure the numerical effect of quantization, not the speed. (Speed requires INT8 kernels, which we don't have on AVX2-without-VNNI.)

TODOs

Block A — implement the quantizer in src/miniquant/quantize.py

The BLUEPRINT lists the API. Recap:

  • quantize_symmetric_per_tensor(W: Tensor, bits: int = 8) -> (Tensor[int8], float). Returns (W_int, scale) with W_int ∈ [-127, 127] and scale = max(|W|) / 127.
  • quantize_symmetric_per_channel(W: Tensor, bits: int = 8, dim: int = 0) -> (Tensor[int8], Tensor[float]). Per-row scales.
  • dequantize(W_int: Tensor[int8], scale: Tensor) -> Tensor. Broadcasts scale correctly.
  • QuantizedLinear(nn.Module). Constructor takes an existing nn.Linear + scheme; forward does fake-quant matmul; preserves bias in FP32.
  • Tests in tests/test_quantization.py (Claude scaffolds the failing tests; Borja makes them pass).

Block B — wrap MiniGPT

  • Load Phase 17's MiniGPT in eval mode.
  • Walk the module tree, replace every nn.Linear with QuantizedLinear(orig_linear, scheme). Note: do not quantize the embedding (it's a nn.Embedding, not nn.Linear; and quantizing embeddings hurts more than weights, see theory 02).
  • Optional: skip the final lm_head linear too (matches LLM.int8() convention — the readout layer is sensitive). Measure with and without skipping; report both.

Block C — calibration

For per-channel weights-only quantization, no calibration is needed (weights are static). For activation quantization, you'd need calibration — we skip that in this lab and only quantize weights.

  • Confirm: your QuantizedLinear forward passes a tensor of the same shape and dtype as the original Linear. Add an assertion in the test.

Block D — evaluate perplexity

  • Use the same held-out perplexity eval from Phase 17 (scripts/eval_minigpt_ppl.py). Run on:
  • FP32 (baseline).
  • INT8 per-tensor.
  • INT8 per-channel.
  • Record bytes-on-disk after each quantization (sum of numel × dtype_size over all weights, including scales).

Block E — results.json

{
  "experiment": "26-int8-ptq",
  "date": "YYYY-MM-DD",
  "model": "minigpt-phase17",
  "model_params": null,
  "schemes": {
    "fp32":              { "ppl": null, "bytes": null },
    "int8_per_tensor":   { "ppl": null, "bytes": null, "ppl_gap_pct": null },
    "int8_per_channel":  { "ppl": null, "bytes": null, "ppl_gap_pct": null }
  },
  "notes": "..."
}

Block F — interpret in README.md

Three questions:

  1. What is the PPL gap per-tensor vs per-channel? Per-channel should be ≤ half the per-tensor gap. If it isn't, your model is unusually outlier-free or your weights are already low-precision somehow.
  2. Where is most of the byte savings? Compute the % of total bytes attributable to (a) weights of Linears, (b) the embedding table, © layer-norm parameters. Embedding tables often dominate small-model byte counts.
  3. Which layer's quantization hurts most? Hint: re-run with only one layer quantized at a time, measure PPL each time, plot a bar chart. Usually the output projection of attention or the final lm_head is the worst.

Constraints

  • No torch.quantization's high-level wrappers. You may use low-level utilities (torch.int8, tensor.to(torch.int8)), but the quantization math itself is yours.
  • No bitsandbytes. Same reason: black-box.
  • Reproducibility: seed_everything(42) at the top of every script.
  • CPU only. No CUDA gate needed; assume the calibration dataset is small enough that FP32 forward passes complete in minutes.

Stop conditions

Done when:

  1. Tests in tests/test_quantization.py all pass.
  2. experiments/26-int8-ptq/ has all five files.
  3. INT8 per-channel PPL gap is < 5% (the DoD threshold); if not, debug per pitfalls/ notes below before consulting solutions.
  4. README.md answers all three Block F questions.

Pitfalls

  • PPL gap > 20%. You probably forgot to dequantize before the matmul, or scale broadcasting is wrong (per-channel scale needs shape (out, 1) not (out,) when multiplying a (out, in) matrix).
  • PPL gap suspiciously small (< 0.1%). You may have accidentally kept the original FP32 weights cached on the module. Print model.layers[0].mlp.fc1.weight.dtype after wrapping; should be FP32 (the dequantized result), and model.layers[0].mlp.fc1.W_int8.dtype should be int8.
  • Memory blows up. You're keeping both INT8 and FP32 copies. The dequantized weight should be computed on the fly per forward, not cached.
  • NaN in output. Per-tensor scale where max(|W|) = 0 for some pathological row. Add a max(s, 1e-9) guard.

When to consult solutions/

After all five files are committed and the DoD threshold is met. The reference at solutions/00-int8-ptq-ref.md (written at phase open) compares your numbers and the structure of QuantizedLinear.


Next lab: lab/01-gptq-toy.md.