English · Español
Phase 26 — Quantization Deep Dive¶
Requires: 02 — Numerical Representation · 25 — PyTorch Internals Teaches:
quantization·int8·nf4·gptq·gguf·calibrationJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Cuantizar es mover menos bytes por el mismo FLOP. Aquí derivamos por qué INT8/INT4 funcionan, qué error introducen y cómo medirlo sobre el MiniGPT de la Fase 17.
Goal¶
Take the MiniGPT model trained in Phase 17 (on the English-verb corpus of Phase 12) and quantize it post-training. Produce a Pareto curve of perplexity vs bytes across {FP32, FP16, INT8 per-tensor, INT8 per-channel, INT4 per-group}, and a verb-tense classification accuracy drop curve over the same schemes (the task metric per §A13). Re-plot the Phase 1 roofline with the quantized inference dot to make explicit what quantization buys: higher arithmetic intensity, same compute, less memory traffic.
This phase is the first one where Borja sees PyTorch's Linear weight tensors get rewritten in place. PyTorch was introduced in Phase 24 and is now a working substrate; quantization is the first non-trivial use case that needs framework-grade infrastructure (fake-quant ops, calibration hooks).
Read order¶
theory/00-motivation.md— why quantization is a roofline argument, not a compression argument.theory/01-number-formats.md— FP32 / FP16 / BF16 / FP8 anatomy. Mantissa vs exponent.theory/02-scales-and-zeros.md— the symmetric/asymmetric quantization map and its error bound. Per-tensor vs per-channel vs per-group.theory/03-gptq-and-nf4.md— Hessian-based weight updates (GPTQ) and quantile-based codebooks (NF4).theory/04-awq-survey.md— survey only; AWQ and SmoothQuant are reading exercises, not implementations.lab/00-int8-ptq.md— implement and evaluate INT8 PTQ on MiniGPT.lab/01-gptq-toy.md— implement GPTQ for a singleLinearlayer.lab/02-quant-curve.md— sweep schemes and plot the Pareto curve.lab/03-gguf-export.md— hand-write a GGUF-like export and round-trip.
solutions/ is empty during pre-write — populated at phase open after Borja's MiniGPT API is visible.
Definition of Done¶
See PHASE_26_PLAN.md §6. Briefly:
- INT8 PTQ perplexity within 5% of FP32 on MiniGPT.
- INT4 per-group perplexity within 15% of FP32.
- Verb-tense classification accuracy drop INT8 vs FP32 < 2 percentage points (task metric per §A13).
- GGUF-like export round-trips to PyTorch fake-quant within 1e-3.
- Re-plotted Phase 1 roofline with FP32 / INT8 / INT4 MiniGPT dots committed.
src/miniquant/{quantize.py, gptq.py, gguf_io.py}implemented (Borja).
What this phase intentionally does NOT cover¶
- QAT (quantization-aware training). Out of scope for PTQ-focused phase; touched only as "if PTQ degrades too much, the fix is to fold quantization into training" — left for a later self-directed exploration.
- AWQ / SmoothQuant implementations. Read as papers, not coded. Pre-written rationale: implementing both adds ~40 hours and the conceptual delta over GPTQ is modest.
- GPU INT8 kernels. Phase 27 (Flash Attention) is the right place for that.
- FP8 training. Hopper-only; defer to Phase 24's GPU follow-ups.
- Quantizing the embedding table. Embeddings rarely benefit from PTQ at our model size; we leave them FP16 throughout.
Phase 26's scope is weights-only post-training quantization for inference on CPU, with one detour into a 4-bit weight format (NF4) and one paper-faithful implementation of GPTQ. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Dettmers et al. · 2022. outlier-aware INT8 that actually preserves quality.
- 📄 GPTQ: Accurate Post-Training Quantization for GPTs — Frantar et al. · 2022. the post-training quantization you implement.