English · Español
Phase 26 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-26-quantization.yaml. Respuestas detrás de bloques<details>para autoevaluación sin spoilers.
Source of truth: data/quizzes/phase-26-quantization.yaml.
q-26-01 — Why quantization is not the same as compression (free)¶
In one sentence, why is INT8 weight quantization a roofline optimization and not merely a disk-compression scheme?
Answer
Quantization reduces the **bytes** loaded per FLOP during every forward pass, raising arithmetic **intensity** on a memory-bound kernel. Compression reduces disk size only.q-26-02 — Per-tensor vs per-channel scales¶
You quantize a Linear weight W (out × in) to INT8. Which scale granularities are legitimate defaults for production models, and which one is bounded against per-row outliers?
- Per-tensor (one scalar for the whole W)
- Per-channel (one scale per output row)
- Per-element (one scale per weight; identical to FP32)
- Per-group (one scale per K-element block within a row)
Answer
**Choices 2 and 4.** Per-channel and per-group bound per-row error and are the production defaults. Per-tensor is the failure mode LLM.int8() addresses; per-element collapses back to FP32.q-26-03 — The Pareto frontier knee on Mini-GPT¶
For Mini-GPT's int8-w vs int8-wa variants (theory 05-pareto-frontier-worked.md), what is the rough size of the additional decode-latency win when adding activation quantization on top of weight-only INT8?
- ≈ 50% additional speedup, justifying the PPL hit
- ≈ 10% additional speedup, smaller than the PPL hit
- ≈ 4× additional speedup, like weights to INT8 over FP32
- No additional speedup; activations are already small
Answer
**Choice 2.** Activation share of bytes is small for Mini-GPT's hidden dims. The 10% latency gain is real but smaller than the 4-5% PPL hit — making `int8-w` the default and `int8-wa` the Pareto knee.q-26-04 — GPTQ vs round-to-nearest¶
GPTQ's quantization step differs from naive round-to-nearest in one essential way. Which of the following describes it most precisely?
- GPTQ uses INT4 instead of INT8.
- GPTQ uses the Hessian of the activation distribution to redistribute rounding error to columns not yet quantized.
- GPTQ trains the model further to compensate for quantization.
- GPTQ stores weights in NF4 instead of INT4.
Answer
**Choice 2.** GPTQ greedily quantizes column-by-column and updates the remaining weights to compensate for the error introduced so far, weighted by the activation Hessian.q-26-05 — When activation quantization helps most (free)¶
For which kind of layer does activation quantization (INT8 W+A vs INT8 W-only) yield the largest additional speedup, and why?