English · Español

Phase 26 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-26-quantization.yaml. Respuestas detrás de bloques <details> para autoevaluación sin spoilers.

Source of truth: data/quizzes/phase-26-quantization.yaml.

q-26-01 — Why quantization is not the same as compression (free)¶

In one sentence, why is INT8 weight quantization a roofline optimization and not merely a disk-compression scheme?

Answer

Quantization reduces the **bytes** loaded per FLOP during every forward pass, raising arithmetic **intensity** on a memory-bound kernel. Compression reduces disk size only.

q-26-02 — Per-tensor vs per-channel scales¶

You quantize a Linear weight W (out × in) to INT8. Which scale granularities are legitimate defaults for production models, and which one is bounded against per-row outliers?

Per-tensor (one scalar for the whole W)
Per-channel (one scale per output row)
Per-element (one scale per weight; identical to FP32)
Per-group (one scale per K-element block within a row)

Answer

**Choices 2 and 4.** Per-channel and per-group bound per-row error and are the production defaults. Per-tensor is the failure mode LLM.int8() addresses; per-element collapses back to FP32.

q-26-03 — The Pareto frontier knee on Mini-GPT¶

For Mini-GPT's int8-w vs int8-wa variants (theory 05-pareto-frontier-worked.md), what is the rough size of the additional decode-latency win when adding activation quantization on top of weight-only INT8?

≈ 50% additional speedup, justifying the PPL hit
≈ 10% additional speedup, smaller than the PPL hit
≈ 4× additional speedup, like weights to INT8 over FP32
No additional speedup; activations are already small

Answer

**Choice 2.** Activation share of bytes is small for Mini-GPT's hidden dims. The 10% latency gain is real but smaller than the 4-5% PPL hit — making `int8-w` the default and `int8-wa` the Pareto knee.

q-26-04 — GPTQ vs round-to-nearest¶

GPTQ's quantization step differs from naive round-to-nearest in one essential way. Which of the following describes it most precisely?

GPTQ uses INT4 instead of INT8.
GPTQ uses the Hessian of the activation distribution to redistribute rounding error to columns not yet quantized.
GPTQ trains the model further to compensate for quantization.
GPTQ stores weights in NF4 instead of INT4.

Answer

**Choice 2.** GPTQ greedily quantizes column-by-column and updates the remaining weights to compensate for the error introduced so far, weighted by the activation Hessian.

q-26-05 — When activation quantization helps most (free)¶

For which kind of layer does activation quantization (INT8 W+A vs INT8 W-only) yield the largest additional speedup, and why?

Answer

The **FFN expansion layer** (`in=d_model, out=4·d_model`) carries the largest activation in the model. Quantizing that activation reduces a meaningful fraction of bytes-loaded. In Mini-GPT the share is small (so the W+A win is small); in production transformers with `d_model ≥ 4096` it becomes decisive.