English · Español

00 — Why Quantization (and why it isn't compression)¶

🇪🇸 La cuantización no es comprimir un fichero; es cambiar la posición de tu kernel en el roofline. Mismo cómputo, menos bytes, intensidad aritmética más alta. Por eso es rápido — no por "tener menos información".

The naive framing (and why it misleads)¶

A first-time reader hears "INT8 quantization shrinks the model 4×" and concludes: quantization is a compression scheme. ZIP makes files smaller; INT8 makes weights smaller; same idea, right?

Wrong, and the wrongness matters.

ZIP saves bytes on disk. When you decompress and load into RAM, you pay the same memory cost as before. ZIP is a storage optimization; it doesn't speed up your program once running.

Quantization saves bytes in flight — during every forward pass, on every weight load from DRAM into cache, on every register fill. The model is smaller on disk and in cache and in transit, every time it runs. That's a fundamentally different kind of win.

To see why this matters, recall the roofline from Phase 1.

The roofline reframing¶

A single transformer block during inference spends most of its time in two operations:

Linear(x) — a matrix-vector multiply y = W x + b, where W ∈ ℝ^(out×in).
Attention(q, k, v) — see Phase 27 for the kernel-level story; for this phase, treat it as a sequence of matrix multiplies plus a softmax.

For batch size 1 (the realistic case for local inference on Borja's i5-8250U), Linear(x) is a matrix-vector multiply, not a matrix-matrix multiply. The arithmetic intensity is brutal:

FLOPs: 2 × out × in (one multiply + one accumulate per weight).
Bytes loaded: 4 × out × in (every weight, FP32, once) + 4 × in (the activation vector, FP32). The weights dominate.
Intensity: I = 2 out·in / (4 out·in) ≈ 0.5 FLOPs/byte.

A quarter to a half of a FLOP per byte. From the Phase 1 roofline derivation, the i5-8250U's machine-balance point is I_crit ≈ 10 FLOPs/byte. Inference on MiniGPT is bandwidth-bound by ~20×.

Now quantize the weights to INT8. The math is unchanged — the FLOPs stay at 2 × out × in. But bytes loaded drop:

Bytes: 1 × out·in (INT8 weights) + 4 × in (FP32 activation). Weights still dominate.
Intensity: I = 2 out·in / out·in = 2 FLOPs/byte. 4× higher.

INT4 per-group: weights pack 2-per-byte (with a small overhead for scales). Intensity: I ≈ 4 FLOPs/byte. 8× higher.

On a bandwidth-bound kernel, intensity is speed. INT8 raises the dot on the memory-ceiling line by 4×; INT4 by 8×. The roofline doesn't care that the numerical precision is lower — the FPU just executes multiply on whatever bits arrive in the register.

The error budget¶

That's the gain side. The cost side is quantization error.

A weight w stored in FP32 has roughly 7 decimal digits of precision. Stored in INT8 with scale s, it has 256 representable values total. The round-off error is at most s/2 per weight. Across out × in weights, errors accumulate, then propagate through layers, then through the softmax, then to the loss.

The standard question — and the one this phase answers empirically — is: how big does that error get, and where does the model break?

The answers, sketched in advance:

INT8 per-tensor: error small enough that perplexity barely moves (< 1%) for well-behaved layers, unless there's an outlier in the activation distribution. Outliers blow up the scale, swallowing precision for the typical case. This is the failure mode LLM.int8() solves.
INT8 per-channel: errors are bounded per row, so a single noisy row doesn't poison the whole tensor. PPL gap typically < 2%.
INT4 per-group: 16 values per "channel" instead of 256, partitioned into groups of 64 or 128 within a channel. PPL gap 5–15% depending on calibration.
INT4 with GPTQ: instead of round-to-nearest, GPTQ uses the Hessian of the activation distribution to redistribute the rounding error to columns we haven't quantized yet. PPL gap drops to 1–3% — at the cost of a few minutes of one-time calibration.

In Phase 26 we implement INT8 (per-tensor and per-channel) and one GPTQ variant on a single linear layer. The other schemes are read but not coded.

Why this comes after Phase 24 (PyTorch introduction)¶

Quantization needs framework-grade infrastructure: hooks to record activations during calibration, fake-quant ops that simulate INT8 in FP32 (so we don't need to write a real INT8 kernel), modules whose forward swaps between quantized and FP32 paths. We could do this in NumPy — but every model layer would need re-wiring, and the boilerplate would dwarf the math.

PyTorch was introduced in Phase 24 specifically so that Phase 26 can lean on torch.quantization primitives (or write our own equivalents — see the BLUEPRINT). We use PyTorch as a substrate for the math, not as a black-box quantizer: every quantization step is hand-coded; PyTorch only handles tensor storage and the autograd graph (which we don't use here since this is post-training).

Why this comes before Phase 27 (Modern Attention)¶

Phase 27 introduces FlashAttention and PagedAttention. Both are dominated by the same insight: memory traffic is the bottleneck. FlashAttention keeps the working set in SRAM; PagedAttention paginates the KV cache. Both are roofline manipulations.

Phase 26 trains the reader's eye to see "this kernel moves B bytes; cut B and you raise intensity". Phase 27 applies the same eye to attention. Phase 28 (LoRA) applies it to fine-tuning. Phase 29 (RAG) applies it to inference at scale.

Quantization is the cleanest demonstration of the principle because the FLOPs are literally identical before and after — only the bytes change. By the end of this phase, when Borja hears "FlashAttention is 3× faster", the next sentence in his head should already be "...because it changed the byte count, not the FLOP count".

What we measure on Borja's machine¶

The DoD includes a re-plot of the Phase 1 roofline with three MiniGPT inference dots overlaid: FP32, INT8 per-channel, INT4 per-group. The expected picture:

FP32 dot: low intensity, low performance — far below the memory ceiling because the activation outliers and Python overhead leave bandwidth on the table.
INT8 dot: 4× higher intensity, 3–4× higher tokens/sec.
INT4 dot: 8× higher intensity, 5–6× higher tokens/sec (less than 8× because the per-group scales add per-fetch overhead and the unpacking is non-free on AVX2-without-VNNI).

If the measured ratios diverge from these predictions, the lab includes follow-up questions about why — bandwidth not actually saturated, AVX2 INT8 path missing, dequantize-in-FP32 path active, etc.

One-paragraph recap¶

Quantization is a roofline optimization, not a compression optimization. It raises the arithmetic intensity of bandwidth-bound kernels (matrix-vector multiply, attention, etc.) by reducing bytes per weight without changing FLOPs. The cost is bounded round-off error, which we control by choosing the unit of scale (per-tensor → per-channel → per-group) and the quantization algorithm (round-to-nearest → GPTQ). The win is real because most local inference is bandwidth-bound by a factor of 10–20×; INT8 closes half the gap, INT4 closes most of it. The rest of Phase 26 derives the formulas, bounds the errors, and measures the trade-offs on Borja's MiniGPT.

Next: theory/01-number-formats.md.