English · Español

02 — Scales and Zeros: the Quantization Map¶

🇪🇸 La cuantización es una función afín entre reales y enteros: q = round(x/s) + z. Aquí derivamos el mapa simétrico (z=0) y asimétrico, acotamos el error máximo, y mostramos por qué la granularidad de s (por-tensor, por-canal, por-grupo) importa más que el número de bits.

The map¶

The quantization function maps a real value x ∈ ℝ to an integer q ∈ ℤ using two parameters:

Scale s > 0 — the size of one quantization step in real units.
Zero-point z ∈ ℤ — the integer that represents real zero.

Forward (quantize):

\[ q = \text{round}(x / s) + z \]

Backward (dequantize):

\[ \hat{x} = s \cdot (q - z) \]

The set of representable values is {s(q - z) : q ∈ [q_min, q_max]}. For INT8, [q_min, q_max] = [-128, 127] (signed) or [0, 255] (unsigned).

Symmetric vs asymmetric¶

Symmetric quantization fixes z = 0. The set of representable values is centred on real zero, spaced uniformly by s. Choosing s such that s × 127 = M where M = max(|x|):

\[ s = M / 127, \quad q = \text{round}(x / s), \quad q \in [-127, 127] \]

We waste the bin at q = -128 for symmetry. The off-by-one is conventional and avoids the asymmetric edge case where q_min doesn't have a symmetric partner.

Asymmetric quantization picks z so that the range [x_min, x_max] maps to [0, 255] (unsigned INT8):

\[ s = (x_{\max} - x_{\min}) / 255, \quad z = \text{round}(-x_{\min} / s) \]

Asymmetric is better when the distribution is one-sided (e.g. post-ReLU activations: all ≥ 0). Symmetric is better when the distribution is two-sided and zero-centred (e.g. weights of a Linear after standard initialization).

For Phase 26 we use symmetric for weights, asymmetric for activations. This matches GPTQ, LLM.int8(), and most production schemes.

Error bounds¶

The quantization error per element is e = x - \hat{x}. We want \sup_x |e|.

For symmetric INT8 with scale s = M/127:

\[ |e| = |x - s \cdot \text{round}(x / s)| \]

The round function commits at worst s/2 of error per element:

\[ |e| \leq s/2 = M/254 \]

This is the per-element bound. Across N elements with bounded second moment, the L2 error scales as \sqrt{N} \cdot s/\sqrt{12} (assuming round-off errors are independent and uniformly distributed on [-s/2, s/2], which is approximately true for non-pathological distributions). The factor 1/\sqrt{12} comes from the variance of a uniform distribution on [-s/2, s/2]:

\[ \text{Var}(e) = s^2 / 12, \quad \|e\|_2 \approx s \sqrt{N/12} \]

Where this bound is tight¶

When the distribution is uniform on [-M, M], round-off errors really are uniform on [-s/2, s/2], the second moment is exactly s^2/12, and the bound is sharp.

Where this bound is loose (and what to do)¶

When the distribution has outliers — a tiny fraction of elements with |x| ≫ \sigma_x — those outliers force M (and hence s) to be huge. Most elements then live near zero, well below the quantization grid resolution: the effective bits used per element collapse from 8 to ~2-3.

This is the most important practical fact about quantization. Outlier-driven scale inflation explains why:

Per-tensor INT8 fails on attention output projections (one row of the matrix has 100× the magnitude of the others).
LLM.int8() exists at all (factor out the outlier rows into FP16; INT8 the rest).
SmoothQuant works (migrate the outlier magnitude from activations to weights, which can absorb it).

The unit of `s`: per-tensor vs per-channel vs per-group¶

The "unit" is the slab of weights that share one scale.

Granularity	One `s` per	Overhead	Quality	When
Per-tensor	Whole weight matrix `W`	1 scalar per layer	Worst	Baseline only
Per-channel	Each output row of `W`	`out` scalars per layer	Medium	INT8 default
Per-group	Each contiguous block of `g` weights within a row (e.g. `g=64`)	`out × in/g` scalars per layer	Best	INT4 default

For a Linear(in=768, out=768):

Per-tensor: 1 scalar. Stored in FP16 = 2 bytes overhead.
Per-channel: 768 scalars. 1.5 KiB overhead.
Per-group (g=64): 768 × (768/64) = 768 × 12 = 9216 scalars. 18 KiB overhead.

In INT4 terms, the weight matrix itself is 768 × 768 / 2 = 288 KiB. The per-group overhead (18 KiB) adds 6%, dropping effective bits per weight from 4 to ~4.3. Worth it: per-group INT4 reaches perplexity that per-tensor INT8 cannot.

Why finer granularity helps so much¶

Consider a row of W with two natural clusters: 99% of weights in [-1, 1], 1% in [-100, 100]. Per-row scale s = 100/127 ≈ 0.8, so the 99% cluster is quantized at resolution 0.8 — every weight in [-0.4, 0.4] collapses to 0. We have effectively destroyed most of the information.

Per-group scale (group size 64) lets each group pick its own scale. Within the 99% cluster, groups see M ≈ 1, scale s ≈ 0.008 — 100× finer resolution. The outlier-only groups still get the bad scale, but they're 1% of the rows.

This is why per-group INT4 often beats per-tensor INT8 on perplexity despite having half the bits per weight.

Choosing `M`: max, percentile, MSE¶

The naive M = max(|x|) is sensitive to outliers. Three alternatives:

Percentile clipping. M = quantile(|x|, 0.999). Anything above gets clipped to M. Trades a small clipping error for a large scale-inflation error. Lab 00 sweeps the percentile and looks at the perplexity curve.
MSE minimization. Choose M to minimize E[(x - \hat{x})^2] over the calibration distribution. Closed-form for symmetric distributions; numerical for general ones.
KL divergence. Used by TensorRT. Choose M to minimize KL between the histogram of x and the histogram of \hat{x}.

For Phase 26, we use (1) at the 99.9^th percentile as default and (2) as a sanity check. (3) is reading-only.

Quantizing activations¶

Weights are static — we can compute M once at calibration time. Activations are dynamic — they depend on the input.

Two strategies:

Static activation quantization. Run a calibration set (typically 128 samples) through the model in FP32; record activation statistics per layer; choose s, z once; use these at inference. Fast at inference; sensitive to distribution shift between calibration and deployment.

Dynamic activation quantization. Compute M = max(|x|) on the fly per input. More accurate per-sample, but the max computation itself adds latency, and on a CPU the conditional code path defeats vectorization.

For Phase 26 PTQ we use static activation quantization for INT8 (matches LLM.int8() and most CPU runtimes).

The `Linear` forward in INT8¶

The quantized forward path for y = W x + b:

W_int8, s_W    = quantize_symmetric_per_channel(W)             # at calibration
s_x, z_x       = ...                                            # at calibration (static)
x_int8         = quantize_asymmetric(x, s_x, z_x)               # at inference
y_int32        = W_int8 @ x_int8        # INT32 accumulator
y_float        = s_W * s_x * (y_int32 - z_x * sum_over_in(W_int8))
y_float       += b                       # bias in FP16/FP32

The INT32 accumulator is critical: INT8 × INT8 can overflow INT8 within a few terms. The accumulator must be wider than the inputs. On CPUs with AVX-VNNI (Ice Lake+), there's a fused vpdpbusd instruction that does INT8 × INT8 → INT32 accumulate. Borja's Kaby Lake R lacks VNNI; PyTorch's INT8 path on this CPU falls back to a dequant-then-fp32-matmul sequence, which is slower than plain FP32 matmul.

This matters for Lab 02. Don't expect INT8 inference to be faster on Borja's machine until we cross-compile or use llama.cpp's hand-tuned AVX2 INT8 kernels. The lab measures bytes-on-disk and PPL; speed measurements are framed as "what we would see on a VNNI-capable CPU".

Drill problems¶

Solutions in solutions/02-scales-and-zeros-ref.md (phase open). Try without running.

A symmetric INT8 quantizer on a tensor with M = 10. What's s? What's the max per-element error? Express the SQNR (signal-to-quantization-noise ratio in dB) assuming uniform distribution on [-10, 10].
A Linear(in=768, out=768). Compute the storage in bytes for: (a) FP32, (b) INT8 per-tensor, © INT8 per-channel (FP16 scales), (d) INT4 per-group=64 (FP16 scales). Verify the "INT4 per-group is ~4.3 effective bits" claim.
A row of W has [w_1, ..., w_{63}, w_{64}] = [0.01, ..., 0.01, 50]. Per-channel INT8 quantize this row and show what happens to w_1, ..., w_{63} after dequantization. Now per-group with group size 32: same question. Explain the difference quantitatively.

One-paragraph recap¶

Quantization is an affine map q = round(x/s) + z with per-element error bounded by s/2. The error is small when s is small, and s is small when the unit of scaling (per-tensor, per-channel, per-group) is matched to the local distribution. Outliers blow up s and destroy resolution; the remedy is fine-grained scales, not more bits. For Phase 26 we use symmetric per-channel INT8 on weights and asymmetric static INT8 on activations as the default, with per-group INT4 as the 4-bit setting. The next theory file shows how GPTQ improves per-group INT4 by exploiting activation statistics.

Next: theory/03-gptq-and-nf4.md.