English · Español

04 — AWQ, SmoothQuant, LLM.int8(): a survey¶

🇪🇸 Tres trucos que la industria usa pero que Borja no implementa en esta fase: AWQ pondera por importancia de activación, SmoothQuant migra magnitud de activación a peso, LLM.int8() separa los outliers a una ruta FP16. Aquí entendemos el porqué; el código se queda como lectura.

Why a survey¶

Phase 26 implements two things from scratch: weights-only INT8 (per-tensor + per-channel) and GPTQ on a single Linear. The papers below are essential context but their full implementations would triple the phase's wall-clock without proportional pedagogical gain. Read them; reproduce one figure mentally; move on.

LLM.int8() (Dettmers et al., 2022)¶

Observation. Activations of large transformer layers contain systematic outliers concentrated in a small number of feature dimensions (~0.1% of dimensions, but they reach magnitudes 20× the typical activation). These outlier features dominate the attention output for a few specific tokens.

Problem. A naive per-tensor INT8 quantization of activations chooses s based on the max — which is set by the outliers. The 99.9% of typical activations get quantized at a resolution that resolves only ~5 distinct values out of 256.

Solution. Detect outlier columns at calibration. For those columns: do the matmul in FP16. For the rest: do the matmul in INT8. Sum the two partial results. The split adds latency overhead, but the typical-case INT8 path now uses its full dynamic range.

Trade-off. ~0.5% PPL gap to FP32 at INT8; ~2× faster on Ampere GPUs (no equivalent kernel on Borja's CPU); ~50% memory savings.

Why we don't implement. The outlier-detection + dual-path forward is ~500 lines of code with edge cases for every Linear, and the speed win only materializes on hardware Borja doesn't have. Read it; understand the principle; move on.

SmoothQuant (Xiao et al., 2022)¶

Observation. The outliers in LLM.int8() are in activations, not weights. Weights are well-behaved. But the matmul y = W x is symmetric in W and x — we can move magnitude from one to the other without changing the math:

\[ W x = (W \text{diag}(s)) \text{diag}(s^{-1}) x = W' x' \]

If we pick s_j to shrink the outlier-channel of x and grow the corresponding column of W by the same factor, x' becomes well-behaved (no outliers) and W' becomes slightly less well-behaved (but weights had headroom).

Solution. Compute per-channel statistics of x from calibration data; choose s_j = max(|x_j|)^\alpha for some \alpha ∈ [0, 1] (paper uses 0.5); apply the offline diagonal rescaling; quantize the result with vanilla per-tensor INT8.

Trade-off. Comparable to LLM.int8() on quality; simpler at inference (single path); requires offline calibration to compute s. The rescaling can be folded into the previous layer's weight so there's no runtime overhead.

Why we don't implement. The fold-into-previous-layer requires whole-graph awareness — every Linear in MiniGPT must be visited, scales propagated, residual paths handled. Not worth the engineering load for the pedagogical content.

AWQ (Lin et al., 2023)¶

Observation. Not all weights matter equally. The "important" weights are the ones that interact with high-magnitude activations. Preserving these at full precision and quantizing the rest aggressively gives a better quality/byte trade-off than uniform quantization.

Solution. Identify the top ~1% most important weight channels (by activation magnitude at calibration), and apply a per-channel scaling that increases their effective resolution under INT4 quantization. The scaling is mathematically equivalent to SmoothQuant's diagonal rescaling, but applied for the opposite reason: SmoothQuant moves magnitude away from sensitive places; AWQ moves quantization resolution toward important places.

The clever bit: AWQ doesn't need to know which weights are important as weights — it knows by reading the activation statistics. This means AWQ is calibration-data-driven, like GPTQ, but vastly cheaper (no Hessian Cholesky).

Trade-off. Comparable to GPTQ at INT4; faster to compute (no Hessian); requires picking \alpha (the scaling exponent) per layer.

Why we don't implement. Confusable with SmoothQuant — they share machinery, differ in motivation. Pedagogically clearer to implement GPTQ (a clean math story) and survey AWQ.

What the survey teaches¶

Three patterns recur:

Look at activations, not just weights. Naive PTQ treats weights as the problem; modern PTQ treats activation distributions as the problem.
Use calibration data. Every modern scheme spends 100–500 forward passes on representative inputs to extract statistics. The cost is fixed and small; the quality lift is large.
Fold per-channel rescaling into adjacent ops at deploy time. No runtime overhead from any of these schemes. The cleverness is offline.

If Borja takes one thing from this file: the difference between a "research" PTQ and a "production" PTQ is usually 1–2 days of pre-processing code, not a fundamentally different algorithm. Phase 26's hand-written GPTQ + per-channel INT8 covers the algorithmic core; the production extras are engineering on top.

Reading checklist¶

For the reflections file at phase close:

Skim LLM.int8() (sections 1–3) — outlier detection criterion.
Skim SmoothQuant (sections 1–4) — the diagonal rescaling identity.
Skim AWQ (sections 1–4) — the calibration-based importance scoring.
Skim GPTQ (sections 1–4 — already implemented, but re-read the proof of optimality).
Skim QLoRA (section 3 — NF4 + double quantization).

Optional: skim "A Survey of Quantization Methods for Efficient Neural Network Inference" (Gholami et al., 2021) for the pre-LLM history.

Drill questions (no code)¶

SmoothQuant's diagonal rescaling W' = W \text{diag}(s), x' = \text{diag}(s^{-1}) x. Show that this is exactly equivalent to applying s_j to the j-th column of W and dividing the j-th element of x by s_j. Then explain why this changes the quantization difficulty even though the matmul output is unchanged.
AWQ chooses s_j based on \max |x_j|^{\alpha} rather than \max |x_j|^1. Why fractional? (Hint: think about what \alpha = 0 and \alpha = 1 would do to the weight magnitudes.)
LLM.int8() splits ~0.1% of feature columns to an FP16 path. The reported overhead is ~10% latency on Ampere. Estimate the latency overhead on Borja's i5-8250U, assuming his AVX2 INT8 dequant-to-FP32 path is roughly 0.5× FP32 throughput and FP16 doesn't have hardware support (so FP16 is emulated as FP32). Result: would LLM.int8() be a win on Borja's hardware? (Hint: no.)

One-paragraph recap¶

Three production-grade PTQ schemes — LLM.int8(), SmoothQuant, AWQ — share a common observation: activation outliers, not weights, are the problem. Each handles the outliers differently: separate FP16 path (LLM.int8()), redistribute magnitude into weights (SmoothQuant), redistribute quantization resolution toward important weights (AWQ). For Phase 26 we read these but don't implement them — the algorithmic core they share with GPTQ is what matters pedagogically. Lab 02 (the quant-curve sweep) will surface activation-outlier behaviour in MiniGPT empirically, even though we don't fix it.

Next: lab/00-int8-ptq.md.