English · Español

03 — GPTQ and NF4¶

🇪🇸 GPTQ no quantiza pesos al azar: usa la Hessiana de las activaciones para redistribuir el error de redondeo hacia las columnas que aún no ha cuantizado. NF4 abandona la rejilla uniforme y usa los cuantiles de una N(0,1) como código óptimo. Dos ideas pequeñas, mucha ganancia en perplexity.

Where round-to-nearest leaves money on the table¶

Per-channel INT8 round-to-nearest (RTN) is the baseline from theory 02. Its weakness: every weight is quantized in isolation. The error from quantizing w_1 doesn't influence how we quantize w_2, even though both will be summed in the same dot product against the same activation x.

GPTQ (Frantar et al., 2022) observes: the network only cares about the output of the dot product, not about individual weights. If we already incurred error on w_1, we can bias the rounding of w_2 to partially compensate.

The setup¶

Consider one Linear layer with weight matrix W ∈ ℝ^(out×in) and a calibration distribution of inputs X ∈ ℝ^(in×n_calib). The output is Y = W X. We want to choose Ŵ (quantized) to minimize

\[ \mathcal{L}(\hat{W}) = \mathbb{E}_x \| W x - \hat{W} x \|^2 = \text{tr}\big( (W - \hat{W})^\top H (W - \hat{W}) \big) \]

where H = X X^\top / n is the empirical Hessian of the input distribution.

The key fact: H couples columns of W through x. Specifically, the loss decomposes per output row, so we treat each row independently — but within a row, the in columns are coupled through H.

Per-row, with w the row vector:

\[ \mathcal{L}(w, \hat{w}) = (w - \hat{w})^\top H (w - \hat{w}) \]

We want to round each w_i to a value on the quantization grid, with the rest of the row picking up the slack.

The GPTQ update¶

Pick a column ordering — for INT4, the standard choice is "in original order" (no permutation) because reordering breaks weight-tying conventions. For each column index i = 1, ..., in:

Round w_i to the nearest grid value: \hat{w}_i = \text{round-to-grid}(w_i).
The error \delta_i = w_i - \hat{w}_i will propagate to the loss.
Distribute \delta_i to the remaining un-quantized columns i+1, ..., in so that, after re-quantization, the total contribution to the loss is minimized.

The optimal redistribution comes from the closed-form solution of a quadratic over the remaining columns. Skipping the linear algebra (see solutions/03-gptq-derivation.md at phase open), the update for column j > i is:

\[ w_j \leftarrow w_j - \delta_i \cdot \frac{H_{ij}}{H_{ii}^{-1}}\ \text{(simplified)} \]

The actual GPTQ algorithm uses the inverse Hessian's Cholesky decomposition L = \text{chol}(H^{-1}) and updates columns via

\[ w_j \leftarrow w_j - \delta_i \cdot L_{ji} \]

The Cholesky structure lets us compute all updates in O(in^2) per row — total cost O(out · in^2), which is the same asymptotic cost as one forward pass through the matrix. Hence GPTQ on a 7B model is a "few-minutes" job, not a "few-hours" one.

What GPTQ actually buys¶

On weights-only INT4:

Scheme	PPL gap vs FP32 (LLaMA-7B, WikiText)
RTN per-tensor	7.3 (broken)
RTN per-group=128	1.2
GPTQ per-group=128	0.4

The gap-closing is significant: GPTQ ~3× closer to FP32 than RTN at the same byte count. For Phase 26 on MiniGPT, the gap will be smaller in absolute terms (the model is tiny) but the relative improvement should reproduce.

What GPTQ assumes (and where it breaks)¶

The calibration distribution is representative. If H is computed from data that doesn't cover the deployment distribution, the redistribution biases wrong. Picking calibration data is the most under-discussed part of GPTQ in practice.
The Hessian is positive-definite. For very rank-deficient activation distributions (e.g. activations after a low-rank bottleneck), H is singular and we need a small diagonal jitter H + \epsilon I. Standard fix.
Per-row independence. GPTQ doesn't redistribute errors across output rows. For attention output projections where rows are highly correlated, this is suboptimal — but the suboptimality is small.

NF4: a different idea¶

NF4 (NormalFloat 4-bit) from Dettmers et al. (2023, QLoRA paper) abandons the uniform grid entirely. Observation: pre-trained weights are approximately normally distributed (zero-mean, similar variances per layer). For such distributions, the information-theoretically optimal 4-bit codebook is the set of 16 quantiles of N(0, 1):

\[ \{ q_k : P(Z \leq q_k) = k/16, \ Z \sim N(0,1), \ k = 0, ..., 15 \} \]

These quantiles are not uniformly spaced. They're dense near zero (where the bulk of the distribution lives) and sparse in the tails. The dequantization table for NF4 is exactly these 16 numbers (with one bit of asymmetry tweaking for zero-representability).

The quantization step: 1. Compute per-block scale s = max(|w|) for a block of 64 weights. 2. Normalize: w / s ∈ [-1, 1] approximately. 3. Find the nearest of the 16 NF4 codebook values. 4. Store the 4-bit codebook index.

The dequantization step: 1. Look up the codebook value for the index. 2. Multiply by the stored scale.

Why NF4 outperforms uniform INT4¶

For a normally-distributed tensor, uniform INT4 wastes resolution on the tails (where almost no weights live) and starves the centre (where most weights live). NF4 reverses that — dense grid near zero, sparse in the tails. The total round-off error (under the normality assumption) drops by ~30% at the same bit count.

For Phase 26 we survey NF4 (theory file 04) and optionally implement the codebook lookup as a stretch goal. The full GPTQ implementation on a single Linear is the graded deliverable.

The double-quantization trick (QLoRA)¶

QLoRA's "double quantization": the scales themselves (one FP16 per block of 64 weights) are quantized to FP8 with a second outer scale. Saves ~0.5 bits per weight on average. Pure engineering; no new math. Surveyed, not implemented.

How GPTQ interacts with quantization granularity¶

GPTQ doesn't replace per-group quantization; it composes with it:

Per-tensor INT4 + GPTQ: better than per-tensor INT4 RTN, but still bad — too coarse a scale.
Per-group INT4 + GPTQ: standard configuration. Group size 64 or 128.
Per-channel INT8 + GPTQ: rarely done — INT8 RTN is already close enough to FP32 that GPTQ's overhead doesn't pay back.

For Phase 26 we test the middle row (per-group INT4 + GPTQ) against per-group INT4 RTN to confirm GPTQ's gain on MiniGPT.

Implementation cost in our setting¶

GPTQ on the largest Linear in MiniGPT (let's assume Linear(768, 768)):

Calibration: 128 samples × forward through this layer = 128 vectors of length 768, recorded as an H of shape (768, 768). Storage: 4.6 MiB FP32 → negligible.
Cholesky of H^{-1}: O(768^3) ≈ 4.5e8 FLOPs ≈ 2.3 seconds at 200 GFLOPS, or 23 seconds at Borja's realistic 20 GFLOPS on a single AVX2 thread. Tractable on the CPU.
Per-row updates: 768 rows × 768 columns × 768 column updates = 4.5e8 FLOPs again. Same order.
Total per layer: ~1 minute. MiniGPT has ~12 such layers → ~12 minutes calibration.

Within Phase 26's wall-clock budget.

Drill problems¶

Solutions at phase open in solutions/03-gptq-ref.md. Reason, don't code.

Show that the loss \mathcal{L}(\hat W) = \mathbb{E} \|Wx - \hat W x\|^2 equals \text{tr}((W - \hat W)^\top H (W - \hat W)) where H = \mathbb{E}[x x^\top]. State the assumption.
For a 1×2 weight vector w = [3, 5] quantized to INT2 (4 levels) with scale s = 5/1.5 = 10/3, the RTN result is \hat{w} = [10/3, 5] (round 3/s ≈ 0.9 → 1, round 5/s = 1.5 → 1 or 2). Compute the loss under input distribution with H = [[1, 0.5], [0.5, 1]]. Now apply GPTQ: after rounding w_1, what's the optimal new value for w_2 (in real space) before its own round-to-grid? Compare losses.
NF4 codebook values for the symmetric 8-quantile case (4-bit quantiles of |Z| with a sign bit). Derive the 7 non-zero values numerically (using scipy.stats.norm.ppf mentally — just give an order-of-magnitude). Why is the smallest non-zero value approximately 0.13 and the largest approximately 3.5?

One-paragraph recap¶

GPTQ improves on round-to-nearest by computing the Hessian of activations and using it to redistribute quantization error among unquantized columns — closing roughly two-thirds of the gap between RTN-per-group and FP32 at 4-bit precision. NF4 improves on the uniform 4-bit grid by using quantiles of a normal distribution as codebook entries — exploiting the empirical fact that pre-trained weights are approximately normal. Both ideas compose: NF4 + GPTQ is QLoRA's default. For Phase 26 we implement GPTQ on a single Linear and survey NF4. The next file surveys AWQ and SmoothQuant for cultural completeness.

Next: theory/04-awq-survey.md.