Skip to content

English · Español

Lab 01 — GPTQ on a Single Linear Layer

Goal: implement GPTQ for one Linear and show it beats RTN at INT4 per-group.

Estimated time: 6–10 hours (this is the hardest math+code combination of Phase 26).

Prereq: lab 00 committed; theory/03 read; src/miniquant/BLUEPRINT.md GPTQ section read.


What you produce

A directory experiments/26-gptq-toy/ containing:

  • gptq_toy.py — script that quantizes a synthetic Linear(768, 768) with both RTN and GPTQ, prints the loss tensor and the worst-case-row PPL proxy.
  • results.json — measurements.
  • loss_curve.png — quantization MSE per row index for RTN vs GPTQ.
  • manifest.json.
  • README.md — interpretation.

You commit src/miniquant/gptq.py implementing the algorithm.

The kernel

Implement GPTQ for one weight matrix W ∈ ℝ^(out × in) with a calibration distribution X ∈ ℝ^(in × n).

Algorithm sketch (theory file 03 has the math):

H = X @ X.T / n              # (in, in) Hessian
H += eps * I                 # ridge for stability, eps = 1e-2 * mean(diag(H))
Hinv_chol = cholesky(inverse(H))   # upper-triangular L = chol(H^-1)

for row r in 0..out-1:
    w = W[r, :].clone()                # length-in vector
    err = zeros(in)
    for i in 0..in-1:
        q_i = round_to_grid(w[i], scale=s_per_group[r, i // group_size])
        delta_i = w[i] - dequant(q_i)
        err[i] = delta_i
        # update remaining columns i+1.. via the Cholesky row
        for j in i+1..in-1:
            w[j] -= delta_i * Hinv_chol[i, j] / Hinv_chol[i, i]
        store q_i into Q[r, i]

The inner loop's coefficient is exactly what theory 03 gave you. You may organize the implementation block-wise (process columns in chunks of 128) for speed; the algorithm output is identical.

TODOs

Block A — synthetic setup

  • Random W ~ N(0, 0.02²) shaped (768, 768) — matches a typical mid-layer weight.
  • Calibration X ~ N(0, 1) shaped (768, 128) — 128 random vectors.
  • Target grid: INT4 per-group, group size 64. (Scales picked per-group as max(|w|)/7 for symmetric INT4.)
  • Compute H = X @ X.T / 128 and H += eps * I.

Block B — implement RTN baseline

  • Independent per-element quantization using the per-group scales.
  • Compute the loss tr((W - W_rtn).T @ H @ (W - W_rtn)). This is your scalar quality metric.

Block C — implement GPTQ

  • Cholesky of H^-1. PyTorch: torch.linalg.cholesky(torch.linalg.inv(H)).
  • Loop over rows (vectorize within row across columns), perform the update.
  • Quantize. Store as a dense INT4 (you can pack two-per-byte at the end; for this lab a tensor[int8] storing symmetric values in [-7, 7] is fine — the storage register is two's-complement [-8, 7], but symmetric quantization with scale = max(|w|)/7 only emits codes in [-7, 7]).

Block D — compare

  • Compute the loss for both schemes.
  • Per-row loss: which rows benefited most from GPTQ?
  • Compute n_grid_changes: how many weights did GPTQ round to a different grid point than RTN would have? (Usually a single-digit percentage.)

Block E — interpret in README.md

Three questions:

  1. What's the GPTQ vs RTN loss ratio? Expect loss_gptq / loss_rtn ∈ [0.3, 0.7] for this setup. If it's > 0.9, your Cholesky is wrong or the update direction is backwards.
  2. Which rows benefit most? Plot per-row loss for both schemes. Rows with the highest RTN loss should see the biggest absolute improvement. Why?
  3. What happens if you replace X ~ N(0, 1) with X ~ Cauchy(0, 1)? (Outlier-heavy distribution.) Re-run and report. GPTQ should benefit more because the Hessian's off-diagonal coupling is stronger.

Stop conditions

Done when:

  1. src/miniquant/gptq.py implements the algorithm; tests pass.
  2. loss_gptq / loss_rtn < 0.7 on the standard setup.
  3. All five files committed.
  4. README.md answers the three questions.

Constraints

  • No reference GPTQ library. You are writing the algorithm. The auto-gptq package exists; don't import it.
  • CPU only. Cholesky on (768, 768) is sub-second; the row loop with explicit updates is the bottleneck. Time it; aim for < 30 seconds per layer.
  • Reproducibility: seed_everything(42). Then torch.linalg.cholesky is deterministic for a given input.

Pitfalls

  • Cholesky fails with "matrix not positive-definite". Your ridge eps is too small. Try eps = 1e-1 * mean(diag(H)).
  • loss_gptq > loss_rtn. Most common bugs: (i) you're subtracting delta * Hinv_chol[j, i] instead of Hinv_chol[i, j] (wrong triangle); (ii) your grid rounding is computing delta as dequant(q) - w instead of w - dequant(q); (iii) you're applying updates after quantizing the rest of the row, not before.
  • GPTQ loss exactly equals RTN loss. The update is silently a no-op because delta is rounding to zero (your scale is too coarse). Sanity-check delta magnitudes are non-trivial.

When to consult solutions/

After all stop conditions met. Reference at solutions/01-gptq-toy-ref.md (phase open) walks through the derivation step-by-step and shows the expected numbers within tolerance.


Next lab: lab/02-quant-curve.md.