English · Español
Lab 01 — GPTQ on a Single Linear Layer¶
Goal: implement GPTQ for one
Linearand show it beats RTN at INT4 per-group.Estimated time: 6–10 hours (this is the hardest math+code combination of Phase 26).
Prereq: lab 00 committed; theory/03 read;
src/miniquant/BLUEPRINT.mdGPTQ section read.
What you produce¶
A directory experiments/26-gptq-toy/ containing:
gptq_toy.py— script that quantizes a syntheticLinear(768, 768)with both RTN and GPTQ, prints the loss tensor and the worst-case-row PPL proxy.results.json— measurements.loss_curve.png— quantization MSE per row index for RTN vs GPTQ.manifest.json.README.md— interpretation.
You commit src/miniquant/gptq.py implementing the algorithm.
The kernel¶
Implement GPTQ for one weight matrix W ∈ ℝ^(out × in) with a calibration distribution X ∈ ℝ^(in × n).
Algorithm sketch (theory file 03 has the math):
H = X @ X.T / n # (in, in) Hessian
H += eps * I # ridge for stability, eps = 1e-2 * mean(diag(H))
Hinv_chol = cholesky(inverse(H)) # upper-triangular L = chol(H^-1)
for row r in 0..out-1:
w = W[r, :].clone() # length-in vector
err = zeros(in)
for i in 0..in-1:
q_i = round_to_grid(w[i], scale=s_per_group[r, i // group_size])
delta_i = w[i] - dequant(q_i)
err[i] = delta_i
# update remaining columns i+1.. via the Cholesky row
for j in i+1..in-1:
w[j] -= delta_i * Hinv_chol[i, j] / Hinv_chol[i, i]
store q_i into Q[r, i]
The inner loop's coefficient is exactly what theory 03 gave you. You may organize the implementation block-wise (process columns in chunks of 128) for speed; the algorithm output is identical.
TODOs¶
Block A — synthetic setup¶
- Random
W ~ N(0, 0.02²)shaped(768, 768)— matches a typical mid-layer weight. - Calibration
X ~ N(0, 1)shaped(768, 128)— 128 random vectors. - Target grid: INT4 per-group, group size 64. (Scales picked per-group as
max(|w|)/7for symmetric INT4.) - Compute
H = X @ X.T / 128andH += eps * I.
Block B — implement RTN baseline¶
- Independent per-element quantization using the per-group scales.
- Compute the loss
tr((W - W_rtn).T @ H @ (W - W_rtn)). This is your scalar quality metric.
Block C — implement GPTQ¶
- Cholesky of
H^-1. PyTorch:torch.linalg.cholesky(torch.linalg.inv(H)). - Loop over rows (vectorize within row across columns), perform the update.
- Quantize. Store as a dense INT4 (you can pack two-per-byte at the end; for this lab a
tensor[int8]storing symmetric values in[-7, 7]is fine — the storage register is two's-complement[-8, 7], but symmetric quantization withscale = max(|w|)/7only emits codes in[-7, 7]).
Block D — compare¶
- Compute the loss for both schemes.
- Per-row loss: which rows benefited most from GPTQ?
- Compute
n_grid_changes: how many weights did GPTQ round to a different grid point than RTN would have? (Usually a single-digit percentage.)
Block E — interpret in README.md¶
Three questions:
- What's the GPTQ vs RTN loss ratio? Expect
loss_gptq / loss_rtn ∈ [0.3, 0.7]for this setup. If it's > 0.9, your Cholesky is wrong or the update direction is backwards. - Which rows benefit most? Plot per-row loss for both schemes. Rows with the highest RTN loss should see the biggest absolute improvement. Why?
- What happens if you replace
X ~ N(0, 1)withX ~ Cauchy(0, 1)? (Outlier-heavy distribution.) Re-run and report. GPTQ should benefit more because the Hessian's off-diagonal coupling is stronger.
Stop conditions¶
Done when:
src/miniquant/gptq.pyimplements the algorithm; tests pass.loss_gptq / loss_rtn < 0.7on the standard setup.- All five files committed.
README.mdanswers the three questions.
Constraints¶
- No reference GPTQ library. You are writing the algorithm. The
auto-gptqpackage exists; don't import it. - CPU only. Cholesky on
(768, 768)is sub-second; the row loop with explicit updates is the bottleneck. Time it; aim for < 30 seconds per layer. - Reproducibility:
seed_everything(42). Thentorch.linalg.choleskyis deterministic for a given input.
Pitfalls¶
- Cholesky fails with "matrix not positive-definite". Your ridge
epsis too small. Tryeps = 1e-1 * mean(diag(H)). loss_gptq > loss_rtn. Most common bugs: (i) you're subtractingdelta * Hinv_chol[j, i]instead ofHinv_chol[i, j](wrong triangle); (ii) your grid rounding is computingdeltaasdequant(q) - winstead ofw - dequant(q); (iii) you're applying updates after quantizing the rest of the row, not before.- GPTQ loss exactly equals RTN loss. The update is silently a no-op because
deltais rounding to zero (your scale is too coarse). Sanity-checkdeltamagnitudes are non-trivial.
When to consult solutions/¶
After all stop conditions met. Reference at solutions/01-gptq-toy-ref.md (phase open) walks through the derivation step-by-step and shows the expected numbers within tolerance.
Next lab: lab/02-quant-curve.md.