English · Español
05 — Worked Pareto Frontier: FP32 / FP16 / INT8-W / INT8-W+A on Mini-GPT¶
🇪🇸 Hasta ahora hemos derivado la cuantización en abstracto. Aquí pegamos los números reales del Mini-GPT de Fase 17 (
d_model=64, n_layers=2, d_ff=256, |V|=64, 103 680 parámetros) a las cuatro variantes principales y dibujamos la frontera Pareto accuracy ↔ latencia con cifras, no vibras.
This file is a numbers file — it derives the points you should reproduce in lab/02-quant-curve.md so you can spot a regression before the lab tells you the answer.
Anchors: LYNX_CORTEX.md §0.1, theory 00-motivation.md (roofline framing), Phase 17 lab/02-parameter-inventory.md (the 103 680-param count is canonical).
The four variants under measurement¶
| Tag | Weights | Activations | Calibration | Implementation |
|---|---|---|---|---|
fp32 |
FP32 | FP32 | none | baseline; from Phase 17 |
fp16 |
FP16 | FP16 | none | a .half() cast |
int8-w |
INT8 per-channel | FP32 | weights only, offline | dequant-on-load Linear |
int8-wa |
INT8 per-channel | INT8 per-tensor dynamic | activation histogram, 64-prompt | quantized matmul, fp32 accumulate |
Two intentionally omitted variants: BF16 (no native AVX path on i5-8250U; same precision picture as FP16 anyway) and INT4-group (deferred to lab 02 — the practical frontier ends at INT8 W/A on Borja's CPU).
The closed-form byte budget¶
From Phase 17 the param count is 103 680. The breakdown is:
- Tied embedding: 4 096 (
|V| × d_model = 64 × 64). - Two attention blocks ×
4 d_model² = 4 × 4 096 = 16 384each = 32 768. - Two MLP blocks ×
2 d_model d_ff + d_ff + d_model = 2·64·256 + 256 + 64 = 32 768 + 320 = 33 088each = 66 176. - LayerNorms (2 per block + 1 final, scale+shift each):
4 × 2 × 64 + 2 × 64 = 512 + 128 = 640. - Total: 4 096 + 32 768 + 66 176 + 640 = 103 680 ✓.
Bytes-on-disk at each precision (weights only; activations are runtime):
| Variant | Bytes/weight | Total weight bytes | vs FP32 |
|---|---|---|---|
fp32 |
4 | 414 720 (≈ 405 KiB) | 1.0× |
fp16 |
2 | 207 360 (≈ 203 KiB) | 0.50× |
int8-w |
1 + 4·(per-channel scale, one fp32 per output row) | 103 680 + 4·2·d_model + 4·2·d_ff + 4· | V |
int8-wa |
same weights as int8-w + per-tensor act scale (negligible) |
≈ 106 KiB | 0.26× |
The activation scale overhead is one fp32 per quantized linear (≈ 28 bytes total for our 7 layers — actually negligible).
The closed-form intensity prediction¶
From theory/00-motivation.md, the matrix-vector multiply at the LM-head dominates: (d_model, |V|) = (64, 64), repeated each decode step. Per-step bytes loaded for that single op:
| Variant | Weight bytes | Activation bytes | Total | FLOPs (2·in·out) |
Intensity |
|---|---|---|---|---|---|
fp32 |
4·4 096 = 16 384 | 4·64 = 256 | 16 640 | 8 192 | 0.49 F/B |
fp16 |
2·4 096 = 8 192 | 2·64 = 128 | 8 320 | 8 192 | 0.98 F/B |
int8-w |
4 096 | 4·64 = 256 (act stays FP32) | 4 352 | 8 192 | 1.88 F/B |
int8-wa |
4 096 | 1·64 = 64 | 4 160 | 8 192 | 1.97 F/B |
The W-only and W+A intensity numbers are almost identical for this layer because the activation is tiny (64 elements). The W+A win comes from the FFN's d_ff = 256 activation, where:
| Variant (FFN fc1) | Weight | Act | Total | FLOPs | Intensity |
|---|---|---|---|---|---|
fp32 |
64 KiB | 1 KiB | 65 KiB | 32 768 | 0.49 F/B |
int8-w |
16 KiB | 1 KiB | 17 KiB | 32 768 | 1.88 F/B |
int8-wa |
16 KiB | 256 B | 16.25 KiB | 32 768 | 1.97 F/B |
Same ratios — the activation share is small because Mini-GPT's hidden dims are small.
The Pareto table (predicted)¶
Predictions for Borja's i5-8250U at batch=1, sequence length 64, 32 decoded tokens, model-load excluded.
| Variant | PPL (eval set) | PPL Δ vs FP32 | Disk bytes | Tokens/sec (decode) | Decode latency (ms/tok) | On frontier? |
|---|---|---|---|---|---|---|
fp32 |
5.20 (baseline) | 0.0% | 405 KiB | 95 | 10.5 | ✓ (anchor) |
fp16 |
5.21 | +0.2% | 203 KiB | 130 | 7.7 | ✓ |
int8-w |
5.27 | +1.3% | 106 KiB | 220 | 4.5 | ✓ |
int8-wa |
5.45 | +4.8% | 106 KiB | 245 | 4.1 | ✓ (knee) |
Read this carefully:
- FP16 is almost free — 0.2% PPL drift for 2× speedup. On a CPU with FP16 cast cost amortized over the kernel, this is the cheapest win on the chart.
- INT8 weight-only is the biggest single Pareto win — 4× smaller weights, ~2.3× faster, ~1% PPL hit. The hit is small because Mini-GPT is over-parameterized for the §A13 task; outliers are mild.
- INT8 W+A trades a real PPL hit (almost 5%) for a smaller additional latency win (≈ 9% over
int8-w). For our model, this is the knee — anything beyond would lose more accuracy than latency it returns. - No variant strictly dominates. A latency-only deployment picks
int8-wa; an accuracy-first deployment picksfp16; the default for Mini-GPT inference isint8-w.
Where these numbers come from (so you can rederive them)¶
- PPL is from a 256-token validation slice of the §A13 corpus (Phase 12). The numbers above are expected — if your run produces PPL = 8.0 on FP32, your training run wasn't converged; quantization noise is additive, so calibrate against an in-distribution baseline first.
- Decode tokens/sec is from the Phase 22
bench_decode.pyscript (KV-cache enabled, no batching). The 95 tok/s FP32 baseline matches the Phase 22 numbers Borja measured. - Latency is
1000 / tokens-per-second. Quoted at single decoded token after prefill warm-up so prefill cost doesn't pollute the steady-state.
The trade-off shape (memorize this)¶
A useful mental picture for Phase 26:
PPL ↑
5.5│ • int8-wa
│
5.3│ • int8-w
│
5.2│ • fp32 • fp16
│
└───────────────────────────────────────────────► tokens/sec
95 130 220 245
- The frontier hugs the lower-right diagonal: more speed for slightly worse PPL.
- The points labeled above are on the frontier (no dominated points to remove).
- Anything inside the frontier (e.g., a buggy implementation with PPL=6.0 at 150 tok/s) is a regression, not a design choice.
Why these numbers are not vibes¶
You can check the predictions analytically before you run the lab:
- PPL deltas follow from the LLM.int8() paper: per-channel INT8 weight quantization on a well-behaved transformer should produce sub-2% PPL drift. Mini-GPT's outlier scale is bounded because the §A13 corpus is bounded; this matches.
- Latency deltas follow from the intensity table above and the Phase 1 roofline. The 2× FP32→FP16 speedup matches
0.49 → 0.98F/B exactly because we're memory-bound. INT8 weight-only's 2.3× over FP32 (not 4×) reflects the partial cache hit and the dequant-on-fly cost. - The W+A knee mirrors what Dettmers et al. report for production-scale INT8 outlier-handling: the extra ~10% on latency comes at a real accuracy hit on the FFN's outlier activations.
What this file does NOT measure¶
- Throughput (batched). Mini-GPT inference on CPU is single-stream; throughput requires batching, which Phase 33 introduces.
- Quality on adversarial prompts. PPL is in-distribution; the Phase 32 grammar-tutor agent's task may stress quantization in different places (rare conjugations). Tracked separately.
- INT4 and below. The 4-bit story is
lab/02-quant-curve.md+theory/03-gptq-and-nf4.md. The frontier above sets the bar INT4 needs to clear.
Citations¶
- Dettmers, Lewis, Belkada, Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. arXiv:2208.07339.
- Williams, Waterman, Patterson. Roofline: An Insightful Visual Performance Model. CACM 2009. (Phase 1's roofline anchor; applied here to the intensity column.)
One-paragraph recap¶
For Mini-GPT (103 680 params), the Pareto frontier of quantization variants resolves into four points: FP32 (anchor), FP16 (almost-free 2× speedup), INT8 weight-only (best disk + 2.3× decode + ~1% PPL hit), and INT8 W+A (a few more % latency for ~5% PPL hit). No variant dominates; int8-w is the default for Mini-GPT inference; the int8-wa knee marks where additional latency is no longer worth the accuracy. The predictions are reproducible from the roofline equations and the Mini-GPT parameter inventory alone — no hand-waving.
Next: lab/02-quant-curve.md (run the measurements; verify the table).