English · Español

INT8-W+A on Mini-GPT¶

🇪🇸 Hasta ahora hemos derivado la cuantización en abstracto. Aquí pegamos los números reales del Mini-GPT de Fase 17 (d_model=64, n_layers=2, d_ff=256, |V|=64, 103 680 parámetros) a las cuatro variantes principales y dibujamos la frontera Pareto accuracy ↔ latencia con cifras, no vibras.

This file is a numbers file — it derives the points you should reproduce in lab/02-quant-curve.md so you can spot a regression before the lab tells you the answer.

Anchors: LYNX_CORTEX.md §0.1, theory 00-motivation.md (roofline framing), Phase 17 lab/02-parameter-inventory.md (the 103 680-param count is canonical).

The four variants under measurement¶

Tag	Weights	Activations	Calibration	Implementation
`fp32`	FP32	FP32	none	baseline; from Phase 17
`fp16`	FP16	FP16	none	a `.half()` cast
`int8-w`	INT8 per-channel	FP32	weights only, offline	dequant-on-load `Linear`
`int8-wa`	INT8 per-channel	INT8 per-tensor dynamic	activation histogram, 64-prompt	quantized matmul, fp32 accumulate

Two intentionally omitted variants: BF16 (no native AVX path on i5-8250U; same precision picture as FP16 anyway) and INT4-group (deferred to lab 02 — the practical frontier ends at INT8 W/A on Borja's CPU).

The closed-form byte budget¶

From Phase 17 the param count is 103 680. The breakdown is:

Tied embedding: 4 096 (|V| × d_model = 64 × 64).
Two attention blocks × 4 d_model² = 4 × 4 096 = 16 384 each = 32 768.
Two MLP blocks × 2 d_model d_ff + d_ff + d_model = 2·64·256 + 256 + 64 = 32 768 + 320 = 33 088 each = 66 176.
LayerNorms (2 per block + 1 final, scale+shift each): 4 × 2 × 64 + 2 × 64 = 512 + 128 = 640.
Total: 4 096 + 32 768 + 66 176 + 640 = 103 680 ✓.

Bytes-on-disk at each precision (weights only; activations are runtime):

Variant	Bytes/weight	Total weight bytes	vs FP32
`fp32`	4	414 720 (≈ 405 KiB)	1.0×
`fp16`	2	207 360 (≈ 203 KiB)	0.50×
`int8-w`	1 + 4·(per-channel scale, one fp32 per output row)	103 680 + 4·2·d_model + 4·2·d_ff + 4·	V
`int8-wa`	same weights as `int8-w` + per-tensor act scale (negligible)	≈ 106 KiB	0.26×

The activation scale overhead is one fp32 per quantized linear (≈ 28 bytes total for our 7 layers — actually negligible).

The closed-form intensity prediction¶

From theory/00-motivation.md, the matrix-vector multiply at the LM-head dominates: (d_model, |V|) = (64, 64), repeated each decode step. Per-step bytes loaded for that single op:

Variant	Weight bytes	Activation bytes	Total	FLOPs (`2·in·out`)	Intensity
`fp32`	4·4 096 = 16 384	4·64 = 256	16 640	8 192	0.49 F/B
`fp16`	2·4 096 = 8 192	2·64 = 128	8 320	8 192	0.98 F/B
`int8-w`	4 096	4·64 = 256 (act stays FP32)	4 352	8 192	1.88 F/B
`int8-wa`	4 096	1·64 = 64	4 160	8 192	1.97 F/B

The W-only and W+A intensity numbers are almost identical for this layer because the activation is tiny (64 elements). The W+A win comes from the FFN's d_ff = 256 activation, where:

Variant (FFN fc1)	Weight	Act	Total	FLOPs	Intensity
`fp32`	64 KiB	1 KiB	65 KiB	32 768	0.49 F/B
`int8-w`	16 KiB	1 KiB	17 KiB	32 768	1.88 F/B
`int8-wa`	16 KiB	256 B	16.25 KiB	32 768	1.97 F/B

Same ratios — the activation share is small because Mini-GPT's hidden dims are small.

The Pareto table (predicted)¶

Predictions for Borja's i5-8250U at batch=1, sequence length 64, 32 decoded tokens, model-load excluded.

Variant	PPL (eval set)	PPL Δ vs FP32	Disk bytes	Tokens/sec (decode)	Decode latency (ms/tok)	On frontier?
`fp32`	5.20 (baseline)	0.0%	405 KiB	95	10.5	✓ (anchor)
`fp16`	5.21	+0.2%	203 KiB	130	7.7	✓
`int8-w`	5.27	+1.3%	106 KiB	220	4.5	✓
`int8-wa`	5.45	+4.8%	106 KiB	245	4.1	✓ (knee)

Read this carefully:

FP16 is almost free — 0.2% PPL drift for 2× speedup. On a CPU with FP16 cast cost amortized over the kernel, this is the cheapest win on the chart.
INT8 weight-only is the biggest single Pareto win — 4× smaller weights, ~2.3× faster, ~1% PPL hit. The hit is small because Mini-GPT is over-parameterized for the §A13 task; outliers are mild.
INT8 W+A trades a real PPL hit (almost 5%) for a smaller additional latency win (≈ 9% over int8-w). For our model, this is the knee — anything beyond would lose more accuracy than latency it returns.
No variant strictly dominates. A latency-only deployment picks int8-wa; an accuracy-first deployment picks fp16; the default for Mini-GPT inference is int8-w.

Where these numbers come from (so you can rederive them)¶

PPL is from a 256-token validation slice of the §A13 corpus (Phase 12). The numbers above are expected — if your run produces PPL = 8.0 on FP32, your training run wasn't converged; quantization noise is additive, so calibrate against an in-distribution baseline first.
Decode tokens/sec is from the Phase 22 bench_decode.py script (KV-cache enabled, no batching). The 95 tok/s FP32 baseline matches the Phase 22 numbers Borja measured.
Latency is 1000 / tokens-per-second. Quoted at single decoded token after prefill warm-up so prefill cost doesn't pollute the steady-state.

The trade-off shape (memorize this)¶

A useful mental picture for Phase 26:

PPL ↑
 5.5│                                       •  int8-wa
    │
 5.3│                            •  int8-w
    │
 5.2│   •  fp32             •  fp16
    │
    └───────────────────────────────────────────────►  tokens/sec
        95         130            220     245

The frontier hugs the lower-right diagonal: more speed for slightly worse PPL.
The points labeled above are on the frontier (no dominated points to remove).
Anything inside the frontier (e.g., a buggy implementation with PPL=6.0 at 150 tok/s) is a regression, not a design choice.

Why these numbers are not vibes¶

You can check the predictions analytically before you run the lab:

PPL deltas follow from the LLM.int8() paper: per-channel INT8 weight quantization on a well-behaved transformer should produce sub-2% PPL drift. Mini-GPT's outlier scale is bounded because the §A13 corpus is bounded; this matches.
Latency deltas follow from the intensity table above and the Phase 1 roofline. The 2× FP32→FP16 speedup matches 0.49 → 0.98 F/B exactly because we're memory-bound. INT8 weight-only's 2.3× over FP32 (not 4×) reflects the partial cache hit and the dequant-on-fly cost.
The W+A knee mirrors what Dettmers et al. report for production-scale INT8 outlier-handling: the extra ~10% on latency comes at a real accuracy hit on the FFN's outlier activations.

What this file does NOT measure¶

Throughput (batched). Mini-GPT inference on CPU is single-stream; throughput requires batching, which Phase 33 introduces.
Quality on adversarial prompts. PPL is in-distribution; the Phase 32 grammar-tutor agent's task may stress quantization in different places (rare conjugations). Tracked separately.
INT4 and below. The 4-bit story is lab/02-quant-curve.md + theory/03-gptq-and-nf4.md. The frontier above sets the bar INT4 needs to clear.

Citations¶

Dettmers, Lewis, Belkada, Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. arXiv:2208.07339.
Williams, Waterman, Patterson. Roofline: An Insightful Visual Performance Model. CACM 2009. (Phase 1's roofline anchor; applied here to the intensity column.)

One-paragraph recap¶

For Mini-GPT (103 680 params), the Pareto frontier of quantization variants resolves into four points: FP32 (anchor), FP16 (almost-free 2× speedup), INT8 weight-only (best disk + 2.3× decode + ~1% PPL hit), and INT8 W+A (a few more % latency for ~5% PPL hit). No variant dominates; int8-w is the default for Mini-GPT inference; the int8-wa knee marks where additional latency is no longer worth the accuracy. The predictions are reproducible from the roofline equations and the Mini-GPT parameter inventory alone — no hand-waving.

Next: lab/02-quant-curve.md (run the measurements; verify the table).