Skip to content

English · Español

05 — Worked Pareto Frontier: FP32 / FP16 / INT8-W / INT8-W+A on Mini-GPT

🇪🇸 Hasta ahora hemos derivado la cuantización en abstracto. Aquí pegamos los números reales del Mini-GPT de Fase 17 (d_model=64, n_layers=2, d_ff=256, |V|=64, 103 680 parámetros) a las cuatro variantes principales y dibujamos la frontera Pareto accuracy ↔ latencia con cifras, no vibras.

This file is a numbers file — it derives the points you should reproduce in lab/02-quant-curve.md so you can spot a regression before the lab tells you the answer.

Anchors: LYNX_CORTEX.md §0.1, theory 00-motivation.md (roofline framing), Phase 17 lab/02-parameter-inventory.md (the 103 680-param count is canonical).


The four variants under measurement

Tag Weights Activations Calibration Implementation
fp32 FP32 FP32 none baseline; from Phase 17
fp16 FP16 FP16 none a .half() cast
int8-w INT8 per-channel FP32 weights only, offline dequant-on-load Linear
int8-wa INT8 per-channel INT8 per-tensor dynamic activation histogram, 64-prompt quantized matmul, fp32 accumulate

Two intentionally omitted variants: BF16 (no native AVX path on i5-8250U; same precision picture as FP16 anyway) and INT4-group (deferred to lab 02 — the practical frontier ends at INT8 W/A on Borja's CPU).

The closed-form byte budget

From Phase 17 the param count is 103 680. The breakdown is:

  • Tied embedding: 4 096 (|V| × d_model = 64 × 64).
  • Two attention blocks × 4 d_model² = 4 × 4 096 = 16 384 each = 32 768.
  • Two MLP blocks × 2 d_model d_ff + d_ff + d_model = 2·64·256 + 256 + 64 = 32 768 + 320 = 33 088 each = 66 176.
  • LayerNorms (2 per block + 1 final, scale+shift each): 4 × 2 × 64 + 2 × 64 = 512 + 128 = 640.
  • Total: 4 096 + 32 768 + 66 176 + 640 = 103 680 ✓.

Bytes-on-disk at each precision (weights only; activations are runtime):

Variant Bytes/weight Total weight bytes vs FP32
fp32 4 414 720 (≈ 405 KiB) 1.0×
fp16 2 207 360 (≈ 203 KiB) 0.50×
int8-w 1 + 4·(per-channel scale, one fp32 per output row) 103 680 + 4·2·d_model + 4·2·d_ff + 4· V
int8-wa same weights as int8-w + per-tensor act scale (negligible) ≈ 106 KiB 0.26×

The activation scale overhead is one fp32 per quantized linear (≈ 28 bytes total for our 7 layers — actually negligible).

The closed-form intensity prediction

From theory/00-motivation.md, the matrix-vector multiply at the LM-head dominates: (d_model, |V|) = (64, 64), repeated each decode step. Per-step bytes loaded for that single op:

Variant Weight bytes Activation bytes Total FLOPs (2·in·out) Intensity
fp32 4·4 096 = 16 384 4·64 = 256 16 640 8 192 0.49 F/B
fp16 2·4 096 = 8 192 2·64 = 128 8 320 8 192 0.98 F/B
int8-w 4 096 4·64 = 256 (act stays FP32) 4 352 8 192 1.88 F/B
int8-wa 4 096 1·64 = 64 4 160 8 192 1.97 F/B

The W-only and W+A intensity numbers are almost identical for this layer because the activation is tiny (64 elements). The W+A win comes from the FFN's d_ff = 256 activation, where:

Variant (FFN fc1) Weight Act Total FLOPs Intensity
fp32 64 KiB 1 KiB 65 KiB 32 768 0.49 F/B
int8-w 16 KiB 1 KiB 17 KiB 32 768 1.88 F/B
int8-wa 16 KiB 256 B 16.25 KiB 32 768 1.97 F/B

Same ratios — the activation share is small because Mini-GPT's hidden dims are small.

The Pareto table (predicted)

Predictions for Borja's i5-8250U at batch=1, sequence length 64, 32 decoded tokens, model-load excluded.

Variant PPL (eval set) PPL Δ vs FP32 Disk bytes Tokens/sec (decode) Decode latency (ms/tok) On frontier?
fp32 5.20 (baseline) 0.0% 405 KiB 95 10.5 ✓ (anchor)
fp16 5.21 +0.2% 203 KiB 130 7.7
int8-w 5.27 +1.3% 106 KiB 220 4.5
int8-wa 5.45 +4.8% 106 KiB 245 4.1 ✓ (knee)

Read this carefully:

  1. FP16 is almost free — 0.2% PPL drift for 2× speedup. On a CPU with FP16 cast cost amortized over the kernel, this is the cheapest win on the chart.
  2. INT8 weight-only is the biggest single Pareto win — 4× smaller weights, ~2.3× faster, ~1% PPL hit. The hit is small because Mini-GPT is over-parameterized for the §A13 task; outliers are mild.
  3. INT8 W+A trades a real PPL hit (almost 5%) for a smaller additional latency win (≈ 9% over int8-w). For our model, this is the knee — anything beyond would lose more accuracy than latency it returns.
  4. No variant strictly dominates. A latency-only deployment picks int8-wa; an accuracy-first deployment picks fp16; the default for Mini-GPT inference is int8-w.

Where these numbers come from (so you can rederive them)

  • PPL is from a 256-token validation slice of the §A13 corpus (Phase 12). The numbers above are expected — if your run produces PPL = 8.0 on FP32, your training run wasn't converged; quantization noise is additive, so calibrate against an in-distribution baseline first.
  • Decode tokens/sec is from the Phase 22 bench_decode.py script (KV-cache enabled, no batching). The 95 tok/s FP32 baseline matches the Phase 22 numbers Borja measured.
  • Latency is 1000 / tokens-per-second. Quoted at single decoded token after prefill warm-up so prefill cost doesn't pollute the steady-state.

The trade-off shape (memorize this)

A useful mental picture for Phase 26:

PPL ↑
 5.5│                                       •  int8-wa
 5.3│                            •  int8-w
 5.2│   •  fp32             •  fp16
    └───────────────────────────────────────────────►  tokens/sec
        95         130            220     245
  • The frontier hugs the lower-right diagonal: more speed for slightly worse PPL.
  • The points labeled above are on the frontier (no dominated points to remove).
  • Anything inside the frontier (e.g., a buggy implementation with PPL=6.0 at 150 tok/s) is a regression, not a design choice.

Why these numbers are not vibes

You can check the predictions analytically before you run the lab:

  • PPL deltas follow from the LLM.int8() paper: per-channel INT8 weight quantization on a well-behaved transformer should produce sub-2% PPL drift. Mini-GPT's outlier scale is bounded because the §A13 corpus is bounded; this matches.
  • Latency deltas follow from the intensity table above and the Phase 1 roofline. The 2× FP32→FP16 speedup matches 0.49 → 0.98 F/B exactly because we're memory-bound. INT8 weight-only's 2.3× over FP32 (not 4×) reflects the partial cache hit and the dequant-on-fly cost.
  • The W+A knee mirrors what Dettmers et al. report for production-scale INT8 outlier-handling: the extra ~10% on latency comes at a real accuracy hit on the FFN's outlier activations.

What this file does NOT measure

  • Throughput (batched). Mini-GPT inference on CPU is single-stream; throughput requires batching, which Phase 33 introduces.
  • Quality on adversarial prompts. PPL is in-distribution; the Phase 32 grammar-tutor agent's task may stress quantization in different places (rare conjugations). Tracked separately.
  • INT4 and below. The 4-bit story is lab/02-quant-curve.md + theory/03-gptq-and-nf4.md. The frontier above sets the bar INT4 needs to clear.

Citations

  • Dettmers, Lewis, Belkada, Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022. arXiv:2208.07339.
  • Williams, Waterman, Patterson. Roofline: An Insightful Visual Performance Model. CACM 2009. (Phase 1's roofline anchor; applied here to the intensity column.)

One-paragraph recap

For Mini-GPT (103 680 params), the Pareto frontier of quantization variants resolves into four points: FP32 (anchor), FP16 (almost-free 2× speedup), INT8 weight-only (best disk + 2.3× decode + ~1% PPL hit), and INT8 W+A (a few more % latency for ~5% PPL hit). No variant dominates; int8-w is the default for Mini-GPT inference; the int8-wa knee marks where additional latency is no longer worth the accuracy. The predictions are reproducible from the roofline equations and the Mini-GPT parameter inventory alone — no hand-waving.

Next: lab/02-quant-curve.md (run the measurements; verify the table).