English · Español
Lab 03 — Round-trip fp32 ↔ int8 and measure the loss¶
Goal: quantify the error introduced by symmetric INT8 quantization on a typical activation distribution. Foreshadow Phase 26 without doing real calibration.
Estimated time: 45–75 minutes.
Prereq: theory
04-precision-zoo.md§ "Integer formats" read.
What you produce¶
A directory experiments/02-quantization-preview/ containing:
quant.py—quantize_fp32_to_int8(x, s)anddequantize_int8_to_fp32(q, s)plus a small helper to chooses.experiment.py— driver that runs the round-trip on three test distributions and records errors.results.json— per-distribution error statistics.error_histogram.png— distribution of per-element absolute errors.manifest.json.README.md— interpretation, including the question "is this enough for the §A13 tense classifier?".
The §A13 framing¶
The test distributions you'll quantize are examples of what a small model produces for the §A13 task:
- Tense classifier logits. A length-5 vector with values in
[-3, 5](typical pre-softmax logits for the 5-tense classification). - Verb-form distribution mass. A length-600 vector representing the model's predicted probability over all 600 verb forms. Values in
[0, 0.1], with most values< 0.01. - Hidden activations. A length-1024 vector with values approximately
Normal(0, 1)— a typical hidden state of a small transformer.
For each, we ask: how much accuracy does INT8 round-trip cost?
TODOs¶
Block A — implementations¶
quant.py:
def choose_scale_symmetric(x, n_bits=8):
# INT8 symmetric: representable range is [-127, 127]
# Scale so that x.max() / s ≤ 127 and -x.max() / s ≥ -127
# i.e., s = abs(x).max() / 127
return np.abs(x).max() / 127.0
def quantize_fp32_to_int8(x, s):
# Round to nearest, clip to [-128, 127]
q = np.clip(np.round(x / s), -128, 127).astype(np.int8)
return q
def dequantize_int8_to_fp32(q, s):
return q.astype(np.float32) * s
Variant: also implement an asymmetric version with a zero-point z:
def choose_scale_asymmetric(x, n_bits=8):
qmin, qmax = -128, 127
s = (x.max() - x.min()) / (qmax - qmin)
z = qmin - np.round(x.min() / s).astype(np.int32)
return s, z
def quantize_asym(x, s, z):
return np.clip(np.round(x / s + z), -128, 127).astype(np.int8)
def dequantize_asym(q, s, z):
return (q.astype(np.float32) - z) * s
Block B — generate the three distributions¶
rng = np.random.default_rng(42)
# Distribution 1: tense logits, length 5
logits = np.array([1.2, 4.7, 3.1, 0.5, 2.9], dtype=np.float32)
# Or, generate a batch: rng.uniform(-3, 5, size=(64, 5)).astype(np.float32)
# Distribution 2: verb-form probabilities, length 600
# Highly skewed: a few large, most tiny
probs_raw = np.exp(rng.normal(0, 1.5, 600)).astype(np.float32)
probs = probs_raw / probs_raw.sum()
# Distribution 3: hidden activations, length 1024
hidden = rng.standard_normal(1024).astype(np.float32)
Each distribution is saved to experiment.py as a named tensor.
Block C — round-trip and measure¶
For each distribution, with both symmetric and asymmetric quantization:
- Compute the scale (and zero-point for asymmetric).
- Quantize to int8.
- Dequantize back to fp32.
- Record:
- Mean absolute error:
np.abs(x - x_dequant).mean(). - Max absolute error.
- Mean relative error (where
|x| > some_thresholdto avoid division by tiny values):np.abs((x - x_dequant) / x).mean(). - Cosine similarity between
xandx_dequant. - Number of unique int8 values used out of 256 (utilization).
Save to results.json as a table:
{
"tense_logits": {
"symmetric": {"mae": ..., "max_err": ..., "rel_err": ..., "cos_sim": ..., "n_codes": ...},
"asymmetric": { ... }
},
"verb_probs": { ... },
"hidden": { ... }
}
Block D — predict before measuring¶
In README.md, before running, predict:
- Which distribution will INT8-quantize most accurately, and why?
- Which distribution will lose information most painfully?
- For the verb-form probabilities (where most values are
< 0.01): what's the quantization steps, and what's the relative error on a value of0.001?
Expected reasoning: scale is set by max, so distributions with high dynamic range (verb probs: from 0 to ~0.5) get a coarse step s ≈ 0.5/127 ≈ 0.004. Values of 0.001 round to either 0 (q = 0) or 1 × s = 0.004 — relative error of 100–300%. INT8 is brutal for the verb-form probability distribution. Asymmetric helps a bit (uses the negative codes for nothing here, so no gain). Per-channel scales or log-domain quantization (Phase 26) would help.
For tense logits ([-3, 5]): step s ≈ 5/127 ≈ 0.039. Relative error on a logit of 1.2 is 0.039/1.2 ≈ 3%. Tolerable. Softmax should still rank the same tense first.
For hidden activations (Normal(0, 1)): step s ≈ 3.5/127 ≈ 0.028 (assuming max ≈ 3.5σ). Relative error on a typical |x| ≈ 0.8 is ~3.5%. Tolerable.
Block E — the killer question¶
For the verb-form probabilities, after INT8 round-trip, does the rank of the top-k forms change? Compute np.argsort(probs)[::-1][:10] for both the original and the quantized-then-dequantized version. Are the top-10 the same set? In the same order?
This is the task-relevant question. INT8 quantization that preserves probabilities to 3 sig figs is useless if it swaps the top two predictions. Argsort stability is a Phase-26 evaluation metric; you're previewing it here.
Block F — histogram¶
error_histogram.png: for the hidden activations (length 1024), plot a histogram of np.abs(x - x_dequant). Annotate the theoretical max error s/2 ≈ 0.014. Confirm the histogram peaks below that.
Constraints¶
- Symmetric
sfrom max-absolute-value. Don't use exotic strategies (percentile clipping, calibration). That's Phase 26. - NumPy only. No
bitsandbytes, notorch.quantization. Pure-NumPy round-trip is enough for measurement. - Use the seeded RNG for reproducibility.
Stop conditions¶
Done when:
- Two implementations (symmetric, asymmetric) in
quant.py. - Three distributions × both quant modes in
results.json. - The "top-10 argsort stability" question is answered for the verb-form probability distribution.
error_histogram.pngexists.README.mdmakes the recommendation: can INT8 alone be trusted for §A13 verb-form prediction? (Answer expected: no, because the distribution is too skewed; Phase 26 will solve this with per-channel scales or log-domain encoding.)
Pitfalls¶
np.roundof halves. Default behavior is banker's rounding (round half to even).np.round(0.5) = 0.0, not1.0. Document this if it surfaces.- Asymmetric clipping at boundary. The zero-point computation can go slightly out of
[-128, 127]for extreme distributions; clip after computing. np.abs(x - x_dequant) / xdivides by zero when an entry is exactly zero (rare for fp32 random data but possible). Mask with|x| > 1e-8.maxof a 2D batch tensor. Use the right axis. For per-tensor quantization,x.max()over everything. For per-channel (Phase 26),x.max(axis=-1, keepdims=True).
When to consult solutions/¶
After committing all five files. Solution at solutions/03-quantization-preview-ref.md (written at phase open).
Hint of last resort¶
If your INT8 round-trip is producing wildly larger errors than expected, you probably forgot to dequantize back to fp32 before comparing — i.e., comparing fp32 x to int8 q directly. The dequant step q.astype(np.float32) * s is essential.
Phase 2 labs complete. Next: /quiz 02, then PHASE_02_REPORT.md, then reflection, then proceed to Phase 3.