Skip to content

English · Español

Lab 03 — Round-trip fp32 ↔ int8 and measure the loss

Goal: quantify the error introduced by symmetric INT8 quantization on a typical activation distribution. Foreshadow Phase 26 without doing real calibration.

Estimated time: 45–75 minutes.

Prereq: theory 04-precision-zoo.md § "Integer formats" read.


What you produce

A directory experiments/02-quantization-preview/ containing:

  • quant.pyquantize_fp32_to_int8(x, s) and dequantize_int8_to_fp32(q, s) plus a small helper to choose s.
  • experiment.py — driver that runs the round-trip on three test distributions and records errors.
  • results.json — per-distribution error statistics.
  • error_histogram.png — distribution of per-element absolute errors.
  • manifest.json.
  • README.md — interpretation, including the question "is this enough for the §A13 tense classifier?".

The §A13 framing

The test distributions you'll quantize are examples of what a small model produces for the §A13 task:

  1. Tense classifier logits. A length-5 vector with values in [-3, 5] (typical pre-softmax logits for the 5-tense classification).
  2. Verb-form distribution mass. A length-600 vector representing the model's predicted probability over all 600 verb forms. Values in [0, 0.1], with most values < 0.01.
  3. Hidden activations. A length-1024 vector with values approximately Normal(0, 1) — a typical hidden state of a small transformer.

For each, we ask: how much accuracy does INT8 round-trip cost?

TODOs

Block A — implementations

quant.py:

def choose_scale_symmetric(x, n_bits=8):
    # INT8 symmetric: representable range is [-127, 127]
    # Scale so that x.max() / s ≤ 127 and -x.max() / s ≥ -127
    # i.e., s = abs(x).max() / 127
    return np.abs(x).max() / 127.0

def quantize_fp32_to_int8(x, s):
    # Round to nearest, clip to [-128, 127]
    q = np.clip(np.round(x / s), -128, 127).astype(np.int8)
    return q

def dequantize_int8_to_fp32(q, s):
    return q.astype(np.float32) * s

Variant: also implement an asymmetric version with a zero-point z:

def choose_scale_asymmetric(x, n_bits=8):
    qmin, qmax = -128, 127
    s = (x.max() - x.min()) / (qmax - qmin)
    z = qmin - np.round(x.min() / s).astype(np.int32)
    return s, z

def quantize_asym(x, s, z):
    return np.clip(np.round(x / s + z), -128, 127).astype(np.int8)

def dequantize_asym(q, s, z):
    return (q.astype(np.float32) - z) * s

Block B — generate the three distributions

rng = np.random.default_rng(42)

# Distribution 1: tense logits, length 5
logits = np.array([1.2, 4.7, 3.1, 0.5, 2.9], dtype=np.float32)
# Or, generate a batch: rng.uniform(-3, 5, size=(64, 5)).astype(np.float32)

# Distribution 2: verb-form probabilities, length 600
# Highly skewed: a few large, most tiny
probs_raw = np.exp(rng.normal(0, 1.5, 600)).astype(np.float32)
probs = probs_raw / probs_raw.sum()

# Distribution 3: hidden activations, length 1024
hidden = rng.standard_normal(1024).astype(np.float32)

Each distribution is saved to experiment.py as a named tensor.

Block C — round-trip and measure

For each distribution, with both symmetric and asymmetric quantization:

  1. Compute the scale (and zero-point for asymmetric).
  2. Quantize to int8.
  3. Dequantize back to fp32.
  4. Record:
  5. Mean absolute error: np.abs(x - x_dequant).mean().
  6. Max absolute error.
  7. Mean relative error (where |x| > some_threshold to avoid division by tiny values): np.abs((x - x_dequant) / x).mean().
  8. Cosine similarity between x and x_dequant.
  9. Number of unique int8 values used out of 256 (utilization).

Save to results.json as a table:

{
  "tense_logits": {
    "symmetric":  {"mae": ..., "max_err": ..., "rel_err": ..., "cos_sim": ..., "n_codes": ...},
    "asymmetric": { ... }
  },
  "verb_probs":   { ... },
  "hidden":       { ... }
}

Block D — predict before measuring

In README.md, before running, predict:

  • Which distribution will INT8-quantize most accurately, and why?
  • Which distribution will lose information most painfully?
  • For the verb-form probabilities (where most values are < 0.01): what's the quantization step s, and what's the relative error on a value of 0.001?

Expected reasoning: scale is set by max, so distributions with high dynamic range (verb probs: from 0 to ~0.5) get a coarse step s ≈ 0.5/127 ≈ 0.004. Values of 0.001 round to either 0 (q = 0) or 1 × s = 0.004 — relative error of 100–300%. INT8 is brutal for the verb-form probability distribution. Asymmetric helps a bit (uses the negative codes for nothing here, so no gain). Per-channel scales or log-domain quantization (Phase 26) would help.

For tense logits ([-3, 5]): step s ≈ 5/127 ≈ 0.039. Relative error on a logit of 1.2 is 0.039/1.2 ≈ 3%. Tolerable. Softmax should still rank the same tense first.

For hidden activations (Normal(0, 1)): step s ≈ 3.5/127 ≈ 0.028 (assuming max ≈ 3.5σ). Relative error on a typical |x| ≈ 0.8 is ~3.5%. Tolerable.

Block E — the killer question

For the verb-form probabilities, after INT8 round-trip, does the rank of the top-k forms change? Compute np.argsort(probs)[::-1][:10] for both the original and the quantized-then-dequantized version. Are the top-10 the same set? In the same order?

This is the task-relevant question. INT8 quantization that preserves probabilities to 3 sig figs is useless if it swaps the top two predictions. Argsort stability is a Phase-26 evaluation metric; you're previewing it here.

Block F — histogram

error_histogram.png: for the hidden activations (length 1024), plot a histogram of np.abs(x - x_dequant). Annotate the theoretical max error s/2 ≈ 0.014. Confirm the histogram peaks below that.

Constraints

  • Symmetric s from max-absolute-value. Don't use exotic strategies (percentile clipping, calibration). That's Phase 26.
  • NumPy only. No bitsandbytes, no torch.quantization. Pure-NumPy round-trip is enough for measurement.
  • Use the seeded RNG for reproducibility.

Stop conditions

Done when:

  1. Two implementations (symmetric, asymmetric) in quant.py.
  2. Three distributions × both quant modes in results.json.
  3. The "top-10 argsort stability" question is answered for the verb-form probability distribution.
  4. error_histogram.png exists.
  5. README.md makes the recommendation: can INT8 alone be trusted for §A13 verb-form prediction? (Answer expected: no, because the distribution is too skewed; Phase 26 will solve this with per-channel scales or log-domain encoding.)

Pitfalls

  • np.round of halves. Default behavior is banker's rounding (round half to even). np.round(0.5) = 0.0, not 1.0. Document this if it surfaces.
  • Asymmetric clipping at boundary. The zero-point computation can go slightly out of [-128, 127] for extreme distributions; clip after computing.
  • np.abs(x - x_dequant) / x divides by zero when an entry is exactly zero (rare for fp32 random data but possible). Mask with |x| > 1e-8.
  • max of a 2D batch tensor. Use the right axis. For per-tensor quantization, x.max() over everything. For per-channel (Phase 26), x.max(axis=-1, keepdims=True).

When to consult solutions/

After committing all five files. Solution at solutions/03-quantization-preview-ref.md (written at phase open).

Hint of last resort

If your INT8 round-trip is producing wildly larger errors than expected, you probably forgot to dequantize back to fp32 before comparing — i.e., comparing fp32 x to int8 q directly. The dequant step q.astype(np.float32) * s is essential.


Phase 2 labs complete. Next: /quiz 02, then PHASE_02_REPORT.md, then reflection, then proceed to Phase 3.