English · Español
01 — Temperature scaling¶
🇪🇸 La temperatura es un único escalar que controla cuán "valiente" o "tímido" es el modelo. Bajo τ: el modelo se aferra a su token favorito. Alto τ: el modelo casi sortea de uniforme. La cara matemática: τ aplana o agudiza la softmax.
The formula¶
Given logits \(z \in \mathbb{R}^V\) and temperature \(\tau > 0\):
The whole thing is divided by \(\tau\) inside the exponent. Note: applying \(\tau\) outside the softmax (e.g., \(q^{1/\tau} / \sum q^{1/\tau}\)) gives a different distribution — same shape but different normalisation. Stay with the inside form.
The three limits¶
\(\tau \to 0^+\) — collapse to argmax¶
As \(\tau\) decreases, \(z_i / \tau\) grows in magnitude. If \(z^* = \max_i z_i\) is the unique max:
So greedy is the τ → 0 limit of temperature sampling. (When there are ties, the limit is uniform over the tied tokens — a measure-zero edge case in practice.)
\(\tau = 1\) — identity¶
When \(\tau = 1\), \(q_i = \text{softmax}(z)_i\) — the model's nominal distribution. Use this as the reference.
\(\tau \to \infty\) — uniform¶
As \(\tau\) increases, \(z_i / \tau \to 0\) for all \(i\). The exponents all approach 1, the normaliser approaches \(V\), and:
— a uniform distribution. Sampling from \(\tau = \infty\) is the same as picking a random token.
Entropy as a function of \(\tau\)¶
The entropy of \(q(\tau)\) is non-decreasing in \(\tau\):
with equality only when the logits are constant (already uniform).
Proof sketch¶
Compute \(\partial H / \partial \tau\). Using \(H(q) = -\sum_i q_i \log q_i\) and \(q_i = \exp(z_i/\tau) / Z\) where \(Z = \sum_j \exp(z_j/\tau)\):
where \(\langle z \rangle = \sum_j q_j z_j = \mathbb{E}_q[z]\). Then:
Use \(\log q_i = z_i/\tau - \log Z\) and simplify; the constant terms drop because \(\sum_i (z_i - \langle z \rangle) q_i = 0\). The remaining term is \(\frac{1}{\tau^3} \text{Var}_q[z] \ge 0\).
So \(\partial H / \partial \tau \ge 0\) with equality iff all \(z_i\) are equal.
(Lab 01 verifies this empirically by sweeping \(\tau\) on a synthetic logit vector and plotting \(H(q(\tau))\).)
A note on calibration¶
A perfectly calibrated model with \(\tau = 1\) has expressed confidence equal to its empirical accuracy. An over-confident model (typical for neural networks; see Phase 05 lab 03) puts too much mass on its top guess. Lowering \(\tau\) slightly (e.g., \(\tau = 0.9\)) sometimes recovers calibration empirically; raising \(\tau\) slightly (e.g., \(\tau = 1.1\)) sometimes adds useful diversity for an under-confident model.
This is not the same operation as the temperature scaling used for calibration in Phase 05. There the temperature is fit on a held-out set to minimise validation CE. Here it's a runtime knob for diversity. Same math, different motivation. Don't confuse them.
What about repetition?¶
A long-standing issue with autoregressive sampling: the model can fall into repetitive loops (I work. I work. I work...). A common fix is the repetition penalty: down-weight logits for tokens already in the generated sequence:
with \(\rho > 1\) (typically \(\rho = 1.1\) to \(1.3\)).
For our verb corpus, repetition is less of a problem because the sequences are short. We don't implement the repetition penalty in lab; we mention it in theory only.
A pitfall: temperature applied to the wrong stage¶
Three places one might apply \(\tau\):
softmax(z / τ)— the correct form. Temperature scales the logits before softmax.softmax(z) ** (1/τ) / normalise— almost but not quite the same; produces a different distribution.- Sampling from
softmax(z)and rejecting with prob τ — nonsensical, but you'd be surprised.
Only (1) is right. Lab 00 will have you verify by computing the entropy at \(\tau = 2\) via both (1) and (2) on the same logits and showing they differ.
What this file does NOT cover¶
- Top-k / top-p. Next file.
- Calibration via temperature as a learned parameter. Phase 05 lab 03 covers that.
- Adaptive temperature schedules. Some samplers vary τ as a function of position; out of scope.
Next: 02-top-k-and-top-p.md