Skip to content

English · Español

01 — Temperature scaling

🇪🇸 La temperatura es un único escalar que controla cuán "valiente" o "tímido" es el modelo. Bajo τ: el modelo se aferra a su token favorito. Alto τ: el modelo casi sortea de uniforme. La cara matemática: τ aplana o agudiza la softmax.

The formula

Given logits \(z \in \mathbb{R}^V\) and temperature \(\tau > 0\):

\[q_i(\tau) = \frac{\exp(z_i / \tau)}{\sum_{j=1}^V \exp(z_j / \tau)}\]

The whole thing is divided by \(\tau\) inside the exponent. Note: applying \(\tau\) outside the softmax (e.g., \(q^{1/\tau} / \sum q^{1/\tau}\)) gives a different distribution — same shape but different normalisation. Stay with the inside form.

The three limits

\(\tau \to 0^+\) — collapse to argmax

As \(\tau\) decreases, \(z_i / \tau\) grows in magnitude. If \(z^* = \max_i z_i\) is the unique max:

\[\lim_{\tau \to 0^+} q_i(\tau) = \begin{cases} 1 & \text{if } z_i = z^* \\ 0 & \text{otherwise} \end{cases}\]

So greedy is the τ → 0 limit of temperature sampling. (When there are ties, the limit is uniform over the tied tokens — a measure-zero edge case in practice.)

\(\tau = 1\) — identity

When \(\tau = 1\), \(q_i = \text{softmax}(z)_i\) — the model's nominal distribution. Use this as the reference.

\(\tau \to \infty\) — uniform

As \(\tau\) increases, \(z_i / \tau \to 0\) for all \(i\). The exponents all approach 1, the normaliser approaches \(V\), and:

\[\lim_{\tau \to \infty} q_i(\tau) = 1/V \quad \text{for all } i\]

— a uniform distribution. Sampling from \(\tau = \infty\) is the same as picking a random token.

Entropy as a function of \(\tau\)

The entropy of \(q(\tau)\) is non-decreasing in \(\tau\):

\[\frac{\partial H(q(\tau))}{\partial \tau} \ge 0\]

with equality only when the logits are constant (already uniform).

Proof sketch

Compute \(\partial H / \partial \tau\). Using \(H(q) = -\sum_i q_i \log q_i\) and \(q_i = \exp(z_i/\tau) / Z\) where \(Z = \sum_j \exp(z_j/\tau)\):

\[\frac{\partial q_i}{\partial \tau} = -\frac{z_i}{\tau^2} q_i + \frac{q_i}{\tau^2} \langle z \rangle = -\frac{1}{\tau^2}(z_i - \langle z \rangle) q_i\]

where \(\langle z \rangle = \sum_j q_j z_j = \mathbb{E}_q[z]\). Then:

\[\frac{\partial H}{\partial \tau} = -\sum_i \frac{\partial q_i}{\partial \tau} (1 + \log q_i) = \frac{1}{\tau^2}\sum_i (z_i - \langle z \rangle) q_i (1 + \log q_i)\]

Use \(\log q_i = z_i/\tau - \log Z\) and simplify; the constant terms drop because \(\sum_i (z_i - \langle z \rangle) q_i = 0\). The remaining term is \(\frac{1}{\tau^3} \text{Var}_q[z] \ge 0\).

So \(\partial H / \partial \tau \ge 0\) with equality iff all \(z_i\) are equal.

(Lab 01 verifies this empirically by sweeping \(\tau\) on a synthetic logit vector and plotting \(H(q(\tau))\).)

A note on calibration

A perfectly calibrated model with \(\tau = 1\) has expressed confidence equal to its empirical accuracy. An over-confident model (typical for neural networks; see Phase 05 lab 03) puts too much mass on its top guess. Lowering \(\tau\) slightly (e.g., \(\tau = 0.9\)) sometimes recovers calibration empirically; raising \(\tau\) slightly (e.g., \(\tau = 1.1\)) sometimes adds useful diversity for an under-confident model.

This is not the same operation as the temperature scaling used for calibration in Phase 05. There the temperature is fit on a held-out set to minimise validation CE. Here it's a runtime knob for diversity. Same math, different motivation. Don't confuse them.

What about repetition?

A long-standing issue with autoregressive sampling: the model can fall into repetitive loops (I work. I work. I work...). A common fix is the repetition penalty: down-weight logits for tokens already in the generated sequence:

\[z'_i = \begin{cases} z_i / \rho & \text{if } i \in \text{generated\_so\_far and } z_i > 0 \\ z_i \cdot \rho & \text{if } i \in \text{generated\_so\_far and } z_i \le 0 \\ z_i & \text{otherwise} \end{cases}\]

with \(\rho > 1\) (typically \(\rho = 1.1\) to \(1.3\)).

For our verb corpus, repetition is less of a problem because the sequences are short. We don't implement the repetition penalty in lab; we mention it in theory only.

A pitfall: temperature applied to the wrong stage

Three places one might apply \(\tau\):

  1. softmax(z / τ) — the correct form. Temperature scales the logits before softmax.
  2. softmax(z) ** (1/τ) / normalisealmost but not quite the same; produces a different distribution.
  3. Sampling from softmax(z) and rejecting with prob τ — nonsensical, but you'd be surprised.

Only (1) is right. Lab 00 will have you verify by computing the entropy at \(\tau = 2\) via both (1) and (2) on the same logits and showing they differ.

What this file does NOT cover

  • Top-k / top-p. Next file.
  • Calibration via temperature as a learned parameter. Phase 05 lab 03 covers that.
  • Adaptive temperature schedules. Some samplers vary τ as a function of position; out of scope.

Next: 02-top-k-and-top-p.md