English · Español
Lab 01 — Temperature sweep¶
🇪🇸 La temperatura aplana o agudiza la softmax. Aquí lo medimos: para una distribución de logits del Mini-GPT, barre τ y grafica la entropía. Debe ser monotónicamente no-decreciente. Si no lo es, hay un bug.
Objective¶
Implement the Temperature(τ) sampling strategy, and produce an empirical verification of the entropy monotonicity property derived in theory/01-temperature.md.
Setup¶
Greedyfrom lab 00.- Trained Mini-GPT checkpoint.
- A reference prompt:
"Tomorrow she"(3rd-person singular, future tense — interesting because the model has multiple plausible continuations).
Tasks¶
- Implement
Temperature(tau)insrc/minimodel/sampling.py:
@dataclass(frozen=True)
class Temperature:
tau: float
def __call__(self, logits, rng):
assert self.tau > 0, "temperature must be positive"
scaled = logits / self.tau
probs = softmax(scaled) # use log-sum-exp from Phase 05
return int(rng.choice(len(probs), p=probs))
Use the numerically stable softmax (np.exp(z - z.max()) / sum) from Phase 05.
- Edge case:
tau → 0. Decide how to handle it. Options: - Raise (forbid).
- Treat as greedy (
tau == 0→ argmax, no rng). - Cap at a small epsilon (e.g.,
tau = max(tau, 1e-6)).
Document your choice in the docstring. The lab uses tau ∈ {0.5, 0.7, 1.0, 1.5, 2.0} — none of these hit zero.
- Sweep τ over a logit vector. Don't use the Mini-GPT yet — use a synthetic 8-vocab logit vector
z = [3.0, 2.0, 1.0, 0.5, 0.0, -0.5, -1.0, -2.0]. For eachτ ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 5.0, 10.0}: - Compute
q = softmax(z / τ). - Compute
H(q) = -Σ q_i log q_i(uselog2for entropy in bits). - Plot
(τ, H(q)).
Assert: the resulting curve is monotone non-decreasing in τ (use np.all(np.diff(H) >= -1e-9); the -1e-9 accounts for floating-point noise).
- Now use the Mini-GPT. For the prompt
"Tomorrow she": - Get the first-step logits (one forward pass).
- Repeat the sweep on those logits.
- Plot.
-
Compare to the synthetic case. The real-model curve will start lower (real logits often have a sharp peak) and rise smoothly.
-
Empirical diversity check. For each
τ ∈ {0.5, 1.0, 2.0}, draw 100 samples (with different seeds) of length 1 from the Mini-GPT first-step distribution. Count the number of unique tokens: τ = 0.5: expect 1-3 unique tokens.τ = 1.0: expect 5-15 unique tokens.τ = 2.0: expect 20+ unique tokens.
Plot (τ, num_unique_tokens) as a bar chart.
Measurements¶
Save to experiments/<date>-phase-21-temp-sweep/:
entropy_curve_synthetic.png—(τ, H)on the synthetic logits.entropy_curve_model.png—(τ, H)on the Mini-GPT first-step logits.unique_tokens_by_tau.png— bar chart from task 5.manifest.json— seeds, versions, exactτvalues.
Acceptance¶
Temperature(tau)is deterministic under fixed seed (rng.choiceconsumes the rng's state predictably).- The entropy curves are monotone non-decreasing within
1e-9tolerance. - Unique-token count strictly increases with
τon this prompt (or weakly, with at most one equality).
Pitfalls¶
- Computing
softmax(z) ** (1/τ)instead ofsoftmax(z / τ). These are different distributions; the file 01 theory has the proof. The lab silently passing while computing the wrong thing is exactly the bug we want to catch — add an explicit unit test that asserts your implementation matchessoftmax(z / τ)(not the post-softmax power form). - Casting
rng.choiceoutput toint(or not).rng.choice(n, p=p)returns anumpy.int64; if your tokenizer code expects a Pythonint, cast it. - Log of zero. If
q_i = 0exactly,q_i log q_iis0 · -inf = nanin IEEE. Usenp.where(q > 0, q * np.log2(q), 0.0)for entropy. - Choosing the wrong reference prompt. If you pick a prompt where the Mini-GPT is extremely confident (e.g., the model has overfit and gives
q_{top} = 0.999), all temperatures will look similar untilτ > 5. Pick a prompt where the model has visible uncertainty.
Next: 02-top-k-and-top-p.md