Skip to content

English · Español

Lab 01 — Temperature sweep

🇪🇸 La temperatura aplana o agudiza la softmax. Aquí lo medimos: para una distribución de logits del Mini-GPT, barre τ y grafica la entropía. Debe ser monotónicamente no-decreciente. Si no lo es, hay un bug.

Objective

Implement the Temperature(τ) sampling strategy, and produce an empirical verification of the entropy monotonicity property derived in theory/01-temperature.md.

Setup

  • Greedy from lab 00.
  • Trained Mini-GPT checkpoint.
  • A reference prompt: "Tomorrow she" (3rd-person singular, future tense — interesting because the model has multiple plausible continuations).

Tasks

  1. Implement Temperature(tau) in src/minimodel/sampling.py:
@dataclass(frozen=True)
class Temperature:
    tau: float

    def __call__(self, logits, rng):
        assert self.tau > 0, "temperature must be positive"
        scaled = logits / self.tau
        probs = softmax(scaled)         # use log-sum-exp from Phase 05
        return int(rng.choice(len(probs), p=probs))

Use the numerically stable softmax (np.exp(z - z.max()) / sum) from Phase 05.

  1. Edge case: tau → 0. Decide how to handle it. Options:
  2. Raise (forbid).
  3. Treat as greedy (tau == 0 → argmax, no rng).
  4. Cap at a small epsilon (e.g., tau = max(tau, 1e-6)).

Document your choice in the docstring. The lab uses tau ∈ {0.5, 0.7, 1.0, 1.5, 2.0} — none of these hit zero.

  1. Sweep τ over a logit vector. Don't use the Mini-GPT yet — use a synthetic 8-vocab logit vector z = [3.0, 2.0, 1.0, 0.5, 0.0, -0.5, -1.0, -2.0]. For each τ ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 5.0, 10.0}:
  2. Compute q = softmax(z / τ).
  3. Compute H(q) = -Σ q_i log q_i (use log2 for entropy in bits).
  4. Plot (τ, H(q)).

Assert: the resulting curve is monotone non-decreasing in τ (use np.all(np.diff(H) >= -1e-9); the -1e-9 accounts for floating-point noise).

  1. Now use the Mini-GPT. For the prompt "Tomorrow she":
  2. Get the first-step logits (one forward pass).
  3. Repeat the sweep on those logits.
  4. Plot.
  5. Compare to the synthetic case. The real-model curve will start lower (real logits often have a sharp peak) and rise smoothly.

  6. Empirical diversity check. For each τ ∈ {0.5, 1.0, 2.0}, draw 100 samples (with different seeds) of length 1 from the Mini-GPT first-step distribution. Count the number of unique tokens:

  7. τ = 0.5: expect 1-3 unique tokens.
  8. τ = 1.0: expect 5-15 unique tokens.
  9. τ = 2.0: expect 20+ unique tokens.

Plot (τ, num_unique_tokens) as a bar chart.

Measurements

Save to experiments/<date>-phase-21-temp-sweep/:

  • entropy_curve_synthetic.png(τ, H) on the synthetic logits.
  • entropy_curve_model.png(τ, H) on the Mini-GPT first-step logits.
  • unique_tokens_by_tau.png — bar chart from task 5.
  • manifest.json — seeds, versions, exact τ values.

Acceptance

  • Temperature(tau) is deterministic under fixed seed (rng.choice consumes the rng's state predictably).
  • The entropy curves are monotone non-decreasing within 1e-9 tolerance.
  • Unique-token count strictly increases with τ on this prompt (or weakly, with at most one equality).

Pitfalls

  • Computing softmax(z) ** (1/τ) instead of softmax(z / τ). These are different distributions; the file 01 theory has the proof. The lab silently passing while computing the wrong thing is exactly the bug we want to catch — add an explicit unit test that asserts your implementation matches softmax(z / τ) (not the post-softmax power form).
  • Casting rng.choice output to int (or not). rng.choice(n, p=p) returns a numpy.int64; if your tokenizer code expects a Python int, cast it.
  • Log of zero. If q_i = 0 exactly, q_i log q_i is 0 · -inf = nan in IEEE. Use np.where(q > 0, q * np.log2(q), 0.0) for entropy.
  • Choosing the wrong reference prompt. If you pick a prompt where the Mini-GPT is extremely confident (e.g., the model has overfit and gives q_{top} = 0.999), all temperatures will look similar until τ > 5. Pick a prompt where the model has visible uncertainty.

Next: 02-top-k-and-top-p.md