Skip to content

English · Español

03 — PPO for language: the InstructGPT recipe

🇪🇸 La receta InstructGPT: SFT → RM → PPO con penalización KL al modelo de referencia. La restricción KL es lo que evita que el modelo se vaya a la deriva.

The three-stage recipe

InstructGPT (Ouyang et al. 2022) standardized:

  1. SFT. Fine-tune the pretrained LM on \((x, y_{\text{demo}})\) pairs. Call the result \(\pi^{\text{SFT}}\).
  2. RM. Train \(r_\phi\) on pairwise preference data from \(\pi^{\text{SFT}}\) samples (chapter 02).
  3. PPO. Run PPO with \(\pi^{\text{SFT}}\) as both the initialization and the frozen reference policy \(\pi_{\text{ref}}\). The reward at each token is the RM reward plus a per-token KL penalty.

The full objective

For a prompt \(x\) and a sampled response \(y \sim \pi_\theta(\cdot \mid x)\), the per-trajectory reward is

\[ R(x, y) = r_\phi(x, y) - \beta\, \mathrm{KL}\!\left(\pi_\theta(\cdot \mid x) \,\Vert\, \pi_{\text{ref}}(\cdot \mid x)\right) \]

In practice the KL is computed per-token and folded into the per-token reward (or as a per-token shaping signal):

\[ \tilde r_t = -\beta\, \log \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_{\text{ref}}(y_t \mid x, y_{<t})} \]

with \(r_\phi\) added only at the final token of the response. The full PPO surrogate is then chapter 01's \(\mathcal{L}^{\text{CLIP}}\), but with advantages computed from this shaped reward.

Why per-token KL

The exact KL \(\mathrm{KL}(\pi_\theta \| \pi_{\text{ref}})\) at the sequence level requires a sum over all possible sequences — intractable. But the per-token log-ratio

\[ \log \pi_\theta(y_t \mid \cdot) - \log \pi_{\text{ref}}(y_t \mid \cdot) \]

is an unbiased single-sample estimator of the per-step KL when \(y_t \sim \pi_\theta\). Summing these along the trajectory gives an unbiased estimator of the sequence-level KL. This is the "log-ratio trick" used in every RLHF codebase.

Why the KL constraint is load-bearing

Without the \(\beta\, \mathrm{KL}\) term, the policy is rewarded purely by \(r_\phi\). Combined with the over-optimization U-curve (chapter 02), this leads to:

  • Drift to gibberish that the RM loves. The RM is a function. Functions have argmaxes. The argmax of a small RM trained on grammar-correction preferences is not a grammar correction — it is a sequence of tokens that happens to score high. Without the KL anchor, PPO finds these adversarial sequences within a few hundred steps.
  • Loss of fluency. The pretrained LM's smooth distribution is destroyed; perplexity on held-out natural text explodes.

The KL term says: "stay close to a known-good distribution." The known-good distribution is \(\pi^{\text{SFT}}\), which already knows how to produce English sentences. PPO is then only allowed to re-weight within the manifold of plausible sentences, not invent new ones.

Choosing \(\beta\)

InstructGPT used \(\beta \approx 0.02\) for the 1.3B model, and adaptively tuned it (the "adaptive KL" of Ziegler et al. 2019): increase \(\beta\) if measured KL exceeds a target, decrease otherwise. Modern recipes often use fixed \(\beta \in [0.01, 0.1]\).

Effect of \(\beta\):

  • \(\beta\) too small → drift (chapter 02 over-optimization).
  • \(\beta\) too large → policy frozen at \(\pi^{\text{SFT}}\); no improvement.

The InstructGPT hyperparameters

For reproducibility, the InstructGPT paper (Table 6, Ouyang et al. 2022) specifies for the 6B model:

Hyperparameter Value
Optimizer Adam
Learning rate (policy) \(9.65 \times 10^{-6}\)
Learning rate (value) \(1.41 \times 10^{-5}\)
Batch size 512
Minibatch size 64
Epochs per rollout 4
Discount \(\gamma\) 1.0 (no discount; finite horizon)
GAE \(\lambda\) 0.95
Clip \(\epsilon\) 0.2
Value clip 0.2
KL coefficient \(\beta\) adaptive, target KL ≈ 6 nats
Entropy bonus 0.0
Rollouts per iteration 32 prompts × 16 samples

These are the numbers an interviewer might ask about. The shape of the recipe — Adam, small LR, K=4 PPO epochs, \(\epsilon=0.2\), GAE \(\lambda \approx 0.95\) — is stable across modern RLHF papers; only the KL handling varies.

What goes wrong in practice

Practitioner-known failure modes (Casper et al. 2023, Zheng et al. 2023):

  1. Reward hacking (chapter 02): length bias, sycophancy, mode collapse.
  2. KL spike. A handful of high-advantage tokens push KL far past target; adaptive \(\beta\) overshoots and oscillates.
  3. Value-function lag. The critic \(V_\phi\) is initialized from the RM; if it is poorly calibrated to the actor's distribution, advantage estimates are biased and learning stalls.
  4. Catastrophic forgetting of SFT skills. Even with a KL penalty, the policy can lose capabilities present in the SFT model that are not exercised by the prompts in the PPO batch. Fix: mix in SFT-loss minibatches (Ouyang et al. 2022 "PPO-ptx").
  5. Reward-model brittleness on OOD samples. The actor explores; explored samples are out of the RM's training distribution; rewards become arbitrary. Fix: iterate the loop (collect new preferences on new actor samples, retrain RM, repeat).

Why anyone uses DPO instead

PPO-for-language has many moving parts: rollout collection, advantage estimation, separate value head, adaptive KL controller, RM. Engineering complexity is high; reproducibility is low (Wang et al. 2024 reported PPO seed variance >5 win-rate points).

DPO (chapter 04) collapses RM-training and RL into a single supervised loss. It does not need rollouts; it does not need a value function. It is one of the most striking simplifications in modern ML — and the next chapter derives it.

References

  • Ziegler et al. 2019, Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593.
  • Ouyang et al. 2022, Training language models to follow instructions with human feedback. arXiv:2203.02155.
  • Casper et al. 2023, Open Problems and Fundamental Limitations of RLHF. arXiv:2307.15217.
  • Zheng et al. 2023, Secrets of RLHF in Large Language Models Part I: PPO. arXiv:2307.04964.
  • Wang et al. 2024, Secrets of RLHF in Large Language Models Part II: Reward Modeling. arXiv:2401.06080.