English · Español
04 — DPO and direct methods¶
🇪🇸 DPO: la política óptima bajo restricción KL tiene forma cerrada; sustituyendo la recompensa, la pérdida BT se convierte en una pérdida sobre la propia política. Sin RM separado, sin rollouts.
DPO (Rafailov et al. 2023) is a four-step algebra exercise that eliminates the RM-and-PPO machinery of chapter 03 and replaces it with a single supervised loss on preference pairs. This chapter walks the derivation.
Step 0: the KL-constrained RL objective¶
The objective PPO is trying to solve (chapter 03) is:
This is a constrained optimization over the per-prompt distribution \(\pi_\theta(\cdot \mid x)\).
Step 1: the closed-form optimal policy¶
For a fixed reward \(r\), this objective has a closed-form maximizer. Setting the Lagrangian's derivative w.r.t. \(\pi(y|x)\) to zero (with normalization constraint) yields:
where \(Z(x) = \sum_y \pi_{\text{ref}}(y \mid x)\, \exp(r(x,y)/\beta)\) is the partition function over responses.
Derivation sketch. The objective in terms of \(\pi(\cdot|x)\) is
Adding a Lagrange multiplier \(\lambda(x)\) for \(\sum_y \pi(y|x) = 1\) and taking \(\partial / \partial \pi(y|x)\):
Solving for \(\pi(y|x)\):
The \(\exp(-1 - \lambda(x)/\beta)\) factors out as a function of \(x\) only — this is \(1/Z(x)\) after normalization. \(\square\)
This result is classical (Gibbs-Boltzmann distribution; max-entropy RL; Ziebart 2010). DPO's contribution is what comes next.
Step 2: invert to express the reward in terms of the policy¶
Take the log of step 1:
Rearrange for \(r\):
This is the DPO identity: given the optimal policy \(\pi^*\), the reward is recovered (up to a function of \(x\) only) as \(\beta\) times the log policy ratio.
Step 3: plug into Bradley-Terry¶
Chapter 02 gave the BT preference model:
The difference \(r(x, y_w) - r(x, y_l)\) is what matters; the \(\beta \log Z(x)\) term cancels because it depends only on \(x\). Substituting step 2:
So the BT likelihood is now expressed directly in terms of the policy \(\pi^*\) — no separate reward function \(r_\phi\) is needed.
Step 4: the DPO loss¶
Parameterize \(\pi^* \equiv \pi_\theta\) and minimize the negative log-likelihood of observed preferences:
That is it. The full pipeline is:
- Compute \(\log \pi_\theta(y_w | x)\) — one forward pass through \(\pi_\theta\).
- Compute \(\log \pi_{\text{ref}}(y_w | x)\) — one forward pass through frozen \(\pi_{\text{ref}}\).
- Same for \(y_l\).
- Form the four log-probs, take the sigmoid, take the log, backprop.
No RM. No rollouts. No advantage estimation. No value head. The entire chapter 03 apparatus is gone.
What the DPO gradient is doing¶
Take \(\nabla_\theta \mathcal{L}_{\text{DPO}}\). Using \(\nabla \log \sigma(z) = \sigma(-z) = 1 - \sigma(z)\):
where \(\hat r = \beta \log(\pi_\theta / \pi_{\text{ref}})\) is the implicit reward.
Reading this:
- The "weight" \(\sigma(\hat r_l - \hat r_w)\) is large when the implicit reward is wrong (the loser is currently scored higher than the winner) and small when it is already right. This is a hard-example weighting built in for free — DPO down-weights easy examples without you doing anything.
- The gradient direction is "increase \(\log \pi_\theta(y_w|x)\), decrease \(\log \pi_\theta(y_l|x)\)" — standard contrastive shape. The reference policy keeps the absolute log-probs anchored.
Why DPO is a Bradley-Terry MLE¶
By construction. Step 3 shows the BT likelihood is the same function whether you parameterize the reward directly or via the policy ratio. DPO is literally the MLE of the BT model with the policy-ratio parameterization. This is why DPO inherits BT's statistical guarantees (consistency, asymptotic normality of \(\hat\theta\)) under standard regularity.
It also means DPO inherits BT's limitations: if your preference data is not pairwise consistent (cycles in preferences, label noise), DPO has the same trouble PPO+RM does.
When DPO outperforms PPO¶
Empirical findings (Rafailov et al. 2023; Tunstall et al. 2023 Zephyr; many others):
- Smaller data regimes (\(<10\)k pairs): DPO wins, mostly because PPO needs to train a usable RM first.
- Same KL budget: DPO matches or beats PPO on downstream win-rate at the same KL-to-reference.
- Engineering: DPO requires no rollout infra; train on static preference pairs like SFT.
When PPO still wins:
- Online iteration. PPO naturally samples from the current policy; you can interleave RM updates with policy updates. DPO is offline against a fixed dataset.
- Reward shaping. Tool-use, code execution, math verifiers — settings where reward is computable from the response (not from preferences) — strongly favor PPO/GRPO over DPO.
The family of direct methods¶
DPO triggered a small explosion of variants. The ones to know:
IPO — Identity Preference Optimization (Azar et al. 2023)¶
Replaces the sigmoid (BT) with a squared loss directly on the log-ratio difference:
Motivation: DPO can over-fit when preferences are deterministic (sigmoid saturates → \(\sigma \to 1\) → no gradient cap). IPO bounds the implicit reward gap explicitly, preventing the "saturation overfit" failure.
KTO — Kahneman-Tversky Optimization (Ethayarajh et al. 2024)¶
Works with unpaired binary labels ("response is good" or "response is bad"), not pairs. Loss is a prospect-theory-shaped function of the implicit reward:
where \(z_0\) is a reference point (KL to a uniform baseline). Useful when you have thumbs-up/down data but no pairwise comparisons.
ORPO — Odds Ratio Preference Optimization (Hong et al. 2024)¶
Combines SFT and preference loss into a single stage by adding a log-odds-ratio term to the SFT loss:
where \(\mathrm{odds}_\theta(y|x) = \pi_\theta(y|x) / (1 - \pi_\theta(y|x))\). Single-stage, no reference model needed; competitive with two-stage SFT+DPO.
Comparison table¶
| Method | Pairs / Binary | Reference model | Single-stage | Notable property |
|---|---|---|---|---|
| PPO+RM | Pairs (via RM) | Yes | No (3 stages) | Online; needs rollouts |
| DPO | Pairs | Yes | No (SFT then DPO) | BT MLE in closed form |
| IPO | Pairs | Yes | No | Bounded implicit reward; less overfit |
| KTO | Binary | Yes | No | Works with thumbs-up/down |
| ORPO | Pairs | No | Yes | Folds SFT and preference |
Choosing \(\beta\) in DPO¶
Same role as in PPO: trust-region knob. Typical values \(\beta \in [0.1, 0.5]\) — note this is larger than in PPO because the loss surface and parameterization are different. Lab 01 uses \(\beta = 0.1\).
Effect:
- Small \(\beta\) → larger updates, can drift, can saturate the sigmoid.
- Large \(\beta\) → tiny updates, slow learning.
Cross-links¶
- Theory 02 — Reward Modeling: Bradley-Terry, which DPO inherits.
- Theory 03 — PPO for Language: the KL-constrained RL objective whose solution DPO inverts.
- Lab 01 — DPO on Grammar Tutor: the ~50-line implementation.
- Phase 28 — LoRA: DPO is overwhelmingly done on LoRA adapters in practice.
References¶
- Rafailov et al. 2023, Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290.
- Azar et al. 2023, A General Theoretical Paradigm to Understand Learning from Human Preferences (IPO). arXiv:2310.12036.
- Ethayarajh et al. 2024, KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306.
- Hong et al. 2024, ORPO: Monolithic Preference Optimization without Reference Model. arXiv:2403.07691.
- Tunstall et al. 2023, Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944.
- Ziebart 2010, Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy (PhD thesis, CMU).