Skip to content

English · Español

04 — DPO and direct methods

🇪🇸 DPO: la política óptima bajo restricción KL tiene forma cerrada; sustituyendo la recompensa, la pérdida BT se convierte en una pérdida sobre la propia política. Sin RM separado, sin rollouts.

DPO (Rafailov et al. 2023) is a four-step algebra exercise that eliminates the RM-and-PPO machinery of chapter 03 and replaces it with a single supervised loss on preference pairs. This chapter walks the derivation.

Step 0: the KL-constrained RL objective

The objective PPO is trying to solve (chapter 03) is:

\[ \max_{\pi_\theta}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)}\!\left[ r(x, y) \right] - \beta\, \mathrm{KL}\!\left( \pi_\theta(\cdot \mid x) \,\Vert\, \pi_{\text{ref}}(\cdot \mid x) \right) \]

This is a constrained optimization over the per-prompt distribution \(\pi_\theta(\cdot \mid x)\).

Step 1: the closed-form optimal policy

For a fixed reward \(r\), this objective has a closed-form maximizer. Setting the Lagrangian's derivative w.r.t. \(\pi(y|x)\) to zero (with normalization constraint) yields:

\[ \pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x)\, \exp\!\left( \frac{1}{\beta} r(x, y) \right) \]

where \(Z(x) = \sum_y \pi_{\text{ref}}(y \mid x)\, \exp(r(x,y)/\beta)\) is the partition function over responses.

Derivation sketch. The objective in terms of \(\pi(\cdot|x)\) is

\[ \mathbb{E}_{y \sim \pi}[r(x,y)] - \beta \sum_y \pi(y|x) \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}. \]

Adding a Lagrange multiplier \(\lambda(x)\) for \(\sum_y \pi(y|x) = 1\) and taking \(\partial / \partial \pi(y|x)\):

\[ r(x,y) - \beta\!\left(\log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} + 1\right) - \lambda(x) = 0 \]

Solving for \(\pi(y|x)\):

\[ \pi(y|x) = \pi_{\text{ref}}(y|x)\, \exp\!\left( \frac{r(x,y) - \beta - \lambda(x)}{\beta} \right) \]

The \(\exp(-1 - \lambda(x)/\beta)\) factors out as a function of \(x\) only — this is \(1/Z(x)\) after normalization. \(\square\)

This result is classical (Gibbs-Boltzmann distribution; max-entropy RL; Ziebart 2010). DPO's contribution is what comes next.

Step 2: invert to express the reward in terms of the policy

Take the log of step 1:

\[ \log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta} r(x,y) - \log Z(x) \]

Rearrange for \(r\):

\[ \boxed{\, r(x, y) = \beta\, \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta\, \log Z(x) \,} \]

This is the DPO identity: given the optimal policy \(\pi^*\), the reward is recovered (up to a function of \(x\) only) as \(\beta\) times the log policy ratio.

Step 3: plug into Bradley-Terry

Chapter 02 gave the BT preference model:

\[ P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l)) \]

The difference \(r(x, y_w) - r(x, y_l)\) is what matters; the \(\beta \log Z(x)\) term cancels because it depends only on \(x\). Substituting step 2:

\[ r(x, y_w) - r(x, y_l) = \beta\, \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\, \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \]

So the BT likelihood is now expressed directly in terms of the policy \(\pi^*\) — no separate reward function \(r_\phi\) is needed.

Step 4: the DPO loss

Parameterize \(\pi^* \equiv \pi_\theta\) and minimize the negative log-likelihood of observed preferences:

\[ \boxed{\, \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)\sim\mathcal{D}} \log \sigma\!\left( \beta\, \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\, \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \,} \]

That is it. The full pipeline is:

  1. Compute \(\log \pi_\theta(y_w | x)\) — one forward pass through \(\pi_\theta\).
  2. Compute \(\log \pi_{\text{ref}}(y_w | x)\) — one forward pass through frozen \(\pi_{\text{ref}}\).
  3. Same for \(y_l\).
  4. Form the four log-probs, take the sigmoid, take the log, backprop.

No RM. No rollouts. No advantage estimation. No value head. The entire chapter 03 apparatus is gone.

What the DPO gradient is doing

Take \(\nabla_\theta \mathcal{L}_{\text{DPO}}\). Using \(\nabla \log \sigma(z) = \sigma(-z) = 1 - \sigma(z)\):

\[ \nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta\, \mathbb{E}\!\left[\, \sigma\!\left( \hat r_l - \hat r_w \right) \cdot \left( \nabla_\theta \log \pi_\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x) \right) \,\right] \]

where \(\hat r = \beta \log(\pi_\theta / \pi_{\text{ref}})\) is the implicit reward.

Reading this:

  • The "weight" \(\sigma(\hat r_l - \hat r_w)\) is large when the implicit reward is wrong (the loser is currently scored higher than the winner) and small when it is already right. This is a hard-example weighting built in for free — DPO down-weights easy examples without you doing anything.
  • The gradient direction is "increase \(\log \pi_\theta(y_w|x)\), decrease \(\log \pi_\theta(y_l|x)\)" — standard contrastive shape. The reference policy keeps the absolute log-probs anchored.

Why DPO is a Bradley-Terry MLE

By construction. Step 3 shows the BT likelihood is the same function whether you parameterize the reward directly or via the policy ratio. DPO is literally the MLE of the BT model with the policy-ratio parameterization. This is why DPO inherits BT's statistical guarantees (consistency, asymptotic normality of \(\hat\theta\)) under standard regularity.

It also means DPO inherits BT's limitations: if your preference data is not pairwise consistent (cycles in preferences, label noise), DPO has the same trouble PPO+RM does.

When DPO outperforms PPO

Empirical findings (Rafailov et al. 2023; Tunstall et al. 2023 Zephyr; many others):

  • Smaller data regimes (\(<10\)k pairs): DPO wins, mostly because PPO needs to train a usable RM first.
  • Same KL budget: DPO matches or beats PPO on downstream win-rate at the same KL-to-reference.
  • Engineering: DPO requires no rollout infra; train on static preference pairs like SFT.

When PPO still wins:

  • Online iteration. PPO naturally samples from the current policy; you can interleave RM updates with policy updates. DPO is offline against a fixed dataset.
  • Reward shaping. Tool-use, code execution, math verifiers — settings where reward is computable from the response (not from preferences) — strongly favor PPO/GRPO over DPO.

The family of direct methods

DPO triggered a small explosion of variants. The ones to know:

IPO — Identity Preference Optimization (Azar et al. 2023)

Replaces the sigmoid (BT) with a squared loss directly on the log-ratio difference:

\[ \mathcal{L}_{\text{IPO}} = \mathbb{E}\!\left[\left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta} \right)^2\right] \]

Motivation: DPO can over-fit when preferences are deterministic (sigmoid saturates → \(\sigma \to 1\) → no gradient cap). IPO bounds the implicit reward gap explicitly, preventing the "saturation overfit" failure.

KTO — Kahneman-Tversky Optimization (Ethayarajh et al. 2024)

Works with unpaired binary labels ("response is good" or "response is bad"), not pairs. Loss is a prospect-theory-shaped function of the implicit reward:

\[ \mathcal{L}_{\text{KTO}} = \mathbb{E}_{y \sim \text{desired}}\!\left[ 1 - \sigma(\beta \log \tfrac{\pi_\theta}{\pi_{\text{ref}}} - z_0) \right] + \mathbb{E}_{y \sim \text{undesired}}\!\left[ 1 - \sigma(z_0 - \beta \log \tfrac{\pi_\theta}{\pi_{\text{ref}}}) \right] \]

where \(z_0\) is a reference point (KL to a uniform baseline). Useful when you have thumbs-up/down data but no pairwise comparisons.

ORPO — Odds Ratio Preference Optimization (Hong et al. 2024)

Combines SFT and preference loss into a single stage by adding a log-odds-ratio term to the SFT loss:

\[ \mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}}(y_w) + \lambda\, \log \sigma\!\left( \log \frac{\mathrm{odds}_\theta(y_w|x)}{\mathrm{odds}_\theta(y_l|x)} \right) \]

where \(\mathrm{odds}_\theta(y|x) = \pi_\theta(y|x) / (1 - \pi_\theta(y|x))\). Single-stage, no reference model needed; competitive with two-stage SFT+DPO.

Comparison table

Method Pairs / Binary Reference model Single-stage Notable property
PPO+RM Pairs (via RM) Yes No (3 stages) Online; needs rollouts
DPO Pairs Yes No (SFT then DPO) BT MLE in closed form
IPO Pairs Yes No Bounded implicit reward; less overfit
KTO Binary Yes No Works with thumbs-up/down
ORPO Pairs No Yes Folds SFT and preference

Choosing \(\beta\) in DPO

Same role as in PPO: trust-region knob. Typical values \(\beta \in [0.1, 0.5]\) — note this is larger than in PPO because the loss surface and parameterization are different. Lab 01 uses \(\beta = 0.1\).

Effect:

  • Small \(\beta\) → larger updates, can drift, can saturate the sigmoid.
  • Large \(\beta\) → tiny updates, slow learning.

References

  • Rafailov et al. 2023, Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290.
  • Azar et al. 2023, A General Theoretical Paradigm to Understand Learning from Human Preferences (IPO). arXiv:2310.12036.
  • Ethayarajh et al. 2024, KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306.
  • Hong et al. 2024, ORPO: Monolithic Preference Optimization without Reference Model. arXiv:2403.07691.
  • Tunstall et al. 2023, Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944.
  • Ziebart 2010, Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy (PhD thesis, CMU).