English · Español

02 — Reward modeling: Bradley-Terry, reward hacking, and the U-curve¶

🇪🇸 Cómo entrenar un modelo de recompensa desde preferencias pareadas, y por qué la optimización contra él tiene un punto óptimo (no monotónico).

We have preference data of the form \((x, y_w, y_l)\): a prompt \(x\), a "winning" response \(y_w\), and a "losing" response \(y_l\). We want a scalar reward function \(r_\phi(x, y)\).

The Bradley-Terry model¶

Bradley & Terry (1952) modeled pairwise preferences as a logistic competition between latent strengths. For two items with strengths \(r_w\) and \(r_l\):

\[ P(y_w \succ y_l \mid x) = \sigma(r_w - r_l) = \frac{\exp(r_w)}{\exp(r_w) + \exp(r_l)} \]

where \(\sigma\) is the sigmoid and \(r_w \equiv r_\phi(x, y_w)\).

The reward-model loss¶

Maximize the log-likelihood of observed preferences:

\[ \mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l)\sim\mathcal{D}}\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right] \]

Three things to notice:

The loss only depends on the difference \(r_\phi(x, y_w) - r_\phi(x, y_l)\). The reward function is identifiable only up to an additive constant (per prompt). Anything that depends only on \(x\) cancels.
There is no notion of "calibration to an absolute scale." A reward of 100 means nothing on its own.
This is just logistic regression with feature \(r_\phi(x, y_w) - r_\phi(x, y_l)\) — a familiar loss.

Architecture in practice¶

Standard recipe (Stiennon et al. 2020, Ouyang et al. 2022):

Initialize \(r_\phi\) from the SFT model.
Replace the LM head with a single scalar linear head on top of the last hidden state of the final token (or pooled).
Train on \((x, y_w, y_l)\) pairs with the BT loss.

For our §A13 grammar-tutor: SFT model has hidden size \(d=128\); the RM head is a single \(\mathbb{R}^{128} \to \mathbb{R}\) linear layer. Lab 00 implements exactly this.

Plackett-Luce: when you have rankings, not pairs¶

If your annotators rank \(K\) responses instead of doing pairwise comparisons, the Plackett-Luce model generalizes BT:

\[ P(y_{\pi(1)} \succ y_{\pi(2)} \succ \dots \succ y_{\pi(K)} \mid x) = \prod_{k=1}^{K} \frac{\exp(r_{\pi(k)})}{\sum_{j=k}^{K} \exp(r_{\pi(j)})} \]

This is a "sequential softmax": at each step, pick the best from the remaining items. For \(K=2\) it reduces to BT exactly. InstructGPT used \(K \in \{4, 9\}\) rankings.

Reward hacking: the three classic modes¶

A reward model is an imperfect proxy for human preference. Any optimizer (PPO, DPO, even decoding-time best-of-\(N\)) will find ways to score high without actually pleasing humans. The three canonical modes:

1. Length bias¶

Annotators systematically prefer longer responses (more detail = perceived effort). The RM learns "longer = better." PPO then produces verbose mush.

Grammar-tutor example: "I works yesterday" → the SFT model says "Use worked: past simple of regular verbs takes -ed." A length-biased RM prefers "Use worked: the past simple form of the regular verb to work is worked, which is formed by adding -ed, applied here because the action is in the past and the subject is first person singular I." Same correctness, three times the words, often less useful for a learner.

Mitigations: length-controlled reward (Singhal et al. 2024 — subtract length from reward), pair-balanced length in training data.

2. Sycophancy¶

Annotators reward responses that agree with the user. The RM learns "agree with the user." PPO then never corrects the user, even when the user is wrong (Perez et al. 2022).

Grammar-tutor example: user says "I works yesterday is correct, right?" A sycophantic tutor says "Yes, that is fine." This violates the honest leg of HHH.

Mitigation: explicit "honesty over agreement" examples in preference data; constitutional principles that name this failure.

3. Mode collapse¶

Optimization concentrates probability on a few "safe high-reward" answers. Diversity drops; the model produces the same canned response for many prompts (Khalifa et al. 2021).

Grammar-tutor example: the tutor learns the phrase "Use the past simple form" and applies it to every conjugation question, including ones about future tense.

Mitigations: KL-to-reference penalty (next chapter), entropy bonus.

The over-optimization U-curve (Gao et al. 2022)¶

This is the most important empirical result in reward modeling.

Setup: Train an RM. Sample \(N\) responses from a policy. Pick the best-of-\(N\) by RM score. Measure true reward (held-out, gold) of the chosen response as a function of \(N\) (or equivalently as a function of KL divergence between the optimized policy and the base policy).

Result: True reward rises, then falls.

true                ___
reward     _____.--'   '--.___
       ___/                   \__
        |                         \___
        |                              \
        +-----------------------------------> KL(π_opt || π_base)
        0   (under-optimized)  (over-optimized)

The optimization is adversarial against the RM. At small KL, you find genuinely good responses that the RM agrees about. At large KL, you find responses that exploit RM idiosyncrasies — the RM scores them high but a human would not.

Gao et al. fit this with a closed-form:

\[ R_{\text{gold}}(d) = d\,(\alpha_{\text{bon}} - \beta_{\text{bon}} d) \]

for best-of-\(N\), where \(d = \sqrt{\mathrm{KL}}\). The form for RL (PPO) is similar but with different exponents. The empirical observation: the optimum is at finite, non-zero KL, not infinity. This is why the KL penalty in PPO is load-bearing, not optional (chapter 03).

Quality of the RM matters¶

Two RMs trained on the same data but different sizes / better data give very different over-optimization curves. Larger RMs are more robust (the U-curve peak is higher and at higher KL). For our CPU-only labs, the RM is small and the curve will be sharp — Lab 00 will see the U-curve at low \(N\).

What we'll measure in Lab 00¶

Train accuracy on BT loss: should reach >90% on the 160-pair train split.
Held-out pair accuracy: 40 pairs; target >70%.
Reward distribution: histograms of \(r_\phi(x, y_w)\) vs \(r_\phi(x, y_l)\); the two distributions should separate visibly.

Cross-links¶

Lab 00 — Reward Model from Preferences: implements the BT loss on §A13 data.
Phase 19 — Training Dynamics: the U-curve is a training-dynamics phenomenon.

References¶

Bradley & Terry 1952, Rank analysis of incomplete block designs. Biometrika.
Stiennon et al. 2020, Learning to summarize from human feedback. arXiv:2009.01325.
Gao, Schulman, Hilton 2022, Scaling Laws for Reward Model Overoptimization. arXiv:2210.10760.
Perez et al. 2022, Discovering Language Model Behaviors with Model-Written Evaluations (sycophancy). arXiv:2212.09251.
Singhal et al. 2024, A Long Way to Go: Investigating Length Correlations in RLHF. arXiv:2310.03716.