English · Español

04 — Alignment Tuning Survey (DPO, ORPO, SimPO, RLHF, RLAIF)¶

🇪🇸 Survey conceptual. Después de SFT (Fase 28 §1) viene un paso opcional: alinear las preferencias del modelo. RLHF lo hace con un modelo de recompensa + PPO; DPO lo hace cerrado en forma con una pérdida tipo cross-entropy entre pares de respuestas. Este documento explica las ideas; no implementamos ninguna en la Fase 28. Pero entender RLHF es entender el último paso del pipeline moderno.

Why this section exists¶

LoRA + SFT (the actual content of Phase 28) is step one of a two-step pipeline for modern fine-tuning:

SFT — teach the model the right format and capability on labelled examples.
Alignment tuning — teach the model the right preferences when multiple correct answers exist.

For our irregular-verb specialization, step 1 is sufficient. The right answer to He __ to school yesterday is went — no preference ambiguity. So Phase 28 stops here.

But the alignment-tuning literature is the most active area of modern LLM research, and not knowing it would leave a gap. This file is a read-only survey.

The setup that motivates alignment¶

Suppose a model has been SFT'd on instruction-following. Given a prompt, it can produce reasonable completions. But:

For "Write me a poem about autumn", every completion is grammatically correct and reasonably on-topic. Which one is better?
For "Explain photosynthesis to a 5-year-old", the model can produce technically correct but inappropriate (dense, jargony) outputs.
For "Is this email phishing?", a calibrated probability of "yes/no" matters more than the raw answer.

SFT trains the model on a single target completion per prompt. It doesn't tell the model how to rank completions. Alignment tuning addresses this.

RLHF — the canonical recipe¶

Reinforcement Learning from Human Feedback (Ouyang et al., InstructGPT, 2022; Christiano et al., 2017).

Three stages:

Stage A: SFT.¶

As covered in theory 01.

Stage B: Reward model training.¶

Collect pairs (prompt, completion_A, completion_B, label) where label ∈ {A, B} indicates which completion the human annotator preferred. Train a reward model r_φ(prompt, completion) → ℝ such that:

\[ \Pr(\text{A preferred over B}) = \sigma(r_\phi(p, A) - r_\phi(p, B)) \]

The reward model is typically initialized from the SFT model (same weights), with the LM head replaced by a scalar regression head. Loss: cross-entropy on the preference label, equivalent to Bradley-Terry ranking.

Stage C: RL fine-tuning with PPO.¶

Treat the model as a policy π_θ(completion | prompt). Run RL with reward = r_φ(prompt, completion), using Proximal Policy Optimization (Schulman et al., 2017) to avoid the model collapsing into a degenerate distribution. Critical addition: a KL penalty to the SFT model:

\[ \mathcal{L}_{\text{RLHF}}(\theta) = - \mathbb{E}_{p, c \sim \pi_\theta} [r_\phi(p, c)] + \beta \cdot \mathbb{E}_{p} [D_{\text{KL}}(\pi_\theta(\cdot | p) \| \pi_{\text{SFT}}(\cdot | p))] \]

The KL term prevents reward hacking — the model finding completions that score high under r_φ but are gibberish or unnatural under the original distribution.

Pros: Mature; the technique behind GPT-3.5, GPT-4, Claude, LLaMA-2-chat. Empirically effective.

Cons: Three-stage pipeline. Reward model is itself an LLM — heavy. PPO is finicky to tune (learning rate, batch composition, KL coefficient β). Reward hacking creeps in via subtle paths. Requires a separate sampling step at training time (you're rolling out completions to evaluate, not just doing backprop on labels).

DPO — direct preference optimization¶

Direct Preference Optimization (Rafailov et al., 2023).

The brilliant observation: stages B and C of RLHF can be collapsed. There's a closed-form expression for the optimal policy under the RLHF objective, and it lets you derive a single loss function that operates directly on preference pairs (prompt, win, lose).

The DPO loss:

\[ \mathcal{L}_{\text{DPO}}(\theta) = - \mathbb{E}_{(p, y_w, y_l) \sim D} \log \sigma\Bigl( \beta \log \frac{\pi_\theta(y_w | p)}{\pi_{\text{ref}}(y_w | p)} - \beta \log \frac{\pi_\theta(y_l | p)}{\pi_{\text{ref}}(y_l | p)} \Bigr) \]

Where π_ref is the SFT model (the reference) and β plays the same role as the KL coefficient in RLHF.

Mechanically: this is cross-entropy on a 2-class problem (win vs lose), modulated by the log-likelihood ratios. No reward model. No RL. Just gradient descent on preference pairs.

Pros: Single-stage. Stable. Easier to implement and hyperparameter-tune than PPO. Comparable or better quality than RLHF on most benchmarks. The de-facto standard since mid-2023.

Cons: Still requires a paired-preference dataset. Sensitive to the choice of β. Doesn't naturally handle unpaired feedback (single-rating scales).

ORPO and SimPO — the further simplifications¶

ORPO (Odds Ratio Preference Optimization, Hong et al. 2024) drops the reference policy π_ref entirely. The loss only depends on π_θ:

\[ \mathcal{L}_{\text{ORPO}} \propto \log \sigma\bigl( \log \frac{\pi_\theta(y_w | p)}{1 - \pi_\theta(y_w | p)} - \log \frac{\pi_\theta(y_l | p)}{1 - \pi_\theta(y_l | p)} \bigr) + \lambda \cdot \mathcal{L}_{\text{SFT}} \]

Combines SFT and preference learning into a single training step. No reference model means no extra memory.

SimPO (Meng et al., 2024) further simplifies by using length-normalized log probabilities and a margin term. Reportedly fewer hyperparameters and slightly stronger empirical results.

The point is: the alignment-tuning literature is rapidly converging on simple, single-stage, paired-preference losses with one or two hyperparameters. The RLHF stack is now seen as the heavyweight baseline that simpler methods aim to match.

RLAIF — when humans are the bottleneck¶

RL from AI Feedback (Bai et al., Anthropic's Constitutional AI, 2022). Replace human preference annotators with a strong LLM. The LLM ranks completions, optionally following a "constitution" (a list of principles like "be helpful but not harmful").

Pros: Scalable; doesn't bottleneck on annotator throughput or budget. Annotation cost goes from $1-5/pair (human) to $0.01/pair (LLM).

Cons: Inherits the ranking model's biases. Risk of preference distillation (the trained model becomes a copy of the ranker, not better than it).

Used by: Anthropic for Claude, increasingly by other frontier labs as a synthetic-data augmentation.

How LoRA composes with alignment training¶

Standard practice: SFT with LoRA (as in Phase 28), then DPO with LoRA on top of the SFT adapter. Two LoRA stages, both cheap. The final inference recipe: load the base model once + load both LoRA adapters + merge or apply on the fly.

This is mechanically straightforward but adds complexity (managing multi-stage adapters). Most production deployments merge LoRA after each stage; the staging structure isn't visible at inference.

Where Phase 28 stops¶

Phase 28 implements SFT with LoRA on the irregular-verb conjugation task. We don't implement:

A reward model.
PPO or DPO loss functions.
Constitutional AI prompts.
Multi-stage LoRA chains.

A motivated learner could extend Phase 28 with a small DPO experiment — e.g., train a preference model to rank went > goed > wented (decreasing acceptability), then run DPO. This is a side-quest, not in the DoD.

A note on the politics¶

Alignment tuning is the proximate site of most "safety" and "values" engineering in modern LLMs. The choice of preference dataset, the constitution, the ranker model — all encode the values of the team running them. This is a feature, not a bug: the alternative ("ship the base model unedited") shifts the responsibility to the deployer without removing the embedded values.

The mathematics is value-neutral. The data is not. Phase 28 sidesteps this entirely (the corpus is English verb grammar; there's a unique correct answer per prompt). At larger scales the data composition is the decisive variable.

Drill problems¶

These are conceptual — no compute. Solutions at phase open in solutions/04-alignment-survey-ref.md.

Why does RLHF use a separate reward model rather than just using the human preference labels directly as a supervised signal on the language model?
Derive the DPO loss informally: starting from the RLHF objective (reward minus KL penalty), apply the Bradley-Terry preference model and the closed-form optimal policy. Sketch the algebra.
RLHF has a "reward hacking" failure mode where the model finds completions scoring high under the reward model but degenerate under the original distribution. How does the KL penalty prevent this? What happens if β is too small? Too large?
Why is ORPO's dropping of the reference policy π_ref plausible but risky? What does π_ref provide that ORPO replaces with the λ · L_SFT term?
Suppose we wanted to add DPO to Phase 28 as a side-quest. Sketch the preference dataset: what pairs (prompt, win, lose) would teach the model to prefer correct irregular conjugations?

One-paragraph recap¶

After SFT teaches a model what to say, alignment tuning teaches it which way to say it. RLHF is the canonical three-stage recipe (SFT → reward model → PPO with KL penalty), but heavy. DPO collapses stages B and C into a single closed-form loss on preference pairs, dramatically simplifying. ORPO and SimPO push further by removing the reference policy. RLAIF replaces human annotators with strong LLMs for scale. LoRA composes with all of these — typically as separate adapter chains for SFT and DPO. Phase 28 implements only SFT-with-LoRA; the alignment-tuning surface is surveyed here for completeness but not exercised. The mathematics is general; the choice of which preferences to train on encodes the values of the team running the training.

Next: lab/00-lora-by-hand.md.