English · Español
00 — Motivation: why preference alignment after pretraining + SFT¶
🇪🇸 Tras el preentrenamiento y el SFT, el modelo imita el corpus pero no necesariamente lo que queremos. El alineamiento por preferencias cierra esa brecha.
The pipeline so far¶
A modern assistant is built in three stages:
- Pretraining (Phase 17–18): minimize next-token cross-entropy on a large corpus. Result: a model that continues text.
- Supervised fine-tuning (SFT): train on (prompt, ideal response) pairs. Result: a model that follows instructions, but only as well as the demonstrators.
- Preference alignment (this module): RLHF / DPO / RLAIF. Result: a model that produces responses humans prefer over alternatives.
Step 3 is necessary because steps 1 and 2 share a fundamental limitation: the imitation gap.
The imitation gap¶
SFT is imitation learning. Its loss is
This maximizes the likelihood of the demonstrators' tokens. Two problems:
- Ceiling problem. The model cannot exceed the demonstrators. If 80% of human demonstrators write competent but mediocre grammar explanations, the model will too.
- Distributional problem. SFT assigns positive probability to anything in the data, including stylistic noise, hedging, padding. It does not learn the ranking between a great response and a merely-acceptable one.
Preference alignment fixes both because the signal is comparative: "given \(x\), response \(y_w\) is preferred over \(y_l\)" — a ranking, not a single target.
Why ranks beat targets¶
A human can reliably judge "A is better than B" even when she cannot write an optimal A from scratch. This is the same insight as in chess: it is easier to evaluate a position than to find the best move. Preference data is therefore:
- Cheaper per useful bit than demonstration data (Bai et al. 2022).
- More aligned with deployment (users compare your model to alternatives; they don't write the gold).
- Better at suppressing bad behaviors (you can downweight \(y_l\), not only upweight \(y_w\)).
The HHH framing¶
Anthropic's framing (Askell et al. 2021) sets the target of alignment:
- Helpful — completes the user's task.
- Honest — does not assert false things; expresses calibrated uncertainty.
- Harmless — refuses to help with clearly harmful requests.
These three pull in different directions (helpfulness can conflict with harmlessness — "tell me how to do X harmful thing"). Preference alignment is the mechanism by which a lab encodes its trade-off between them. There is no single "correct" assistant; there is the assistant your preference data describes.
Grammar-tutor framing for §A13¶
For the §A13 grammar-tutor, HHH specializes to:
- Helpful: the tutor proposes the correct conjugation correction.
- Honest: the tutor flags when the input sentence is already correct rather than inventing a fake error (a real failure mode of SFT-only tutors).
- Harmless: less salient here, but the tutor should not invent slurs, profanity, or misleading meta-claims like "this is how all English speakers say it."
The labs in this module instantiate these three concretely.
What this module will do¶
| Lab | What you train | Signal |
|---|---|---|
| 00 | A reward model (RM) on 200 pairwise preferences | "\(y_w\) better than \(y_l\)" labels |
| 01 | DPO-fine-tune the Phase 28 LoRA tutor | Same pairs, but no separate RM — DPO collapses RM + RL into one loss |
| 02 | Constitutional revision loop | The model critiques and revises its own outputs against a written constitution; you distill the revisions back via SFT |
Cross-links¶
- Phase 18 — Training Loop: the SFT stage that precedes everything here.
- Phase 20 — Evaluation Harness: the eval that catches the imitation-gap failures.
References¶
- Askell et al. 2021, A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861.
- Ouyang et al. 2022, Training language models to follow instructions with human feedback. arXiv:2203.02155.
- Bai et al. 2022, Training a Helpful and Harmless Assistant with RLHF. arXiv:2204.05862.