English · Español

00 — Motivation: why preference alignment after pretraining + SFT¶

🇪🇸 Tras el preentrenamiento y el SFT, el modelo imita el corpus pero no necesariamente lo que queremos. El alineamiento por preferencias cierra esa brecha.

The pipeline so far¶

A modern assistant is built in three stages:

Pretraining (Phase 17–18): minimize next-token cross-entropy on a large corpus. Result: a model that continues text.
Supervised fine-tuning (SFT): train on (prompt, ideal response) pairs. Result: a model that follows instructions, but only as well as the demonstrators.
Preference alignment (this module): RLHF / DPO / RLAIF. Result: a model that produces responses humans prefer over alternatives.

Step 3 is necessary because steps 1 and 2 share a fundamental limitation: the imitation gap.

The imitation gap¶

SFT is imitation learning. Its loss is

\[ \mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\sum_{t=1}^{|y|} \log \pi_\theta(y_t \mid x, y_{<t})\right] \]

This maximizes the likelihood of the demonstrators' tokens. Two problems:

Ceiling problem. The model cannot exceed the demonstrators. If 80% of human demonstrators write competent but mediocre grammar explanations, the model will too.
Distributional problem. SFT assigns positive probability to anything in the data, including stylistic noise, hedging, padding. It does not learn the ranking between a great response and a merely-acceptable one.

Preference alignment fixes both because the signal is comparative: "given \(x\), response \(y_w\) is preferred over \(y_l\)" — a ranking, not a single target.

Why ranks beat targets¶

A human can reliably judge "A is better than B" even when she cannot write an optimal A from scratch. This is the same insight as in chess: it is easier to evaluate a position than to find the best move. Preference data is therefore:

Cheaper per useful bit than demonstration data (Bai et al. 2022).
More aligned with deployment (users compare your model to alternatives; they don't write the gold).
Better at suppressing bad behaviors (you can downweight \(y_l\), not only upweight \(y_w\)).

The HHH framing¶

Anthropic's framing (Askell et al. 2021) sets the target of alignment:

Helpful — completes the user's task.
Honest — does not assert false things; expresses calibrated uncertainty.
Harmless — refuses to help with clearly harmful requests.

These three pull in different directions (helpfulness can conflict with harmlessness — "tell me how to do X harmful thing"). Preference alignment is the mechanism by which a lab encodes its trade-off between them. There is no single "correct" assistant; there is the assistant your preference data describes.

Grammar-tutor framing for §A13¶

For the §A13 grammar-tutor, HHH specializes to:

Helpful: the tutor proposes the correct conjugation correction.
Honest: the tutor flags when the input sentence is already correct rather than inventing a fake error (a real failure mode of SFT-only tutors).
Harmless: less salient here, but the tutor should not invent slurs, profanity, or misleading meta-claims like "this is how all English speakers say it."

The labs in this module instantiate these three concretely.

What this module will do¶

Lab	What you train	Signal
00	A reward model (RM) on 200 pairwise preferences	"\(y_w\) better than \(y_l\)" labels
01	DPO-fine-tune the Phase 28 LoRA tutor	Same pairs, but no separate RM — DPO collapses RM + RL into one loss
02	Constitutional revision loop	The model critiques and revises its own outputs against a written constitution; you distill the revisions back via SFT

Cross-links¶

Phase 18 — Training Loop: the SFT stage that precedes everything here.
Phase 20 — Evaluation Harness: the eval that catches the imitation-gap failures.

References¶

Askell et al. 2021, A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861.
Ouyang et al. 2022, Training language models to follow instructions with human feedback. arXiv:2203.02155.
Bai et al. 2022, Training a Helpful and Harmless Assistant with RLHF. arXiv:2204.05862.