English · Español

Extension Module X3 — RLHF, DPO & RLAIF¶

Requires: 05 — Probability & Information Theory · 19 — Training Dynamics & Debugging · 28 — Fine-Tuning, LoRA, QLoRA Teaches: rlhf · reward-modeling · ppo · dpo · constitutional-ai Jump to any chapter from the phase reference index.

Chapter map¶

🇪🇸 Módulo de extensión sobre alineamiento por preferencias: RLHF (PPO), DPO (método directo) y RLAIF / IA Constitucional. Cierra la brecha "RL/RLHF beyond conceptual coverage" del HIRING_PATH.md.

Status¶

Track: Extension (parallel to core 40-phase curriculum)
Authorization: Addendum A15 (extension tracks authorized)
Prerequisites: Phase 04 (calculus/optimization), Phase 17 (mini-GPT), Phase 19 (training dynamics), Phase 28 (LoRA/QLoRA)
Scope guard: All labs use §A13 grammar-tutor scope (20 verbs, 5 tenses, 3 persons). No scope creep.
Hardware bar: CPU-only (i5-8250U); all labs run end-to-end without GPU.

Why this module exists¶

The core curriculum covers pretraining (Phase 17–18), evaluation (Phase 20), and inference (Phase 21). It does not cover the post-training alignment loop that turns a base LM into an assistant — this is what every production lab (OpenAI, Anthropic, DeepMind, Meta) does after SFT, and what an Anthropic-style interview will probe in depth.

This module fills that gap with:

Theory — the math of policy gradients, reward modeling, PPO, DPO, and Constitutional AI.
Labs — three hands-on labs on the §A13 grammar-tutor that exercise each technique end-to-end on CPU.

Module map¶

File	Topic
`theory/00-motivation.md`	Why preference alignment after SFT; the imitation gap; HHH framing
`theory/01-rl-fundamentals.md`	REINFORCE → PG with baseline → PPO clipping
`theory/02-reward-modeling.md`	Bradley-Terry, reward hacking, over-optimization U-curve
`theory/03-ppo-for-language.md`	InstructGPT recipe, KL-to-reference penalty
`theory/04-dpo-and-direct-methods.md`	DPO derivation; KTO, IPO, ORPO comparison
`theory/05-constitutional-ai-and-rlaif.md`	Constitutional AI, RLAIF, supervised + RL CAI steps
`lab/00-reward-model-from-preferences.md`	Train a tiny RM on 200 pairwise grammar-correction preferences
`lab/01-dpo-on-grammar-tutor.md`	DPO-fine-tune the Phase 28 LoRA grammar-tutor
`lab/02-constitutional-revision-loop.md`	Constitutional self-critique → revision → SFT distillation

Cross-links to core curriculum¶

Phase 04 — Calculus & Optimization: policy gradient is backprop with a different loss.
Phase 19 — Training Dynamics: the KL constraint is a training-dynamics tool.
Phase 28 — LoRA / QLoRA: DPO is most often done on adapters (low extra cost).
Phase 37 — Security & Safety: alignment is a safety story; Constitutional AI is what Anthropic ships.

Key references¶

Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT).
Rafailov et al. 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO).
Bai et al. 2022 — Constitutional AI: Harmlessness from AI Feedback (Anthropic CAI / RLAIF).
Gao et al. 2022 — Scaling Laws for Reward Model Overoptimization.
Lambert 2024 — Reinforcement Learning from Human Feedback (preference tuning survey).

Definition of Done¶

All six theory files reviewed by math-reviewer.
Lab 00 RM achieves >70% accuracy on held-out 40-pair eval.
Lab 01 DPO model achieves >55% win rate vs SFT baseline on held-out 50-pair test set.
Lab 02 shows measurable improvement on held-out eval after one revision-distillation cycle.
mkdocs build --strict passes with this module included in the nav.