English · Español
Extension Module X3 — RLHF, DPO & RLAIF¶
Requires: 05 — Probability & Information Theory · 19 — Training Dynamics & Debugging · 28 — Fine-Tuning, LoRA, QLoRA Teaches:
rlhf·reward-modeling·ppo·dpo·constitutional-aiJump to any chapter from the phase reference index.
Chapter map¶
🇪🇸 Módulo de extensión sobre alineamiento por preferencias: RLHF (PPO), DPO (método directo) y RLAIF / IA Constitucional. Cierra la brecha "RL/RLHF beyond conceptual coverage" del
HIRING_PATH.md.
Status¶
- Track: Extension (parallel to core 40-phase curriculum)
- Authorization: Addendum A15 (extension tracks authorized)
- Prerequisites: Phase 04 (calculus/optimization), Phase 17 (mini-GPT), Phase 19 (training dynamics), Phase 28 (LoRA/QLoRA)
- Scope guard: All labs use §A13 grammar-tutor scope (20 verbs, 5 tenses, 3 persons). No scope creep.
- Hardware bar: CPU-only (i5-8250U); all labs run end-to-end without GPU.
Why this module exists¶
The core curriculum covers pretraining (Phase 17–18), evaluation (Phase 20), and inference (Phase 21). It does not cover the post-training alignment loop that turns a base LM into an assistant — this is what every production lab (OpenAI, Anthropic, DeepMind, Meta) does after SFT, and what an Anthropic-style interview will probe in depth.
This module fills that gap with:
- Theory — the math of policy gradients, reward modeling, PPO, DPO, and Constitutional AI.
- Labs — three hands-on labs on the §A13 grammar-tutor that exercise each technique end-to-end on CPU.
Module map¶
| File | Topic |
|---|---|
theory/00-motivation.md |
Why preference alignment after SFT; the imitation gap; HHH framing |
theory/01-rl-fundamentals.md |
REINFORCE → PG with baseline → PPO clipping |
theory/02-reward-modeling.md |
Bradley-Terry, reward hacking, over-optimization U-curve |
theory/03-ppo-for-language.md |
InstructGPT recipe, KL-to-reference penalty |
theory/04-dpo-and-direct-methods.md |
DPO derivation; KTO, IPO, ORPO comparison |
theory/05-constitutional-ai-and-rlaif.md |
Constitutional AI, RLAIF, supervised + RL CAI steps |
lab/00-reward-model-from-preferences.md |
Train a tiny RM on 200 pairwise grammar-correction preferences |
lab/01-dpo-on-grammar-tutor.md |
DPO-fine-tune the Phase 28 LoRA grammar-tutor |
lab/02-constitutional-revision-loop.md |
Constitutional self-critique → revision → SFT distillation |
Cross-links to core curriculum¶
- Phase 04 — Calculus & Optimization: policy gradient is backprop with a different loss.
- Phase 19 — Training Dynamics: the KL constraint is a training-dynamics tool.
- Phase 28 — LoRA / QLoRA: DPO is most often done on adapters (low extra cost).
- Phase 37 — Security & Safety: alignment is a safety story; Constitutional AI is what Anthropic ships.
Key references¶
- Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT).
- Rafailov et al. 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO).
- Bai et al. 2022 — Constitutional AI: Harmlessness from AI Feedback (Anthropic CAI / RLAIF).
- Gao et al. 2022 — Scaling Laws for Reward Model Overoptimization.
- Lambert 2024 — Reinforcement Learning from Human Feedback (preference tuning survey).
Definition of Done¶
- All six theory files reviewed by
math-reviewer. - Lab 00 RM achieves >70% accuracy on held-out 40-pair eval.
- Lab 01 DPO model achieves >55% win rate vs SFT baseline on held-out 50-pair test set.
- Lab 02 shows measurable improvement on held-out eval after one revision-distillation cycle.
-
mkdocs build --strictpasses with this module included in the nav.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Training Language Models to Follow Instructions (InstructGPT) — Ouyang et al. · 2022. the RLHF pipeline, start to finish.
- 📄 Direct Preference Optimization (DPO) — Rafailov et al. · 2023. alignment without a separate reward model or PPO.
- 📄 Constitutional AI: Harmlessness from AI Feedback — Bai et al. · 2022. RLAIF — feedback from a model, not a crowd.