Skip to content

English · Español

Extension Module X3 — RLHF, DPO & RLAIF

Requires: 05 — Probability & Information Theory · 19 — Training Dynamics & Debugging · 28 — Fine-Tuning, LoRA, QLoRA Teaches: rlhf · reward-modeling · ppo · dpo · constitutional-ai Jump to any chapter from the phase reference index.

Chapter map

🇪🇸 Módulo de extensión sobre alineamiento por preferencias: RLHF (PPO), DPO (método directo) y RLAIF / IA Constitucional. Cierra la brecha "RL/RLHF beyond conceptual coverage" del HIRING_PATH.md.

Status

  • Track: Extension (parallel to core 40-phase curriculum)
  • Authorization: Addendum A15 (extension tracks authorized)
  • Prerequisites: Phase 04 (calculus/optimization), Phase 17 (mini-GPT), Phase 19 (training dynamics), Phase 28 (LoRA/QLoRA)
  • Scope guard: All labs use §A13 grammar-tutor scope (20 verbs, 5 tenses, 3 persons). No scope creep.
  • Hardware bar: CPU-only (i5-8250U); all labs run end-to-end without GPU.

Why this module exists

The core curriculum covers pretraining (Phase 17–18), evaluation (Phase 20), and inference (Phase 21). It does not cover the post-training alignment loop that turns a base LM into an assistant — this is what every production lab (OpenAI, Anthropic, DeepMind, Meta) does after SFT, and what an Anthropic-style interview will probe in depth.

This module fills that gap with:

  1. Theory — the math of policy gradients, reward modeling, PPO, DPO, and Constitutional AI.
  2. Labs — three hands-on labs on the §A13 grammar-tutor that exercise each technique end-to-end on CPU.

Module map

File Topic
theory/00-motivation.md Why preference alignment after SFT; the imitation gap; HHH framing
theory/01-rl-fundamentals.md REINFORCE → PG with baseline → PPO clipping
theory/02-reward-modeling.md Bradley-Terry, reward hacking, over-optimization U-curve
theory/03-ppo-for-language.md InstructGPT recipe, KL-to-reference penalty
theory/04-dpo-and-direct-methods.md DPO derivation; KTO, IPO, ORPO comparison
theory/05-constitutional-ai-and-rlaif.md Constitutional AI, RLAIF, supervised + RL CAI steps
lab/00-reward-model-from-preferences.md Train a tiny RM on 200 pairwise grammar-correction preferences
lab/01-dpo-on-grammar-tutor.md DPO-fine-tune the Phase 28 LoRA grammar-tutor
lab/02-constitutional-revision-loop.md Constitutional self-critique → revision → SFT distillation

Key references

  • Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT).
  • Rafailov et al. 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO).
  • Bai et al. 2022 — Constitutional AI: Harmlessness from AI Feedback (Anthropic CAI / RLAIF).
  • Gao et al. 2022 — Scaling Laws for Reward Model Overoptimization.
  • Lambert 2024 — Reinforcement Learning from Human Feedback (preference tuning survey).

Definition of Done

  • All six theory files reviewed by math-reviewer.
  • Lab 00 RM achieves >70% accuracy on held-out 40-pair eval.
  • Lab 01 DPO model achieves >55% win rate vs SFT baseline on held-out 50-pair test set.
  • Lab 02 shows measurable improvement on held-out eval after one revision-distillation cycle.
  • mkdocs build --strict passes with this module included in the nav.

Further reading

Optional — enrichment, not required to pass the phase.