Skip to content

English · Español

05 — Constitutional AI and RLAIF

🇪🇸 IA Constitucional (Anthropic, 2022): un conjunto de principios escritos guía al modelo a criticar y revisar sus propias respuestas; las revisiones se destilan vía SFT, y luego un RM entrenado con preferencias de IA alimenta RLAIF.

The motivating problem

The PPO and DPO recipes (chapters 03 and 04) both depend on a corpus of human pairwise preferences. Three issues:

  1. Cost. Anthropic-scale RLHF needs \(10^5\)\(10^6\) preference pairs. At minutes per pair, that is many person-years.
  2. Coverage. Humans cannot enumerate every kind of harmful or low-quality response. The data has gaps.
  3. Honesty about values. The model's behavior is determined by what annotators happened to prefer. There is no explicit, auditable statement of what the model is supposed to do.

Constitutional AI (Bai et al. 2022) addresses all three: replace most preference annotators with the model itself, guided by a written constitution.

The two-stage CAI pipeline

Anthropic's CAI has two stages, both new on top of the InstructGPT pipeline of chapter 03:

Stage 1: Supervised CAI (SL-CAI)

For each red-team prompt \(x\):

  1. Sample an initial response \(y_0 \sim \pi^{\text{SFT}}(\cdot \mid x)\).
  2. Self-critique. Ask the same model: "Identify ways in which the last response was [harmful / dishonest / unhelpful], according to principle \(P_i\)." It produces a critique \(c_i\).
  3. Self-revision. Ask the model: "Rewrite the response to address the critique." It produces a revised response \(y_1\).
  4. Optionally iterate (critique \(y_1\), revise to \(y_2\), etc.).
  5. Fine-tune \(\pi^{\text{SFT}}\) on \((x, y_{\text{final}})\) pairs via SFT. The result is \(\pi^{\text{SL-CAI}}\).

This is distillation of the constitutional-revision loop into the weights. After SL-CAI, the model produces responses similar to the post-revision outputs without needing to explicitly run the critique step at inference.

Stage 2: RLAIF (RL-CAI)

Now train a reward model on AI-generated preferences instead of human ones:

  1. Sample two responses \(y_a, y_b \sim \pi^{\text{SL-CAI}}(\cdot \mid x)\).
  2. Ask the model: "Which response better satisfies principle \(P_i\)?" The model returns a label.
  3. Aggregate labels across principles (often via log-probs of "A" vs "B" tokens as soft labels, then BT loss).
  4. Train a reward model \(r_\phi\) on these AI-labeled preferences.
  5. Run PPO (chapter 03) with this RM. The result is \(\pi^{\text{RL-CAI}}\).

The key change from RLHF: step 2 used to be a human. Now it is the model evaluating against a written constitution.

What a constitution looks like

The published Anthropic constitution (Bai et al. 2022, Appendix C) has \(\sim\)16 principles. A few representative examples (paraphrased):

  • "Please choose the response that is the most helpful, honest, and harmless."
  • "Identify specific ways in which the assistant's last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal."
  • "Choose the response that is more thoughtful and engaging."
  • "Choose the response that sounds least like it could have been written by a child."

The form is intentional: short, English-language, easy to audit, easy to amend. Compare this to a learned reward model — a fixed set of weights with no interpretable structure.

A grammar-tutor constitution (used in Lab 02)

For §A13, we instantiate a three-principle constitution:

  1. Correct. "Choose the response that correctly identifies the conjugation error in the input sentence and proposes the correct form."
  2. Concise. "Choose the response that fixes the error in the fewest words while remaining clear to an A2-level learner."
  3. Honest. "If the input sentence has no grammatical error, choose the response that says so plainly rather than inventing a correction."

The third principle is critical: it directly addresses the sycophancy failure mode (chapter 02).

Why RLAIF works at all

A priori, it sounds circular: "the model judges itself." Three reasons it works:

  1. Evaluation is easier than generation. Given two grammar corrections, deciding which is better is much easier than producing the better one from scratch. The model has more headroom on the evaluation task than on the generation task.
  2. Constitutional anchoring. The model is not free-judging; it is judging against an externally specified principle. The constitution provides the "ground truth" that a learned RM would otherwise need.
  3. Chain-of-thought scaffolding. The critique step explicitly reasons about the constitution before judging. This unrolls capabilities the bare model has but does not exercise zero-shot.

Empirically (Bai et al. 2022; Lee et al. 2023 RLAIF replication): RLAIF matches or exceeds RLHF on harmlessness benchmarks at a fraction of the human-annotation cost.

The supervised-revision loop in detail

The SL-CAI loop is the move most likely to come up in an interview. Each iteration:

prompt x
sample y_0 ~ π_SFT(·|x)
critique-prompt(x, y_0, principle_i) → c_i = "your response was X because..."
revise-prompt(x, y_0, c_i)           → y_1 = revised response
(optionally repeat with y_1 in place of y_0)
collect (x, y_final) → SFT dataset for next-stage model

Two design choices matter:

  • Which principles get applied to which prompts. Anthropic randomly samples a small subset of principles per prompt to avoid the model fixating on one principle.
  • How many revision rounds. Diminishing returns after 1-2 rounds; cost is linear in the number of forward passes per prompt.

Why this is cheaper than collecting human prefs

Cost comparison (back-of-envelope, modern API rates):

Source Cost per preference
Human contractor (skilled) $1–5
Model self-critique (Claude / GPT-class) $0.001–0.01

Two-to-three orders of magnitude. The catch is that AI preferences are only as good as (a) the model used to judge and (b) the constitution. For frontier-quality alignment, iterated CAI — using last-generation RL-CAI models to judge next-generation training data — is the standard play.

Failure modes specific to CAI / RLAIF

  1. Constitution gaming. The model can learn to satisfy the letter of the constitution while violating its spirit (e.g., refusing harmful requests by hallucinating a refusal reason that is itself dishonest). Fix: explicit honesty principles; cross-principle audits.
  2. Critique-revision shallow loop. The model produces formulaic critiques ("Your response could be improved by being more helpful") and rewrites cosmetically. Fix: principle-conditional critique prompts that demand concrete grounding.
  3. Mode collapse onto refusals. Easy to satisfy "harmless" by refusing everything; over-refusal is a known CAI failure (Anthropic 2024 has explicit "be helpful when safe" principles to counterbalance). Fix: helpfulness principles weighted at least equally.
  4. Constitution drift across iterations. Each new model interprets the constitution slightly differently; principles may need explicit examples (few-shot in the critique prompt) to remain anchored.

How this lands in production

Anthropic's Claude family uses a descendant of CAI for harmlessness training; specifics evolve, but the core loop (constitution + self-critique + self-revision + RLAIF) is the institutional answer to "how do you align a frontier model without burning 10,000 person-years of annotation."

For the grammar-tutor scope, Lab 02 implements a minimal CAI loop: one round of self-critique + self-revision against the 3-principle constitution above, followed by SFT-distillation on the revised responses. No RLAIF (that would require a fresh RM after CAI; the lab stops at SL-CAI).

References

  • Bai et al. 2022, Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
  • Lee et al. 2023, RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv:2309.00267.
  • Anthropic 2023, Claude's Constitution. https://www.anthropic.com/news/claudes-constitution
  • Askell et al. 2021, A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861.