English · Español
01 — RL fundamentals: REINFORCE → PG with baseline → PPO¶
🇪🇸 Política gradiente desde cero: REINFORCE, varianza y línea base, y el "clip" de PPO. Backprop con una función de pérdida distinta.
You already know backprop (Phase 04). Policy gradient is the same algorithm with a different loss. The trick is that the loss involves an expectation over the policy's own samples, so you cannot differentiate it naively.
The setup¶
- Policy \(\pi_\theta(a \mid s)\): a probability distribution over actions \(a\) given state \(s\). For a language model, \(s\) is the prompt + tokens so far, \(a\) is the next token.
- Trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots, s_T, a_T)\).
- Return \(R(\tau) = \sum_{t=0}^{T} r_t\) (for our LM case, often a single terminal reward \(R\) at end-of-sequence).
- Objective \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\).
We want \(\nabla_\theta J(\theta)\).
REINFORCE: the log-derivative trick¶
The gradient of an expectation over a parameterized distribution is non-trivial because \(\theta\) is inside the sampling distribution. The log-derivative trick (Williams 1992) gives:
Derivation (one line):
using \(\nabla_\theta p_\theta = p_\theta \nabla_\theta \log p_\theta\).
Practically: roll out a trajectory, compute its return \(R\), then do an ordinary backprop step on the loss
That is it. The minus is because we want to ascend \(J\).
Intuition¶
Each token-level log-prob is weighted by the whole-trajectory return. If the trajectory was good (\(R>0\)), upweight all the tokens we chose; if bad, downweight them. This is credit assignment by association.
Problem: variance¶
REINFORCE has enormous variance. If \(R\) varies from \(-10\) to \(+10\) across trajectories, gradient estimates swing wildly. Two fixes:
1. Baseline subtraction¶
Subtracting any function \(b(s_t)\) that does not depend on \(a_t\) leaves the gradient unbiased:
Proof sketch: \(\mathbb{E}_{a \sim \pi}[b(s) \nabla_\theta \log \pi_\theta(a \mid s)] = b(s) \nabla_\theta \mathbb{E}[1] = 0\).
A good baseline is the value function \(V^\pi(s) = \mathbb{E}_\pi[R \mid s]\). Then \(R - V^\pi(s)\) is the advantage \(A(s,a)\) — how much better this action was than average from \(s\).
2. Actor-Critic¶
Train a critic \(V_\phi(s)\) alongside the policy ("actor") to predict \(V^\pi\). Critic loss: \((V_\phi(s_t) - R_t)^2\). This is the modern setup.
The PG estimator becomes:
PPO: the clipping trick¶
REINFORCE / vanilla actor-critic still have a problem: one bad batch of trajectories can move \(\theta\) so far that the policy collapses (the next batch is all garbage). We want a trust region — limit how much \(\pi_\theta\) can change per step.
TRPO (Schulman et al. 2015) does this with an explicit KL constraint and second-order optimization. PPO (Schulman et al. 2017) is the first-order approximation that everyone actually uses.
The PPO objective¶
Let
be the importance ratio between the new and old policy at step \(t\). The PPO-Clip objective is:
Typically \(\epsilon = 0.2\).
Why clipping prevents catastrophic updates¶
Case analysis on the sign of \(A_t\):
- \(A_t > 0\) (action was good). We want to increase \(r_t\). The unclipped term \(r_t A_t\) grows without bound as \(r_t \to \infty\). The clipped term saturates at \((1+\epsilon) A_t\). The
minpicks the smaller (clipped) value once \(r_t > 1+\epsilon\). So the gradient w.r.t. \(\theta\) becomes zero once we've moved the policy "enough" — no incentive to push \(r_t\) further this batch. - \(A_t < 0\) (action was bad). We want to decrease \(r_t\). Symmetrically, the
minpicks the clipped term once \(r_t < 1 - \epsilon\), killing the gradient. - \(|A_t|\) small or \(r_t\) inside \([1-\epsilon, 1+\epsilon]\): behaves like vanilla PG.
Result: per-batch updates are bounded; the policy cannot make a giant leap based on one batch's stochastic estimate of the gradient.
What PPO does not do¶
PPO does not enforce a closed-form KL constraint. The clip is a proxy for a trust region; the actual KL between \(\pi_\theta\) and \(\pi_{\theta_{\text{old}}}\) is monitored but not directly constrained in PPO-Clip. (PPO-Penalty is a variant that does add a KL term; PPO-Clip is more common.)
The general algorithm¶
for iteration in 1..N:
rollouts = collect_trajectories(π_θ_old)
advantages = estimate_advantages(rollouts, V_φ)
for epoch in 1..K: # K ≈ 4 typically
for minibatch in rollouts:
update θ on L^CLIP
update φ on (V_φ - R)^2
θ_old ← θ
The "K epochs per rollout batch" is what makes PPO sample-efficient vs. vanilla PG — you reuse the same rollouts multiple times, which clipping makes safe.
What is missing for language¶
This chapter set up PPO in the standard MDP framing. For language models we additionally need:
- A reward signal — there is no environment giving us \(r_t\). Solution: train a reward model (next chapter).
- A way to keep the LM from drifting too far from a sensible distribution. Solution: a KL-to-reference penalty (chapter 03).
Cross-links¶
- Phase 04 — Calculus & Optimization: policy gradient is just backprop on \(-R \log \pi_\theta\).
- Phase 19 — Training Dynamics: PPO's clip is a training-dynamics intervention.
References¶
- Williams 1992, Simple statistical gradient-following algorithms for connectionist reinforcement learning.
- Schulman et al. 2015, Trust Region Policy Optimization (TRPO).
- Schulman et al. 2017, Proximal Policy Optimization Algorithms. arXiv:1707.06347.