Skip to content

Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) is the workhorse of modern RL. It improves on vanilla policy gradient by constraining how much the policy can change in each update, leading to more stable training.

REINFORCE updates can be unstable — a single bad gradient step can catastrophically change the policy. The idea behind PPO is to keep the new policy “close” to the old one.

We define the probability ratio:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}

When rt=1r_t = 1, the new and old policies agree. When rtr_t deviates far from 1, the policy has changed significantly.

The vanilla surrogate objective is:

LCPI(θ)=Et[rt(θ)A^t]L^{CPI}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \hat{A}_t \right]

where A^t\hat{A}_t is the estimated advantage. But maximizing this without constraint can lead to excessively large policy updates.

PPO clips the ratio to prevent large updates:

LCLIP(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

The min\min ensures:

  • When A^t>0\hat{A}_t > 0 (good action): the ratio is capped at 1+ϵ1 + \epsilon, preventing over-reinforcement
  • When A^t<0\hat{A}_t < 0 (bad action): the ratio is capped at 1ϵ1 - \epsilon, preventing over-punishment

The typical value is ϵ=0.2\epsilon = 0.2.

Initialize policy π_θ and value function V_φ
for each iteration:
Collect trajectories using π_θ
Compute advantages Â_t using GAE
for each mini-batch epoch:
Compute r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
L_clip = min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t)
Update θ to maximize L_clip
Update φ to minimize (V_φ(s_t) - G_t)²
θ_old ← θ