Proximal Policy Optimization (PPO)
PPO (Schulman et al., 2017) is the workhorse of modern RL. It improves on vanilla policy gradient by constraining how much the policy can change in each update, leading to more stable training.
Motivation: The Trust Region
Section titled “Motivation: The Trust Region”REINFORCE updates can be unstable — a single bad gradient step can catastrophically change the policy. The idea behind PPO is to keep the new policy “close” to the old one.
We define the probability ratio:
When , the new and old policies agree. When deviates far from 1, the policy has changed significantly.
The Surrogate Objective
Section titled “The Surrogate Objective”The vanilla surrogate objective is:
where is the estimated advantage. But maximizing this without constraint can lead to excessively large policy updates.
PPO’s Clipped Objective
Section titled “PPO’s Clipped Objective”PPO clips the ratio to prevent large updates:
The ensures:
- When (good action): the ratio is capped at , preventing over-reinforcement
- When (bad action): the ratio is capped at , preventing over-punishment
The typical value is .
PPO Algorithm
Section titled “PPO Algorithm”Initialize policy π_θ and value function V_φfor each iteration: Collect trajectories using π_θ Compute advantages Â_t using GAE for each mini-batch epoch: Compute r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) L_clip = min(r_t·Â_t, clip(r_t, 1-ε, 1+ε)·Â_t) Update θ to maximize L_clip Update φ to minimize (V_φ(s_t) - G_t)² θ_old ← θ