The PPO objective takes the minimum of the unclipped and clipped surrogate:
LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]
Let’s break this down for both cases:
We want to increase πθ(at∣st), which increases rt. But the clip caps the benefit at rt=1+ϵ:
Lt=min(rtA^t,(1+ϵ)A^t)={rtA^t(1+ϵ)A^tif rt≤1+ϵif rt>1+ϵ
We want to decrease πθ(at∣st), which decreases rt. The clip prevents over-correction below rt=1−ϵ:
Lt=min(rtA^t,(1−ϵ)A^t)={rtA^t(1−ϵ)A^tif rt≥1−ϵif rt<1−ϵ
Adjust ϵ and toggle the advantage sign to see how the clipping region changes:
| ϵ | Effect |
|---|
| 0.1 | Very conservative — slow but stable |
| 0.2 | Standard (OpenAI default) |
| 0.3 | More aggressive — faster but riskier |
In practice, ϵ=0.2 works well across many tasks, including RLHF for language models.