Skip to content

The Clipped Surrogate

The PPO objective takes the minimum of the unclipped and clipped surrogate:

LCLIP(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

Let’s break this down for both cases:

When A^t>0\hat{A}_t > 0 (the action was good)

Section titled “When A^t>0\hat{A}_t > 0A^t​>0 (the action was good)”

We want to increase πθ(atst)\pi_\theta(a_t \mid s_t), which increases rtr_t. But the clip caps the benefit at rt=1+ϵr_t = 1 + \epsilon:

Lt=min(rtA^t,(1+ϵ)A^t)={rtA^tif rt1+ϵ(1+ϵ)A^tif rt>1+ϵL_t = \min(r_t \hat{A}_t, (1+\epsilon) \hat{A}_t) = \begin{cases} r_t \hat{A}_t & \text{if } r_t \leq 1+\epsilon \\ (1+\epsilon)\hat{A}_t & \text{if } r_t > 1+\epsilon \end{cases}

When A^t<0\hat{A}_t < 0 (the action was bad)

Section titled “When A^t<0\hat{A}_t < 0A^t​<0 (the action was bad)”

We want to decrease πθ(atst)\pi_\theta(a_t \mid s_t), which decreases rtr_t. The clip prevents over-correction below rt=1ϵr_t = 1 - \epsilon:

Lt=min(rtA^t,(1ϵ)A^t)={rtA^tif rt1ϵ(1ϵ)A^tif rt<1ϵL_t = \min(r_t \hat{A}_t, (1-\epsilon) \hat{A}_t) = \begin{cases} r_t \hat{A}_t & \text{if } r_t \geq 1-\epsilon \\ (1-\epsilon)\hat{A}_t & \text{if } r_t < 1-\epsilon \end{cases}

Adjust ϵ\epsilon and toggle the advantage sign to see how the clipping region changes:

ϵ\epsilonEffect
0.1Very conservative — slow but stable
0.2Standard (OpenAI default)
0.3More aggressive — faster but riskier

In practice, ϵ=0.2\epsilon = 0.2 works well across many tasks, including RLHF for language models.