Skip to content

Baselines & Variance Reduction

REINFORCE’s gradient estimate has high variance because the return GtG_t can vary wildly between episodes. This means we need many samples for a reliable gradient, making training slow.

We can subtract a baseline b(st)b(s_t) from the return without introducing bias:

θJ(θ)=Eτ[t=0Tθlogπθ(atst)(Gtb(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t)) \right]

Why is this still unbiased? Because:

Eaπθ[θlogπθ(as)b(s)]=b(s)θaπθ(as)=b(s)θ1=0\mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s) \right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a \mid s) = b(s) \cdot \nabla_\theta 1 = 0

The baseline doesn’t change the expected gradient — it only reduces variance.

The best common choice for the baseline is the state-value function Vπ(s)V^\pi(s):

b(st)=Vπ(st)b(s_t) = V^\pi(s_t)

This turns the weight from GtG_t into the advantage:

At=GtVπ(st)A_t = G_t - V^\pi(s_t)

The advantage is positive when the action was better than average, and negative when it was worse. This is much more informative than the raw return.

When we learn both:

  • A policy πθ\pi_\theta (the actor)
  • A value function VϕV_\phi (the critic)

we get an actor-critic method. The critic provides the baseline, and the actor updates using the advantage:

θθ+αθlogπθ(atst)A^t\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \hat{A}_t ϕϕβϕ(Vϕ(st)Gt)2\phi \leftarrow \phi - \beta \nabla_\phi (V_\phi(s_t) - G_t)^2

This is the foundation for PPO, which we’ll cover next.

MethodWeightBiasVariance
REINFORCEGtG_tNoneHigh
With baselineGtb(s)G_t - b(s)NoneLower
Actor-CriticA^t\hat{A}_tSome (bootstrapping)Lowest

The bias-variance tradeoff is a recurring theme — we’ll see it again with GAE in the PPO section.