Baselines & Variance Reduction

The Variance Problem

REINFORCE’s gradient estimate has high variance because the return $G_t$ can vary wildly between episodes. This means we need many samples for a reliable gradient, making training slow.

Adding a Baseline

We can subtract a baseline $b(s_t)$ from the return without introducing bias:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t)) \right]

Why is this still unbiased? Because:

\mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s) \right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a \mid s) = b(s) \cdot \nabla_\theta 1 = 0

The baseline doesn’t change the expected gradient — it only reduces variance.

The Value Function Baseline

The best common choice for the baseline is the state-value function $V^\pi(s)$ :

b(s_t) = V^\pi(s_t)

This turns the weight from $G_t$ into the advantage:

A_t = G_t - V^\pi(s_t)

The advantage is positive when the action was better than average, and negative when it was worse. This is much more informative than the raw return.

Actor-Critic

When we learn both:

A policy $\pi_\theta$ (the actor)
A value function $V_\phi$ (the critic)

we get an actor-critic method. The critic provides the baseline, and the actor updates using the advantage:

\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \hat{A}_t

\phi \leftarrow \phi - \beta \nabla_\phi (V_\phi(s_t) - G_t)^2

This is the foundation for PPO, which we’ll cover next.

Summary

Method	Weight	Bias	Variance
REINFORCE	$G_t$	None	High
With baseline	$G_t - b(s)$	None	Lower
Actor-Critic	$\hat{A}_t$	Some (bootstrapping)	Lowest

The bias-variance tradeoff is a recurring theme — we’ll see it again with GAE in the PPO section.