Baselines & Variance Reduction
The Variance Problem
Section titled “The Variance Problem”REINFORCE’s gradient estimate has high variance because the return can vary wildly between episodes. This means we need many samples for a reliable gradient, making training slow.
Adding a Baseline
Section titled “Adding a Baseline”We can subtract a baseline from the return without introducing bias:
Why is this still unbiased? Because:
The baseline doesn’t change the expected gradient — it only reduces variance.
The Value Function Baseline
Section titled “The Value Function Baseline”The best common choice for the baseline is the state-value function :
This turns the weight from into the advantage:
The advantage is positive when the action was better than average, and negative when it was worse. This is much more informative than the raw return.
Actor-Critic
Section titled “Actor-Critic”When we learn both:
- A policy (the actor)
- A value function (the critic)
we get an actor-critic method. The critic provides the baseline, and the actor updates using the advantage:
This is the foundation for PPO, which we’ll cover next.
Summary
Section titled “Summary”| Method | Weight | Bias | Variance |
|---|---|---|---|
| REINFORCE | None | High | |
| With baseline | None | Lower | |
| Actor-Critic | Some (bootstrapping) | Lowest |
The bias-variance tradeoff is a recurring theme — we’ll see it again with GAE in the PPO section.