Generalized Advantage Estimation (GAE)
The Bias-Variance Tradeoff
Section titled “The Bias-Variance Tradeoff”How do we estimate the advantage ? There are many options:
Monte Carlo (high variance, no bias):
TD(0) (low variance, high bias):
where is the TD residual.
Monte Carlo uses the actual return (accurate but noisy). TD uses the value estimate (smooth but biased by errors).
GAE: The Best of Both Worlds
Section titled “GAE: The Best of Both Worlds”Generalized Advantage Estimation (Schulman et al., 2016) interpolates between these extremes with a parameter :
This can be written recursively:
The Role of
Section titled “The Role of λ\lambdaλ”| Equivalent to | Bias | Variance | |
|---|---|---|---|
| TD(0): | High | Low | |
| Monte Carlo: | None | High | |
| Weighted mix of n-step returns | Medium | Medium |
GAE in the PPO Pipeline
Section titled “GAE in the PPO Pipeline”The full PPO training loop:
- Collect rollout data with current policy
- Compute TD residuals:
- Compute GAE advantages:
- Compute targets for the value function:
- Run multiple epochs of mini-batch updates on and the value loss
Why Not Just Monte Carlo?
Section titled “Why Not Just Monte Carlo?”In LLM training (RLHF), episodes can be long (hundreds of tokens). Monte Carlo returns have very high variance because each token’s reward signal is buried under the noise of all future tokens.
GAE with exponentially downweights distant TD residuals, giving a much cleaner signal for credit assignment — which token actually contributed to the reward?