Group Relative Policy Optimization (GRPO)
GRPO (Shao et al., 2024 — DeepSeek) simplifies PPO for language model training by removing the critic network entirely. Instead of learning a value function to estimate advantages, GRPO computes advantages by comparing outputs within a group sampled from the same prompt.
Motivation
Section titled “Motivation”PPO requires a critic to compute advantages. For LLMs, this critic:
- Must be a model comparable in size to the policy (expensive!)
- Is hard to train well for diverse prompts
- Adds significant memory and compute overhead
GRPO asks: can we estimate advantages without a critic?
The GRPO Idea
Section titled “The GRPO Idea”For each prompt , sample a group of outputs from the current policy:
Score each output with a reward function , then normalize within the group:
This group-relative advantage tells us how good each output is compared to its peers, without needing any learned baseline.
The GRPO Objective
Section titled “The GRPO Objective”GRPO uses a clipped surrogate similar to PPO:
where:
- is the probability ratio for the full output
- is a KL penalty to prevent the policy from drifting too far from a reference model
Key Differences from PPO
Section titled “Key Differences from PPO”| Aspect | PPO | GRPO |
|---|---|---|
| Advantage estimation | Learned critic | Group normalization |
| Extra model | Yes (critic) | No |
| Per-token or per-output | Per-token advantages | Per-output advantages |
| KL constraint | Clipping only | Clipping + explicit KL penalty |
| Memory overhead | High (policy + critic) | Lower (policy only) |