Group-Relative Advantages
The Group Sampling Procedure
Section titled “The Group Sampling Procedure”For each training prompt :
- Sample complete outputs:
- Score each output:
- Normalize rewards within the group:
where and .
Interactive: Group vs. Value Baseline
Section titled “Interactive: Group vs. Value Baseline”Compare how GRPO’s group-relative advantages differ from PPO’s value-function baseline. The bars show raw rewards; the lines show the resulting advantages:
Why Group Normalization Works
Section titled “Why Group Normalization Works”The group mean serves the same role as the critic baseline in PPO:
But the group mean is:
- Free to compute — no extra neural network needed
- Adaptive — automatically adjusts per prompt (hard prompts have lower mean, easy prompts have higher mean)
- Unbiased — no function approximation error
The standard deviation normalization additionally stabilizes training by keeping advantage magnitudes consistent across prompts.
The KL Penalty
Section titled “The KL Penalty”GRPO adds an explicit KL divergence penalty to the objective:
This prevents the policy from deviating too far from the reference (usually the SFT model). In practice, DeepSeek uses a per-token KL approximation:
GRPO Algorithm Summary
Section titled “GRPO Algorithm Summary”Initialize policy π_θ from SFT model, set π_ref = π_θfor each iteration: Sample batch of prompts {x₁, ..., x_B} for each prompt x: Sample G outputs: y₁, ..., y_G ~ π_θ_old(·|x) Score: rᵢ = R(x, yᵢ) Normalize: Âᵢ = (rᵢ - mean) / std for each mini-batch epoch: Compute clipped surrogate with  Add KL penalty: L = L_clip - β·D_KL Update θ to maximize L θ_old ← θConnection to the Learning Path
Section titled “Connection to the Learning Path”Looking back at our progression:
- Action chains gave us the MDP framework and the concept of return
- Policy gradients showed how to optimize a policy directly
- PPO stabilized updates with clipping and used GAE for advantage estimation
- GRPO simplified the pipeline by replacing the learned critic with group-relative normalization
Each step builds on the last — GRPO is best understood as “PPO without the critic, made possible by sampling multiple outputs per prompt.”