Skip to content

Group Relative Policy Optimization (GRPO)

GRPO (Shao et al., 2024 — DeepSeek) simplifies PPO for language model training by removing the critic network entirely. Instead of learning a value function to estimate advantages, GRPO computes advantages by comparing outputs within a group sampled from the same prompt.

PPO requires a critic Vϕ(s)V_\phi(s) to compute advantages. For LLMs, this critic:

  • Must be a model comparable in size to the policy (expensive!)
  • Is hard to train well for diverse prompts
  • Adds significant memory and compute overhead

GRPO asks: can we estimate advantages without a critic?

For each prompt xx, sample a group of GG outputs from the current policy:

{y1,y2,,yG}πθ(x)\{y_1, y_2, \ldots, y_G\} \sim \pi_\theta(\cdot \mid x)

Score each output with a reward function R(x,yi)R(x, y_i), then normalize within the group:

A^i=R(x,yi)mean({R(x,yj)}j=1G)std({R(x,yj)}j=1G)\hat{A}_i = \frac{R(x, y_i) - \text{mean}(\{R(x, y_j)\}_{j=1}^G)}{\text{std}(\{R(x, y_j)\}_{j=1}^G)}

This group-relative advantage tells us how good each output is compared to its peers, without needing any learned baseline.

GRPO uses a clipped surrogate similar to PPO:

LGRPO(θ)=Ex[1Gi=1Gmin(ri(θ)A^i,  clip(ri(θ),1ϵ,1+ϵ)A^i)βDKL(πθπref)]L^{GRPO}(\theta) = \mathbb{E}_{x} \left[ \frac{1}{G} \sum_{i=1}^{G} \min \left( r_i(\theta) \hat{A}_i, \; \text{clip}(r_i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta \, D_{KL}(\pi_\theta \| \pi_\text{ref}) \right]

where:

  • ri(θ)=πθ(yix)πθold(yix)r_i(\theta) = \frac{\pi_\theta(y_i \mid x)}{\pi_{\theta_\text{old}}(y_i \mid x)} is the probability ratio for the full output
  • βDKL\beta \, D_{KL} is a KL penalty to prevent the policy from drifting too far from a reference model
AspectPPOGRPO
Advantage estimationLearned critic VϕV_\phiGroup normalization
Extra modelYes (critic)No
Per-token or per-outputPer-token advantagesPer-output advantages
KL constraintClipping onlyClipping + explicit KL penalty
Memory overheadHigh (policy + critic)Lower (policy only)