Skip to content

Policy Gradient Methods

Instead of learning a value function and deriving a policy from it (as in Q-learning), policy gradient methods directly optimize the policy parameters θ\theta to maximize expected return.

We want to find θ\theta that maximizes:

J(θ)=Eτπθ[R(τ)]=τP(τθ)R(τ)J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] = \sum_{\tau} P(\tau \mid \theta) \, R(\tau)

where R(τ)=t=0TγtrtR(\tau) = \sum_{t=0}^{T} \gamma^t r_t is the total (discounted) return of trajectory τ\tau.

The idea: compute the gradient θJ(θ)\nabla_\theta J(\theta) and update θ\theta in the direction that increases JJ.

Starting from the definition:

θJ(θ)=θτP(τθ)R(τ)=τθP(τθ)R(τ)\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau} P(\tau \mid \theta) \, R(\tau) = \sum_{\tau} \nabla_\theta P(\tau \mid \theta) \, R(\tau)

Here’s the problem: θP(τθ)\nabla_\theta P(\tau \mid \theta) is hard to compute directly because P(τθ)P(\tau \mid \theta) involves the environment dynamics. We need a trick.

The key identity: for any function f(x)>0f(x) > 0,

xlogf(x)=xf(x)f(x)\nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}

Rearranging:

xf(x)=f(x)xlogf(x)\nabla_x f(x) = f(x) \, \nabla_x \log f(x)

Apply this to P(τθ)P(\tau \mid \theta):

θP(τθ)=P(τθ)θlogP(τθ)\nabla_\theta P(\tau \mid \theta) = P(\tau \mid \theta) \, \nabla_\theta \log P(\tau \mid \theta)

Substituting back:

θJ(θ)=τP(τθ)θlogP(τθ)R(τ)=Eτπθ[θlogP(τθ)R(τ)]\nabla_\theta J(\theta) = \sum_{\tau} P(\tau \mid \theta) \, \nabla_\theta \log P(\tau \mid \theta) \, R(\tau) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log P(\tau \mid \theta) \, R(\tau) \right]

This is powerful: we’ve turned a sum weighted by θP\nabla_\theta P into an expectation under PP itself — something we can estimate by sampling trajectories.

Recall the chain rule factorization from the action chain section:

P(τθ)=P(s0)t=0T1πθ(atst)P(st+1st,at)P(\tau \mid \theta) = P(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t) \cdot P(s_{t+1} \mid s_t, a_t)

Taking the log:

logP(τθ)=logP(s0)no θ+t=0T1[logπθ(atst)+logP(st+1st,at)no θ]\log P(\tau \mid \theta) = \underbrace{\log P(s_0)}_{\text{no } \theta} + \sum_{t=0}^{T-1} \Big[ \log \pi_\theta(a_t \mid s_t) + \underbrace{\log P(s_{t+1} \mid s_t, a_t)}_{\text{no } \theta} \Big]

Taking the gradient θ\nabla_\theta, all environment terms vanish:

θlogP(τθ)=t=0T1θlogπθ(atst)\nabla_\theta \log P(\tau \mid \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)

Combining everything:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Gt]\boxed{ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t \right] }

where Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k is the return from timestep tt.

The log-derivative trick is not specific to RL — it’s a general technique called the score function estimator (or REINFORCE estimator or likelihood ratio trick) used throughout machine learning:

ApplicationWhat plays the role of πθ\pi_\thetaWhat plays the role of R(τ)R(\tau)
RL policy gradientPolicy πθ(as)\pi_\theta(a \mid s)Return GtG_t
Variational inferenceVariational distribution qθ(z)q_\theta(z)ELBO signal
Black-box optimizationSampling distributionObjective value

The pattern is always the same: to differentiate an expectation over a distribution you control, use θExpθ[f(x)]=Expθ[θlogpθ(x)f(x)]\nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \mathbb{E}_{x \sim p_\theta}[\nabla_\theta \log p_\theta(x) \cdot f(x)].

AdvantageExplanation
Handles continuous actionsNo need for argmax over actions
Can learn stochastic policiesUseful for exploration and partially observable environments
Directly optimizes what we care aboutNo approximation errors from value functions
Model-freeEnvironment dynamics cancel out in the gradient

The main downside is high variance in the gradient estimate, which we’ll address with baselines in the next pages.