Instead of learning a value function and deriving a policy from it (as in Q-learning), policy gradient methods directly optimize the policy parameters θ \theta θ to maximize expected return.
We want to find θ \theta θ that maximizes:
J ( θ ) = E τ ∼ π θ [ R ( τ ) ] = ∑ τ P ( τ ∣ θ ) R ( τ ) J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] = \sum_{\tau} P(\tau \mid \theta) \, R(\tau) J ( θ ) = E τ ∼ π θ [ R ( τ ) ] = τ ∑ P ( τ ∣ θ ) R ( τ )
where R ( τ ) = ∑ t = 0 T γ t r t R(\tau) = \sum_{t=0}^{T} \gamma^t r_t R ( τ ) = ∑ t = 0 T γ t r t is the total (discounted) return of trajectory τ \tau τ .
The idea: compute the gradient ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇ θ J ( θ ) and update θ \theta θ in the direction that increases J J J .
Starting from the definition:
∇ θ J ( θ ) = ∇ θ ∑ τ P ( τ ∣ θ ) R ( τ ) = ∑ τ ∇ θ P ( τ ∣ θ ) R ( τ ) \nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau} P(\tau \mid \theta) \, R(\tau) = \sum_{\tau} \nabla_\theta P(\tau \mid \theta) \, R(\tau) ∇ θ J ( θ ) = ∇ θ τ ∑ P ( τ ∣ θ ) R ( τ ) = τ ∑ ∇ θ P ( τ ∣ θ ) R ( τ )
Here’s the problem: ∇ θ P ( τ ∣ θ ) \nabla_\theta P(\tau \mid \theta) ∇ θ P ( τ ∣ θ ) is hard to compute directly because P ( τ ∣ θ ) P(\tau \mid \theta) P ( τ ∣ θ ) involves the environment dynamics. We need a trick.
The key identity: for any function f ( x ) > 0 f(x) > 0 f ( x ) > 0 ,
∇ x log f ( x ) = ∇ x f ( x ) f ( x ) \nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)} ∇ x log f ( x ) = f ( x ) ∇ x f ( x )
Rearranging:
∇ x f ( x ) = f ( x ) ∇ x log f ( x ) \nabla_x f(x) = f(x) \, \nabla_x \log f(x) ∇ x f ( x ) = f ( x ) ∇ x log f ( x )
Apply this to P ( τ ∣ θ ) P(\tau \mid \theta) P ( τ ∣ θ ) :
∇ θ P ( τ ∣ θ ) = P ( τ ∣ θ ) ∇ θ log P ( τ ∣ θ ) \nabla_\theta P(\tau \mid \theta) = P(\tau \mid \theta) \, \nabla_\theta \log P(\tau \mid \theta) ∇ θ P ( τ ∣ θ ) = P ( τ ∣ θ ) ∇ θ log P ( τ ∣ θ )
Substituting back:
∇ θ J ( θ ) = ∑ τ P ( τ ∣ θ ) ∇ θ log P ( τ ∣ θ ) R ( τ ) = E τ ∼ π θ [ ∇ θ log P ( τ ∣ θ ) R ( τ ) ] \nabla_\theta J(\theta) = \sum_{\tau} P(\tau \mid \theta) \, \nabla_\theta \log P(\tau \mid \theta) \, R(\tau) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log P(\tau \mid \theta) \, R(\tau) \right] ∇ θ J ( θ ) = τ ∑ P ( τ ∣ θ ) ∇ θ log P ( τ ∣ θ ) R ( τ ) = E τ ∼ π θ [ ∇ θ log P ( τ ∣ θ ) R ( τ ) ]
This is powerful: we’ve turned a sum weighted by ∇ θ P \nabla_\theta P ∇ θ P into an expectation under P P P itself — something we can estimate by sampling trajectories.
Recall the chain rule factorization from the action chain section:
P ( τ ∣ θ ) = P ( s 0 ) ∏ t = 0 T − 1 π θ ( a t ∣ s t ) ⋅ P ( s t + 1 ∣ s t , a t ) P(\tau \mid \theta) = P(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t) \cdot P(s_{t+1} \mid s_t, a_t) P ( τ ∣ θ ) = P ( s 0 ) t = 0 ∏ T − 1 π θ ( a t ∣ s t ) ⋅ P ( s t + 1 ∣ s t , a t )
Taking the log:
log P ( τ ∣ θ ) = log P ( s 0 ) ⏟ no θ + ∑ t = 0 T − 1 [ log π θ ( a t ∣ s t ) + log P ( s t + 1 ∣ s t , a t ) ⏟ no θ ] \log P(\tau \mid \theta) = \underbrace{\log P(s_0)}_{\text{no } \theta} + \sum_{t=0}^{T-1} \Big[ \log \pi_\theta(a_t \mid s_t) + \underbrace{\log P(s_{t+1} \mid s_t, a_t)}_{\text{no } \theta} \Big] log P ( τ ∣ θ ) = no θ log P ( s 0 ) + t = 0 ∑ T − 1 [ log π θ ( a t ∣ s t ) + no θ log P ( s t + 1 ∣ s t , a t ) ]
Taking the gradient ∇ θ \nabla_\theta ∇ θ , all environment terms vanish :
∇ θ log P ( τ ∣ θ ) = ∑ t = 0 T − 1 ∇ θ log π θ ( a t ∣ s t ) \nabla_\theta \log P(\tau \mid \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) ∇ θ log P ( τ ∣ θ ) = t = 0 ∑ T − 1 ∇ θ log π θ ( a t ∣ s t )
Combining everything:
∇ θ J ( θ ) = E τ ∼ π θ [ ∑ t = 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ G t ] \boxed{ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t \right] } ∇ θ J ( θ ) = E τ ∼ π θ [ t = 0 ∑ T ∇ θ log π θ ( a t ∣ s t ) ⋅ G t ]
where G t = ∑ k = t T γ k − t r k G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k G t = ∑ k = t T γ k − t r k is the return from timestep t t t .
Why G t G_t G t instead of R ( τ ) R(\tau) R ( τ ) ?
Using the full trajectory return R ( τ ) R(\tau) R ( τ ) for every timestep works but has higher variance. Since actions at time t t t cannot affect rewards before time t t t , we can replace R ( τ ) R(\tau) R ( τ ) with the future return G t G_t G t . This is called the “reward-to-go” trick and reduces variance without introducing bias.
The log-derivative trick is not specific to RL — it’s a general technique called the score function estimator (or REINFORCE estimator or likelihood ratio trick ) used throughout machine learning:
Application What plays the role of π θ \pi_\theta π θ What plays the role of R ( τ ) R(\tau) R ( τ ) RL policy gradient Policy π θ ( a ∣ s ) \pi_\theta(a \mid s) π θ ( a ∣ s ) Return G t G_t G t Variational inference Variational distribution q θ ( z ) q_\theta(z) q θ ( z ) ELBO signal Black-box optimization Sampling distribution Objective value
The pattern is always the same: to differentiate an expectation over a distribution you control, use ∇ θ E x ∼ p θ [ f ( x ) ] = E x ∼ p θ [ ∇ θ log p θ ( x ) ⋅ f ( x ) ] \nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \mathbb{E}_{x \sim p_\theta}[\nabla_\theta \log p_\theta(x) \cdot f(x)] ∇ θ E x ∼ p θ [ f ( x )] = E x ∼ p θ [ ∇ θ log p θ ( x ) ⋅ f ( x )] .
Advantage Explanation Handles continuous actions No need for argmax over actions Can learn stochastic policies Useful for exploration and partially observable environments Directly optimizes what we care about No approximation errors from value functions Model-free Environment dynamics cancel out in the gradient
The main downside is high variance in the gradient estimate, which we’ll address with baselines in the next pages.