Skip to content

REINFORCE

REINFORCE (Williams, 1992) is the most basic policy gradient method. It estimates the gradient from complete episodes:

  1. Sample a trajectory τ=(s0,a0,r0,,sT)\tau = (s_0, a_0, r_0, \ldots, s_T) by running πθ\pi_\theta
  2. For each timestep tt, compute the return Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k
  3. Update: θθ+αt=0Tθlogπθ(atst)Gt\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t

The gradient estimate is unbiased — in expectation, it equals the true gradient. But it has high variance because GtG_t includes all future rewards, which are noisy.

Initialize policy parameters θ
for each episode:
Generate trajectory τ ~ π_θ
for t = 0 to T:
G_t ← Σ_{k=t}^{T} γ^{k-t} · r_k
θ ← θ + α · ∇_θ log π_θ(a_t | s_t) · G_t

Interactive: Gradient Ascent on a Non-Convex Landscape

Section titled “Interactive: Gradient Ascent on a Non-Convex Landscape”

The objective J(θ)J(\theta) is rarely convex in practice — it has local optima, saddle regions, and the gradient is estimated from noisy samples. Explore how these affect optimization:

The term θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t \mid s_t) is called the score function. It has a useful property:

θlogπθ(as)=θπθ(as)πθ(as)\nabla_\theta \log \pi_\theta(a \mid s) = \frac{\nabla_\theta \pi_\theta(a \mid s)}{\pi_\theta(a \mid s)}

This is what allows us to turn an expectation over trajectories into a practical Monte Carlo estimate — we sample trajectories and multiply the score by the return.

For a categorical policy (like a language model), if πθ(as)=softmax(fθ(s))a\pi_\theta(a \mid s) = \text{softmax}(f_\theta(s))_a, then the score function is just the gradient of the log-softmax output.