REINFORCE
The Algorithm
Section titled “The Algorithm”REINFORCE (Williams, 1992) is the most basic policy gradient method. It estimates the gradient from complete episodes:
- Sample a trajectory by running
- For each timestep , compute the return
- Update:
The gradient estimate is unbiased — in expectation, it equals the true gradient. But it has high variance because includes all future rewards, which are noisy.
Pseudocode
Section titled “Pseudocode”Initialize policy parameters θfor each episode: Generate trajectory τ ~ π_θ for t = 0 to T: G_t ← Σ_{k=t}^{T} γ^{k-t} · r_k θ ← θ + α · ∇_θ log π_θ(a_t | s_t) · G_tInteractive: Gradient Ascent on a Non-Convex Landscape
Section titled “Interactive: Gradient Ascent on a Non-Convex Landscape”The objective is rarely convex in practice — it has local optima, saddle regions, and the gradient is estimated from noisy samples. Explore how these affect optimization:
The Score Function Estimator
Section titled “The Score Function Estimator”The term is called the score function. It has a useful property:
This is what allows us to turn an expectation over trajectories into a practical Monte Carlo estimate — we sample trajectories and multiply the score by the return.
For a categorical policy (like a language model), if , then the score function is just the gradient of the log-softmax output.