Skip to content

Action Chains & Rewards

Reinforcement learning is built on the idea of an agent interacting with an environment over time. At each step, the agent observes a state, takes an action, and receives a reward. This loop — the action chain — is the core abstraction.

An MDP is defined by the tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

SymbolMeaning
SSSet of states
AASet of actions
P(ss,a)P(s' \mid s, a)Transition probability
R(s,a)R(s, a)Reward function
γ[0,1)\gamma \in [0, 1)Discount factor

The Markov property means the future depends only on the current state, not on the history:

P(st+1st,at,st1,at1,)=P(st+1st,at)P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

A trajectory (or episode) is a sequence of states, actions, and rewards:

τ=(s0,a0,r0,s1,a1,r1,,sT)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T)

The agent’s behavior is governed by a policy π(as)\pi(a \mid s) — a distribution over actions given a state. The goal of RL is to find the policy that maximizes the expected cumulative reward.

The Chain Rule of Probability for Trajectories

Section titled “The Chain Rule of Probability for Trajectories”

The probability of a trajectory τ\tau under policy π\pi decomposes by the chain rule:

P(τπ)=P(s0)t=0T1π(atst)P(st+1st,at)P(\tau \mid \pi) = P(s_0) \prod_{t=0}^{T-1} \pi(a_t \mid s_t) \cdot P(s_{t+1} \mid s_t, a_t)

Breaking this apart:

  • P(s0)P(s_0) — the initial state distribution (given by the environment)
  • π(atst)\pi(a_t \mid s_t) — the policy’s action probability (what we control)
  • P(st+1st,at)P(s_{t+1} \mid s_t, a_t) — the environment’s transition dynamics (not in our control)

This factorization is fundamental. When we later derive the policy gradient, we’ll differentiate logP(τπ)\log P(\tau \mid \pi) with respect to the policy parameters θ\theta. The chain rule gives us:

logP(τπθ)=logP(s0)+t=0T1[logπθ(atst)+logP(st+1st,at)]\log P(\tau \mid \pi_\theta) = \log P(s_0) + \sum_{t=0}^{T-1} \left[ \log \pi_\theta(a_t \mid s_t) + \log P(s_{t+1} \mid s_t, a_t) \right]

When we take the gradient θ\nabla_\theta, the environment terms vanish (they don’t depend on θ\theta), leaving only:

θlogP(τπθ)=t=0T1θlogπθ(atst)\nabla_\theta \log P(\tau \mid \pi_\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)

This is why policy gradient methods work without knowing the environment dynamics — we only need the gradient of our own policy.

Navigate the agent to the goal. Watch how the trajectory τ\tau and discounted return G0G_0 are constructed step by step:

(0,0)
(0,1)
(0,2)
Goal +10
(1,0)
(1,1)
Trap −5
(1,3)
(2,0)
(2,1)
(2,2)
(2,3)
🤖
(3,1)
(3,2)
(3,3)
Trajectory τ = (s, a, r, s', ...)
Navigate to ⭐ to build a trajectory...

In the following pages, we’ll look at:

  • States and Actions — how to think about the state and action spaces
  • Rewards and Return — how rewards accumulate over time, and why discounting matters