Action Chains & Rewards
Reinforcement learning is built on the idea of an agent interacting with an environment over time. At each step, the agent observes a state, takes an action, and receives a reward. This loop — the action chain — is the core abstraction.
The Markov Decision Process (MDP)
Section titled “The Markov Decision Process (MDP)”An MDP is defined by the tuple :
| Symbol | Meaning |
|---|---|
| Set of states | |
| Set of actions | |
| Transition probability | |
| Reward function | |
| Discount factor |
The Markov property means the future depends only on the current state, not on the history:
The Action Chain
Section titled “The Action Chain”A trajectory (or episode) is a sequence of states, actions, and rewards:
The agent’s behavior is governed by a policy — a distribution over actions given a state. The goal of RL is to find the policy that maximizes the expected cumulative reward.
The Chain Rule of Probability for Trajectories
Section titled “The Chain Rule of Probability for Trajectories”The probability of a trajectory under policy decomposes by the chain rule:
Breaking this apart:
- — the initial state distribution (given by the environment)
- — the policy’s action probability (what we control)
- — the environment’s transition dynamics (not in our control)
This factorization is fundamental. When we later derive the policy gradient, we’ll differentiate with respect to the policy parameters . The chain rule gives us:
When we take the gradient , the environment terms vanish (they don’t depend on ), leaving only:
This is why policy gradient methods work without knowing the environment dynamics — we only need the gradient of our own policy.
Interactive Demo: Build a Trajectory
Section titled “Interactive Demo: Build a Trajectory”Navigate the agent to the goal. Watch how the trajectory and discounted return are constructed step by step:
What’s Next
Section titled “What’s Next”In the following pages, we’ll look at:
- States and Actions — how to think about the state and action spaces
- Rewards and Return — how rewards accumulate over time, and why discounting matters