Skip to content

States and Actions

A state sSs \in S captures everything the agent needs to make a decision. In practice, states can be:

  • Discrete — e.g., grid positions in a maze, board configurations in a game
  • Continuous — e.g., joint angles and velocities of a robot, pixel values in an image

The key requirement is the Markov property: the state must contain enough information to predict the future without knowing the past.

The action space AA defines what the agent can do:

  • Discrete actions — e.g., move left/right/up/down, select a token from vocabulary
  • Continuous actions — e.g., torque applied to a joint, steering angle

For language models, the action space is the vocabulary — at each step, the model selects a token.

The policy πθ(as)\pi_\theta(a \mid s) maps states to a distribution over actions. It’s parameterized by θ\theta (e.g., the weights of a neural network).

  • Deterministic policy: a=μθ(s)a = \mu_\theta(s)
  • Stochastic policy: aπθ(s)a \sim \pi_\theta(\cdot \mid s)

In deep RL, the policy is typically a neural network. For language models, the policy is the model — it outputs a probability distribution over the next token given the context.

At each timestep tt:

  1. Agent observes state sts_t
  2. Agent samples action atπθ(st)a_t \sim \pi_\theta(\cdot \mid s_t)
  3. Environment transitions to st+1P(st,at)s_{t+1} \sim P(\cdot \mid s_t, a_t)
  4. Agent receives reward rt=R(st,at)r_t = R(s_t, a_t)

This loop repeats until a terminal state is reached (episodic) or continues forever (continuing).