Skip to content

Rewards and Return

The reward rt=R(st,at)r_t = R(s_t, a_t) is a scalar signal that tells the agent how good its action was. Rewards can be:

  • Sparse — only at the end (e.g., win/lose in a game)
  • Dense — at every step (e.g., distance to goal)
  • Shaped — hand-crafted to guide learning (careful: can introduce bias)

For language models, the reward typically comes from a reward model trained on human preferences, or from a rule-based function (e.g., format checking, code execution results).

The return GtG_t is the total reward from timestep tt onward:

Gt=rt+rt+1+rt+2+=k=0rt+kG_t = r_t + r_{t+1} + r_{t+2} + \cdots = \sum_{k=0}^{\infty} r_{t+k}

But this sum can diverge. Enter the discount factor.

The discounted return weights future rewards exponentially less:

Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

where γ[0,1)\gamma \in [0, 1). The discount factor controls:

γ\gammaBehavior
γ0\gamma \to 0Myopic — only cares about immediate reward
γ1\gamma \to 1Far-sighted — values future rewards almost equally

With constant reward r=1r = 1, the return converges to 11γ\frac{1}{1 - \gamma}.

The state-value function Vπ(s)V^\pi(s) is the expected return from state ss under policy π\pi:

Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid s_t = s \right]

The action-value function Qπ(s,a)Q^\pi(s, a) is the expected return from taking action aa in state ss, then following π\pi:

Qπ(s,a)=Eπ[Gtst=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid s_t = s, a_t = a \right]

The advantage function measures how much better an action is compared to the average:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

This advantage function will be central to policy gradient methods and PPO.