Rewards and Return
Reward Signal
Section titled “Reward Signal”The reward is a scalar signal that tells the agent how good its action was. Rewards can be:
- Sparse — only at the end (e.g., win/lose in a game)
- Dense — at every step (e.g., distance to goal)
- Shaped — hand-crafted to guide learning (careful: can introduce bias)
For language models, the reward typically comes from a reward model trained on human preferences, or from a rule-based function (e.g., format checking, code execution results).
Return: Cumulative Reward
Section titled “Return: Cumulative Reward”The return is the total reward from timestep onward:
But this sum can diverge. Enter the discount factor.
Discount Factor
Section titled “Discount Factor γ\gammaγ”The discounted return weights future rewards exponentially less:
where . The discount factor controls:
| Behavior | |
|---|---|
| Myopic — only cares about immediate reward | |
| Far-sighted — values future rewards almost equally |
With constant reward , the return converges to .
Value Functions
Section titled “Value Functions”The state-value function is the expected return from state under policy :
The action-value function is the expected return from taking action in state , then following :
The advantage function measures how much better an action is compared to the average:
This advantage function will be central to policy gradient methods and PPO.