This repo serves as a workspace for my learning of RL in the context of modern LLMs.
Resources:
| Term | Explanation |
|---|---|
| Agent | Us |
| Environment | The world |
| Action | Environment changes upon an agent's action, but it can also change on its own. Current action is acted after/on current state. |
| State | Complete view of the world. |
| Observation | What agent can see. Partial or full state of the world. |
| Policy | Agent's rule to decide which action to take. It can be probabilistic / stochastic. Usually it's parameterized so that we can optimize it. |
| Trajectory (aka episodes, rollouts) | Sequence of states and actions |
| State transition | What happens to the world in the next time step. Depends on the laws of the environment, and only the current action. It can be stochastic. |
| Reward | Response from the environment to the agent. Only depends on current state, current action, and next state. It can be simplified to only depend on current state, or current state and current action. |
| Return | Cumulative reward over a trajectory. The agent's goal is to maximize this. For infinite-horizon, we can include a discount factor. |
| Expected return | Since policy and transition can be stochastic, trajectory is stochastic as well. Hence, expected return is computed over the distribution of trajectories. The optimal policy maximizes expected return. |
| Q-function | (aka Action-value function) Expected return at a particular state and taking a particular action, following a particular policy. Optimal action corresponds to optimal Q-function. |
| Value function | Expected return at a particular state, following a particular policy. It follows that Value function is expected value of Q-function over action distribution (policy at current state). |
| Advantage function | The difference between Q-function and Value function i.e. Q - V -> how much a particular action is better than average, given a policy. |
Model-based: Agent learns to model the environment. This allows the agent to plan ahead, but the learned model might not generalise well.
| Learning | Description |
|---|---|
| Policy optimization | Models the policy. Optimize via stochastic gradient ascent on (approximate) Expected return. Typically on-policy - use data according to latest policy. Usually more stable. e.g. A2C/A3C, PPO. |
| Q-Learning | Models the Q-function. Optimize based on Bellman equation. Typically off-policy - use any data, not necessarily with the latest policy. The corresponding policy is choosing the action with the largest Q-function value. e.g. DQN. |
Policy gradient is the gradient of Expected return
Note: A trajectory's log-prob is equal to log-prob of initial state, plus the log-prob sum of (action + state transition). Since initial state and state transition are properties of the environment, they do not depend on policy's parameters. Therefore, the derivative of trajectory's log-prob (wrt policy's parameters) is equal to the sum of log-policy's derivatives.
Note that the total number of timesteps
In the sampling form (estimator)
Notice that once we sample the trajectories (using current policy), we can bring the gradient operator out of the sum. We can define the objective function:
Please note that this loss function is only defined to simplify the computation of policy gradient using an autograd engine. THERE IS NO MEANING behind this loss function. Do not try to interpret its value.
Other forms of Policy gradient It turns out that we can replace
- Using on-policy Q-function
- Using Advantage function - this is the most common since there are many ways to estimate the advantage function and it results in lowest variance.
Notice in the above two forms, all the terms inside the summation depends only on current
One common way to estimate the Value function is with a neural network. We will train the Value function network together with the Policy network, using the same rollout data.
Thanks to this, we can estimate the Value function at current state without integrating all possible trajectories.
Generalized Advantage Estimator (GAE) https://arxiv.org/abs/1506.02438
Note that the Value function above is estimated as well, using least mean squares optimization mentioned previously.
When
This expression is TD-residual for the estimated Value function.
When
The summation term is Return of a particular trajectory sample starting at current state and action. This is an unbiased estimator of Q-function, but it has high variance, which we are trying to avoid.
https://arxiv.org/abs/1502.05477
Impose an additional constraint not to deviate too much from the old policy -> avoid big update that can collapse training.
Surrogate advantage is a measure of how policy
Average KL-divergence over states visited by the old policy
Theoretical TRPO update
Note
- Gradient of surrogate advantage function (wrt policy's parameters) is still equal to policy gradient.
- Surrogate advantage is taking expectation over
$(s,a)\sim\pi_{\theta_{old}}$ , instead of summing over the time steps then taking expectation over trajectories. Not sure if the author is assuming$T$ is the same across different trajectories. TODO: double-check this
https://arxiv.org/abs/1707.06347
PPO-clip Loss function
Note that since Advantage
PPO-Penalty (not used much) Loss function
Where the penalty coefficient
Reinforcement Learning from Human Feedback (RLHF) + PPO (OpenAI)
- https://arxiv.org/abs/1909.08593 / https://github.com/openai/lm-human-preferences
- https://arxiv.org/abs/2009.01325 / https://github.com/openai/summarize-from-feedback
- https://arxiv.org/abs/2203.02155 (InstructGPT)
Given prompt
The Reward function is set as
Where
The Reward model is frozen during RL fine-tuning. InstructGPT uses the following objective (very similar to Discriminator loss in Relativistic GAN), where
The OpenAI papers above then use PPO-Clip to RL-finetune the model with this engineered reward. GAE is used to estimate the Advantage function, which requires estimating the Value function with a neural network. This Value function is initialized from the Reward model.
TODO: check OpenAI code to see how they normalize the loss across tokens / samples.
Direct Preference Optimization (DPO) https://arxiv.org/abs/2305.18290
Use Reward model's objective to optimize the LLM directly
Reinforcement Learning with Verifiable Rewards (RLVR) First coined in Tulu3 (https://arxiv.org/abs/2411.15124). The reward function comes from an easily verifiable procedure e.g. answer to a math problem.
Group Relative Policy Optimization (GRPO) https://arxiv.org/abs/2402.03300
In this setup, state is prefix (prompt
For a given prompt
This is basically PPO-Clip loss with extra KL penalty term (the KL term is opposite direction of PPO-Penalty). Advantage is estimated based on relative rewards within a group (hence the name). Also notice that KL penalty is not part of the reward, like RLHF, but part of the loss. TODO: check how KL term is computed.
If we only have reward at the end of each rollout
If we have rewards at intermediate steps (process supervision), Advantage is estimated as the sum of current and future normalized rewards. TODO: check how reward normalization is done in this case. Each rollout can have different number of (intermediate) rewards.
The full GRPO loss is then averaged over sampled responses
TODO: check DeepSeek-R1 as well
GRPO Done Right (Dr. GRPO) https://arxiv.org/abs/2503.20783
Length normalization is removed
Reward normalization only shifts mean to zero. Don't rescale it to unit variance
Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) https://arxiv.org/abs/2503.14476
On a prompt-response pair
Similar to GRPO, Advantage is estimated as normalized rewards. The other important difference is
- How sampling is done: Remove all prompts that have 0 or 1 accuracy i.e. sampled responses are all wrong or all correct. This is to avoid normalized reward being zero, which will contribute to training instability.
- How expectation is computed: Loss is averaged over all tokens in a batch, instead of:
- GRPO: loss for each response is averged over its length (per-sample), then averaged over rollouts.
- Dr. GRPO: loss for each response is summed over its length (per-sample), then averaged over rollouts.
- Special treatment for truncated responses: Long responses are truncated. DAPO will either ignore truncated samples (overlong filtering) or apply a penalty to the reward function (soft overlong punishment).
Unlike GRPO and OpenAI RLHF, DAPO doesn't include any KL penalty term at all, whether in its reward function or loss function.