Summary
RLOOTrainer and GRPOTrainer implement the same KL-penalty concept using different formulas. GRPOTrainer uses the Schulman second-order approximation (always ≥ 0, lower variance); RLOOTrainer uses the simple log ratio (can be negative per token, higher variance). This violates the AGENTS.md consistency requirement and has practical training stability implications.
Code
trl/trainer/rloo_trainer.py (line ~1281)
# First-order log ratio — can be negative per token
per_token_kl = old_per_token_logps - ref_per_token_logps
kl = (per_token_kl * completion_mask).sum(-1)
rewards = rewards - self.beta * kl
trl/trainer/grpo_trainer.py (line ~2516)
# Schulman approximation — always ≥ 0 per token
per_token_kl = (
torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1
)
Why this matters
Mathematical difference
The Schulman approximation exp(r) - r - 1 is derived from the second-order Taylor expansion of KL(π_θ || π_ref):
KL(π_θ || π_ref) = E_π_θ[(π_ref/π_θ) - 1 - log(π_ref/π_θ)]
= E_π_θ[exp(log π_ref - log π_θ) - (log π_ref - log π_θ) - 1]
It is always ≥ 0 (matching the true KL which is non-negative), has lower variance, and is numerically stable.
The simple log ratio log(π_old/π_ref) at a single sampled token is an unbiased but high-variance estimator of KL(π_old || π_ref). Its value per token can be arbitrarily negative (when the model assigns lower probability than the reference), which means:
- For sequences where the summed log ratio is negative,
reward -= beta * kl increases the reward — counterintuitive and can destabilize the RLOO advantage baseline
- For sequences where the summed log ratio is large positive, the full KL penalty applies — asymmetric effective penalty
Consistency violation
From AGENTS.md:
"When the same logic appears in multiple trainers, the duplicated blocks must stay aligned. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep. When modifying duplicated code, apply the same change to all other trainers."
The KL penalty block is duplicated across GRPO and RLOO and the formulas are different.
Proposed fix
Replace the simple log ratio in RLOOTrainer with the Schulman approximation:
# Before
per_token_kl = old_per_token_logps - ref_per_token_logps
# After — matches GRPOTrainer (Schulman approximation, always ≥ 0)
per_token_kl = (
torch.exp(ref_per_token_logps - old_per_token_logps) - (ref_per_token_logps - old_per_token_logps) - 1
)
Note: per_token_kl is already named to suggest a non-negative quantity; the log ratio naming is misleading too.
Questions for maintainers
- Is the first-order log ratio in RLOO intentional (e.g., matching the original RLOO paper)?
- If so, should we add a comment explaining the deliberate divergence from GRPOTrainer?
- If not intentional, should we apply the Schulman approximation for consistency?
Happy to send a PR for whichever direction the maintainers prefer.
Summary
RLOOTrainerandGRPOTrainerimplement the same KL-penalty concept using different formulas.GRPOTraineruses the Schulman second-order approximation (always ≥ 0, lower variance);RLOOTraineruses the simple log ratio (can be negative per token, higher variance). This violates the AGENTS.md consistency requirement and has practical training stability implications.Code
trl/trainer/rloo_trainer.py(line ~1281)trl/trainer/grpo_trainer.py(line ~2516)Why this matters
Mathematical difference
The Schulman approximation
exp(r) - r - 1is derived from the second-order Taylor expansion ofKL(π_θ || π_ref):It is always ≥ 0 (matching the true KL which is non-negative), has lower variance, and is numerically stable.
The simple log ratio
log(π_old/π_ref)at a single sampled token is an unbiased but high-variance estimator ofKL(π_old || π_ref). Its value per token can be arbitrarily negative (when the model assigns lower probability than the reference), which means:reward -= beta * klincreases the reward — counterintuitive and can destabilize the RLOO advantage baselineConsistency violation
From
AGENTS.md:The KL penalty block is duplicated across GRPO and RLOO and the formulas are different.
Proposed fix
Replace the simple log ratio in
RLOOTrainerwith the Schulman approximation:Note:
per_token_klis already named to suggest a non-negative quantity; the log ratio naming is misleading too.Questions for maintainers
Happy to send a PR for whichever direction the maintainers prefer.