Skip to content

bug(rloo): KL penalty uses first-order log ratio instead of Schulman approximation (inconsistent with GRPOTrainer) #5889

@Sumu004

Description

@Sumu004

Summary

RLOOTrainer and GRPOTrainer implement the same KL-penalty concept using different formulas. GRPOTrainer uses the Schulman second-order approximation (always ≥ 0, lower variance); RLOOTrainer uses the simple log ratio (can be negative per token, higher variance). This violates the AGENTS.md consistency requirement and has practical training stability implications.

Code

trl/trainer/rloo_trainer.py (line ~1281)

# First-order log ratio — can be negative per token
per_token_kl = old_per_token_logps - ref_per_token_logps
kl = (per_token_kl * completion_mask).sum(-1)
rewards = rewards - self.beta * kl

trl/trainer/grpo_trainer.py (line ~2516)

# Schulman approximation — always ≥ 0 per token
per_token_kl = (
    torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1
)

Why this matters

Mathematical difference

The Schulman approximation exp(r) - r - 1 is derived from the second-order Taylor expansion of KL(π_θ || π_ref):

KL(π_θ || π_ref) = E_π_θ[(π_ref/π_θ) - 1 - log(π_ref/π_θ)]
                 = E_π_θ[exp(log π_ref - log π_θ) - (log π_ref - log π_θ) - 1]

It is always ≥ 0 (matching the true KL which is non-negative), has lower variance, and is numerically stable.

The simple log ratio log(π_old/π_ref) at a single sampled token is an unbiased but high-variance estimator of KL(π_old || π_ref). Its value per token can be arbitrarily negative (when the model assigns lower probability than the reference), which means:

  • For sequences where the summed log ratio is negative, reward -= beta * kl increases the reward — counterintuitive and can destabilize the RLOO advantage baseline
  • For sequences where the summed log ratio is large positive, the full KL penalty applies — asymmetric effective penalty

Consistency violation

From AGENTS.md:

"When the same logic appears in multiple trainers, the duplicated blocks must stay aligned. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep. When modifying duplicated code, apply the same change to all other trainers."

The KL penalty block is duplicated across GRPO and RLOO and the formulas are different.

Proposed fix

Replace the simple log ratio in RLOOTrainer with the Schulman approximation:

# Before
per_token_kl = old_per_token_logps - ref_per_token_logps

# After — matches GRPOTrainer (Schulman approximation, always ≥ 0)
per_token_kl = (
    torch.exp(ref_per_token_logps - old_per_token_logps) - (ref_per_token_logps - old_per_token_logps) - 1
)

Note: per_token_kl is already named to suggest a non-negative quantity; the log ratio naming is misleading too.

Questions for maintainers

  1. Is the first-order log ratio in RLOO intentional (e.g., matching the original RLOO paper)?
  2. If so, should we add a comment explaining the deliberate divergence from GRPOTrainer?
  3. If not intentional, should we apply the Schulman approximation for consistency?

Happy to send a PR for whichever direction the maintainers prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions