bug(rloo): KL penalty uses first-order log ratio instead of Schulman approximation (inconsistent with GRPOTrainer)

## Summary

`RLOOTrainer` and `GRPOTrainer` implement the same KL-penalty concept using different formulas. `GRPOTrainer` uses the Schulman second-order approximation (always ≥ 0, lower variance); `RLOOTrainer` uses the simple log ratio (can be **negative per token**, higher variance). This violates the AGENTS.md consistency requirement and has practical training stability implications.

## Code

**`trl/trainer/rloo_trainer.py` (line ~1281)**
```python
# First-order log ratio — can be negative per token
per_token_kl = old_per_token_logps - ref_per_token_logps
kl = (per_token_kl * completion_mask).sum(-1)
rewards = rewards - self.beta * kl
```

**`trl/trainer/grpo_trainer.py` (line ~2516)**
```python
# Schulman approximation — always ≥ 0 per token
per_token_kl = (
    torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1
)
```

## Why this matters

### Mathematical difference
The Schulman approximation `exp(r) - r - 1` is derived from the second-order Taylor expansion of `KL(π_θ || π_ref)`:

```
KL(π_θ || π_ref) = E_π_θ[(π_ref/π_θ) - 1 - log(π_ref/π_θ)]
                 = E_π_θ[exp(log π_ref - log π_θ) - (log π_ref - log π_θ) - 1]
```

It is **always ≥ 0** (matching the true KL which is non-negative), has **lower variance**, and is numerically stable.

The simple log ratio `log(π_old/π_ref)` at a single sampled token is an **unbiased** but **high-variance** estimator of `KL(π_old || π_ref)`. Its value per token can be arbitrarily negative (when the model assigns lower probability than the reference), which means:
- For sequences where the summed log ratio is negative, `reward -= beta * kl` **increases** the reward — counterintuitive and can destabilize the RLOO advantage baseline
- For sequences where the summed log ratio is large positive, the full KL penalty applies — asymmetric effective penalty

### Consistency violation
From `AGENTS.md`:
> "When the same logic appears in multiple trainers, the duplicated blocks must stay aligned. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep. When modifying duplicated code, apply the same change to all other trainers."

The KL penalty block is duplicated across GRPO and RLOO and the formulas are different.

## Proposed fix

Replace the simple log ratio in `RLOOTrainer` with the Schulman approximation:

```python
# Before
per_token_kl = old_per_token_logps - ref_per_token_logps

# After — matches GRPOTrainer (Schulman approximation, always ≥ 0)
per_token_kl = (
    torch.exp(ref_per_token_logps - old_per_token_logps) - (ref_per_token_logps - old_per_token_logps) - 1
)
```

Note: `per_token_kl` is already named to suggest a non-negative quantity; the log ratio naming is misleading too.

## Questions for maintainers

1. Is the first-order log ratio in RLOO intentional (e.g., matching the original RLOO paper)?
2. If so, should we add a comment explaining the deliberate divergence from GRPOTrainer?
3. If not intentional, should we apply the Schulman approximation for consistency?

Happy to send a PR for whichever direction the maintainers prefer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(rloo): KL penalty uses first-order log ratio instead of Schulman approximation (inconsistent with GRPOTrainer) #5889

Summary

Code

Why this matters

Mathematical difference

Consistency violation

Proposed fix

Questions for maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug(rloo): KL penalty uses first-order log ratio instead of Schulman approximation (inconsistent with GRPOTrainer) #5889

Description

Summary

Code

Why this matters

Mathematical difference

Consistency violation

Proposed fix

Questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions