Add experimental A2PO trainer (Optimal Advantage Regression)#5940
Conversation
Implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression" (https://huggingface.co/papers/2505.20686) as an experimental trainer. - trl/experimental/a2po: A2POConfig + A2POTrainer (two-stage offline V* estimation + on-policy single-generation regression) - Register A2POTrainer in _TELEMETRY_TRAINERS - Add paper_index.md section Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 20e285c. Configure here.
| for dataset in datasets: | ||
| dataloader = self.accelerator.prepare( | ||
| DataLoader(dataset, batch_size=self.args.per_device_train_batch_size, collate_fn=list) | ||
| ) |
There was a problem hiding this comment.
Stage 1 breaks without train data
Medium Severity
_estimate_optimal_values always includes self.train_dataset in the Stage 1 loop whenever eval_dataset is set, and calls self.train_dataset.filter when filter_all_incorrect is enabled, without checking that train_dataset is non-None. A trainer constructed with only an eval set (or train_dataset=None) will fail in Stage 1 before evaluation can run.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 20e285c. Configure here.
| for j, prompt_text in enumerate(prompts_text): | ||
| optimal_values[prompt_text] = v_star[j].item() | ||
| if rewards[j].sum() == 0: | ||
| all_incorrect.add(prompt_text) |
There was a problem hiding this comment.
Duplicate prompts stick in filter set
Medium Severity
With filter_all_incorrect enabled, all_incorrect is keyed only by templated prompt text and never removes a prompt once added. If the same text appears more than once (same batch or later batches) and any one row’s reference samples are all zero, every training row with that text is dropped even when another row’s samples had positive reward.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 20e285c. Configure here.
qgallouedec
left a comment
There was a problem hiding this comment.
thanks! I'll push a few commits on your branch and we're good!
|
@qgallouedec thanks a lot!!! for looking into the PR |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression" (https://huggingface.co/papers/2505.20686) as an experimental trainer.
What does this PR do?
The Idea: : A*-PO optimizes the standard KL-regularized RL objective with binary verifiable reward r(x, y) = {0, 1}
The objective : maxₚ E_{y∼π}[ r(x,y) − β·KL(π(·|x) ‖ π_ref(·|x))]
Stage 1 (offline, once): we estimate the optimal value from reference policy samples, In which all prompts are all N samples fail, this can be filtered out.
Stage 2 (on-policy, single generation per prompt): a plain least-squares regression, No group, No critic, No clipping, no reward normalizations, thought the target is r(x,y) − V̂*(x), will be the estimated optimal advantage A*.
Scope:
Designed for binary, verifiable rewards (Math/code), not open-ended problems, so thresholding will always be [0, 1]
Stage 1 introduces an offline Value-estimation pass, a pattern TRL currently does not have so this will be contained within experimental/
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Note
Medium Risk
New RL training loop with reference-model generation and dataset mutation in Stage 1; isolated under experimental/ and covered by tests, but behavior differs from existing GRPO/PPO trainers.
Overview
Adds experimental A*-PO (Optimal Advantage Regression) for binary verifiable rewards: a new
trl.experimental.a2popackage withA2POConfigandA2POTrainer.Stage 1 runs once before training: sample
num_value_samplescompletions per prompt from a frozen reference policy, score them, and cache V*(x) per prompt (optionalfilter_all_incorrectdrops prompts with all-zero reference rewards). Stage 2 is standard on-policy training with one generation per prompt and a least-squares loss onβ₂·log(π/π_ref)vsr − V*.Also wires docs (
a2po_trainer.md, paper index, toctree), registersA2POTrainerin telemetry, and adds smoke/regression tests (train updates weights, extra dataset columns to rewards, eval without priortrain()).Reviewed by Cursor Bugbot for commit 37e52ad. Bugbot is set up for automated code reviews on this repo. Configure here.