Add experimental A2PO trainer (Optimal Advantage Regression) by raghulchandramouli · Pull Request #5940 · huggingface/trl

raghulchandramouli · 2026-06-04T14:18:00Z

Implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression" (https://huggingface.co/papers/2505.20686) as an experimental trainer.

trl/experimental/a2po: A2POConfig + A2POTrainer (two-stage offline V* estimation + on-policy single-generation regression)
Register A2POTrainer in _TELEMETRY_TRAINERS
Add paper_index.md section

What does this PR do?

The Idea: : A*-PO optimizes the standard KL-regularized RL objective with binary verifiable reward r(x, y) = {0, 1}

The objective : maxₚ E_{y∼π}[ r(x,y) − β·KL(π(·|x) ‖ π_ref(·|x))]
Stage 1 (offline, once): we estimate the optimal value from reference policy samples, In which all prompts are all N samples fail, this can be filtered out.
Stage 2 (on-policy, single generation per prompt): a plain least-squares regression, No group, No critic, No clipping, no reward normalizations, thought the target is r(x,y) − V̂*(x), will be the estimated optimal advantage A*.
Scope:
Designed for binary, verifiable rewards (Math/code), not open-ended problems, so thresholding will always be [0, 1]
Stage 1 introduces an offline Value-estimation pass, a pattern TRL currently does not have so this will be contained within experimental/

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. (A*PO: Accelerating RL for LLM Reasoning with Optimal Advantage regression #5935)
Did you make sure to update the documentation with your changes?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
[ X] AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

Medium Risk
New RL training loop with reference-model generation and dataset mutation in Stage 1; isolated under experimental/ and covered by tests, but behavior differs from existing GRPO/PPO trainers.

Overview
Adds experimental A*-PO (Optimal Advantage Regression) for binary verifiable rewards: a new trl.experimental.a2po package with A2POConfig and A2POTrainer.

Stage 1 runs once before training: sample num_value_samples completions per prompt from a frozen reference policy, score them, and cache V*(x) per prompt (optional filter_all_incorrect drops prompts with all-zero reference rewards). Stage 2 is standard on-policy training with one generation per prompt and a least-squares loss on β₂·log(π/π_ref) vs r − V*.

Also wires docs (a2po_trainer.md, paper index, toctree), registers A2POTrainer in telemetry, and adds smoke/regression tests (train updates weights, extra dataset columns to rewards, eval without prior train()).

^{Reviewed by Cursor Bugbot for commit 37e52ad. Bugbot is set up for automated code reviews on this repo. Configure here.}

Implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression" (https://huggingface.co/papers/2505.20686) as an experimental trainer. - trl/experimental/a2po: A2POConfig + A2POTrainer (two-stage offline V* estimation + on-policy single-generation regression) - Register A2POTrainer in _TELEMETRY_TRAINERS - Add paper_index.md section Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 20e285c. Configure here.}

cursor · 2026-06-04T14:42:48Z

+        for dataset in datasets:
+            dataloader = self.accelerator.prepare(
+                DataLoader(dataset, batch_size=self.args.per_device_train_batch_size, collate_fn=list)
+            )


Stage 1 breaks without train data

Medium Severity

_estimate_optimal_values always includes self.train_dataset in the Stage 1 loop whenever eval_dataset is set, and calls self.train_dataset.filter when filter_all_incorrect is enabled, without checking that train_dataset is non-None. A trainer constructed with only an eval set (or train_dataset=None) will fail in Stage 1 before evaluation can run.

Additional Locations (1)

trl/experimental/a2po/a2po_trainer.py#L266-L271

^{Reviewed by Cursor Bugbot for commit 20e285c. Configure here.}

cursor · 2026-06-04T14:42:49Z

+                for j, prompt_text in enumerate(prompts_text):
+                    optimal_values[prompt_text] = v_star[j].item()
+                    if rewards[j].sum() == 0:
+                        all_incorrect.add(prompt_text)


Duplicate prompts stick in filter set

Medium Severity

With filter_all_incorrect enabled, all_incorrect is keyed only by templated prompt text and never removes a prompt once added. If the same text appears more than once (same batch or later batches) and any one row’s reference samples are all zero, every training row with that text is dropped even when another row’s samples had positive reward.

Additional Locations (1)

trl/experimental/a2po/a2po_trainer.py#L266-L271

^{Reviewed by Cursor Bugbot for commit 20e285c. Configure here.}

qgallouedec

thanks! I'll push a few commits on your branch and we're good!

raghulchandramouli · 2026-06-05T10:39:52Z

@qgallouedec thanks a lot!!! for looking into the PR

bot-ci-comment · 2026-06-05T10:56:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread trl/experimental/a2po/a2po_trainer.py

Comment thread trl/experimental/a2po/a2po_trainer.py

Comment thread trl/experimental/a2po/a2po_trainer.py

Comment thread trl/experimental/a2po/a2po_trainer.py

Merge branch 'main' into A2PO

d2b5016

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread trl/experimental/a2po/a2po_trainer.py Outdated

Comment thread trl/experimental/a2po/a2po_trainer.py

Comment thread trl/experimental/a2po/a2po_trainer.py Outdated

A2PO: guard logits_to_keep, fix V* gather, cover eval prompts in Stage 1

20e285c

cursor Bot reviewed Jun 4, 2026

View reviewed changes

qgallouedec reviewed Jun 5, 2026

View reviewed changes

Comment thread trl/experimental/a2po/a2po_config.py Outdated

style, doc and minor fixes

7f095f2

qgallouedec and others added 2 commits June 5, 2026 10:49

fix

bce9455

Merge branch 'main' into A2PO

37e52ad

qgallouedec approved these changes Jun 5, 2026

View reviewed changes

qgallouedec merged commit da3ee53 into huggingface:main Jun 5, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental A2PO trainer (Optimal Advantage Regression)#5940

Add experimental A2PO trainer (Optimal Advantage Regression)#5940
qgallouedec merged 6 commits into
huggingface:mainfrom
raghulchandramouli:A2PO

raghulchandramouli commented Jun 4, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 4, 2026

Uh oh!

cursor Bot Jun 4, 2026

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

raghulchandramouli commented Jun 5, 2026

Uh oh!

bot-ci-comment Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raghulchandramouli commented Jun 4, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Stage 1 breaks without train data

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Duplicate prompts stick in filter set

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raghulchandramouli commented Jun 5, 2026

Uh oh!

bot-ci-comment Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raghulchandramouli commented Jun 4, 2026 •

edited by cursor Bot

Loading