Skip to content

Add experimental A2PO trainer (Optimal Advantage Regression)#5940

Merged
qgallouedec merged 6 commits into
huggingface:mainfrom
raghulchandramouli:A2PO
Jun 5, 2026
Merged

Add experimental A2PO trainer (Optimal Advantage Regression)#5940
qgallouedec merged 6 commits into
huggingface:mainfrom
raghulchandramouli:A2PO

Conversation

@raghulchandramouli

@raghulchandramouli raghulchandramouli commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression" (https://huggingface.co/papers/2505.20686) as an experimental trainer.

  • trl/experimental/a2po: A2POConfig + A2POTrainer (two-stage offline V* estimation + on-policy single-generation regression)
  • Register A2POTrainer in _TELEMETRY_TRAINERS
  • Add paper_index.md section

What does this PR do?

The Idea: : A*-PO optimizes the standard KL-regularized RL objective with binary verifiable reward r(x, y) = {0, 1}

The objective : maxₚ E_{y∼π}[ r(x,y) − β·KL(π(·|x) ‖ π_ref(·|x))]
Stage 1 (offline, once): we estimate the optimal value from reference policy samples, In which all prompts are all N samples fail, this can be filtered out.
Stage 2 (on-policy, single generation per prompt): a plain least-squares regression, No group, No critic, No clipping, no reward normalizations, thought the target is r(x,y) − V̂*(x), will be the estimated optimal advantage A*.
Scope:
Designed for binary, verifiable rewards (Math/code), not open-ended problems, so thresholding will always be [0, 1]
Stage 1 introduces an offline Value-estimation pass, a pattern TRL currently does not have so this will be contained within experimental/

Before submitting

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • [ X] AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.


Note

Medium Risk
New RL training loop with reference-model generation and dataset mutation in Stage 1; isolated under experimental/ and covered by tests, but behavior differs from existing GRPO/PPO trainers.

Overview
Adds experimental A*-PO (Optimal Advantage Regression) for binary verifiable rewards: a new trl.experimental.a2po package with A2POConfig and A2POTrainer.

Stage 1 runs once before training: sample num_value_samples completions per prompt from a frozen reference policy, score them, and cache V*(x) per prompt (optional filter_all_incorrect drops prompts with all-zero reference rewards). Stage 2 is standard on-policy training with one generation per prompt and a least-squares loss on β₂·log(π/π_ref) vs r − V*.

Also wires docs (a2po_trainer.md, paper index, toctree), registers A2POTrainer in telemetry, and adds smoke/regression tests (train updates weights, extra dataset columns to rewards, eval without prior train()).

Reviewed by Cursor Bugbot for commit 37e52ad. Bugbot is set up for automated code reviews on this repo. Configure here.

Implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal
Advantage Regression" (https://huggingface.co/papers/2505.20686) as an
experimental trainer.

- trl/experimental/a2po: A2POConfig + A2POTrainer (two-stage offline V*
  estimation + on-policy single-generation regression)
- Register A2POTrainer in _TELEMETRY_TRAINERS
- Add paper_index.md section

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread trl/experimental/a2po/a2po_trainer.py
Comment thread trl/experimental/a2po/a2po_trainer.py
Comment thread trl/experimental/a2po/a2po_trainer.py
Comment thread trl/experimental/a2po/a2po_trainer.py
Comment thread trl/experimental/a2po/a2po_trainer.py Outdated
Comment thread trl/experimental/a2po/a2po_trainer.py
Comment thread trl/experimental/a2po/a2po_trainer.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 20e285c. Configure here.

for dataset in datasets:
dataloader = self.accelerator.prepare(
DataLoader(dataset, batch_size=self.args.per_device_train_batch_size, collate_fn=list)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stage 1 breaks without train data

Medium Severity

_estimate_optimal_values always includes self.train_dataset in the Stage 1 loop whenever eval_dataset is set, and calls self.train_dataset.filter when filter_all_incorrect is enabled, without checking that train_dataset is non-None. A trainer constructed with only an eval set (or train_dataset=None) will fail in Stage 1 before evaluation can run.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 20e285c. Configure here.

for j, prompt_text in enumerate(prompts_text):
optimal_values[prompt_text] = v_star[j].item()
if rewards[j].sum() == 0:
all_incorrect.add(prompt_text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate prompts stick in filter set

Medium Severity

With filter_all_incorrect enabled, all_incorrect is keyed only by templated prompt text and never removes a prompt once added. If the same text appears more than once (same batch or later batches) and any one row’s reference samples are all zero, every training row with that text is dropped even when another row’s samples had positive reward.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 20e285c. Configure here.

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I'll push a few commits on your branch and we're good!

Comment thread trl/experimental/a2po/a2po_config.py Outdated
@raghulchandramouli

Copy link
Copy Markdown
Contributor Author

@qgallouedec thanks a lot!!! for looking into the PR

@bot-ci-comment

bot-ci-comment Bot commented Jun 5, 2026

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec merged commit da3ee53 into huggingface:main Jun 5, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants