This repository implements Balanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective reinforcement learning algorithm designed to stabilize off-policy optimization and preserve exploration for large language models (LLMs). Most of our experiments, including the partial rollout experiments presented in the paper, were conducted using our internal proprietary framework. Since we cannot open-source that framework, we have specifically implemented an open-source version based on the verl framework.
You can install dependencies by running the following commands:
# Clone the repository
git clone https://github.com/WooooDyy/BAPO.git
cd BAPO
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-cuda.txt # for CUDA supportBAPO dynamically adjusts the PPO clipping bounds (
# Single node training example
bash recipe/bapo/run_bapo_example.shKey hyperparameters for BAPO's adaptive clipping mechanism, as used in the paper, can be configured (e.g., in recipe/bapo/config/bapo_trainer.yaml):
-
positive_contribution_target($\rho_0$ ): Target contribution of positive signals to the policy gradient loss, set to$\mathbf{0.5}$ . -
ratio_lower_start/max($a^{-}/b^{-}$ ): Movable range for the lower clipping bound, set to$\mathbf{[0.6, 0.9]}$ . -
ratio_upper_start/max($a^{+}/b^{+}$ ): Movable range for the upper clipping bound, set to$\mathbf{[1.2, 3.0]}$ .
BAPO addresses two key challenges in applying off-policy Reinforcement Learning to Large Language Models (LLMs):
- Imbalanced Optimization: Policy updates are often dominated by negative-advantage samples, suppressing useful behaviors and risking gradient explosions.
- Entropy Collapse: The fixed clipping mechanism in PPO-like objectives systematically blocks updates that would increase policy entropy, driving the model toward over-exploitation.
-
Adaptive Clipping: Dynamically adjusts clipping bounds (
$c_{low}$ and$c_{high}$ ) to re-balance positive and negative contributions for each update step. - Entropy Preservation: By incorporating more low-probability positive tokens and filtering excessive low-probability negative tokens, BAPO effectively preserves policy entropy, ensuring stable exploration.
- Stable and Fast Training: Achieves fast, stable, and data-efficient training across diverse off-policy scenarios, preventing the instability and collapse seen in baselines.
BAPO iteratively adjusts the clipping bounds until the positive token contribution (
- Increases the contribution of positive tokens to overcome the dominance of negative tokens.
- Maintains a smoother output distribution, thus preserving entropy and exploratory capacity.
BAPO consistently yields significant performance improvements and enables more stable optimization.
-
State-of-the-Art (SOTA) Performance: The
$\mathbf{32B}$ BAPO model achieves SOTA results among models of the same scale on the AIME 2024 (87.1) and AIME 2025 (80.0) benchmarks. It even outperforms leading proprietary systems like 03-mini-medium. -
Strong 7B Performance: The
$\mathbf{7B}$ BAPO model scores$\mathbf{70.8}$ on AIME 2024 and$\mathbf{62.5}$ on AIME 2025, surpassing open-source counterparts like SkyWork-OR1-7B. - Training Stability: BAPO exhibits a more stable optimization process characterized by rapidly increasing training rewards, greater positive token contribution, steady gradient normalization, and stable policy entropy compared to baseline methods.
If you find this work helpful, please cite us:
@misc{xi2025bapostabilizingoffpolicyreinforcement,
title={BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping},
author={Zhiheng Xi and Xin Guo and Yang Nan and Enyu Zhou and Junrui Shen and Wenxiang Chen and Jiaqi Liu and Jixuan Huang and Zhihao Zhang and Honglin Guo and Xun Deng and Zhikai Lei and Miao Zheng and Guoteng Wang and Shuo Zhang and Peng Sun and Rui Zheng and Hang Yan and Tao Gui and Qi Zhang and Xuanjing Huang},
year={2025},
eprint={2510.18927},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.18927},
}We implement our reinforcement learning algorithm extending from verl. We utilize vLLM for inference. Our models are trained primarily on Qwen2.5 family and DeepSeek-R1-Distill-Qwen. Thanks for their great contributions!
For questions, discussion, or collaboration opportunities, feel free to contact:
- Zhiheng Xi: zhxi22@m.fudan.edu.cn