_{RLinf: Reinforcement Learning Infrastructure for Post-training}

_{RLinf: Reinforcement Learning Infrastructure for Post-training}

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

What's NEW!

[2025/11] 🔥 RLinf supports reinforcement learning fine-tuning for CALVIN. Doc: RL on CALVIN.
[2025/11] 🔥 RLinf supports reinforcement learning fine-tuning for IsaacLab. Doc: RL on IsaacLab.
[2025/11] 🔥 RLinf supports reinforcement learning fine-tuning for GR00T-N1.5. Doc: RL on GR00T-N1.5.
[2025/11] 🔥 RLinf supports reinforcement learning fine-tuning for Metaworld. Doc: RL on Metaworld.
[2025/11] 🔥 RLinf supports reinforcement learning fine-tuning for Behavior 1k. Doc: RL on Behavior 1k.
[2025/11] Add lora support to π₀ and π₀.₅.
[2025/10] 🔥 RLinf supports reinforcement learning fine-tuning for π₀ and π₀.₅! Doc: RL on π₀ and π₀.₅ Models. For more technical details, refer to the RL fine-tuning for π₀ and π₀.₅ technical report. The report on πRL by Machine Heart and RoboTech are also released.
[2025/10] 🔥 RLinf now officially supports online reinforcement learning! Doc: coding_online_rl, Blog post: The first open-source agent online RL framework RLinf-Online.
[2025/10] 🔥 The RLinf Algorithm Technical Report RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training is released.
[2025/09] 🔥 Example Gallery is updated, users can find various off-the-shelf examples!
[2025/09] The paper RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation is released.
[2025/09] The report on RLinf by Machine Heart is released.
[2025/08] RLinf is open-sourced. The formal v0.1 will be released soon.

Key Features

Embodied Intelligence

Simulators	Real-world Robotics	Models	Algorithms
ManiSkill ✅ LIBERO ✅ RoboTwin RoboVerse BEHAVIOR ✅ MetaWorld ✅ IsaacLab ✅ CALVIN ✅ RoboCasa More...	Franka Arm More...	VLA π₀ ✅ π₀.₅ ✅ OpenVLA ✅ OpenVLA-OFT ✅ GR00T ✅ VLM Qwen2.5-VL Custom Models MLP-Policy ✅	RL Algos GRPO ✅ PPO ✅ DAPO ✅ Reinforce++ ✅ SAC SFT Full-parameter SFT LoRA SFT

RLinf supports mainstream VLA models, mainstream CPU & GPU-parallel simulators via standardized Worker interfaces, and enables the first RL fine-tuning of the $\pi_{0}$ and $\pi_{0.5}$ model family and Isaac-GR00T with a flow-matching action expert, as shown in the above table.

Agentic RL

Agentic RL includes both RL training for improving LLM reasoning ability, such as Math Reasoning, and RL training for Agents, for example, RL training of coding agent. We believe embodied intelligence will also integrate the ability of agents in the future to complete complex tasks.

High flexibility, efficiency, and scalability

Besides the rich functionalities introduced above, RLinf has high flexibility to support diverse RL training workflows (PPO, GRPO, SAC and so on), while hiding the complexity of distributed programming. Users can easily scale RL training to a large number of GPU nodes without modifying code, meeting the increasing demand of computation for RL training.

The high flexibility allows RLinf to explore more efficient scheduling and execution. The hybrid execution mode for embodied RL achieves a 100%+ throughput improvement compared to baseline solutions.

Multiple Backend Integrations

FSDP + HuggingFace/SGLang/vLLM: rapid adaptation to new models and algorithms, ideal for beginners and fast prototyping.
Megatron + SGLang/vLLM: optimized for large-scale training, delivering maximum efficiency for expert users with demanding workloads.

Quick Start

Installation: Users can refer to our installation guide to install RLinf. We recommend users to use our provided docker image (i.e., Installation Method 1), as the environment and dependencies of embodied RL are complex.

Run a simple example: After setting up the environment, users can run a simple example of embodied RL with ManiSkill3 simulator following this document.

For more tutorials of RLinf and application examples, checkout our documentation and example gallery.

Main Results

Embodied Intelligence

RLinf supports both PPO and GRPO algorithms, enabling state-of-the-art training for Vision-Language-Action models.
The framework provides seamless integration with mainstream embodied intelligence benchmarks, and achieves strong performance across diverse evaluation metrics.

OpenVLA and OpenVLA-OFT Results

OpenVLA

OpenVLA-OFT

Training curves on ManiSkill “PutOnPlateInScene25Mani-v3” with OpenVLA and OpenVLA-OFT models, using PPO and GRPO algorithms. PPO consistently outperforms GRPO and exhibits greater stability.

Evaluation results on ManiSkill. Values denote success rates
	In-Distribution	Out-Of-Distribution
	In-Distribution	Vision	Semantic	Execution	Avg.
OpenVLA (Base)	53.91%	38.75%	35.94%	42.11%	39.10%
RL4VLA (PPO)	93.75%	80.47%	75.00%	81.77%	79.15%
OpenVLA (RLinf-GRPO)	84.38%	74.69%	72.99%	77.86%	75.15%
OpenVLA (RLinf-PPO)	96.09%	82.03%	78.35%	85.42%	81.93%

OpenVLA-OFT (Base)	28.13%	27.73%	12.95%	11.72%	18.29%
OpenVLA-OFT (RLinf-GRPO)	94.14%	84.69%	45.54%	44.66%	60.64%
OpenVLA-OFT (RLinf-PPO)	97.66%	92.11%	64.84%	73.57%	77.05%

Evaluation results of the unified model on the five LIBERO task groups
Model	Spatial	Object	Goal	Long	90	Avg.
OpenVLA-OFT (Base)	72.18%	71.48%	64.06%	48.44%	70.97%	65.43%
OpenVLA-OFT (RLinf-GRPO)	99.40%	99.80%	98.79%	93.95%	98.59%	98.11%
Δ Improvement	+27.22	+28.32	+34.73	+45.51	+27.62	+32.68

π₀ and π_0.5 Results

Evaluation results on the four LIBERO task groups
Model		LIBERO
Model		Spatial	Object	Goal	Long	Avg.	Δ Avg.
Full Dataset SFT
Octo		78.9%	85.7%	84.6%	51.1%	75.1%	—
OpenVLA		84.7%	88.4%	79.2%	53.7%	76.5%	—
π_fast		96.4%	96.8%	88.6%	60.2%	85.5%	—
OpenVLA-OFT		91.6%	95.3%	90.6%	86.5%	91.0%	—
π₀		96.8%	98.8%	95.8%	85.2%	94.2%	—
π_0.5		98.8%	98.2%	98.0%	92.4%	96.9%	—
Few-shot Dataset SFT + RL
π₀	SFT	65.3%	64.4%	49.8%	51.2%	57.6%	—
	Flow-SDE	98.4%	99.4%	96.2%	90.2%	96.1%	+38.5
	Flow-Noise	99.0%	99.2%	98.2%	93.8%	97.6%	+40.0
Few-shot Dataset SFT + RL
π_0.5	SFT	84.6%	95.4%	84.6%	43.9%	77.1%	—
	Flow-SDE	99.6%	100%	98.8%	93.0%	97.9%	+20.8
	Flow-Noise	99.6%	100%	99.6%	94.0%	98.3%	+21.2

Math Reasoning

1.5B model results
Model	AIME 24	AIME 25	GPQA-diamond	Average
DeepSeek-R1-Distill-Qwen-1.5B (base model)	28.33	24.90	27.45	26.89
DeepMath-1.5B	37.80	30.42	32.11	33.44
DeepScaleR-1.5B-Preview	40.41	30.93	27.54	32.96
AReaL-1.5B-Preview-Stage-3	40.73	31.56	28.10	33.46
AReaL-1.5B-retrain*	44.42	34.27	33.81	37.50
FastCuRL-1.5B-V3	43.65	32.49	35.00	37.05
RLinf-math-1.5B	48.44	35.63	38.46	40.84

* We retrain the model using the default settings for 600 steps.

7B model results
Model	AIME 24	AIME 25	GPQA-diamond	Average
DeepSeek-R1-Distill-Qwen-7B (base model)	54.90	40.20	45.48	46.86
AReaL-boba-RL-7B	61.66	49.38	46.93	52.66
Skywork-OR1-7B	66.87	52.49	44.43	54.60
Polaris-7B-Preview	68.55	51.24	43.88	54.56
AceMath-RL-Nemotron-7B	67.30	55.00	45.57	55.96
RLinf-math-7B	68.33	52.19	48.18	56.23

RLinf achieves state-of-the-art performance on math reasoning tasks, consistently outperforming existing models across multiple benchmarks (AIME 24, AIME 25, GPQA-diamond) for both 1.5B and 7B model sizes.

Roadmap

1. System-Level Enhancements

Support for heterogeneous GPUs
Support for asynchronous pipeline execution
Support for Mixture of Experts (MoE)

2. Application-Level Extensions

Support for Vision-Language Models (VLMs) training
Support for deep searcher agent training
Support for multi-agent training
Support for integration with more embodied simulators (e.g., RoboCasa, GENESIS, RoboTwin)
Support for more Vision Language Action models (VLAs) (e.g., WALL-OSS)
Support for world model
Support for real-world RL embodied intelligence

CI Test Status

RLinf has comprehensive CI tests for both the core components (via unit tests) and end-to-end RL training workflows of embodied, agent, and reasoning scenarios. Below is the summary of the CI test status of the main branch:

Test Name	Status
unit-tests
agent-reason-e2e-tests
embodied-e2e-tests
scheduler-tests

Contribution Guidelines

We welcome contributions to RLinf. Please read contribution guide before taking action. Thank the following contributors and welcome more developers to join us on this open source project.

Citation and Acknowledgement

If you find RLinf helpful, please cite the paper:

@article{yu2025rlinf,
  title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
  author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and others},
  journal={arXiv preprint arXiv:2509.15965},
  year={2025}
}

If you use RL+VLA in RLinf, you can also cite our technical report and empirical study paper:

@article{zang2025rlinf,
  title={RLinf-VLA: A Unified and Efficient Framework for VLA+ RL Training},
  author={Zang, Hongzhi and Wei, Mingjie and Xu, Si and Wu, Yongji and Guo, Zhen and Wang, Yuanqing and Lin, Hao and Shi, Liangzhi and Xie, Yuqing and Xu, Zhexuan and others},
  journal={arXiv preprint arXiv:2510.06710},
  year={2025}
}

@article{liu2025can,
  title={What can rl bring to vla generalization? an empirical study},
  author={Liu, Jijia and Gao, Feng and Wei, Bingwen and Chen, Xinlei and Liao, Qingmin and Wu, Yi and Yu, Chao and Wang, Yu},
  journal={arXiv preprint arXiv:2505.19789},
  year={2025}
}

@article{chen2025pi_,
  title={$$\backslash$pi\_$\backslash$texttt $\{$RL$\}$ $: Online RL Fine-tuning for Flow-based Vision-Language-Action Models},
  author={Chen, Kang and Liu, Zhihao and Zhang, Tonghe and Guo, Zhen and Xu, Si and Lin, Hao and Zang, Hongzhi and Zhang, Quanlu and Yu, Zhaofei and Fan, Guoliang and others},
  journal={arXiv preprint arXiv:2510.25889},
  year={2025}
}

Acknowledgements RLinf has been inspired by, and benefits from, the ideas and tooling of the broader open-source community. In particular, we would like to thank the teams and contributors behind VeRL, AReaL, Megatron-LM, SGLang, and PyTorch Fully Sharded Data Parallel (FSDP), and if we have inadvertently missed your project or contribution, please open an issue or a pull request so we can properly credit you.

Contact: We welcome applications from Postdocs, PhD/Master's students, and interns. Join us in shaping the future of RL infrastructure and embodied AI!

Chao Yu: zoeyuchao@gmail.com
Yu Wang: yu-wang@tsinghua.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.github		.github
docker		docker
docs		docs
examples		examples
ray_utils		ray_utils
requirements		requirements
rlinf		rlinf
tests		tests
toolkits		toolkits
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

_{RLinf: Reinforcement Learning Infrastructure for Post-training}

What's NEW!

Key Features

Embodied Intelligence

Agentic RL

High flexibility, efficiency, and scalability

Quick Start

Main Results

Embodied Intelligence

OpenVLA and OpenVLA-OFT Results

π₀ and π_0.5 Results

Math Reasoning

Roadmap

1. System-Level Enhancements

2. Application-Level Extensions

CI Test Status

Contribution Guidelines

Citation and Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

qurakchin/RLinf

Folders and files

Latest commit

History

Repository files navigation

RLinf: Reinforcement Learning Infrastructure for Post-training

What's NEW!

Key Features

Embodied Intelligence

Agentic RL

High flexibility, efficiency, and scalability

Quick Start

Main Results

Embodied Intelligence

OpenVLA and OpenVLA-OFT Results

π0 and π0.5 Results

Math Reasoning

Roadmap

1. System-Level Enhancements

2. Application-Level Extensions

CI Test Status

Contribution Guidelines

Citation and Acknowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

_{RLinf: Reinforcement Learning Infrastructure for Post-training}

π₀ and π_0.5 Results

Packages