A general framework for off-policy learning in large reasoning models.
- 🎉 News
- 📖 Introduction
- ✨ Getting Started
- 🔧 Usage
- 📃 Evaluation
- 🎈 Citation
- 🌻 Acknowledgement
- [2025/10/04] 🚀 Introducing ExGRPO, a new variant that boosts performance by learning from the model’s own off-policy experience, without relying on external guidance.
- [2025/09/19] 🎉 LUFFY has been accepted to NeurIPS 2025!
- [2025/05/30] We integrate the implementation and scripts of other off-policy learning methods including SFT, SFT+RL and RL w/ SFT Loss (multi-task learning).
- [2025/05/21] We have updated the paper version, which re-evaluates all models using a more accurate verifier and adds comparisons with other off-policy learning methods, including RL with SFT Loss and SFT+RL.
- [2025/04/23] Our paper now trending on alphaXiv! We welcome feedback and discussion.
- [2025/04/23] 🎉 Ranked #1 of the day on Huggingface Daily Papers.
- [2025/04/20] LUFFY paper available on arXiv.
LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces policy shaping via regularized importance sampling to emphasize low-probability yet crucial actions.
- Off-Policy Guidance: Seamlessly integrates external reasoning traces to bootstrap learning from stronger models.
- Dynamic Balance: Learns when to imitate and when to explore, adapting over the course of training.
- Policy Shaping: Emphasizes important actions often ignored in standard policy gradients, enabling better generalization.
You can install LUFFY dependencies by running the following commands:
conda create -n luffy python=3.10
conda activate luffy
cd luffy
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .Recently, we find the deprecation of pyairports caused a lot of environment issues. Thus, we now provide a bit more complicated way for installing the environment.
conda create -n luffy python=3.10
conda activate luffy
pip install airports-py
git clone https://github.com/dottxt-ai/outlines.git
cd outlines
git checkout 0.0.46
pip install .
cd ../luffy
pip install -r requirements.v2.txt
pip install -e .
cd verl
pip install -e .If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whlThis repository includes:
luffy: Codes for training LUFFY using off-policy reasoning traces. Our main code changes are in luffy/verl/verl/mix_src.data: Data and code for training and evaluating LUFFY.exp_scripts: Example script to train LUFFY.eval_scripts: Evaluation scripts on math and out-of-distribution benchmarks.ExGRPO: Implementation and notes for ExGRPO, which leverages off-policy experience replay to further boost performance without external guidance.
LUFFY is built on top of the GRPO framework and supports plug-and-play integration with off-policy traces from models such as DeepSeek-R1.
You need to first run the data preparation script to get the training data in parquet format.
cd data
python prepare_train.pyWe provide an example script to train LUFFY on our subset of OpenR1-Math-220k. You can run the following command to train LUFFY:
cd exp_scripts
bash train.shFirst clone the OpenRLHF repository and prepare the data to SFT format. (We plan to integrate the SFT pipeline directly into LUFFY in the near future.)
git clone https://github.com/OpenRLHF/OpenRLHF
cd data
python prepare_sft.pyThen, you can run the SFT training command.
RESULT_DIR="Your result directory"
DATA_DIR="Your data directory"
WANDB_KEY="Your Wandb Key"
MODEL_PATH=Elliott/Qwen2.5-Math-7B-16k-think
MASTER_ADDR=`scontrol show hostname $SLURM_JOB_NODELIST | head -n1`
MASTER_PORT=$((RANDOM % 101 + 20000))
DEVICES="0,1,2,3,4,5,6,7"
deepspeed --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR --include localhost:$DEVICES --module openrlhf.cli.train_sft \
--max_len 16384 \
--dataset $DATA_DIR \
--input_key prompt \
--output_key target \
--train_batch_size 64 \
--apply_chat_template \
--micro_train_batch_size 1 \
--max_samples 500000 \
--pretrain $MODEL_PATH \
--save_path $RESULT_DIR \
--logging_steps 1 \
--eval_steps -1 \
--zero_stage 2 \
--max_epochs 3 \
--adam_offload \
--packing_samples \
--bf16 \
--flash_attn \
--save_hf_ckpt \
--learning_rate 5e-5 \
--lr_warmup_ratio 0.1 \
--wandb_project r1_sft_distill \
--wandb_run_name qwen-7b-base-sft \
--use_wandb $WANDB_KEY \
--gradient_checkpointing
cd exp_scripts
bash train_rl_sft_loss.shWe use heldout data for RL training, following previous works like PRIME.
cd data
python prepare_train_sft_rl.py
cd ../exp_scripts
bash train_sft_rl.shHere’s an example of using LUFFY for inference:
Click to view inference example
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_path="Elliott/LUFFY-Qwen-Math-7B-Zero"
question = "which number is larger? 9.11 or 9.9?"
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)| Model | Huggingface | Base Model |
|---|---|---|
| LUFFY-Qwen-Math-7B-Zero | https://huggingface.co/Elliott/LUFFY-Qwen-Math-7B-Zero | Qwen2.5-Math-7B |
| LUFFY-Qwen-Math-7B-SFT | https://huggingface.co/Elliott/Qwen2.5-Math-7B-SFT | Qwen2.5-Math-7B |
| LUFFY-Qwen-Math-7B-SFT-RL | https://huggingface.co/Elliott/Qwen2.5-Math-7B-SFT-RL | Qwen2.5-Math-7B |
| LUFFY-Qwen-Math-1.5B-Zero | https://huggingface.co/Elliott/LUFFY-Qwen-Math-1.5B-Zero | Qwen2.5-Math-1.5B |
| LUFFY-Qwen-Instruct-7B | https://huggingface.co/Elliott/LUFFY-Qwen-Instruct-7B | Qwen2.5-7B-Instruct |
We currently support automated evaluation on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). The platform provides specialized system prompts for a range of RL models, including LUFFY, SimpleRL, OpenReasoner, PRIME, and OAT.
You can reproduce our results by running the following commands:
ROOT=YOUR_ROOT_PATH
DATA=$ROOT/data/valid.all.parquet
OUTPUT_DIR=./results/
mkdir -p $OUTPUT_DIR
# If you want to evaluate other models, you can change the model path and name.
MODEL_PATH=Elliott/LUFFY-Qwen-Math-7B-Zero
MODEL_NAME=luffy
if [ $MODEL_NAME == "eurus-2-7b-prime-zero" ]; then
TEMPLATE=prime
elif [ $MODEL_NAME == "simple-rl-zero" ]; then
TEMPLATE=qwen
else
TEMPLATE=own
fi
CUDA_VISIBLE_DEVICES=0,1,2,3 python eval_scripts/generate_vllm.py \
--model_path $MODEL_PATH \
--input_file $DATA \
--remove_system True \
--add_oat_evaluate True \
--output_file $OUTPUT_DIR/$MODEL_NAME.jsonl \
--template $TEMPLATE > $OUTPUT_DIR/$MODEL_NAME.logLUFFY is evaluated on six competition-level benchmarks, achieving state-of-the-art results among all zero-RL methods. It surpasses both on-policy RL and imitation learning (SFT), especially in generalization:
| Model | AIME 2024 | AIME 2025 | AMC | MATH-500 | Minerva | Olympiad | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 11.5 | 4.9 | 31.3 | 43.6 | 7.4 | 15.6 | 19.0 |
| Qwen2.5-Math-7B-Instruct | 12.5 | 10.2 | 48.5 | 80.4 | 32.7 | 41.0 | 37.6 |
| SimpleRL-Zero | 27.0 | 6.8 | 54.9 | 76.0 | 25.0 | 34.7 | 37.4 |
| OpenReasoner-Zero | 16.5 | 15.0 | 52.1 | 82.4 | 33.1 | 47.1 | 41.0 |
| PRIME-Zero | 17.0 | 12.8 | 54.0 | 81.4 | 39.0 | 40.3 | 40.7 |
| Oat-Zero | 33.4 | 11.9 | 61.2 | 78.0 | 34.6 | 43.4 | 43.7 |
| LUFFY-Qwen-Math-7B-Zero | 29.4 | 23.1 | 65.6 | 87.6 | 37.5 | 57.2 | 50.1 |
LUFFY also generalizes well to out-of-distribution tasks, with over +6.2 average gain on ARC-C, GPQA, and MMLU-Pro.
| Model | ARC-c | GPQA-diamond | MMLU-Pro | Avg. |
|---|---|---|---|---|
| Qwen2.5-Math-7B | 18.2 | 11.1 | 16.9 | 15.4 |
| Qwen2.5-Math-7B-Instruct | 70.3 | 24.7 | 34.1 | 43.0 |
| SimpleRL-Zero | 30.2 | 23.2 | 34.5 | 29.3 |
| OpenReasoner-Zero | 66.2 | 29.8 | 58.7 | 51.6 |
| PRIME-Zero | 73.3 | 18.2 | 32.7 | 41.4 |
| Oat-Zero | 70.1 | 23.7 | 41.7 | 45.2 |
| LUFFY-Qwen-Math-7B-Zero | 80.5 | 39.9 | 53.0 | 57.8 |
We further compare LUFFY with alternative off-policy learning methods, including SFT, RL w/ SFT Loss and SFT+RL (see our paper for details):
| Model | GPU Hours | Data Usage (On/Off) | AIME 2024 | AIME 2025 | AMC | MATH-500 | Minerva | Olympiad | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| SFT | 24*8 | 0 / 64k | 22.2 | 22.3 | 52.8 | 82.6 | 40.8 | 43.7 | 44.1 |
| RL w/ SFT Loss | 133*8 | 64k*7 / 64k | 19.5 | 16.4 | 49.7 | 80.4 | 34.9 | 39.4 | 40.1 |
| SFT+RL | 130*8 | 64k*8/135k | 25.8 | 23.1 | 62.7 | 87.2 | 39.7 | 50.4 | 48.2 |
| LUFFY-Qwen-Math-7B-Zero | 77*8 | 64k*7 / 64k | 29.4 | 23.1 | 65.6 | 87.6 | 37.5 | 57.2 | 50.1 |
| LUFFY-Qwen-Math-7B-Zero-Extra | 130*8 | 110k*7 / 110k | 30.7 | 22.5 | 66.2 | 86.8 | 41.2 | 55.3 | 50.4 |
| Model | AIME 2024 | AIME 2025 | AMC | MATH-500 | Minerva | Olympiad | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-Math-1.5B | 7.2 | 3.6 | 26.4 | 28.0 | 9.6 | 21.2 | 16.0 |
| Qwen2.5-Math-1.5B-Instruct | 12.1 | 8.9 | 48.1 | 77.4 | 28.7 | 39.1 | 35.7 |
| LUFFY-Qwen-Math-1.5B-Zero | 16.0 | 13.1 | 47.1 | 80.2 | 30.5 | 41.0 | 38.0 |
| Model | AIME 2024 | AIME 2025 | AMC | MATH-500 | Minerva | Olympiad | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 11.7 | 7.5 | 43.8 | 71.8 | 30.9 | 40.4 | 34.4 |
| LUFFY-Qwen-Instruct-7B | 17.7 | 14.8 | 50.9 | 82.0 | 31.3 | 47.4 | 40.7 |
LUFFY builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones, including NuminaMath, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.
For questions, feedback, or collaboration opportunities, feel free to reach out:
- Jianhao Yan: elliottyan37@gmail.com
- Yafu Li: yafuly@gmail.com
If you find our model, data, or evaluation code useful, please kindly cite our paper.
LUFFY:
@misc{luffy,
title={Learning to Reason under Off-Policy Guidance},
author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
year={2025},
eprint={2504.14945},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.14945},
}ExGRPO:
@article{zhan2025exgrpo,
title={ExGRPO: Learning to Reason from Experience},
author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
year={2025},
journal = {ArXiv preprint},
volume = {2510.02245},
url={https://arxiv.org/abs/2510.02245},
}