LUFFY: Learning to Reason Under Off‑Policy Guidance

A general framework for off-policy learning in large reasoning models.

📚 Overview

🎉 News
📖 Introduction
✨ Getting Started
🔧 Usage
📃 Evaluation
🎈 Citation
🌻 Acknowledgement

🎉News

[2025/10/04] 🚀 Introducing ExGRPO, a new variant that boosts performance by learning from the model’s own off-policy experience, without relying on external guidance.
[2025/09/19] 🎉 LUFFY has been accepted to NeurIPS 2025!
[2025/05/30] We integrate the implementation and scripts of other off-policy learning methods including SFT, SFT+RL and RL w/ SFT Loss (multi-task learning).
[2025/05/21] We have updated the paper version, which re-evaluates all models using a more accurate verifier and adds comparisons with other off-policy learning methods, including RL with SFT Loss and SFT+RL.
[2025/04/23] Our paper now trending on alphaXiv! We welcome feedback and discussion.
[2025/04/23] 🎉 Ranked #1 of the day on Huggingface Daily Papers.
[2025/04/20] LUFFY paper available on arXiv.

📖Introduction

LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces policy shaping via regularized importance sampling to emphasize low-probability yet crucial actions.

Key Highlights:

Off-Policy Guidance: Seamlessly integrates external reasoning traces to bootstrap learning from stronger models.
Dynamic Balance: Learns when to imitate and when to explore, adapting over the course of training.
Policy Shaping: Emphasizes important actions often ignored in standard policy gradients, enabling better generalization.

✨Getting Started

Installation

You can install LUFFY dependencies by running the following commands:

conda create -n luffy python=3.10
conda activate luffy
cd luffy
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

Update 9.8

Recently, we find the deprecation of pyairports caused a lot of environment issues. Thus, we now provide a bit more complicated way for installing the environment.

conda create -n luffy python=3.10
conda activate luffy
pip install airports-py
git clone https://github.com/dottxt-ai/outlines.git
cd outlines
git checkout 0.0.46
pip install .
cd ../luffy
pip install -r requirements.v2.txt
pip install -e .
cd verl
pip install -e .

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Repo Structure

This repository includes:

luffy: Codes for training LUFFY using off-policy reasoning traces. Our main code changes are in luffy/verl/verl/mix_src.
data: Data and code for training and evaluating LUFFY.
exp_scripts: Example script to train LUFFY.
eval_scripts: Evaluation scripts on math and out-of-distribution benchmarks.
ExGRPO: Implementation and notes for ExGRPO, which leverages off-policy experience replay to further boost performance without external guidance.

LUFFY is built on top of the GRPO framework and supports plug-and-play integration with off-policy traces from models such as DeepSeek-R1.

🔧Usage

Data Preparation

You need to first run the data preparation script to get the training data in parquet format.

cd data
python prepare_train.py

Training

We provide an example script to train LUFFY on our subset of OpenR1-Math-220k. You can run the following command to train LUFFY:

  cd exp_scripts
  bash train.sh

Other Off-Policy Baselines

SFT

First clone the OpenRLHF repository and prepare the data to SFT format. (We plan to integrate the SFT pipeline directly into LUFFY in the near future.)

git clone https://github.com/OpenRLHF/OpenRLHF
cd data
python prepare_sft.py

Then, you can run the SFT training command.

RESULT_DIR="Your result directory"
DATA_DIR="Your data directory"
WANDB_KEY="Your Wandb Key"

MODEL_PATH=Elliott/Qwen2.5-Math-7B-16k-think
MASTER_ADDR=`scontrol show hostname $SLURM_JOB_NODELIST | head -n1`
MASTER_PORT=$((RANDOM % 101 + 20000))
DEVICES="0,1,2,3,4,5,6,7"
deepspeed --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR --include localhost:$DEVICES --module openrlhf.cli.train_sft \
   --max_len 16384 \
   --dataset $DATA_DIR \
   --input_key prompt \
   --output_key target \
   --train_batch_size 64 \
   --apply_chat_template \
   --micro_train_batch_size 1 \
   --max_samples 500000 \
   --pretrain $MODEL_PATH \
   --save_path $RESULT_DIR \
   --logging_steps 1 \
   --eval_steps -1 \
   --zero_stage 2 \
   --max_epochs 3 \
   --adam_offload \
   --packing_samples \
   --bf16 \
   --flash_attn \
   --save_hf_ckpt \
   --learning_rate 5e-5 \
   --lr_warmup_ratio 0.1 \
   --wandb_project r1_sft_distill \
   --wandb_run_name qwen-7b-base-sft \
   --use_wandb $WANDB_KEY \
   --gradient_checkpointing

RL w/ SFT Loss

  cd exp_scripts
  bash train_rl_sft_loss.sh

SFT + RL

We use heldout data for RL training, following previous works like PRIME.

  cd data
  python prepare_train_sft_rl.py
  cd ../exp_scripts
  bash train_sft_rl.sh

Inference

Here’s an example of using LUFFY for inference:

Click to view inference example

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="Elliott/LUFFY-Qwen-Math-7B-Zero"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)

Models

Model	Huggingface	Base Model
LUFFY-Qwen-Math-7B-Zero	https://huggingface.co/Elliott/LUFFY-Qwen-Math-7B-Zero	Qwen2.5-Math-7B
LUFFY-Qwen-Math-7B-SFT	https://huggingface.co/Elliott/Qwen2.5-Math-7B-SFT	Qwen2.5-Math-7B
LUFFY-Qwen-Math-7B-SFT-RL	https://huggingface.co/Elliott/Qwen2.5-Math-7B-SFT-RL	Qwen2.5-Math-7B
LUFFY-Qwen-Math-1.5B-Zero	https://huggingface.co/Elliott/LUFFY-Qwen-Math-1.5B-Zero	Qwen2.5-Math-1.5B
LUFFY-Qwen-Instruct-7B	https://huggingface.co/Elliott/LUFFY-Qwen-Instruct-7B	Qwen2.5-7B-Instruct

📃Evaluation

Reproducing the Results

We currently support automated evaluation on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). The platform provides specialized system prompts for a range of RL models, including LUFFY, SimpleRL, OpenReasoner, PRIME, and OAT.

You can reproduce our results by running the following commands:

ROOT=YOUR_ROOT_PATH
DATA=$ROOT/data/valid.all.parquet

OUTPUT_DIR=./results/
mkdir -p $OUTPUT_DIR

# If you want to evaluate other models, you can change the model path and name.
MODEL_PATH=Elliott/LUFFY-Qwen-Math-7B-Zero
MODEL_NAME=luffy

if [ $MODEL_NAME == "eurus-2-7b-prime-zero" ]; then
  TEMPLATE=prime
elif [ $MODEL_NAME == "simple-rl-zero" ]; then
  TEMPLATE=qwen
else
  TEMPLATE=own
fi

CUDA_VISIBLE_DEVICES=0,1,2,3 python eval_scripts/generate_vllm.py \
  --model_path $MODEL_PATH \
  --input_file $DATA \
  --remove_system True \
  --add_oat_evaluate True \
  --output_file $OUTPUT_DIR/$MODEL_NAME.jsonl \
  --template $TEMPLATE > $OUTPUT_DIR/$MODEL_NAME.log

LUFFY on Qwen2.5-Math-7B (zero-RL)

LUFFY is evaluated on six competition-level benchmarks, achieving state-of-the-art results among all zero-RL methods. It surpasses both on-policy RL and imitation learning (SFT), especially in generalization:

Model	AIME 2024	AIME 2025	AMC	MATH-500	Minerva	Olympiad	Avg.
Qwen2.5-Math-7B	11.5	4.9	31.3	43.6	7.4	15.6	19.0
Qwen2.5-Math-7B-Instruct	12.5	10.2	48.5	80.4	32.7	41.0	37.6
SimpleRL-Zero	27.0	6.8	54.9	76.0	25.0	34.7	37.4
OpenReasoner-Zero	16.5	15.0	52.1	82.4	33.1	47.1	41.0
PRIME-Zero	17.0	12.8	54.0	81.4	39.0	40.3	40.7
Oat-Zero	33.4	11.9	61.2	78.0	34.6	43.4	43.7
LUFFY-Qwen-Math-7B-Zero	29.4	23.1	65.6	87.6	37.5	57.2	50.1

LUFFY also generalizes well to out-of-distribution tasks, with over +6.2 average gain on ARC-C, GPQA, and MMLU-Pro.

Model	ARC-c	GPQA-diamond	MMLU-Pro	Avg.
Qwen2.5-Math-7B	18.2	11.1	16.9	15.4
Qwen2.5-Math-7B-Instruct	70.3	24.7	34.1	43.0
SimpleRL-Zero	30.2	23.2	34.5	29.3
OpenReasoner-Zero	66.2	29.8	58.7	51.6
PRIME-Zero	73.3	18.2	32.7	41.4
Oat-Zero	70.1	23.7	41.7	45.2
LUFFY-Qwen-Math-7B-Zero	80.5	39.9	53.0	57.8

We further compare LUFFY with alternative off-policy learning methods, including SFT, RL w/ SFT Loss and SFT+RL (see our paper for details):

Model	GPU Hours	Data Usage (On/Off)	AIME 2024	AIME 2025	AMC	MATH-500	Minerva	Olympiad	Avg.
SFT	24*8	0 / 64k	22.2	22.3	52.8	82.6	40.8	43.7	44.1
RL w/ SFT Loss	133*8	64k*7 / 64k	19.5	16.4	49.7	80.4	34.9	39.4	40.1
SFT+RL	130*8	64k*8/135k	25.8	23.1	62.7	87.2	39.7	50.4	48.2
LUFFY-Qwen-Math-7B-Zero	77*8	64k*7 / 64k	29.4	23.1	65.6	87.6	37.5	57.2	50.1
LUFFY-Qwen-Math-7B-Zero-Extra	130*8	110k*7 / 110k	30.7	22.5	66.2	86.8	41.2	55.3	50.4

LUFFY on Qwen2.5-Math-1.5B

Model	AIME 2024	AIME 2025	AMC	MATH-500	Minerva	Olympiad	Avg.
Qwen2.5-Math-1.5B	7.2	3.6	26.4	28.0	9.6	21.2	16.0
Qwen2.5-Math-1.5B-Instruct	12.1	8.9	48.1	77.4	28.7	39.1	35.7
LUFFY-Qwen-Math-1.5B-Zero	16.0	13.1	47.1	80.2	30.5	41.0	38.0

LUFFY on Qwen2.5-Instruct-7B

Model	AIME 2024	AIME 2025	AMC	MATH-500	Minerva	Olympiad	Avg.
Qwen2.5-7B-Instruct	11.7	7.5	43.8	71.8	30.9	40.4	34.4
LUFFY-Qwen-Instruct-7B	17.7	14.8	50.9	82.0	31.3	47.4	40.7

🌻Acknowledgement

LUFFY builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones, including NuminaMath, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.

📬 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out:

Jianhao Yan: elliottyan37@gmail.com
Yafu Li: yafuly@gmail.com

Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper.

LUFFY:

@misc{luffy,
      title={Learning to Reason under Off-Policy Guidance}, 
      author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
      year={2025},
      eprint={2504.14945},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.14945}, 
}

ExGRPO:

@article{zhan2025exgrpo,
      title={ExGRPO: Learning to Reason from Experience}, 
      author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
      year={2025},
      journal = {ArXiv preprint},
      volume = {2510.02245},
      url={https://arxiv.org/abs/2510.02245}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
ExGRPO		ExGRPO
data		data
eval_scripts		eval_scripts
exp_scripts		exp_scripts
figures		figures
luffy		luffy
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LUFFY: Learning to Reason Under Off‑Policy Guidance

📚 Overview

🎉News

📖Introduction

Key Highlights:

✨Getting Started

Installation

Update 9.8

Repo Structure

🔧Usage

Data Preparation

Training

Other Off-Policy Baselines

SFT

RL w/ SFT Loss

SFT + RL

Inference

Models

📃Evaluation

Reproducing the Results

LUFFY on Qwen2.5-Math-7B (zero-RL)

LUFFY on Qwen2.5-Math-1.5B

LUFFY on Qwen2.5-Instruct-7B

🌻Acknowledgement

📬 Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

ElliottYan/LUFFY

Folders and files

Latest commit

History

Repository files navigation

LUFFY: Learning to Reason Under Off‑Policy Guidance

📚 Overview

🎉News

📖Introduction

Key Highlights:

✨Getting Started

Installation

Update 9.8

Repo Structure

🔧Usage

Data Preparation

Training

Other Off-Policy Baselines

SFT

RL w/ SFT Loss

SFT + RL

Inference

Models

📃Evaluation

Reproducing the Results

LUFFY on Qwen2.5-Math-7B (zero-RL)

LUFFY on Qwen2.5-Math-1.5B

LUFFY on Qwen2.5-Instruct-7B

🌻Acknowledgement

📬 Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages