OREO: Offline REasoning Optimization

Source code for Offline Reinforcement Learning for LLM Multi-Step Reasoning

Installation

This repo is based on OpenRLHF and the installation follows a similar process. We recommend using Docker to setup the environment.

First build Docker image

cd dockerfile
docker build -t [IMAGE_NAME] .

Start a docker container

docker run -itd --ipc host --gpus all [IMAGE_NAME] bash

Attach to the container

docker exec -it [CONTAINER_ID] /bin/bash

Install the current repo

cd [PATH_TO_THIS_REPO]
pip install -e .

As the data collection process involves randomness, we will publish the training data used in our experiments in the near future.

Reproduction

Training

You may need to change the following command line options in the following scripts:

--train_file specifies the path of training data in OREO experiments.
--dataset specifies the path of training data in SFT experiments.
--save_path specifies the path to save the model.
--pretrain specifies the path to load the pretrained model. In OREO experiments, this should be the path to the SFT model.

Math Reasoning

Supervised fine-tuning

cd example/scripts
bash train_oreo_sft.sh

OREO training

cd example/scripts
bash train_oreo.sh

To train the DeepSeekMath-7B-Instruct model,

cd example/scripts
bash train_oreo_deepseek-math.sh

Note that DeepSeekMath-7B-Instruct is already supervise fine-tuned, so we don't have an SFT phase here.

ALFWorld

Supervised fine-tuning

cd example/scripts
bash train_oreo_alfworld_sft.sh

OREO training

cd example/scripts
bash train_oreo_alfworld.sh

Evaluation

Math Reasoning

Make sure you have antlr4-python3-runtime==4.11.0 installed.

For Qwen-based models

cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --save [SAVE_GENERATED_RESULTS_JSONL]

For DeepSeekMath-based models

cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --no_bos --save [SAVE_GENERATED_RESULTS_JSONL]

Note the --no_bos option here.

Here is a script that uses the OREO model to solve a specific math problem:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "jwhj/Qwen2.5-Math-1.5B-OREO"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model_path)
params = SamplingParams(temperature=0, max_tokens=2048)

message = [
    {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
    {
        "role": "user",
        "content": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
    },
]
prompt = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)

result = llm.generate(prompt, params)
print(result[0].outputs[0].text)

The output should be something like the following:

First find the total number of eggs Janet has each day: $16$ eggs/day
Then subtract the number of eggs she eats for breakfast: $16-3=13$ eggs/day
Then subtract the number of eggs she bakes for her friends: $13-4=9$ eggs/day
Then multiply the number of eggs she sells by the price per egg to find her daily earnings: $9\cdot2=\boxed{18}$ dollars/day

ALFWorld

This part requires ALFWorld to be installed.

First start an vllm server

python -m vllm.entrypoints.openai.api_server --model [PATH_TO_YOUR_MODEL]

Then run evaluation with

cd example/scripts
python ../scratch/run_alfworld_async.py --model [PATH_TO_YOUR_MODEL] --save_dir [SAVE_GENERATED_TRAJS]

You can use --split eval_in_distribution for seen environments.

Reference

@article{wang2024offline,
  title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
  author={Wang, Huaijie and Hao, Shibo and Dong, Hanze and Zhang, Shenao and Bao, Yilin and Yang, Ziran and Wu, Yi},
  journal={arXiv preprint arXiv:2412.16145},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
dockerfile		dockerfile
docs		docs
evaluation		evaluation
examples		examples
openrlhf		openrlhf
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
OREO.png		OREO.png
Offline Reinforcement Learning for LLM Multi-Step Reasoning.pdf		Offline Reinforcement Learning for LLM Multi-Step Reasoning.pdf
README.md		README.md
README.openrlhf.md		README.openrlhf.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OREO: Offline REasoning Optimization

Installation

Reproduction

Training

Math Reasoning

ALFWorld

Evaluation

Math Reasoning

ALFWorld

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

jwhj/OREO

Folders and files

Latest commit

History

Repository files navigation

OREO: Offline REasoning Optimization

Installation

Reproduction

Training

Math Reasoning

ALFWorld

Evaluation

Math Reasoning

ALFWorld

Reference

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages