UFT: Unifying Supervised and Reinforcement Fine-Tuning

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

📊 Results • 🛠️ Installation

⚙️ Usage • 🌻 Acknowledgement • 📝 Citation

Results

Accuracy of different algorithms averaged over `Qwen2.5-0.5/1.5/3B`

Accuracy of different algorithms on `Qwen2.5-0.5B`

Accuracy of different algorithms on `Qwen2.5-3B`

Installation

conda create -n uft python=3.9
conda activate uft
bash install.sh

Usage

Training

python run.py
  --algo              Algorithm to use: {sft, rft, stage, r3, uft}
  --n_gpu             Number of GPUs
  --visible-devices   GPU index to use, e.g., "0,1,2,3"
  --T                 Total training steps (default: 500)
  --T_hint            Maximum training steps with hint (default: 300)
  --data              Dataset: {countdown,math,kk_logic,others}
  --model             Model name (e.g., Qwen2.5-1.5B)
  --tp_size           
  --eval              Triggered to evaluate the model, otherwise training
  --idx IDX           Index of the current process (default=0)
  --sft_loss_coef     Coefficient for the additional log-likelihood term on hint
  --n_rollout        Number of trajectory rollouts (default 4)

Example

python run.py --model Qwen/Qwen2.5-1.5B --data countdown

Requirement

Qwen2.5-0.5/1.5B and Llama-3.2-1B: 2 H100
Qwen2.5-3B and Llama-3.2-3B: 4 H100

Qwen2.5-0.5/1.5B / Llama-3.2-1B can be trained with 1 H100 by setting n_rollouts=2

Major Modifications from VERL

Concatenating hint to the prompt: RLHFDataset
Modifications on the objective function: policy_loss

Evaluate

Change model and dataset to the the model name (e.g., Qwen/Qwen2.5-1.5B) and dataset name (e.g., countdown) to evaluate

python run.py --model {model} --data {dataset} --eval

Acknowledgement

The experiments are based on VERL and TinyZero.
We use Qwen2.5 and Llama3.2 series base models.
We use some of the evaluation code from Dr.GRPO

Citation

@article{UFT,
author       = {Liu, Mingyang and Farina, Gabriele and Ozdaglar, Asuman},
title        = {UFT: Unifying Supervised and Reinforcement Fine-Tuning},
journal      = {arXiv preprint arXiv:2505.16984},
year         = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
data		data
docker		docker
docs		docs
examples		examples
images		images
patches		patches
scripts		scripts
tests		tests
verl		verl
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Results

Accuracy of different algorithms averaged over `Qwen2.5-0.5/1.5/3B`

Accuracy of different algorithms on `Qwen2.5-0.5B`

Accuracy of different algorithms on `Qwen2.5-3B`

Installation

Usage

Training

Example

Requirement

Major Modifications from VERL

Evaluate

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UFT: Unifying Supervised and Reinforcement Fine-Tuning

Results

Accuracy of different algorithms averaged over Qwen2.5-0.5/1.5/3B

Accuracy of different algorithms on Qwen2.5-0.5B

Accuracy of different algorithms on Qwen2.5-3B

Installation

Usage

Training

Example

Requirement

Major Modifications from VERL

Evaluate

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Accuracy of different algorithms averaged over `Qwen2.5-0.5/1.5/3B`

Accuracy of different algorithms on `Qwen2.5-0.5B`

Accuracy of different algorithms on `Qwen2.5-3B`

Packages