conda create -n uft python=3.9
conda activate uft
bash install.sh
python run.py
--algo Algorithm to use: {sft, rft, stage, r3, uft}
--n_gpu Number of GPUs
--visible-devices GPU index to use, e.g., "0,1,2,3"
--T Total training steps (default: 500)
--T_hint Maximum training steps with hint (default: 300)
--data Dataset: {countdown,math,kk_logic,others}
--model Model name (e.g., Qwen2.5-1.5B)
--tp_size
--eval Triggered to evaluate the model, otherwise training
--idx IDX Index of the current process (default=0)
--sft_loss_coef Coefficient for the additional log-likelihood term on hint
--n_rollout Number of trajectory rollouts (default 4)
python run.py --model Qwen/Qwen2.5-1.5B --data countdown
Qwen2.5-0.5/1.5BandLlama-3.2-1B: 2H100Qwen2.5-3BandLlama-3.2-3B: 4H100
Qwen2.5-0.5/1.5B / Llama-3.2-1B can be trained with 1 H100 by setting n_rollouts=2
- Concatenating hint to the prompt: RLHFDataset
- Modifications on the objective function: policy_loss
Change model and dataset to the the model name (e.g., Qwen/Qwen2.5-1.5B) and dataset name (e.g., countdown) to evaluate
python run.py --model {model} --data {dataset} --eval
- The experiments are based on VERL and TinyZero.
- We use Qwen2.5 and Llama3.2 series base models.
- We use some of the evaluation code from Dr.GRPO
@article{UFT,
author = {Liu, Mingyang and Farina, Gabriele and Ozdaglar, Asuman},
title = {UFT: Unifying Supervised and Reinforcement Fine-Tuning},
journal = {arXiv preprint arXiv:2505.16984},
year = {2025}
}