This is the repository for the paper Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model.
Pre-DPO is a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model.
This repository is based on the popular repository LLaMA-Factory, which can easily fine-tune 100+ large language models.
First, create a new conda environment and activate it.
conda create -n predpo python=3.10 && conda activate predpoNext, clone the repository and install PyTorch along with the remaining dependencies.
git clone https://github.com/DtYXs/Pre-DPO.git
cd Pre-DPO
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -e ".[torch,metrics]"
pip install deepspeed==0.15.4For the Base models (Llama3.2-3B-Base and Qwen2.5-7B-Base), we utilize the UltraChat-200k dataset to obtain the SFT models. Subsequently, we perform preference optimization using the UltraFeedback-Binarized dataset.
For the Instruct models (Llama3.2-3B-Instruct and Qwen2.5-7B-Instruct), we follow the pipeline described in SimPO to generate on-policy preference data, using ArmoRM-Llama3-8B-v0.1 as the preference label annotator. The resulting preference datasets are llama3.2-3b-ultrafeedback-armorm-binarized and qwen2.5-7b-ultrafeedback-armorm-binarized, respectively.
You can refer ./data/README.md and prepare your data in ./data/dataset_info.json.
We provide our training scripts and examples in ./scripts. We train the 3B models on 4 × 80G GPUs and the 7B models on 8 × 80G GPUs.
bash scripts/train_sft.sh --model_name_or_path <MODEL_NAME_OR_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE>bash scripts/train_dpo.sh --sft_model_path <SFT_MODEL_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE> --pref_beta <BETA_IN_DPO> --bsz <BATCH_SIZE> --gradient_accumulation_steps <GRADIENT_ACCUMULATION_STEPS> --lr <LEARNING_RATE>bash scripts/train_simpo.sh --sft_model_path <SFT_MODEL_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE> --pref_beta <BETA_IN_SIMPO> --simpo_gamma <GAMMA_IN_SIMPO> --bsz <BATCH_SIZE> --gradient_accumulation_steps <GRADIENT_ACCUMULATION_STEPS> --lr <LEARNING_RATE>bash scripts/train_predpo.sh --sft_model_path <SFT_MODEL_PATH> --ref_model_path <REF_MODEL_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE> --pref_beta <BETA_IN_PREDPO> --bsz <BATCH_SIZE> --gradient_accumulation_steps <GRADIENT_ACCUMULATION_STEPS> --lr <LEARNING_RATE>We conduct evaluations on AlpacaEval 2.0 and Arena-Hard v0.1 following their official repositories.
We deeply appreciate the outstanding open-source code of LlamaFactory and SimPO, which has greatly supported research efforts within the community.
if Pre-DPO is helpful to your work, please cite our paper:
@misc{pan2025predpoimprovingdatautilization,
title={Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model},
author={Junshu Pan and Wei Shen and Shulin Huang and Qiji Zhou and Yue Zhang},
year={2025},
eprint={2504.15843},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.15843},
}- Email: panjunshu@westlake.edu.cn