Pre-DPO

Overview

This is the repository for the paper Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model.

Pre-DPO is a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model.

This repository is based on the popular repository LLaMA-Factory, which can easily fine-tune 100+ large language models.

Installation

First, create a new conda environment and activate it.

conda create -n predpo python=3.10 && conda activate predpo

Next, clone the repository and install PyTorch along with the remaining dependencies.

git clone https://github.com/DtYXs/Pre-DPO.git
cd Pre-DPO
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -e ".[torch,metrics]"
pip install deepspeed==0.15.4

Training

Training Data

For the Base models (Llama3.2-3B-Base and Qwen2.5-7B-Base), we utilize the UltraChat-200k dataset to obtain the SFT models. Subsequently, we perform preference optimization using the UltraFeedback-Binarized dataset.

For the Instruct models (Llama3.2-3B-Instruct and Qwen2.5-7B-Instruct), we follow the pipeline described in SimPO to generate on-policy preference data, using ArmoRM-Llama3-8B-v0.1 as the preference label annotator. The resulting preference datasets are llama3.2-3b-ultrafeedback-armorm-binarized and qwen2.5-7b-ultrafeedback-armorm-binarized, respectively.

You can refer ./data/README.md and prepare your data in ./data/dataset_info.json.

Training Scripts

We provide our training scripts and examples in ./scripts. We train the 3B models on 4 × 80G GPUs and the 7B models on 8 × 80G GPUs.

SFT

bash scripts/train_sft.sh --model_name_or_path <MODEL_NAME_OR_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE>

DPO

bash scripts/train_dpo.sh --sft_model_path <SFT_MODEL_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE> --pref_beta <BETA_IN_DPO> --bsz <BATCH_SIZE> --gradient_accumulation_steps <GRADIENT_ACCUMULATION_STEPS> --lr <LEARNING_RATE>

SimPO

bash scripts/train_simpo.sh --sft_model_path <SFT_MODEL_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE> --pref_beta <BETA_IN_SIMPO> --simpo_gamma <GAMMA_IN_SIMPO> --bsz <BATCH_SIZE> --gradient_accumulation_steps <GRADIENT_ACCUMULATION_STEPS> --lr <LEARNING_RATE>

Pre-DPO

bash scripts/train_predpo.sh --sft_model_path <SFT_MODEL_PATH> --ref_model_path <REF_MODEL_PATH> --dataset <DATASET_NAME> --output_dir <OUTPUT_DIR> --template <TEMPLATE> --pref_beta <BETA_IN_PREDPO> --bsz <BATCH_SIZE> --gradient_accumulation_steps <GRADIENT_ACCUMULATION_STEPS> --lr <LEARNING_RATE>

Evaluation

We conduct evaluations on AlpacaEval 2.0 and Arena-Hard v0.1 following their official repositories.

Acknowledgement

We deeply appreciate the outstanding open-source code of LlamaFactory and SimPO, which has greatly supported research efforts within the community.

Citiation

if Pre-DPO is helpful to your work, please cite our paper:

@misc{pan2025predpoimprovingdatautilization,
      title={Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model}, 
      author={Junshu Pan and Wei Shen and Shulin Huang and Qiji Zhou and Yue Zhang},
      year={2025},
      eprint={2504.15843},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15843}, 
}

Contact

Email: panjunshu@westlake.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
src		src
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pre-DPO

Overview

Installation

Training

Training Data

Training Scripts

SFT

DPO

SimPO

Pre-DPO

Evaluation

Acknowledgement

Citiation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DtYXs/Pre-DPO

Folders and files

Latest commit

History

Repository files navigation

Pre-DPO

Overview

Installation

Training

Training Data

Training Scripts

SFT

DPO

SimPO

Pre-DPO

Evaluation

Acknowledgement

Citiation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages