Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

中文 | English

🎊 News

[2026.05.13] 🚀 We open-source the code and training scripts for OPD.
[2026.05.05] 📖 We release our paper on ArXiv.

🚀 Overview

Uni-OPD is a unified On-Policy Distillation (OPD) framework that consolidates the capabilities of specialized expert teachers into a single student model, generalizing across LLMs and MLLMs. We identify two fundamental bottlenecks that limit effective OPD:

Insufficient exploration of informative student-generated states, and
Unreliable teacher supervision for student rollouts.

To address them, Uni-OPD introduces a dual-perspective optimization recipe that jointly improves student exploration (via offline difficulty-aware and online correctness-aware data balancing) and teacher reliability (via an outcome-guided margin calibration mechanism). Extensive experiments on 5 domains and 16 benchmarks, covering single-/multi-teacher, strong-to-weak, and cross-modal distillation, verify the effectiveness and versatility of Uni-OPD.

📌 Contents

🔑 Key Features

A unified OPD framework across LLMs and MLLMs. Uni-OPD consolidates knowledge from one or several expert teachers into a single student model and works seamlessly across single-teacher, multi-teacher, strong-to-weak, and cross-modal (text + multimodal) distillation settings.

Student-perspective 1: offline difficulty-aware data balancing. We selectively upsample medium-difficulty prompts to reshape the training corpus into a more balanced difficulty distribution while preserving data diversity. This enables the student to generate more informative trajectories and explore a broader solution space.

Student-perspective 2: online correctness-aware data balancing. During training, we dynamically filter and reshape rollout batches to maintain a balanced ratio between correct and incorrect trajectories, preventing the student from collapsing onto trivially correct samples or being overwhelmed by uniformly failed ones.

Teacher-perspective: outcome-guided margin calibration. We show that reliable token-level teacher supervision largely depends on whether its trajectory-level aggregation remains order-consistent with the outcome reward. Uni-OPD uses the outcome reward as a global anchor to calibrate the teacher's per-token margins, restoring order consistency between correct and incorrect trajectories.

📚 Dataset

The dataset we use for training and evaluation in Uni-OPD is a combination of publicly available resources:

Text training data (Math + Code). We use the same training data as G-OPD, available at 🤗 Keven16/G-OPD-Training-Data.
- The math part is sourced from the DeepMath dataset.
- The code part is sourced from the code subset of the Eurus-2-RL dataset.
Multimodal training data. We use a mixture of:

💻 Environment Setup

We provide step-by-step instructions for both the training and evaluation environments:

Training environment — see docs/build_env.md. It walks through preparing the conda env (Uni-OPD, Python 3.12), installing required packages, and applying the SGLang & Megatron patches shipped under miles/docker/patch.
Evaluation environment — see docs/build_eval_env.md. It covers two separate conda envs:
- Uni-OPD-LLM-Eval for text evaluation (built on top of G-OPD), and
- Uni-OPD-LMMS-Eval for multimodal evaluation (built on top of lmms-eval).

A typical post-setup layout looks like:

- Uni-OPD/                  # this repository
  - miles/                  # RL / OPD training framework
  - Megatron-LM/            # training backend
  - sglang/                 # inference / rollout backend
  - G-OPD/                  # text-side evaluation (cloned for eval env)
  - lmms-eval/              # multimodal evaluation (cloned for eval env)

⚙️ Training

All training and implementation in Uni-OPD is built on top of the miles framework. For a summary of the modifications we made to miles, see docs/miles_modifications.md.

We release the full set of training scripts used in the paper under exps/scripts/OPD, grouped by distillation setting:

Setting	Path	Description
Single-teacher	`exps/scripts/OPD/single_teacher`	Math / Code distillation with Qwen3-1.7B & Qwen3-4B students.
Multi-teacher	`exps/scripts/OPD/multi_teacher`	Joint Math + Code distillation from multiple expert teachers.
Strong-to-weak	`exps/scripts/OPD/strong_to_weak`	Distilling a stronger teacher (Qwen3-A3B-Instruct) into a smaller student.

A minimal launch command looks like:

# Activate the training conda env built via docs/build_env.md
conda activate Uni-OPD

# Example: single-teacher Math distillation, 4B student
bash exps/scripts/OPD/single_teacher/0413/Qwen3_Stu_4B_Math_Uni_OPD.sh \
    --rollout-batch-size 64 \
    --sample-n 16 \
    --lr 1e-6

Before running, please

update the model / data paths at the top of the script (and inside the corresponding YAML under configs/) to point to your local checkpoints and dataset files.

Launch teacher server(s) using miles/Uni_OPD_utils/scripts/server/run_sglang_server.sh and put relevent addresses in miles/Uni_OPD_utils/OPD_reward/teacher_server_list.json.

📈 Evaluation

Evaluation is performed in the dedicated evaluation environments described in docs/build_eval_env.md:

LLM benchmarks (math & code) follow the G-OPD evaluation pipeline.
MLLM benchmarks (ChartQA, InfographicVQA, MathVision, LogicVista, etc.) follow the lmms-eval pipeline.

Please refer to the upstream repositories for the per-benchmark commands.

📝 Citation

If you find our paper / code helpful, please consider citing our work 📝 and starring this repository ⭐️!

@article{hou2026uni,
  title   = {{Uni-OPD}: Unifying On-Policy Distillation with a Dual-Perspective Recipe},
  author  = {Hou, Wenjin and Peng, Shangpin and Wang, Weinong and Ruan, Zheng and Zhang, Yue and Zhou, Zhenglin and Gao, Mingqi and Chen, Yifei and Wang, Kaiqi and Yang, Hongming and Zhang, Chengquan and Tian, Zhuotao and Hu, Han and Yang, Yi and Wu, Fei and Fan, Hehe},
  journal = {arXiv preprint arXiv:2605.03677},
  year    = {2026}
}

🙏 Acknowledgement

G-OPD: an excellent open-source project on on-policy distillation; we reuse its text-side training data and evaluation pipeline.
miles: the powerful RL training framework on top of which we build Uni-OPD.
Megatron-LM and SGLang: the training and rollout backends used throughout this project.
lmms-eval: the multimodal evaluation framework we adopt for MLLM benchmarks.

📧 Contact us

If you have any questions, comments, or suggestions, please feel free to open an issue or PR. Contributions and discussions that help advance research in this area are very welcome!

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
exps/scripts/OPD		exps/scripts/OPD
miles		miles
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

🎊 News

🚀 Overview

📌 Contents

🔑 Key Features

📚 Dataset

💻 Environment Setup

⚙️ Training

📈 Evaluation

📝 Citation

🙏 Acknowledgement

📧 Contact us

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

🎊 News

🚀 Overview

📌 Contents

🔑 Key Features

📚 Dataset

💻 Environment Setup

⚙️ Training

📈 Evaluation

📝 Citation

🙏 Acknowledgement

📧 Contact us

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages