- [2026.05.13] π We open-source the code and training scripts for OPD.
- [2026.05.05] π We release our paper on ArXiv.
Uni-OPD is a unified On-Policy Distillation (OPD) framework that consolidates the capabilities of specialized expert teachers into a single student model, generalizing across LLMs and MLLMs. We identify two fundamental bottlenecks that limit effective OPD:
- Insufficient exploration of informative student-generated states, and
- Unreliable teacher supervision for student rollouts.
To address them, Uni-OPD introduces a dual-perspective optimization recipe that jointly improves student exploration (via offline difficulty-aware and online correctness-aware data balancing) and teacher reliability (via an outcome-guided margin calibration mechanism). Extensive experiments on 5 domains and 16 benchmarks, covering single-/multi-teacher, strong-to-weak, and cross-modal distillation, verify the effectiveness and versatility of Uni-OPD.
- A unified OPD framework across LLMs and MLLMs. Uni-OPD consolidates knowledge from one or several expert teachers into a single student model and works seamlessly across single-teacher, multi-teacher, strong-to-weak, and cross-modal (text + multimodal) distillation settings.
- Student-perspective 1: offline difficulty-aware data balancing. We selectively upsample medium-difficulty prompts to reshape the training corpus into a more balanced difficulty distribution while preserving data diversity. This enables the student to generate more informative trajectories and explore a broader solution space.
- Student-perspective 2: online correctness-aware data balancing. During training, we dynamically filter and reshape rollout batches to maintain a balanced ratio between correct and incorrect trajectories, preventing the student from collapsing onto trivially correct samples or being overwhelmed by uniformly failed ones.
- Teacher-perspective: outcome-guided margin calibration. We show that reliable token-level teacher supervision largely depends on whether its trajectory-level aggregation remains order-consistent with the outcome reward. Uni-OPD uses the outcome reward as a global anchor to calibrate the teacher's per-token margins, restoring order consistency between correct and incorrect trajectories.
The dataset we use for training and evaluation in Uni-OPD is a combination of publicly available resources:
-
Text training data (Math + Code). We use the same training data as G-OPD, available at π€ Keven16/G-OPD-Training-Data.
- The math part is sourced from the DeepMath dataset.
- The code part is sourced from the code subset of the Eurus-2-RL dataset.
-
Multimodal training data. We use a mixture of:
We provide step-by-step instructions for both the training and evaluation environments:
- Training environment β see docs/build_env.md. It walks through preparing the conda env (
Uni-OPD, Python 3.12), installing required packages, and applying the SGLang & Megatron patches shipped undermiles/docker/patch. - Evaluation environment β see docs/build_eval_env.md. It covers two separate conda envs:
A typical post-setup layout looks like:
- Uni-OPD/ # this repository
- miles/ # RL / OPD training framework
- Megatron-LM/ # training backend
- sglang/ # inference / rollout backend
- G-OPD/ # text-side evaluation (cloned for eval env)
- lmms-eval/ # multimodal evaluation (cloned for eval env)
All training and implementation in Uni-OPD is built on top of the miles framework. For a summary of the modifications we made to miles, see docs/miles_modifications.md.
We release the full set of training scripts used in the paper under exps/scripts/OPD, grouped by distillation setting:
| Setting | Path | Description |
|---|---|---|
| Single-teacher | exps/scripts/OPD/single_teacher |
Math / Code distillation with Qwen3-1.7B & Qwen3-4B students. |
| Multi-teacher | exps/scripts/OPD/multi_teacher |
Joint Math + Code distillation from multiple expert teachers. |
| Strong-to-weak | exps/scripts/OPD/strong_to_weak |
Distilling a stronger teacher (Qwen3-A3B-Instruct) into a smaller student. |
A minimal launch command looks like:
# Activate the training conda env built via docs/build_env.md
conda activate Uni-OPD
# Example: single-teacher Math distillation, 4B student
bash exps/scripts/OPD/single_teacher/0413/Qwen3_Stu_4B_Math_Uni_OPD.sh \
--rollout-batch-size 64 \
--sample-n 16 \
--lr 1e-6Before running, please
- update the model / data paths at the top of the script (and inside the corresponding YAML under
configs/) to point to your local checkpoints and dataset files.- Launch teacher server(s) using
miles/Uni_OPD_utils/scripts/server/run_sglang_server.shand put relevent addresses inmiles/Uni_OPD_utils/OPD_reward/teacher_server_list.json.
Evaluation is performed in the dedicated evaluation environments described in docs/build_eval_env.md:
- LLM benchmarks (math & code) follow the G-OPD evaluation pipeline.
- MLLM benchmarks (ChartQA, InfographicVQA, MathVision, LogicVista, etc.) follow the lmms-eval pipeline.
Please refer to the upstream repositories for the per-benchmark commands.
If you find our paper / code helpful, please consider citing our work π and starring this repository βοΈ!
@article{hou2026uni,
title = {{Uni-OPD}: Unifying On-Policy Distillation with a Dual-Perspective Recipe},
author = {Hou, Wenjin and Peng, Shangpin and Wang, Weinong and Ruan, Zheng and Zhang, Yue and Zhou, Zhenglin and Gao, Mingqi and Chen, Yifei and Wang, Kaiqi and Yang, Hongming and Zhang, Chengquan and Tian, Zhuotao and Hu, Han and Yang, Yi and Wu, Fei and Fan, Hehe},
journal = {arXiv preprint arXiv:2605.03677},
year = {2026}
}- G-OPD: an excellent open-source project on on-policy distillation; we reuse its text-side training data and evaluation pipeline.
- miles: the powerful RL training framework on top of which we build Uni-OPD.
- Megatron-LM and SGLang: the training and rollout backends used throughout this project.
- lmms-eval: the multimodal evaluation framework we adopt for MLLM benchmarks.
If you have any questions, comments, or suggestions, please feel free to open an issue or PR. Contributions and discussions that help advance research in this area are very welcome!