Skip to content

WenjinHou/Uni-OPD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

δΈ­ζ–‡ | English

🎊 News

  • [2026.05.13] πŸš€ We open-source the code and training scripts for OPD.
  • [2026.05.05] πŸ“– We release our paper on ArXiv.

πŸš€ Overview

Uni-OPD is a unified On-Policy Distillation (OPD) framework that consolidates the capabilities of specialized expert teachers into a single student model, generalizing across LLMs and MLLMs. We identify two fundamental bottlenecks that limit effective OPD:

  1. Insufficient exploration of informative student-generated states, and
  2. Unreliable teacher supervision for student rollouts.

To address them, Uni-OPD introduces a dual-perspective optimization recipe that jointly improves student exploration (via offline difficulty-aware and online correctness-aware data balancing) and teacher reliability (via an outcome-guided margin calibration mechanism). Extensive experiments on 5 domains and 16 benchmarks, covering single-/multi-teacher, strong-to-weak, and cross-modal distillation, verify the effectiveness and versatility of Uni-OPD.

πŸ“Œ Contents

πŸ”‘ Key Features

  • A unified OPD framework across LLMs and MLLMs. Uni-OPD consolidates knowledge from one or several expert teachers into a single student model and works seamlessly across single-teacher, multi-teacher, strong-to-weak, and cross-modal (text + multimodal) distillation settings.

  • Student-perspective 1: offline difficulty-aware data balancing. We selectively upsample medium-difficulty prompts to reshape the training corpus into a more balanced difficulty distribution while preserving data diversity. This enables the student to generate more informative trajectories and explore a broader solution space.

  • Student-perspective 2: online correctness-aware data balancing. During training, we dynamically filter and reshape rollout batches to maintain a balanced ratio between correct and incorrect trajectories, preventing the student from collapsing onto trivially correct samples or being overwhelmed by uniformly failed ones.

  • Teacher-perspective: outcome-guided margin calibration. We show that reliable token-level teacher supervision largely depends on whether its trajectory-level aggregation remains order-consistent with the outcome reward. Uni-OPD uses the outcome reward as a global anchor to calibrate the teacher's per-token margins, restoring order consistency between correct and incorrect trajectories.

πŸ“š Dataset

The dataset we use for training and evaluation in Uni-OPD is a combination of publicly available resources:

πŸ’» Environment Setup

We provide step-by-step instructions for both the training and evaluation environments:

  • Training environment β€” see docs/build_env.md. It walks through preparing the conda env (Uni-OPD, Python 3.12), installing required packages, and applying the SGLang & Megatron patches shipped under miles/docker/patch.
  • Evaluation environment β€” see docs/build_eval_env.md. It covers two separate conda envs:
    • Uni-OPD-LLM-Eval for text evaluation (built on top of G-OPD), and
    • Uni-OPD-LMMS-Eval for multimodal evaluation (built on top of lmms-eval).

A typical post-setup layout looks like:

- Uni-OPD/                  # this repository
  - miles/                  # RL / OPD training framework
  - Megatron-LM/            # training backend
  - sglang/                 # inference / rollout backend
  - G-OPD/                  # text-side evaluation (cloned for eval env)
  - lmms-eval/              # multimodal evaluation (cloned for eval env)

βš™οΈ Training

All training and implementation in Uni-OPD is built on top of the miles framework. For a summary of the modifications we made to miles, see docs/miles_modifications.md.

We release the full set of training scripts used in the paper under exps/scripts/OPD, grouped by distillation setting:

Setting Path Description
Single-teacher exps/scripts/OPD/single_teacher Math / Code distillation with Qwen3-1.7B & Qwen3-4B students.
Multi-teacher exps/scripts/OPD/multi_teacher Joint Math + Code distillation from multiple expert teachers.
Strong-to-weak exps/scripts/OPD/strong_to_weak Distilling a stronger teacher (Qwen3-A3B-Instruct) into a smaller student.

A minimal launch command looks like:

# Activate the training conda env built via docs/build_env.md
conda activate Uni-OPD

# Example: single-teacher Math distillation, 4B student
bash exps/scripts/OPD/single_teacher/0413/Qwen3_Stu_4B_Math_Uni_OPD.sh \
    --rollout-batch-size 64 \
    --sample-n 16 \
    --lr 1e-6

Before running, please

  1. update the model / data paths at the top of the script (and inside the corresponding YAML under configs/) to point to your local checkpoints and dataset files.
  2. Launch teacher server(s) using miles/Uni_OPD_utils/scripts/server/run_sglang_server.sh and put relevent addresses in miles/Uni_OPD_utils/OPD_reward/teacher_server_list.json.

πŸ“ˆ Evaluation

Evaluation is performed in the dedicated evaluation environments described in docs/build_eval_env.md:

  • LLM benchmarks (math & code) follow the G-OPD evaluation pipeline.
  • MLLM benchmarks (ChartQA, InfographicVQA, MathVision, LogicVista, etc.) follow the lmms-eval pipeline.

Please refer to the upstream repositories for the per-benchmark commands.

πŸ“ Citation

If you find our paper / code helpful, please consider citing our work πŸ“ and starring this repository ⭐️!

@article{hou2026uni,
  title   = {{Uni-OPD}: Unifying On-Policy Distillation with a Dual-Perspective Recipe},
  author  = {Hou, Wenjin and Peng, Shangpin and Wang, Weinong and Ruan, Zheng and Zhang, Yue and Zhou, Zhenglin and Gao, Mingqi and Chen, Yifei and Wang, Kaiqi and Yang, Hongming and Zhang, Chengquan and Tian, Zhuotao and Hu, Han and Yang, Yi and Wu, Fei and Fan, Hehe},
  journal = {arXiv preprint arXiv:2605.03677},
  year    = {2026}
}

πŸ™ Acknowledgement

  • G-OPD: an excellent open-source project on on-policy distillation; we reuse its text-side training data and evaluation pipeline.
  • miles: the powerful RL training framework on top of which we build Uni-OPD.
  • Megatron-LM and SGLang: the training and rollout backends used throughout this project.
  • lmms-eval: the multimodal evaluation framework we adopt for MLLM benchmarks.

πŸ“§ Contact us

If you have any questions, comments, or suggestions, please feel free to open an issue or PR. Contributions and discussions that help advance research in this area are very welcome!

License

Apache License 2.0

About

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages