Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
Official PyTorch implementation of "Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models" (CVPR 2026).
- [2026.05] We released the training code, inference code and pretrained lora weights.
- [2026.02] Our paper is accepted to CVPR 2026! π
We are actively working on releasing all components of LocalDPO. Stay tuned for updates!
- β Release Inference Code & test data: Open-source the inference scripts and test data.
- β Release Pre-trained Checkpoints: Full release of LocalDPO fine-tuned weights for CogvideoX-2B, CogvideoX-5B, and Wan2.2-1.3B.
- β Release Corrupted Video Generation Script: Code to synthesize locally corrupted videos for constructing preference pairs.
- β Release Training Code: Complete training pipeline for LocalDPO.
Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.
- Python >= 3.10
- PyTorch == 2.2.0
- Clone the repository:
git clone https://github.com/1170300714/Local-DPO.git cd Local-DPO - Create a conda environment:
conda create -n localdpo python=3.10 conda activate localdpo
- Install dependencies::
pip install -r requirements.txt
You can follow the following steps to run the inference code:
-
Download the base weights and pretrained checkpoints of CogVideoX-2B, CogVideoX-5B, and Wan2.2-1.3B from here
-
Then, inference on prepared test prompts with pre-trained checkpoints:
bash local_launch.sh test_base \ 1 \ # number of gpus OUTPUT_DIR \ 49 \ # number of frames per video 720 \ # height 1280 \ # width 1 \ # number of video per prompt demo_data/prompt.json \ # prompt file BASE_MODEL_PATH \ # the path to the base model weights TUNED_MODEL_PATH \ # the path to the tuned model weights (lora)
-
You can also perform inference on your custom prompts by replacing demo_data/prompt.json with your own. Note that the prompts file itself should be a JSON list, with the specific format as follows:
[ {"long": "PROMPT1"}, {"long": "PROMPT2"}, ... ]
You can follow the following steps to generate locally corrupted video and train Local DPO with your own data:
- Prepare custom real video data and corresponding description, which will be used to generate corrupted data. The meta data of the data should be a JSONL file, with the specific format as follows:
{"video_path": "PATH_TO_VIDEO1", "description": "DESCRIPTION1 (CAPTION)", "vid": "VIDEO_ID1"}, {"video_path": "PATH_TO_VIDEO2", "description": "DESCRIPTION2(CAPTION)", "vid": "VIDEO_ID2"}, ... - Then, generate corrupted video from real video with base model:
The resized real videos, generated videos and random 3D maskswill be saved in OUTPUT_DIR. The prefixname of each corrupted video and 3D mask are the same as the video's name.
bash local_launch.sh generate_corrupted_video \ 1 \ # number of gpus OUTPUT_DIR \ 49 \ # number of frames per video 720 \ # height 1280 \ # width REAL_VIDEO_META_DATA \ # your real video meta data BASE_MODEL_PATH \ # the path to the base model weights
- Create metadata for Local DPO training data. The metadata should be JSONL file, whose specific format as follows:
{ "height_win": "The height of the winner sample (int)", "width_win": "The width of the winner sample (int)", "height_lose": "The height of the loser sample (int)", "width_lose": "The width of the loser sample (int)", "fps_win": "The fps of the winner sample (int)", "fps_lose": "The fps of the loser sample (int)", "duration_win": "The total seconds of the winner sample (float)", "duration_lose": "The total seconds of the loser sample (float)", "pos_num_frames": "The number of frames in the winner sample (int)", "neg_num_frames": "The number of frames in of the loser sample (int)", "pos_video_path": "The path of the winner sample (str)", "neg_video_path": "The path of the loser sample (str)", "mask": "The path of genearted 3D mask of the winner sample (str)", "yita": "The strenth of inversion noise (float)", "gen_caption": "video description (str)" }, {...}, {...}, ... - Train model:
bash local_launch.sh train_base \ 1 \ # number of gpus OUTPUT_DIR \ META_DATA_PATH \ # your metadata BASE_MODEL_PATH \ # the path to the base model weights
If our work inspires your research or some part of the codes are useful for your work, please cite our paper:
@article{huang2026mind,
title={Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models},
author={Huang, Zitong and Zhang, Kaidong and Ding, Yukang and Gao, Chao and Ding, Rui and Chen, Ying and Zuo, Wangmeng},
journal={arXiv preprint arXiv:2601.04068},
year={2026}
}If you have any questions, please contact us via
This codebase builds upon several excellent open-source projects: