TL;DR — High-quality video generation in just five denoising steps.
See the results in the Combined Techniques section.
This project focuses on the distillation and post-training processes for the Wan2.1 model, aiming to enhance its efficiency and performance. All data processing, training and inference code along with model weights are open-sourced.
- References
- Data Preprocessing
- Distillation Techniques
- Reinforcement Learning
- Inference
- Combined Techniques
- Model Weights
- Environment Setup
This work is based on:
The work is done at Zulution AI
- Assuming your video data has been uniformly resized, run
scripts/data_preprocess/preprocess.shto preprocess the features required for model input. - An example directory structure can be found under
./data.
The following distillation techniques are utilized in this project:
- Description: Eliminates the need for unconditional generation by fusing guidance information with temporal information using sinusoidal positional encoding and a Multi-Layer Perceptron (MLP).
- Performance: Achieves 2X acceleration with minimal performance degradation.
- Training: Run the script
scripts/train/sh/distill_cfg_i2v.sh. - Comparative Results:
cfg_distill_26_SuperVi_step40_shift3_guide5.mp4 |
org_26_SuperVi_step40_shift3_guide5.mp4 |
cfg_distill_38_The.wom_step40_shift5_guide5.mp4 |
org_The.wom_step40_shift5_guide5.mp4 |
- Training: Run the script
scripts/train/sh/distill_step.sh. - Mode: Defaults to text-to-video (t2v). For image-to-video (i2v) training, add the
--i2vflag. - Types:
- Consistency Distillation:
- Run with the
--distill_mode consistencyflag.
- Run with the
- Half Step Distillation:
- Performance: Achieves 2X acceleration with minimal performance loss.
- Description: Consolidates the original two prediction steps into a single step.
- Training: Run with the
--distill_mode halfflag. - Example Results: halfed, 20 steps
- Consistency Distillation:
halfed_38_The.wom_step20_shift10_guide5.mp4
halfed_26_SuperVi_step20_shift10_guide5.mp4
orignial cfg-distilled 30 steps vs. further step-distilled 15 steps
cfg_._step30_shift3_guide5.mp4 |
halfed_._step15_shift9_guide8.mp4 |
- Description: Distribution Matching Distillation is applied to further optimize the model.
- Implementation: For detailed implementation, see DMD2_wanx.
- Method: Implemented according to the concept from DRaFT.
- Reward: HPSReward V2.1, with implementation from Easyanimate.
- Best Practices: Our experiments found it best to train with LoRA and apply reward on the first frame.
- Training: Run the script
RL/sh/debug.sh. - Inference: Set
lora_alphato a smaller value than during training for more natural-looking videos. - Example Results:
org_In.the._step7_shift13_guide8.mp4 |
RL_In.the._step7_shift13_guide8.mp4 |
-
Text-to-Video (t2v)
- Run
scripts/inference/inference.shfor text-to-video generation tasks. - Sample prompts are provided in
test_prompts.txtandmoviibench_2.0_prompts.txt.
- Run
-
Image-to-Video (i2v)
- Run
scripts/inference/i2v.shfor image-to-video generation tasks. - Sample images and corresponding prompts are available in the
examples/i2vdirectory.
- Run
-
Configuration
- Update the
transformer_dirvariable in the scripts to point to your model checkpoint directory. - Adjust LoRA-related settings in
generate.pyif using LoRA models.
- Update the
By combining distillation (based on DMD2) and RL, we can achieve high-quality video generation in just 5 steps:
A.coffe_step5_shift7_guide8.mp4
A.littl_step5_shift15_guide8.mp4
After.j_step5_shift15_guide8.mp4
The.aft_step5_shift7_guide8.mp4
In.the._step5_shift7_guide8.mp4
All pretrained model weights are available for download:
- Baidu Pan: https://pan.baidu.com/s/1wUCrRY9Fu8GdDMTZXdc7tw?pwd=m9kn
- Access Code:
m9kn
- Dependencies: All required packages are listed in the
environment.ymlfile. - FastVideo: This project requires FastVideo. Please install our forked version from:
git clone https://github.com/azuresky03/FastVideo.git cd FastVideo pip install -e .