VideoGrain is a zero-shot method for class-level, instance-level, and part-level video editing.
- Multi-grained Video Editing
- class-level: Editing objects within the same class (previous SOTA limited to this level)
- instance-level: Editing each individual instance to distinct object
- part-level: Adding new objects or modifying existing attributes at the part-level
- Training-Free
- Does not require any training/fine-tuning
- One-Prompt Multi-region Control & Deep investigations about cross/self attn
- modulating cross-attn for multi-regions control (visualizations available)
- modulating self-attn for feature decoupling (clustering are available)
| class level | instance level | part level | animal instances | ||
| animal instances | human instances | part-level modification | |||
videograin.mp4
- [2025/2/25] Our VideoGrain is posted and recommended by Gradio on LinkedIn and Twitter, and recommended by AK.
- [2025/2/25] Our VideoGrain is submited by AK to HuggingFace-daily papers, and rank #1 paper of that day.
- [2025/2/24] We release our paper on arxiv, we also release code and full-data on google drive.
- [2025/1/23] Our paper is accepted to ICLR2025! Welcome to watch 👀 this repository for the latest updates.
Our method is tested using cuda12.1, fp16 of accelerator and xformers on a single L40.
# Step 1: Create and activate Conda environment
conda create -n videograin python==3.10
conda activate videograin
# Step 2: Install PyTorch, CUDA and Xformers
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install --pre -U xformers==0.0.27
# Step 3: Install additional dependencies with pip
pip install -r requirements.txtxformers is recommended to save memory and running time.
You may download all the base model checkpoints using the following bash command
## download sd 1.5, controlnet depth/pose v10/v11
bash download_all.shClick for ControlNet annotator weights (if you can not access to huggingface)
You can download all the annotator checkpoints (such as DW-Pose, depth_zoe, depth_midas, and OpenPose, cost around 4G) from baidu or google Then extract them into ./annotator/ckpts
We have provided all the video data and layout masks in VideoGrain at following link. Please download unzip the data and put them in the `./data' root directory.
gdown https://drive.google.com/file/d/1dzdvLnXWeMFR3CE2Ew0Bs06vyFSvnGXA/view?usp=drive_link
tar -zxvf videograin_data.tar.gz
prepare video to frames If the input video is mp4 file, using the following command to process it to frames:
python image_util/sample_video2frames.py --video_path 'your video path' --output_dir './data/video_name/video_name'prepare layout masks
We segment videos using our ReLER lab's SAM-Track. I suggest using the app.py in SAM-Track for graio mode to manually select which region in the video your want to edit. Here, we also provided an script image_util/process_webui_mask.py to process masks from SAM-Track path to VideoGrain path.
Your can reproduce the instance + part level results in our teaser by running:
bash test.sh
#or
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/part_level/adding_new_object/run_two_man/spider_polar_sunglass.yamlFor other instance/part/class results in VideoGrain project page or teaser, we provide all the data (video frames and layout masks) and corresponding configs to reproduce, check results in 🚀Multi-Grained Video Editing.
The result is saved at `./result` . (Click for directory structure)
result
├── run_two_man
│ ├── control # control conditon
│ ├── infer_samples
│ ├── input # the input video frames
│ ├── masked_video.mp4 # check whether edit regions are accuratedly covered
│ ├── sample
│ ├── step_0 # result image folder
│ ├── step_0.mp4 # result video
│ ├── source_video.mp4 # the input video
│ ├── visualization_denoise # cross attention weight
│ ├── sd_study # cluster inversion feature
VideoGrain is a training-free framework. To run VideoGrain on your video, modify ./config/demo_config.yaml based on your needs:
- Replace your pretrained model path and controlnet path in your config. you can change the control_type to
dwposeordepth_zoeordepth(midas). - Prepare your video frames and layout masks (edit regions) using SAM-Track or SAM2 in dataset config.
- Change the
prompt, and extract eachlocal promptin the editing prompts. the local prompt order should be same as layout masks order. - Your can change flatten resolution with 1->64, 2->16, 4->8. (commonly, flatten at 64 worked best)
- To ensure temporal consistency, you can set
use_pnp: Trueandinject_step:5/10. (Note: pnp>10 steps will be bad for multi-regions editing) - If you want to visualize the cross attn weight, set
vis_cross_attn: True - If you want to cluster DDIM Inversion spatial temporal video feature, set
cluster_inversion_feature: True
bash test.sh
#or
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config /path/to/the/configYou can get multi-grained definition result, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config /config/class_level/running_two_man/man2spider.yaml #class-level
# /config/instance_level/running_two_man/4cls_spider_polar.yaml #instance-level
#config/part_level/adding_new_object/run_two_man/spider_polar_sunglass.yaml #part-level| source video | class level | instance level | part level |
You can get instance-level video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/running_two_man/running_3cls_iron_spider.yamlYou can get part-level video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/part_level/modification/man_text_message/blue_shirt.yaml| source video | blue shirt | black suit | source video | ginger head | ginger body |
| source video | superman | superman + cap | source video | superman | superman + sunglasses |
You can get class-level video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/class_level/wolf/wolf.yaml| input | pig | husky | bear | tiger |
| input | iron man | Batman + snow court + iced wall | input | posche |
You can get soely video editing results, using the following command:
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/soely_edit/only_left.yaml
#--config config/instance_level/soely_edit/only_right.yaml
#--config config/instance_level/soely_edit/joint_edit.yaml| source video | left→Iron Man | right→Spiderman | joint edit |
You can get visulize attention weight editing results, using the following command:
#setting vis_cross_attn: True in your config
CUDA_VISIBLE_DEVICES=0 accelerate launch test.py --config config/instance_level/running_two_man/3cls_spider_polar_vis_weight.yaml| source video | left→spiderman, right→polar bear, trees→cherry blossoms | spiderman weight | bear weight | cherry weight |
If you think this project is helpful, please feel free to leave a star⭐️⭐️⭐️ and cite our paper:
@article{yang2025videograin,
title={VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing},
author={Yang, Xiangpeng and Zhu, Linchao and Fan, Hehe and Yang, Yi},
journal={arXiv preprint arXiv:2502.17258},
year={2025}
}Xiangpeng Yang @knightyxp, email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au
- This code builds on diffusers, and FateZero. Thanks for open-sourcing!
- We would like to thank AK(@_akhaliq) and Gradio team for recommendation!