Skip to content

Official implementation of paper "VMoBA: Mixture-of-Block Attention for Video Diffusion Models"

Notifications You must be signed in to change notification settings

KlingTeam/VMoBA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation


VMoBA: Mixture-of-Block Attention for
Video Diffusion Models


Jianzong Wu · Liang Hou · Haotian Yang · Xin Tao . Ye Tian . Pengfei Wan · Di Zhang · Yunhai Tong

arXiv PDF


teaser

🚀 TL;DR

We introduce VMoBA, Mixture of Block Attention for Video Diffusion Models!

  • 🌟 Sparse attention mechanism based on MoBA, designed for video diffusion model training.
  • 🖼️ Key innovations: Layer-wise Recurrent Block Partition, Global Block Selection, and Threshold-based Block Selection. These innovations enhance VMoBA's performance and speed in video generation.
  • ✨ 2.92x FLOPs acceleration. 1.48x latency acceleration on 576p video (93x576x1024, 55K tokens). Faster with longer sequence length!

🎉 News

  • [2025-6-27] Paper and code are released!

🛠️ Quick Start

We provide a clean single-file code with only VMoBA implemented by FlashAttention, along with its speed test unit. Feel free to replace Full Attention with VMoBA in any of your models!

Environment Preparation

# Create a new environment with Conda
conda create -n diffusers python=3.11
conda activate diffusers

# Install Pytorch
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia

## Install FlashAttention locally
pip install packaging ninja
mkdir libs
cd libs
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.6.3+cu123torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

## Install other dependencies
pip install -r requirements.txt

For issues installing FlashAttention, please refer to the official repo for help.

VMoBA Speed Test

VMoBA is implemented in a single file, src/vmoba.py

Run this command to test the speed compared with Full Attention.

CUDA_VISIBLE_DEVICES=1 \
python -u src/vmoba.py

Feel free to try different sequence lengths and component variables (topk selection, local selection as in the vanilla MoBA).

Note: The current implementation, based on FlashAttention, exhibits apparent acceleration compared to Full Attention when the sequence length exceeds approximately 33,000 tokens. This is also suggested by one of MoBA's issues.

Note 2: The 1-2-3D block partition algorithm is implemented in the process_moba_input and process_moba_output functions in the same file. Please use it according to your data format.

Theoretical FLOPs computation

In case that most third-party packages to compute FLOPs of attention-based networks usually miss some operators (Lack of implementation for certain operators), we implement a hand-drafted theoretical FLOPs computation script to calculate the theoretical FLOPs of VMoBA and Full Attention networks. The code is at src/cal_theo_flops.py.

python scripts/flops/cal_theo_flops.py

Contact

Jianzong Wu (吴健宗): jzwu@stu.pku.edu.cn

Citation

article{wu2025vmoba,
  title={VMoBA: Mixture-of-Block Attention for Video Diffusion Models},
  author={Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, and Yunhai Tong},
  journal={arXiv preprint arXiv:2506.23858},
  year={2025},
}

Star History Chart

About

Official implementation of paper "VMoBA: Mixture-of-Block Attention for Video Diffusion Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages