Jianzong Wu
·
Liang Hou
·
Haotian Yang
·
Xin Tao
.
Ye Tian
.
Pengfei Wan
·
Di Zhang
·
Yunhai Tong
We introduce VMoBA, Mixture of Block Attention for Video Diffusion Models!
- 🌟 Sparse attention mechanism based on MoBA, designed for video diffusion model training.
- 🖼️ Key innovations: Layer-wise Recurrent Block Partition, Global Block Selection, and Threshold-based Block Selection. These innovations enhance VMoBA's performance and speed in video generation.
- ✨ 2.92x FLOPs acceleration. 1.48x latency acceleration on 576p video (93x576x1024, 55K tokens). Faster with longer sequence length!
- [2025-6-27] Paper and code are released!
We provide a clean single-file code with only VMoBA implemented by FlashAttention, along with its speed test unit. Feel free to replace Full Attention with VMoBA in any of your models!
# Create a new environment with Conda
conda create -n diffusers python=3.11
conda activate diffusers
# Install Pytorch
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
## Install FlashAttention locally
pip install packaging ninja
mkdir libs
cd libs
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu123torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.6.3+cu123torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
## Install other dependencies
pip install -r requirements.txtFor issues installing FlashAttention, please refer to the official repo for help.
VMoBA is implemented in a single file, src/vmoba.py
Run this command to test the speed compared with Full Attention.
CUDA_VISIBLE_DEVICES=1 \
python -u src/vmoba.pyFeel free to try different sequence lengths and component variables (topk selection, local selection as in the vanilla MoBA).
Note: The current implementation, based on FlashAttention, exhibits apparent acceleration compared to Full Attention when the sequence length exceeds approximately 33,000 tokens. This is also suggested by one of MoBA's issues.
Note 2: The 1-2-3D block partition algorithm is implemented in the process_moba_input and process_moba_output functions in the same file. Please use it according to your data format.
In case that most third-party packages to compute FLOPs of attention-based networks usually miss some operators (Lack of implementation for certain operators), we implement a hand-drafted theoretical FLOPs computation script to calculate the theoretical FLOPs of VMoBA and Full Attention networks. The code is at src/cal_theo_flops.py.
python scripts/flops/cal_theo_flops.pyJianzong Wu (吴健宗): jzwu@stu.pku.edu.cn
article{wu2025vmoba,
title={VMoBA: Mixture-of-Block Attention for Video Diffusion Models},
author={Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, and Yunhai Tong},
journal={arXiv preprint arXiv:2506.23858},
year={2025},
}