Beijia Lu1,
Ziyi Chen2,
Jing Xiao2,
Jun-Yan Zhu1.
Carnegie Mellon University1, PAII Inc2.
SIGGRAPH Asia 2025
We introduce a new conditional video distillation method for real-time co-speech video generation, by leveraging human pose conditioning for input-aware sparse attention and distillation loss.
Our method generates more natural lip and hand animations than pose-driven baselines. Existing pose-driven methods often produce stiff or unnatural movements in these critical regions. In contrast, our model maintains high fidelity and realism, with lifelike facial and hand animations, while achieving significantly faster inference.
Given only a single static reference image and an input audio clip, our model effectively synthesizes highly realistic and expressive video outputs. These results visually demonstrate its capability to produce natural facial expressions, fluid body movements, and accurate lip synchronization in real time.
Our attention mechanism selectively focuses on tokens within salient body regions and their corresponding areas in temporally relevant frames. (a) We first apply global masking, which restricts attention to the K most similar past frames based on pose similarity. (b) Then local masking limits inter-frame attention to matched regions (e.g., faces, hands) to enhance temporal coherence. (c) Our input-aware attention masking integrates both global and local masks to form an efficient and structured sparse attention pattern.
- Clone the repository and create conda environment:
git clone https://github.com/Beijia11/IASA.git
cd IASA
conda create -n IASA python=3.10
conda activate IASA
- Install pytorch
# CUDA 11.8
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121
# CUDA 12.4
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
- Make sure the submodule and requirements are installed:
pip install -r requirements.txt
git lfs install
git clone https://huggingface.co/H-Liu1997/emage_evaltools
You can download the dataset from the link provided. After downloading, please place and organize the files into the following directory structure:
dataset/
├── audio/
│ ├── 00000_audio.wav
│ ├── 00001_audio.wav
│ ├── ...
│
├── images/
│ ├── 00000.png
│ ├── 00001.png
│ ├── ...
Our framework consists of two main components: Audio-to-Motion and Motion-to-Video.
Please follow the instructions below to download and prepare the pretrained models.
Clone the pretrained audio-to-motion model from Hugging Face:
git clone https://huggingface.co/H-Liu1997/emage_audio
For the motion-to-video stage, we provide our implementation on DiT model.
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./Wan2.1-I2V-14B-720P
python infer_motion.py --visualization --audio_folder dataset/audio --save_folder dataset/pose
The code for input-aware attention applied to general conditional video generation will be released soon, stay tuned!
python examples/IASA/train_iasa.py --config 'config/train_sparseattn_lora.yaml'
If you find this repository useful for your research, please cite the following work.
@article{lu2025iasa,
title={Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation},
author={Lu, Beijia and Chen, Ziyi and Xiao, Jing and Zhu, Jun-Yan},
journal={ACM SIGGRAPH Asia},
year={2025}
}We would like to thank Kangle Deng, Muyang Li, Sheng-Yu Wang, Maxwell Jones, Gaurav Parmar for their insightful feedback and input that contributed to the finished work. The project is partly supported by Ping An Research.
Our codebase is built upon several amazing repos: EMAGE, DiffSynth-Studio, and UniAnimate-DiT.