📖 Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation (ICCV 2025)
[Paper] [Project Page] [HuggingFace]
🚩 **Updates**Fa-Ting Hong1,2, Zunnan Xu2,3, Zixiang Zhou2, Jun Zhou2, Xiu Li3, Qin Lin2, Qinglin Lu2, Dan Xu1
1The Hong Kong University of Science and Technology
2Tencent
3Tsinghua University
🎉 Paper accepted at ICCV 2025!
☑ arXiv paper is available here
🔧 Project Status: We are continuously organizing the open-source release. Pre-trained checkpoints will be released gradually. Stay tuned!
- Python: 3.10
- CUDA: 11.8 (recommended) or 12.1+
- GPU Memory: 24GB+ VRAM (H100/A100 recommended)
- System RAM: 32GB+
- Storage: 20GB+ available space
- OS: Linux (Ubuntu 20.04+ recommended)
- FFmpeg: Required for video processing
| GPU Model | VRAM | Inference Speed | Batch Size | Notes |
|---|---|---|---|---|
| NVIDIA H100 | 80GB | ~6min/25steps | 1 | Recommended |
| NVIDIA A100 | 40GB | ~8min/25steps | 1 | Good performance |
| NVIDIA RTX 4090 | 24GB | ~12min/25steps | 1 | Minimum requirement |
| NVIDIA RTX 3090 | 24GB | ~15min/25steps | 1 | Usable but slow |
ACTalker uses Mamba SSM, which has strict environment requirements!
✅ Recommended Configuration (Tested Successfully):
- CUDA 11.8 + PyTorch 2.0.1 + mamba-ssm 1.2.0.post1
- This combination ensures full Mamba SSM compatibility
❌ Known Issues:
- CUDA 12.x versions may have compatibility issues with mamba-ssm
- PyTorch 2.1+ may cause dependency conflicts
- Incorrect version combinations will cause inference failures
Before installation, make sure you have FFmpeg and libx264 installed:
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg libx264-devCentOS/RHEL:
sudo yum install epel-release
sudo yum install ffmpeg x264-develConda (Alternative):
conda install -c conda-forge ffmpeg x264git clone https://github.com/harlanhong/ACTalker.git
cd ACTalker
bash install_actalker.shThis script will automatically install all compatible dependency versions, including Mamba SSM.
conda env create -f environment.yaml
conda activate actalkerAfter installation, run complete environment testing:
# Verify FFmpeg installation
ffmpeg -version
# Verify basic environment
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"
# Verify Mamba SSM
python -c "from src.models.base.mamba_layer import MAMBA_AVAILABLE; print(f'Mamba available: {MAMBA_AVAILABLE}')"
# Run environment test script
python test_environment.pyACTalker requires pretrained models to function properly. The main model needed is Stable Video Diffusion.
- Model:
stabilityai/stable-video-diffusion-img2vid-xt-1-1 - Purpose: Image-to-video generation backbone
- Size: ~9.5GB
- License: Requires agreement to Stability AI Community License
- Model:
pretrained Checkpoints
We propose ACTalker, an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, pose, expression). ACTalker uses a parallel mamba-based architecture with a gating mechanism to assign different control signals to specific facial regions, ensuring fine-grained and conflict-free generation. A mask-drop strategy further enhances regional independence and control stability. Experiments show that ACTalker produces natural, synchronized talking head videos under various control combinations.
emoji_natural.mp4
singing.mp4
multimodal1.mp4
multimodal4.mp4
Use the following command for inference testing:
CUDA_VISIBLE_DEVICES=0 python Inference.py --config config/inference.yaml --ref assets/ref.jpg --audio assets/audio.mp3 --video assets/video.mp4 --mode 2Parameter Description:
--config: Configuration file path--ref: Reference image path--audio: Audio file path--video: Video file path--mode: Inference mode0: Audio-only driven1: VASA control only2: Audio + Expression joint control
--exp_name: Experiment name
Optimization Tips:
- Use
--overlapparameter for long video segmented processing - Adjust
--frames_per_batchto fit VRAM size - Use mixed precision to accelerate inference
If you have any question or collaboration need (research purpose or commercial purpose), please email fhongac@connect.ust.hk.
Please feel free to leave a star⭐️⭐️⭐️ and cite our paper:
@inproceedings{hong2025audio,
title={Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation},
author={Hong, Fa-Ting and Xu, Zunnan and Zhou, Zixiang and Zhou, Jun and Li, Xiu and Lin, Qin and Lu, Qinglin and Xu, Dan},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}