Skip to content

ICCV 2025 ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).

Notifications You must be signed in to change notification settings

harlanhong/ACTalker

Repository files navigation

📖 Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation (ICCV 2025)

[Paper]   [Project Page]   [HuggingFace]

Fa-Ting Hong1,2, Zunnan Xu2,3, Zixiang Zhou2, Jun Zhou2, Xiu Li3, Qin Lin2, Qinglin Lu2, Dan Xu1
1The Hong Kong University of Science and Technology
2Tencent
3Tsinghua University

🚩 **Updates**

🎉 Paper accepted at ICCV 2025!

☑ arXiv paper is available here

🔧 Project Status: We are continuously organizing the open-source release. Pre-trained checkpoints will be released gradually. Stay tuned!

Framework

⚙️ Installation

System Requirements

  • Python: 3.10
  • CUDA: 11.8 (recommended) or 12.1+
  • GPU Memory: 24GB+ VRAM (H100/A100 recommended)
  • System RAM: 32GB+
  • Storage: 20GB+ available space
  • OS: Linux (Ubuntu 20.04+ recommended)
  • FFmpeg: Required for video processing

Hardware Performance

GPU Model VRAM Inference Speed Batch Size Notes
NVIDIA H100 80GB ~6min/25steps 1 Recommended
NVIDIA A100 40GB ~8min/25steps 1 Good performance
NVIDIA RTX 4090 24GB ~12min/25steps 1 Minimum requirement
NVIDIA RTX 3090 24GB ~15min/25steps 1 Usable but slow

⚠️ Important Notes

Mamba SSM Compatibility

ACTalker uses Mamba SSM, which has strict environment requirements!

Recommended Configuration (Tested Successfully):

  • CUDA 11.8 + PyTorch 2.0.1 + mamba-ssm 1.2.0.post1
  • This combination ensures full Mamba SSM compatibility

Known Issues:

  • CUDA 12.x versions may have compatibility issues with mamba-ssm
  • PyTorch 2.1+ may cause dependency conflicts
  • Incorrect version combinations will cause inference failures

Prerequisites

Before installation, make sure you have FFmpeg and libx264 installed:

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg libx264-dev

CentOS/RHEL:

sudo yum install epel-release
sudo yum install ffmpeg x264-devel

Conda (Alternative):

conda install -c conda-forge ffmpeg x264

Option 1: One-Click Auto Installation 🚀

git clone https://github.com/harlanhong/ACTalker.git
cd ACTalker
bash install_actalker.sh

This script will automatically install all compatible dependency versions, including Mamba SSM.

Option 2: Using Pre-configured Environment File

conda env create -f environment.yaml
conda activate actalker

Complete Verification

After installation, run complete environment testing:

# Verify FFmpeg installation
ffmpeg -version

# Verify basic environment
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

# Verify Mamba SSM
python -c "from src.models.base.mamba_layer import MAMBA_AVAILABLE; print(f'Mamba available: {MAMBA_AVAILABLE}')"

# Run environment test script
python test_environment.py

📦 Pretrained Models

ACTalker requires pretrained models to function properly. The main model needed is Stable Video Diffusion.

Required Models

1. Stable Video Diffusion (SVD-XT-1.1)

2. Pretrained checkpoints

TL;DR:

We propose ACTalker, an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, pose, expression). ACTalker uses a parallel mamba-based architecture with a gating mechanism to assign different control signals to specific facial regions, ensuring fine-grained and conflict-free generation. A mask-drop strategy further enhances regional independence and control stability. Experiments show that ACTalker produces natural, synchronized talking head videos under various control combinations.

Expression Driven Samples

emoji_natural.mp4

Audio Dirven Samples

singing.mp4

Audio-Visual Driven Samples

multimodal1.mp4
multimodal4.mp4

💻 Testing

Testing (Inference)

Use the following command for inference testing:

CUDA_VISIBLE_DEVICES=0 python Inference.py --config config/inference.yaml --ref assets/ref.jpg --audio assets/audio.mp3 --video assets/video.mp4 --mode 2

Parameter Description:

  • --config: Configuration file path
  • --ref: Reference image path
  • --audio: Audio file path
  • --video: Video file path
  • --mode: Inference mode
    • 0: Audio-only driven
    • 1: VASA control only
    • 2: Audio + Expression joint control
  • --exp_name: Experiment name

Optimization Tips:

  • Use --overlap parameter for long video segmented processing
  • Adjust --frames_per_batch to fit VRAM size
  • Use mixed precision to accelerate inference

📧 Contact

If you have any question or collaboration need (research purpose or commercial purpose), please email fhongac@connect.ust.hk.

📍Citation

Please feel free to leave a star⭐️⭐️⭐️ and cite our paper:

@inproceedings{hong2025audio,
  title={Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation},
  author={Hong, Fa-Ting and Xu, Zunnan and Zhou, Zixiang and Zhou, Jun and Li, Xiu and Lin, Qin and Lu, Qinglin and Xu, Dan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

About

ICCV 2025 ACTalker: an end-to-end video diffusion framework for talking head synthesis that supports both single and multi-signal control (e.g., audio, expression).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages