Starred repositories
OpenClaw Desktop Assistant MVP - Electron-based AI voice assistant with Live2D character animations, real-time speech recognition, and text-to-speech
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning
Added vLLM support to IndexTTS for faster inference.
Deprecated, the Web Neural Network Polyfill project has been moved to https://github.com/webmachinelearning/webnn-polyfill
Production First and Production Ready End-to-End Keyword Spotting Toolkit
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
Use Microsoft Edge's online text-to-speech service from Python WITHOUT needing Microsoft Edge or Windows or an API key
[ACM MM 2025] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
Real time interactive streaming digital human
User-friendly AI Interface (Supports Ollama, OpenAI API, ...)
[CVPR 2025] MatAnyone: Stable Video Matting with Consistent Memory Propagation
[CVPR 2024] Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution
[ICCV 2025] STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Singing Voice Conversion via diffusion model
SoftVC VITS Singing Voice Conversion
[ICLR2025] DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
A Trimap-Free Portrait Matting Solution in Real Time [AAAI 2022]
Real-Time High-Resolution Background Matting
[CVPR 2025] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
High-Fidelity Lip-Syncing with Wav2Lip and Real-ESRGAN
[ICLR 2025 Oral] TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio-Motion Embedding and Diffusion Interpolation