Lists (1)
Sort Name ascending (A-Z)
Stars
Vibe Workflow Platform for Non-technical Creators.
Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages
MultiModal Pairwise Constrained Speaker Diarization System
🎓 Update Talking-Face Research Papers Daily
A toolkit for speaker diarization.
A Fully Self-Hosted Solution for Full-Duplex Voice Interaction
Efficient audio understanding with general audio captions
Voice Activity Detector (VAD) : low-latency, high-performance and lightweight
Script to demonstrate how to use a Language Model for Semantic Turn Detection. Refer to blog post for full details.
The implementation of "X-TF-GridNet: A Time-Frequency Domain Target Speaker Extraction Network with Adaptive Speaker Embedding Fusion", which is accepted by Information Fusion.
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Deep Xi: A deep learning approach to a priori SNR estimation implemented in TensorFlow 2/Keras. For speech enhancement and robust ASR.
Exa MCP for web search and web crawling!
Source code for "Enginneering Deep Learning Platforms"
A TTS model capable of generating ultra-realistic dialogue in one pass.
Open singing synthesis platform / Open source UTAU successor
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
End-to-end realtime stack for connecting humans and AI
openvpi / DiffSinger
Forked from MoonInTheRiver/DiffSingerAn advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
DALI: a large Dataset of synchronised Audio, LyrIcs and vocal notes.
A tool for real-time lyrics alignment and visualization, integrating audio processing, phoneme-level synchronization, and interactive variable font typography.
A Conversational Speech Generation Model
Zero-Shot Speech Editing and Text-to-Speech in the Wild
Vector (and Scalar) Quantization, in Pytorch
Di♪♪Rhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion