Stars
Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.
MooER: Moore-threads Open Omni model for speech-to-speech intERaction. MooER-omni includes a series of end-to-end speech interaction models along with training and inference code, covering but not …
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- H…
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation
Add n-gram and LLM language model support to HF Transformers Whisper models.
Add n-gram and large language model (LLM) support to Whisper models.
Noise supression using deep filtering
Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation
Robust One-step Speech Enhancement via Consistency Distillation (ROSE-CD)(IEEE WASPAA ORAL)
The official implementation of GTCRN, an ultra-lightweight SE model.
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Very low latency speech to text, intent recognition, and text to speech, for building voice agents and interfaces
Reading list for research topics in Sound AI
rsxdalv / VibeVoice
Forked from microsoft/VibeVoiceFrontier Open-Source Text-to-Speech
Simultaneous speech-to-text models
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
Real-time voice assistant — WebRTC streaming, faster-whisper ASR, local LLM, Vui Nano (300M) TTS. OpenAI Realtime API compatible. Voice cloning, barge-in, ~9× realtime on a 4090. Apache 2.0.
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.
SALMONN family: A suite of advanced multi-modal LLMs