Skip to content
View soong2's full-sized avatar

Block or report soong2

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

Python 18,144 1,857 Updated Jun 16, 2026

MooER: Moore-threads Open Omni model for speech-to-speech intERaction. MooER-omni includes a series of end-to-end speech interaction models along with training and inference code, covering but not …

Python 221 17 Updated Jan 8, 2025
Python 273 28 Updated May 19, 2025

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python 3,141 223 Updated May 19, 2025

Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- H…

Rust 1,470 118 Updated Apr 15, 2025

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Python 490 74 Updated May 19, 2025

Silero VAD: pre-trained enterprise-grade Voice Activity Detector

Python 9,342 785 Updated Mar 26, 2026

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation

Python 254 33 Updated Sep 13, 2024

Add n-gram and LLM language model support to HF Transformers Whisper models.

Python 14 2 Updated May 6, 2025

KenLM: Faster and Smaller Language Model Queries

C++ 2,779 545 Updated Mar 30, 2025

Add n-gram and large language model (LLM) support to Whisper models.

Jupyter Notebook 43 4 Updated May 6, 2025

Noise supression using deep filtering

Python 4,335 472 Updated Oct 17, 2024
Python 31 3 Updated Jan 9, 2024

Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation

Python 757 106 Updated May 12, 2026

Robust One-step Speech Enhancement via Consistency Distillation (ROSE-CD)(IEEE WASPAA ORAL)

Python 10 Updated Jun 7, 2026

The official implementation of GTCRN, an ultra-lightweight SE model.

Python 670 111 Updated Jan 18, 2026

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.

Python 10,407 969 Updated May 16, 2026
Python 620 91 Updated Jun 8, 2026

Very low latency speech to text, intent recognition, and text to speech, for building voice agents and interfaces

C 8,482 459 Updated Jun 2, 2026

Reading list for research topics in Sound AI

198 9 Updated Aug 8, 2024

Frontier Open-Source Text-to-Speech

Python 132 129 Updated Sep 9, 2025

Simultaneous speech-to-text models

Python 10,452 1,082 Updated Jun 12, 2026

[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Python 2,209 259 Updated Feb 23, 2026

PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

1,144 96 Updated Dec 15, 2025

Real-time voice assistant — WebRTC streaming, faster-whisper ASR, local LLM, Vui Nano (300M) TTS. OpenAI Realtime API compatible. Voice cloning, barge-in, ~9× realtime on a 4090. Apache 2.0.

Python 701 72 Updated Jun 12, 2026

The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.

Python 933 48 Updated Jun 15, 2026

Update ASR paper everyday

Python 515 24 Updated May 16, 2026

The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.

Python 2,081 165 Updated Apr 21, 2025

The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.

Python 1,903 145 Updated Jul 5, 2024

SALMONN family: A suite of advanced multi-modal LLMs

1,450 115 Updated May 26, 2026
Next