Stars
A streaming audio reader, processor, and writer built on top of soundfile, and PyAV (bindings for FFmpeg)
RealSI: Open Benchmark for Simultaneous Interpretation in Real-world Scenarios
[AAAI 2026] Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
EtrajEval: Official framework for emotional support evaluation in language models, from the paper "Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Languag…
LongCat Audio Tokenizer and Detokenizer
MiMo-Audio: Audio Language Models are Few-Shot Learners
Text Normalization & Inverse Text Normalization
OSUM & OSUM-EChat, open speech understanding model and empathetic spoken chatbot based on it, open-sourced by ASLP@NPU.
Pseudo Streaming SenseVoice with Hotwords
Official code for "EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting"
Multilingual Voice Understanding Model
基于PQAEF (https://github.com/QuwanAI/PQAEF) 框架设计的情感陪伴对话系统测评基准
A toolkit for processing speech data and creating speech datasets
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
CosyVoice_DPO_NOTES: Supercharge Your Cosyvoice model with Cutting-Edge DPO Fine-Tuning!
[AAAI 2026] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Text-audio foundation model from Boson AI
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enablin…
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Official PyTorch implementation of BigVGAN (ICLR 2023)
Variational Autoencoder (VAE) with Normalizing Flows
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
No fortress, purely open ground. OpenManus is Coming.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
A native-PyTorch library for large scale M-LLM (text/audio) training with tp/cp/dp.
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
🚀 Truly open-source AI avatar(digital human) toolkit for offline video generation and digital human cloning.