Stars
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singi…
SpeechJudge: Towards Human-Level Judgment for Speech Naturalness (https://arxiv.org/abs/2511.07931)
Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
Official repository for the WenetSpeech-Chuan dataset.
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-cache compatibility, achieving high eff…
verl: Volcano Engine Reinforcement Learning for LLMs
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
SpeechIO Leaderboard: a large, robust, comprehensive, benchmarking platform for Automatic Speech Recognition.
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enablin…
Code for DeSTA2.5-Audio, general-purpose LALM
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
Text-audio foundation model from Boson AI
Foundational Models for State-of-the-Art Speech and Text Translation
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.5, DeepSeek, gpt-oss locally.
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis