Stars
Pure C inference of Mistral Voxtral Realtime 4B speech to text model
Unofficial implementation of training pipeline in mimo-tokenizer about "MiMo-Audio: Audio Language Models are Few-Shot Learners"
DFlash: Block Diffusion for Flash Speculative Decoding
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Write scalable load tests in plain Python 🚗💨
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
Trainging, inference, and testing of the SAC speech codec model.
VibeVoice: Expressive, longform conversational speech synthesis. (Community fork)
LongCat Audio Tokenizer and Detokenizer
MOSS-Speech is a true speech-to-speech large language model without text guidance.
MiMo-Audio: Audio Language Models are Few-Shot Learners
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Official Repository of Paper: "Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling"
[ICLR2026] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)
VoiceStar: Robust, Duration-controllable TTS that can Extrapolate
[ICCV2025] TokenBridge: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation. https://yuqingwang1029.github.io/TokenBridge
A fundamental toolkit designed for music, song, and audio generation
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling