Starred repositories
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
ASLP-lab / DiffRhythm2
Forked from xiaomi-research/diffrhythm2Di♪♪Rhythm 2: Efficient And High Fidelity Song Generation Via Block Flow Matching
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
First foundation ASR built for the real world - 7 atomic acoustic conditions, 54 compound scenarios, 2.6M samples, and up to ~30% gains over SOTA where every other model falls apart. **You'll come …
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
Open source voice AI platform. Self-hosted alternative to Vapi and Retell. On Prem, BYOK across Speech to Speech or LLM/STT/TTS, with a visual workflow builder, MCP native and telephony support.
[ICLR 2026 Oral] DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
MultiModal Audio Generation in Raw Waveform Space.
LLaDA2.0 is the diffusion language model series developed by InclusionAI team, Ant Group.
🎙️ 「大模型」从0训练0.1B能听能说能看的全模态Omni模型!A 0.1B Omni model trained from scratch, capable of listening, speaking, and seeing!
MoshiRAG is a compact full-duplex speech language model augmented with asynchronous knowledge retrieval to improve factuality without sacrificing real-time interactivity.
MOSS-Music is an open-source music understanding model for targeting musical captioning, lyrics ASR, structural analysis, chord / key / tempo reasoning, and long-form musical question answering.
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control
https://adongwanai.github.io/AgentGuide | AI Agent开发指南 | LangGraph实战 | 高级RAG | 转行大模型 | 大模型面试 | 算法工程师 | 面试题库 | 强化学习|数据合成
OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
Bash is all you need - A nano claude code–like 「agent harness」, built from 0 to 1