Starred repositories
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control
https://adongwanai.github.io/AgentGuide | AI Agent开发指南 | LangGraph实战 | 高级RAG | 转行大模型 | 大模型面试 | 算法工程师 | 面试题库 | 强化学习|数据合成
OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
Bash is all you need - A nano claude code–like 「agent harness」, built from 0 to 1
Plug-and-play streaming semantic VAD for real-time full-duplex spoken dialogue systems.
Easy fine-tuning for Qwen3-TTS: Fast voice cloning and high-quality multilingual speech synthesis.
Pre-training, SFT, DPO and GRPO for Text-to-Audio Generation
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
ChatTTS 2000条音色稳定性打分🥇+区分男女年龄👧+在线试听🔈 ChatTTS 2K Speaker Stability Score & Categorized by Gender and Age & Audio Preview
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
This challenge focuses on evaluating speech recognition and semantic understanding capabilities of AI glasses in complex real-world environments.
Code for Latent Speech-Text Transformer (LST)
A SOTA Industrial-Grade Voice Activity Detection & Audio Event Detection, supporting 100+ languages, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD
An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
A curated list of full-duplex spoken dialogue models & benchmarks
FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models
This is the official implementation of reverberant speech to room impulse response estimator
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Unofficial Implementation of MiniMax-Speech
Local-first Suno-style music studio powered by ACE-Step 1.5.
Official repository for the paper "Audio ControlNet for Fine-Grained Audio Generation and Editing".
Reinforcement Learning via Self-Distillation (SDPO)
Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
UniAudio 2.0: An audio fundation model for text, speech, sound, and music