Starred repositories
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Speech Editing and Synthesis
Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.
A Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
ASLP-lab / DiffRhythm2
Forked from xiaomi-research/diffrhythm2Di♪♪Rhythm 2: Efficient And High Fidelity Song Generation Via Block Flow Matching
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Text-audio foundation model from Boson AI
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
MOSS-TTSD is a spoken dialogue generation model that enables expressive dialogue speech synthesis in both Chinese and English, supporting zero-shot multi-speaker voice cloning, and long-form speech…
Train transformer language models with reinforcement learning.
An elegant PyTorch deep reinforcement learning library.
PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator (NeurIPS 2024)
A PyTorch native platform for training generative AI models
A native-PyTorch library for large scale M-LLM (text/audio) training with tp/cp/dp.
A feature-rich command-line audio/video downloader
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Convenient for developers to call inference models from version v1 to v3 through API, supporting streaming transmission and specified type file transfer.
OSUM & OSUM-EChat, open speech understanding model and empathetic spoken chatbot based on it, open-sourced by ASLP@NPU.
Unified automatic quality assessment for speech, music, and sound.
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.