- Shanghai,China
- blog.csdn.net/zhulinniao
Stars
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Official repository for the WenetSpeech-Chuan dataset.
A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
cchen1436 / NeMo
Forked from NVIDIA-NeMo/NeMoA scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
We Speech Toolkit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
A Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.
This is the official repo for the paper "LongCat-Flash-Omni Technical Report"
A unified tokenizer that is capable of both extracting semantic information and enabling high-fidelity audio reconstruction.
LongCat Audio Tokenizer and Detokenizer
MiMo-Audio: Audio Language Models are Few-Shot Learners
Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Official code for"DiaMoE-TTS: A Unified IPA-based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation"
vits2 backbone with multilingual-bert
A barebones WebSocket client and server implementation written in 100% Java.
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM