Stars
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Scaled diffusion transformer for text-to-speech synthesis (DiT + T5Gemma2 conditioning, TorchTitan & Megatron backends, tested up to 1024 GPUs)
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
M7-TTS: A Mini-Scale Multilingual and Multi-Dialect Text-to-Speech Language Model with Mimi codec and Multi Token Prediction
This challenge focuses on evaluating speech recognition and semantic understanding capabilities of AI glasses in complex real-world environments.
An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
A Large-scale Wu Dialect Speech Corpus with Multi-dimensional Annotations
An instruct text-to-speech solution based on LLaSA and CosyVoice2 developed by the ASLP lab and collaborators.
ASLP-lab / DiffRhythm2
Forked from xiaomi-research/diffrhythm2Di♪♪Rhythm 2: Efficient And High Fidelity Song Generation Via Block Flow Matching
A Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
Open-Source Turn-Taking Detection Model and Dataset for Full-Duplex Spoken Dialogue Systems
Official repository for the WenetSpeech-Chuan dataset.
A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
Open repository of "MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios"
A Massive Contextual Speech Recognition Benchmark.
A song aesthetic evaluation toolkit trained on SongEval.