Stars
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
5Hz Deep-Compression Speech VAE for AR-Diffusion and CALMs
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
This is the official repo for the paper "LongCat-Flash-Omni Technical Report"
Proxmox VE Helper-Scripts (Community Edition)
Data Pipeline, Models, and Benchmark for Omni-Captioner.
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
Exploration into Discrete Distribution Network, by Lei Yang out of Beijing
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis"
Official Repository of Paper: "Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling"
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
An transformer based LLM. Written completely in Rust
MiMo-Audio: Audio Language Models are Few-Shot Learners
Flash Attention Triton kernel with support for second-order derivatives
[ICCV 2025] Implementation of the paper "Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs"