Lists (16)
Sort Name ascending (A-Z)
Stars
Official implementation of "Continuous Autoregressive Language Models"
[EMNLP 2025] LightThinker: Thinking Step-by-Step Compression
Parallel Continuous Chain-of-Thought with Jacobi Iteration. Accepted to EMNLP 2025.
This is the official repo for the paper "LongCat-Flash-Omni Technical Report"
Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis"
A unified tokenizer that is capable of both extracting semantic information and enabling high-fidelity audio reconstruction.
Finetune Sesame AI's conversational speech model on new languages and voices. Blog post: https://blog.speechmatics.com/sesame-finetune
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
Long-form streaming TTS system for multi-speaker dialogue generation
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
VoXtream is a Full-Stream Zero-shot TTS model with Extremely Low Latency
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Lan…
Enjoy the magic of Diffusion models!
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
Python implementation of performance metrics in Loizou's Speech Enhancement book
This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Lan…
The python library for real-time communication
[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Text-audio foundation model from Boson AI
Official code for "F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization"
Jupiter Python SDK is a Python library that allows you to use most of Jupiter features.
A complete cross-modal RAG system for end-to-end speech-to-speech large models, including ASR-based Retrieval and E2E Retrieval.
Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon