Stars
Native full-duplex speech dialogue inference for BayLing-Duplex.
An end-to-end framework for multi-speaker transcription that jointly models who spoke, when, and what.
Real-Time Streamable Generative Speech Restoration with Flow Matching
Whisfusion: Parallel ASR Decoding via a Diffusion Transformer
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
a collection of skills for vllm-omni
MOSS-Audio is an open-source foundation model for unified audio understanding, enabling speech, sound, music, captioning, QA, and reasoning in real-world scenarios.
Research about Voxtral, its codebooks and an attempt to reconstruct codes for a known audio
🌋LavaSR: Fast Speech restoration and enhancement
MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and the OpenMOSS team. With only 0.1B parameters, it is designed for realtime speech generation, can run direc…
The official implementation of GTCRN, an ultra-lightweight SE model.
lifeiteng / mlx-vlm
Forked from Blaizzy/mlx-vlmMLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
VITA-QINYU: Expressive Spoken Language Model for Role-Playing and Singing
An agent-managed museum exhibit, built in Rust with Gajae-Code / LazyCodex — developed and maintained with no human intervention.
Voxtral Codec : Combining Semantic VQ and Acoustic FSQ for Ultra-Low Bitrate Speech Generation (Voxtral TTS Backbone)
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Pure C inference of Mistral Voxtral Realtime 4B speech to text model
Unofficial implementation of training pipeline in mimo-tokenizer about "MiMo-Audio: Audio Language Models are Few-Shot Learners"
DFlash: Block Diffusion for Flash Speculative Decoding
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Write scalable load tests in plain Python 🚗💨
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech