Stars
Stop-To-Ask-Questions-The-Stupid-Ways
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- H…
A trainer for SNAC (Multi-Scale Neural Audio Codec) has replaced the decoder with Vocos.
State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
speech self-supervised representations
PyTorch implementation of JiT https://arxiv.org/abs/2511.13720
[CVPR2025 Highlight] Video Generation Foundation Models: https://saiyan-world.github.io/goku/
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
(NeurIPS 2025) Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate
[Official Implementation] Acoustic Autoregressive Modeling 🔥
[ACL 2025] OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
[NeurIPS 2025 Oral]Infinity⭐️: Unified Spacetime AutoRegressive Modeling for Visual Generation
Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
DiFlow-TTS delivers low-latency zero-shot TTS via discrete flow matching and factorized speech tokens. A compact, open framework for fast voice synthesis.🐙
High-performance Image Tokenizers for VAR and AR
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Advanced GRAG implementation for ComfyUI with beginner-friendly and expert modes
https://little-misfit.github.io/GRAG-Image-Editing/