Starred repositories
Official implementation of "USAD: Universal Speech and Audio Representation via Distillation"
Towards Scalable Pre-training of Visual Tokenizers for Generation
Official code release for the paper "One-Step Generative Modeling via Wasserstein Gradient Flows"
Official Repo of "Flow-OPD: On-Policy Distillation for Flow Matching Models"
MultiModal Audio Generation in Raw Waveform Space.
[CVPR 2026 Findings] V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
[CVPR 2026] Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
[KDD 2026] Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
A dual-rate LLM architecture bridging DSP and NLP. Decouples semantic planning from lexical synthesis to solve O(N2) bottlenecks.
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
Scaled diffusion transformer for text-to-speech synthesis (DiT + T5Gemma2 conditioning, TorchTitan & Megatron backends, tested up to 1024 GPUs)
The agent that grows with you
CVPR 2026 (Oral)-Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
Single-stage End-to-End Training for Tokenization and Generation
DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick
A Large-scale Wu Dialect Speech Corpus with Multi-dimensional Annotations
Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech/music/song recognition, language detection and timestamp prediction.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-Step High-Fidelity Audio Generation
FlowMirror-HydraVox — A natively accelerated multi-head autoregressive TTS system derived from CosyVoice 3.0. It predicts multiple tokens per step for faster, high-quality speech synthesis, featuri…
An instruct text-to-speech solution based on LLaSA and CosyVoice2 developed by the ASLP lab and collaborators.
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free