Stars
[NeurIPS 2025] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
verl: Volcano Engine Reinforcement Learning for LLMs
Hackable and optimized Transformers building blocks, supporting a composable construction.
Foundational Models for State-of-the-Art Speech and Text Translation
A high-throughput and memory-efficient inference and serving engine for LLMs
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Lightweight coding agent that runs in your terminal
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Official Pytorch implementation of "Large Language Models are Strong Audio-Visual Speech Recognition Learners" [ICASSP 2025] and "Mitigating Attention Sinks and Massive Activations in Audio-Visual …
Official implementation of the paper "BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec"
Generative models for conditional audio generation
Movie Gen Bench - two media generation evaluation benchmarks released with Meta Movie Gen
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
SALMONN family: A suite of advanced multi-modal LLMs
Offical code for the CVPR 2024 Paper: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Python module for syllabifying English ARPABET transcriptions
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on
Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.
Implementation of Generating Diverse High-Fidelity Images with VQ-VAE-2 in PyTorch
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
Phoneme segmentation using pre-trained speech models
Code for the IEEE Signal Processing Letters 2022 paper "UAVM: Towards Unifying Audio and Visual Models".
Unsupervised phone and word segmentation using dynamic programming on self-supervised VQ features.