-
NVIDIA
- United States
- https://L0SG.github.io
- @L0SG
Stars
OpenFLAM: Framewise Language Audio Model
This is the official implementation for εar-VAE model including inference and evaluation parts, more details coming soon...
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
[ACL 2025] Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis
ACE-Step: A Step Towards Music Generation Foundation Model
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
[NAACL 2025] WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
The official implementation of TokenSynth (ICASSP 2025)
A low-bitrate single-codebook 16 / 24 kHz speech codec based on focal modulation
Unified automatic quality assessment for speech, music, and sound.
Training Large Language Model to Reason in a Continuous Latent Space
A family of state-of-the-art Transformer-based audio codecs for low-bitrate high-quality audio coding.
A suite of image and video neural tokenizers
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos
[ICLR 2026] TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching
LibriSpeech-Long is a benchmark dataset for long-form speech generation and processing. Released as part of "Long-Form Speech Generation with Spoken Language Models" (arXiv 2024).
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Official PyTorch implementation of "Paralinguistics-Aware Speech-Empowered LLMs for Natural Conversation" (NeurIPS 2024)
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Official implementation of the paper "BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec"
Text-to-Music Generation with Rectified Flow Transformers
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
The official Implementation of PeriodWave and PeriodWave-Turbo
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery 🧑🔬