-
Seoul National University
- Seoul, Republic of Korea
-
10:29
(UTC +09:00) - gmltmd789.github.io
- https://scholar.google.com/citations?user=4ojbJpoAAAAJ&hl=ko
- in/gmltmd789
Stars
The official implement of VITA, VITA15, LongVITA, VITA-Audio, VITA-VLA, and VITA-E.
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
MOSS-Speech is a true speech-to-speech large language model without text guidance.
An official implementation of "Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning".
✨✨[NeurIPS 2025] VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.
[ACL 2025] Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis
Official Implementation for the paper "d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning"
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
Colab notebook for fine-tuning Qwen2-Audio with trl's SFT and PPO trainers.
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'
Official PyTorch implementation for "Large Language Diffusion Models"
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
GPT-4o-level, real-time spoken dialogue system.
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
[ICCV 2025] VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
Official repo for CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations