Skip to content
View gmltmd789's full-sized avatar

Block or report gmltmd789

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

The official implement of VITA, VITA15, LongVITA, VITA-Audio, VITA-VLA, and VITA-E.

Python 135 2 Updated Oct 28, 2025

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Python 19,297 2,059 Updated Oct 21, 2025
Python 250 26 Updated May 19, 2025

MOSS-Speech is a true speech-to-speech large language model without text guidance.

Python 112 5 Updated Dec 4, 2025

Open-Source Frontier Voice AI

Python 18,944 2,096 Updated Dec 17, 2025

An official implementation of "Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning".

Python 7 Updated Apr 30, 2024

✨✨[NeurIPS 2025] VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Python 669 60 Updated May 24, 2025

A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.

Python 135 17 Updated Oct 7, 2025
Python 481 43 Updated May 6, 2025
Python 6 2 Updated Dec 8, 2025

[ACL 2025] Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis

Python 12 Updated Jun 3, 2025

Official Implementation for the paper "d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning"

Python 389 48 Updated Dec 20, 2025

Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation

Python 4,401 321 Updated Jun 21, 2025

Colab notebook for fine-tuning Qwen2-Audio with trl's SFT and PPO trainers.

Jupyter Notebook 22 1 Updated Nov 23, 2024

Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.

Jupyter Notebook 3,854 303 Updated Jun 12, 2025

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Python 291 38 Updated May 16, 2025
Python 57 2 Updated Mar 22, 2025

Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'

Python 148 9 Updated Mar 24, 2025

Official PyTorch implementation for "Large Language Diffusion Models"

Python 3,425 230 Updated Nov 12, 2025

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

Python 215 12 Updated Feb 28, 2025

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Python 57 3 Updated Apr 14, 2025
HTML 3 Updated Jan 5, 2025

GPT-4o-level, real-time spoken dialogue system.

Python 363 29 Updated Jan 27, 2025

Re-implementation of Self-Refine

Python 1 Updated Aug 19, 2024

Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

Python 336 45 Updated Jul 21, 2025

[ICCV 2025] VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE

Python 376 12 Updated Jan 19, 2025

Official repo for CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Python 62 3 Updated Jan 16, 2025
Next