Highlights
- Pro
Starred repositories
Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
A curated collection of fun and creative examples generated with Nano Banana & Nano Banana Pro🍌, Gemini-2.5-flash-image based model. We also release Nano-consistent-150K openly to support the commu…
Voice Activity Detector (VAD) : low-latency, high-performance and lightweight
Open-source framework for conversational voice AI agents
T5Voice is a lightweight PyTorch implementation of T5-based text-to-speech synthesis, supporting both streaming and non-streaming speech synthesis with zero-shot capabilities.
Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages
Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
A search engine that "just works" for Obsidian. Supports OCR and PDF indexing.
Precision Alignment, Infinite Possibilities
A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
Official implementation of "Continuous Autoregressive Language Models"
The official Implementation of PeriodWave and PeriodWave-Turbo
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
VibeVoice: Expressive, longform conversational speech synthesis. (Community fork)
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Trainging, inference, and testing of the SAC speech codec model.
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
kyutai-labs / nanoGPTaudio
Forked from karpathy/nanoGPTCode for the blog "Neural audio codecs: how to get audio into LLMs"
[NAACL 2025] WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
LongCat Audio Tokenizer and Detokenizer
Data Pipeline, Models, and Benchmark for Omni-Captioner.
PESQ (Perceptual Evaluation of Speech Quality) Wrapper for Python Users (narrow band and wide band)
FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.
Speech To Speech: an effort for an open-sourced and modular GPT4-o
A CLI text-to-speech tool using the Kokoro model, supporting multiple languages, voices (with blending), and various input formats including EPUB books and PDF documents.
Official implementation of DNSMOS Pro (accepted at INTERSPEECH 2024).