Starred repositories
Very low latency speech to text, intent recognition, and text to speech, for building voice agents and interfaces
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems. Use when building, optimizing, or debugging agent systems that require e…
Whisper Encoder (extracted from pretrained) with a Linear on top and solve using CTC criterion
A native-PyTorch library for large scale M-LLM (text/audio) training with tp/cp/dp.
Text-audio foundation model from Boson AI
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
每个人都能看懂的大模型知识分享,LLMs春/秋招大模型面试前必看,让你和面试官侃侃而谈
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
XphoneBR is a Brazilian portuguese transformer base grapheme-to-phoneme and normalization tool modeling library that leverages recent deep learning technology and is optimized for usage in producti…
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-V4, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, …
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
✨✨Latest Advances on Multimodal Large Language Models
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
A Next-Generation Training Engine Built for Ultra-Large MoE Models
A multi-voice TTS system trained with an emphasis on quality
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Instant voice cloning by MIT and MyShell. Audio foundation model.
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Generative Models by Stability AI
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
Official implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition
This repository contains the SpeechBrain Benchmarks
Multimodal Transformer for Korean Sentiment Analysis with Audio and Text Features