Highlights
- Pro
Lists (26)
Sort Name ascending (A-Z)
AcousticFrontend
AcousticModel
ASR
ASR-pretrain
ASV
AudioQuality
AwesomeList
Paper list, awesome list and so on.BandwidthExtension
Classification
Codec
Data
Develop
Evaluation
FrontEnd
FrontEnd for Text-to-SpeechHow-to
LLM
Music
Performance
Quant
SingingVoiceSynthesis
SpeechEditing
SpeechSeperation
Tools
Universal Method
Vocoder
VoiceConversion
Starred repositories
ZeenSong / claude-code
Forked from ultraworkers/claw-codeClaude Code 源码文档解析
分享AI Infra知识&代码练习:PyTorch/vLLM/SGLang框架入门⚡️、性能加速🚀、大模型基础🧠、AI软硬件🔧等
Plug-and-play streaming semantic VAD for real-time full-duplex spoken dialogue systems.
Pre-training, SFT, DPO and GRPO for Text-to-Audio Generation
A SOTA Industrial-Grade Voice Activity Detection & Audio Event Detection, supporting 100+ languages, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD
Compute WER and SER for speech recognition evaluation
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice…
Finetune Nemo parakeet ASR model with new language (support 8 bit optimizer). Experimental birwkv-fastconformer TDT for long-form ASR(8.5 hours in single pass).
A demo-level low-latency, high-throughput inference engine for whisper
🐍📦 Ultra-fast Python package for calculating and analyzing the Word Error Rate (WER). Built for the scalable evaluation of speech and transcription accuracy.
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Text-to-text alignment algorithm for speech recognition error analysis.
轻量级大语言模型MiniMind的源码解读,包含tokenizer、RoPE、MoE、KV Cache、pretraining、SFT、LoRA、DPO等完整流程
A python module to repair invalid JSON from LLMs
Simultaneous speech-to-text models
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enablin…
Text-audio foundation model from Boson AI
Chinese voice corpus. 中文语音语料,语音更加清晰自然,包含8个开源数据集,3200个说话人,900小时语音,1300万字。
A Collection of Papers on Diffusion Language Models
Voice Activity Detector (VAD) : low-latency, high-performance and lightweight