Stars
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'
CoVoST: A Large-Scale Multilingual Speech-To-Text Translation Corpus (CC0 Licensed)
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper: https://hubs.li/Q01fwRWB0
Implementation of the VQ-VAE model described in "Privacy-oriented manipulation of speaker representations".
Recipe for training and testing RIR-Classifier
Lightweight highly configurable Python launcher based on microkernel architecture
Glow-TTS with Stochastic Duration Predictor and Stochastic Pitch Predictor
A python package to analyze and compare voices with deep learning
A Python library for audio data augmentation. Useful for making audio ML models work well in the real world, not just in the lab.
FACodec: Speech Codec with Attribute Factorization used for NaturalSpeech 3
The implementation of paper "SpeechTripleNet: End-to-End Disentangled Speech Representation Learning for Content, Timbre and Prosody"
Awesome speech/audio LLMs, representation learning, and codec models
Foundational Models for State-of-the-Art Speech and Text Translation
Model implementation and trained network for "A Two-Step Disentanglement Method" by Naama Hadad, Lior Wolf and Moni Shahar
Measuring Disentanglement: A Review of Metrics
MOS score prediction by fine-tuned wav2vec2.0 model
NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment
Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
Context Dependent Semantic Parsing: Awesome Paper List