Stars
Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
AI Audio Datasets (AI-ADS) 🎵, including Speech, Music, and Sound Effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio a…
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
Predicts the level of noise and reverberation on your audiofiles
A toolkit for speaker diarization.
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
The official pytorch implemention of the Intespeech 2024 paper "Reshape Dimensions Network for Speaker Recognition"
An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawling, transcription and refinement
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
✨✨Latest Advances on Multimodal Large Language Models
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation Pytorch's Implement
Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation implemented by Pytorch
This is the audio sample repository for speech separation model "MossFormer2".
multi-scale time domain speaker extraction
Open-source, accurate and easy-to-use video speech recognition & clipping tool, LLM based AI clipping intergrated.
The AVA dataset densely annotates 80 atomic visual actions in 351k movie clips with actions localized in space and time, resulting in 1.65M action labels with multiple labels per human occurring fr…