Highlights
- Pro
Lists (4)
Sort Name ascending (A-Z)
Stars
Robust Speech Recognition via Large-Scale Weak Supervision
🧑🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), ga…
The world's simplest facial recognition api for Python and the command line
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Deep Learning papers reading roadmap for anyone who are eager to learn this amazing tech!
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
Deezer source separation library including pretrained models.
SoftVC VITS Singing Voice Conversion
Download your Spotify playlists and songs along with album art and metadata (from YouTube if a match is found).
GUI for a Vocal Remover that uses Deep Neural Networks.
A Lightweight Face Recognition and Facial Attribute Analysis (Age, Gender, Emotion and Race) Library for Python
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
An open source implementation of CLIP.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Code for the paper Hybrid Spectrogram and Waveform Source Separation
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Multilingual Voice Understanding Model