Lists (16)
Sort Name ascending (A-Z)
Stars
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Minimal and clean examples of machine learning algorithms implementations
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, GLM4.5, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, Llava, GLM4v, Ph…
Enjoy the magic of Diffusion models!
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Flexible and powerful tensor operations for readable and reliable code (for pytorch, jax, TF and others)
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
视频硬字幕提取,生成srt文件。无需申请第三方API,本地实现文本识别。基于深度学习的视频字幕提取框架,包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.
An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io/vallex/
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Text-audio foundation model from Boson AI
Multilingual Voice Understanding Model
Official repo for consistency models.
Differentiable ODE solvers with full GPU support and O(1)-memory backpropagation.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
The Unofficial TikTok API Wrapper In Python
A concise but complete full-attention transformer with a set of promising experimental features from various papers
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI
Inference and training library for high-quality TTS models.