Stars
A high-throughput and memory-efficient inference and serving engine for LLMs
Ongoing research training transformer models at scale
沉浸式双语网页翻译扩展 , 支持输入框翻译, 鼠标悬停翻译, PDF, Epub, 字幕文件, TXT 文件翻译 - Immersive Dual Web Page Translation Extension
Deep learning software to decode EEG, ECG or MEG signals
A Unified and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for 🤗Diffusers.
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Automatically crawl arXiv papers daily and summarize them using AI. Illustrating them using GitHub Pages.
Fine-tune the Whisper speech recognition model to support training without timestamp data, training with timestamp data, and training without speech data. Accelerate inference and support Web deplo…
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
A feature-rich command-line audio/video downloader
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Andr…
This repository aims to collect Transformer-based sound event detection (SED) algorithms.
MOSS-TTSD is a spoken dialogue generation model that enables expressive dialogue speech synthesis in both Chinese and English, supporting zero-shot multi-speaker voice cloning, and long-form speech…
Translate the video from one language to another and add dubbing.
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
一个开源的多角色、多情绪 AI 配音生成平台,支持小说、剧本、视频等内容的自动配音与导出。
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
OSUM & OSUM-EChat, open speech understanding model and empathetic spoken chatbot based on it, open-sourced by ASLP@NPU.
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
VibeVoice: Expressive, longform conversational speech synthesis. (Community fork)
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Official codebase for "Brain-JEPA: Brain Dynamics Foundation Model with Gradient Positioning and Spatiotemporal Masking" (NeurIPS 2024, Spotlight).
Long-form streaming TTS system for multi-speaker dialogue generation