-
Happy
- Shanghai
- 公众号:音视频平凡之路
Lists (1)
Sort Name ascending (A-Z)
Starred repositories
🎬 火宝短剧 - 基于AI的一站式短剧生成平台 《一句话生成完整短剧,从剧本到成片全自动化》 Huobao Drama - An AI-Powered End-to-End Short Drama Generator "One Sentence to Complete Drama: Fully Automated from Script to Final Video"
A SOTA Industrial-Grade Voice Activity Detection & Audio Event Detection, supporting 100+ languages, outperforming Silero-VAD, TEN-VAD, FunASR-VAD and WebRTC-VAD
[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
[ICLR2026] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
Repo for SeedVR2 (ICLR2026) & SeedVR (CVPR2025 Highlight)
📚 《从零开始构建智能体》——从零开始的智能体原理与实践教程
MOSS-Audio is an open-source foundation model for unified audio understanding, enabling speech, sound, music, captioning, QA, and reasoning in real-world scenarios.
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
AI Agent Skills for Wan — Enable your AI Agent to easily leverage Wan's AIGC capabilities.
Pseudo Streaming SenseVoice with Hotwords
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
Multilingual Voice Understanding Model
Official Python toolkit for the Qwen3-ASR API. Parallel high‑throughput calls, robust long‑audio transcription, multi‑sample‑rate support.
All in one Qwen3-ASR Server, compatible with OpenAI API
Port of OpenAI's Whisper model in C/C++
Robust Speech Recognition via Large-Scale Weak Supervision
A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singi…
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
FireRed-Image-Edit is a powerful image editing foundation model achieving open-source state-of-the-art performance with precise instruction following, high-fidelity generation, superior identity co…
Long-form streaming TTS system for multi-speaker dialogue generation