Starred repositories
We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into …
EVA Series: Visual Representation Fantasies from BAAI
A 5-way embedding model for text, audio, image, video, and 3D point clouds.
A dataset of 100M connections between 5 different modalities.
Uses machine learning to denoise audio containing speech
Code implementation for the paper "Large-scale Pre-training for Grounded Video Caption Generation" (ICCV 2025)
AnyTalker: Scaling Multi-person Talking Video Generation with Interactivity Refinement
[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
Video Grounding and Captioning
[ISMIR 2025] A curated list of vision-to-music generation: methods, datasets, evaluation and challenges.
A curated list of Video to Audio Generation
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Robust Speech Recognition via Large-Scale Weak Supervision
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
[CVPR 2023] Official implementation of the paper: Fine-grained Audible Video Description
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
PAM is a no-reference audio quality metric for audio generation tasks
ACE-Step: A Step Towards Music Generation Foundation Model
Unified automatic quality assessment for speech, music, and sound.
🔥 1Panel provides an intuitive web interface and MCP Server to manage websites, files, containers, databases, and LLMs on a Linux server.
Kronos: A Foundation Model for the Language of Financial Markets
Lets make video diffusion practical!
Wan: Open and Advanced Large-Scale Video Generative Models