Starred repositories
QVerisAI / QVerisBot
Forked from openclaw/openclawYour own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
[ICLR 2026] TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching
Comprehensive open-source library of AI research and engineering skills for any AI model. Package the skills and your claude code/codex/gemini agent will be an AI research agent with full horsepowe…
LLM驱动的 A/H/美股智能分析器,多数据源行情 + 实时新闻 + Gemini 决策仪表盘 + 多渠道推送,零成本,纯白嫖,定时运行
Code for "Diffusion Model Alignment Using Direct Preference Optimization"
An instruct text-to-speech solution based on LLaSA and CosyVoice2 developed by the ASLP lab and collaborators.
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
SoulX-FlashTalk is the first 14B model to achieve sub-second start-up latency (0.87s) while maintaining a real-time throughput of 32 FPS on an 8xH800 node.
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into …
EVA Series: Visual Representation Fantasies from BAAI
A 5-way embedding model for text, audio, image, video, and 3D point clouds.
A dataset of 100M connections between 5 different modalities.
Uses machine learning to denoise audio containing speech
Code implementation for the paper "Large-scale Pre-training for Grounded Video Caption Generation" (ICCV 2025)
AnyTalker: Scaling Multi-person Talking Video Generation with Interactivity Refinement
[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
Video Grounding and Captioning
[ISMIR 2025] A curated list of vision-to-music generation: methods, datasets, evaluation and challenges.
A curated list of Vision (video/image) to Audio Generation
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…