Stars
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels with Hunyuan3D World Model
Code for ICCV 2025 paper: "ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation"
Autonomous coding agent right in your IDE, capable of creating/editing files, executing commands, using the browser, and more with your permission every step of the way.
Open-source, accurate and easy-to-use video speech recognition & clipping tool, LLM based AI clipping intergrated.
This repo contains the code for 1D tokenizer and generator
Open-Sora: Democratizing Efficient Video Production for All
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Solve Visual Understanding with Reinforced VLMs
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
[ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos
xLAM: A Family of Large Action Models to Empower AI Agent Systems
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
Codebase for Aria - an Open Multimodal Native MoE
Composable building blocks to build Llama Apps
Agentic components of the Llama Stack APIs
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery 🧑🔬
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curatio…
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
A curated list of recent diffusion models for video generation, editing, and various other applications.
[CVPR 2023] SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation