Stars
LongLive: Real-time Interactive Long Video Generation
🚀 Efficient implementations of state-of-the-art linear attention models
Awesome curated collection of images and prompts generated by gemini-2.5-flash-image (aka Nano Banana) state-of-the-art image generation and editing model. Explore AI generated visuals created with…
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
Pytorch implementation for the paper titled "SimpleAR: Pushing the Frontier of Autoregressive Visual Generation"
Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grained visual understanding".
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
A simple screen parsing tool towards pure vision based GUI agent
Witness the aha moment of VLM with less than $3.
Solve Visual Understanding with Reinforced VLMs
Official Repo for Open-Reasoner-Zero
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Paper collections of the continuous effort start from World Models.
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
[ECCV 2024 & NeurIPS 2024] Official implementation of the paper TAPTR & TAPTRv2 & TAPTRv3
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
[ECCV 2024] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
[ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant
Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions