Highlights
- Pro
Stars
Official implementation of Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)
RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation (arxiv preprint)
🌟A curated list of DUSt3R-related papers and resources, tracking recent advancements using this geometric foundation model.
FULL Augment Code, Claude Code, Cluely, CodeBuddy, Comet, Cursor, Devin AI, Junie, Kiro, Leap.new, Lovable, Manus, NotionAI, Orchids.app, Perplexity, Poke, Qoder, Replit, Same.dev, Trae, Traycer AI…
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A generative world for general-purpose robotics & embodied AI learning.
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
PhD Dissertation Template for UNC Computer Science
Implementation of the paper: "Answering Questions by Meta-Reasoning over Multiple Chains of Thought"
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Jupyter notebook server extension to proxy web services.
PyTorch code for System-1.x: Learning to Balance Fast and Slow Planning with Language Models
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.
Official implementation of SEED-LLaMA (ICLR 2024).
Official implementation of Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model (ICLR 2025 Oral)
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts (CVPR 2024)
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
OCR, layout analysis, reading order, table recognition in 90+ languages
Official Code Repository for EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents (COLM 2024)
Code and Data for Paper: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
Code for the paper "pix2gestalt: Amodal Segmentation by Synthesizing Wholes" (CVPR 2024)