Lists (1)
Sort Name ascending (A-Z)
Stars
[ICLR 2026 🔥 ] Official implementation of "UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing"
Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"
A curated collection of fun and creative examples generated with Nano Banana & Nano Banana Pro🍌, Gemini-2.5-flash-image based model. We also release Nano-consistent-150K openly to support the commu…
Official Implementation of Paper Transfer between Modalities with MetaQueries
[CVPR 2025] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Code release for "PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop" (ICML 2025)
Code for Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [COLM 2024]
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
Open Overleaf/ShareLaTex projects in vscode, with full collaboration support.
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Fast and memory-efficient exact attention
[ICLR 2025 Oral] On Scaling Up 3D Gaussian Splatting Training
Unofficial implementation of "SODA: Bottleneck Diffusion Models for Representation Learning"
PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Official Implementation of ICLR'24: Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Infinite Photorealistic Worlds using Procedural Generation
Code for the paper "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models"
Official implementation of AnimateDiff.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
TripoSR: Fast 3D Object Reconstruction from a Single Image
Efficient, check-pointed data loading for deep learning with massive data sets.