Stars
Code for the paper "Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization"
DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images
ARM: An AutoRegressive Large Multimodal Model with Discrete Representations
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
🔥 Official code repository for "Unlocking Dense Metric Depth Estimation in VLMs"
A feed-forward 3D foundation model for reconstructing scenes from streaming data
[CVPR 2026 (Highlight)] Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
This is the repo for paper "OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models"
(CVPR 2026) Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
A simple video streaming baseline that outperforms SOTAs.
[NeurIPS 2025] 3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
[CVPR 2026 Findings] Speed3R: Sparse Feed-forward 3D Reconstruction Models
[CVPR 2026] ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[CVPR2026] SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
[ECCV 2024] This is the official implementation of HRMapNet, maintaining and utilizing a low-cost global rasterized map to enhance online vectorized map perception.
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
[CVPR 26] Release repo of our work "Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers"
Reference PyTorch implementation and models for DINOv3
[ICLR2026] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.
[ICLR 2026] π^3: Permutation-Equivariant Visual Geometry Learning
Tooling for the Common Objects In 3D dataset.
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models (dLLMs with block diffusion, mixed-CoT, unified RL)