Stars
🔥 Official code repository for "Unlocking Dense Metric Depth Estimation in VLMs"
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders [Technical Report]
Official code for CVPR 2026 paper: VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
[CVPR 2025] Online Video Understanding: OVBench and VideoChat-Online
Official code for "Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image"
Human-taught Computer-use Agent Designed for Real Windows and MacOS Desktops.
SPAgent, a foundation agent for understanding, reasoning over, and operating within the physical and spatial world.
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
Youtu-Tip: Tap for Intelligence, Keep on Device.
RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing
DART-GUI: Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
[ECCV 2024] GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time
[ICCV 2025] CityGS-X : A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
Official implementation of GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
[NeurIPS 2025] LabelAny3D: Label Any Object 3D in the Wild
[EMNLP 2025]Repository for paper "DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning"
Official implementation of "RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics"
🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
[NeurIPS 2025] Official implementation of "RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics"
Accelerate VGGT with efficient desciptor-based global attention
[ICLR 2026] Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
A part-based 3D generation framework & the largest and most comprehensively annotated 3D part dataset.
[CVPR 2024] This is official implementation of our CVPR 2024 paper "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception" https://arxiv.org/abs/2405.07201
🍳 [CVPR'25] PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
[CVPR 2026] An accurate and dense-annotated synthetic dataset for training SOTA detectors / segmentors / Grounding-VLMs.
Universal Monocular Metric Depth Estimation
[ICCV'25] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection