-
Huazhong University of Science and Technology
Starred repositories
Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
An hardware-aware Efficient Implementation for "Mixture-of-Depths Attention".
[ECCV 2026] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
[ArXiv 2025] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
Native Multimodal Models are World Learners
[ICLR 2025] ControlAR: Controllable Image Generation with Autoregressive Models
[NeurIPS 2025] Pixel-Perfect Depth
[ICCV 2025] ZeroStereo: Zero-Shot Stereo Matching from Single Images
MonSter++: A Unified Geometric Foundation Model for Stereo and Multi-View Depth Estimation via the Unleashing of Monodepth Priors
【CVPR 2024】Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving
[TPAMI 2024 & CVPR 2022] Attention Concatenation Volume for Accurate and Efficient Stereo Matching
【CVPR 2025 Highlight】MonSter: Marry Monodepth to Stereo Unleashes Power
The repository of "Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds"
[AAAI 2026 Oral] LENS: Learning to Segment Anything with Unified Reinforced Reasoning
[ICLR 2026] ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Official implementation of T-PAMI25 paper "M²Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes"
PixelHacker: Image Inpainting with Structural and Semantic Consistency
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[ICCV 2025] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
The first decoder-only multimodal state space model
[CVPR'25 Highlight] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
[CVPR 2025 Oral] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
[CVPR 2025] GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
[CVPR 2025 Highlight] Truncated Diffusion Model for Real-Time End-to-End Autonomous Driving