Starred repositories
The official implementation of "Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs"
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
[AAAI 2026] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
[CVPR 2025] Official Repository for Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments
[ICCV 2025] Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation.
All-in-one training for vision models (YOLO, ViTs, RT-DETR, DINOv3): pretraining, fine-tuning, distillation.
Code for the CVPR DriveX 2025 paper V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving
A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.
Solve Visual Understanding with Reinforced VLMs
Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction"
This is the first paper to explore how to effectively use R1-like RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reas…
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
[TCSVT] DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction
[CVPR 2024 Highlight] Visual Point Cloud Forecasting
[ICCV 2023 Oral]: Scaling Data Generation in Vision-and-Language Navigation
[ECCV 2024] Official implementation of NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
This is the official repo of CVPR 2024 paper "Multimodal Sense-Informed Prediction of 3D Human Motions"
[ICCV 2025] Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
[ICCV 2025] Official implementation of SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
[NeurIPS 2024] SCube: Instant Large-Scale Scene Reconstruction using VoxSplats
[CVPR 2025 Highlight] Material Anything: Generating Materials for Any 3D Object via Diffusion
Introduce Multiscope Conception to Sequential Descision Learning
[CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
[ICML'25] Official Implementation of "PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting"