-
HIT (Shenzhen)
- Shenzhen
-
18:56
(UTC -12:00) - https://xiaojieli0903.github.io
Stars
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and Autonomous Driving, including papers, codes, and related webs…
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
[CVPR 2025 Highlight] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation
Fully Open Framework for Democratized Multimodal Training
InternVLA-M1: A Spatially Grounded Foundation Model for Generalist Robot Policy
InternRobotics' open platform for building generalized navigation foundation models.
[ICRA'24 Best UAV Paper Award Finalist] An Efficient Global Planner for Aerial Coverage
Official implementation of the paper: "StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling"
Nav-R1: Reasoning and Navigation in Embodied Scenes
A paper list of some recent Mamba-based CV works.
The new spin-off of Visual Language Navigation.
Reference PyTorch implementation and models for DINOv3
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
Latest Papers, Codes and Datasets on VTG-LLMs.
Awesome collection of resources and papers on Diffusion Models for Robotic Manipulation.
[TMLR 2025] Efficient Reasoning Models: A Survey
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
About Awesome things towards foundation agents. Papers / Repos / Blogs / ...
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.