Skip to content

[IROS'25 Oral] WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Notifications You must be signed in to change notification settings

B0B8K1ng/WMNavigation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

27 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

IROS 2025 Oral

arXiv Home Page youtube

Dujun Nie1,*, Xianda Guo2,*, Yiqun Duan3, Ruijun Zhang1, Long Chen1,4,5,โ€ 

1 Institute of Automation, Chinese Academy of Sciences, 2 School of Computer Science, Wuhan University, 3 School of Computer Science, University of Technology Sydney, 4 IAIR, Xi'an Jiaotong University, 5 Waytous :email:: niedujun2024@ia.ac.cn, xianda_guo@163.com, duanyiquncc@gmail.com

[IROS'25] This repository is the official implementation of WMNav, a novel World Model-based Object Goal Navigation framework powered by Vision-Language Models.

Overview of our navigation framework

This project is based on VLMnav. Our method comprises four Three components:
  1. Novel architecture - Introducing a new direction for object goal navigation using a world model consisting of VLMs and novel modules.
  2. Memory strategy - Designing an innovative memory strategy of predicted environmental states that employs an online Curiosity Value Map to quantitatively store the likelihood of the target's presence in various scenarios predicted by the world model.
  3. Efficiency - Proposing a subtask decomposition with feedback and a two-stage action proposer strategy to enhance the reliability of VLM reasoning outcomes and improve exploration efficiency.

๐Ÿ”ฅ News

  • June. 16th, 2025: WMNav is accepted to IROS 2025 and selected as oral presentation.
  • June. 11th, 2025: WMNav NOW supports the open-source model Qwen2.5-VL!
  • Mar. 14th, 2025: The code of WMNav is available! โ˜•๏ธ
  • Mar. 4th, 2025: We released our paper on Arxiv.

๐Ÿ“š Table of Contents

๐Ÿš€ Get Started

โš™ Installation and Setup

  1. clone this repo.
    git clone https://github.com/B0B8K1ng/WMNavigation
    cd WMNav
    
  2. Create the conda environment and install all dependencies.
    conda create -n wmnav python=3.9 cmake=3.14.0
    conda activate wmnav
    conda install habitat-sim=0.3.1 withbullet headless -c conda-forge -c aihabitat
    
    pip install -e .
    
    pip install -r requirements.txt
    

๐Ÿ›ข Prepare Dataset

This project is based on Habitat simulator and the HM3D and MP3D datasets are available here. Our code requires all above data to be in a data folder in the following format. Move the downloaded HM3D v0.1, HM3D v0.2 and MP3D folders into the following configuration:

โ”œโ”€โ”€ <DATASET_ROOT>
โ”‚  โ”œโ”€โ”€ hm3d_v0.1/
โ”‚  โ”‚  โ”œโ”€โ”€ val/
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ 00800-TEEsavR23oF/
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ TEEsavR23oF.navmesh
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ TEEsavR23oF.glb
โ”‚  โ”‚  โ”œโ”€โ”€ hm3d_annotated_basis.scene_dataset_config.json
โ”‚  โ”œโ”€โ”€ objectnav_hm3d_v0.1/
โ”‚  โ”‚  โ”œโ”€โ”€ val/
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ content/
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€4ok3usBNeis.json.gz
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ val.json.gz
โ”‚  โ”œโ”€โ”€ hm3d_v0.2/
โ”‚  โ”‚  โ”œโ”€โ”€ val/
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ 00800-TEEsavR23oF/
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ TEEsavR23oF.basis.navmesh
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ TEEsavR23oF.basis.glb
โ”‚  โ”‚  โ”œโ”€โ”€ hm3d_annotated_basis.scene_dataset_config.json
โ”‚  โ”œโ”€โ”€ objectnav_hm3d_v0.2/
โ”‚  โ”‚  โ”œโ”€โ”€ val/
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ content/
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€4ok3usBNeis.json.gz
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ val.json.gz
โ”‚  โ”œโ”€โ”€ mp3d/
โ”‚  โ”‚  โ”œโ”€โ”€ 17DRP5sb8fy/
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ 17DRP5sb8fy.glb
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ 17DRP5sb8fy.house
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ 17DRP5sb8fy.navmesh
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ 17DRP5sb8fy_semantic.ply
โ”‚  โ”‚  โ”œโ”€โ”€ mp3d_annotated_basis.scene_dataset_config.json
โ”‚  โ”œโ”€โ”€ objectnav_mp3d/
โ”‚  โ”‚  โ”œโ”€โ”€ val/
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ content/
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€2azQ1b91cZZ.json.gz
โ”‚  โ”‚  โ”‚  โ”œโ”€โ”€ val.json.gz

The variable DATASET_ROOT can be set in .env file.

Task datasets: objectnav_hm3d_v0.1, objectnav_hm3d_v0.2 and objectnav_mp3d

๐Ÿšฉ API Key

You can choose the VLM type in YAML file:

agent_cfg:
  ...
  vlm_cfg:
    model_cls: GeminiVLM # [GeminiVLM, QwenVLM]
    model_kwargs:
      model: gemini-2.0-flash # [gemini-1.5-flash, gemini-1.5-pro, gemini-2.0-flash, Qwen/Qwen2.5-VL-3B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct]
  1. Gemini To use the Gemini VLMs, paste a base url and api key into the .env file for the variable called GEMINI_BASE_URL and GEMINI_API_KEY.

  2. Qwen To use the Qwen VLMs, refer to Qwen2.5-VL. We use vLLM for fast Qwen2.5-VL deployment and inference. You need to install vllm>0.7.2 to enable Qwen2.5-VL support.

    pip install git+https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775
    pip install accelerate
    pip install qwen-vl-utils
    pip install 'vllm>0.7.2'
    

    Run the command below to start an OpenAI-compatible API service:

    vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=5,video=5
    

    Then set the variables in the .env file as following:

    GEMINI_BASE_URL= "http://localhost:8000/v1"
    GEMINI_API_KEY= "EMPTY"
    
  3. Others You can also try other VLMs by modifying api.py(using the OpenAI libraries)

๐ŸŽฎ Demo

Run the following command to visualize the result of an episode:

python scripts/main.py

In the logs/ directory, there should be saved gifs:


๐Ÿ“Š Evaluation

To evaluate WMNav at scale (HM3D v0.1 contains 2000 episodes, 1000 episodes for HM3D v0.2 and 2195 episodes for MP3D), we use a framework for parallel evaluation. The file parallel.sh contains a script to distribute K instances over N GPUs, and for each of them to run M episodes. Note each episode consumes ~320MB of GPU memory. A local flask server is intialized to handle the data aggregation, and then the aggregated results are logged to wandb. Make sure you are logged in with wandb login

This implementation requires tmux to be installed. Please install it via your package manager:

  • Ubuntu/Debian: sudo apt install tmux
  • macOS (with Homebrew): brew install tmux
# parallel.sh
ROOT_DIR=PROJECT_DIR
CONDA_PATH="<user>/miniconda3/etc/profile.d/conda.sh"
NUM_GPU=5
INSTANCES=50
NUM_EPISODES_PER_INSTANCE=20  # 20 for HM3D v0.2, 40 for HM3D v0.1, 44 for MP3D 
MAX_STEPS_PER_EPISODE=40
TASK="ObjectNav"
DATASET="hm3d_v0.2"  # Dataset [hm3d_v0.1, hm3d v0.2, mp3d]
CFG="WMNav"  # Name of config file
NAME="Evaluation"
PROJECT_NAME="WMNav"
VENV_NAME="wmnav" # Name of the conda environment
GPU_LIST=(3 4 5 6 7) # List of GPU IDs to use

results are saved in logs/ directory.

๐Ÿ”จ Customize Experiments

To run your own configuration, please refer to the YAML file detailing the configuration variables:

task: ObjectNav
agent_cls: WMNavAgent # agent class
env_cls: WMNavEnv # env class

agent_cfg:
  navigability_mode: 'depth_sensor' 
  context_history: 0
  explore_bias: 4 
  max_action_dist: 1.7
  min_action_dist: 0.5
  clip_frac: 0.66 # clip action distance to avoid getting too close to obstacles
  stopping_action_dist: 1.5 # length of actions after the agent calls stop
  default_action: 0.2 # how far forward to move if the VLM's chosen action is invalid

๐Ÿ’ก If you want to the design your own model(achieved by designing our own CustomAgent and CustomEnv) or try ablation experiments detailed in the paper, please refer to the cumstom_agent.py and custom_env.py.

๐Ÿ™‡ Acknowledgement

This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!

๐Ÿ“ Citation

If you find our work useful in your research, please consider giving a star โญ and citing the following paper ๐Ÿ“.

@article{nie2025wmnav,
  title={WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation},
  author={Nie, Dujun and Guo, Xianda and Duan, Yiqun and Zhang, Ruijun and Chen, Long},
  journal={arXiv preprint arXiv:2503.02247},
  year={2025}
}

๐Ÿค— Contact

For feedback, questions, or press inquiries please contact niedujun2024@ia.ac.cn.

About

[IROS'25 Oral] WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published