WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

IROS 2025 Oral

Dujun Nie^1,*, Xianda Guo^2,*, Yiqun Duan³, Ruijun Zhang¹, Long Chen^1,4,5,†

¹ Institute of Automation, Chinese Academy of Sciences, ² School of Computer Science, Wuhan University, ³ School of Computer Science, University of Technology Sydney, ⁴ IAIR, Xi'an Jiaotong University, ⁵ Waytous :email:: niedujun2024@ia.ac.cn, xianda_guo@163.com, duanyiquncc@gmail.com

[IROS'25] This repository is the official implementation of WMNav, a novel World Model-based Object Goal Navigation framework powered by Vision-Language Models.

Overview of our navigation framework

This project is based on VLMnav. Our method comprises four Three components:

Novel architecture - Introducing a new direction for object goal navigation using a world model consisting of VLMs and novel modules.
Memory strategy - Designing an innovative memory strategy of predicted environmental states that employs an online Curiosity Value Map to quantitatively store the likelihood of the target's presence in various scenarios predicted by the world model.
Efficiency - Proposing a subtask decomposition with feedback and a two-stage action proposer strategy to enhance the reliability of VLM reasoning outcomes and improve exploration efficiency.

🔥 News

June. 16th, 2025: WMNav is accepted to IROS 2025 and selected as oral presentation.
June. 11th, 2025: WMNav NOW supports the open-source model Qwen2.5-VL!
Mar. 14th, 2025: The code of WMNav is available! ☕️
Mar. 4th, 2025: We released our paper on Arxiv.

📚 Table of Contents

🚀 Get Started

⚙ Installation and Setup

clone this repo.

git clone https://github.com/B0B8K1ng/WMNavigation
cd WMNav

Create the conda environment and install all dependencies.

conda create -n wmnav python=3.9 cmake=3.14.0
conda activate wmnav
conda install habitat-sim=0.3.1 withbullet headless -c conda-forge -c aihabitat

pip install -e .

pip install -r requirements.txt

🛢 Prepare Dataset

This project is based on Habitat simulator and the HM3D and MP3D datasets are available here. Our code requires all above data to be in a data folder in the following format. Move the downloaded HM3D v0.1, HM3D v0.2 and MP3D folders into the following configuration:

├── <DATASET_ROOT>
│  ├── hm3d_v0.1/
│  │  ├── val/
│  │  │  ├── 00800-TEEsavR23oF/
│  │  │  │  ├── TEEsavR23oF.navmesh
│  │  │  │  ├── TEEsavR23oF.glb
│  │  ├── hm3d_annotated_basis.scene_dataset_config.json
│  ├── objectnav_hm3d_v0.1/
│  │  ├── val/
│  │  │  ├── content/
│  │  │  │  ├──4ok3usBNeis.json.gz
│  │  │  ├── val.json.gz
│  ├── hm3d_v0.2/
│  │  ├── val/
│  │  │  ├── 00800-TEEsavR23oF/
│  │  │  │  ├── TEEsavR23oF.basis.navmesh
│  │  │  │  ├── TEEsavR23oF.basis.glb
│  │  ├── hm3d_annotated_basis.scene_dataset_config.json
│  ├── objectnav_hm3d_v0.2/
│  │  ├── val/
│  │  │  ├── content/
│  │  │  │  ├──4ok3usBNeis.json.gz
│  │  │  ├── val.json.gz
│  ├── mp3d/
│  │  ├── 17DRP5sb8fy/
│  │  │  ├── 17DRP5sb8fy.glb
│  │  │  ├── 17DRP5sb8fy.house
│  │  │  ├── 17DRP5sb8fy.navmesh
│  │  │  ├── 17DRP5sb8fy_semantic.ply
│  │  ├── mp3d_annotated_basis.scene_dataset_config.json
│  ├── objectnav_mp3d/
│  │  ├── val/
│  │  │  ├── content/
│  │  │  │  ├──2azQ1b91cZZ.json.gz
│  │  │  ├── val.json.gz

The variable DATASET_ROOT can be set in .env file.

Task datasets: objectnav_hm3d_v0.1, objectnav_hm3d_v0.2 and objectnav_mp3d

🚩 API Key

You can choose the VLM type in YAML file:

agent_cfg:
  ...
  vlm_cfg:
    model_cls: GeminiVLM # [GeminiVLM, QwenVLM]
    model_kwargs:
      model: gemini-2.0-flash # [gemini-1.5-flash, gemini-1.5-pro, gemini-2.0-flash, Qwen/Qwen2.5-VL-3B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct]

Gemini To use the Gemini VLMs, paste a base url and api key into the .env file for the variable called GEMINI_BASE_URL and GEMINI_API_KEY.

Qwen To use the Qwen VLMs, refer to Qwen2.5-VL. We use vLLM for fast Qwen2.5-VL deployment and inference. You need to install vllm>0.7.2 to enable Qwen2.5-VL support.

pip install git+https://github.com/huggingface/transformers@f3f6c86582611976e72be054675e2bf0abb5f775
pip install accelerate
pip install qwen-vl-utils
pip install 'vllm>0.7.2'

Run the command below to start an OpenAI-compatible API service:

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=5,video=5

Then set the variables in the .env file as following:

GEMINI_BASE_URL= "http://localhost:8000/v1"
GEMINI_API_KEY= "EMPTY"

Others You can also try other VLMs by modifying api.py(using the OpenAI libraries)

🎮 Demo

Run the following command to visualize the result of an episode:

python scripts/main.py

In the logs/ directory, there should be saved gifs:

📊 Evaluation

To evaluate WMNav at scale (HM3D v0.1 contains 2000 episodes, 1000 episodes for HM3D v0.2 and 2195 episodes for MP3D), we use a framework for parallel evaluation. The file parallel.sh contains a script to distribute K instances over N GPUs, and for each of them to run M episodes. Note each episode consumes ~320MB of GPU memory. A local flask server is intialized to handle the data aggregation, and then the aggregated results are logged to wandb. Make sure you are logged in with wandb login

This implementation requires tmux to be installed. Please install it via your package manager:

Ubuntu/Debian: sudo apt install tmux
macOS (with Homebrew): brew install tmux

# parallel.sh
ROOT_DIR=PROJECT_DIR
CONDA_PATH="<user>/miniconda3/etc/profile.d/conda.sh"
NUM_GPU=5
INSTANCES=50
NUM_EPISODES_PER_INSTANCE=20  # 20 for HM3D v0.2, 40 for HM3D v0.1, 44 for MP3D 
MAX_STEPS_PER_EPISODE=40
TASK="ObjectNav"
DATASET="hm3d_v0.2"  # Dataset [hm3d_v0.1, hm3d v0.2, mp3d]
CFG="WMNav"  # Name of config file
NAME="Evaluation"
PROJECT_NAME="WMNav"
VENV_NAME="wmnav" # Name of the conda environment
GPU_LIST=(3 4 5 6 7) # List of GPU IDs to use

results are saved in logs/ directory.

🔨 Customize Experiments

To run your own configuration, please refer to the YAML file detailing the configuration variables:

task: ObjectNav
agent_cls: WMNavAgent # agent class
env_cls: WMNavEnv # env class

agent_cfg:
  navigability_mode: 'depth_sensor' 
  context_history: 0
  explore_bias: 4 
  max_action_dist: 1.7
  min_action_dist: 0.5
  clip_frac: 0.66 # clip action distance to avoid getting too close to obstacles
  stopping_action_dist: 1.5 # length of actions after the agent calls stop
  default_action: 0.2 # how far forward to move if the VLM's chosen action is invalid

💡 If you want to the design your own model(achieved by designing our own CustomAgent and CustomEnv) or try ablation experiments detailed in the paper, please refer to the cumstom_agent.py and custom_env.py.

🙇 Acknowledgement

This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!

📝 Citation

If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.

@article{nie2025wmnav,
  title={WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation},
  author={Nie, Dujun and Guo, Xianda and Duan, Yiqun and Zhang, Ruijun and Chen, Long},
  journal={arXiv preprint arXiv:2503.02247},
  year={2025}
}

🤗 Contact

For feedback, questions, or press inquiries please contact niedujun2024@ia.ac.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
config		config
imgs		imgs
scripts		scripts
src		src
.env		.env
.gitignore		.gitignore
parallel.sh		parallel.sh
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

🔥 News

📚 Table of Contents

🚀 Get Started

⚙ Installation and Setup

🛢 Prepare Dataset

🚩 API Key

🎮 Demo

📊 Evaluation

🔨 Customize Experiments

🙇 Acknowledgement

📝 Citation

🤗 Contact

About

Uh oh!

Releases

Packages

Languages

B0B8K1ng/WMNavigation

Folders and files

Latest commit

History

Repository files navigation

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

🔥 News

📚 Table of Contents

🚀 Get Started

⚙ Installation and Setup

🛢 Prepare Dataset

🚩 API Key

🎮 Demo

📊 Evaluation

🔨 Customize Experiments

🙇 Acknowledgement

📝 Citation

🤗 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages