JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

Shuang Zeng^1,2, Dekang Qi¹, Xinyuan Chang¹, Feng Xiong¹, Shichao Xie¹, Xiaolong Wu¹, Shiyi Liang^1,2, Mu Xu¹, Xing Wei²

¹Amap, Alibaba Group, ²Xi’an Jiaotong University

janusvln.mp4

💡 Introduction

JanusVLN is a novel VLN framework and the first to feature a dual implicit memory. Inspired by the implicit scene representation in human navigation, which integrates left-brain semantic understanding with right-brain spatial cognition, JanusVLN constructs two complementary, fixed-size, compact neural memory. JanusVLN steers VLN research from 2D semantics-dominant toward 3D spatial-semantisynergy, a critical direction for developing next-generation spatial embodied agents.

📢 News

[2025-11-06] Due to the previous upload of incorrect weights for the JanusVLN_Extra model, if you need to directly infer, please download the correct weights from JanusVLN_Extra again.

🛠️ Installation

Create the required environment through the following steps:

git clone https://github.com/MIV-XJTU/JanusVLN.git && cd JanusVLN

conda create -n janusvln python=3.9 -y && conda activate janusvln

conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines
cd ..

# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt
# Install JanusVLN
pip install -e .

📦 Data Preparation

1、Scene Datasets

For R2R, RxR: Download the MP3D scenes from the official project page, and place them under data/scene_datasets/mp3d/.
For ScaleVLN: Download the HM3D scenes from the official github page, and place the train split under data/scene_datasets/hm3d/.

2、VLN-CE Episodes
Download the VLN-CE episodes and extract them into the data/datasets/ directory:

r2r (Rename R2R_VLNCE_v1-3_preprocessed/ -> r2r/)
rxr (Rename RxR_VLNCE_v0/ -> rxr/)
scalevln (Follow the StreamVLN to convert a subset of the ScaleVLN dataset into the VLN-CE format.)

3、Collected Trajectory Data

We provide pre-collected observation-action trajectory data for training. R2R and RxR are collected following VLN-CE. ScaleVLN is collected following StreamVLN. DAgger data is collected using JanusVLN_Base. Note: It is best to collect DAgger data using your own base model. Download the collected trajectory data from ModelScope and extract it to the data/trajectory_data/ and data/dagger_data/ directory.

Your final folder structure should look like this:

data/
├── datasets/
│   ├── r2r/
│   │   ├── train/
│   │   ├── val_seen/
│   │   │   └── val_seen.json.gz
│   │   └── val_unseen/
│   │       └── val_unseen.json.gz
│   ├── rxr/
│   │   ├── train/
│   │   ├── val_seen/
│   │   │   ├── val_seen_guide.json.gz
│   │   │   └── ...
│   │   └── val_unseen/
│   │       ├── val_unseen_guide.json.gz
│   │       └── ...
│   └── scalevln/
│       └── scalevln_subset_150k.json.gz
├── scene_datasets/
│   ├── hm3d/
│   │   ├── 00000-kfPV7w3FaU5/
│   │   ├── 00001-UVdNNRcVyV1/
│   │   └── ...
│   └── mp3d/
│       ├── 17DRP5sb8fy/
│       ├── 1LXtFkjw3qL/
│       └── ...
├── trajectory_data/
│   ├── R2R-CE-640x480/
│   │   └── images/   
│   ├── RxR-CE-640x480/
│   │   └── images/ 
│   └── ScaleVLN/
│       ├── images/
│       └── annotations.json
└── dagger_data/
    ├── R2R/
    │   ├── images/
    │   └── annotations.json
    └── RxR/
        ├── images/
        └── annotations.json

4、Build Datasets

Construct a base dataset that only includes R2R-CE and RxR-CE:

python create_data/create_data.py

Finally, the dataset information needs to be configured in the file src/qwen_vl/data/__init__.py.

🏆 Model Zoo

We have separately provided two sets of JanusVLN model weights to distinguish whether additional data is used or not:

Model	Data	Name
JanusVLN	R2R-CE,RxR-CE	JanusVLN_Base
	R2R-CE,RxR-CE,DAgger,ScaleVLN	JanusVLN_Extra

🚀 Training

Base Training

Use the base data to train the base model:
```
bash scripts/train.sh
```
Dagger Collection

Collecting DAgger data using the base model:
```
bash scripts/dagger.sh
```
Construct extra dataset:
```
python create_data/create_data.py --use_extra_data
```
It is also necessary to configure the dataset information in the file src/qwen_vl/data/__init__.py.
Extra Training

Continue training on extra data on top of the base model:
```
bash scripts/train_extra.sh
```

📈 Evaluation

Use multiple GPUs to infer the model for evaluation:

bash scripts/evaluation.sh

📜 Citing

If you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:

@article{zeng2025janusvln,
            title={JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation},
            author={Zeng, Shuang and Qi, Dekang and Chang, Xinyuan and Xiong, Feng and Xie, Shichao and Wu, Xiaolong and Liang, Shiyi and Xu, Mu and Wei, Xing},
            journal={arXiv preprint arXiv:2509.22548},
            year={2025}
            }

🙏 Acknowledgement

Our work is primarily based on the following codebases:Qwen2.5-VL, VGGT, StreamVLN, VG-LLM. We are sincerely grateful for their work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

💡 Introduction

📢 News

Table of Contents

🛠️ Installation

📦 Data Preparation

🏆 Model Zoo

🚀 Training

📈 Evaluation

📜 Citing

🙏 Acknowledgement

About

Uh oh!

Releases

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
config		config
create_data		create_data
scripts		scripts
src		src
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

MIV-XJTU/JanusVLN

Folders and files

Latest commit

History

Repository files navigation

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

💡 Introduction

📢 News

Table of Contents

🛠️ Installation

📦 Data Preparation

🏆 Model Zoo

🚀 Training

📈 Evaluation

📜 Citing

🙏 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Languages