JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Shuang Zeng1,2, Dekang Qi1, Xinyuan Chang1, Feng Xiong1, Shichao Xie1, Xiaolong Wu1, Shiyi Liang1,2, Mu Xu1, Xing Wei2
1Amap, Alibaba Group, 2Xiβan Jiaotong University
janusvln.mp4
JanusVLN is a novel VLN framework and the first to feature a dual implicit memory. Inspired by the implicit scene representation in human navigation, which integrates left-brain semantic understanding with right-brain spatial cognition, JanusVLN constructs two complementary, fixed-size, compact neural memory. JanusVLN steers VLN research from 2D semantics-dominant toward 3D spatial-semantisynergy, a critical direction for developing next-generation spatial embodied agents.
[2025-11-06] Due to the previous upload of incorrect weights for the JanusVLN_Extra model, if you need to directly infer, please download the correct weights from JanusVLN_Extra again.
Create the required environment through the following steps:
git clone https://github.com/MIV-XJTU/JanusVLN.git && cd JanusVLN
conda create -n janusvln python=3.9 -y && conda activate janusvln
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines
cd ..
# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Install JanusVLN
pip install -e .1γScene Datasets
- For R2R, RxR: Download the MP3D scenes from the official project page, and place them under
data/scene_datasets/mp3d/. - For ScaleVLN: Download the HM3D scenes from the official github page, and place the
trainsplit underdata/scene_datasets/hm3d/.
2γVLN-CE Episodes
Download the VLN-CE episodes and extract them into the data/datasets/ directory:
- r2r (Rename
R2R_VLNCE_v1-3_preprocessed/->r2r/) - rxr (Rename
RxR_VLNCE_v0/->rxr/) - scalevln (Follow the StreamVLN to convert a subset of the ScaleVLN dataset into the VLN-CE format.)
3γCollected Trajectory Data
We provide pre-collected observation-action trajectory data for training. R2R and RxR are collected following VLN-CE. ScaleVLN is collected following StreamVLN. DAgger data is collected using JanusVLN_Base. Note: It is best to collect DAgger data using your own base model. Download the collected trajectory data from ModelScope and extract it to the data/trajectory_data/ and data/dagger_data/ directory.
Your final folder structure should look like this:
data/
βββ datasets/
β βββ r2r/
β β βββ train/
β β βββ val_seen/
β β β βββ val_seen.json.gz
β β βββ val_unseen/
β β βββ val_unseen.json.gz
β βββ rxr/
β β βββ train/
β β βββ val_seen/
β β β βββ val_seen_guide.json.gz
β β β βββ ...
β β βββ val_unseen/
β β βββ val_unseen_guide.json.gz
β β βββ ...
β βββ scalevln/
β βββ scalevln_subset_150k.json.gz
βββ scene_datasets/
β βββ hm3d/
β β βββ 00000-kfPV7w3FaU5/
β β βββ 00001-UVdNNRcVyV1/
β β βββ ...
β βββ mp3d/
β βββ 17DRP5sb8fy/
β βββ 1LXtFkjw3qL/
β βββ ...
βββ trajectory_data/
β βββ R2R-CE-640x480/
β β βββ images/
β βββ RxR-CE-640x480/
β β βββ images/
β βββ ScaleVLN/
β βββ images/
β βββ annotations.json
βββ dagger_data/
βββ R2R/
β βββ images/
β βββ annotations.json
βββ RxR/
βββ images/
βββ annotations.json4γBuild Datasets
Construct a base dataset that only includes R2R-CE and RxR-CE:
python create_data/create_data.pyFinally, the dataset information needs to be configured in the file src/qwen_vl/data/__init__.py.
We have separately provided two sets of JanusVLN model weights to distinguish whether additional data is used or not:
| Model | Data | Name |
|---|---|---|
| JanusVLN | R2R-CE,RxR-CE | JanusVLN_Base |
| R2R-CE,RxR-CE,DAgger,ScaleVLN | JanusVLN_Extra |
-
Base Training
Use the base data to train the base model:
bash scripts/train.sh
-
Dagger Collection
Collecting DAgger data using the base model:
bash scripts/dagger.sh
Construct extra dataset:
python create_data/create_data.py --use_extra_data
It is also necessary to configure the dataset information in the file
src/qwen_vl/data/__init__.py. -
Extra Training
Continue training on extra data on top of the base model:
bash scripts/train_extra.sh
Use multiple GPUs to infer the model for evaluation:
bash scripts/evaluation.shIf you find FSDrive is useful in your research or applications, please consider giving us a star π and citing it by the following BibTeX entry:
@article{zeng2025janusvln,
title={JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation},
author={Zeng, Shuang and Qi, Dekang and Chang, Xinyuan and Xiong, Feng and Xie, Shichao and Wu, Xiaolong and Liang, Shiyi and Xu, Mu and Wei, Xing},
journal={arXiv preprint arXiv:2509.22548},
year={2025}
}
Our work is primarily based on the following codebases:Qwen2.5-VL, VGGT, StreamVLN, VG-LLM. We are sincerely grateful for their work.