Skip to content

The official implementation of Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

License

Notifications You must be signed in to change notification settings

zhijie-group/Mantis

Repository files navigation

If you find our project helpful, please give us a star ⭐ to support us 🙏🙏

📄 Paper | 🤗 Checkpoints | 📜 License

head

  • Disentangled Visual Foresight automatically capture the latent actions that delineate the visual trajectory without overburdening the backbone.
  • Progressive Training introduces modalities in stages, preserving the language understanding and reasoning capabilities of the VLM backbone.
  • Adaptive Temporal Ensemble dynamically adjusts temporal ensembling strength, reducing inference cost while maintaining stable control.

📘 Contents

🎥 Demos

More demos coming soon...

In-domain instructions (3x speed):

Put the cup on
the female singer
Put the cup on
the Marvel superhero
Put the watch
in the basket
mantis_id_taylor_x3 mantis_id_ironman_x3 mantis_id_watch_x3

Out-of-domain instructions (3x speed):

Put the cup on
Taylor Swift
Put the cup on
Iron Man
Put a thing that can
tell the time in the basket
mantis_id_taylor_x3 mantis_id_ironman_x3 mantis_id_watch_x3

📖 Introduction

Previous vision-augmented action learning paradigms

arch

Overall framework of Mantis

arch

🤗 Models & Datasets

Model Note
Mantis-Base Base Mantis model trained through the 3-stage pretraining pipeline
Mantis-SSV2 Mantis model pretrained on the SSV2 dataset after Stage 1
Mantis-LIBERO Mantis model fine-tuned on the LIBERO dataset
Dataset Note
Something-Something-v2 The human action video dataset used in Stage 1 pretraining
DROID-Lerobot The robot dataset used in Stage 2 & 3 pretraining
LLaVA-OneVision-1.5-Instruct-Data The multimodal dataset used in Stage 3 pretraining
LIBERO-Lerobot The LIBERO dataset used for fine-tuning

📈 Evaluation

First, clone the repository and create the conda environment:

git clone git@github.com:Yysrc/Mantis.git
cd Mantis
conda env create -f configs/environment_libero.yml
conda activate mantis_libero

Then clone and install the LIBERO repository:

git clone git@github.com:Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .

Install other required packages:

cd ..
pip install -r experiments/libero/libero_requirements.txt

Evaluate the LIBERO benchmark:

sh experiments/libero/run_libero_eval.sh

Modify the task_suite_name parameter in the script to evaluate different task suites. Adjust the eval_mode parameter to switch between $\textbf{TE}$ and $\textbf{ATE}$ modes.

🔧 Training

Please first download the LIBERO datasets and the base Mantis model.

First, create the training conda environment:

conda env create -f configs/environment_lerobot.yml
conda activate mantis_lerobot

Then clone and install the Lerobot repository:

git clone -b paszea/lerobot git@github.com:Yysrc/lerobot.git
cd lerobot
conda install ffmpeg=7.1.1 -c conda-forge
pip install -e .

The configuration files are in the configs folder. Please update the dataset_root_dir to the LIBERO dataset directory and set resume_from_checkpoint to the path of the base Mantis model.

Train the Mantis model on the LIBERO dataset:

sh train.sh

✨ Acknowledgements

Heartfelt thanks to the creators of Metaquery and Lerobot for their open-sourced work!

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@article{yang2025mantis,
  title={Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight},
  author={Yang, Yi and Li, Xueqi and Chen, Yiyang and Song, Jin and Wang, Yihan and Xiao, Zipeng and Su, Jiadi and Qiaoben, You and Liu, Pengfei and Deng, Zhijie},
  journal={arXiv preprint arXiv:2511.16175},
  year={2025}
}

About

The official implementation of Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published