GitHub - zhijie-group/Mantis: The official implementation of Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Mantis: A Versatile Vision-Language-Action Model
with Disentangled Visual Foresight

If you find our project helpful, please give us a star ⭐ to support us 🙏🙏

📄 Paper | 🤗 Checkpoints | 📜 License

Disentangled Visual Foresight automatically capture the latent actions that delineate the visual trajectory without overburdening the backbone.
Progressive Training introduces modalities in stages, preserving the language understanding and reasoning capabilities of the VLM backbone.
Adaptive Temporal Ensemble dynamically adjusts temporal ensembling strength, reducing inference cost while maintaining stable control.

📘 Contents

🎥 Demos

More demos coming soon...

In-domain instructions (3x speed):

Put the cup on the female singer	Put the cup on the Marvel superhero	Put the watch in the basket

Out-of-domain instructions (3x speed):

Put the cup on Taylor Swift	Put the cup on Iron Man	Put a thing that can tell the time in the basket

📖 Introduction

Previous vision-augmented action learning paradigms

Overall framework of Mantis

🤗 Models & Datasets

Model	Note
Mantis-Base	Base Mantis model trained through the 3-stage pretraining pipeline
Mantis-SSV2	Mantis model pretrained on the SSV2 dataset after Stage 1
Mantis-LIBERO	Mantis model fine-tuned on the LIBERO dataset

Dataset	Note
Something-Something-v2	The human action video dataset used in Stage 1 pretraining
DROID-Lerobot	The robot dataset used in Stage 2 & 3 pretraining
LLaVA-OneVision-1.5-Instruct-Data	The multimodal dataset used in Stage 3 pretraining
LIBERO-Lerobot	The LIBERO dataset used for fine-tuning

📈 Evaluation

First, clone the repository and create the conda environment:

git clone git@github.com:Yysrc/Mantis.git
cd Mantis
conda env create -f configs/environment_libero.yml
conda activate mantis_libero

Then clone and install the LIBERO repository:

git clone git@github.com:Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .

Install other required packages:

cd ..
pip install -r experiments/libero/libero_requirements.txt

Evaluate the LIBERO benchmark:

sh experiments/libero/run_libero_eval.sh

Modify the task_suite_name parameter in the script to evaluate different task suites. Adjust the eval_mode parameter to switch between $\textbf{TE}$ and $\textbf{ATE}$ modes.

🔧 Training

Please first download the LIBERO datasets and the base Mantis model.

First, create the training conda environment:

conda env create -f configs/environment_lerobot.yml
conda activate mantis_lerobot

Then clone and install the Lerobot repository:

git clone -b paszea/lerobot git@github.com:Yysrc/lerobot.git
cd lerobot
conda install ffmpeg=7.1.1 -c conda-forge
pip install -e .

The configuration files are in the configs folder. Please update the dataset_root_dir to the LIBERO dataset directory and set resume_from_checkpoint to the path of the base Mantis model.

Train the Mantis model on the LIBERO dataset:

sh train.sh

✨ Acknowledgements

Heartfelt thanks to the creators of Metaquery and Lerobot for their open-sourced work!

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@article{yang2025mantis,
  title={Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight},
  author={Yang, Yi and Li, Xueqi and Chen, Yiyang and Song, Jin and Wang, Yihan and Xiao, Zipeng and Su, Jiadi and Qiaoben, You and Liu, Pengfei and Deng, Zhijie},
  journal={arXiv preprint arXiv:2511.16175},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
configs		configs
experiments		experiments
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
train.py		train.py
train.sh		train.sh
trainer.py		trainer.py
trainer_utils.py		trainer_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mantis: A Versatile Vision-Language-Action Model
with Disentangled Visual Foresight

If you find our project helpful, please give us a star ⭐ to support us 🙏🙏

📘 Contents

🎥 Demos

In-domain instructions (3x speed):

Out-of-domain instructions (3x speed):

📖 Introduction

Previous vision-augmented action learning paradigms

Overall framework of Mantis

🤗 Models & Datasets

📈 Evaluation

🔧 Training

✨ Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

zhijie-group/Mantis

Folders and files

Latest commit

History

Repository files navigation

Mantis: A Versatile Vision-Language-Action Modelwith Disentangled Visual Foresight

If you find our project helpful, please give us a star ⭐ to support us 🙏🙏

📘 Contents

🎥 Demos

In-domain instructions (3x speed):

Out-of-domain instructions (3x speed):

📖 Introduction

Previous vision-augmented action learning paradigms

Overall framework of Mantis

🤗 Models & Datasets

📈 Evaluation

🔧 Training

✨ Acknowledgements

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Mantis: A Versatile Vision-Language-Action Model
with Disentangled Visual Foresight

Packages