[Project Page] [Paper Page] [Code Link] [Data Link]
This is the official implementation of the paper "VIP: Vision Instructed Pre-training for Robotic Manipulation".
To prepare the environment, please first create a conda environment with the following command. We do not recommend using other python versions because Isaac Gym may not support them.
conda create -n VIRT python=3.8
conda activate VIRTThen, please install Isaac Gym following the instructions in Isaac Gym. Our used version is 1.0.preview4. You need to install Isaac Gym using a computer with visualization and at least a GPU. Cuda should also be installed. After installing Isaac Gym, move to the installation path and run the following commands:
cd isaacgym/python
pip install -e .
Now, please go to the root directory of this project and install the following dependencies:
pip install torch==2.3.1 torchvision==0.18.1
pip install pyquaternion
pip install pyyaml
pip install rospkg
pip install pexpect
pip install opencv-python
pip install matplotlib
pip install einops
pip install packaging
pip install h5py
pip install ipython
pip install xformers==0.0.27
pip install tensorboard
pip install tqdm
cd VIRT/detr && pip install -e . && cd ../..
We provide our collected teleoperation data of the three simulated tasks, including 'Move a Single Box', 'Transport the Specified Box', and 'Stack the Specified Boxes'. You can download the data from here. Please unzip the data and put it in the ROOT/VIRT/datasets directory. The data saving structure should look like:
$ROOT/VIRT/datasets
├── isaac_singlebox
│ ├── h5py
│ ├── exterior_camera1
│ ├── exterior_camera2
│ ├── top_camera
│ ├── wrist_camera
├── isaac_singlecolorbox
├── isaac_multicolorbox
The folders isaac_singlebox, isaac_singlecolorbox, and isaac_multicolorbox correspond to the aforementioned three simulated tasks, respectively.
We provide the policy weight of the VIRT model pre-trained by the proposed RIP paradigm at here. Please download it and save it as $ROOT/VIRT/pretrained/VIRT_droid_pretrain.ckpt.
For training VIRT, you can run the following script template at ROOT/VIRT:
torchrun --nnodes=$nnodes --nproc_per_node=$nproc_per_node --node_rank=$node_rank --master_addr=$master_addr --master_port $master_port \
main.py \
--config_name $config_name \
--save_dir ./outputs/$exp_id \
--data_dir $data_path \
--num_nodes $nnodes \
--load_pretrain pretrained/VIRT_droid_pretrain.ckptSpecifically, we provide config names of the three provided simulated tasks in $ROOT/VIRT/configs, and $config_name is the path of the config file corresponding the experiment you want to try. The variable $data_path is the path to the corresponding dataset, e.g., $ROOT/VIRT/datasets/isaac_singlebox for the task Move a Single Box. The training logs and checkpoint models will be saved in ./outputs/$exp_id. If you only want to use one GPU to train, the variables $nnodes, $nproc_per_node, and $node_rank should be set to 1, 1, and 0, respectively. $master_addr is the IP address of your computer and can be set to 127.0.0.1. $master_port can be 29515, and it must be different for different experiments if you are running multiple experiments simultaneously. Our code supports multi-mode multi-GPU training.
For evaluating the trained policy, you can follow the script command template as follows:
torchrun --nnodes=$nnodes --nproc_per_node=$nproc_per_node --node_rank=$node_rank --master_addr=$master_addr --master_port $master_port \
main.py \
--config_name $config_name \
--save_dir outputs/$exp_id \
--load_dir outputs/$exp_id/policy_latest.ckpt \
--num_nodes $nnodes \
--evalThe variables are set in the same way as the training script.
This project is released under CC-BY-NC 4.0.
Please use the following BibTeX entry if you find our work useful:
@article{li2025virt,
title={VIP: Vision Instructed Pre-training for Robotic Manipulation},
author={Zhuoling, Li and Liangliang, Ren and Jinrong, Yang and Yong, Zhao and others},
journal={ICML},
year={2025}
}