Project Page | Paper | ArXiv
Junbang Liang1, Pavel Tokmakov2, Ruoshi Liu1, Sruthi Sudhakar1, Paarth Shah2, Rares Ambrus2, Carl Vondrick1
1Columbia University, 2Toyota Research Institute
Create environment:
git clone https://github.com/cvlab-columbia/videopolicy.git
conda create -n videopolicy python=3.10
conda activate videopolicy
Install simulation environment:
cd packages && \
git clone -b robocasa https://github.com/ARISE-Initiative/robomimic && pip install -e robomimic && \
git clone https://github.com/ARISE-Initiative/robosuite && pip install -e robosuite && \
git clone https://github.com/robocasa/robocasa && pip install -e robocasa && \
python robocasa/robocasa/scripts/download_kitchen_assets.py && \
python robocasa/robocasa/scripts/setup_macros.py
Install python packages:
cd ..
pip install -r requirements.txt
Download pretrained checkpoints and move checkpoints under the video_model folder:
wget https://videopolicy.cs.columbia.edu/assets/checkpoints.zip
Download simulation dataset and move datasets under the video_model folder:
wget https://videopolicy.cs.columbia.edu/assets/datasets.zip
After downloading the pretrained checkpoints amd the simulation dataset, you can run the Robocasa evaluations from the video_model folder:
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python scripts/sampling/robocasa_experiment.py --config=scripts/sampling/configs/svd_xt.yaml
This will run evaluation on one of the 24 tasks defined in svd_xt.yaml. To run on another task, please run this command again on a different gpu.
After downloading the pretrained checkpoints amd the simulation dataset, you can run the stage 1 video model training on Robocasa simualtion dataset from the video_model folder:
PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --base=configs/stage_1_video_model_training.yaml --name=ft1 --seed=24 --num_nodes=1 --wandb=1 lightning.trainer.devices="0,1,2,3,4,5,6,7"
Alternatively, you can run the stage 2 action decoder training with the video model frozen from a pretrained checkpoint, or you can modify stage_2_action_decoder_training.yaml to train from your stage 1 checkpoints:
PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --base=configs/stage_2_action_decoder_training.yaml --name=ft1 --seed=24 --num_nodes=1 --wandb=1 lightning.trainer.devices="0,1,2,3,4,5,6,7"
Also, we provide an example training the video model and action decoder jointly:
PYTHONPATH=. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --base=configs/joint_training.yaml --name=ft1 --seed=24 --num_nodes=1 --wandb=1 lightning.trainer.devices="0,1,2,3,4,5,6,7"
Note that this training script is set for an 8-GPU system, each with 80GB of VRAM. Training with an overall batch size of 32 is found to produce good results, and larger batch size tends to improve model performance.
This repository is based on Stable Video Diffusion and Generative Camera Dolly. We would like to thank the authors of these work for publicly releasing their code.
This research is based on work partially supported by the Toyota Research Institute and the NSF NRI Award #2132519.
@article{liang2025video,
title={Video Generators are Robot Policies},
author={Liang, Junbang and Tokmakov, Pavel and Liu, Ruoshi and Sudhakar, Sruthi and Shah, Paarth and Ambrus, Rares and Vondrick, Carl},
journal={arXiv preprint arXiv:2508.00795},
year={2025}
}