This is the official codebase for:
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback,
Yufei Wang*, Zhanyi Sun*, Jesse Zhang, Zhou Xian, Erdem Bıyık, David Held†, Zackory Erickson†,
ICML 2024.
Website | ArXiv
Install the conda env via
conda env create -f conda_env.yml
conda activate rlvlmf
PLEASE ONLY USE THE METAWORLD FOR RIGHT NOW. There is no need for docker right now.
First download the cached data below. Then, go to run.sh and select which metaworld env to run
To run the metaworld env, we only need to run on the host (no docker needed). To do this make sure you have your conda rlvlmf env activated.
You can run . ./activate_conda.sh after modifications to it (link your miniforge properly!). This will also source prepare.sh so that things are linked properly. Then run . ./run.sh
- Get a Gemini api key: follow instructions at https://aistudio.google.com/app/apikey
- We use GPT4v for the cloth fold task, get the OpenAI API key.
- Make sure you're in the rlvlmf virtualenv and run
conda env config vars set GEMINI_API_KEY="<ENTER API KEY>"conda env config vars set OPENAI_API_KEY="<ENTER API KEY>"
- Reactivate virtualenv:
conda deactivate && conda activate rlvlmf - Run
source prepare.shto prepare some environment variables. - Then please see
run.shfor running experiments with different environments.
- Due to that Gemini-pro 1.0 has greatly decreased its free quota to be only 1500 request per day: https://ai.google.dev/pricing, we provide some of the VLM preference labels we cached when running the experiments. We only stored them at an interval during training, e.g., we stored every 25th time when we queried the VLM. Therefore, the total number of cached preferece labels are fewer than the number for the complete run. The labels are also not on-policy, which means they are not generated using the agent's online experience.
- Still, we find that we are able to get roughly similar performances by using the cached preference labels, for Fold Cloth, Open Drawer, Soccer, CartPole, Straighten Rope, and Pass Water. The performance of Sweep Into with the cached labels is worse compared to the original results in the paper.
- The cahced preference labels can be downloaded through this google drive link.
- After downloading, put it under
dataso it looks likedata/cached_labels/env_name/different_seed. - The commands in
run.shwill by default load the cached preference labels; you can usecached_label_path=Noneto not use the cached labels and query the VLM online during training. - If you wish to fully reproduce the results in the paper, please train without using the provided cached labels, and generate the VLM preference labels online using the learning agent's online experience.
If you want to test RL-VLM-F on a new task, you should add the environment build function in utils.py, see make_metaworld_env for an example. If you want to run on more metaworld tasks, you should adjust the camera angle such that it focuses on the target object to manipulate. See metaworld/envs/assets_v2/objects/assets/xyz_base_transparant.xml for the camera parameters we used for the tasks in the paper.
- We thank the author of PEBBLE for open sourcing their code, which our code is built on: https://github.com/pokaxpoka/B_Pref
If you find this codebase / paper useful in your research, please consider citing:
@InProceedings{wang2024,
title = {RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback},
author = {Wang, Yufei and Sun, Zhanyi and Zhang, Jesse and Xian, Zhou and Biyik, Erdem and Held, David and Erickson, Zackory},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
year = {2024}
}