URLS aims to provide a set of unsupervised reinforcement learning algorithms and experiments for the purpose of researching the applicability of unsupervised reinforcement learning to a variety of paradigms.
The codebase is based upon URLB and ExORL. Further details are provided in the following papers:
- URLB: Unsupervised Reinforcement Learning Benchmark
- Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
URLS is intended as a successor to URLB allowing for an increased number of experiments and RL paradigms.
Install MuJoCo if it is not already the case:
- Download MuJoCo binaries here.
- Unzip the downloaded archive into
~/.mujoco/. - Append the MuJoCo subdirectory bin path into the env variable
LD_LIBRARY_PATH.
Install the following libraries:
sudo apt update
sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3 unzipInstall dependencies:
conda env create -f conda_env.yml
conda activate urls-envWe provide the following workflows:
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAINFine-tuning, learn with the pre-trained agent on a specific, task specific reward is now used for the agent
python finetune.py pretrained_agent=UNSUPERVISED_AGENT task=TASK snapshot_ts=TS obs_type=OBS_TYPEPre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAINSampling, sample demos from agent replay buffer on a specific task
python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPEOffline-learning, learn a policy using the offline data collected on the specific task.
python train_offline.py agent=OFFLINE_AGENT expl_agent=UNSUPERVISED_AGENT task=TASKPre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAINSampling, sample demos from agent replay buffer with constraints and images
python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPETrajectories to Images, create image dataset from trajectories
python data_to_images.py --env=DOMAINTrain VAE, train Variational Auto Encoder from the image dataset
python train_encoder.py --env=DOMAINTrain MPC, train LS3 safe model predictive controller on specific domain
python train_mpc.py --env=DOMAINThe following unsupervised reinforcement learning agents are available, replace UNSUPERVISED_AGENT with Command.
For example to use DIAYN, set UNSUPERVISED_AGENT = diayn.
| Agent | Command | Type | Implementation Author(s) | Paper | Intrinsic Reward |
|---|---|---|---|---|---|
| ICM | icm |
Knowledge | Denis | paper | $| | g(\mathbf{z}{t+1} | \mathbf{z}{t}, \mathbf{a}{t}) - \mathbf{z}{t+1} | | ^{2}$ |
| Disagreement | disagreement |
Knowledge | Catherine | paper | $Var{ g_{i} (\mathbf{z}{t+1} | \mathbf{z}{t}, \mathbf{a}_{t}) }$ |
| RND | rnd |
Knowledge | Kevin | paper | $| | g(\mathbf{z}{t}, \mathbf{a}{t}) - \tilde{g}(\mathbf{z}{t}, \mathbf{a}{t}) | | ^{2}_{2}$ |
| APT(ICM) | icm_apt |
Data | Hao, Kimin | paper | $\sum_{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}{j} | |$ |
| APT(Ind) | ind_apt |
Data | Hao, Kimin | paper | $\sum_{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}{j} | |$ |
| ProtoRL | proto |
Data | Denis | paper | $\sum_{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}{j} | |$ |
| DIAYN | diayn |
Competence | Misha | paper | |
| APS | aps |
Competence | Hao, Kimin | paper | |
| SMM | smm |
Competence | Albert | paper |
The following 5 RL procedures are available to learn a policy offline from unsupervised data. Replace OFFLINE_AGENT with Command, for example to use behavioral cloning, set OFFLINE_AGENT = bc.
| Offline RL Procedure | Command | Paper |
|---|---|---|
| Behavior Cloning | bc |
paper |
| CQL | cql |
paper |
| CRR | crr |
paper |
| TD3+BC | td3_bc |
paper |
| TD3 | td3 |
paper |
The following environments with specific domains and tasks are provided. We also provide a wrapper to convert Gym environments to DMC extended time-step types based on DeepMind's acme wrapper.
| Environment Type | Domain | Task |
|---|---|---|
| Deep Mind Control | walker |
stand, walk, run, flip |
| Deep Mind Control | quadruped |
walk, run, stand, jump |
| Deep Mind Control | jaco |
reach_top_left, reach_top_right, reach_bottom_left, reach_bottom_right |
| Deep Mind Control | cheetah |
run |
| Gym Box2D | BipedalWalker-v3 |
walk |
| Gym Box2D | CarRacing-v1 |
race |
| Gym Classic Control | MountainCarContinuous-v0 |
goal |
| Safe Control | SimplePointBot |
goal |
The majority of URLS including the ExORL & URLB based code is licensed under the MIT license, however portions of the project are available under separate license terms: DeepMind is licensed under the Apache 2.0 license.