A minimal and stable Proximal Policy Optimization (PPO), tested on IsaacGymEnvs.
- Python (tested on 3.7)
- PyTorch (tested on 1.8.1)
Following instructions here to install Isaac Gym and the IsaacGymEnvs repo.
Optional instructions for cleaner code and dependencies:
- Under isaacgymenvs directory, the
cfg,learningsubdirectories andtrain.pyfile can be removed. - The dependency on
rl-gameson this line can be removed.
To train a policy on Cartpole, run
python train.py task=CartpoleCartpole should converge to optimal within a few seconds of starting.
In configs directory, we provide the main config file and template configs for Cartpole and AllegroHand tasks. We use Hydra for config management following IsaacGymEnvs.
To train on additional tasks, follow the template configs to define [new_task].yaml under configs/task and [new_task]PPO.yaml under configs/train.
Logging on TensorBoard and WandB are supported by default.
Our PPO results match IsaacGymEnvs' default RL implementation, in terms of both training speed and performance.
task=TASK- Selects which task to use. Options correspond to the config for each environment inconfigs/task.num_envs=NUM_ENVS- Selects the number of environments to use (overriding the default number of environments set in the task config).seed=SEED- Sets a seed value for randomizations, and overrides the default seed set up in the task config.device_id=DEVICE_ID- Device used for physics simulation and the RL algorithm.graphics_device_id=GRAPHICS_DEVICE_ID- Which Vulkan graphics device ID to use for rendering. Defaults to 0. Note - this may be different from CUDA device ID, and does not follow PyTorch-like device syntax.pipeline=PIPELINE- Which API pipeline to use. Defaults togpu, can also set tocpu. When using thegpupipeline, all data stays on the GPU and everything runs as fast as possible. When using thecpupipeline, simulation can run on either CPU or GPU, depending on thesim_devicesetting, but a copy of the data is always made on the CPU at every step.test=TEST- If set toTrue, only runs inference on the policy and does not do any training.checkpoint=CHECKPOINT_PATH- Set to path to the checkpoint to load for training or testing.headless=HEADLESS- Whether to run in headless mode.output_name=OUTPUT_NAME- Sets the output folder name.wandb_mode=WANDB_MODE- Options for using WandB.
The main configs to experiment with are:
train.network.mlp.unitstrain.ppo.gammatrain.ppo.tautrain.ppo.learning_ratetrain.ppo.lr_scheduletrain.ppo.kl_threshold(only relevant whenlr_schedule == 'kl')train.ppo.e_cliptrain.ppo.horizon_lengthtrain.ppo.minibatch_sizetrain.ppo.max_agent_steps
We recommend the default value for other configs, but of course, RL is RL :)
Here are some helpful guides to tuning PPO hyperparameters:
The 37 Implementation Details of Proximal Policy Optimization
I also documented a few general takeaways in this tweet.
Yes, rl_games has great performance but could be hard to use.
If all you're looking for is a simple, clean, performant PPO that is easy to modify and extend, try this repo :))) And feel free to give feedback to make this better!
Please use the following bibtex if you find this repo helpful and would like to cite:
@misc{minimal-stable-PPO,
author = {Lin, Toru},
title = {A minimal and stable PPO},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ToruOwO/minimal-stable-PPO}},
}
Shout-out to hora and rl_games, which this code implementation referenced!