Implementation of a sample efficient and fast deep MBRL algorithm, Double Horizon Model-based Policy Optimization (DHMBPO) DHMBPO conceptually combines MBPO and SAC-SVG(H). The implementation if designed to use a gpu.
This implementation works on Gymnaisum (GYM) and DM control (DMC) even with common hyper-parameters over all tasks, with high sample efficiency, short runtime, and small GPU memory usage.
- Visualization of each agents' optimized behavior at the final environment stsp (500K steps)
| GYM Humanoid-v4 | GYM Walker2d-v4 | DMC cartplole-swingup_sparse | DMC quadruped_run |
|---|---|---|---|
At first, install suitable version of pytorch<2 from here, then
pip install -e .for the Gymnaisum (formerly Open AI's gym) tasks, set
env=gym
and env.task_id to one of "MBHumanoid-v0", "
MBAnt-v0", "MBHopper-v0", "MBHalfCheetah-v0" and "MBWalker2d-v0".
These are identical with "Humanoid-v4", "
Ant-v4", "Hopper-v4", "HalfCheetah-v4" and "Walker2d-v4"
except for providing known reward function usable in offline policy optimization.
Other available value for env is dmc
for DM control (DMC), gym-robo
for Gymnasium-Robotics
and myosuite for MyoSuite.
In the following, we use quadruped-run in the DMC suite.
Execute DHMBPO algorithm with option agent=dhmbpo.
For GYM tasks, you need to specify env=gym followed by
e.g., env.task_id=MBHalfCheetah-v0
as described in the subsection "Suites and tasks".
python train.py agent=dhmbpo env=dmc env.task_id=quadruped-runRun DHBMPO without DR, conceptually corresponding to SAC-SVG(H), but with deep ensemble.
By adding agent.training_rollout_length=5, it runs SAC-SVG(H) of which model rollout
length=5.
python train.py agent=svg env=dmc env.task_id=quadruped-run agent.training_rollout_length=5Run DHBMPO without TR, conceptually corresponding to MBPO. The options are
agent.rollout_length=20: The length of distribution rollout ("branched rollout" in the MBPO paper) is 20.agent.num_policy_opt_per_step=10: So called the UTD ratio.
python train.py agent=mbpo env=dmc env.task_id=cheetah-run agent.rollout_length=20 agent.num_policy_opt_per_step=10We provide csv files used for plots in the paper, including other baseline methods.
@article{
kubo2025double,
title={Double Horizon Model-Based Policy Optimization},
author={Akihiro Kubo and Paavo Parmas and Shin Ishii},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=HRvHCd03HM},
}