SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

We've implemented the SHAPO versions of three algorithms: CPO, PIDLag and CRPO.

Installation

For detailed installation instructions please see Installation.md.

Running SHAPO Algorithms

We provide scripts to run the SHAPO-enhanced algorithms (CPO, PIDLag, and CRPO) on supported environments.

Quick Start

1. SHAPO-CPO

Run SHAPO-CPO with default settings:

python safepo/single_agent/cpo.py --task SafetyPointGoal1-v0 \
    --use-shapo-actor-reward True \
    --use-shapo-actor-cost True \
    --perturbation-target-kl 0.00001 \
    --cost-limit 10.0 \
    --seed 0

2. SHAPO-CRPO

Run SHAPO-CRPO with default settings:

python safepo/single_agent/crpo.py --task SafetyPointGoal1-v0 \
    --use-shapo-actor True \
    --perturbation-target-kl 0.00001 \
    --crpo-distance 0.0 \
    --crpo-lambda-c 1.0 \
    --cost-limit 10.0 \
    --seed 0

3. SHAPO-PIDLag (PIDLag)

Run SHAPO-PIDLag with default settings:

python safepo/single_agent/trpo_pid.py --task SafetyPointGoal1-v0 \
    --use-shapo-actor True \
    --perturbation-target-kl 0.00001 \
    --cost-limit 10.0 \
    --seed 0

Environments

The code supports all the environments shown in the paper: SafetyPointGoal1-v0, SafetyPointButton1-v0, Ant-v4 and Walker2d-v4.

Important Command-Line Arguments

Argument	Type	Default	Description
`--task`	str	`SafetyPointGoal1-v0`	Environment to run (e.g., SafetyPointGoal1-v0, SafetyCarGoal1-v0)
`--use-shapo-actor`	bool	`False`	Enable SHAPO for actor updates (CRPO, PIDLag)
`--use-shapo-actor-reward`	bool	`False`	Enable SHAPO for reward objective in actor (CPO only)
`--use-shapo-actor-cost`	bool	`False`	Enable SHAPO for cost objective in actor (CPO only)
`--sam-rho`	float	`0.05`	SAM perturbation radius for the critic
`--perturbation-target-kl`	float	`0.01`	Target KL divergence for SHAPO perturbation for the actor
`--use-sam-cost-critic`	bool	`False`	Enable SAM for cost critic updates
`--use-sam-reward-critic`	bool	`False`	Enable SAM for reward critic updates
`--cost-limit`	float	`10.0`	Cost constraint threshold
`--num-envs`	int	`5`	Number of parallel environments
`--seed`	int	`0`	Random seed for reproducibility

Note: CPO computes two separate SHAPO gradients (one for reward, one for cost) using --use-shapo-actor-reward and --use-shapo-actor-cost. CRPO and PIDLag use a single SHAPO gradient with --use-shapo-actor.

This repository is a modification of the original Safe-Policy-Optimization repository developed by the PKU-Alignment team.

Citation

If you use this code or the original Safe-Policy-Optimization repository in your research, please cite:

@article{ji2023safety,
  title={Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark},
  author={Ji, Jiaming and Zhang, Borong and Zhou, Jiayi and Pan, Xuehai and Huang, Weidong and Sun, Ruiyang and Geng, Yiran and Zhong, Yifan and Dai, Juntao and Yang, Yaodong},
  journal={arXiv preprint arXiv:2310.12567},
  year={2023}
}

Original Repository

For the original implementation and documentation, please visit:

GitHub: https://github.com/PKU-Alignment/Safe-Policy-Optimization
Paper: https://arxiv.org/abs/2310.12567

Acknowledgments

We thank the PKU-Alignment team for developing and maintaining the original Safe-Policy-Optimization benchmark, which serves as the foundation for this work.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
docs		docs
safepo		safepo
.gitignore		.gitignore
.gitmodules		.gitmodules
.ruby-version		.ruby-version
Installation.md		Installation.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

Installation

Running SHAPO Algorithms

Quick Start

1. SHAPO-CPO

2. SHAPO-CRPO

3. SHAPO-PIDLag (PIDLag)

Environments

Important Command-Line Arguments

Citation

Original Repository

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

Installation

Running SHAPO Algorithms

Quick Start

1. SHAPO-CPO

2. SHAPO-CRPO

3. SHAPO-PIDLag (PIDLag)

Environments

Important Command-Line Arguments

Citation

Original Repository

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages