InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

This repository contains the official implementation of this article

[NeurIPS 2024] InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao

This repository also provides the official implementation of our extended journal version, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI):

[ArXiv 2025] Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao

Installation

conda create -n inform python=3.12 -y
conda activate inform

pip install vllm==0.7.2
git clone https://github.com/miaoyuchun/InfoRM.git
cd InfoRM
pip install -e .

Prepare Datasets

The data format used in this project is fully consistent with that of OpenRLHF.

To reproduce this work, you can process the ShareGPT dataset for SFT and the Anthropic HH dataset for RM and PPO training following the data format specifications provided by OpenRLHF.

Supervised Fine-tuning

bash ./example_sft/slurm/scc_sft_llama3_8b_sharegpt_packing.sh

Before running the above commands, you should replace WORKSPACE, pretrain, and dataset with the corresponding project path, pretrained model path, and dataset path, respectively.

RM Training

Information-Theoretic Reward Model

bash ./example_rm/slurm/scc_rm_llama3_hh105_wprompt_packing_inform.sh

Standard Reward Model

bash ./example_rm/slurm/scc_rm_llama3_hh105_wprompt_packing_baseline.sh

Before running the above commands, you should replace WORKSPACE, pretrain, and dataset with the corresponding project path, sft model path, and dataset path, respectively.

PPO Training

PPO with Standard-RM

bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_baseline.sh

PPO with InfoRM

bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform.sh

PPO with InfoRM and IBL

bash ./example_ppo/script/prepare_ib_representation.sh

bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform_ibl.sh

Before running the above commands, you should replace WORKSPACE, pretrain, reward_pretrain, and dataset with the corresponding project path, sft model path, rm model path, and dataset path, respectively.
./example_ppo/script/generate_eval.sh is used to generate responses using the SFT and RLHF models. This step is already included in the PPO training script above.
./example_ppo/script/ib_latent_eval.sh is used to generate T-SNE visualizations based on the representations of samples in the latent space of InfoRM. You can assess the extent of reward hacking by identifying outliers in these plots. This step is also included in the PPO training script above.
./example_ppo/script/prepare_ib_representation.sh is used to pre-compute and store the representations of SFT model responses in the IB latent space, which are later utilized to compute Mahalanobis distances during the RL process.
Before running ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform_ibl.sh, the prompt dataset should include an additional key, class, whose value must be one of the two options: helpful or harmless, indicating the dataset source of each sample.

Reward Hacking Indicator: Mahalanobis Outlier Probability (MOP)

python -m openrlhf.eval.compute_mop

Before running the above commands, make sure to obtain sft_representation and rlhf_representation as defined in Line #78 and Line #80.

Citation

If you find our work useful in your research, please consider citing both our conference and journal versions:

Conference Version

@inproceedings{miao2024inform,
title={Info{RM}: Mitigating Reward Hacking in {RLHF} via Information-Theoretic Reward Modeling},
author={Yuchun Miao and Sen Zhang and Liang Ding and Rong Bao and Lefei Zhang and Dacheng Tao},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=3XnBVK9sD6}
}

Journal Version

@misc{miao2025informationtheoreticrewardmodelingstable,
title={Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking}, 
author={Yuchun Miao and Liang Ding and Sen Zhang and Rong Bao and Lefei Zhang and Dacheng Tao},
year={2025},
eprint={2510.13694},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.13694}, 
}

Thanks

This project is based on OpenRLHF. Thanks for this wonderful work!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
chat-template		chat-template
example_ppo		example_ppo
example_rm		example_rm
example_sft		example_sft
openrlhf.egg-info		openrlhf.egg-info
openrlhf		openrlhf
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Installation

Prepare Datasets

Supervised Fine-tuning

RM Training

Information-Theoretic Reward Model

Standard Reward Model

PPO Training

PPO with Standard-RM

PPO with InfoRM

PPO with InfoRM and IBL

Reward Hacking Indicator: Mahalanobis Outlier Probability (MOP)

Citation

Conference Version

Journal Version

Thanks

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Installation

Prepare Datasets

Supervised Fine-tuning

RM Training

Information-Theoretic Reward Model

Standard Reward Model

PPO Training

PPO with Standard-RM

PPO with InfoRM

PPO with InfoRM and IBL

Reward Hacking Indicator: Mahalanobis Outlier Probability (MOP)

Citation

Conference Version

Journal Version

Thanks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages