Skip to content

miaoyuchun/InfoRM

Repository files navigation

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

This repository contains the official implementation of this article

[NeurIPS 2024] InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao

This repository also provides the official implementation of our extended journal version, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI):

[ArXiv 2025] Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao

Installation

conda create -n inform python=3.12 -y
conda activate inform

pip install vllm==0.7.2
git clone https://github.com/miaoyuchun/InfoRM.git
cd InfoRM
pip install -e .

Prepare Datasets

The data format used in this project is fully consistent with that of OpenRLHF.

To reproduce this work, you can process the ShareGPT dataset for SFT and the Anthropic HH dataset for RM and PPO training following the data format specifications provided by OpenRLHF.

Supervised Fine-tuning

bash ./example_sft/slurm/scc_sft_llama3_8b_sharegpt_packing.sh

Before running the above commands, you should replace WORKSPACE, pretrain, and dataset with the corresponding project path, pretrained model path, and dataset path, respectively.

RM Training

Information-Theoretic Reward Model

bash ./example_rm/slurm/scc_rm_llama3_hh105_wprompt_packing_inform.sh

Standard Reward Model

bash ./example_rm/slurm/scc_rm_llama3_hh105_wprompt_packing_baseline.sh

Before running the above commands, you should replace WORKSPACE, pretrain, and dataset with the corresponding project path, sft model path, and dataset path, respectively.

PPO Training

PPO with Standard-RM

bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_baseline.sh

PPO with InfoRM

bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform.sh

PPO with InfoRM and IBL

bash ./example_ppo/script/prepare_ib_representation.sh

bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform_ibl.sh
  • Before running the above commands, you should replace WORKSPACE, pretrain, reward_pretrain, and dataset with the corresponding project path, sft model path, rm model path, and dataset path, respectively.

  • ./example_ppo/script/generate_eval.sh is used to generate responses using the SFT and RLHF models. This step is already included in the PPO training script above.

  • ./example_ppo/script/ib_latent_eval.sh is used to generate T-SNE visualizations based on the representations of samples in the latent space of InfoRM. You can assess the extent of reward hacking by identifying outliers in these plots. This step is also included in the PPO training script above.

  • ./example_ppo/script/prepare_ib_representation.sh is used to pre-compute and store the representations of SFT model responses in the IB latent space, which are later utilized to compute Mahalanobis distances during the RL process.

  • Before running ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform_ibl.sh, the prompt dataset should include an additional key, class, whose value must be one of the two options: helpful or harmless, indicating the dataset source of each sample.

Reward Hacking Indicator: Mahalanobis Outlier Probability (MOP)

python -m openrlhf.eval.compute_mop
  • Before running the above commands, make sure to obtain sft_representation and rlhf_representation as defined in Line #78 and Line #80.

Citation

If you find our work useful in your research, please consider citing both our conference and journal versions:

Conference Version

@inproceedings{miao2024inform,
title={Info{RM}: Mitigating Reward Hacking in {RLHF} via Information-Theoretic Reward Modeling},
author={Yuchun Miao and Sen Zhang and Liang Ding and Rong Bao and Lefei Zhang and Dacheng Tao},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=3XnBVK9sD6}
}

Journal Version

@misc{miao2025informationtheoreticrewardmodelingstable,
title={Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking}, 
author={Yuchun Miao and Liang Ding and Sen Zhang and Rong Bao and Lefei Zhang and Dacheng Tao},
year={2025},
eprint={2510.13694},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.13694}, 
}

Thanks

This project is based on OpenRLHF. Thanks for this wonderful work!

About

The official implementation of InfoRM [NeurIPS 2024].

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors