This repository contains the official implementation of this article
[NeurIPS 2024] InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
This repository also provides the official implementation of our extended journal version, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI):
Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao
conda create -n inform python=3.12 -y
conda activate inform
pip install vllm==0.7.2
git clone https://github.com/miaoyuchun/InfoRM.git
cd InfoRM
pip install -e .The data format used in this project is fully consistent with that of OpenRLHF.
To reproduce this work, you can process the ShareGPT dataset for SFT and the Anthropic HH dataset for RM and PPO training following the data format specifications provided by OpenRLHF.
bash ./example_sft/slurm/scc_sft_llama3_8b_sharegpt_packing.shBefore running the above commands, you should replace WORKSPACE, pretrain, and dataset with the corresponding project path, pretrained model path, and dataset path, respectively.
bash ./example_rm/slurm/scc_rm_llama3_hh105_wprompt_packing_inform.shbash ./example_rm/slurm/scc_rm_llama3_hh105_wprompt_packing_baseline.shBefore running the above commands, you should replace WORKSPACE, pretrain, and dataset with the corresponding project path, sft model path, and dataset path, respectively.
bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_baseline.shbash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform.shbash ./example_ppo/script/prepare_ib_representation.sh
bash ./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform_ibl.sh-
Before running the above commands, you should replace
WORKSPACE,pretrain,reward_pretrain, anddatasetwith the corresponding project path, sft model path, rm model path, and dataset path, respectively. -
./example_ppo/script/generate_eval.shis used to generate responses using the SFT and RLHF models. This step is already included in the PPO training script above. -
./example_ppo/script/ib_latent_eval.shis used to generate T-SNE visualizations based on the representations of samples in the latent space of InfoRM. You can assess the extent of reward hacking by identifying outliers in these plots. This step is also included in the PPO training script above. -
./example_ppo/script/prepare_ib_representation.shis used to pre-compute and store the representations of SFT model responses in the IB latent space, which are later utilized to compute Mahalanobis distances during the RL process. -
Before running
./example_ppo/slurm/scc_ppo_ray_llama3_8b_hh105rm_reproduce_hacking_offload_inform_ibl.sh, the prompt dataset should include an additional key,class, whose value must be one of the two options:helpfulorharmless, indicating the dataset source of each sample.
python -m openrlhf.eval.compute_mop- Before running the above commands, make sure to obtain
sft_representationandrlhf_representationas defined in Line #78 and Line #80.
If you find our work useful in your research, please consider citing both our conference and journal versions:
@inproceedings{miao2024inform,
title={Info{RM}: Mitigating Reward Hacking in {RLHF} via Information-Theoretic Reward Modeling},
author={Yuchun Miao and Sen Zhang and Liang Ding and Rong Bao and Lefei Zhang and Dacheng Tao},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=3XnBVK9sD6}
}@misc{miao2025informationtheoreticrewardmodelingstable,
title={Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking},
author={Yuchun Miao and Liang Ding and Sen Zhang and Rong Bao and Lefei Zhang and Dacheng Tao},
year={2025},
eprint={2510.13694},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.13694},
}This project is based on OpenRLHF. Thanks for this wonderful work!