Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan: Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization. CoRR abs/2503.18130 (2025)