Skip to content

bodasadallah/RevUtil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

94 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors


arXiv RevUtil Human RevUtil Synthetic RevUtil Llama Score Only RevUtil Llama Score + Rationale License

πŸ”₯ News

πŸ“ Abstract

Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors.

To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive their utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness.

To support evaluation and model development, we introduce the RevUtil dataset, consisting of 1,430 human-labeled review comments and 10,000 synthetically labeled comments (with rationales) for training.

Using this dataset, we benchmark fine-tuned models for scoring and explaining review comments. These models achieve agreement with human annotations on par with, or even exceeding, GPT-4o. Further analysis shows that GPT-4-generated reviews generally underperform human reviews on these utility dimensions.


πŸ–ΌοΈ System Overview

System Overview


πŸ“š Dataset

πŸ§‘β€πŸ”¬ RevUtil Human

HF Logo boda/RevUtil_human

The dataset contains 1,430 review comments annotated by three human raters.

Key columns:

Column Description
paper_id ID of the reviewed paper
venue Conference or journal name
focused_review Full review (weakness + suggestion sections)
review_point Individual review comment being evaluated
id Unique ID for the review point
batch Identifier for the annotation batch/study
ASPECT Dictionary with annotators and their individual labels
ASPECT_label Majority label (empty if no agreement among annotators)
ASPECT_label_type Label quality: "gold" (3/3), "silver" (2/3), or "None" (no agreement)

πŸ€– RevUtil Synthetic

HF Logo boda/RevUtil_synthetic
Synthetic dataset generated using GPT-4o with 10k examples (9k train / 1k test).

Key columns:

Column Description
paper_id ID of the reviewed paper
venue Conference or journal name
focused_review Full review (weakness + suggestion sections)
review_point Individual review comment
id Unique ID for the review point
chatgpt_ASPECT_score Model-generated score for the aspect
chatgpt_ASPECT_rationale Explanation of the score provided by GPT-4o

πŸ› οΈ Finetuning

We use the Hugging Face Alignment Handbook for training scripts. We apply LoRA-based fine-tuning and leverage DeepSpeed ZeRO for distributed training.

πŸƒ To start finetuning

bash finetune.sh

πŸ”§ Important configuration options (in finetune.sh)

Variable Description
USE_PEFT=true Enables LoRA training instead of full fine-tuning
GENERATION_TYPE "score_only" or "score_rationale" for rationale generation
ASPECTS=("all") Choose "all" or specific aspects like "actionability"

βœ… Evaluation

All evaluation scripts are in the inference/ directory and use vLLM.

πŸƒ To run evaluation

bash inf.bash

πŸ”§ Important configuration options (in inf.bash)

Variable Description
FINETUNING_TYPE "adapters" for LoRA models, "baseline" for base models
STEP Checkpoint to evaluate (0 for latest or for base models)
TRAINING_aspects Aspects the model was trained on (e.g., "all")
GENERATION_TYPES "score_only" or "score_rationale"

πŸ—ƒ Dataset Configurations

DATASETS=("boda/RevUtil_human" "boda/RevUtil_synthetic")
DATASET_SPLITS=("full" "test")
DATASET_CONFIGS=("combined_main_aspects" "all")

πŸ“Š Analysis

All scripts and notebooks for performance breakdown and visualization are included in the analysis/ directory.


πŸ“Ž Citation

@article{sadallah2025goodbadconstructiveautomatically,
      title={The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors}, 
      author={Abdelrahman Sadallah and Tim BaumgΓ€rtner and Iryna Gurevych and Ted Briscoe},
      year={2025},
      eprint={2509.04484},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.04484}, 
}

License

RevUtil is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contact

For questions or contributions, contact:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors