This paper introduces a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet.
The datasets for MSRS-Story and MSRS-Meet are provided in the data directory.
The retrieval code and the settings created by each retrieval model, which serve as inputs for summarization, are located in the code/retrieval directory.
The summarization code is included in code/summarization.
The evaluation code, along with the generated summaries and their corresponding evaluation results (e.g., ROUGE-2, G-Eval), are located in the code/evaluation directory.
Install the required packages using Python version >=3.9.
pip install -r requirements.txt
Examples for running the retrieval, summarization, and evaluation scripts are provided in usage.sh files alongside the scripts.
Retrieval Peformance for MSRS-Story
Retrieval Peformance for MSRS-Meet
Summarization Performance for MSRS-Story
Summarization Performance for MSRS-Meet
Oracle Summarization Performance for Reasoning Models
If you find our work helpful, please consider citing it:
@inproceedings{
phanse2025msrs,
title={{MSRS}: Evaluating Multi-Source Retrieval-Augmented Generation},
author={Rohan Phanse and Yijie Zhou and Kejian Shi and Wencai Zhang and Yixin Liu and Yilun Zhao and Arman Cohan},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=KtGsJm8bOC}
}