📌 Benchmark for Utility of Retrieved Documents
🏆 COLING 2025
🟨 Hyeonseok Lim
🟨 Dongjae Shin
🟨 Seohyun Song
🟨 Inho Won
🟨 Minjun Kim
🟪 Junghun Yuk
🟪 Haneol Jang
🟨 KyungTae Lim
Affiliations:
🟨 Seoul National University of Science and Technology
🟪 Hanbat National University
- 🗒️ VLR-Bench Blog
- 📄 arXiv
- 🎓 COLING 2025
- 📂 VLR-IF Dataset (HuggingFace)
- 📂 VLR-Bench Dataset (HuggingFace)
-
VLR-Bench
- We propose VLR-BENCH, a visual question answering (VQA) benchmark for evaluating vision-language models (VLMs) using retrieval-augmented generation (RAG).
- VLR-BENCH includes five input passages, allowing models to determine which passage is most relevant for answering a query—an aspect often overlooked in prior research.
-
VLR-IF
- We introduce VLR-IF, a dataset of 32,000 instruction-following examples to enhance VLMs' ability to generate accurate responses from retrieved information.
-
Open-Source
- Both VLR-BENCH and VLR-IF datasets are publicly available online.
📌 Dataset Summary
- 150 images from BOK-VQA
- 150 images from Wikimedia Commons (reflecting cultural elements)
- Multilingual Parallel Corpus: English, Chinese, and Korean
- 📂 Dataset Link
📷 Example Images from VLR-Bench
📌 Dataset Summary
- 9,000 images from COCO
- 32,000 entries (valid/invalid passages)
- Languages: English, Chinese, Korean
- 📂 Dataset Link
📷 Example Images from VLR-IF
@article{lim2024vlr, title={VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation}, author={Lim, Hyeonseok and Shin, Dongjae and Song, Seohyun and Won, Inho and Kim, Minjun and Yuk, Junghun and Jang, Haneol and Lim, KyungTae}, journal={arXiv preprint arXiv:2412.10151}, year={2024} }
@inproceedings{lim-etal-2025-vlr, title = "{VLR}-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation", author = "Lim, Hyeonseok and Shin, Dongjae and Song, Seohyun and Won, Inho and Kim, Minjun and Yuk, Junghun and Jang, Haneol and Lim, KyungTae", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", month = jan, year = "2025", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-main.411/" }This work was supported by:
- Institute of Information & Communications Technology Planning & Evaluation (IITP)
- Artificial Intelligence Industrial Convergence Cluster Development Project
- The data and code are intended and licensed for research use only.
- They must comply with the license agreement of GPT-4.
- The dataset is released under CC BY NC 4.0 (non-commercial use only).
- Models trained using this dataset should not be used for non-research purposes.
- Clone this repository:
git clone https://github.com/MLP-Lab/VLR-Bench.git cd VLR-Bench - Install dependencies:
pip install -r requirements.txt - Generate the model's inference results:
- The dataset can be loaded directly from Hugging Face.
- You can choose a specific language ('en', 'ko', or 'zh') from the language column and use the filtered dataset for inference.
- Use the following code to load the dataset and filter by language:
from datasets import load_dataset dataset = load_dataset("MLP-KTLim/VLR-Bench") # Select a specific language ('en', 'ko', or 'zh') selected_language = "en" # Change to 'ko' or 'zh' if needed filtered_data = dataset.filter(lambda x: x["language"] == selected_language) - Prepare the inference result JSON file:
The JSON file containing the inference results of the model to be evaluated must include the following fields:
{
"result": "The model's inference output",
"label": "The output value from MLP-KTLim/VLR-Bench(filtered_data)",
"answer_keyword1": "The keyword1 value from MLP-KTLim/VLR-Bench(filtered_data)",
"answer_keyword2": "The keyword2 value from MLP-KTLim/VLR-Bench(filtered_data)"
}- Run eval:
sh eval.sh