MultiChartQA-R: A Benchmark for Multi-Chart Question Answering in Real-World Reasoning Scenarios
MultiChartQA-R is a benchmark for multi-chart question answering, designed to evaluate multimodal large language models (MLLMs) in realistic reasoning settings. It extends prior multi-chart resources with broader task coverage, multilingual data, a scalable data construction pipeline, and evaluation protocols for both multi-select and generative settings.
✨ Paper Status: Under review at the ACMMM 2026 Dataset Track.
Paper Web | Main Benchmark | Extended Benchmark | Code Utilities
The appendix includes detailed benchmark statistics, metric definitions, prompt templates, multilingual breakdowns, extended benchmark details, and supplementary analyses. In the full paper web view, appendix references cited in the main text remain clickable, so the corresponding appendix content can be opened directly from the paper.
The web viewer home page lets you choose between the full paper and the appendix. If you open the full paper at paper.html, the appendix links referenced in the main text can be followed directly in the browser, making it easier to jump from the paper to the relevant supplementary section.
MultiChartQA-R studies reasoning over multiple related charts, rather than isolated single-chart understanding. The benchmark is designed to cover a progression of abilities from basic cross-chart perception to decision-oriented reasoning.
Each language version currently contains:
- 180 multi-chart sets
- 695 chart-code pairs
- 2,160 QA pairs
- 4 task types
The benchmark currently supports English, Chinese, and Spanish, and is designed to be extendable to additional languages.
In addition, we provide an extended benchmark for retrieval-oriented analysis, built from 101 multi-chart articles with 1,212 QA pairs, to study how model performance changes as the number of charts and the amount of relevant information increase.
The main benchmark JSON files are stored under benchmark/json/{cn,en,es}. Each file contains a multi-chart set and its qa_pairs.
- Task 1 / Task 2 entries include direct answers and, for Task 2, explanatory calculations in the
explanationfield. - Task 3 / Task 4 entries include the released multi-select supervision fields:
label: correct option seteasy_error: distractors corresponding to clearly unsupported or obviously incorrect choiceshard_error: distractors corresponding to more plausible but ultimately incorrect choices
- Task 3 / Task 4 entries additionally include a
cotfield, which stores option-level explanations for why each option is correct or incorrect.
This format supports answer evaluation, instruction construction, explanation analysis, and option-level supervision.
MultiChartQA-R includes four progressively more complex task types:
-
Cross-chart Trend Inference
Determine whether trends or patterns across charts are aligned, divergent, or otherwise related. -
Complementary Data Integration
Combine evidence from multiple charts to derive a missing value, comparison, or aggregated conclusion. -
Anomaly and Pattern Analysis
Identify and explain non-trivial anomalies or patterns grounded in multi-chart evidence. This task is released in both multi-select and generative settings. -
Strategy Recommendation
Produce decision-oriented recommendations supported by cross-chart analysis. This task is also released in both multi-select and generative settings.
MultiChartQA-R is built through a scalable pipeline that supports both realistic benchmark construction and multilingual extension:
- Chart-code pair construction: Reconstruct chart-rendering code from real-world multi-chart examples to preserve structured data.
- Task-specific QA synthesis: Build the four tasks with a mix of manual annotation and model-assisted generation plus human refinement.
- Multilingual expansion: Extend both chart content and QA pairs to multiple languages while maintaining semantic consistency.
MultiChartQA-R/
├── benchmark/ # Main benchmark data
│ ├── images/
│ ├── images_info/
│ ├── code/
│ └── json/
├── benchmark-extended/ # Retrieval-oriented extended benchmark
├── code/ # Public utilities: loaders, prompt templates, evaluators, inference template
├── readme.assets/ # README figures
└── appendix.pdf # Appendix document
cd code
python load_benchmark.py
python load_exbenchmark.pyThese scripts load the released benchmark files and print example samples from the main benchmark and the retrieval-oriented extended benchmark.
Use the released API templates with environment variables:
cd code
export OPENAI_BASE_URL=https://your-endpoint/v1
export OPENAI_API_KEY=your_api_key
python inference_multiselect_api_template.py --model your-model-name --task 1 --language en --sample-index 0
python inference_multiselect_api_template.py --model your-model-name --task 2 --language en --sample-index 0
python inference_multiselect_api_template.py --model your-model-name --task 3 --language en --sample-index 0
python inference_multiselect_api_template.py --model your-model-name --task 4 --language en --sample-index 0For generative inference on Task 3 / Task 4:
cd code
export OPENAI_BASE_URL=https://your-endpoint/v1
export OPENAI_API_KEY=your_api_key
python inference_generative_api_template.py --model your-model-name --task 3 --language en --sample-index 0
python inference_generative_api_template.py --model your-model-name --task 4 --language en --sample-index 0The released public scripts use environment variables instead of hardcoded credentials.
cd code
python eval_task1_accuracy.py --result-file path/to/task1_predictions.jsonl
python eval_task2_accuracy.py --result-file path/to/task2_predictions.jsonl
python eval_task34_strict_risk_aware.py --result-file path/to/task3_multiselect_predictions.jsonl
python eval_task34_strict_risk_aware.py --result-file path/to/task4_multiselect_predictions.jsonlFor primary generative evaluation of Task 3 / Task 4, we provide an answer-extraction-based judge template:
export OPENAI_BASE_URL=https://your-endpoint/v1
export OPENAI_API_KEY=your_api_key
python eval_task34_generative_answer_extraction.py --result-file path/to/task3_generative_predictions.jsonl --judge-model your-judge-model
python eval_task34_generative_answer_extraction.py --result-file path/to/task4_generative_predictions.jsonl --judge-model your-judge-modeldata_utils.py: load the main benchmark and the extended benchmarkprompts.py: prompt templates for multi-select inference and Task 2 rationale-to-code conversionparse_predictions.py: parse JSON-style model outputsinference_multiselect_api_template.py: API inference template for Task 1-4 multi-selectinference_generative_api_template.py: API inference template for Task 3 / Task 4 generative settingeval_task1_accuracy.py: Task 1 accuracyeval_task2_accuracy.py: Task 2 answer-level accuracyeval_task34_strict_risk_aware.py: Task 3 / Task 4 multi-select Strict Risk-AwareMF_betaeval_task34_generative_answer_extraction.py: Task 3 / Task 4 generative primary score via answer extraction + strict risk-aware scoringevaluation.py: minimal usage examples
For the main benchmark:
- Task 1-2 use accuracy-based evaluation.
- Task 3-4 (multi-select) use a Strict Risk-Aware ( MF_{\beta} ) metric.
- Task 3-4 (generative) use a free-form generation protocol aligned with the benchmark’s option-level evaluation principle.
- The benchmark is intended for research on realistic multi-chart reasoning, including multilingual analysis, retrieval scalability, and decision-oriented evaluation.
If you find MultiChartQA-R useful, please cite the project/paper once the final bibliographic information is available.