DocVXQA: Context-Aware Visual Explanations for Document Question Answering
PyTorch implementation of our ICML 2025 paper DocVXQA: Context-Aware Visual Explanations for Document Question Answering. This model not only produces accurate answers to questions grounded in document images but also generates visual explanations — heatmaps that highlight semantically and contextually important regions, enabling interpretability in document understanding tasks.
clone the repository:
git clone https://github.com/dali92002/DocVXQA
cd DocVXQA
Create a virtual environment and install dependencies
conda env create -f environment.yml
conda activate docvxqa
You can download the pretrained model weights from This link
After downloading, place the weights in your preferred directory.
You can try out the model quickly using our provided Jupyter notebook demo.ipynb.
First, a similarity map should be extracted using ColPali. For each data point, two maps are generated: one between the question and the document image, and another between the answer and the document image. These maps are stored and later used in the dataloader for training with the token interactions loss. For an example implementation of similarity map extraction, see this reference, though you are free to implement your own approach.
After setting your desired args you can simply train with
python train.py
After setting your desired args you can evaluate with
python evaluate.py
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License 🛡.
If you find this useful for your research, please cite it as follows:
@inproceedings{
souibgui2025docvxqa,
title={Doc{VXQA}: Context-Aware Visual Explanations for Document Question Answering},
author={Mohamed Ali Souibgui and Changkyu Choi and Andrey Barsky and Kangsoo Jung and Ernest Valveny and Dimosthenis Karatzas},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=wex0vL4c2Y}
}