Skip to content

dali92002/DocVXQA

Repository files navigation

DocVXQA

DocVXQA: Context-Aware Visual Explanations for Document Question Answering

Table of Contents

Description

Model Architecture

PyTorch implementation of our ICML 2025 paper DocVXQA: Context-Aware Visual Explanations for Document Question Answering. This model not only produces accurate answers to questions grounded in document images but also generates visual explanations — heatmaps that highlight semantically and contextually important regions, enabling interpretability in document understanding tasks.

Installation

clone the repository:

git clone https://github.com/dali92002/DocVXQA
cd DocVXQA

Create a virtual environment and install dependencies

conda env create -f environment.yml
conda activate docvxqa

Usage

Download Weights

You can download the pretrained model weights from This link

After downloading, place the weights in your preferred directory.

Demo

You can try out the model quickly using our provided Jupyter notebook demo.ipynb.

Training

Data preparation

First, a similarity map should be extracted using ColPali. For each data point, two maps are generated: one between the question and the document image, and another between the answer and the document image. These maps are stored and later used in the dataloader for training with the token interactions loss. For an example implementation of similarity map extraction, see this reference, though you are free to implement your own approach.

model training

After setting your desired args you can simply train with

python train.py

Evaluation

After setting your desired args you can evaluate with

python evaluate.py

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License 🛡.

Citation

If you find this useful for your research, please cite it as follows:

@inproceedings{
souibgui2025docvxqa,
title={Doc{VXQA}: Context-Aware Visual Explanations for Document Question Answering},
author={Mohamed Ali Souibgui and Changkyu Choi and Andrey Barsky and Kangsoo Jung and Ernest Valveny and Dimosthenis Karatzas},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=wex0vL4c2Y}
}

About

DocVXQA: Context-Aware Visual Explanations for Document Question Answering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published