Hao Yang1, Yifan Ji1, Zhipeng Xu1, Zhenghao Liu1, Yukun Yan2, Zulong Chen3, Shuo Wang2, Yu Gu1, Ge Yu1
1Northeastern University, 2Tsinghua University, 3Alibaba Group
• Overview • Collections • Setup • Training • Evaluation • Acknowledgement • Citation • Contact
We introduce Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. The framework supports multiple multimodal backbone models including Phi3 Vision and Qwen2.5 VL.
Our work is accepted by SIGIR 2026 🎉🎉🎉!
If you find this project useful, please give us a star🌟.
We have made the following resources available on 🤗ReAlign collection.
| Resource | Description | Link |
|---|---|---|
| ReAlign-Phi3v | The visual document retriever based on Phi-3-vision-128k-instruct | 🤗ReAlign-Phi3v |
| ReAlign-Qwen | The visual document retriever based on Qwen2.5-VL-7B-Instruct | 🤗ReAlign-Qwen |
| Training Data | The data used to train the ReAlign retriever | 🤗ReAlign-Trainset |
| ReAlign-Set | All-in-one package: model weights, training set, and evaluation set | 🤗ReAlign-Set |
(1) Clone this repository:
git clone git@github.com:NEUIR/ReAlign.git
cd ReAlign(2) Create and activate a Conda environment (Python 3.10):
conda create -n realign python=3.10 -y
conda activate realign(3) Install dependencies and the editable package:
pip install -r requirements.txt
pip install -e .Use the following command to download all required data, including model checkpoints, training set, and evaluation set:
huggingface-cli download --repo-type dataset yanghaoir/ReAlign-Set --local-dir ./datasetBy default, the paths in this file work out of the box and no changes are needed. If you need to customize model or dataset locations, edit config/dir_config.sh, which looks like:
export REALIGN_TRAIN_DATASET_PATH="path/to/train_data"This file does not need to be run manually — it is sourced automatically during training and evaluation.
If you want to reproduce the data construction pipeline from scratch, run the following steps after downloading the dataset:
# Step 1: Extract corpus images from parquet shards
python src/realign/data_construction/data_unzip.py
# Step 2: Call the grounding model to build synthetic annotations
export DASHSCOPE_API_KEY="your-key"
python src/realign/data_construction/build_from_parquet.pyThe output CSV will be saved to synthetic_data/OpenDocVQA-Query-1.csv. The pre-built training data is already included in dataset/train_data/train.parquet, so this step can be skipped if you do not need to regenerate it.
mkdir -p logPhi3 Vision:
bash sh/train_phi3v.sh > log/realign-phi3v.log 2>&1Qwen2.5 VL:
bash sh/train_qwen.sh > log/realign-qwen.log 2>&1The second argument of each evaluation script is a comma-separated list of GPU IDs. The examples below use four GPUs; adjust to match your hardware (e.g., use 0 for a single GPU).
Phi3 Vision:
bash sh/eval.sh realign-phi3v 0,1,2,3Qwen2.5 VL:
bash sh/eval_qwen.sh realign-qwen 0,1,2,3Part of our code and data are built upon the following works. We sincerely thank the authors for their contributions.
@article{yang2026realign,
title={ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment},
author={Yang, Hao and Ji, Yifan and Xu, Zhipeng and Liu, Zhenghao and Yan, Yukun and Chen, Zulong and Wang, Shuo and Gu, Yu and Yu, Ge},
year={2026}
url={https://arxiv.org/abs/2604.07419},
}If you have questions, suggestions, and bug reports, please email:
yanghao123@mails.neu.edu.cn