ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Hao Yang¹, Yifan Ji¹, Zhipeng Xu¹, Zhenghao Liu¹, Yukun Yan², Zulong Chen³, Shuo Wang², Yu Gu¹, Ge Yu¹

¹Northeastern University, ²Tsinghua University, ³Alibaba Group

• Overview • Collections • Setup • Training • Evaluation • Acknowledgement • Citation • Contact

Overview

We introduce Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. The framework supports multiple multimodal backbone models including Phi3 Vision and Qwen2.5 VL.

Our work is accepted by SIGIR 2026 🎉🎉🎉!

If you find this project useful, please give us a star🌟.

Collections

We have made the following resources available on 🤗ReAlign collection.

Resource	Description	Link
ReAlign-Phi3v	The visual document retriever based on Phi-3-vision-128k-instruct	🤗ReAlign-Phi3v
ReAlign-Qwen	The visual document retriever based on Qwen2.5-VL-7B-Instruct	🤗ReAlign-Qwen
Training Data	The data used to train the ReAlign retriever	🤗ReAlign-Trainset
ReAlign-Set	All-in-one package: model weights, training set, and evaluation set	🤗ReAlign-Set

Setup

(1) Clone this repository:

git clone git@github.com:NEUIR/ReAlign.git
cd ReAlign

(2) Create and activate a Conda environment (Python 3.10):

conda create -n realign python=3.10 -y
conda activate realign

(3) Install dependencies and the editable package:

pip install -r requirements.txt
pip install -e .

Training

1. Prepare Data and Model Paths

Use the following command to download all required data, including model checkpoints, training set, and evaluation set:

huggingface-cli download --repo-type dataset yanghaoir/ReAlign-Set --local-dir ./dataset

By default, the paths in this file work out of the box and no changes are needed. If you need to customize model or dataset locations, edit config/dir_config.sh, which looks like:

export REALIGN_TRAIN_DATASET_PATH="path/to/train_data"

This file does not need to be run manually — it is sourced automatically during training and evaluation.

2. Build Synthetic Training Data (Optional)

If you want to reproduce the data construction pipeline from scratch, run the following steps after downloading the dataset:

# Step 1: Extract corpus images from parquet shards
python src/realign/data_construction/data_unzip.py

# Step 2: Call the grounding model to build synthetic annotations
export DASHSCOPE_API_KEY="your-key"
python src/realign/data_construction/build_from_parquet.py

The output CSV will be saved to synthetic_data/OpenDocVQA-Query-1.csv. The pre-built training data is already included in dataset/train_data/train.parquet, so this step can be skipped if you do not need to regenerate it.

3. Create Log Directory

mkdir -p log

4. Run Training

Phi3 Vision:

bash sh/train_phi3v.sh > log/realign-phi3v.log 2>&1

Qwen2.5 VL:

bash sh/train_qwen.sh > log/realign-qwen.log 2>&1

Evaluation

The second argument of each evaluation script is a comma-separated list of GPU IDs. The examples below use four GPUs; adjust to match your hardware (e.g., use 0 for a single GPU).

Phi3 Vision:

bash sh/eval.sh realign-phi3v 0,1,2,3

Qwen2.5 VL:

bash sh/eval_qwen.sh realign-qwen 0,1,2,3

Acknowledgement

Part of our code and data are built upon the following works. We sincerely thank the authors for their contributions.

Citation

@article{yang2026realign,
      title={ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment},
      author={Yang, Hao and Ji, Yifan and Xu, Zhipeng and Liu, Zhenghao and Yan, Yukun and Chen, Zulong and Wang, Shuo and Gu, Yu and Yu, Ge},
      year={2026}
      url={https://arxiv.org/abs/2604.07419}, 
}

Contact

If you have questions, suggestions, and bug reports, please email:

yanghao123@mails.neu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
config		config
deepspeed		deepspeed
sh		sh
src/realign		src/realign
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Overview

Collections

Setup

Training

1. Prepare Data and Model Paths

2. Build Synthetic Training Data (Optional)

3. Create Log Directory

4. Run Training

Evaluation

Acknowledgement

Citation

Contact

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Overview

Collections

Setup

Training

1. Prepare Data and Model Paths

2. Build Synthetic Training Data (Optional)

3. Create Log Directory

4. Run Training

Evaluation

Acknowledgement

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages