Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention (CVPR 2026)
Official PyTorch implementation of Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention, accepted at CVPR 2026.
[Paper]
- python environment setup
# We use cuda12.6 cudnn9.7.1 nccl2.24.3 openmpi-4.0.5 and pyenv
pyenv install 3.12.5
pyenv virtualenv 3.12.5 pga_sft-3.12.5
pyenv shell pga_sft-3.12.5
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -m pip install transformers==4.51.3
python -m pip install scipy shortuuid sentencepiece accelerate peft jupyter datasets ninja scikit-image wandb opencv-python timm einops einops-exts bitsandbytes markdown2 scikit-learn gradio gradio_client uvicorn fastapi wavedrom Pygments wheel tensorboard urllib3 pillow pycocotools matplotlib numpy tensorboard imageio evaluate
python -m pip install sacrebleu editdistance pyaml_env python-dotenv spacy textacy seaborn pandas captum scikit-image pycocoevalcap qwen-vl-utils decord nltk python-Levenshtein
python -m pip install mpi4py- dataset and model setup
HDD=<your_data_directory>
HF_TOKEN=<your_huggingface_token_here>
huggingface-cli login --token ${HF_TOKEN}
huggingface-cli download ahmed-masry/ChartQA --repo-type dataset
huggingface-cli download liuhaotian/llava-v1.5-7b
huggingface-cli download openai/clip-vit-large-patch14-336
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct
cd $HDD
mkdir datasets && cd datasets
git clone https://github.com/vis-nlp/chartqa
mv chartqa ChartQA
cd ChartQA
mv 'ChartQA Dataset' ChartQA_Dataset- Regarding our phrase-region aligned dataset, download the dataset from here. The dataset is provided in zip format.
- Unzip
dataset_pga_sft.zip. - Place
GPT_chartQA_01under$HDD/datasetsas$HDD/datasets/GPT_chartQA_01. - Place
gpt_dataunder$HDDas$HDD/gpt_data.
- Unzip
cd $HDD
# download dataset_pga_sft.zip
unzip dataset_pga_sft.zip
mv dataset_pga_sft/GPT_chartQA_01/ datasets/
mv dataset_pga_sft/gpt_data/ .- For training Qwen2.5VL-7B on ChartQA
# 8GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --module train --deepspeed ds_script/zero3_offload_mod.json --per_device_train_batch_size 12 --gradient_accumulation_steps 1 --dataloader_num_workers 4 --output-dir <output_dir> --dataset_type qa --model_type qwen --num_train_epochs 1 --total_aux_loss_param 1.0 --gamma 0.5 --celoss_weight pos:0.1 --normalize_text
# evaluation
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 evaluate_model.py --output-dir <output_dir> --dataset_type qa --model_type qwen --normalize_text
# evaluation for localization
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 evaluate_model.py --output-dir <output_dir> --dataset_type qa_ablation --model_type qwen --normalize_text
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 -m evaluate_localization --output-dir <output_dir> --dataset_type qa_ablation --model_type qwen --normalize_text- For training LLaVA-7B on ChartQA
# 4GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --module train --deepspeed ds_script/zero3_offload_mod.json --per_device_train_batch_size 24 --gradient_accumulation_steps 1 --dataloader_num_workers 4 --output-dir <output_dir> --dataset_type qa --model_type llava --num_train_epochs 1 --total_aux_loss_param 0.5 --gamma 2.0 --celoss_weight pos:0.1 --normalize_text
# evaluation
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 evaluate_model.py --output-dir <output_dir> --dataset_type qa --model_type llava --normalize_text
# evaluation for localization
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 evaluate_model.py --output-dir <output_dir> --dataset_type qa_ablation --model_type llava --normalize_text
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 -m evaluate_localization --output-dir <output_dir> --dataset_type qa_ablation --model_type llava --normalize_text- Sorry for the messy code — this is just the initial public release for now.
If you find our work useful for your research, please consider citing our paper.
@InProceedings{Ito_2026_CVPR,
author = {Ito, Koichiro},
title = {Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {9501-9511}
}