Skip to content

jaxa/PGA-SFT

Repository files navigation

Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention (CVPR 2026)

Official PyTorch implementation of Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention, accepted at CVPR 2026.

[Paper]

Setup

  • python environment setup
# We use cuda12.6 cudnn9.7.1 nccl2.24.3 openmpi-4.0.5 and pyenv
pyenv install 3.12.5
pyenv virtualenv 3.12.5 pga_sft-3.12.5
pyenv shell pga_sft-3.12.5
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -m pip install transformers==4.51.3
python -m pip install scipy shortuuid sentencepiece accelerate peft jupyter datasets ninja scikit-image wandb opencv-python timm einops einops-exts bitsandbytes markdown2 scikit-learn gradio gradio_client uvicorn fastapi wavedrom Pygments wheel tensorboard urllib3 pillow pycocotools matplotlib numpy tensorboard imageio evaluate
python -m pip install sacrebleu editdistance pyaml_env python-dotenv spacy textacy seaborn pandas captum scikit-image pycocoevalcap qwen-vl-utils decord nltk python-Levenshtein
python -m pip install mpi4py
  • dataset and model setup
HDD=<your_data_directory>
HF_TOKEN=<your_huggingface_token_here>
huggingface-cli login --token ${HF_TOKEN}
huggingface-cli download ahmed-masry/ChartQA --repo-type dataset
huggingface-cli download liuhaotian/llava-v1.5-7b
huggingface-cli download openai/clip-vit-large-patch14-336
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct

cd $HDD
mkdir datasets && cd datasets
git clone https://github.com/vis-nlp/chartqa
mv chartqa ChartQA
cd ChartQA
mv 'ChartQA Dataset' ChartQA_Dataset
  • Regarding our phrase-region aligned dataset, download the dataset from here. The dataset is provided in zip format.
    • Unzip dataset_pga_sft.zip.
    • Place GPT_chartQA_01 under $HDD/datasets as $HDD/datasets/GPT_chartQA_01.
    • Place gpt_data under $HDD as $HDD/gpt_data.
cd $HDD
# download dataset_pga_sft.zip 
unzip dataset_pga_sft.zip
mv dataset_pga_sft/GPT_chartQA_01/ datasets/
mv dataset_pga_sft/gpt_data/ .

Training and Evaluation

  • For training Qwen2.5VL-7B on ChartQA
# 8GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --module train --deepspeed ds_script/zero3_offload_mod.json --per_device_train_batch_size 12 --gradient_accumulation_steps 1 --dataloader_num_workers 4 --output-dir <output_dir> --dataset_type qa --model_type qwen --num_train_epochs 1 --total_aux_loss_param 1.0 --gamma 0.5 --celoss_weight pos:0.1 --normalize_text
# evaluation
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 evaluate_model.py --output-dir <output_dir> --dataset_type qa --model_type qwen --normalize_text

# evaluation for localization
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 evaluate_model.py --output-dir <output_dir> --dataset_type qa_ablation --model_type qwen --normalize_text
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node=8 -m evaluate_localization --output-dir <output_dir> --dataset_type qa_ablation --model_type qwen --normalize_text
  • For training LLaVA-7B on ChartQA
# 4GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --module train --deepspeed ds_script/zero3_offload_mod.json --per_device_train_batch_size 24 --gradient_accumulation_steps 1 --dataloader_num_workers 4 --output-dir <output_dir> --dataset_type qa --model_type llava --num_train_epochs 1 --total_aux_loss_param 0.5 --gamma 2.0 --celoss_weight pos:0.1 --normalize_text

# evaluation
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 evaluate_model.py --output-dir <output_dir> --dataset_type qa --model_type llava --normalize_text

# evaluation for localization
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 evaluate_model.py --output-dir <output_dir> --dataset_type qa_ablation --model_type llava --normalize_text
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 -m evaluate_localization --output-dir <output_dir> --dataset_type qa_ablation --model_type llava --normalize_text

Comment

  • Sorry for the messy code — this is just the initial public release for now.

Citation

If you find our work useful for your research, please consider citing our paper.

@InProceedings{Ito_2026_CVPR,
    author    = {Ito, Koichiro},
    title     = {Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {9501-9511}
}

About

Official code for "Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention" (CVPR 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages