Skip to content

rabiulcste/vismin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VisMin: Visual Minimal-Change Understanding

Rabiul Awal ✨, Saba Ahmadi ✨, Le Zhang ✨, Aishwarya Agrawal
Mila - Quebec AI Institute, University of Montreal
✨ indicates equal contribution

arXiv Hugging Face

πŸŽ‰ Accepted to NeurIPS 2024! πŸŽ‰

Table of Contents

Dataset

The training dataset has 64,392 samples, and the VisMin dataset has 2,084 samples. The dataset is stored in a JSON format. Each entry contains the image path, caption, and a list of negative examples. The negative examples consist of the edited image path and edited caption.

  • Training Data: 64,392 samples from VSR and COCO 2017 training split.
  • Benchmark Data: 2,084 samples from COCO 2017 validation split, human-verified.

πŸ”₯ Exciting News! πŸ”₯ The VisMin benchmark dataset is now available πŸŽ‰ Check it out here πŸ€—

Example of a dataset entry in the training dataset:

{
  "image_path": "/coco/images/train2017/000000234136.jpg",
  "caption": "Two men holding a brown and white dog in a van.",
  "negatives": [
    {
      "edited_image_path": "/edited/coco/234136/0.png",
      "edited_caption": "Three men holding a brown and white dog in a van.",
    }
  ]
}

Training

To fine-tune models, such as pre-trained CLIP, using the hard-negative contrastive loss on the curated dataset, follow these steps:

  1. Clone CVPR 2024 paper's codebase: Enhance-FineGrained.
  2. You need to specify training parameters in scrips/run_all.sh such as --gres=gpu:a100:2 and batch_size. Refer to this script file for more details.
  3. To start the training, use the following commands:
cd scripts/
bash run_multiple_nodes.sh

The result checkpoint will be at Enhance-FineGrained/src/Outputs directory.

Evaluation

The models including CLIP or Multimodal LM can be evaluated on our VisMin benchmark which image-text matching tasks. We also support evaluation on a pool of diagnostics datasets such as VALE, Winoground, and ARO.

# To evaluate two-tower models such as CLIP
python evals.contrastive_inference --dataset <dataset_name> --model_name <path_to_model> --pretrained <pretrained_model_name>
# To evaluate generative models such as Idefics2 => https://huggingface.co/blog/idefics2
python evals.mllm_inference --dataset <dataset_name> --model_name <path_to_model>

Minimal-Change Image-Text Dataset Creation

LLM-guided Edit Instructions Generation

We use LLM to generate edit instructions. There are two approaches to generate these instructions: one with captions, which suggests object attribute changes following the style of in-context demonstrations, and another for spatial and counting changes, where we prompt LLM with in-context demonstrations to create the appropriate edit instructions with layouts.

Example of an llm-generated edit instruction (object attribute category):

  {
      "InputCaption": "A glass of ice water sitting next to a wine glass.",
      "SelectedPhrase": "glass of ice water",
      "EditedPhrase": "glass of milk",
      "EditedRegionPhrase": "A glass of milk",
      "EditedCaption": "A glass of milk sitting next to a wine glass.",
      "Category": "object"
  }

Example of an llm-generated edit instruction (spatial and counting category):

 "A paint brush is to the left of a palette.": [
      "[('a paint brush', [50, 200, 100, 312]), ('a palette', [362, 150, 150, 362])]\nBackground prompt: A realistic scene\nNegative prompt:\nCategory: relation(left of)"
  ]

To run the script, from the directory containing cntr_edit/, execute:

# for object attribute category 
# requires dataset name to be specified for source of captions
python -m llm_agent.minchange_text_pairs_gen --dataset <name_of_dataset> --prompt_type edit_instructgen_from_caption --language_model_name <name_of_language_model>
# for spatial and counting category
python -m llm_agent.minchange_text_pairs_gen --prompt_type edit_instructgen_from_caption --language_model_name <name_of_language_model>

Generating magic prompt (to be appended with the e.g. object name) for better diffusion guidance of input prompt:

# for object attribute category (e.g. coco dataset)
python -m llm_agent.magic_prompt --dataset coco --language_model_name <name_of_language_model>
# for spatial and counting category
python -m llm_agent.magic_prompt --dataset relation --language_model_name <name_of_language_model>

Diffusion-guided Image Synthesis

We have two approaches to generate minimal-change images:

  1. Masking and Inpainting: First, we mask the object to be edited in the source image using the Grounding-DINO model. Then, we use diffusion inpainting to generate minimal-change images.
  2. Layout Swapping: We use GLIGEN layout-diffusion to swap objects in the source image to generate edited images. For counting changes, we remove objects using their bounding boxes and create edited images.

Run the following command:

# for object attribute category (e.g. coco dataset)
# this script loads segmentation model, the diffusion model and vqa model
python -m ctrl_edit.inpaint_with_mask --language_model_name <llm_used_to_generated_edit_instruction>  --dataset <dataset_name> --output <path_to_edited_image>

# for spatial and counting category (generated from scratch using layout diffusion model)
# dataset name can be "relation" or "counting"
# this script loads the layout diffusion model and vqa model
python3 -m ctrl_edit.diffusion_llm_grounded_old --repeats 3 \
    --frozen_step_ratio 0.5 --no-scale-boxes-default \
    --sdxl --sdxl-step-ratio 0.4 \
    --dataset <dataset_name> \
    --split <split_name>

Edited Image Quality Verification using Local-Global VQA Approach

Verify images through the vqa filter approach. first generate the local-global vqa questions and answers using llm following edit instruction.

# To create the local-global VQA questions and answers using LLM-generated edit instructions from one of the previous step:
python -m ctrl_edit.llm_agent.auto_filter_question_gen  --language_model_name <name_of_language_model>

# Automatically filter out bad edited images using the local-global VQA approach:
python -m ctrl_edit.filters.tifa_filter --dataset <dataset_name>

Setup

git clone <https://github.com/rabiulcste/vismin>
cd vismin
pip install -r requirements.txt

Acknowledgements

The codebase is built on top of the following repositories: open_clip, Enhance-FineGrained, GLIGEN, FastChat

Citation

If you find this work useful for your research, please consider citing our paper:

@article{awal2024vismin,
  title={VisMin: Visual Minimal-Change Understanding},
  author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
  journal={arXiv preprint arXiv:2401.08832},
  year={2024}
}

About

[NeurIPS24] VisMin: Visual Minimal-Change Understanding

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors