Vision Language Models are Biased

by An Vo^1*, Khai-Nguyen Nguyen^2*, Mohammad Reza Taesiri³, Vy Tuong Dang¹, Anh Totti Nguyen^4†, Daeyoung Kim^1†

^*Equal contribution
^†Equal advising
¹KAIST, ²College of William and Mary, ³University of Alberta, ⁴Auburn University

📌 Abstract

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, boardgames, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io

👋 Trying out our images on your model

Use these examples where most tested models fail to answer.

📰 Integration with `lmms-eval`

VLMs are Biased benchmark is now officially supported by lmms-eval, one of the main open-source evaluation frameworks for VLMs! The community can now run the benchmark out-of-the-box across many VLMs.

To run our benchmark on lmms-eval please follow these steps:

Set up lmms-eval by following their installation guide documentation
Run the following command:

python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
  --tasks vlms_are_biased \
  --batch_size 1 \
  --device cuda:0

For more details, please visit their page: lmms_eval/tasks/vlms_are_biased

Note: lmms-eval currently only support the main subset of VLMBias. To use other subsets, please refer to our Quick Start Guide below.

🚀 Quick Start Guide

Option 1: Use Pre-built Dataset (Recommended for testing your models on different subsets)

If you just want to use our dataset for evaluation or research:

📥 Download the complete dataset from Hugging Face with full images and prompts:

Please run the following script:

import datasets
dataset = datasets.load_dataset('anvo25/vlms-are-biased')

This will return a DatasetDict with this structure:

DatasetDict({
    main: Dataset({
        features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
        num_rows: 2784
    })
    identification: Dataset({
        features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
        num_rows: 1392
    })
    withtitle: Dataset({
        features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
        num_rows: 2784
    })
    original: Dataset({
        features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
        num_rows: 458
    })
    remove_background_q1q2: Dataset({
        features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
        num_rows: 2784
    })
    remove_background_q3: Dataset({
        features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
        num_rows: 1392
    })
})

main: our counting dataset of counterfactual images used throughout the paper
identification: our identification dataset of counterfactual images in Section 4.3
withtitle: our counting dataset of counterfactual images with in-image title injection in Section A.9
original: our identification dataset of original images in Section 4.1
remove_background_q1q2: our counting dataset of counterfactual images with their background being removed in Section 4.4
remove_background_q3: our identification dataset of counterfactual images with their background being removed in Section 4.4

This is the fastest way to get started. Alternatively, you can also directly download the parquet files:

Go to our Hugging Face dataset
Download ready-to-use images with corresponding prompts

Option 2: Reproduce/Generate Dataset

If you want to reproduce our dataset generation process or create custom variations:

Please follow the installation and generation steps below to run the code locally.

💻 Getting Started

git clone https://github.com/anvo25/vlms-are-biased.git
cd vlms-are-biased
pip install -r requirements.txt

📊 Tasks

To run the code to generate the counterfactual images for a specific task, go to the following embedded links:

Chess Pieces: Chess pieces, Xiangqi pieces (modified starting positions)
Game Boards: Chess board, Go board, Xiangqi board, Sudoku board (dimension variations)
Optical Illusions: Ebbinghaus, Müller-Lyer, Ponzo, Vertical-Horizontal, Zöllner, Poggendorff
Patterned Grids: Dice patterns, Tally mark patterns (anomalous cells)
Animals: Mammals (4 legs → 5 legs) and birds (2 legs → 3 legs)
Logos: Consisting of 2 logo types: shoes and cars.
- Shoe Logos:
  - Nike (1 swoosh → 2 swooshes)
  - Adidas (3 stripes → 4 stripes)
- Car Logos:
  - Maserati (3 prongs → 5 prongs)
  - Mercedes-Benz (3-pointed star → 4-pointed star)
  - Audi (4 overlapping circles → 5 overlapping circles)
Flags: Star-typed flags (+1 and −1 star) and stripes (+1 and −1 stripe)

All images are generated at 384px, 768px, and 1152px resolutions.

Quickstart

Generate all available datasets:

python main.py --all
python add_titles.py --topic all

If you only want to generate data of a specific task, you can run the corresponding file instead. Here are some examples:

Generate specific optical illusions:

python main.py --optical_illusions --illusion_type Ebbinghaus

Generate chess pieces dataset with modified starting positions:

# Step 1: Generate "notitle" images
python main.py --chess_pieces

# Step 2: Add titles to create "in_image_title" versions  
python add_titles.py --topic chess_pieces

📂 Structure

vlms-are-biased/
├── main.py                        # Generate "notitle" datasets
├── add_titles.py                  # Add "in_image_title" versions
├── generators/                    # Individual dataset generators
│   ├── chess_pieces_generator.py
│   ├── optical_illusion_generator.py
│   └── ...
├── vlms-are-biased-notitle/       # Output: images without titles
└── vlms-are-biased-in_image_title/ # Output: images with titles

📖 Citation

@misc{vlmsarebiased,
      title={Vision Language Models are Biased}, 
      author={An Vo and Khai-Nguyen Nguyen and Mohammad Reza Taesiri and Vy Tuong Dang and Anh Totti Nguyen and Daeyoung Kim},
      year={2025},
      eprint={2505.23941},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.23941}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
evaluation		evaluation
examples		examples
figures		figures
generators		generators
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_titles.py		add_titles.py
grid_utils.py		grid_utils.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Language Models are Biased

📌 Abstract

👋 Trying out our images on your model

📰 Integration with `lmms-eval`

🚀 Quick Start Guide

Option 1: Use Pre-built Dataset (Recommended for testing your models on different subsets)

Option 2: Reproduce/Generate Dataset

💻 Getting Started

📊 Tasks

Quickstart

📂 Structure

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

anvo25/vlms-are-biased

Folders and files

Latest commit

History

Repository files navigation

Vision Language Models are Biased

📌 Abstract

👋 Trying out our images on your model

📰 Integration with lmms-eval

🚀 Quick Start Guide

Option 1: Use Pre-built Dataset (Recommended for testing your models on different subsets)

Option 2: Reproduce/Generate Dataset

💻 Getting Started

📊 Tasks

Quickstart

📂 Structure

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

📰 Integration with `lmms-eval`

Packages