by An Vo1*, Khai-Nguyen Nguyen2*, Mohammad Reza Taesiri3, Vy Tuong Dang1, Anh Totti Nguyen4†, Daeyoung Kim1†
*Equal contribution
†Equal advising
1KAIST, 2College of William and Mary, 3University of Alberta, 4Auburn University
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, boardgames, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io
Use these examples where most tested models fail to answer.
VLMs are Biased benchmark is now officially supported by lmms-eval, one of the main open-source evaluation frameworks for VLMs! The community can now run the benchmark out-of-the-box across many VLMs.
To run our benchmark on lmms-eval please follow these steps:
- Set up
lmms-evalby following their installation guide documentation - Run the following command:
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
--tasks vlms_are_biased \
--batch_size 1 \
--device cuda:0
For more details, please visit their page: lmms_eval/tasks/vlms_are_biased
Note: lmms-eval currently only support the main subset of VLMBias. To use other subsets, please refer to our Quick Start Guide below.
If you just want to use our dataset for evaluation or research:
📥 Download the complete dataset from Hugging Face with full images and prompts:
Please run the following script:
import datasets
dataset = datasets.load_dataset('anvo25/vlms-are-biased')
This will return a DatasetDict with this structure:
DatasetDict({
main: Dataset({
features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
num_rows: 2784
})
identification: Dataset({
features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
num_rows: 1392
})
withtitle: Dataset({
features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
num_rows: 2784
})
original: Dataset({
features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
num_rows: 458
})
remove_background_q1q2: Dataset({
features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
num_rows: 2784
})
remove_background_q3: Dataset({
features: ['image', 'ID', 'image_path', 'topic', 'sub_topic', 'prompt', 'ground_truth', 'expected_bias', 'with_title', 'type_of_question', 'pixel', 'metadata'],
num_rows: 1392
})
})
main: our counting dataset of counterfactual images used throughout the paperidentification: our identification dataset of counterfactual images in Section 4.3withtitle: our counting dataset of counterfactual images with in-image title injection in Section A.9original: our identification dataset of original images in Section 4.1remove_background_q1q2: our counting dataset of counterfactual images with their background being removed in Section 4.4remove_background_q3: our identification dataset of counterfactual images with their background being removed in Section 4.4
This is the fastest way to get started. Alternatively, you can also directly download the parquet files:
- Go to our Hugging Face dataset
- Download ready-to-use images with corresponding prompts
If you want to reproduce our dataset generation process or create custom variations:
Please follow the installation and generation steps below to run the code locally.
git clone https://github.com/anvo25/vlms-are-biased.git
cd vlms-are-biased
pip install -r requirements.txtTo run the code to generate the counterfactual images for a specific task, go to the following embedded links:
- Chess Pieces: Chess pieces, Xiangqi pieces (modified starting positions)
- Game Boards: Chess board, Go board, Xiangqi board, Sudoku board (dimension variations)
- Optical Illusions: Ebbinghaus, Müller-Lyer, Ponzo, Vertical-Horizontal, Zöllner, Poggendorff
- Patterned Grids: Dice patterns, Tally mark patterns (anomalous cells)
- Animals: Mammals (4 legs → 5 legs) and birds (2 legs → 3 legs)
- Logos: Consisting of 2 logo types: shoes and cars.
- Shoe Logos:
- Nike (1 swoosh → 2 swooshes)
- Adidas (3 stripes → 4 stripes)
- Car Logos:
- Maserati (3 prongs → 5 prongs)
- Mercedes-Benz (3-pointed star → 4-pointed star)
- Audi (4 overlapping circles → 5 overlapping circles)
- Shoe Logos:
- Flags: Star-typed flags (+1 and −1 star) and stripes (+1 and −1 stripe)
All images are generated at 384px, 768px, and 1152px resolutions.
Generate all available datasets:
python main.py --all
python add_titles.py --topic allIf you only want to generate data of a specific task, you can run the corresponding file instead. Here are some examples:
Generate specific optical illusions:
python main.py --optical_illusions --illusion_type EbbinghausGenerate chess pieces dataset with modified starting positions:
# Step 1: Generate "notitle" images
python main.py --chess_pieces
# Step 2: Add titles to create "in_image_title" versions
python add_titles.py --topic chess_piecesvlms-are-biased/
├── main.py # Generate "notitle" datasets
├── add_titles.py # Add "in_image_title" versions
├── generators/ # Individual dataset generators
│ ├── chess_pieces_generator.py
│ ├── optical_illusion_generator.py
│ └── ...
├── vlms-are-biased-notitle/ # Output: images without titles
└── vlms-are-biased-in_image_title/ # Output: images with titles
@misc{vlmsarebiased,
title={Vision Language Models are Biased},
author={An Vo and Khai-Nguyen Nguyen and Mohammad Reza Taesiri and Vy Tuong Dang and Anh Totti Nguyen and Daeyoung Kim},
year={2025},
eprint={2505.23941},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.23941},
}