Seeing Isnβt Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
FOCUS is a benchmark and framework for testing the robustness of Evaluator VLMs (Vision-Language Model) across diverse tasks and evaluation strategies. The framework covers two task families:
- I2T (Image-to-Text): Evaluating VLM answers to visual questions.
- T2I (Text-to-Image): Evaluating images generated from text prompts.
The core idea: generate high-quality gold responses/images, then introduce carefully crafted adversarial perturbations across a taxonomy of error types. LLM-as-a-judge evaluators are then tested on whether they can detect these perturbations.
Benchmark Instances
β
βΌ
βββββββββββββββββββββ
β 1. IPTonate App β β Human annotation & instance selection
βββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Perturbation Generation β
β β
β I2T: gold answers βββΊ perturbed text answers β
β T2I: gold images βββΊ perturbed images β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββ
β 3. PerturbVal App β β Human validation of perturbations
βββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Evaluator Benchmarking β
β β
β Single-answer β Comparison β Reference-based β
β Vanilla CoT β Vanilla CoT β Score vs. ref β
β Rubrics β Rules-based β β
β Multi-Axes β Multi-Axes β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
focus/
βββ app/ # Streamlit annotation tools
β βββ iptonate_benchmark_selection_app.py # Instance selection & annotation
β βββ perturbval_pertubation_validation_app.py # Perturbation validation
β
βββ i2t/ # Image-to-Text pipeline
β βββ perturbations/ # Generate adversarial text perturbations
β βββ evaluators/ # LLM-as-a-judge evaluation harness
β
βββ t2i/ # Text-to-Image pipeline
βββ perturbations/ # Generate adversarial image perturbations
βββ evaluators/ # LLM-as-a-judge evaluation harness
pip install google-genai openai anthropic Pillow streamlit markdown aiohttp requestsSet API keys in your environment (or in a .env file):
export GEMINI_API_KEY="..."
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
# For Vertex AI (optional)
export GOOGLE_CLOUD_PROJECT="..."
export GOOGLE_CLOUD_LOCATION="global"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service_account.json"
export GEMINI_BUCKET_NAME="..."Annotation Apps (app/)
Two Streamlit UIs for human-in-the-loop annotation:
| App | Script | Purpose |
|---|---|---|
| IPTonate | iptonate_benchmark_selection_app.py |
Select and annotate benchmark instances; assign feasibility labels (Yes/No/Maybe) and difficulty |
| PerturbVal | perturbval_pertubation_validation_app.py |
Validate generated perturbations; label as Valid / Score-Invariant / Incorrect / Not Sure / Not Relevant |
See app/README.md for setup and usage.
I2T Perturbation Generation (i2t/perturbations/)
Generates adversarial text perturbations of VLM answers across four evaluation categories and 21 subcategories:
| Category | Example Perturbation Types |
|---|---|
| General Perception | Entity mislabelling, attribute substitution, spatial relation perturbation |
| Semantic Understanding | Contextual nuance ignoring, cultural context substitution |
| Reasoning | Numerical errors, sequence misordering, misattributed relations |
| Creative Generation | Incoherent details, thematic drift, tone mismatch |
See i2t/perturbations/README.md for the full pipeline.
T2I Perturbation Generation (t2i/perturbations/)
Generates adversarial image perturbations using a two-model architecture (Gemini image generation + edit instruction models) across four categories and 21 subcategories:
| Category | Example Perturbation Types |
|---|---|
| Basic Skill | Object substitution, element omission, attribute manipulation |
| Scene Context & Style | Style inconsistency, environmental conflict, overcrowding |
| Reasoning | Physics manipulation, logical contradiction, functional absurdity |
| Text Rendering | Typographical substitution, incomplete rendering, mislabelled symbols |
See t2i/perturbations/README.md for the full pipeline.
I2T Evaluators (i2t/evaluators/)
LLM-as-a-judge harness for image-to-text evaluation. Supports batch API and parallel execution across OpenAI, Google Gemini, Vertex AI, and Anthropic Claude.
Evaluator types: Vanilla CoT Β· Rubrics Β· Multi-Axes Β· Comparison Β· Reference-Based
See i2t/evaluators/README.md for full documentation.
T2I Evaluators (t2i/evaluators/)
LLM-as-a-judge harness for text-to-image evaluation β same evaluator architecture as I2T, adapted for image quality assessment.
See t2i/evaluators/README.md for full documentation.
If you use this work, please cite:
@article{khan2026seeing,
title = {Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models},
author = {Mohammed Safi Ur Rahman Khan and Sanjay Suryanarayanan and Tushar Anand and Mitesh M. Khapra},
year = {2026},
journal = {arXiv preprint arXiv: 2604.21523}
}