FOCUS

Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Overview

FOCUS is a benchmark and framework for testing the robustness of Evaluator VLMs (Vision-Language Model) across diverse tasks and evaluation strategies. The framework covers two task families:

I2T (Image-to-Text): Evaluating VLM answers to visual questions.
T2I (Text-to-Image): Evaluating images generated from text prompts.

The core idea: generate high-quality gold responses/images, then introduce carefully crafted adversarial perturbations across a taxonomy of error types. LLM-as-a-judge evaluators are then tested on whether they can detect these perturbations.

Framework Pipeline

Benchmark Instances
        │
        ▼
┌───────────────────┐
│  1. IPTonate App  │  ← Human annotation & instance selection
└───────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│  2. Perturbation Generation                          │
│                                                      │
│   I2T: gold answers  ──►  perturbed text answers     │
│   T2I: gold images   ──►  perturbed images           │
└──────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────┐
│  3. PerturbVal App    │  ← Human validation of perturbations
└───────────────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────┐
│  4. Evaluator Benchmarking                           │
│                                                      │
│   Single-answer  │  Comparison  │  Reference-based  │
│   Vanilla CoT    │  Vanilla CoT │  Score vs. ref    │
│   Rubrics        │  Rules-based │                   │
│   Multi-Axes     │  Multi-Axes  │                   │
└──────────────────────────────────────────────────────┘

Repository Structure

focus/
├── app/                        # Streamlit annotation tools
│   ├── iptonate_benchmark_selection_app.py   # Instance selection & annotation
│   └── perturbval_pertubation_validation_app.py  # Perturbation validation
│
├── i2t/                        # Image-to-Text pipeline
│   ├── perturbations/          # Generate adversarial text perturbations
│   └── evaluators/             # LLM-as-a-judge evaluation harness
│
└── t2i/                        # Text-to-Image pipeline
    ├── perturbations/          # Generate adversarial image perturbations
    └── evaluators/             # LLM-as-a-judge evaluation harness

Getting Started

Requirements

pip install google-genai openai anthropic Pillow streamlit markdown aiohttp requests

Set API keys in your environment (or in a .env file):

export GEMINI_API_KEY="..."
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."

# For Vertex AI (optional)
export GOOGLE_CLOUD_PROJECT="..."
export GOOGLE_CLOUD_LOCATION="global"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service_account.json"
export GEMINI_BUCKET_NAME="..."

Components

Annotation Apps (`app/`)

Two Streamlit UIs for human-in-the-loop annotation:

App	Script	Purpose
IPTonate	`iptonate_benchmark_selection_app.py`	Select and annotate benchmark instances; assign feasibility labels (Yes/No/Maybe) and difficulty
PerturbVal	`perturbval_pertubation_validation_app.py`	Validate generated perturbations; label as Valid / Score-Invariant / Incorrect / Not Sure / Not Relevant

See app/README.md for setup and usage.

I2T Perturbation Generation (`i2t/perturbations/`)

Generates adversarial text perturbations of VLM answers across four evaluation categories and 21 subcategories:

Category	Example Perturbation Types
General Perception	Entity mislabelling, attribute substitution, spatial relation perturbation
Semantic Understanding	Contextual nuance ignoring, cultural context substitution
Reasoning	Numerical errors, sequence misordering, misattributed relations
Creative Generation	Incoherent details, thematic drift, tone mismatch

See i2t/perturbations/README.md for the full pipeline.

T2I Perturbation Generation (`t2i/perturbations/`)

Generates adversarial image perturbations using a two-model architecture (Gemini image generation + edit instruction models) across four categories and 21 subcategories:

Category	Example Perturbation Types
Basic Skill	Object substitution, element omission, attribute manipulation
Scene Context & Style	Style inconsistency, environmental conflict, overcrowding
Reasoning	Physics manipulation, logical contradiction, functional absurdity
Text Rendering	Typographical substitution, incomplete rendering, mislabelled symbols

See t2i/perturbations/README.md for the full pipeline.

I2T Evaluators (`i2t/evaluators/`)

LLM-as-a-judge harness for image-to-text evaluation. Supports batch API and parallel execution across OpenAI, Google Gemini, Vertex AI, and Anthropic Claude.

Evaluator types: Vanilla CoT · Rubrics · Multi-Axes · Comparison · Reference-Based

See i2t/evaluators/README.md for full documentation.

T2I Evaluators (`t2i/evaluators/`)

LLM-as-a-judge harness for text-to-image evaluation — same evaluator architecture as I2T, adapted for image quality assessment.

See t2i/evaluators/README.md for full documentation.

Citation

If you use this work, please cite:

@article{khan2026seeing,
  title   = {Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models},
  author  = {Mohammed Safi Ur Rahman Khan and Sanjay Suryanarayanan and Tushar Anand and Mitesh M. Khapra},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2604.21523}
}

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
app		app
i2t		i2t
t2i		t2i
.gitignore		.gitignore
README.md		README.md
hero.png		hero.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FOCUS

Overview

Framework Pipeline

Repository Structure

Getting Started

Requirements

Components

Annotation Apps (`app/`)

I2T Perturbation Generation (`i2t/perturbations/`)

T2I Perturbation Generation (`t2i/perturbations/`)

I2T Evaluators (`i2t/evaluators/`)

T2I Evaluators (`t2i/evaluators/`)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FOCUS

Overview

Framework Pipeline

Repository Structure

Getting Started

Requirements

Components

Annotation Apps (app/)

I2T Perturbation Generation (i2t/perturbations/)

T2I Perturbation Generation (t2i/perturbations/)

I2T Evaluators (i2t/evaluators/)

T2I Evaluators (t2i/evaluators/)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Annotation Apps (`app/`)

I2T Perturbation Generation (`i2t/perturbations/`)

T2I Perturbation Generation (`t2i/perturbations/`)

I2T Evaluators (`i2t/evaluators/`)

T2I Evaluators (`t2i/evaluators/`)

Packages