PSF-Med: Paraphrase Sensitivity Framework for Medical VLMs

Code and data for the paper: "PSF-Med: Measuring Paraphrase Sensitivity Failures in Medical Vision-Language Models"

Overview

PSF-Med is a benchmark for evaluating whether medical Vision-Language Models give consistent answers when clinical questions are rephrased. It includes:

26,850 clinical questions across 3 chest X-ray datasets (MIMIC-CXR, PadChest, VinDr-CXR)
92,234 semantically equivalent paraphrases validated by LLM judge
Evaluation results for 11 models (MedGemma, LLaVA-RAD, CheXOne, CheXagent, RadFM, GPT-5-mini, Claude Haiku)

Figure 1: A chest X-ray produces opposite answers under rephrasing with MedGemma 4B 1.5 IT.

Dataset

The judge-filtered dataset is available on HuggingFace: saillab/psf-med

from datasets import load_dataset

mimic = load_dataset("saillab/psf-med", "mimic")["test"]
padchest = load_dataset("saillab/psf-med", "padchest")["test"]
vindr = load_dataset("saillab/psf-med", "vindr")["test"]

Images must be obtained separately from PhysioNet (MIMIC-CXR, VinDr-CXR) and BIMCV (PadChest).

Repository Structure

psf_med/
├── evaluation/              # Model evaluation scripts
│   ├── evaluate_psf.py      # Main evaluation (MedGemma, LLaVA-RAD, CheXagent)
│   ├── evaluate_padchest_v2_filtered.py  # All models on judge-filtered data
│   ├── evaluate_psf_chexone.py           # CheXOne evaluation
│   ├── evaluate_psf_gpt5mini_batch.py    # GPT-5-mini via OpenAI Batch API
│   └── evaluate_psf_claude_batch.py      # Claude Haiku via Anthropic Batch API
├── paraphrase/              # Paraphrase generation and quality filtering
│   ├── batch_paraphrase_generator.py     # Generate paraphrases via GPT-4
│   ├── filter_paraphrases.py             # Bio_ClinicalBERT similarity filter
│   ├── llm_judge_equivalence.py          # GPT-5-mini LLM judge (bidirectional entailment)
│   └── regenerate_padchest.py            # PadChest v2 paraphrase regeneration
├── models/                  # Model loading utilities
│   ├── model_loader.py      # MedGemma + LoRA loader
│   ├── llava_rad_adapter.py # LLaVA-RAD adapter (margin + generation)
│   ├── constants.py         # Paths, token IDs, model specs
│   ├── data_loaders.py      # Dataset loading helpers
│   └── uncertainty_methods.py  # forward_single (logit extraction)
├── release/                 # Dataset release and upload
│   ├── build_judge_filtered_dataset.py   # Merge judge verdicts into dataset
│   ├── upload_to_huggingface_v3.py       # Push to HuggingFace
│   └── upload_to_postgres.py             # Push to PostgreSQL
├── dataset/                 # Dataset definitions (questions + paraphrases)
│   ├── mimic/               # MIMIC-CXR questions and splits
│   ├── padchest/            # PadChest questions and flip bank
│   ├── vindr/               # VinDr-CXR questions and flip bank
│   └── README.md            # Image access instructions
├── data/                    # Judge-filtered dataset + benchmark results
│   ├── psf_med_judge_filtered/  # Pre-built filtered JSONs (mimic, padchest, vindr)
│   └── PSF_MED_BENCHMARK.json  # Benchmark summary
├── tests/                   # Test suite
└── paraphrase_generator.py  # Core paraphrase generation library

Quick Start

1. Environment Setup

conda create -n psf-med python=3.11
conda activate psf-med
pip install torch transformers peft pillow tqdm psycopg2-binary datasets

2. Evaluate a Model

# MedGemma-4B on MIMIC-CXR
CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate_padchest_v2_filtered.py \
    --model base --dataset mimic

# MedGemma-4B on PadChest
CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate_padchest_v2_filtered.py \
    --model base --dataset padchest

# All supported models: base, chil_lora, full_lora, medgemma27b, medgemma15_4b,
#                        llava_rad, chexone, chexagent, radfm

3. Evaluate API Models

# GPT-5-mini (500 hardest questions, OpenAI Batch API)
python evaluation/evaluate_psf_gpt5mini_batch.py create --dataset mimic --n 500
python evaluation/evaluate_psf_gpt5mini_batch.py submit --dataset mimic
python evaluation/evaluate_psf_gpt5mini_batch.py parse --dataset mimic

# Claude Haiku 4.5 (Anthropic Batch API)
python evaluation/evaluate_psf_claude_batch.py create --dataset mimic --n 500
python evaluation/evaluate_psf_claude_batch.py submit --dataset mimic
python evaluation/evaluate_psf_claude_batch.py parse --dataset mimic

4. Build the Judge-Filtered Dataset from Scratch

# Generate paraphrases
python paraphrase/batch_paraphrase_generator.py --dataset mimic

# Filter with Bio_ClinicalBERT (similarity >= 0.95)
python paraphrase/filter_paraphrases.py --input data/paraphrases.json

# Validate with LLM judge (GPT-5-mini, bidirectional entailment)
python paraphrase/llm_judge_equivalence.py create --dataset all
python paraphrase/llm_judge_equivalence.py submit --input data/judge_batch_input.jsonl
python paraphrase/llm_judge_equivalence.py parse

# Build final filtered dataset
python release/build_judge_filtered_dataset.py

Figure 3: Existing benchmarks do not capture paraphrase sensitivity — models can pass every task and still flip 20-30% of answers under rephrasing.

Benchmark Results

PSF-Med v2 (LLM-judge filtered, presence questions):

Model	MIMIC Acc	MIMIC Flip	PadChest Acc	PadChest Flip	VinDr Acc	VinDr Flip
Targeted LoRA	89.9%	4.2%	82.1%	7.5%	74.7%	8.5%
MedGemma-4B	88.3%	6.7%	59.0%	13.1%	57.2%	12.7%
Full LoRA	84.3%	3.3%	40.0%	10.3%	53.5%	5.8%
LLaVA-RAD	82.9%	9.9%	--	--	83.9%	12.4%
MedGemma-27B	76.7%	10.6%	55.3%	17.8%	57.6%	12.4%
MedGemma-1.5-4B	--	--	73.2%	12.0%	--	--
CheXOne	65.4%	13.3%	61.5%	10.3%	47.3%	14.7%
RadFM	--	--	--	27.0%	--	--

Figure 2: Embedding distance distributions for flip vs. non-flip pairs — flips occur even at high cosine similarity.

Paraphrase Phenomena

Each question has 4-6 paraphrases generated using 5 linguistic phenomena:

Phenomenon	Example
Lexical substitution	"abnormalities" -> "pathologic findings"
Syntactic restructuring	"Is X present?" -> "Can X be seen?"
Negation pattern	"Is there X?" -> "Can you rule out X?"
Scope quantification	"Is there effusion?" -> "Is there any effusion?"
Specificity modulation	"What level?" -> "At what vertebral level?"

Figure 4: PSF dataset generation pipeline — from clinical VQA questions to 92,234 validated question-paraphrase pairs.

Citation

@article{sadanandan2026psf,
  title={PSF-Med: Measuring Paraphrase Sensitivity Failures in Medical Vision-Language Models},
  author={Sadanandan, Binesh and Behzadan, Vahid},
  journal={arXiv preprint arXiv:2602.21428},
  year={2026}
}

License

CC-BY-4.0. Access to underlying medical images requires separate agreements with PhysioNet (MIMIC-CXR, VinDr-CXR) and BIMCV (PadChest).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PSF-Med: Paraphrase Sensitivity Framework for Medical VLMs

Overview

Dataset

Repository Structure

Quick Start

1. Environment Setup

2. Evaluate a Model

3. Evaluate API Models

4. Build the Judge-Filtered Dataset from Scratch

Benchmark Results

Paraphrase Phenomena

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
data		data
dataset		dataset
evaluation		evaluation
figures		figures
models		models
paraphrase		paraphrase
release		release
tests		tests
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
paraphrase_generator.py		paraphrase_generator.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

PSF-Med: Paraphrase Sensitivity Framework for Medical VLMs

Overview

Dataset

Repository Structure

Quick Start

1. Environment Setup

2. Evaluate a Model

3. Evaluate API Models

4. Build the Judge-Filtered Dataset from Scratch

Benchmark Results

Paraphrase Phenomena

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages