Code and data for the paper: "PSF-Med: Measuring Paraphrase Sensitivity Failures in Medical Vision-Language Models"
PSF-Med is a benchmark for evaluating whether medical Vision-Language Models give consistent answers when clinical questions are rephrased. It includes:
- 26,850 clinical questions across 3 chest X-ray datasets (MIMIC-CXR, PadChest, VinDr-CXR)
- 92,234 semantically equivalent paraphrases validated by LLM judge
- Evaluation results for 11 models (MedGemma, LLaVA-RAD, CheXOne, CheXagent, RadFM, GPT-5-mini, Claude Haiku)
Figure 1: A chest X-ray produces opposite answers under rephrasing with MedGemma 4B 1.5 IT.
The judge-filtered dataset is available on HuggingFace: saillab/psf-med
from datasets import load_dataset
mimic = load_dataset("saillab/psf-med", "mimic")["test"]
padchest = load_dataset("saillab/psf-med", "padchest")["test"]
vindr = load_dataset("saillab/psf-med", "vindr")["test"]Images must be obtained separately from PhysioNet (MIMIC-CXR, VinDr-CXR) and BIMCV (PadChest).
psf_med/
├── evaluation/ # Model evaluation scripts
│ ├── evaluate_psf.py # Main evaluation (MedGemma, LLaVA-RAD, CheXagent)
│ ├── evaluate_padchest_v2_filtered.py # All models on judge-filtered data
│ ├── evaluate_psf_chexone.py # CheXOne evaluation
│ ├── evaluate_psf_gpt5mini_batch.py # GPT-5-mini via OpenAI Batch API
│ └── evaluate_psf_claude_batch.py # Claude Haiku via Anthropic Batch API
├── paraphrase/ # Paraphrase generation and quality filtering
│ ├── batch_paraphrase_generator.py # Generate paraphrases via GPT-4
│ ├── filter_paraphrases.py # Bio_ClinicalBERT similarity filter
│ ├── llm_judge_equivalence.py # GPT-5-mini LLM judge (bidirectional entailment)
│ └── regenerate_padchest.py # PadChest v2 paraphrase regeneration
├── models/ # Model loading utilities
│ ├── model_loader.py # MedGemma + LoRA loader
│ ├── llava_rad_adapter.py # LLaVA-RAD adapter (margin + generation)
│ ├── constants.py # Paths, token IDs, model specs
│ ├── data_loaders.py # Dataset loading helpers
│ └── uncertainty_methods.py # forward_single (logit extraction)
├── release/ # Dataset release and upload
│ ├── build_judge_filtered_dataset.py # Merge judge verdicts into dataset
│ ├── upload_to_huggingface_v3.py # Push to HuggingFace
│ └── upload_to_postgres.py # Push to PostgreSQL
├── dataset/ # Dataset definitions (questions + paraphrases)
│ ├── mimic/ # MIMIC-CXR questions and splits
│ ├── padchest/ # PadChest questions and flip bank
│ ├── vindr/ # VinDr-CXR questions and flip bank
│ └── README.md # Image access instructions
├── data/ # Judge-filtered dataset + benchmark results
│ ├── psf_med_judge_filtered/ # Pre-built filtered JSONs (mimic, padchest, vindr)
│ └── PSF_MED_BENCHMARK.json # Benchmark summary
├── tests/ # Test suite
└── paraphrase_generator.py # Core paraphrase generation library
conda create -n psf-med python=3.11
conda activate psf-med
pip install torch transformers peft pillow tqdm psycopg2-binary datasets# MedGemma-4B on MIMIC-CXR
CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate_padchest_v2_filtered.py \
--model base --dataset mimic
# MedGemma-4B on PadChest
CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate_padchest_v2_filtered.py \
--model base --dataset padchest
# All supported models: base, chil_lora, full_lora, medgemma27b, medgemma15_4b,
# llava_rad, chexone, chexagent, radfm# GPT-5-mini (500 hardest questions, OpenAI Batch API)
python evaluation/evaluate_psf_gpt5mini_batch.py create --dataset mimic --n 500
python evaluation/evaluate_psf_gpt5mini_batch.py submit --dataset mimic
python evaluation/evaluate_psf_gpt5mini_batch.py parse --dataset mimic
# Claude Haiku 4.5 (Anthropic Batch API)
python evaluation/evaluate_psf_claude_batch.py create --dataset mimic --n 500
python evaluation/evaluate_psf_claude_batch.py submit --dataset mimic
python evaluation/evaluate_psf_claude_batch.py parse --dataset mimic# Generate paraphrases
python paraphrase/batch_paraphrase_generator.py --dataset mimic
# Filter with Bio_ClinicalBERT (similarity >= 0.95)
python paraphrase/filter_paraphrases.py --input data/paraphrases.json
# Validate with LLM judge (GPT-5-mini, bidirectional entailment)
python paraphrase/llm_judge_equivalence.py create --dataset all
python paraphrase/llm_judge_equivalence.py submit --input data/judge_batch_input.jsonl
python paraphrase/llm_judge_equivalence.py parse
# Build final filtered dataset
python release/build_judge_filtered_dataset.pyFigure 3: Existing benchmarks do not capture paraphrase sensitivity — models can pass every task and still flip 20-30% of answers under rephrasing.
PSF-Med v2 (LLM-judge filtered, presence questions):
| Model | MIMIC Acc | MIMIC Flip | PadChest Acc | PadChest Flip | VinDr Acc | VinDr Flip |
|---|---|---|---|---|---|---|
| Targeted LoRA | 89.9% | 4.2% | 82.1% | 7.5% | 74.7% | 8.5% |
| MedGemma-4B | 88.3% | 6.7% | 59.0% | 13.1% | 57.2% | 12.7% |
| Full LoRA | 84.3% | 3.3% | 40.0% | 10.3% | 53.5% | 5.8% |
| LLaVA-RAD | 82.9% | 9.9% | -- | -- | 83.9% | 12.4% |
| MedGemma-27B | 76.7% | 10.6% | 55.3% | 17.8% | 57.6% | 12.4% |
| MedGemma-1.5-4B | -- | -- | 73.2% | 12.0% | -- | -- |
| CheXOne | 65.4% | 13.3% | 61.5% | 10.3% | 47.3% | 14.7% |
| RadFM | -- | -- | -- | 27.0% | -- | -- |
Figure 2: Embedding distance distributions for flip vs. non-flip pairs — flips occur even at high cosine similarity.
Each question has 4-6 paraphrases generated using 5 linguistic phenomena:
| Phenomenon | Example |
|---|---|
| Lexical substitution | "abnormalities" -> "pathologic findings" |
| Syntactic restructuring | "Is X present?" -> "Can X be seen?" |
| Negation pattern | "Is there X?" -> "Can you rule out X?" |
| Scope quantification | "Is there effusion?" -> "Is there any effusion?" |
| Specificity modulation | "What level?" -> "At what vertebral level?" |
Figure 4: PSF dataset generation pipeline — from clinical VQA questions to 92,234 validated question-paraphrase pairs.
@article{sadanandan2026psf,
title={PSF-Med: Measuring Paraphrase Sensitivity Failures in Medical Vision-Language Models},
author={Sadanandan, Binesh and Behzadan, Vahid},
journal={arXiv preprint arXiv:2602.21428},
year={2026}
}CC-BY-4.0. Access to underlying medical images requires separate agreements with PhysioNet (MIMIC-CXR, VinDr-CXR) and BIMCV (PadChest).