Skip to content

UNHSAILLab/psf-med

Repository files navigation

PSF-Med: Paraphrase Sensitivity Framework for Medical VLMs

Code and data for the paper: "PSF-Med: Measuring Paraphrase Sensitivity Failures in Medical Vision-Language Models"

[Paper] [Demo]

Overview

PSF-Med is a benchmark for evaluating whether medical Vision-Language Models give consistent answers when clinical questions are rephrased. It includes:

  • 26,850 clinical questions across 3 chest X-ray datasets (MIMIC-CXR, PadChest, VinDr-CXR)
  • 92,234 semantically equivalent paraphrases validated by LLM judge
  • Evaluation results for 11 models (MedGemma, LLaVA-RAD, CheXOne, CheXagent, RadFM, GPT-5-mini, Claude Haiku)

Paraphrase Sensitivity Failure Example

Figure 1: A chest X-ray produces opposite answers under rephrasing with MedGemma 4B 1.5 IT.

Dataset

The judge-filtered dataset is available on HuggingFace: saillab/psf-med

from datasets import load_dataset

mimic = load_dataset("saillab/psf-med", "mimic")["test"]
padchest = load_dataset("saillab/psf-med", "padchest")["test"]
vindr = load_dataset("saillab/psf-med", "vindr")["test"]

Images must be obtained separately from PhysioNet (MIMIC-CXR, VinDr-CXR) and BIMCV (PadChest).

Repository Structure

psf_med/
├── evaluation/              # Model evaluation scripts
│   ├── evaluate_psf.py      # Main evaluation (MedGemma, LLaVA-RAD, CheXagent)
│   ├── evaluate_padchest_v2_filtered.py  # All models on judge-filtered data
│   ├── evaluate_psf_chexone.py           # CheXOne evaluation
│   ├── evaluate_psf_gpt5mini_batch.py    # GPT-5-mini via OpenAI Batch API
│   └── evaluate_psf_claude_batch.py      # Claude Haiku via Anthropic Batch API
├── paraphrase/              # Paraphrase generation and quality filtering
│   ├── batch_paraphrase_generator.py     # Generate paraphrases via GPT-4
│   ├── filter_paraphrases.py             # Bio_ClinicalBERT similarity filter
│   ├── llm_judge_equivalence.py          # GPT-5-mini LLM judge (bidirectional entailment)
│   └── regenerate_padchest.py            # PadChest v2 paraphrase regeneration
├── models/                  # Model loading utilities
│   ├── model_loader.py      # MedGemma + LoRA loader
│   ├── llava_rad_adapter.py # LLaVA-RAD adapter (margin + generation)
│   ├── constants.py         # Paths, token IDs, model specs
│   ├── data_loaders.py      # Dataset loading helpers
│   └── uncertainty_methods.py  # forward_single (logit extraction)
├── release/                 # Dataset release and upload
│   ├── build_judge_filtered_dataset.py   # Merge judge verdicts into dataset
│   ├── upload_to_huggingface_v3.py       # Push to HuggingFace
│   └── upload_to_postgres.py             # Push to PostgreSQL
├── dataset/                 # Dataset definitions (questions + paraphrases)
│   ├── mimic/               # MIMIC-CXR questions and splits
│   ├── padchest/            # PadChest questions and flip bank
│   ├── vindr/               # VinDr-CXR questions and flip bank
│   └── README.md            # Image access instructions
├── data/                    # Judge-filtered dataset + benchmark results
│   ├── psf_med_judge_filtered/  # Pre-built filtered JSONs (mimic, padchest, vindr)
│   └── PSF_MED_BENCHMARK.json  # Benchmark summary
├── tests/                   # Test suite
└── paraphrase_generator.py  # Core paraphrase generation library

Quick Start

1. Environment Setup

conda create -n psf-med python=3.11
conda activate psf-med
pip install torch transformers peft pillow tqdm psycopg2-binary datasets

2. Evaluate a Model

# MedGemma-4B on MIMIC-CXR
CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate_padchest_v2_filtered.py \
    --model base --dataset mimic

# MedGemma-4B on PadChest
CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate_padchest_v2_filtered.py \
    --model base --dataset padchest

# All supported models: base, chil_lora, full_lora, medgemma27b, medgemma15_4b,
#                        llava_rad, chexone, chexagent, radfm

3. Evaluate API Models

# GPT-5-mini (500 hardest questions, OpenAI Batch API)
python evaluation/evaluate_psf_gpt5mini_batch.py create --dataset mimic --n 500
python evaluation/evaluate_psf_gpt5mini_batch.py submit --dataset mimic
python evaluation/evaluate_psf_gpt5mini_batch.py parse --dataset mimic

# Claude Haiku 4.5 (Anthropic Batch API)
python evaluation/evaluate_psf_claude_batch.py create --dataset mimic --n 500
python evaluation/evaluate_psf_claude_batch.py submit --dataset mimic
python evaluation/evaluate_psf_claude_batch.py parse --dataset mimic

4. Build the Judge-Filtered Dataset from Scratch

# Generate paraphrases
python paraphrase/batch_paraphrase_generator.py --dataset mimic

# Filter with Bio_ClinicalBERT (similarity >= 0.95)
python paraphrase/filter_paraphrases.py --input data/paraphrases.json

# Validate with LLM judge (GPT-5-mini, bidirectional entailment)
python paraphrase/llm_judge_equivalence.py create --dataset all
python paraphrase/llm_judge_equivalence.py submit --input data/judge_batch_input.jsonl
python paraphrase/llm_judge_equivalence.py parse

# Build final filtered dataset
python release/build_judge_filtered_dataset.py

Existing Benchmark Gap

Figure 3: Existing benchmarks do not capture paraphrase sensitivity — models can pass every task and still flip 20-30% of answers under rephrasing.

Benchmark Results

PSF-Med v2 (LLM-judge filtered, presence questions):

Model MIMIC Acc MIMIC Flip PadChest Acc PadChest Flip VinDr Acc VinDr Flip
Targeted LoRA 89.9% 4.2% 82.1% 7.5% 74.7% 8.5%
MedGemma-4B 88.3% 6.7% 59.0% 13.1% 57.2% 12.7%
Full LoRA 84.3% 3.3% 40.0% 10.3% 53.5% 5.8%
LLaVA-RAD 82.9% 9.9% -- -- 83.9% 12.4%
MedGemma-27B 76.7% 10.6% 55.3% 17.8% 57.6% 12.4%
MedGemma-1.5-4B -- -- 73.2% 12.0% -- --
CheXOne 65.4% 13.3% 61.5% 10.3% 47.3% 14.7%
RadFM -- -- -- 27.0% -- --

Embedding Distance Distributions

Figure 2: Embedding distance distributions for flip vs. non-flip pairs — flips occur even at high cosine similarity.

Paraphrase Phenomena

Each question has 4-6 paraphrases generated using 5 linguistic phenomena:

Phenomenon Example
Lexical substitution "abnormalities" -> "pathologic findings"
Syntactic restructuring "Is X present?" -> "Can X be seen?"
Negation pattern "Is there X?" -> "Can you rule out X?"
Scope quantification "Is there effusion?" -> "Is there any effusion?"
Specificity modulation "What level?" -> "At what vertebral level?"

PSF Dataset Generation Pipeline

Figure 4: PSF dataset generation pipeline — from clinical VQA questions to 92,234 validated question-paraphrase pairs.

Citation

@article{sadanandan2026psf,
  title={PSF-Med: Measuring Paraphrase Sensitivity Failures in Medical Vision-Language Models},
  author={Sadanandan, Binesh and Behzadan, Vahid},
  journal={arXiv preprint arXiv:2602.21428},
  year={2026}
}

License

CC-BY-4.0. Access to underlying medical images requires separate agreements with PhysioNet (MIMIC-CXR, VinDr-CXR) and BIMCV (PadChest).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors