PathFLIP is a novel vision-language framework for holistic Whole Slide Image (WSI) interpretation. By decomposing slide-level captions into region-level sub-captions and leveraging Large Language Models (LLMs), PathFLIP achieves precise visual-language grounding and instruction-aware WSI interpretation.
While Vision-Language Models (VLMs) have achieved notable progress in computational pathology, the gigapixel scale and spatial heterogeneity of WSIs continue to pose challenges. PathFLIP addresses these issues with the following capabilities:
- 🧩 Fine-grained Visual-Language Grounding: Decomposes slide-level captions into region-level sub-captions and generates text-conditioned region embeddings, capturing fine-grained correspondences across thousands of patches.
- 🤖 LLM-Powered Instruction Following: Seamlessly follows diverse clinical instructions and adapts to varied diagnostic contexts by harnessing the reasoning power of LLMs.
- 🎯 Versatile Task Adaptation: Efficiently handles multiple paradigms, including slide-level classification, WSI-text retrieval, fine-grained lesion localization, and instruction following.
- ⚡ High Efficiency: Outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data.
PathFLIP proposes a region-aware pretraining strategy to bridge the gap between massive gigapixel visual contexts and textual diagnostic descriptions.
(Brief description of the figure: The overall pipeline of PathFLIP, illustrating the decomposition of slide-level captions and the text-conditioned region embedding generation.)
If you find PathFLIP useful in your research, please consider citing our paper:
@article{liu2025pathflip,
title={Pathflip: Fine-grained language-image pretraining for versatile computational pathology},
author={Liu, Fengchun and Jiang, Songhan and Cai, Linghan and Wang, Ziyue and Zhang, Yongbing},
journal={arXiv preprint arXiv:2512.17621},
year={2025}
}We would like to thank the open-source community for their invaluable contributions, specifically the repositories of CLAM, CONCH and BLIP2.