Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images
- [2025/06/20] The code are publicly available! π
- [2026/06/13] Hepto-LLaVA has been accepted at MICCAI 2026 as a poster! π
- [2026/03/01] Hepto-LLaVA is now live on arXiv! π₯
Hepatocellular Carcinoma (HCC) relies on histopathological Whole Slide Images (WSIs) examination as the gold standard. However, manual analysis of these gigapixel, highly heterogeneous WSIs is labor-intensive and prone to inter-observer variability. This has catalyzed WSI-based Multi-modal Large Language Models (MLLMs) to enable VQA.
A key challenge in pathology MLLMs is gigapixel WSI representation. Existing methods either use thumbnail-based approaches that lose critical high-resolution diagnostic details, or employ slide-encoder approaches that generate excessively redundant tokens.
We propose Hepato-LLaVA, a specialized MLLM for fine-grained hepatocellular pathology analysis. It features a novel Hierarchical Sparse Visual Attention (HSVA) mechanism that models 2D tissue topology to aggregate diagnostic evidence while preserving context. To address multiscale data scarcity, we also present HepatoPathoVQA, comprising 33K hierarchically structured QA pairs validated by pathologists. Hepato-LLaVA achieves state-of-the-art diagnostic accuracy, outperforming existing pathology MLLMs by an absolute 20%.
git clone https://github.com/wssf3092/Hepato-LLaVA.git
cd Hepato-LLaVA
conda create -n hepato_llava python=3.10 -y
conda activate hepato_llava
pip install --upgrade pip
pip install -r requirements.txtFor the patch encoder, please follow the official installation instructions of CONCH to set up the model and obtain the pretrained weights.
Use the CONCH encoder to extract patch-level features from WSIs:
bash data/feature/1_run.shFor data augmentation (generating 9 variants per WSI):
bash data/feature/1_run_augment.shConvert VQA data to LLaVA fine-tuning format:
data/conversation/qa.pyβ convert QA JSONL to LLaVA fine-tuning formatdata/conversation/caption.pyβ convert captioning data to fine-tuning format
Hepato-LLaVA follows a three-stage training pipeline:
Stage 1: MAE Pre-training β Self-supervised pre-training of the HSAN slide encoder with curriculum masking (patch-level β pack-level):
bash scripts/run_mae.shStage 2: MoCo Pre-training β Contrastive learning for summary token representations:
bash scripts/run_moco_summary.shStage 3: LLaVA Fine-tuning β End-to-end fine-tuning with DeepSpeed and LoRA:
bash scripts/run_llava_finetune.shRun VQA evaluation:
bash scripts/run_eval_vqa.shFor GPT-4 based open-ended evaluation:
python scripts/eval_open.pyFor choice question statistics:
python scripts/stat_choice.py| Parameter | Value |
|---|---|
| LORA_R | 128 |
| LORA_ALPHA | 256 |
| Parameter | Value |
|---|---|
| NUM_EPOCHS | 3 |
| BATCH_SIZE | 8 |
| GRADIENT_ACCUMULATION | 4 |
| LEARNING_RATE | 2e-5 |
| MM_PROJECTOR_LR | 2e-5 |
| WARMUP_RATIO | 0.03 |
| MODEL_MAX_LENGTH | 8192 |
| Parameter | Value |
|---|---|
| TEMPERATURE | 0.0 |
| TOP_P | 0.9 |
| NUM_BEAMS | 1 |
| MAX_NEW_TOKENS | 2048 |
@article{hepatollava2026,
title={Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images},
author={Yang, Yuxuan and Yan, Zhonghao and Zhang, Yi and Yun, Bo and Diao, Muxi and Zhao, Guowei and Liang, Kongming and Li, Wenbin and Ma, Zhanyu},
year={2026}
}This code is built on CONCH and WSI-LLaVA. We thank the authors for sharing their codes.