CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Guankun Wang∗, Han Xiao∗, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li†, Hongliang Ren†

Supplementary Material

More details about CoPESD can be found in the supplementary material.

Overview

With the advances in surgical robotics, robot-assisted endoscopic submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Multimodal Large Language Models (MLLMs) offer promising decision support and predictive planning capabilities for robotic systems, which allows the robot to complete complex tasks in more challenging scenarios. However, the training of MLLMs requires large-scale, well-annotated datasets, and existing datasets for multi-level fine-grained ESD surgical motion reasoning are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training MLLMs as the robotic Co-Pilot of Endoscopic Submucosal Dissection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. Extensive experiments demonstrate the effectiveness of CoPESD in training MLLMs to comprehend surgical scenarios and reason following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD motion decision-making and surgical automation.

Features

CoPESD is built based on a granular decomposition of surgical motions, providing precise motion definitions for ESD.
CoPESD is a fine-grained multi-level surgical motion dataset including 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions.
We provide the link to download CoPESD.

Data Download

CoPESD data can be downloaded through this link.

Fine-tuning on CoPESD dataset

Sphinx-ESD

Environment Setup

Follow the instructions provided in the LLaMA2-Accessory repository to set up the environment.
Download the pretrained Sphinx-Tiny-1k models from huggingface and place them in the sphinx_esd/accessory/data/SPHINX-Tiny directory.

Fine-tuning with CoPESD Dataset

To fine-tune Sphinx-ESD-13B with different image sizes, use the following commands:

For Image Size 512:

cd sphinx_esd/accessory
bash exps/finetune_ens1_13b.sh

For Image Size 1024:

cd sphinx_esd/accessory
bash exps/finetune_ens5_13b.sh

Inference

To run inference and evaluate using the fine-tuned models, use the following commands:

cd sphinx_esd/accessory
bash exps/generate_action.sh

LLaVA-ESD

Environment Setup

Follow the instructions provided in the LLaVA repository to set up the environment and download the pretrained LLaVA-1.5 models.

Fine-tuning with CoPESD Dataset

To fine-tune LLaVA-ESD-7B and LLaVA-ESD-13B models, use the following commands:

For the 7B model:

cd llava_esd
bash scripts/v1_5/finetune_copesd_7b.sh

For the 13B model:

cd llava_esd
bash scripts/v1_5/finetune_copesd_13b.sh

Inference

To run inference and evaluate using the fine-tuned models, use the following commands:

For the 7B model:

cd llava_esd
bash scripts/v1_5/eval/eval_copesd_7b.sh

For the 13B model:

cd llava_esd
bash scripts/v1_5/eval/eval_copesd_13b.sh

Checkpoints Release

We have released the fine-tuned model checkpoints on huggingface. You can download them and perform evaluations directly.

Questions

If you have any questions, feel free to reach out to gkwang@link.cuhk.edu.hk. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find CoPESD useful for your research or development, please cite the following:

@article{wang2024copesd,
  title={CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection},
  author={Wang, Guankun and Xiao, Han and Zhang, Renrui and Gao, Huxin and Bai, Long and Yang, Xiaoxiao and Li, Zhen and Li, Hongsheng and Ren, Hongliang},
  journal={arXiv preprint arXiv:2410.07540},
  year={2024}
}

License

The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
figures		figures
llava_esd		llava_esd
sphinx_esd		sphinx_esd
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Supplementary Material

Overview

Features

Data Download

Fine-tuning on CoPESD dataset

Sphinx-ESD

Environment Setup

Fine-tuning with CoPESD Dataset

For Image Size 512:

For Image Size 1024:

Inference

LLaVA-ESD

Environment Setup

Fine-tuning with CoPESD Dataset

For the 7B model:

For the 13B model:

Inference

For the 7B model:

For the 13B model:

Checkpoints Release

Questions

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Supplementary Material

Overview

Features

Data Download

Fine-tuning on CoPESD dataset

Sphinx-ESD

Environment Setup

Fine-tuning with CoPESD Dataset

For Image Size 512:

For Image Size 1024:

Inference

LLaVA-ESD

Environment Setup

Fine-tuning with CoPESD Dataset

For the 7B model:

For the 13B model:

Inference

For the 7B model:

For the 13B model:

Checkpoints Release

Questions

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages