Skip to content

orangeshushu/TCM-Ladder

Repository files navigation

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

📌 Abstract

TCM-Ladder, the first comprehensive multimodal QA dataset specifically designed for evaluating large TCM language models.

  • Multiple core disciplines of TCM: fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics.

  • Multimodal: TCM-Ladder incorporates various modalities such as images and videos.

  • Multiple question formats: single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks.

We trained a reasoning model on TCM-Ladder and conducted comparative experiments against nine state-of-the-art general domain and five leading TCM-specific LLMs to evaluate their performance on the dataset. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality in terms of terminology usage and semantic expression. To the best of our knowledge, this is the first work to systematically evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com and will be continuously updated.

📑 paper 📖 dataset

🚀 News

  • [2025-9] Our paper is accepted by NeurIPS 2025.
  • [2025-5] We release our preprint paper on arXiv.
  • [2025-5] Our dataset TCM-Ladder is released on Huggingface.

📝 TODOs

  • English version.
  • Instructions to run evaluation.

1. Overview of the architectural composition of TCM-Ladder.

New Figure 1

2.Data distribution and length statistics in TCM-Ladder

Final Figure 2

3. Performance of general-domain and TCM-specific language models on single and multiple-choice question answering tasks

Performance

4. The performance of large language models on questions regarding Chinese herbal medicine and tongue images.

Performance

🌟Citation

if you find our work useful in your research, please consider citing:

@article{xie2025tcm,
  title={TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine},
  author={Xie, Jiacheng and Yu, Yang and Zhang, Ziyang and Zeng, Shuai and He, Jiaxuan and Vasireddy, Ayush and Tang, Xiaoting and Guo, Congyu and Zhao, Lening and Jing, Congcong and others},
  journal={arXiv preprint arXiv:2505.24063},
  year={2025}
}

About

The first multimodal QA dataset specifically designed for evaluating large TCM language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •