TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

📌 Abstract

TCM-Ladder, the first comprehensive multimodal QA dataset specifically designed for evaluating large TCM language models.

Multiple core disciplines of TCM: fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics.
Multimodal: TCM-Ladder incorporates various modalities such as images and videos.
Multiple question formats: single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks.

We trained a reasoning model on TCM-Ladder and conducted comparative experiments against nine state-of-the-art general domain and five leading TCM-specific LLMs to evaluate their performance on the dataset. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality in terms of terminology usage and semantic expression. To the best of our knowledge, this is the first work to systematically evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com and will be continuously updated.

📑 paper 📖 dataset

🚀 News

[2025-9] Our paper is accepted by NeurIPS 2025.
[2025-5] We release our preprint paper on arXiv.
[2025-5] Our dataset TCM-Ladder is released on Huggingface.

📝 TODOs

English version.
Instructions to run evaluation.

1. Overview of the architectural composition of TCM-Ladder.

2.Data distribution and length statistics in TCM-Ladder

3. Performance of general-domain and TCM-specific language models on single and multiple-choice question answering tasks

4. The performance of large language models on questions regarding Chinese herbal medicine and tongue images.

🌟Citation

if you find our work useful in your research, please consider citing:

@article{xie2025tcm,
  title={TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine},
  author={Xie, Jiacheng and Yu, Yang and Zhang, Ziyang and Zeng, Shuai and He, Jiaxuan and Vasireddy, Ayush and Tang, Xiaoting and Guo, Congyu and Zhao, Lening and Jing, Congcong and others},
  journal={arXiv preprint arXiv:2505.24063},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
ChatGPT		ChatGPT
Data Sample/Herbs visual questions		Data Sample/Herbs visual questions
Figures		Figures
Gemini		Gemini
HuatuoGPT		HuatuoGPT
Ladder-base		Ladder-base
Zhongjing		Zhongjing
bentsao		bentsao
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

📌 Abstract

🚀 News

📝 TODOs

1. Overview of the architectural composition of TCM-Ladder.

2.Data distribution and length statistics in TCM-Ladder

3. Performance of general-domain and TCM-specific language models on single and multiple-choice question answering tasks

4. The performance of large language models on questions regarding Chinese herbal medicine and tongue images.

🌟Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

orangeshushu/TCM-Ladder

Folders and files

Latest commit

History

Repository files navigation

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

📌 Abstract

🚀 News

📝 TODOs

1. Overview of the architectural composition of TCM-Ladder.

2.Data distribution and length statistics in TCM-Ladder

3. Performance of general-domain and TCM-specific language models on single and multiple-choice question answering tasks

4. The performance of large language models on questions regarding Chinese herbal medicine and tongue images.

🌟Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages