Up 4
Up 4
22 2792
Ivyspring
International Publisher International Journal of Medical Sciences
2025; 22(11): 2792-2801. doi: 10.7150/ijms.111780
Review
Corresponding author: Chuanjie Wu, Department of Neurology, Xuanwu Hospital, Capital Medical University; No.45, Changchun Street, Xicheng District,
Beijing, China, 100053. Tel: +86-18911366882, E-mail: wuchuanjie@ccmu.edu.cn; Xunming Ji, Department of Neurology, Xuanwu Hospital, Capital Medical
University; No.45, Changchun Street, Xicheng District, Beijing, China, 100053. Email: jixm@ccmu.edu.cn.
© The author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/).
See https://ivyspring.com/terms for full terms and conditions.
Abstract
In recent years, large language models (LLMs) represented by GPT-4 have developed rapidly and
performed well in various natural language processing tasks, showing great potential and transformative
impact. The medical field, due to its vast data information as well as complex diagnostic and treatment
processes, is undoubtedly one of the most promising areas for the application of LLMs. At present, LLMs
has been gradually implemented in clinical practice, medical research, and medical education. However, in
practical applications, medical LLMs still face numerous challenges, including the phenomenon of
hallucination, interpretability, and ethical concerns. Therefore, in-depth exploration is still needed in
areas of standardized evaluation frameworks, multimodal LLMs, and multidisciplinary collaboration in the
future, so as to realize the widespread application of medical LLMs and promote the development and
transformation in the field of global healthcare. This review offers a comprehensive overview of
applications, challenges, and future directions of LLMs in medicine, providing new insights for the
sustained development of medical LLMs.
Keywords: Large language models; Medical applications; Natural language processing; Artificial Intelligence
Introduction
Large language models are deep learning rapid development, and there is an urgent need to
models based on the Transformer architecture, which introduce new tools or explore innovative approaches
leverages the self-attention mechanism. They are not to solve existing problems. LLMs have paid much
only capable of generating natural language text, but attention to clinical experts in recent years due to their
also capable of deeply understanding the meaning of powerful natural language processing (NLP)
the text and processing various natural language capabilities. It has become a research hotspot in
tasks, such as text summarization, and question medicine, bringing unprecedented development
answering [1]. In 2022, OpenAI released ChatGPT, opportunities to the field. In clinical practice, LLMs
which quickly attracted attention and heated can assist doctors in optimizing clinical decisions by
discussion of all walks of life [2]. Since then, LLMs analyzing patient information [3]. In medical research,
exemplified by ChatGPT have been widely used in LLMs can assist in paper writing, mining and
various fields and have achieved significant analyzing data, thus improving research efficiency [4].
breakthroughs, such as OpenAI o1 in Mathematics In medical education, LLMs can simulate real patients
and Programming. and act as virtual teaching assistants, providing
Currently, the field of medicine is undergoing personalized learning programs [5]. Despite the great
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2793
potential of LLMs in medicine, they still face and then further optimizing the model for specific
numerous challenges, such as hallucinations, its domains through post-training. Pre-training is the
black-box nature, the lack of evaluation benchmarks initial stage of language model learning, usually
and high-quality data, energy consumption and adopting the framework based on the Transformer
ethical concerns, which severely limit their practical model. The models learn from large-scale unlabeled
application [6, 7]. Therefore, it is crucial to summarize text data in an unsupervised manner, capturing the
and analyze the current research status and linguistic patterns, structures, and grammar in the
development trend of LLMs in medicine. text corpus. This process enables models to
In this review, we provide a systematic and understand the contextual information and semantic
comprehensive overview of the applications and relationships in the text, while equipping them with
challenges of LLMs in medicine, along with specific rich vocabulary knowledge [9, 15]. Post-training refers
recommendations for their future development, to further adjusting and optimizing the model
aiming to offer valuable references to clinicians and through methods like fine-tuning and alignment to
researchers. improve its performance on specific tasks. Fine-tuning
is the process of further training LLMs using
Development of large language models task-specific datasets, which is an effective parameter
calibration technique. The FLAN model released by
Progress and innovations in LLMs
Google first introduced the paradigm for instruction
LLMs refer to language models with over fine-tuning, enabling the model to better respond to
hundreds of billions of parameters, which are trained human instructions and thereby generating accurate
on vast amounts of text data [8]. In 2018, Google feedback [16].
released BERT, a pre-trained language model that In addition, prompt engineering is employed in
pioneered the learning paradigm of “pre-training and practical applications to efficiently invoke the
fine-tuning”, improving performance on NLP tasks to powerful capabilities of LLMs. It refers to the design,
a large extent [9]. In the same year, OpenAI also optimization, and implementation of prompts and
released the generative pre-training model GPT [10]. instructions, which helps users apply LLMs to various
Since then, pre-trained language models have begun scenarios and research fields. As a matter of fact, it is a
to come into the public eye. In 2020, the release of practice of effectively interacting with artificial
GPT-3 with a parameter scale of 175 billion officially intelligence (AI) systems to optimize their
opened the era of LLMs [1]. In November 2022, performance [17]. In the future, prompt engineering is
OpenAI released ChatGPT, which was an important expected to become an important bridge between
milestone in the development process of LLMs [2]. users and LLMs.
Subsequently, LLMs entered a phase of rapid
development. Meta, Google, Anthropic and other Comparative Overview of Leading LLMs
companies released multiple LLMs like LLaMA [11], In recent years, several representative LLMs
PaLM 2 [12], Gemini, and Claude which performed have emerged, each demonstrating unique
excellently in NLP tasks (Figure 1). advantages in architectural design and practical
In recent years, a growing number of medical deployment. ChatGPT, developed by OpenAI, has
LLMs have emerged, such as Med-PaLM, which is shown outstanding performance in NLP, with strong
based on PaLM. The study showed that Med-PaLM capabilities in understanding complex language
was the first LLM to achieve a passing score on the structures and semantics, generating logically
United States Medical Licensing Examination coherent and content-rich responses [2]. In the
(USMLE). It was not only comparable with clinicians medical domain, ChatGPT has demonstrated
in medical knowledge retrieval, but also potential for clinical decision support. Studies have
demonstrated significant advantages in answering shown that physicians assisted by GPT-4 perform
patients' medical questions [13]. Additionally, significantly better in complex case management than
Med-PaLM 2 was the first LLM to reach the level of those relying on traditional methods [18]. The release
human experts in answering USMLE-style questions. of GPT-4o in 2024 further enhances the model’s
It could correctly answer multiple-choice and response speed and operational efficiency, making it
open-ended questions with an accuracy of up to 86.5% suitable for a wide range of task scenarios [19]. The
[14]. newly launched OpenAI o1 integrates reinforcement
learning with chain-of-thought (CoT) prompting,
The principles of LLMs
achieving significant improvements in reasoning
Currently, LLMs typically undergo two stages: capabilities and enabling it to handle more complex
first, acquiring NLP capabilities through pre-training, logical inference tasks [20].
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2794
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2795
cross-sectional study showed that LLMs could number of medical researchers have begun to utilize
accurately assess the criticality of a patient's condition them to write academic papers. While LLMs can
with performance comparable to that of a resident generate seemingly logical and fluent “academic
physician [33]. Thus, LLMs are promising to be papers” in a short time, such papers are likely to
incorporated into emergency department workflows contain factual errors, logical fallacies, and even
to improve the efficiency and accuracy of emergency fabricated references, among other problems [42].
triage. This has undoubtedly aroused concerns within the
The application of LLMs in the field of radiology academic community about the authenticity and
similarly shows broad prospects. Studies have shown originality of the papers. On the other hand, LLMs
that assisted generation of radiology reports using also show the potential to assist scientific research. For
LLMs not only improves efficiency and quality but example, they can help physicians quickly review a
also helps surgeons make more accurate surgical large amount of literature and generate abstracts, and
decisions [34, 35]. Moreover, LLMs can simplify they can also help authors with language translation
radiology reports, improving their readability to and polishing [43], thereby improving the efficiency
facilitate patient understanding [36, 37]. In clinical of scientific research. Despite the great potential of
work, LLMs have the potential to automate LLMs in academic writing, the boundaries of their use
administrative tasks and outperform medical experts remain undefined, and the related ethical issues
in multiple tasks dealing with clinical text [38, 39]. urgently need to be discussed [44]. In addition to
Therefore, applying LLMs to the optimization of article writing, LLMs also demonstrate promising
clinical workflows can effectively reduce the potential for applications in systematic reviews and
documentation burden on medical staff, enabling meta-analyses. For example, as a tool for literature
them to focus more on patients [40, 41]. selection, LLMs exhibit high sensitivity and
specificity, which can effectively improve work
Medical research efficiency [45, 46].
With the popularity of LLMs, an increasing
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2796
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2797
synergy of multiple strategies is expected to further Nowadays, in order to overcome this challenge,
enhance the accuracy and reliability of model outputs, multidisciplinary collaboration has become an
laying a solid foundation for their widespread inevitable trend in the development of medical LLMs.
application in medicine. Medical experts should be deeply involved in the
model development process, integrating professional
Interpretability medical knowledge into model training. They also
The interpretability of LLMs refers to their need to evaluate and correct the model's outputs to
capacity to explain their decision-making process in a ensure that they conform to medical logic and clinical
manner that is comprehensible to humans and practice. For example, a recent study proposed a
elucidate the relationship between inputs and outputs multidisciplinary collaborative framework based on
[63]. However, the majority of current LLMs are role-playing agents to enhance the medical
‘black-box’ models with opaque internal workings knowledge comprehension and reasoning ability of
that make it difficult to explain their predictions [64]. LLMs by simulating multiple rounds of medical
This poor interpretability leads to a number of expert discussions [65]. In addition, it has been
problems. Firstly, healthcare professionals and demonstrated that GPT-4 is capable of simulating the
patients may be unable to comprehend and trust the cognitive process of doctors and providing accurate
clinical decisions and medical recommendations diagnostic outcomes by guiding diagnostic reasoning
generated by the models, which greatly restrict the with specific prompts. This discovery also brings
application of LLM in medicine. Secondly, researchers hope for solving the ‘black box’ problem of LLMs,
lack understanding of their internal mechanisms, demonstrating their potential for interpretability in
making it difficult to identify potential flaws in LLMs, medicine [66].
thereby limiting the improvement of their
performance.
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2798
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2799
architectures, have shown promising potential for currently not capable of replacing human physicians,
enabling energy-efficient deployment of LLMs [83]. particularly in complex clinical decision-making.
Future research should aim to develop medical LLMs Under appropriate ethical and safety safeguards,
that combine high performance with energy rigorously validated LLMs have the potential to
efficiency, thereby facilitating their broad and become valuable tools for optimizing clinical
sustainable application in healthcare. workflows and improving communication between
doctors and patients. Looking ahead, the
Ethical concerns development of medical LLMs requires the joint
The application of LLMs in medicine faces participation of medical professionals, AI specialists,
numerous ethical challenges. 1. Data privacy and ethicists, and experts from other fields. By
security: LLMs require massive amounts of patient establishing unified evaluation benchmarks,
data during training. In the absence of comprehensive developing multimodal LLMs, and conducting more
security measures, the models could potentially prospective clinical trials, LLMs are expected to break
memorize and disclose this information during the through the existing bottlenecks, provide patients
training process, thus threatening patient privacy and with more accurate and personalized healthcare
data security [6]. 2. Fairness and bias: If the dataset is services, and help smart healthcare move to a higher
biased, for example, if there is insufficient data on level.
certain races, genders, or socioeconomic statuses, the
model’s output results may be biased, leading to Acknowledgements
unfair distribution of healthcare resources or
Funding
irrational diagnosis and treatment protocols [84]. 3.
Liability determination: When LLMs are applied to This work was supported by the National
assist in clinical decision-making, there are currently Natural Science Foundation of China (82271507),
no consensus for determining liability in cases where Beijing Natural Science Foundation (JQ24041),
the model provides incorrect recommendations that Noncommunicable Chronic Diseases-National
lead to adverse outcomes. 4. Academic integrity: The Science and Technology Major Project (2023ZD
powerful text generation capabilities of LLMs have 0505403), and Beijing Physician Scientist Training
been used by some scholars to write medical papers Project (BJPSTP-2024-04).
[85, 86] and even to generate false research data and
Author Contributions
images [87, 88], which raises concerns about academic
integrity. Erlan Yu and Xuehong Chu: Writing—Original
Therefore, it is crucial to give high priority to draft preparation and Editing. Wanwan Zhang:
ethical issues in the development and application of Conceptualization, Writing—Reviewing and Editing.
LLMs in medicine. Under the premise of ensuring that
everyone can benefit equally from the medical LLMs, Xiangbin Meng and Yaodong Yang: Writing —
we need to actively explore and establish robust Reviewing and Editing. Chuanjie Wu and Xunming Ji:
ethical guidelines and regulatory mechanisms to Conceptualization, Supervision, Writing—Reviewing
protect patient privacy and prevent data misuse. At and Editing.
the same time, rigorous clinical trial validation is
required to ensure the safety and efficacy of medical Competing Interests
LLMs. Currently, clinical studies on the application of
The authors have declared that no competing
LLMs in the field of medicine are still relatively
interest exists.
limited [7]. In the future, more prospective clinical
trials are needed to evaluate the performance of LLMs References
in real clinical settings to avoid potential risks.
1. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al.
Language Models are Few-Shot Learners. ArXiv. 2020; abs/2005.14165.
Conclusions 2. OpenAI. Introducing ChatGPT. 2022.
3. Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet
The rapid development of LLMs in medicine is Res. 2023; 25: e48568.
4. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al.
exciting, however, the challenges they face are equally The future landscape of large language models in medicine. Commun Med
significant and cannot be ignored. Improving the (Lond). 2023; 3: 141.
5. Xu X, Chen Y, Miao J. Opportunities, challenges, and future directions of large
accuracy and interpretability of models, addressing language models, including ChatGPT in medical education: a systematic
the lack of evaluation benchmarks and data, energy 6.
scoping review. J Educ Eval Health Prof. 2024; 21: 6.
Ong JCL, Chang SY, William W, Butte AJ, Shah NH, Chew LST, et al. Ethical
consumption and related ethical issues will be the and regulatory challenges of large language models in medicine. Lancet Digit
Health. 2024; 6: e428-e32.
focus of future research. Notably, despite their 7. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW.
improving performance on medical tasks, LLMs are Large language models in medicine. Nat Med. 2023; 29: 1930-40.
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2800
8. Shanahan M. Talking about Large Language Models. Communications of the 39. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit
ACM. 2022; 67: 68 - 79. Health. 2023; 5: e107-e8.
9. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep 40. Tripathi S, Sukumaran R, Cook TS. Efficient healthcare with large language
Bidirectional Transformers for Language Understanding. North American models: optimizing clinical workflow and enhancing patient care. J Am Med
Chapter of the Association for Computational Linguistics; 2019. Inform Assoc. 2024; 31: 1436-40.
10. Radford A, Narasimhan K. Improving Language Understanding by 41. Roberts K. Large language models for reducing clinicians' documentation
Generative Pre-Training. 2018. burden. Nat Med. 2024; 30: 942-3.
11. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, et al. 42. Májovský M, Černý M, Kasal M, Komarc M, Netuka D. Artificial Intelligence
LLaMA: Open and Efficient Foundation Language Models. ArXiv. 2023; Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles:
abs/2302.13971. Pandora's Box Has Been Opened. J Med Internet Res. 2023; 25: e46924.
12. Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, Passos AT, et al. PaLM 2 43. Hake J, Crowley M, Coy A, Shanks D, Eoff A, Kirmer-Voss K, et al. Quality,
Technical Report. ArXiv. 2023; abs/2305.10403. Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts.
13. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language Ann Fam Med. 2024; 22: 113-20.
models encode clinical knowledge. Nature. 2023; 620: 172-80. 44. Gao CA, Howard FM, Markov NS, Dyer EC, Ramesh S, Luo Y, et al.
14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards Comparing scientific abstracts generated by ChatGPT to real abstracts with
Expert-Level Medical Question Answering with Large Language Models. detectors and blinded human reviewers. NPJ Digit Med. 2023; 6: 75.
ArXiv. 2023; abs/2305.09617. 45. Oami T, Okada Y, Nakada TA. Performance of a Large Language Model in
15. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Screening Citations. JAMA Netw Open. 2024; 7: e2420496.
Attention is All you Need. Neural Information Processing Systems; 2017. 46. Luo X, Chen F, Zhu D, Wang L, Wang Z, Liu H, et al. Potential Roles of Large
16. Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, et al. Finetuned Language Language Models in the Production of Systematic Reviews and
Models Are Zero-Shot Learners. ArXiv. 2021; abs/2109.01652. Meta-Analyses. J Med Internet Res. 2024; 26: e56780.
17. Meskó B. Prompt Engineering as an Important Emerging Skill for Medical 47. Huang J, Yang DM, Rong R, Nezafati K, Treager C, Chi Z, et al. A critical
Professionals: Tutorial. J Med Internet Res. 2023; 25: e50638. assessment of using ChatGPT for extracting structured data from clinical
18. Goh E, Gallo RJ, Strong E, Weng Y, Kerman H, Freed JA, et al. GPT-4 notes. NPJ Digit Med. 2024; 7: 106.
assistance for improvement of physician performance on patient care tasks: a 48. Huang Y, Wu R, He J, Xiang Y. Evaluating ChatGPT-4.0's data analytic
randomized controlled trial. Nat Med. 2025; 31: 1233-8. proficiency in epidemiological studies: A comparative analysis with SAS,
19. OpenAI. Hello GPT-4o. 2024. SPSS, and R. J Glob Health. 2024; 14: 04070.
20. OpenAI. Learning to Reason with LLMs. OpenAI; 2024. 49. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al.
21. Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. Constitutional Large Language Models in Medical Education: Opportunities, Challenges, and
AI: Harmlessness from AI Feedback. ArXiv. 2022; abs/2212.08073. Future Directions. JMIR Med Educ. 2023; 9: e48291.
22. Chen D, Parsa R, Hope A, Hannon B, Mak E, Eng L, et al. Physician and 50. Wu Y, Zheng Y, Feng B, Yang Y, Kang K, Zhao A. Embracing ChatGPT for
Artificial Intelligence Chatbot Responses to Cancer Questions From Social Medical Education: Exploring Its Impact on Doctors and Medical Students.
Media. JAMA Oncol. 2024; 10: 956-60. JMIR Med Educ. 2024; 10: e52483.
23. Team G, Anil R, Borgeaud S, Alayrac J-B, Yu J, Soricut R, et al. Gemini: a 51. Holderried F, Stegemann-Philipps C, Herschbach L, Moldt JA, Nevins A,
family of highly capable multimodal models. arXiv preprint arXiv:231211805. Griewatz J, et al. A Generative Pretrained Transformer (GPT)-Powered
2023. Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed
24. Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the Methods Study. JMIR Med Educ. 2024; 10: e53961.
Utility of ChatGPT Throughout the Entire Clinical Workflow: Development 52. Cook DA. Creating virtual patients using large language models: scalable,
and Usability Study. J Med Internet Res. 2023; 25: e48659. global, and low cost. Med Teach. 2024: 1-3.
25. Abdullahi T, Singh R, Eickhoff C. Learning to Make Rare and Complex 53. Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat
Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Sci Educ. 2024; 17: 926-31.
Language Models. JMIR Med Educ. 2024; 10: e51391. 54. Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, et al.
26. El Haj M, Boutoleau-Bretonnière C, Gallouj K, Wagemann N, Antoine P, ChatGPT versus human in generating medical graduate exam multiple choice
Kapogiannis D, et al. ChatGPT as a Diagnostic Aid in Alzheimer's Disease: An questions-A multinational prospective study (Hong Kong S.A.R., Singapore,
Exploratory Study. J Alzheimers Dis Rep. 2024; 8: 495-500. Ireland, and the United Kingdom). PLoS One. 2023; 18: e0290691.
27. Salihu A, Meier D, Noirclerc N, Skalidis I, Mauler-Wittwer S, Recordon F, et al. 55. Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large
A study of ChatGPT in facilitating Heart Team decisions on severe aortic Language Models in Medical Education: Comparing ChatGPT- to
stenosis. EuroIntervention. 2024; 20: e496-e503. Human-Generated Exam Questions. Acad Med. 2024; 99: 508-12.
28. Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a 56. Tangadulrat P, Sono S, Tangtrakulwanich B. Using ChatGPT for Clinical
Large Language Model's Responses to Questions and Cases About Glaucoma Practice and Medical Education: Cross-Sectional Survey of Medical Students'
and Retina Management. JAMA Ophthalmol. 2024; 142: 371-5. and Physicians' Perceptions. JMIR Med Educ. 2023; 9: e50658.
29. He Z, Bhasuran B, Jin Q, Tian S, Hanna K, Shavor C, et al. Quality of Answers 57. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of Hallucination in
of Generative Large Language Models Versus Peer Users for Interpreting Natural Language Generation. ACM Comput Surv. 2023; 55: Article 248.
Laboratory Test Results for Lay Patients: Evaluation Study. J Med Internet Res. 58. Christophe Ce, Kanithi PK, Munjal P, Raha T, Hayat N, Rajan R, et al. Med42 -
2024; 26: e56655. Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs.
30. Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, et al. Assessing the Parameter-Efficient Approaches. ArXiv. 2024; abs/2404.14779.
performance of ChatGPT in answering questions regarding cirrhosis and 59. Ye K, Zhou H, Zhu J, Quinzan F, Shi C. Robust Reinforcement Learning from
hepatocellular carcinoma. Clin Mol Hepatol. 2023; 29: 721-32. Human Feedback for Large Language Models Fine-Tuning. 2025.
31. Acharya A, Shrestha S, Chen A, Conte J, Avramovic S, Sikdar S, et al. Clinical 60. Wu J, Zhu J, Qi Y. Medical Graph RAG: Towards Safe Medical Large
risk prediction using language models: benefits and considerations. J Am Med Language Model via Graph Retrieval-Augmented Generation. ArXiv. 2024;
Inform Assoc. 2024. abs/2408.04187.
32. Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, et 61. Gilbert S, Kather JN, Hogan A. Augmented non-hallucinating large language
al. Predicting seizure recurrence after an initial seizure-like episode from models as medical information curators. NPJ Digit Med. 2024; 7: 100.
routine clinical notes using large language models: a retrospective cohort 62. Li D, Yang S, Tan Z, Baik JY, Yun S, Lee J, et al. DALK: Dynamic
study. Lancet Digit Health. 2023; 5: e882-e94. Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions
33. Williams CYK, Zack T, Miao BY, Sushil M, Wang M, Kornblith AE, et al. Use with Scientific Literature. ArXiv. 2024; abs/2405.04819.
of a Large Language Model to Assess Clinical Acuity of Adults in the 63. Joyce DW, Kormilitzin A, Smith KA, Cipriani A. Explainable artificial
Emergency Department. JAMA Netw Open. 2024; 7: e248895. intelligence for mental health through transparency and interpretability for
34. Bhayana R, Nanda B, Dehkharghanian T, Deng Y, Bhambra N, Elias G, et al. understandability. NPJ Digit Med. 2023; 6: 6.
Large Language Models for Automated Synoptic Reports and Resectability 64. Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes
Categorization in Pancreatic Cancer. Radiology. 2024; 311: e233117. Decisions and Use Interpretable Models Instead. Nat Mach Intell. 2019; 1:
35. Bhayana R, Biswas S, Cook TS, Kim W, Kitamura FC, Gichoya J, et al. From 206-15.
Bench to Bedside With Large Language Models: AJR Expert Panel Narrative 65. Tang X, Zou A, Zhang Z, Zhao Y, Zhang X, Cohan A, et al. MedAgents: Large
Review. AJR Am J Roentgenol. 2024. Language Models as Collaborators for Zero-shot Medical Reasoning. ArXiv.
36. Doshi R, Amin KS, Khosla P, Bajaj SS, Chheang S, Forman HP. Quantitative 2023; abs/2311.10537.
Evaluation of Large Language Models to Streamline Radiology Report 66. Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning
Impressions: A Multimodal Retrospective Analysis. Radiology. 2024; 310: prompts reveal the potential for large language model interpretability in
e231593. medicine. NPJ Digit Med. 2024; 7: 20.
37. Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stüber AT, Topalis J, et al. 67. Tang YD, Dong ED, Gao W. LLMs in medicine: The need for advanced
ChatGPT makes medicine easy to swallow: an exploratory case study on evaluation systems for disruptive technologies. Innovation (Camb). 2024; 5:
simplified radiology reports. Eur Radiol. 2024; 34: 2817-25. 100622.
38. Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, Bluethgen C, 68. Liu A, Zhou H, Hua Y, Rohanian O, Clifton LA, Clifton DA. Large Language
et al. Adapted large language models can outperform medical experts in Models in Healthcare: A Comprehensive Benchmark. ArXiv. 2024;
clinical text summarization. Nat Med. 2024; 30: 1134-42. abs/2405.00716.
https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2801
69. Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What Disease does
this Patient Have? A Large-scale Open Domain Question Answering Dataset
from Medical Exams. ArXiv. 2020; abs/2009.13081.
70. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA : A Large-scale
Multi-Subject Multi-Choice Dataset for Medical domain Question Answering.
ACM Conference on Health, Inference, and Learning; 2022.
71. Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in
machine learning for medicine and healthcare. Nat Biomed Eng. 2021; 5: 493-7.
72. Zhou H, Gu B, Zou X, Li Y, Chen SS, Zhou P, et al. A Survey of Large
Language Models in Medicine: Progress, Application, and Challenge. ArXiv.
2023; abs/2311.05112.
73. Goel A, Gueta A, Gilon O, Liu C, Erell S, Nguyen LH, et al. LLMs Accelerate
Annotation for Medical Information Extraction. ArXiv. 2023; abs/2312.02296.
74. Meskó B. The Impact of Multimodal Large Language Models on Health Care's
Future. J Med Internet Res. 2023; 25: e52865.
75. Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, et al. The application of large
language models in medicine: A scoping review. iScience. 2024; 27: 109713.
76. Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Towards Generalist Foundation
Model for Radiology. ArXiv. 2023; abs/2308.02463.
77. Huang H, Zheng O, Wang D, Yin J, Wang Z, Ding S, et al. ChatGPT for
shaping the future of dentistry: the potential of multi-modal large language
model. Int J Oral Sci. 2023; 15: 29.
78. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P-C, et al. Towards
Generalist Biomedical AI. ArXiv. 2023; abs/2307.14334.
79. Samsi S, Zhao D, McDonald J, Li B, Michaleas A, Jones M, et al. From Words to
Watts: Benchmarking the Energy Costs of Large Language Model Inference.
2023 IEEE High Performance Extreme Computing Conference (HPEC). 2023:
1-9.
80. Agrawal V. Energy Efficient Large Language Models: Advancements and
Challenges. INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN
ENGINEERING AND MANAGEMENT. 2025.
81. DeepSeek-AI, Liu A, Feng B, Xue B, Wang B-L, Wu B, et al. DeepSeek-V3
Technical Report. ArXiv. 2024; abs/2412.19437.
82. DeepSeek-AI, Guo D, Yang D, Zhang H, Song J-M, Zhang R, et al.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
Learning. ArXiv. 2025; abs/2501.12948.
83. Wang Z, Luo T, Liu C, Liu W, Goh RSM, Wong WF. Enabling Energy-Efficient
Deployment of Large Language Models on Memristor Crossbar: A Synergy of
Large and Small. IEEE Trans Pattern Anal Mach Intell. 2025; 47: 916-33.
84. Chen RJ, Chen TY, Lipková J, Wang JJ, Williamson DFK, Lu MY, et al.
Algorithm Fairness in AI for Medicine and Healthcare. ArXiv. 2021;
abs/2110.00603.
85. Stokel-Walker C. ChatGPT listed as author on research papers: many scientists
disapprove. Nature. 2023; 613: 620-1.
86. Tools such as ChatGPT threaten transparent science; here are our ground rules
for their use. Nature. 2023; 613: 612.
87. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data
Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA
Ophthalmol. 2023; 141: 1174-5.
88. Zhu L, Lai Y, Mou W, Zhang H, Lin A, Qi C, et al. ChatGPT's ability to
generate realistic experimental images poses a new challenge to academic
integrity. J Hematol Oncol. 2024; 17: 27.
https://www.medsci.org