Up 4

This review discusses the rapid development and applications of large language models (LLMs) in medicine, highlighting their potential in clinical practice, medical research, and education. Despite their promising capabilities, LLMs face challenges such as hallucination, interpretability, and ethical concerns that need to be addressed for broader implementation. The authors suggest future directions for LLMs, including the need for standardized evaluation frameworks and multidisciplinary collaboration to enhance their effectiveness in healthcare.

Uploaded by

gemoy44784

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views10 pages

Up 4

Uploaded by

gemoy44784

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Int. J. Med. Sci. 2025, Vol.

22 2792

Ivyspring
International Publisher International Journal of Medical Sciences
2025; 22(11): 2792-2801. doi: 10.7150/ijms.111780
Review

Large Language Models in Medicine: Applications,

Challenges, and Future Directions
Erlan Yu1#, Xuehong Chu1#, Wanwan Zhang1, Xiangbin Meng2, Yaodong Yang3, Xunming Ji1, Chuanjie
Wu1
1. Department of Neurology, Xuanwu Hospital, Capital Medical University, Beijing, China.
2. Pengcheng Laboratory, Shenzhen 518055, P. R. China.
3. Institute for AI, Peking University.
#These authors contributed equally to this work, and should be regarded as co-first authors.

 Corresponding author: Chuanjie Wu, Department of Neurology, Xuanwu Hospital, Capital Medical University; No.45, Changchun Street, Xicheng District,
Beijing, China, 100053. Tel: +86-18911366882, E-mail: wuchuanjie@ccmu.edu.cn; Xunming Ji, Department of Neurology, Xuanwu Hospital, Capital Medical
University; No.45, Changchun Street, Xicheng District, Beijing, China, 100053. Email: jixm@ccmu.edu.cn.

© The author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/).
See https://ivyspring.com/terms for full terms and conditions.

Received: 2025.02.08; Accepted: 2025.05.12; Published: 2025.05.31

Abstract
In recent years, large language models (LLMs) represented by GPT-4 have developed rapidly and
performed well in various natural language processing tasks, showing great potential and transformative
impact. The medical field, due to its vast data information as well as complex diagnostic and treatment
processes, is undoubtedly one of the most promising areas for the application of LLMs. At present, LLMs
has been gradually implemented in clinical practice, medical research, and medical education. However, in
practical applications, medical LLMs still face numerous challenges, including the phenomenon of
hallucination, interpretability, and ethical concerns. Therefore, in-depth exploration is still needed in
areas of standardized evaluation frameworks, multimodal LLMs, and multidisciplinary collaboration in the
future, so as to realize the widespread application of medical LLMs and promote the development and
transformation in the field of global healthcare. This review offers a comprehensive overview of
applications, challenges, and future directions of LLMs in medicine, providing new insights for the
sustained development of medical LLMs.
Keywords: Large language models; Medical applications; Natural language processing; Artificial Intelligence

Introduction
Large language models are deep learning rapid development, and there is an urgent need to
models based on the Transformer architecture, which introduce new tools or explore innovative approaches
leverages the self-attention mechanism. They are not to solve existing problems. LLMs have paid much
only capable of generating natural language text, but attention to clinical experts in recent years due to their
also capable of deeply understanding the meaning of powerful natural language processing (NLP)
the text and processing various natural language capabilities. It has become a research hotspot in
tasks, such as text summarization, and question medicine, bringing unprecedented development
answering [1]. In 2022, OpenAI released ChatGPT, opportunities to the field. In clinical practice, LLMs
which quickly attracted attention and heated can assist doctors in optimizing clinical decisions by
discussion of all walks of life [2]. Since then, LLMs analyzing patient information [3]. In medical research,
exemplified by ChatGPT have been widely used in LLMs can assist in paper writing, mining and
various fields and have achieved significant analyzing data, thus improving research efficiency [4].
breakthroughs, such as OpenAI o1 in Mathematics In medical education, LLMs can simulate real patients
and Programming. and act as virtual teaching assistants, providing
Currently, the field of medicine is undergoing personalized learning programs [5]. Despite the great

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2793

potential of LLMs in medicine, they still face and then further optimizing the model for specific
numerous challenges, such as hallucinations, its domains through post-training. Pre-training is the
black-box nature, the lack of evaluation benchmarks initial stage of language model learning, usually
and high-quality data, energy consumption and adopting the framework based on the Transformer
ethical concerns, which severely limit their practical model. The models learn from large-scale unlabeled
application [6, 7]. Therefore, it is crucial to summarize text data in an unsupervised manner, capturing the
and analyze the current research status and linguistic patterns, structures, and grammar in the
development trend of LLMs in medicine. text corpus. This process enables models to
In this review, we provide a systematic and understand the contextual information and semantic
comprehensive overview of the applications and relationships in the text, while equipping them with
challenges of LLMs in medicine, along with specific rich vocabulary knowledge [9, 15]. Post-training refers
recommendations for their future development, to further adjusting and optimizing the model
aiming to offer valuable references to clinicians and through methods like fine-tuning and alignment to
researchers. improve its performance on specific tasks. Fine-tuning
is the process of further training LLMs using
Development of large language models task-specific datasets, which is an effective parameter
calibration technique. The FLAN model released by
Progress and innovations in LLMs
Google first introduced the paradigm for instruction
LLMs refer to language models with over fine-tuning, enabling the model to better respond to
hundreds of billions of parameters, which are trained human instructions and thereby generating accurate
on vast amounts of text data [8]. In 2018, Google feedback [16].
released BERT, a pre-trained language model that In addition, prompt engineering is employed in
pioneered the learning paradigm of “pre-training and practical applications to efficiently invoke the
fine-tuning”, improving performance on NLP tasks to powerful capabilities of LLMs. It refers to the design,
a large extent [9]. In the same year, OpenAI also optimization, and implementation of prompts and
released the generative pre-training model GPT [10]. instructions, which helps users apply LLMs to various
Since then, pre-trained language models have begun scenarios and research fields. As a matter of fact, it is a
to come into the public eye. In 2020, the release of practice of effectively interacting with artificial
GPT-3 with a parameter scale of 175 billion officially intelligence (AI) systems to optimize their
opened the era of LLMs [1]. In November 2022, performance [17]. In the future, prompt engineering is
OpenAI released ChatGPT, which was an important expected to become an important bridge between
milestone in the development process of LLMs [2]. users and LLMs.
Subsequently, LLMs entered a phase of rapid
development. Meta, Google, Anthropic and other Comparative Overview of Leading LLMs
companies released multiple LLMs like LLaMA [11], In recent years, several representative LLMs
PaLM 2 [12], Gemini, and Claude which performed have emerged, each demonstrating unique
excellently in NLP tasks (Figure 1). advantages in architectural design and practical
In recent years, a growing number of medical deployment. ChatGPT, developed by OpenAI, has
LLMs have emerged, such as Med-PaLM, which is shown outstanding performance in NLP, with strong
based on PaLM. The study showed that Med-PaLM capabilities in understanding complex language
was the first LLM to achieve a passing score on the structures and semantics, generating logically
United States Medical Licensing Examination coherent and content-rich responses [2]. In the
(USMLE). It was not only comparable with clinicians medical domain, ChatGPT has demonstrated
in medical knowledge retrieval, but also potential for clinical decision support. Studies have
demonstrated significant advantages in answering shown that physicians assisted by GPT-4 perform
patients' medical questions [13]. Additionally, significantly better in complex case management than
Med-PaLM 2 was the first LLM to reach the level of those relying on traditional methods [18]. The release
human experts in answering USMLE-style questions. of GPT-4o in 2024 further enhances the model’s
It could correctly answer multiple-choice and response speed and operational efficiency, making it
open-ended questions with an accuracy of up to 86.5% suitable for a wide range of task scenarios [19]. The
[14]. newly launched OpenAI o1 integrates reinforcement
learning with chain-of-thought (CoT) prompting,
The principles of LLMs
achieving significant improvements in reasoning
Currently, LLMs typically undergo two stages: capabilities and enabling it to handle more complex
first, acquiring NLP capabilities through pre-training, logical inference tasks [20].

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2794

Figure 1. Development timeline of LLMs.

Meanwhile, Claude, developed by Anthropic, physicians in clinical decision-making, including

introduces the concept of Constitutional AI (CAI), initial diagnosis, differential diagnosis, and clinical
emphasizing the helpfulness, harmlessness, and management. Research showed that based on
truthfulness of model outputs, making it particularly information such as the history of present illness and
suitable for sensitive application areas with strict physical exam, ChatGPT achieved an accuracy of
requirements for safety and ethical standards [21]. For 60.3% in determining the differential diagnosis. When
example, in a study comparing responses to cancer- additional information, such as results of relevant
related patient questions, Claude outperformed medical tests, was added, ChatGPT's accuracy in
physicians in empathy, quality, and readability, narrowing down the final diagnosis increased to
highlighting its potential in ethically sensitive medical 76.9% [24]. Notably, LLMs have been shown to
communication [22]. Gemini, developed by Google surpass the average population consensus in
DeepMind, is characterized by a native multimodal diagnosing rare and complex cases, and they are
architecture, enabling the coordinated processing of expected to help address the issues of delayed
text, images, audio, video, and code within a unified diagnosis and misdiagnosis in the future [25].
framework, significantly enhancing cross-modal Furthermore, recent studies have shown that LLMs
understanding and reasoning capabilities [23]. play a significant role in aiding decision-making
Additionally, the Llama series released by Meta, as within clinical subspecialties like neurology and
the first open-source LLM available for commercial cardiology, for example, in diagnosing Alzheimer's
use, offers a high degree of flexibility and disease and managing valvular heart diseases [26, 27].
customizability, allowing researchers and developers LLMs have a wide range of applications in the
to tailor and optimize the models according to specific field of medical question answering. They can not
needs, thus promoting the widespread adoption and only answer a variety of patient questions regarding
innovative development of AI technologies [11]. diagnosis, treatment, and management of diseases
[28], but also help interpret the results of laboratory
Medical applications of LLMs tests [29] and even provide emotional support [30].
Since 2023, LLMs represented by ChatGPT have Furthermore, the application of LLMs can improve
gradually begun to be applied in the field of medicine, the performance of clinical risk prediction based on
playing an important role in clinical practice, medical structured electronic health records [31]. For example,
research, and medical education (Figure 2). a retrospective study found that LLMs with
additional pre-training performed excellently in
Clinical practice predicting the risk of recurrence after an initial
Currently, LLMs have been widely used to assist seizure-like episode [32]. In patient triage, a

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2795

cross-sectional study showed that LLMs could number of medical researchers have begun to utilize
accurately assess the criticality of a patient's condition them to write academic papers. While LLMs can
with performance comparable to that of a resident generate seemingly logical and fluent “academic
physician [33]. Thus, LLMs are promising to be papers” in a short time, such papers are likely to
incorporated into emergency department workflows contain factual errors, logical fallacies, and even
to improve the efficiency and accuracy of emergency fabricated references, among other problems [42].
triage. This has undoubtedly aroused concerns within the
The application of LLMs in the field of radiology academic community about the authenticity and
similarly shows broad prospects. Studies have shown originality of the papers. On the other hand, LLMs
that assisted generation of radiology reports using also show the potential to assist scientific research. For
LLMs not only improves efficiency and quality but example, they can help physicians quickly review a
also helps surgeons make more accurate surgical large amount of literature and generate abstracts, and
decisions [34, 35]. Moreover, LLMs can simplify they can also help authors with language translation
radiology reports, improving their readability to and polishing [43], thereby improving the efficiency
facilitate patient understanding [36, 37]. In clinical of scientific research. Despite the great potential of
work, LLMs have the potential to automate LLMs in academic writing, the boundaries of their use
administrative tasks and outperform medical experts remain undefined, and the related ethical issues
in multiple tasks dealing with clinical text [38, 39]. urgently need to be discussed [44]. In addition to
Therefore, applying LLMs to the optimization of article writing, LLMs also demonstrate promising
clinical workflows can effectively reduce the potential for applications in systematic reviews and
documentation burden on medical staff, enabling meta-analyses. For example, as a tool for literature
them to focus more on patients [40, 41]. selection, LLMs exhibit high sensitivity and
specificity, which can effectively improve work
Medical research efficiency [45, 46].
With the popularity of LLMs, an increasing

Figure 2. Applications of LLMs in medicine.

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2796

LLMs also demonstrate powerful data Hallucination

processing and analysis capabilities in medical
Hallucination of LLMs refers to the generation of
research. For example, they can transform
results that are meaningless or inconsistent with the
unstructured data, such as medical records and test
provided source content [57]. In the medicine field,
results, into extractable structured data, providing
LLMs may generate responses that include fictitious
stronger data support for medical research [47]. In
drug recommendations or cite non-existent clinical
addition, research showed that compared with
studies as supporting evidence. Such hallucinations
traditional statistical software (SAS, SPSS, and R),
may lead to misdiagnosis, inappropriate treatment,
GPT-4 demonstrated numerous advantages in data
and incorrect medical management. Therefore, it is
analysis, such as more efficient analysis, and a more
crucial to reduce the phenomenon of hallucinations to
friendly and intuitive user interface. In the future,
ensure the accuracy and reliability of the output
LLMs are expected to become a powerful auxiliary
results produced by LLMs.
tool for statistical analysis and further promote the
To address this issue, researchers have proposed
development of medical research [48].
several effective strategies, including fine-tuning,
Medical education reinforcement learning from human feedback (RLHF),
and retrieval-augmented generation (RAG).
LLMs have broad application prospects in
Fine-tuning refers to retraining a pre-trained model
medical education, covering various aspects such as
on a specific domain dataset, such as medical data, to
personalized learning, educational material
improve its task adaptability. For example, Clément
generation, and student assessment [49]. During the
Christophe et al. applied a combination of
learning process, LLMs can serve as virtual teaching
instruction-tuning and parameter-efficient tuning to
assistants, providing personalized guidance and
the LLaMA-2 model using a large-scale medical
feedback, and timely adjustment of teaching strategies
question-answering dataset, which significantly
according to students' individual differences and
improved the model’s accuracy on the USMLE
learning progress [50]. For example, it can
benchmark and effectively reduced the occurrence of
automatically generate more targeted practice
hallucinations [58]. RLHF leverages human feedback
questions based on students' answering performance,
to optimize the model’s output behavior, aiming to
helping them consolidate their knowledge and
better align it with human values and expectations.
address gaps. LLMs can also simulate real patients,
This technique has been widely applied in
conducting interactive dialogues with medical
mainstream LLMs such as ChatGPT and Claude,
students for training and assessment of clinical skills
further reducing the occurrence of hallucinations in
such as history taking and diagnostic reasoning [51,
medical question answering and complex reasoning
52]. Similar to standardized patients, they offer
tasks [59]. RAG is a method that retrieves external
realistic and versatile training scenarios for medical
knowledge such as clinical guidelines in advance and
students. Studies showed that this personalized
incorporates it into the generation process to ensure
learning approach based on LLMs could effectively
that the output is grounded in factual information. A
improve students' learning interest, engagement, and
representative example is the MedGraphRAG
learning outcomes [53]. Additionally, LLMs can be
framework proposed by Junde Wu et al., which
used in medical exams, such as automatically
incorporates graph-based medical retrieval to
generating high-quality multiple-choice questions to
significantly improve model performance on multiple
reduce the burden on teachers [54, 55]. An increasing
medical benchmark tests [60]. Nowadays, RAG is
amount of research indicates that medical students
increasingly recognized as a key strategy for
have positive attitudes toward the application of
mitigating hallucinations in medical LLMs.
LLMs to assist medical education [56]. It is foreseeable
Furthermore, the combination of LLMs and
that LLMs will continue to drive transformation in
knowledge graphs (KGs) are considered an effective
medical education and provide new possibilities for
approach to address the hallucination problem [61].
training future medical professionals.
For example, Dawei Li et al. proposed DALK, a
Challenges and Future Development dynamic collaborative enhancement framework for
LLMs and KGs. Research results based on the
Directions of LLMs in Medicine Alzheimer's Disease question-answering benchmark
This section will explore in depth the challenges show that DALK outperforms other AI technique in
faced by LLMs in medicine and propose overall performance [62].
corresponding strategies for their future development In the future, addressing hallucination issues
(Table 1). will rely on the integration and optimization of
techniques such as fine-tuning, RLHF, and RAG. The

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2797

synergy of multiple strategies is expected to further Nowadays, in order to overcome this challenge,
enhance the accuracy and reliability of model outputs, multidisciplinary collaboration has become an
laying a solid foundation for their widespread inevitable trend in the development of medical LLMs.
application in medicine. Medical experts should be deeply involved in the
model development process, integrating professional
Interpretability medical knowledge into model training. They also
The interpretability of LLMs refers to their need to evaluate and correct the model's outputs to
capacity to explain their decision-making process in a ensure that they conform to medical logic and clinical
manner that is comprehensible to humans and practice. For example, a recent study proposed a
elucidate the relationship between inputs and outputs multidisciplinary collaborative framework based on
[63]. However, the majority of current LLMs are role-playing agents to enhance the medical
‘black-box’ models with opaque internal workings knowledge comprehension and reasoning ability of
that make it difficult to explain their predictions [64]. LLMs by simulating multiple rounds of medical
This poor interpretability leads to a number of expert discussions [65]. In addition, it has been
problems. Firstly, healthcare professionals and demonstrated that GPT-4 is capable of simulating the
patients may be unable to comprehend and trust the cognitive process of doctors and providing accurate
clinical decisions and medical recommendations diagnostic outcomes by guiding diagnostic reasoning
generated by the models, which greatly restrict the with specific prompts. This discovery also brings
application of LLM in medicine. Secondly, researchers hope for solving the ‘black box’ problem of LLMs,
lack understanding of their internal mechanisms, demonstrating their potential for interpretability in
making it difficult to identify potential flaws in LLMs, medicine [66].
thereby limiting the improvement of their
performance.

Table 1. Summary of Challenges and future development directions of LLMs in medicine.

Challenges Description Future development directions
Hallucination Sometimes, the outputs generated by LLMs may - Fine-tuning: Retraining a pre-trained model on domain-specific data, such as
appear reasonable, but they actually do not align medical texts, to enhance its performance on specialized tasks.
with the user's input, contradict prior context, or - RLHF: Leveraging human feedback to optimize model outputs and better align
are inconsistent with the facts. them with human values and expectations.
- RAG: Retrieving external knowledge, such as clinical guidelines, prior to
generation to ensure that model outputs are grounded in factual information.
Interpretability It refers to the ability of LLMs to reveal their - Guiding diagnostic reasoning through specific prompts: By utilizing specific
internal reasoning chains and decision-making structured prompts, LLMs are able to correlate the patient's history, symptoms,
processes in a manner that is comprehensible to and ancillary examinations to form a clear chain of reasoning.
humans. However, the ‘black box’ nature of LLMs - Multidisciplinary Collaboration: Medical experts collaborate with AI specialists
greatly reduces user trust and reliability of results. in the development of LLMs for optimizing their decision-making pathways.
Evaluation benchmarks Current medical LLMs still lack extensive and - Utilizing desensitized real electronic health records and medical literature to
comprehensive evaluation benchmarks that can construct more representative and challenging evaluation datasets that better
reflect real clinical workflows. Therefore, it is simulate actual clinical environments.
difficulty to systematically measure and compare - Designing more complex evaluation tasks, such as diagnostic reasoning,
the performance of different LLMs. treatment recommendation, and doctor-patient dialogue generation, to more
comprehensively assess the model's overall capabilities.
Data limitations Due to ethical concerns and the highly specialized - Establishment of standardized data-sharing models.
nature of the medical field, the acquisition, - Development of new techniques for data annotation and pre-processing to
processing, and use of clinical data are severely improve the quality and efficiency of data processing.
restricted, significantly hindering the development - Multimodal LLMs: These models can simultaneously process and understand
of medical LLMs. medical data in multiple modalities, such as text, images, and speech, to achieve
Single text data is no longer sufficient to meet the more comprehensive and accurate medical information analysis and knowledge
needs of medical diagnosis and treatment. reasoning.
Energy consumption The training and inference of LLMs demand - Algorithmic optimization: Techniques such as quantization, knowledge
substantial energy and rely on high-performance distillation, sparsification, and pruning reduce computational demands.
GPUs. However, many hospitals, especially in - Hardware innovation: Emerging low-power hardware, such as memristor
resource-limited regions, lack the infrastructure crossbar architectures, enables more energy-efficient deployment of LLMs.
and funding to sustain such energy-intensive AI
systems.
Ethical concerns Data privacy and security. - Establishment of robust ethical guidelines and regulatory measures.
Fairness and bias. - Conducting more prospective clinical trials in the future to provide solid
Liability determination. scientific evidence for the practical application of LLMs in medicine.
Academic integrity.

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2798

Evaluation benchmarks Furthermore, medical diagnosis and treatment

often require the integration of multimodal
At present, due to the lack of unified evaluation
information, such as textual medical histories, and
benchmarks, medical professionals are unable to
imaging results, for comprehensive judgment [74].
objectively and comprehensively compare the
Single-modal text data is no longer sufficient to meet
performance of different LLMs. Therefore, it is
the demand for multi-source heterogeneous data
difficult to judge the accuracy and reliability of the
analysis in medicine. For this reason, researchers have
model's output results, which severely limits the
begun to explore the application of multimodal LLMs
application of LLMs in real clinical scenarios [67].
in medicine [75]. For example, in radiology,
Recently, researchers have made significant
multimodal LLMs can combine imaging images and
efforts in this regard and proposed a series of
corresponding text reports to assist physicians in
evaluation benchmarks for medical LLMs. For
making more accurate imaging diagnoses [76]. In
example, Singhal, K. et al. proposed MultiMedQA, an
dentistry, researchers are trying to use multimodal
evaluation benchmark that is closer to human
LLMs to integrate patients' oral images and speech
standards. It covers multiple aspects of professional
symptom descriptions, achieving fully automated
medicine, medical research, and patient consultation,
diagnosis of oral diseases [77]. Med-PaLM M,
used to evaluate the model's ability in the medical
proposed by Tu T et al., is one of the successful cases
question-answering [13]. BenchHealth, another
of multimodal LLM applications in medicine. It can
evaluation benchmark, introduces multidimensional
flexibly encode and interpret multiple types of
metrics like relevance, fidelity, comprehensiveness,
biomedical data, and shows great potential for
generalizability, and robustness to more
application in disease diagnosis, treatment
comprehensively assess model performance [68].
recommendation, and drug development [78].
However, existing evaluation benchmarks
predominantly focus on closed-ended medical Energy consumption
question-answering tasks [69, 70]. This assessment
The training and inference processes of LLMs are
approach is difficult to reflect the complexity of real
accompanied by considerable energy consumption
clinical settings, as doctors often need to answer
and are heavily dependent on high-performance
open-ended questions with no predefined options in
graphics processing units (GPUs), such as NVIDIA's
actual clinical practice, based on the specific
A100 and H100. Studies have shown that executing
circumstances of patients [68]. Therefore, future
any meaningful inferences with the 65B LLaMA
research needs to focus on developing evaluation
model requires at least eight V100 GPUs with 32GB of
benchmarks that are more closely aligned with
memory each, or four A100 GPUs with 80GB of
real-world medical scenarios, thus better promoting
memory each [79]. However, most hospitals and
the standardized application of LLMs in medicine.
healthcare institutions, particularly those in
Data limitations resource-limited regions, generally lack the
infrastructure and financial capacity to support the
The training and evaluation of LLMs rely on
continuous operation of such energy-intensive AI
large-scale, diverse, and representative datasets [71].
systems. This presents a significant challenge for the
However, in the medical field, accessing, processing,
deployment and application of LLMs in the field of
and using clinical data faces several challenges that
medicine.
severely limit the development of medical LLMs.
In recent years, continuous progress in
Firstly, access to clinical data is constrained by strict
energy-efficient model design has provided feasible
ethical, legal, and privacy protections. Authorization
directions for addressing this challenge. Techniques
for use can only be granted after a complex approval
such as quantization, knowledge distillation,
process, which results in relatively few datasets that
sparsification, pruning, and mixture-of-experts (MoE)
are available for training LLMs [72]. Secondly,
architectures have enabled researchers to significantly
medical data usually needs to be manually annotated
reduce the computational demands of LLMs while
by experienced medical experts to ensure its accuracy
preserving their performance [80]. DeepSeek-R1, for
and professionalism. This process is not only
instance, employs a MoE architecture that selectively
time-consuming and labor-intensive but also poses a
activates only task-relevant model parameters,
significant challenge to data processing efficiency [73].
thereby reducing computational cost during inference
Given that high-quality data is crucial for training and
while sustaining strong performance in specialized
evaluating LLMs, acquiring and processing clinical
domains [81, 82]. In addition to algorithmic
data efficiently and securely becomes a key
optimizations, emerging low-power hardware
prerequisite for promoting the widespread use of
technologies, such as memristor crossbar
LLMs in medicine.

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2799

architectures, have shown promising potential for currently not capable of replacing human physicians,
enabling energy-efficient deployment of LLMs [83]. particularly in complex clinical decision-making.
Future research should aim to develop medical LLMs Under appropriate ethical and safety safeguards,
that combine high performance with energy rigorously validated LLMs have the potential to
efficiency, thereby facilitating their broad and become valuable tools for optimizing clinical
sustainable application in healthcare. workflows and improving communication between
doctors and patients. Looking ahead, the
Ethical concerns development of medical LLMs requires the joint
The application of LLMs in medicine faces participation of medical professionals, AI specialists,
numerous ethical challenges. 1. Data privacy and ethicists, and experts from other fields. By
security: LLMs require massive amounts of patient establishing unified evaluation benchmarks,
data during training. In the absence of comprehensive developing multimodal LLMs, and conducting more
security measures, the models could potentially prospective clinical trials, LLMs are expected to break
memorize and disclose this information during the through the existing bottlenecks, provide patients
training process, thus threatening patient privacy and with more accurate and personalized healthcare
data security [6]. 2. Fairness and bias: If the dataset is services, and help smart healthcare move to a higher
biased, for example, if there is insufficient data on level.
certain races, genders, or socioeconomic statuses, the
model’s output results may be biased, leading to Acknowledgements
unfair distribution of healthcare resources or
Funding
irrational diagnosis and treatment protocols [84]. 3.
Liability determination: When LLMs are applied to This work was supported by the National
assist in clinical decision-making, there are currently Natural Science Foundation of China (82271507),
no consensus for determining liability in cases where Beijing Natural Science Foundation (JQ24041),
the model provides incorrect recommendations that Noncommunicable Chronic Diseases-National
lead to adverse outcomes. 4. Academic integrity: The Science and Technology Major Project (2023ZD
powerful text generation capabilities of LLMs have 0505403), and Beijing Physician Scientist Training
been used by some scholars to write medical papers Project (BJPSTP-2024-04).
[85, 86] and even to generate false research data and
Author Contributions
images [87, 88], which raises concerns about academic
integrity. Erlan Yu and Xuehong Chu: Writing—Original
Therefore, it is crucial to give high priority to draft preparation and Editing. Wanwan Zhang:
ethical issues in the development and application of Conceptualization, Writing—Reviewing and Editing.
LLMs in medicine. Under the premise of ensuring that
everyone can benefit equally from the medical LLMs, Xiangbin Meng and Yaodong Yang: Writing —
we need to actively explore and establish robust Reviewing and Editing. Chuanjie Wu and Xunming Ji:
ethical guidelines and regulatory mechanisms to Conceptualization, Supervision, Writing—Reviewing
protect patient privacy and prevent data misuse. At and Editing.
the same time, rigorous clinical trial validation is
required to ensure the safety and efficacy of medical Competing Interests
LLMs. Currently, clinical studies on the application of
The authors have declared that no competing
LLMs in the field of medicine are still relatively
interest exists.
limited [7]. In the future, more prospective clinical
trials are needed to evaluate the performance of LLMs References
in real clinical settings to avoid potential risks.
1. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al.
Language Models are Few-Shot Learners. ArXiv. 2020; abs/2005.14165.
Conclusions 2. OpenAI. Introducing ChatGPT. 2022.
3. Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet
The rapid development of LLMs in medicine is Res. 2023; 25: e48568.
4. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al.
exciting, however, the challenges they face are equally The future landscape of large language models in medicine. Commun Med
significant and cannot be ignored. Improving the (Lond). 2023; 3: 141.
5. Xu X, Chen Y, Miao J. Opportunities, challenges, and future directions of large
accuracy and interpretability of models, addressing language models, including ChatGPT in medical education: a systematic
the lack of evaluation benchmarks and data, energy 6.
scoping review. J Educ Eval Health Prof. 2024; 21: 6.
Ong JCL, Chang SY, William W, Butte AJ, Shah NH, Chew LST, et al. Ethical
consumption and related ethical issues will be the and regulatory challenges of large language models in medicine. Lancet Digit
Health. 2024; 6: e428-e32.
focus of future research. Notably, despite their 7. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW.
improving performance on medical tasks, LLMs are Large language models in medicine. Nat Med. 2023; 29: 1930-40.

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2800
8. Shanahan M. Talking about Large Language Models. Communications of the 39. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit
ACM. 2022; 67: 68 - 79. Health. 2023; 5: e107-e8.
9. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep 40. Tripathi S, Sukumaran R, Cook TS. Efficient healthcare with large language
Bidirectional Transformers for Language Understanding. North American models: optimizing clinical workflow and enhancing patient care. J Am Med
Chapter of the Association for Computational Linguistics; 2019. Inform Assoc. 2024; 31: 1436-40.
10. Radford A, Narasimhan K. Improving Language Understanding by 41. Roberts K. Large language models for reducing clinicians' documentation
Generative Pre-Training. 2018. burden. Nat Med. 2024; 30: 942-3.
11. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, et al. 42. Májovský M, Černý M, Kasal M, Komarc M, Netuka D. Artificial Intelligence
LLaMA: Open and Efficient Foundation Language Models. ArXiv. 2023; Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles:
abs/2302.13971. Pandora's Box Has Been Opened. J Med Internet Res. 2023; 25: e46924.
12. Anil R, Dai AM, Firat O, Johnson M, Lepikhin D, Passos AT, et al. PaLM 2 43. Hake J, Crowley M, Coy A, Shanks D, Eoff A, Kirmer-Voss K, et al. Quality,
Technical Report. ArXiv. 2023; abs/2305.10403. Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts.
13. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language Ann Fam Med. 2024; 22: 113-20.
models encode clinical knowledge. Nature. 2023; 620: 172-80. 44. Gao CA, Howard FM, Markov NS, Dyer EC, Ramesh S, Luo Y, et al.
14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards Comparing scientific abstracts generated by ChatGPT to real abstracts with
Expert-Level Medical Question Answering with Large Language Models. detectors and blinded human reviewers. NPJ Digit Med. 2023; 6: 75.
ArXiv. 2023; abs/2305.09617. 45. Oami T, Okada Y, Nakada TA. Performance of a Large Language Model in
15. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Screening Citations. JAMA Netw Open. 2024; 7: e2420496.
Attention is All you Need. Neural Information Processing Systems; 2017. 46. Luo X, Chen F, Zhu D, Wang L, Wang Z, Liu H, et al. Potential Roles of Large
16. Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, et al. Finetuned Language Language Models in the Production of Systematic Reviews and
Models Are Zero-Shot Learners. ArXiv. 2021; abs/2109.01652. Meta-Analyses. J Med Internet Res. 2024; 26: e56780.
17. Meskó B. Prompt Engineering as an Important Emerging Skill for Medical 47. Huang J, Yang DM, Rong R, Nezafati K, Treager C, Chi Z, et al. A critical
Professionals: Tutorial. J Med Internet Res. 2023; 25: e50638. assessment of using ChatGPT for extracting structured data from clinical
18. Goh E, Gallo RJ, Strong E, Weng Y, Kerman H, Freed JA, et al. GPT-4 notes. NPJ Digit Med. 2024; 7: 106.
assistance for improvement of physician performance on patient care tasks: a 48. Huang Y, Wu R, He J, Xiang Y. Evaluating ChatGPT-4.0's data analytic
randomized controlled trial. Nat Med. 2025; 31: 1233-8. proficiency in epidemiological studies: A comparative analysis with SAS,
19. OpenAI. Hello GPT-4o. 2024. SPSS, and R. J Glob Health. 2024; 14: 04070.
20. OpenAI. Learning to Reason with LLMs. OpenAI; 2024. 49. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al.
21. Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. Constitutional Large Language Models in Medical Education: Opportunities, Challenges, and
AI: Harmlessness from AI Feedback. ArXiv. 2022; abs/2212.08073. Future Directions. JMIR Med Educ. 2023; 9: e48291.
22. Chen D, Parsa R, Hope A, Hannon B, Mak E, Eng L, et al. Physician and 50. Wu Y, Zheng Y, Feng B, Yang Y, Kang K, Zhao A. Embracing ChatGPT for
Artificial Intelligence Chatbot Responses to Cancer Questions From Social Medical Education: Exploring Its Impact on Doctors and Medical Students.
Media. JAMA Oncol. 2024; 10: 956-60. JMIR Med Educ. 2024; 10: e52483.
23. Team G, Anil R, Borgeaud S, Alayrac J-B, Yu J, Soricut R, et al. Gemini: a 51. Holderried F, Stegemann-Philipps C, Herschbach L, Moldt JA, Nevins A,
family of highly capable multimodal models. arXiv preprint arXiv:231211805. Griewatz J, et al. A Generative Pretrained Transformer (GPT)-Powered
2023. Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed
24. Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the Methods Study. JMIR Med Educ. 2024; 10: e53961.
Utility of ChatGPT Throughout the Entire Clinical Workflow: Development 52. Cook DA. Creating virtual patients using large language models: scalable,
and Usability Study. J Med Internet Res. 2023; 25: e48659. global, and low cost. Med Teach. 2024: 1-3.
25. Abdullahi T, Singh R, Eickhoff C. Learning to Make Rare and Complex 53. Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat
Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Sci Educ. 2024; 17: 926-31.
Language Models. JMIR Med Educ. 2024; 10: e51391. 54. Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, et al.
26. El Haj M, Boutoleau-Bretonnière C, Gallouj K, Wagemann N, Antoine P, ChatGPT versus human in generating medical graduate exam multiple choice
Kapogiannis D, et al. ChatGPT as a Diagnostic Aid in Alzheimer's Disease: An questions-A multinational prospective study (Hong Kong S.A.R., Singapore,
Exploratory Study. J Alzheimers Dis Rep. 2024; 8: 495-500. Ireland, and the United Kingdom). PLoS One. 2023; 18: e0290691.
27. Salihu A, Meier D, Noirclerc N, Skalidis I, Mauler-Wittwer S, Recordon F, et al. 55. Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large
A study of ChatGPT in facilitating Heart Team decisions on severe aortic Language Models in Medical Education: Comparing ChatGPT- to
stenosis. EuroIntervention. 2024; 20: e496-e503. Human-Generated Exam Questions. Acad Med. 2024; 99: 508-12.
28. Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a 56. Tangadulrat P, Sono S, Tangtrakulwanich B. Using ChatGPT for Clinical
Large Language Model's Responses to Questions and Cases About Glaucoma Practice and Medical Education: Cross-Sectional Survey of Medical Students'
and Retina Management. JAMA Ophthalmol. 2024; 142: 371-5. and Physicians' Perceptions. JMIR Med Educ. 2023; 9: e50658.
29. He Z, Bhasuran B, Jin Q, Tian S, Hanna K, Shavor C, et al. Quality of Answers 57. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of Hallucination in
of Generative Large Language Models Versus Peer Users for Interpreting Natural Language Generation. ACM Comput Surv. 2023; 55: Article 248.
Laboratory Test Results for Lay Patients: Evaluation Study. J Med Internet Res. 58. Christophe Ce, Kanithi PK, Munjal P, Raha T, Hayat N, Rajan R, et al. Med42 -
2024; 26: e56655. Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs.
30. Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, et al. Assessing the Parameter-Efficient Approaches. ArXiv. 2024; abs/2404.14779.
performance of ChatGPT in answering questions regarding cirrhosis and 59. Ye K, Zhou H, Zhu J, Quinzan F, Shi C. Robust Reinforcement Learning from
hepatocellular carcinoma. Clin Mol Hepatol. 2023; 29: 721-32. Human Feedback for Large Language Models Fine-Tuning. 2025.
31. Acharya A, Shrestha S, Chen A, Conte J, Avramovic S, Sikdar S, et al. Clinical 60. Wu J, Zhu J, Qi Y. Medical Graph RAG: Towards Safe Medical Large
risk prediction using language models: benefits and considerations. J Am Med Language Model via Graph Retrieval-Augmented Generation. ArXiv. 2024;
Inform Assoc. 2024. abs/2408.04187.
32. Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, et 61. Gilbert S, Kather JN, Hogan A. Augmented non-hallucinating large language
al. Predicting seizure recurrence after an initial seizure-like episode from models as medical information curators. NPJ Digit Med. 2024; 7: 100.
routine clinical notes using large language models: a retrospective cohort 62. Li D, Yang S, Tan Z, Baik JY, Yun S, Lee J, et al. DALK: Dynamic
study. Lancet Digit Health. 2023; 5: e882-e94. Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions
33. Williams CYK, Zack T, Miao BY, Sushil M, Wang M, Kornblith AE, et al. Use with Scientific Literature. ArXiv. 2024; abs/2405.04819.
of a Large Language Model to Assess Clinical Acuity of Adults in the 63. Joyce DW, Kormilitzin A, Smith KA, Cipriani A. Explainable artificial
Emergency Department. JAMA Netw Open. 2024; 7: e248895. intelligence for mental health through transparency and interpretability for
34. Bhayana R, Nanda B, Dehkharghanian T, Deng Y, Bhambra N, Elias G, et al. understandability. NPJ Digit Med. 2023; 6: 6.
Large Language Models for Automated Synoptic Reports and Resectability 64. Rudin C. Stop Explaining Black Box Machine Learning Models for High Stakes
Categorization in Pancreatic Cancer. Radiology. 2024; 311: e233117. Decisions and Use Interpretable Models Instead. Nat Mach Intell. 2019; 1:
35. Bhayana R, Biswas S, Cook TS, Kim W, Kitamura FC, Gichoya J, et al. From 206-15.
Bench to Bedside With Large Language Models: AJR Expert Panel Narrative 65. Tang X, Zou A, Zhang Z, Zhao Y, Zhang X, Cohan A, et al. MedAgents: Large
Review. AJR Am J Roentgenol. 2024. Language Models as Collaborators for Zero-shot Medical Reasoning. ArXiv.
36. Doshi R, Amin KS, Khosla P, Bajaj SS, Chheang S, Forman HP. Quantitative 2023; abs/2311.10537.
Evaluation of Large Language Models to Streamline Radiology Report 66. Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning
Impressions: A Multimodal Retrospective Analysis. Radiology. 2024; 310: prompts reveal the potential for large language model interpretability in
e231593. medicine. NPJ Digit Med. 2024; 7: 20.
37. Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stüber AT, Topalis J, et al. 67. Tang YD, Dong ED, Gao W. LLMs in medicine: The need for advanced
ChatGPT makes medicine easy to swallow: an exploratory case study on evaluation systems for disruptive technologies. Innovation (Camb). 2024; 5:
simplified radiology reports. Eur Radiol. 2024; 34: 2817-25. 100622.
38. Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, Bluethgen C, 68. Liu A, Zhou H, Hua Y, Rohanian O, Clifton LA, Clifton DA. Large Language
et al. Adapted large language models can outperform medical experts in Models in Healthcare: A Comprehensive Benchmark. ArXiv. 2024;
clinical text summarization. Nat Med. 2024; 30: 1134-42. abs/2405.00716.

https://www.medsci.org
Int. J. Med. Sci. 2025, Vol. 22 2801
69. Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What Disease does
this Patient Have? A Large-scale Open Domain Question Answering Dataset
from Medical Exams. ArXiv. 2020; abs/2009.13081.
70. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA : A Large-scale
Multi-Subject Multi-Choice Dataset for Medical domain Question Answering.
ACM Conference on Health, Inference, and Learning; 2022.
71. Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in
machine learning for medicine and healthcare. Nat Biomed Eng. 2021; 5: 493-7.
72. Zhou H, Gu B, Zou X, Li Y, Chen SS, Zhou P, et al. A Survey of Large
Language Models in Medicine: Progress, Application, and Challenge. ArXiv.
2023; abs/2311.05112.
73. Goel A, Gueta A, Gilon O, Liu C, Erell S, Nguyen LH, et al. LLMs Accelerate
Annotation for Medical Information Extraction. ArXiv. 2023; abs/2312.02296.
74. Meskó B. The Impact of Multimodal Large Language Models on Health Care's
Future. J Med Internet Res. 2023; 25: e52865.
75. Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, et al. The application of large
language models in medicine: A scoping review. iScience. 2024; 27: 109713.
76. Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Towards Generalist Foundation
Model for Radiology. ArXiv. 2023; abs/2308.02463.
77. Huang H, Zheng O, Wang D, Yin J, Wang Z, Ding S, et al. ChatGPT for
shaping the future of dentistry: the potential of multi-modal large language
model. Int J Oral Sci. 2023; 15: 29.
78. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P-C, et al. Towards
Generalist Biomedical AI. ArXiv. 2023; abs/2307.14334.
79. Samsi S, Zhao D, McDonald J, Li B, Michaleas A, Jones M, et al. From Words to
Watts: Benchmarking the Energy Costs of Large Language Model Inference.
2023 IEEE High Performance Extreme Computing Conference (HPEC). 2023:
1-9.
80. Agrawal V. Energy Efficient Large Language Models: Advancements and
Challenges. INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN
ENGINEERING AND MANAGEMENT. 2025.
81. DeepSeek-AI, Liu A, Feng B, Xue B, Wang B-L, Wu B, et al. DeepSeek-V3
Technical Report. ArXiv. 2024; abs/2412.19437.
82. DeepSeek-AI, Guo D, Yang D, Zhang H, Song J-M, Zhang R, et al.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
Learning. ArXiv. 2025; abs/2501.12948.
83. Wang Z, Luo T, Liu C, Liu W, Goh RSM, Wong WF. Enabling Energy-Efficient
Deployment of Large Language Models on Memristor Crossbar: A Synergy of
Large and Small. IEEE Trans Pattern Anal Mach Intell. 2025; 47: 916-33.
84. Chen RJ, Chen TY, Lipková J, Wang JJ, Williamson DFK, Lu MY, et al.
Algorithm Fairness in AI for Medicine and Healthcare. ArXiv. 2021;
abs/2110.00603.
85. Stokel-Walker C. ChatGPT listed as author on research papers: many scientists
disapprove. Nature. 2023; 613: 620-1.
86. Tools such as ChatGPT threaten transparent science; here are our ground rules
for their use. Nature. 2023; 613: 612.
87. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data
Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA
Ophthalmol. 2023; 141: 1174-5.
88. Zhu L, Lai Y, Mou W, Zhang H, Lin A, Qi C, et al. ChatGPT's ability to
generate realistic experimental images poses a new challenge to academic
integrity. J Hematol Oncol. 2024; 17: 27.