Cureus 0016 00000069996
Cureus 0016 00000069996
Abstract
Background
Health literacy empowers patients to participate in their own healthcare. Personal health literacy is one’s
ability to find, understand, and use information/resources to make well-informed health decisions. Artificial
intelligence (AI) has become a source for the acquisition of health-related information through large
language model (LLM)-driven chatbots. Assessment of the readability and quality of health information
produced by these chatbots has been the subject of numerous studies to date. This study seeks to assess the
quality of patient education materials on cardiac catheterization produced by AI chatbots.
Methodology
We asked a set of 10 questions about cardiac catheterization to four chatbots: ChatGPT (OpenAI, San
Francisco, CA), Microsoft Copilot (Microsoft Corporation, Redmond, WA), Google Gemini (Google DeepMind,
London, UK), and Meta AI (Meta, New York, NY). The questions and subsequent answers were utilized to
make patient education materials on cardiac catheterization. The quality of these materials was assessed
using two validated instruments for patient education materials: DISCERN and the Patient Education
Materials Assessment Tool (PEMAT).
Results
The overall DISCERN scores were 4.5 for ChatGPT, 4.4 for Microsoft Copilot and Google Gemini, and 3.8 for
Meta AI. ChatGPT, Microsoft Copilot, and Google Gemini tied for the highest reliability score at 4.6, while
Meta AI had the lowest with 4.2. ChatGPT had the highest quality score at 4.4, while Meta AI had the lowest
with 3.4. ChatGPT and Google Gemini had Understandability scores of 100%, while Meta AI had the lowest
with 82%. ChatGPT, Microsoft Copilot, and Google Gemini all had Actionability scores of 75%, while Meta AI
had one of 50%.
Conclusions
ChatGPT produced the most reliable and highest quality materials, followed closely by Google Gemini. Meta
AI produced the lowest quality materials. Given the easy accessibility that chatbots provide patients and the
high-quality responses that we obtained, they could be a reliable source for patients to obtain information
about cardiac catheterization.
Introduction
Health literacy is integral in empowering patients to participate in their own healthcare. Personal health
literacy is one’s ability to find, understand, and use information/resources to make well-informed health
decisions, and is a central part of the Healthy People 2030 initiative [1]. A major element of personal health
literacy is the ability to understand and utilize health-related information from different formats [2]. In
recent years, increased access to and expansion of the internet has resulted in website-based patient
education materials replacing paper handouts obtained directly from healthcare providers. Assessments of
the quality and readability of these online patient education materials have been the subject of hundreds of
studies to date.
Within the rapidly expanding digital realm, artificial intelligence (AI) has become a potential resource for
In a previous study, we assessed the readability of patient education materials on cardiac catheterization
generated through four AI chatbots: ChatGPT (OpenAI, San Francisco, CA), Microsoft Copilot (Microsoft
Corporation, Redmond, WA), Google Gemini (Google DeepMind, London, UK), and Meta AI (Meta, New York,
NY)[4]. We found that these materials had reading grade levels at the high school or college reading level, far
exceeding the sixth-grade level recommended by the American Medical Association and National Institutes
of Health [4]. While readability is an important aspect of health education materials, the quality of the
information provided is equally as important. This current study seeks to build upon our previous findings
and assess the quality of health education materials on cardiac catheterization from AI chatbots.
Questions
The quality of these patient education materials was then assessed using two validated
instruments: DISCERN and the Patient Education Materials Assessment Tool (PEMAT) [5,6]. The first two
authors (BJB and CASM) independently screened these materials, assigning ratings as outlined by the two
instruments. Disagreements were resolved via discussion, with the last author (JFB) available if consensus
was unable to be achieved.
The DISCERN instrument assesses the quality of written health information using 16 questions across three
sections: reliability of the publication (Questions 1-8), quality of information on treatment choices
(Questions 9-15), and overall rating (Question 16) [5]. Each question is scored one to five, with a score of one
corresponding to “No” and five corresponding to “Yes,” while three represents “Partially” [5]. Given that
sources are not generally provided by AI chatbots, we excluded DISCERN questions that asked about these.
This excluded three questions from the reliability of the publication section. We also calculated the overall
rating by taking an average of the ratings for the rest of the questions, as opposed to simply rating this from
one to five. The modified DISCERN that we used for this study can be seen in Table 2.
11. Is it clear that there may be more than one possible treatment
choice?
PEMAT is a systematic method to determine whether health education information is understandable and
actionable for patients [6]. It consists of 24 questions for printable materials that are scored zero for
“Disagree” and one for “Agree,” with certain questions having the option of N/A for “Not Applicable” [6].
Scores are assigned separately for the Understandability and Actionability domains by taking the total
number of points divided by the total possible points (excluding questions scored N/A) and multiplying by
100% [6]. The PEMAT questions can be seen in Table 3.
Word choice 4. Medical terms are used only to familiarize audience with the terms. When used, medical terms
and style are defined.
Layout and 12. The material uses visual cues (e.g., arrows, boxes, bullets, bold, larger font, highlighting) to
design draw attention to key points.
13. The material uses visual aids whenever they could make content more easily understood (e.g.,
illustration of healthy portion size).
14. The material’s visual aids reinforce rather than distract from the content.
Use of visual
aids 15. The material’s visual aids have clear titles or captions.
16. The material uses illustrations and photographs that are clear and uncluttered.
17. The material uses simple tables with short and clear row and column headings.
18. The material clearly identifies at least one action the user can take.
19. The material addresses the user directly when describing actions.
20. The material breaks down any action into manageable, explicit steps.
21. The material provides a tangible tool (e.g., menu planners, checklists) whenever it could help the
Actionability
user take action.
22. The material provides simple instructions or examples of how to perform calculations.
23. The material explains how to use charts, graphs, tables, or diagrams to take actions.
24. The material uses visual aids whenever they could make it easier to act on the instructions.
TABLE 3: The Patient Education Materials Assessment Tool (PEMAT) instrument questions
Results
ChatGPT had the highest DISCERN scores, overall and for both sections, of 4.5, 4.6, and 4.4, respectively.
Microsoft Copilot and Google Gemini were tied for second with an overall score of 4.4 and section scores of
4.6 and 4.3, respectively. Meta AI had the lowest DISCERN scores across all three, with an overall score of 4.2
and scores of 3.4 and 3.8 on Sections 1 and 2, respectively. Higher scores across Section 1 indicate better
reliability of the publication, while higher scores in Section 2 indicate better quality of information provided
on treatment choices. Our results indicate that ChatGPT, Microsoft Copilot, and Google Gemini all provide
equally reliable information, but ChatGPT provides slightly better quality information. The DISCERN scores
for each chatbot can be found in Table 4.
TABLE 4: Average DISCERN scores for each chatbot for Section 1 (reliability of the publication),
Section 2 (quality of information), and overall
We further investigated specific differences between chatbots in scores to the DISCERN questions. In Section
1, none of the chatbots received a perfect score of five on Question #5, “Does it refer to areas of
uncertainty?” Using the DISCERN instrument, ChatGPT, Microsoft Copilot, and Google Gemini received
scores of three or “Partially” due to their failure to mention gaps in knowledge or differences in expert
opinion. However, they did a sufficient job of discussing that individual patients will have different risk
factors, procedures, and outcomes, which is the other aspect of this question. Meta AI performed lower on
this same question with a score of two due to its failure to mention knowledge gaps, as well as providing less
emphasis on the variations to be expected between patients. The only other question in Section 1 that did
not receive a score of five was Question #3, “Is it relevant?” Meta AI received a score of four on this question
due to its failure to recognize and address questions that readers might ask.
In Section 2, all four chatbots received scores of five on Questions #6-8, which judge whether the materials
effectively describe each treatment, its benefits, and its risks. The lowest scores were seen with Question #9,
“Does it describe what would happen if no treatment is used?” ChatGPT received a high score of three due to
its mention that cardiac catheterization can help reduce the risk of heart attacks, heart failure, and other
serious complications by accurately diagnosing and treating heart conditions. However, we could only score
this a three because of its failure to explicitly describe what would happen if no treatment was used. On the
other hand, Meta AI received a score of one due to its failure to mention or even imply the consequences of
no treatment. Microsoft Copilot and Google Gemini received scores of two due to their implications of what
would happen without treatment, but failure to explicitly state it and lesser emphasis than we saw with
ChatGPT. The other question that none of the chatbots received a score of five on was Question #11, “Is it
clear that there may be more than one possible treatment choice?” ChatGPT, Microsoft Copilot, and Google
Gemini all received scores of three due to their mention of open heart surgery as an alternative. However,
they did not elaborate on heart surgery and failed to mention whether that was the only alternative. Meta AI
received a score of one on this question for its failure to mention heart surgery as an alternative. Meta AI
also received a score of three on Question #10, “Does it describe how the treatment choices affect the overall
quality of life?” due to its vague responses about what to expect after cardiac catheterization and failure to
explicitly mention the quality of life outcomes. It also received a score of four on Question #12, “Does it
provide support for shared decision-making?” due to it being more direct than the others and its failure to
explicitly lay out that patients should collaborate with their healthcare providers when considering cardiac
catheterization. Differences in the scores for each chatbot across the 12 DISCERN questions can be seen in
Figure 1.
ChatGPT and Google Gemini had the same PEMAT Understandability and Actionability scores of 100% and
75%, respectively. Microsoft Copilot also had an Actionability score of 75% but a slightly lower
Understandability score of 92%. The only difference in Understandability for these three chatbots was Item
#11 and Microsoft Copilot’s failure to provide a summary. Additionally, all three chatbots received a score of
0 for Item #21 for failure to provide a tangible tool to help the user take action. Meta AI had the lowest
scores, with an Understandability score of 82% and an Actionability score of 50%. Meta AI missed points on
Understandability because of Items #4 and #11, which failed to define medical terms consistently and lacked
a summary, respectively. It missed points on Actionability due to Items #22-23 and its failure to break
actions into manageable, explicit steps, as well as its failure to provide a tangible tool for action. The PEMAT
scores for each chatbot can be found in Table 5.
TABLE 5: Patient Education Materials Assessment Tool (PEMAT) scores for each chatbot broken
down by Understandability and Actionability domains
Overall percentage scores are provided with parentheses denoting the number of items agreed to as the numerator and the total number of items
answered as the denominator, i.e., items scored “Not Applicable” were not included in the denominator.
Items #13-17, which deal with the use of visual aids, were labeled N/A for all four chatbots due to the lack of
visual aids found in chatbot responses. This resulted in a maximum number of Understandability questions
of 12. Google Gemini and Meta AI both only had 11 questions because Item #6 was scored N/A due to their
lack of use of numbers. Similarly, Items #22-24 also received scores of N/A for all four chatbots due to the
lack of calculations to perform, as well as further questions on the use of visual aids. This resulted in four
Discussion
Overall, ChatGPT appears to provide the highest quality patient education materials on cardiac
catheterization. Our results suggest that ChatGPT’s materials are the most reliable, understandable, and
actionable and contain the highest quality materials produced by the four chatbots. Google Gemini provided
similarly reliable, understandable, and actionable materials but of slightly lower quality. Conversely, Meta
AI was found to provide the lowest quality patient education materials across all four measures. Our study
suggests that three of these four chatbots, excluding Meta AI, are capable of providing high-quality patient
education materials on cardiac catheterization based on our analysis using DISCERN and PEMAT, two
validated tools specifically designed to assess the quality of patient education materials [5,6].
Our findings that AI chatbots produce high-quality patient education materials are reassuring, given the
increasing rate at which patients are utilizing chatbots for health information. As mentioned, health literacy
allows patients to better participate in their own care and has been associated with better health outcomes
[1,7]. Improvement in health literacy is especially vital in patients with cardiovascular diseases, as poor
health literacy has been associated with increased mortality, increased hospital admissions, and decreased
quality of life [8,9]. However, for patient education materials to positively impact health literacy, they need
to be both readable and accurate. Prior studies, including our own on cardiac catheterization, have shown
that chatbot-generated materials are written significantly above the recommended sixth-grade level, a
significant limitation to the furtherance of health literacy [4]. However, other studies have shown the ability
of AI chatbots to simplify information when prompted to, albeit at the possible expense of accuracy [4,10].
While there seems to be consensus on the readability of these materials, studies are conflicted on the quality
and accuracy. A recent systematic review on the role of ChatGPT in cardiology concluded that while there
may be some benefit to its use in patient education, this is limited by inaccuracies in its outputs, incomplete
answers, and the inability to provide the most up-to-date information [11]. A study comparing ChatGPT’s
heart failure education to that of national cardiology institutes found that ChatGPT was less readable and
had the lowest PEMAT Actionability score of any material [12]. Furthermore, a study comparing chatbots to
traditional patient information leaflets (PILs) on local anesthesia in eye surgery found the traditional
leaflets to be superior in readability, accuracy, and completeness [13]. However, another study by the same
authors found no differences in accuracy, completeness, understandability, and actionability between PILs
and ChatGPT patient education materials on chronic pain medications [14]. Additionally, another study
found that ChatGPT-4 used multiple academic sources to answer queries about the Latarjet procedure
compared to Google Search Engine using single-surgeon and large medical practice websites, although the
clinical relevance and accuracy of the information were not significantly different [15]. This highlights the
strength of LLM chatbots in compiling and referencing high-quality sources when providing their responses.
As we saw in our study, each chatbot is different and provides varying levels of information quality with its
responses. Given that ChatGPT is the most common chatbot, many studies focus solely on it, but there are
studies similar to ours that compare multiple chatbots. A comparison of responses from five chatbots
(ChatGPT-4, Claude, Mistral, Google PaLM, and Grok) to the most frequently asked questions (FAQs) on
kidney stones found Grok to be the easiest to read and ChatGPT the hardest, while Claude had the best text
quality [16]. Another study assessing the responses of ChatGPT, Bard, Gemini, Copilot, and Perplexity in
responses to FAQs about palliative care found Perplexity and Gemini to be the highest quality [17]. However,
another study assessing the responses of ChatGPT-3.5 and ChatGPT-4 to FAQs by patients undergoing deep
brain stimulation found significantly better and more complete responses by ChatGPT-4, as determined by
experts in the field [18]. This highlights the variability in the abilities of the different chatbots to provide
quality patient education materials, even among different versions of the same instrument. Given the ever-
changing nature of chatbots, this variability is likely also constantly changing and will be difficult to predict.
This study is not without its limitations, the most notable of which is that the DISCERN and PEMAT
instruments have not been validated for use on AI chatbot-generated patient education materials.
Furthermore, the questions on the DISCERN tool that we excluded dealt with the quality and reliability of
sources used to compile the information. While it is known that these chatbots are trained on large amounts
of data across the internet and have been shown to provide reliable and factually correct responses, the
exact sources of the information provided cannot be determined. This limits our ability to confidently
promote the quality of these materials, and our results should thus be taken with caution. Additionally, the
use of a modified DISCERN questionnaire, as well as the number of items that were deemed “Not Applicable”
on the PEMAT, makes comparison of the quality of these materials to prior studies challenging. We also only
asked each question once to the chatbots, so our results are only reflective of that single point in time and
cannot account for any updates to these models or differences in responses to the questions that asking
multiple times may have elicited. Finally, only two authors assessed the quality of the responses and
disagreements in the initial screening, which were solved through discussion, lending the possibility of bias
through groupthink and confirmation bias. Future studies should utilize more authors to independently rate
the materials and calculate inter-rater reliability to limit this bias and achieve a better understanding of the
true quality of these materials.
Additional Information
Author Contributions
All authors have reviewed the final version to be published and agreed to be accountable for all aspects of the
work.
Concept and design: Benjamin J. Behers, Karen M. Hamad, Joel F. Baker, Ian A. Vargas, Caroline N. Wojtas,
Manuel A. Rosario, Djhemson Anneaud
Critical review of the manuscript for important intellectual content: Benjamin J. Behers, Karen M.
Hamad, Joel F. Baker, Ian A. Vargas, Caroline N. Wojtas, Manuel A. Rosario, Djhemson Anneaud, Christoph
A. Stephenson-Moe, Profilia Nord, Rebecca M. Gibons
Disclosures
Human subjects: All authors have confirmed that this study did not involve human participants or tissue.
Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.
Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the
following: Payment/services info: All authors have declared that no financial support was received from
any organization for the submitted work. Financial relationships: All authors have declared that they have
no financial relationships at present or within the previous three years with any organizations that might
have an interest in the submitted work. Other relationships: All authors have declared that there are no
other relationships or activities that could appear to have influenced the submitted work.
References
1. Health Literacy in Healthy People 2030. (2024). Accessed: August 28, 2024:
https://health.gov/healthypeople/priority-areas/health-literacy-healthy-people-2030.
2. Liu C, Wang D, Liu C, et al.: What is the meaning of health literacy? A systematic review and qualitative
synthesis. Fam Med Community Health. 2020, 8:51. 10.1136/fmch-2020-000351
3. Golan R, Reddy R, Ramasamy R: The rise of artificial intelligence-driven health communication . Transl
Androl Urol. 2024, 13:356-8. 10.21037/tau-23-556
4. Behers BJ, Vargas IA, Behers BM, Rosario MA, Wojtas CN, Deevers AC, Hamad KM: Assessing the readability
of patient education materials on cardiac catheterization from artificial intelligence chatbots: an
observational cross-sectional study. Cureus. 2024, 16:e63865. 10.7759/cureus.63865
5. The DISCERN Instrument. (2024). Accessed: August 31, 2024: http://www.discern.org.uk/index.php.
6. Shoemaker SJ, Wolf MS, Brach C: Development of the Patient Education Materials Assessment Tool
(PEMAT): a new measure of understandability and actionability for print and audiovisual patient
information. Patient Educ Couns. 2014, 96:395-403. 10.1016/j.pec.2014.05.027
7. Tepe M, Emekli E: Assessing the responses of large language models (ChatGPT-4, Gemini, and Microsoft
Copilot) to frequently asked questions in breast imaging: a study on readability and accuracy. Cureus. 2024,
16:e59960. 10.7759/cureus.59960
8. Berkman ND, Sheridan SL, Donahue KE, Halpern DJ, Crotty K: Low health literacy and health outcomes: an
updated systematic review. Ann Intern Med. 2011, 155:97-107. 10.7326/0003-4819-155-2-201107190-00005
9. Kanejima Y, Shimogai T, Kitamura M, Ishihara K, Izawa KP: Impact of health literacy in patients with
cardiovascular diseases: a systematic review and meta-analysis. Patient Educ Couns. 2022, 105:1793-800.
10.1016/j.pec.2021.11.021
10. Sudharshan R, Shen A, Gupta S, Zhang-Nunes S: Assessing the utility of ChatGPT in simplifying text
complexity of patient educational materials. Cureus. 2024, 16:e55304. 10.7759/cureus.55304
11. Sharma A, Medapalli T, Alexandrou M, Brilakis E, Prasad A: Exploring the role of ChatGPT in cardiology: a
systematic review of the current literature. Cureus. 2024, 16:e58936. 10.7759/cureus.58936
12. Anaya F, Prasad R, Bashour M, Yaghmour R, Alameh A, Balakumaran K: Evaluating ChatGPT platform in
delivering heart failure educational material: a comparison with the leading national cardiology institutes.