0% found this document useful (0 votes)
20 views11 pages

Toth 2021

The study develops a decision tree model using R software to predict diabetes, addressing the gap in evidence-based clinical decision-making due to a lack of behavioral risk models and analytical education among clinicians. Results show that the model can classify diabetes with up to 89.36% accuracy, highlighting the potential of using healthcare analytics to improve clinical outcomes. The research emphasizes the need for integrating predictive analytics into healthcare practices to enhance personalized medicine and reduce costs associated with chronic diseases.

Uploaded by

remap29717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Toth 2021

The study develops a decision tree model using R software to predict diabetes, addressing the gap in evidence-based clinical decision-making due to a lack of behavioral risk models and analytical education among clinicians. Results show that the model can classify diabetes with up to 89.36% accuracy, highlighting the potential of using healthcare analytics to improve clinical outcomes. The research emphasizes the need for integrating predictive analytics into healthcare practices to enhance personalized medicine and reduce costs associated with chronic diseases.

Uploaded by

remap29717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Health and Technology (2021) 11:535–545

https://doi.org/10.1007/s12553-021-00542-w

ORIGINAL PAPER

Decision tree modeling in R software to aid clinical decision making


Elena G. Toth1 · David Gibbs1 · Jackie Moczygemba1 · Alexander McLeod1

Received: 27 November 2020 / Accepted: 12 March 2021 / Published online: 24 March 2021
© IUPESM and Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
There is increasing excitement in the healthcare field about using behavioral data and healthcare analytics for disease risk
prediction, clinical decision support, and overall improvement of personalized medicine. However, this excitement has not
effectively translated to improved clinical outcomes due to knowledge gaps, a lack of behavioral risk models, and resistance
to evidence-based practice. Reportedly, only 10–20% of clinical decisions are known to be evidence-based and this problem
is further highlighted by the fact that the US spends more money on healthcare per person than any other nation, while still
wrestling with poor health outcomes. Critics say there are inadequate technological resources and analytical education for
clinicians to make behavioral data useful in the medical world. Healthcare technology innovators often neglect important
aspects of the reality of integrating clinical data into electronic healthcare solutions. In this study, we developed a decision
tree model using R statistical software to predict diabetes since it is among the top causes of death in the US, can be poorly
managed, and provides an opportunity for improvement using analytics. This study examined behavioral data and healthcare
analytics for use in clinical applications, demonstrating that health information professionals can develop behavioral risk
factor prediction models to bridge the gap. Results indicated that decision trees are effective in classifying diabetes in an
individual at up to 89.36% accuracy.

Keywords Analytics · Disease prediction · Clinical decision support · Diabetes · Decision trees

Preventable chronic diseases are the most common cause of population, with 10,000 more people turning 65 every day
premature death in the US population [1, 2]. Chronic dis- between January 2011 and January 2030 [6].
eases also account for 86% of healthcare costs in the US [3]. The goal of this research is to use healthcare analytics
This is partly due to a lack of intervention based on behavio- for the creation of behavioral risk prediction models to sup-
ral risk factors, such as obesity, and a large amount of funds port clinical decision making in evidence-based practice.
being spent on high-cost medical care after the disease has Specifically, we focus on utilizing R Statistical Software for
already developed. The healthcare industry is expected to decision tree analysis, as applications of R remain scarce in
consume a quarter of the country’s Gross Domestic Product healthcare analytics [7].
in the near future [4, 5]. A great concern is an aging US Reportedly, only 10–20% of clinical decisions are known
to be evidence-based [8]. Many disease prediction models
are too advanced to be used by healthcare professionals
without specialized training in data analytics. Using data
* Elena G. Toth from the Centers for Disease Control and Prevention (CDC)
elenagritsenko@utexas.edu 2017 Behavioral Risk Factor Survey [9], this work creates
David Gibbs diabetes prediction models using behavioral risk factors.
dgibbs@txstate.edu Additionally, the models can help healthcare professionals,
Jackie Moczygemba including those without extensive statistical training, under-
jackiem@txstate.edu stand how data analytics can aid in clinical decisions. Given
Alexander McLeod the need for analytical skills to support evidenced-based
am@txstate.edu medicine, the following research questions were developed
1
Department of Health Information Management, Texas State
and addressed:
University College of Health Professions, 601 University
Dr., Encino Hall 310B, TX 78666 San Marcos, USA

13
Vol.:(0123456789)
536 Health and Technology (2021) 11:535–545

RQ1: Can risk factors for diseases be identified in large related to patients with brain injuries. This system is able to
behavioral data sets for clinical decision support? diagnose complications 48 h sooner than other clinical meth-
RQ2: How can predictive analytics be used to create ods in patients who suffer a bleeding stroke [11]. Addition-
diabetes prediction models for clinical decision sup- ally, Fox discussed using data analytics for creating predic-
port? tive models that can identify how an intervention program is
RQ3: Can a diabetes prediction model be created with likely to impact the patient’s health behaviors [4].
conditional inference trees and machine learning using Common approaches to disease risk prediction include
R? linear regression, decision trees, machine learning, neural
networks, and the Naïve Bayesian approach [29–31]. There
Through the use of data analytics, the US healthcare sec-
are several statistical analytics tools similar to R that can be
tor could save more than $300 billion per year, with two
used in health decision sciences, including C +  + , MAT-
thirds coming from reduced healthcare spending [10]. Clini-
LAB, and Python. Despite R’s wide availability and compu-
cal treatment costs of $165 billion is one of the largest areas
tational capabilities, it continues to be an under-utilized tool
for savings [11].
in the healthcare field. However, Jalal conducted a promis-
There is now an opportunity to incorporate predictive
ing study comparing the utilization of R compared to other
analytics into clinical decision making. A study done by
statistical tools in health decision sciences, and determined
the Medical Group Management Association reported that
that R appears to be increasing in popularity with studies
only 31 percent of healthcare providers currently use analyt-
using R in the healthcare field increasing by about 50% over
ics tools and capabilities offered in their systems [12]. His-
the last 5 years [7].
torically, patient care decisions have been mostly based on
Steinberg et al. [32] used an analytics platform to cre-
medical experience and practitioner intuition. A suggested
ate a risk prediction model for metabolic syndrome. The
improvement to this model is the use of evidence-based
study conducted screenings of patients and calculated risk
practices [13]. To do so, data and healthcare analytics may
of metabolic syndrome, impact of incremental changes on
consider behavioral risk prediction models to identify new
risk factors, and impact of adherence to treatment plan. Two
modifiable factors in the population.
models using machine learning had good predictive ability
(0.80 and 0.88 ROC/AUC). For example, the study could
identify a man who had a 92% chance of developing meta-
1 Literature review bolic syndrome in the next 12 months. As a result, a meta-
bolic syndrome intervention program focused on reducing
1.1 Data and analytics waist circumference, which was determined to be a strong
risk factor [32].
Understanding the limits of data and analytics from a com- Delen, Oztekin and Tomak [33] developed models that
puting perspective is important. According to Viceconti predicted survival for patients undergoing coronary artery
et al. [14], data can be defined by the “5Vs”: Volume (quan- bypass. Their study compared artificial neural networks,
tity of data), variety (different categories of data), velocity support vector machines, and two decision tree algorithms.
(quick generation of new data), veracity (quality of data), Support vector machines had the best prediction accuracy
and value (within the data). However, storage, management, of 88%. Similarly, Dag, Oztekin, Yucel, Bulur and Megahed
and processing of such data are considered to be funda- [34] identified 43 pre-operative variables for heart transplant
mental overarching issues [15]. There is a vast quantity of patients that contribute to survival outcomes.
healthcare data being created, and data volume is expected Zhu and Fang demonstrated the application of logistic
to grow at a compound annual growth rate of 36 percent by regression-based trichotomous classification trees for medi-
2025 [16]. To create highly sensitive disease risk prediction cal diagnosis. Using both simulated and real data sets relat-
models, we need large data sets that contain information ing to breast cancer and diabetes, they demonstrated that the
about behavioral risk factors as well as disease outcomes. algorithm can be used to successfully classify disease states,
including those patients who are difficult to diagnose [35].
1.2 Disease prediction Razavian et al. created a risk prediction model using
insurance claims data [17]. Risk factors such as cardiovas-
Historically, there have been successful implementations cular disease history and diagnosis of obesity, were modeled
of large-scale risk assessment tools, such as disease onset using logistic regression and machine learning. The model
prediction, hospital readmission models, and prototypes pre- predicted the risk of developing diabetes in the future with a
dicting healthcare cost and utilization [17–28]. For example, 21.6% accuracy, compared to 11.4% using traditional predic-
Columbia University Medical Center created a prediction tion methods [17]. Turnea and Ilea created a predictive simu-
system that analyzes correlations of physiological data lation for type II diabetes. Decision trees based on variables

13
Health and Technology (2021) 11:535–545 537

such as body mass index and diastolic blood pressure were knowledge and experience, as well as data, was rated the
used. The authors propose that such a predictive model may highest by participants when it came to preference, accuracy,
be used to diagnose type II diabetes before complications fairness, and ethicalness. The study proves that statistical
appear [36]. models must be used simply to aid in clinical decision mak-
ing, and a healthcare professional must be there to carefully
1.3 Criticism of analytics and evidence‑based review and interpret the results [39].
practice

Some critics that dispute the excitement surrounding data 2 Methods


analytics in healthcare. Neff argued that while the technol-
ogy sectors of healthcare see large quantities of data as valu- This study used the CDC Behavioral Risk Factor Surveil-
able, healthcare providers do not have the resources, exper- lance System data to create classification decision trees to
tise, or the time to utilize predictive analytics for patient model diabetes. Data analysis was done using IBM SPSS
care [37]. and R statistical software. R is an accessible software and
To implement data analytics into clinical practice, clinical programming language. It is free and open source, being
staff will need some basic statistical training. However, even currently supported by an extensive community of data sci-
without comprehensive statistical training, clinical staff may ence professionals. It has the capability to integrate many
still be able to implement simple models for improved patient features, including several decision tree tools, for decision
care. Clinicians will need to begin utilizing machine learn- analysis. There are also many resources available providing
ing, data mining, and other advanced analytic techniques tutorials to potential users [7]. R consists of a basic set of “R
requiring new training in data science [18]. Moskowitz Core Packages” that provide an extensive set of functions for
et al. suggested there needs to be promotion of ongoing, healthcare analytics. In this study, the R “party package” was
cross-disciplinary collaboration between clinical staff and utilized, and within it, the “ctree” algorithm.
data scientists, even having data scientists participate in Following the process used by Delen et al., the data set
hospital rounds alongside clinicians to access data in real was prepared, research model was developed, the model was
time [8]. trained, and finally the analysis and testing of the predic-
Clinicians and their patients may also struggle to interpret tive model occurred. [33, 40] The steps of this analysis are
these results, so aids must be created to help with their inter- shown in Fig. 1.
pretation. For example, Talboy and Schneider [38] created a
brief 10-min tutorial to help people interpret Bayesian infer- 2.1 Modeling behavioral factors
ence problems. Their tutorial, with representation-based and
rule-based components, was shown to significantly improve To model variables associated with diabetes, we looked
accuracy when solving the inference problems [38]. to literature to identify leading causes of death [41]. We
When it comes to using behavioral health data, there are selected diabetes for our model, since it is among the top
issues of confidentiality, privacy, consent, and oversight. causes of death in the US, can be poorly managed, and pro-
Researchers may also lack the data needed to make an accu- vides an opportunity for improvement using analytics [42].
rate prediction, because many predictive models are drawn
from data collected from low-risk groups. There is also a 2.2 Selecting diabetes risk factors
lack of longitudinal data, which takes time to develop and
may become more available as EHRs are more widely used Diabetes is the ­7th leading cause of death in the US and
[27]. our model considered the risk factors for type II diabetes as
Lastly, there is potential for statistical bias and false delineated by the National Institute of Diabetes and Diges-
negative or false positive results. Because of these reasons, tive and Kidney Diseases [43]. Table 1 shows the BRFSS
models may not be trusted by the general public over the variables mapped to the recognized diabetes risk factors.
opinion of a practitioner. Eastwood, Snook, and Luther [39] For the prediction model, “Ever told you have diabe-
conducted a study measuring attitudes toward decision- tes” was selected as the dependent variable. Independent
making strategies. Their findings showed that the clinical/fully variables were selected using risk factors previously identi-
rational strategy, where the decision maker uses personal fied and were tested for significance using binary logistic

Fig. 1  Four step research model


Data Model Model Model Tesng
Preparaon Development Training and Analysis

13
538 Health and Technology (2021) 11:535–545

Table 1  Diabetes risk factors (National Institute of Diabetes and Digestive and Kidney Diseases, 2016)
Risk factor Risky class BRFSS category

Age 45 + Years of Age Reported age in five-year categories calculated variable (_AGE-
G5YR), Imputed age collapsed above 80 (_AGE80), Imputed
age in six groups (_AGE_G)
Race African-American, Alaska Native, American-Indian, Computed Race-Ethnicity Grouping (_RACE), Computed Five
Asian American, Hispanic, Native Hawaiian, Pacific level race/ethnicity category (_RACEGR3)
Islander
BMI Overweight or Obese Computed BMI (_BMI5), Computed BMI Categories (_BMI-
5CAT), Overweight or Obese calculated variable (_RFBMI5)
Physical Activity No Physical Activity Exercise in Past 30 Days (EXERANY2), Leisure Time Physical
Activity Calculated Variable (_TOTINDA)
Cholesterol High cholesterol Ever Told Blood Cholesterol High (TOLDHI2), Currently taking
medicine for high cholesterol (CHOLMED1), High cholesterol
calculated variable(_RFCHOL1)
Blood Pressure High blood pressure Ever told blood pressure high (BPHIGH4), Currently taking blood
pressure medication (BPMEDS)
Depression History of Depression Ever told you had a depressive disorder (ADDEPEV2)
Stroke History of Stroke Ever diagnosed with a stoke (CVDSTRK3)
Heart Disease History of Heart Disease Ever diagnosed with angina or coronary heart disease (CVD-
CRHD4), Ever had CHD or MI (_MICHD)

regression [44]. This form of regression produces the odds or coronary heart disease.” See Table 2 for Binary Logistic
ratio, which is then evaluated for significance using t-test and regression model results.
subsequent p value. For the binary logistic regression model,
the “Ever told you have diabetes” variable was transformed 2.3 Data and collection
into a binary dependent variable to identify those who
responded “Yes” they have been told they have diabetes, or Data for this study was acquired from the 2017 Behavioral
“No” they did not. Independent variables which were statis- Risk Factor Surveillance System (BRFSS) survey conducted
tically significant were selected for inclusion in the models. by the CDC [9]. This was a telephone survey data set which
This included 9 variables from 450,698 individual records. included respondents from 50 states, as well as US Terri-
The risk factors associated with the prediction model were tories. The objective of the BRFSS survey was to collect
“Reported age in five-year categories calculated variable,” uniform state-specific data on health risk behaviors, chronic
“computed five level race/ethnicity category,” “computed diseases, access to healthcare, and the use of preventative
body mass index categories,” “exercising in past 30 days,” health services related to the leading causes of death in the
“ever told blood cholesterol high,” “ever told blood pres- US. Data collection is managed by state health departments
sure high,” “ever told you had a depressive disorder,” “ever following protocols established by the CDC. States and US
diagnosed with a stoke” and “ever diagnosed with angina territories collect data for each of the 12 calendar months,

Table 2  Results for binary logistic regression results for diabetes

BRFSS variable name BRFSS variable code B S.E Wald df P value Exp(B)

Computed Body Mass Index Categories _BMI5CAT​ 0.654 0.007 8604.067 1 0.000 1.923
Reported Age In Five-Year Categories _AGEG5YR 0.143 0.002 5100.824 1 0.000 1.154
Computed Five Level Race/Ethnicity Categories _RACEGR3 0.104 0.003 1025.268 1 0.000 1.110
Exercising in the Past 30 days EXERANY2 0.350 0.011 1063.921 1 0.000 1.419
Told High Cholesterol TOLDHI2 -0.636 0.011 3463.320 1 0.000 0.529
Ever Told High Blood Pressure BPHIGH4 -.0444 0.006 5796.620 1 0.000 0.641
Ever Diagnosed with Angina or Coronary Heart Disease CVDCRHD4 -0.522 0.016 1022.983 1 0.000 0.593
Ever Told You Had a Depressive Disorder ADDEPEV2 -0.293 0.012 591.188 1 0.000 0.746
Ever Diagnosed with a Stroke CVDSTRK3 -0.383 0.019 390.831 1 0.000 0.682
Constant -1.716 0.63 733.426 1 0.000 0.180

13
Health and Technology (2021) 11:535–545 539

submitting the data to the CDC at the end of each month Missing” category, which can be signified by various codes
[45]. depending on the question. For example, on variable “Com-
Studies have been done to assess the reliability and valid- puted number of drinks of alcohol beverages per week”,
ity of the BRFSS data set. According to Pierannunzi et al., 0 = Did not drink, 1–98,999 = Number of drinks per week
who conducted a systematic review of related studies, the specified by respondent, and 99,900 = Don’t know/Not sure/
BRFSS data are reliable and valid because prevalence rates Refused/Missing. The 99,900 category was recoded into 0, as
match well with other national surveys which relied on self- these respondents did not specify they consumed any alcoholic
reports [46]. Prevalence estimates from the data set also cor- drinks. Following transformation and recoding, the data set
respond well with findings from surveys based on face-to- was examined for outliers, misspecification and error.
face interviews such as the National Health Interview Study For the first cycle of analysis, the entire data set was used
and the National Health and Nutrition Examination Survey to create decision trees. The R tree algorithm selects the
[47]. most important variable for the first split in the tree, second
The 2017 BRFSS data set contains variables that were most important variable as the next split, and so on. For
created from the questions asked, as well as calculated vari- the second analysis and creation of the predictive decision
ables. There are two types of calculated variables included tree, the data set was split into training (80%) and validation
in the data set: Intermediate variables, and variables used (20%) data sets. This was done at random using the R com-
to categorize or classify respondents [48]. Intermediate mand “sample.” This split of the total number of records
variables are taken from a question response and are used to (450,638) into a training set of 360,679 records and valida-
calculate another variable or risk factor. An example is the tion set of 89,959 records.
Body Mass Index (_BMI5) variable being calculated from
individual computed weight and height variables WTKG3 2.5 Analysis
(Computed Weight in Kilograms) and HTM4 (Computed
Height in Meters), with WTKG3 originally being calculated There are advantages to using and teaching rule-based clas-
from the variable WEIGHT2 (Reported Weight in Pounds). sification, since the rules are easy to explain, and can be
The other type of calculated variable is used to classify or understood by practitioners [50]. Rules are typically rep-
categorize respondents for simplifying analysis or identi- resented in logic form as IF–THEN statements. For exam-
fying risk of specific injury or illness. The CDC provides ple, in using rule-based classification for predicting breast
the Statistical Analysis System (SAS) code that was used to cancer, Singh used IF statements such as Gender = Female,
calculate each variable [48]. Age >  = 60, and Gene Mutation = BRCA2 [51]. Therefore, if
a woman was found to have a breast cancer risk factor, such
2.4 Preparing the data as the BRCA2 mutation, then she would be classified into
the “Risky Class” by the algorithm [51]. There are several
Before beginning analysis, some of the data was transformed tree-building algorithms available for classifying and seg-
using SPSS. A missing value analysis was conducted and menting data. For those without a computer science back-
variables that were favored for inclusion in the analysis ground, new tree building software, such as TreeAge has
showed less than 20% missing data. As recommended by made the process easier [7, 52].
Garson, a conservative cutoff of 20% missing values was In this study, conditional inference classification decision
used [49]. All survey variables required careful recoding. tree models were created using the R “party package” and
The dependent variable was coded as binary. The variables the “ctree” algorithm. Three conditional inference classifica-
were recoded so that they would be equivalent in code/ tion decision trees were created for diabetes using different
response to each other, as many questions were coded dif- independent variables. A fourth prediction decision tree was
ferently. On categorical yes/no variables where “Yes” = 1, created with split training and testing data sets. Using the
“No” = 2 (or vice versa), “Don’t know/not sure” (usually training data set, this tree provided a prediction using the
coded as 7) and “Refused/missing” (usually coded as a value classification algorithm and “predict” function in R. The
of 9) responses were recoded into “No” categories (in some prediction was then tested using the validation data set.
questions these two were combined into a common code
of 9, or similar, which was recoded into “No”). For exam-
ple, on variable “Drink any alcoholic beverages in the past 3 Results
30 days”, 1 = ”Yes,” 2 = “No,” 7 = “Don’t know/not sure,”
9 = “Refused/Missing.” 7 and 9 were recoded into 2 s, sig- 3.1 Decision tree models
nifying people who did not answer “Yes”.
For discrete variables, respondent’s answers were kept The first tree, shown in Fig. 2, captured the relationship
the same, only recoding the “Don’t know/Not sure/Refused/ between Exercise, Stroke, Depression and Diabetes. In this

13
540 Health and Technology (2021) 11:535–545

model, the algorithm selected exercise as the most influ- “predict tree” R algorithm learned from the initial classifi-
ential variable, followed by stroke history and depression. cation, and then was used to predict an outcome with the
Figure 2 shows the classification error rates ranging from validation data. The prediction was tested a second time.
8.8%—23.8%, with an average error rate of 17.46%. Figure 5 shows these results. With the initial prediction,
The second tree examined the relationship between High there was an average error rate of 17.6%. When testing the
Blood Cholesterol, BMI and Diabetes. In this model, the prediction, there was an average error rate of 17.9%. When
algorithm selected high blood cholesterol as the most influ- compared to the tree using the entire data set, this model had
ential variable, followed by body mass index. Figure 3 shows a slightly poorer predictive value. However, this tree has the
that this decision tree had classification error rates ranging advantage of being able to make a prediction of disease risk
from 3.4% to 22.2%, with an average of 10.64%, meaning it versus just classifying individuals. Error rates for individual
was more accurate than the first tree at predicting diabetes. nodes are noted in Table 3.
The third tree showed the relationship between High To test the Accuracy, Sensitivity and Specificity of the
Blood Cholesterol, High Blood Pressure and Diabetes. In classification models, we followed Delen et al. [33] Eqs.
this model, the algorithm selected high blood pressure as (1, 2, 3).
the most important variable, followed by high cholesterol.
TP + TN
Figure 4 shows that this decision tree had error rates ranging Accuracy = (1)
TP + TN + FP + FN
from 3.93%—21.5%. The average error rate for this tree was
11.01%, meaning that its classification value was slightly
higher when compared to the second tree. The results from TP
Sensitivity = (2)
these three decision trees show that tree 2 would have the TP + FN
lowest error rate.
TN
Specificity = (3)
3.2 Machine learning TN + FP
In these formulations FN, FP, TN, TP, denote false neg-
A fourth tree was created using machine learning in R. This ative, false positive, true negative, true positive respec-
model has the ability to predict whether someone might get tively. Equations 1, 2, and 3, respectively, were used to calculate
diabetes, instead of just classifying individuals. This tree accuracy, sensitivity, and specificity [33]. Results indicate the best
showed the relationship between High Blood Cholesterol, overall performance was obtained when high blood cholesterol
High Blood Pressure and Diabetes. First, a base decision tree and body mass index were considered with a predictive accuracy
was created using the “ctree” function. In this model, the of 93.57%, a sensitivity of 92.47% and specificity of 87.02%.
algorithm selected high blood pressure as the most impor- Exercise, stroke and depression formed the least predictive clas-
tant variable, followed by high cholesterol. Using this tree, sification tree with only 88.67% accuracy, 66.75% sensitivity and
we trained the prediction model using the training data set 96.12% specificity. This tree was very accurate at predicting true
and then validated the model using a smaller data set. The

Fig. 2  Conditional inference


decision tree for diabetes with Exercise
exercise, stroke, and depression

Had Had
Stroke? Stroke?

Have Have Have Have


Depression? Depression? Depression? Depression?

N=89251
N=3367 N=7733 N=54976 N=259360 N=2911 N=4974 N=28066
Err=12467.7
Err=724.1 Err=1488 Err=6628 Err=22929.4 Err=694.2 Err=1121.2 Err=4989.2
23.8% 14.0%
21.5% 19.2% 12.1% 8.8% 22.5% 17.8%

13
Health and Technology (2021) 11:535–545 541

High Blood
Cholesterol

BMI
BMI

BMI
BMI

BMI

N=43313 N=64245 N=61837 N=5722 N=99278 N=98083 N=78160


Err=4683.1 Err=9903.1 Err=13711.3 Err=193.9 Err=3906.8 Err=6577.5 Err=9492.8
10.8% 15.4% 22.2% 3.4% 3.9% 6.7% 12.1%

Fig. 3  Conditional inference tree for diabetes with high cholesterol and BMI

negatives with the high specificity level but poor at predicting After training this model and analyzing the hold out data there
true positives. Using high blood pressure and high cholesterol, the was a slight improvement with an accuracy rate at 89.48%, a
next tree was better at predicting true positives with an accuracy sensitivity of 96.60% and specificity of 66.48%.
rate at 89.47%, a sensitivity of 96.61% and specificity of 66.42%.

Fig. 4  Conditional inference


tree for diabetes, high blood High Blood
pressure, and cholesterol Pressure

High Blood High Blood


Cholesterol Cholesterol

High Blood
Pressure

N=101366 N=79971 N=65694 N=201180 N=2507


Err=21518 Err=11527.7 Err=6374 Err=7908 Err=145.4
21.2% 14.4% 9.70% 3.93% 5.8%

13
542 Health and Technology (2021) 11:535–545

Fig. 5  Conditional inference


tree for diabetes, with high High Blood
blood pressure, and high blood Pressure
cholesterol created with training
data set

High Cholesterol

High Cholesterol

High Blood
Pressure
N=81282 N=64031
Err=17238.7 Err=9248.7
21.2% 14.4% N=52299
Err=5056.9
9.7%

N=161045 N=2022
Err=6286.4 Err=114.6
3.9% 5.7%

The results of this study effectively addressed the research for clinicians and encourage the use of evidence-based medi-
questions that were posed. In response to RQ1, this study cine in a new more accurate and predictive ways. In response
showed that diabetes risk factors can be identified from a to RQ3, conditional inference decision trees provided simple
large behavioral data set using binary logistic regression. and effective classification models for diabetes supporting
In response to RQ2 and in support of evidence-based medi- the use of machine learning to create disease prediction
cine, models using different risk factors can effectively be models for practitioners.
compared in order to evaluate which risk factors have the
strongest influence on predictive capability. These prediction
models may then be used to provide clinical decision support 4 Discussion
Table 3  Results for conditional inference tree prediction using valida-
tion data Four decision tree models were created in this study. The
first tree looked at the relationship between exercise, stroke,
Tree 4 High Blood Pressure and High Cholesterol, Prediction with
Training Data depression, and diabetes. This tree had an average error rate
of 17.46%, meaning it was correct in classifying diabetes
Node N Error Pct error or no diabetes in individuals 82.54% of the time. This tree
1 56,468 24,814 43.94% had the highest error rate, which could indicate that these
2 52,819 11,212 21.23% particular risk factors are weaker in predicting diabetes.
3 46,627 5672 12.16% The second tree looked at the relationship between high
4 1900 122 6.42% blood cholesterol, BMI, and diabetes. This tree had an aver-
5 154,492 6553 4.24% age error rate of 10.64%, meaning it was correct in classify-
Average 17.60% ing diabetes or no diabetes in an individual 89.36% of the
Tree 4 High Blood Pressure and High Cholesterol, Prediction time. This tree had the best classification capability and the
Tested with Validation Data
lowest error rate, which could indicate that these risk factors
Node N Error Pct Error
are more strongly associated with diabetes.
1 13,911 6199 44.56%
The third tree looked at the relationship between high
2 13,176 2738 20.78%
blood cholesterol, high blood pressure, and diabetes. This
3 11,833 1477 12.48%
tree had an average error rate of 11.01%, meaning it was
4 454 33 7.27%
correct in classifying diabetes or no diabetes in individuals
5 38,440 1698 4.42%
88.99% of the time, making it slightly poorer in classification
Average 17.90%

13
Health and Technology (2021) 11:535–545 543

compared to the second tree. This could indicate that these more likely to not answer the question about smoking. There
risk factors are good classifiers of diabetes. may also be sampling and measurement errors, such as the
The fourth tree was created with the purpose of predicting wording on the questionnaire.
which individuals would have diabetes using machine learn- There was also a limitation in the structure of the BRFSS
ing. The tree looked at the relationship between high blood question concerning diabetes type. The question did not dif-
cholesterol, high blood pressure, and diabetes, similar to the ferentiate among the various types of diabetes. We assume a
third tree. This tree had an average error rate of 17.6% when small number of respondents had type I or gestational, and
evaluated for classification strength. When the prediction this could have affected the results. Another limitation con-
was tested using the validation data set, there was an error cerned disease risk factors. The patients in this survey had
rate of 17.9%, meaning it was correct at predicting whether already been diagnosed with diabetes and may have altered
an individual had or did not have diabetes 82.1% of the time. their lifestyle due to the disease. The final limitation was
Overall this tree performed just slightly poorer compared to that not all disease risk factors that were identified in litera-
the third tree when it came to predictive capability. ture had corresponding BRFSS variables and could not be
Healthcare professionals should consider using decision included in the analysis.
trees and machine learning for classification and predictive
analysis to assist in reducing costs and improving clinical
4.2 Future research
outcomes. Different combinations of risk factors could affect
prediction and classification results. Risks are correlated and
Looking forward, a next step is standardization of disease
dependent on each other, and so predictive models need to
prediction models for clinical use. The Society of Actuaries
address simultaneous conditions [10].
conducted a survey analyzing the state of predictive ana-
Healthcare providers need resources and available time
lytics in healthcare. The survey identified that within the
to utilize predictive analytics. However many report that
US healthcare industry, fewer than half (43%) of healthcare
incomplete data and insufficient technology are the biggest
organizations are currently using predictive analytics [53].
obstacles in implementing predictive analytics [53]. Hospi-
While most payers in healthcare are using predictive ana-
tals are more likely to lack the technology required to take
lytics (80%), only 39% of medical groups/clinics, and only
advantage of predictive analytics. Medical groups and clin-
36% of hospitals are using these tools. For those who are
ics are twice as likely to lack employees who are skilled in
using predictive analytics, the most common use is predict-
predictive analytics [53].
ing hospital readmissions and costs. Future research on this
The healthcare industry has historically made decisions
topic might be implementing the predictive and classifica-
differently than other business sectors [53]. The clinical
tion models created in this study into the clinical setting.
issue, rather than habit or protocol, should determine medi-
Finally, the models must be evaluated for their usefulness
cal intervention. Medical authorities will have to adapt a
in medical practice.
new way of thinking about research, including switching
from the primary use of deductive reasoning to inductive
reasoning and pattern recognition [18]. The clinician must 4.3 Conclusion
not be merely bound by rules and guidelines but be taught
to apply those rules in the context of each patient. A recent Although disease prevention awareness campaigns have
campaign in the United Kingdom, “Too Much Medicine,” become more prevalent, the US continues to be ravaged by
led by academics, clinicians, and patients is hoping to reduce chronic disease, in both mortality and cost. In their most
over screening, overdiagnosis, and overtreatment [54]. recent report, the Partnership to Fight Chronic Disease [55]
estimates the projected total cost of chronic disease in Amer-
4.1 Limitations ica to reach $42 trillion between 2016–2030. The number
of people with three or more chronic diseases in the US is
There were some limitations regarding the nature of self- expected to reach 83.4 million by 2030, compared to 30.8
reported survey results. First, is the clumping of data around million in 2015. With behavioral changes, new interventions,
whole numbers. For example, when asked their weight, and treatment advances, 16 million lives could be saved in
respondents would be more likely to report 150 than 151 the next 15 years [55].
lbs. Data smoothing to account for this can be done but was Although many advanced diagnostic tools exist, the
out of scope for this paper. Other limitations of self-reported healthcare field is still lacking accessible predictive mod-
data included people underestimating their tobacco/alcohol els to plan interventions. Healthcare providers currently
usage, or overestimating frequency of seeking healthcare and say that clinical outcomes and costs are the most valuable
following medical advice. There can also be non-response data to predict [53]. By focusing on prediction of disease
bias, participants such as those who are smokers, may be risk we can improve clinical outcomes and reduce costs.

13
544 Health and Technology (2021) 11:535–545

Professionals involved with analyzing healthcare data should 10. Belle A, Thiagarajan R, Soroushmehr S, et al. Big data analytics in
assist clinicians and healthcare organizations in utilizing pre- healthcare. BioMed Res Int. 2015. https://​doi.​org/​10.​1155/​2015/​
370194.
dictive analytics, maintaining clean data, and bridging the 11. Raghupathi W, Raghupathi V. Big data analytics in healthcare:
gap between clinicians and data scientists. promise and potential. Health Inf Sci Syst. 2014;2(1):3. https://​
doi.​org/​10.​1186/​2047-​2501-2-3.
Author’s contributions Elena G Toth: Conceptualization, Methodol- 12. Monica K. Why are so few healthcare providers using EHR data
ogy, Software, Writing- Original Draft, Visualization, Formal Analy- analytics? EHR Intelligence xtelligent Healthcare Media, 2017.
sis. Alexander McLeod: Conceptualization, Methodology, Software, 13. Palaniappan S and Awang R. Intelligent heart disease prediction
Visualization, Formal Analysis, Writing-Review & Editing, Supervi- system using data mining techniques. In: IEEE/ACS Interna-
sion, Resources. David Gibbs: Writing- Review & Editing, Resources. tional Conference on Computer Systems and Applications Doha,
Jacqueline Moczygemba: Writing- Review & Editing, Resources. Qatar; 2008, pp.108–115. IEEE.
14. Viceconti M, Hunter PJ, Hose RD. Big data, big knowledge: big
data for personalized healthcare. IEEE J Biomed Health Inform.
Declarations 2015;19(4):1209–15. https://d​ oi.o​ rg/1​ 0.1​ 109/J​ BHI.2​ 015.​24068​83.
15. Kaisler S, Armour F, Espinosa JA, et al. Big data: Issues and chal-
Data availability BRFSS Data Set is available at https://​www.​cdc.​ lenges moving forward. In: Proceedings of the 46th Hawaii Inter-
gov/​brfss/​annual_​data/​annual_​2017.​html. Processed data used in the national Conference on System Sciences (HICSS) Maui, HI; 2013,
research paper is available upon request. pp.995–1004. IEEE.
16. Kent J. Big data to see explosive growth, challenging healthcare
Availability of code Code used in R software is available upon request. organizations. Health IT Analytics. 2018.
17. Razavian N, Blecker S, Schmidt AM, et al. Population-level pre-
Informed consent This study did not have human participants and diction of type 2 diabetes from claims data and analysis of risk
therefore did not require informed consent. factors. Big Data. 2015;3(4):277–87. https://​doi.​org/​10.​1089/​big.​
2015.​0020.
Conflicts of interest To the best of our knowledge, the named authors 18. Krumholz HM. Big data and new knowledge in medicine: the
have no conflict of interest, financial or otherwise. thinking, training, and tools needed for a learning health system.
Health Aff. 2014;33(7):1163–70. https://​doi.​org/​10.​1377/​hltha​ff.​
2014.​0053.
19. Wang L, Porter B, Maynard C, et al. Predicting risk of hospitalization
or death among patients receiving primary care in the Veterans Health
Administration. Med Care. 2013;51(4):368–73. https://​doi.​org/​10.​
References 1097/​MLR.​0b013​e3182​7da95a PubMedPMID:edsjsr.23434292.
20. Neuvirth H, Ozery-Flato M, Hu J, et al. Toward personalized care
1. Barrett MA, Humblet O, Hiatt RA, et al. Big data and disease management of patients at risk: The diabetes case study. Knowl-
prevention: from quantified self to quantified communities. Big edge discovery and data mining. 2011:395-403.https://d​ oi.o​ rg/1​ 0.​
Data. 2013;1(3):168–75. https://​doi.​org/​10.​1089/​big.​2013.​0027. 1145/​20204​08.​20204​72.
2. National Center for Health Statistics. Health, United States, 2011: 21. Lloyd-Jones DM, Leip EP, Larson MG, et al. Prediction of lifetime
With Special Feature on Socioeconomic Status and Health. MD: risk for cardiovascular disease by risk factor burden at 50 years
Hyattsville; 2012. of age. Circulation. 2006;113(6):791–8. https://​doi.​org/​10.​1161/​
3. Lin Y-K, Chen H, Brown RA, et al. Healthcare predictive analyt- CIRCU​LATIO​NAHA.​105.​548206.
ics for risk profiling in chronic care: a Bayesian multitask learning 22. Maguire J, Dhar V. Comparative effectiveness for oral anti-
approach. MIS Quarterly. 2017;41(2):473–95. https://​doi.​org/​10.​ diabetic treatments among newly diagnosed type 2 diabetics:
25300/​MISQ/​2017/​41.2.​07. data-driven predictive analytics in healthcare. Health Syst.
4. Fox B. Using big data for big impact: Leveraging data and analyt- 2013;2(2):73–92. https://​doi.​org/​10.​1057/​hs.​2012.​20 PubMedP
ics provides the foundation for rethinking how to impact patient MID:edselc.2-52.0-84888421992.
behavior. Health Manag Technol. 2011;32(11):16–16 PubMed 23. Adler P, Rajesh R, Jamie SH, et al. Risk prediction for chronic
PMID: 22141243. kidney disease progression using heterogeneous electronic health
5. Keehan SP, Stone DA, Cuckler GA, et al. National health record data and time series analysis. J Am Med Inform Assoc.
expenditure projections, 2016–25: Price increases, aging push 2015;22(4):872–80. https://​doi.​org/​10.​1093/​jamia/​ocv024 PubM
sector to 20 percent of economy. Health Aff. 2017;36(3):553–63. edPMID:edsovi.00042637.201507000.00016.
https://​doi.​org/​10.​1377/​hltha​f f.​2016.​1627 PubMedPMID:eds 24. Letham B, Rudin C, McCormick TH, et al. Interpretable classi-
elc.2-52.0-85014636612. fiers using rules and Bayesian analysis: Building a better stroke
6. Lash TA, Escobedo MR. Introduction. Clin Geriatr Med. prediction model. Ann Appl Stat. 2015;9(3):1350–70. https://​doi.​
2018;34(3):XVII–XIX. https://d​ oi.o​ rg/1​ 0.1​ 016/j.c​ ger.2​ 018.​06.​002. org/​10.​1214/​15-​AOAS8​48 PubMedPMID:edsjsr.43826424.
7. Jalal H, Pechlivanoglou P, Krijkamp E, et al. An overview of R in 25. Sun J, Hu J, Luo D, et al. Combining knowledge and data driven
health decision sciences. Med Decis Making. 2017;37(7):735–46. insights for identifying risk factors using electronic health records.
https://​doi.​org/​10.​1177/​02729​89X16​686559 PubMedPMID:eds AMIA Annu Symp Proc. 2012;2012:901–10.
elc.2-52.0-85027071420. 26. Henry KE, Saria S, Hager DN, et al. A targeted real-time early
8. Moskowitz A, McSparron J, Stone DJ, et al. Preparing a new gen- warning score (TREWScore) for septic shock. Sci Transl Med.
eration of clinicians for the era of big data. Harv Med Stud Rev. 2015;7(299):1–9. https://​doi.​org/​10.​1126/​scitr​anslm​ed.​aab37​19
2015;2(1):24–7. PubMedPMID:edselc.2-52.0-84938704873.
9. Centers for Disease Control and Prevention. Behavioral Risk Fac- 27. Bates DW, Saria S, Ohno-Machado L, et al. Big data in health
tor Surveillance System Survey (BRFSS). In: National Center for care: using analytics to identify and manage high-risk and high-
Chronic Disease Prevention and Health Promotion: Division of cost patients. Health Aff. 2014;33(7):1123–31. https://​doi.​org/​10.​
Population Health, (ed.). Atlanta, GA; 2016. 1377/​hltha​ff.​2014.​0041.

13
Health and Technology (2021) 11:535–545 545

28. Bardhan I, Oh J-h, Zheng Z, et al. Predictive analytics for read- 41. Bertsimas D, O’Hair A, Relyea S, et al. An analytics approach to
mission of patients with congestive heart failure. Inf Syst Res. designing combination chemotherapy regimens for cancer. Man-
2015;26(1):19–39. age Sci. 2016;62(5):1511–31.
29. Xie Z, Li D, Nikolayeva O, et al. Building risk prediction mod- 42. Nichols H. The top 10 leading causes of death in the United States,
els for type 2 diabetes using machine learning techniques. Prev https://​www.​medic​alnew​stoday.​com/​artic​les/​282929.​php (2018).
Chronic Dis. 2019;16(E130):1–9. https://​doi.​org/​10.​5888/​pcd16.​ 43. National Institute of Diabetes and Digestive and Kidney Diseases.
190109 PubMedPMID:edselc.2-52.0-85072402053. Risk factors for type 2 diabetes, https://w ​ ww.n​ iddk.n​ ih.g​ ov/h​ ealth-​
30. Piri S, Delen D, Liu T, et al. A data analytics approach to build- inform
​ ation/d​ iabet​ es/o​ vervi​ ew/r​ isk-f​ actor​ s-t​ ype-2-d​ iabet​ es (2016,
ing a clinical decision support system for diabetic retinopathy: 2018).
Developing and deploying a model ensemble. Decis Support Syst. 44. Sperandei S. Understanding logistic regression analysis. Biochemia
2017;101:12–27. https://​doi.​org/​10.​1016/j.​dss.​2017.​05.​012. Medica. 2014;24(1):12–8. https://​doi.​org/​10.​11613/​BM.2​ 014.0​ 03
31. Wang T, Qiu RG, Yu M, et al. Directed disease networks to facili- PubMedPMID:PMC3936971.
tate multiple-disease risk assessment modeling. Decis Support 45. Centers for Disease Control and Prevention. Overview: BRFSS
Syst. 2020;129:113171. https://d​ oi.o​ rg/1​ 0.1​ 016/j.d​ ss.2​ 019.​113171. 2017. 2018.
32. Steinberg GB, Church BW, McCall CJ, et al. Novel predic- 46. Pierannunzi C, Hu SS, Balluz L. A systematic review of publica-
tive models for metabolic syndrome risk: a" big data" analytic tions assessing reliability and validity of the Behavioral Risk Factor
approach. Am J Manag Care. 2014;20(6):e221–8. Surveillance System (BRFSS), 2004–2011. BMC Med Res Meth-
33. Delen D, Oztekin A, Tomak L. An analytic approach to better odol. 2013;13(1):49. https://​doi.​org/​10.​1186/​1471-​2288-​13-​49.
understanding and management of coronary surgeries. Decis Sup- 47. Li C, Balluz LS, Ford ES, et al. A comparison of prevalence
port Syst. 2012;52(3):698–705. https://d​ oi.o​ rg/1​ 0.1​ 016/j.d​ ss.2​ 011.​ estimates for selected health indicators and chronic diseases or
11.​004. conditions from the Behavioral Risk Factor Surveillance System,
34. Dag A, Oztekin A, Yucel A, et al. Predicting heart transplan- the National Health Interview Survey, and the National Health
tation outcomes through data analytics. Decis Support Syst. and Nutrition Examination Survey, 2007–2008. Prev Med.
2017;94:42–52. 2012;54(6):381–7. https://​doi.​org/​10.​1016/j.​ypmed.​2012.​04.​003.
35. Zhu Y, Fang J. Logistic regression-based trichotomous classification 48. Centers for Disease Control and Prevention. Calculated variables
tree and its application in medical diagnosis. Med Decis Making. in the 2017 Behavioral Risk Factor Surveillance System data file
2016;36(8):973–89. https://​doi.​org/​10.​1177/​02729​89X15​618658 2018.
PubMedPMID:edselc.2-52.0-84989285855. 49. Garson GD. Missing values analysis and data imputation. Ashe-
36. Turnea M, Ilea M. Predictive simulation for type II diabetes using boro: Statistical Associates Publishing Asheboro, NC; 2015.
data mining strategies applied to big data. In: The International 50. Li X and Liu B. Rule-based classification. In: Aggarwal CC (ed)
Scientific Conference eLearning and Software for Education Data Classification: Algorithms and Applications. Chapman &
Bucharest, Romania, 2018, pp.481–486. Carol I National Defence Hall/CRC; 2014, pp.121–156.
University. 51. Singh NK. Prediction of breast cancer using rule based classifica-
37. Neff G. Why big data won’t cure us. Big Data. 2013;1(3):117–23. tion. Appl Med Inf. 2015;37(4):11–22.
https://​doi.​org/​10.​1089/​big.​2013.​0029. 52. Bae J-M. The clinical decision analysis using decision tree. Epi-
38. Talboy AN, Schneider SL. Improving accuracy on Bayesian demiol Health. 2014;36:e2014025. https://​doi.​org/​10.​4178/​epih/​
inference problems using a brief tutorial. J Behav Decis Mak. e2014​025 PubMedPMID:PMC4251295.
2017;30(2):373–88. https://​doi.​org/​10.​1002/​bdm.​1949 PubMed 53. Society of Actuaries. The state of predictive analytics in US
PMID:edselc.2-52.0-84961282329. healthcare Modern Healthcare. 2016.
39. Eastwood J, Snook B, Luther K. What people want from their pro- 54. Greenhalgh T, Howick J, Maskrey N. Evidence based medicine:
fessionals: Attitudes toward decision-making strategies. J Behav a movement in crisis? BMJ. 2014;348:g3725. https://​doi.​org/​10.​
Decis Mak. 2012;25(5):458–68. https://​doi.​org/​10.​1002/​bdm.​741 1136/​bmj.​g3725.
PubMed PMID: 82468884. 55. Partnership to Fight Chronic Disease. What is the impact of
40. Oztekin A, Kong ZJ, Delen D. Development of a structural equa- chronic disease in America? 2016. FightChronicDisease.org.
tion modeling-based decision tree methodology for the analysis
of lung transplantations. Decis Support Syst. 2011;51(1):155–66.
https://​doi.​org/​10.​1016/j.​dss.​2010.​12.​004.

13

You might also like