Toth 2021
Toth 2021
https://doi.org/10.1007/s12553-021-00542-w
ORIGINAL PAPER
Received: 27 November 2020 / Accepted: 12 March 2021 / Published online: 24 March 2021
© IUPESM and Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
There is increasing excitement in the healthcare field about using behavioral data and healthcare analytics for disease risk
prediction, clinical decision support, and overall improvement of personalized medicine. However, this excitement has not
effectively translated to improved clinical outcomes due to knowledge gaps, a lack of behavioral risk models, and resistance
to evidence-based practice. Reportedly, only 10–20% of clinical decisions are known to be evidence-based and this problem
is further highlighted by the fact that the US spends more money on healthcare per person than any other nation, while still
wrestling with poor health outcomes. Critics say there are inadequate technological resources and analytical education for
clinicians to make behavioral data useful in the medical world. Healthcare technology innovators often neglect important
aspects of the reality of integrating clinical data into electronic healthcare solutions. In this study, we developed a decision
tree model using R statistical software to predict diabetes since it is among the top causes of death in the US, can be poorly
managed, and provides an opportunity for improvement using analytics. This study examined behavioral data and healthcare
analytics for use in clinical applications, demonstrating that health information professionals can develop behavioral risk
factor prediction models to bridge the gap. Results indicated that decision trees are effective in classifying diabetes in an
individual at up to 89.36% accuracy.
Keywords Analytics · Disease prediction · Clinical decision support · Diabetes · Decision trees
Preventable chronic diseases are the most common cause of                 population, with 10,000 more people turning 65 every day
premature death in the US population [1, 2]. Chronic dis-                 between January 2011 and January 2030 [6].
eases also account for 86% of healthcare costs in the US [3].                The goal of this research is to use healthcare analytics
This is partly due to a lack of intervention based on behavio-            for the creation of behavioral risk prediction models to sup-
ral risk factors, such as obesity, and a large amount of funds            port clinical decision making in evidence-based practice.
being spent on high-cost medical care after the disease has               Specifically, we focus on utilizing R Statistical Software for
already developed. The healthcare industry is expected to                 decision tree analysis, as applications of R remain scarce in
consume a quarter of the country’s Gross Domestic Product                 healthcare analytics [7].
in the near future [4, 5]. A great concern is an aging US                    Reportedly, only 10–20% of clinical decisions are known
                                                                          to be evidence-based [8]. Many disease prediction models
                                                                          are too advanced to be used by healthcare professionals
                                                                          without specialized training in data analytics. Using data
* Elena G. Toth                                                           from the Centers for Disease Control and Prevention (CDC)
  elenagritsenko@utexas.edu                                               2017 Behavioral Risk Factor Survey [9], this work creates
     David Gibbs                                                          diabetes prediction models using behavioral risk factors.
     dgibbs@txstate.edu                                                   Additionally, the models can help healthcare professionals,
     Jackie Moczygemba                                                    including those without extensive statistical training, under-
     jackiem@txstate.edu                                                  stand how data analytics can aid in clinical decisions. Given
     Alexander McLeod                                                     the need for analytical skills to support evidenced-based
     am@txstate.edu                                                       medicine, the following research questions were developed
1
     Department of Health Information Management, Texas State
                                                                          and addressed:
     University College of Health Professions, 601 University
     Dr., Encino Hall 310B, TX 78666 San Marcos, USA
                                                                                                                             13
                                                                                                                        Vol.:(0123456789)
536                                                                                         Health and Technology (2021) 11:535–545
  RQ1: Can risk factors for diseases be identified in large       related to patients with brain injuries. This system is able to
  behavioral data sets for clinical decision support?             diagnose complications 48 h sooner than other clinical meth-
  RQ2: How can predictive analytics be used to create             ods in patients who suffer a bleeding stroke [11]. Addition-
  diabetes prediction models for clinical decision sup-           ally, Fox discussed using data analytics for creating predic-
  port?                                                           tive models that can identify how an intervention program is
  RQ3: Can a diabetes prediction model be created with            likely to impact the patient’s health behaviors [4].
  conditional inference trees and machine learning using             Common approaches to disease risk prediction include
  R?                                                              linear regression, decision trees, machine learning, neural
                                                                  networks, and the Naïve Bayesian approach [29–31]. There
   Through the use of data analytics, the US healthcare sec-
                                                                  are several statistical analytics tools similar to R that can be
tor could save more than $300 billion per year, with two
                                                                  used in health decision sciences, including C +  + , MAT-
thirds coming from reduced healthcare spending [10]. Clini-
                                                                  LAB, and Python. Despite R’s wide availability and compu-
cal treatment costs of $165 billion is one of the largest areas
                                                                  tational capabilities, it continues to be an under-utilized tool
for savings [11].
                                                                  in the healthcare field. However, Jalal conducted a promis-
   There is now an opportunity to incorporate predictive
                                                                  ing study comparing the utilization of R compared to other
analytics into clinical decision making. A study done by
                                                                  statistical tools in health decision sciences, and determined
the Medical Group Management Association reported that
                                                                  that R appears to be increasing in popularity with studies
only 31 percent of healthcare providers currently use analyt-
                                                                  using R in the healthcare field increasing by about 50% over
ics tools and capabilities offered in their systems [12]. His-
                                                                  the last 5 years [7].
torically, patient care decisions have been mostly based on
                                                                     Steinberg et al. [32] used an analytics platform to cre-
medical experience and practitioner intuition. A suggested
                                                                  ate a risk prediction model for metabolic syndrome. The
improvement to this model is the use of evidence-based
                                                                  study conducted screenings of patients and calculated risk
practices [13]. To do so, data and healthcare analytics may
                                                                  of metabolic syndrome, impact of incremental changes on
consider behavioral risk prediction models to identify new
                                                                  risk factors, and impact of adherence to treatment plan. Two
modifiable factors in the population.
                                                                  models using machine learning had good predictive ability
                                                                  (0.80 and 0.88 ROC/AUC). For example, the study could
                                                                  identify a man who had a 92% chance of developing meta-
1 Literature review                                              bolic syndrome in the next 12 months. As a result, a meta-
                                                                  bolic syndrome intervention program focused on reducing
1.1 Data and analytics                                           waist circumference, which was determined to be a strong
                                                                  risk factor [32].
Understanding the limits of data and analytics from a com-           Delen, Oztekin and Tomak [33] developed models that
puting perspective is important. According to Viceconti           predicted survival for patients undergoing coronary artery
et al. [14], data can be defined by the “5Vs”: Volume (quan-      bypass. Their study compared artificial neural networks,
tity of data), variety (different categories of data), velocity   support vector machines, and two decision tree algorithms.
(quick generation of new data), veracity (quality of data),       Support vector machines had the best prediction accuracy
and value (within the data). However, storage, management,        of 88%. Similarly, Dag, Oztekin, Yucel, Bulur and Megahed
and processing of such data are considered to be funda-           [34] identified 43 pre-operative variables for heart transplant
mental overarching issues [15]. There is a vast quantity of       patients that contribute to survival outcomes.
healthcare data being created, and data volume is expected           Zhu and Fang demonstrated the application of logistic
to grow at a compound annual growth rate of 36 percent by         regression-based trichotomous classification trees for medi-
2025 [16]. To create highly sensitive disease risk prediction     cal diagnosis. Using both simulated and real data sets relat-
models, we need large data sets that contain information          ing to breast cancer and diabetes, they demonstrated that the
about behavioral risk factors as well as disease outcomes.        algorithm can be used to successfully classify disease states,
                                                                  including those patients who are difficult to diagnose [35].
1.2 Disease prediction                                              Razavian et al. created a risk prediction model using
                                                                  insurance claims data [17]. Risk factors such as cardiovas-
Historically, there have been successful implementations          cular disease history and diagnosis of obesity, were modeled
of large-scale risk assessment tools, such as disease onset       using logistic regression and machine learning. The model
prediction, hospital readmission models, and prototypes pre-      predicted the risk of developing diabetes in the future with a
dicting healthcare cost and utilization [17–28]. For example,     21.6% accuracy, compared to 11.4% using traditional predic-
Columbia University Medical Center created a prediction           tion methods [17]. Turnea and Ilea created a predictive simu-
system that analyzes correlations of physiological data           lation for type II diabetes. Decision trees based on variables
13
Health and Technology (2021) 11:535–545                                                                                         537
such as body mass index and diastolic blood pressure were          knowledge and experience, as well as data, was rated the
used. The authors propose that such a predictive model may         highest by participants when it came to preference, accuracy,
be used to diagnose type II diabetes before complications          fairness, and ethicalness. The study proves that statistical
appear [36].                                                       models must be used simply to aid in clinical decision mak-
                                                                   ing, and a healthcare professional must be there to carefully
1.3 Criticism of analytics and evidence‑based                     review and interpret the results [39].
     practice
                                                                                                                        13
538                                                                                                       Health and Technology (2021) 11:535–545
Table 1  Diabetes risk factors (National Institute of Diabetes and Digestive and Kidney Diseases, 2016)
Risk factor        Risky class                                                BRFSS category
Age                45 + Years of Age                                          Reported age in five-year categories calculated variable (_AGE-
                                                                               G5YR), Imputed age collapsed above 80 (_AGE80), Imputed
                                                                               age in six groups (_AGE_G)
Race               African-American, Alaska Native, American-Indian,          Computed Race-Ethnicity Grouping (_RACE), Computed Five
                    Asian American, Hispanic, Native Hawaiian, Pacific         level race/ethnicity category (_RACEGR3)
                    Islander
BMI                Overweight or Obese                                        Computed BMI (_BMI5), Computed BMI Categories (_BMI-
                                                                               5CAT), Overweight or Obese calculated variable (_RFBMI5)
Physical Activity No Physical Activity                                        Exercise in Past 30 Days (EXERANY2), Leisure Time Physical
                                                                               Activity Calculated Variable (_TOTINDA)
Cholesterol        High cholesterol                                           Ever Told Blood Cholesterol High (TOLDHI2), Currently taking
                                                                               medicine for high cholesterol (CHOLMED1), High cholesterol
                                                                               calculated variable(_RFCHOL1)
Blood Pressure     High blood pressure                                        Ever told blood pressure high (BPHIGH4), Currently taking blood
                                                                               pressure medication (BPMEDS)
Depression         History of Depression                                      Ever told you had a depressive disorder (ADDEPEV2)
Stroke             History of Stroke                                          Ever diagnosed with a stoke (CVDSTRK3)
Heart Disease      History of Heart Disease                                   Ever diagnosed with angina or coronary heart disease (CVD-
                                                                               CRHD4), Ever had CHD or MI (_MICHD)
regression [44]. This form of regression produces the odds                 or coronary heart disease.” See Table 2 for Binary Logistic
ratio, which is then evaluated for significance using t-test and           regression model results.
subsequent p value. For the binary logistic regression model,
the “Ever told you have diabetes” variable was transformed                 2.3 Data and collection
into a binary dependent variable to identify those who
responded “Yes” they have been told they have diabetes, or                 Data for this study was acquired from the 2017 Behavioral
“No” they did not. Independent variables which were statis-                Risk Factor Surveillance System (BRFSS) survey conducted
tically significant were selected for inclusion in the models.             by the CDC [9]. This was a telephone survey data set which
This included 9 variables from 450,698 individual records.                 included respondents from 50 states, as well as US Terri-
The risk factors associated with the prediction model were                 tories. The objective of the BRFSS survey was to collect
“Reported age in five-year categories calculated variable,”                uniform state-specific data on health risk behaviors, chronic
“computed five level race/ethnicity category,” “computed                   diseases, access to healthcare, and the use of preventative
body mass index categories,” “exercising in past 30 days,”                 health services related to the leading causes of death in the
“ever told blood cholesterol high,” “ever told blood pres-                 US. Data collection is managed by state health departments
sure high,” “ever told you had a depressive disorder,” “ever               following protocols established by the CDC. States and US
diagnosed with a stoke” and “ever diagnosed with angina                    territories collect data for each of the 12 calendar months,
BRFSS variable name BRFSS variable code B S.E Wald df P value Exp(B)
Computed Body Mass Index Categories                            _BMI5CAT               0.654    0.007      8604.067    1     0.000      1.923
Reported Age In Five-Year Categories                           _AGEG5YR                0.143    0.002      5100.824    1     0.000      1.154
Computed Five Level Race/Ethnicity Categories                  _RACEGR3                0.104    0.003      1025.268    1     0.000      1.110
Exercising in the Past 30 days                                 EXERANY2                0.350    0.011      1063.921    1     0.000      1.419
Told High Cholesterol                                          TOLDHI2                -0.636    0.011      3463.320    1     0.000      0.529
Ever Told High Blood Pressure                                  BPHIGH4                 -.0444   0.006      5796.620    1     0.000      0.641
Ever Diagnosed with Angina or Coronary Heart Disease           CVDCRHD4               -0.522    0.016      1022.983    1     0.000      0.593
Ever Told You Had a Depressive Disorder                        ADDEPEV2               -0.293    0.012       591.188    1     0.000      0.746
Ever Diagnosed with a Stroke                                   CVDSTRK3               -0.383    0.019       390.831    1     0.000      0.682
                                                               Constant               -1.716    0.63        733.426    1     0.000      0.180
13
Health and Technology (2021) 11:535–545                                                                                      539
submitting the data to the CDC at the end of each month          Missing” category, which can be signified by various codes
[45].                                                            depending on the question. For example, on variable “Com-
   Studies have been done to assess the reliability and valid-   puted number of drinks of alcohol beverages per week”,
ity of the BRFSS data set. According to Pierannunzi et al.,      0 = Did not drink, 1–98,999 = Number of drinks per week
who conducted a systematic review of related studies, the        specified by respondent, and 99,900 = Don’t know/Not sure/
BRFSS data are reliable and valid because prevalence rates       Refused/Missing. The 99,900 category was recoded into 0, as
match well with other national surveys which relied on self-     these respondents did not specify they consumed any alcoholic
reports [46]. Prevalence estimates from the data set also cor-   drinks. Following transformation and recoding, the data set
respond well with findings from surveys based on face-to-        was examined for outliers, misspecification and error.
face interviews such as the National Health Interview Study         For the first cycle of analysis, the entire data set was used
and the National Health and Nutrition Examination Survey         to create decision trees. The R tree algorithm selects the
[47].                                                            most important variable for the first split in the tree, second
   The 2017 BRFSS data set contains variables that were          most important variable as the next split, and so on. For
created from the questions asked, as well as calculated vari-    the second analysis and creation of the predictive decision
ables. There are two types of calculated variables included      tree, the data set was split into training (80%) and validation
in the data set: Intermediate variables, and variables used      (20%) data sets. This was done at random using the R com-
to categorize or classify respondents [48]. Intermediate         mand “sample.” This split of the total number of records
variables are taken from a question response and are used to     (450,638) into a training set of 360,679 records and valida-
calculate another variable or risk factor. An example is the     tion set of 89,959 records.
Body Mass Index (_BMI5) variable being calculated from
individual computed weight and height variables WTKG3            2.5 Analysis
(Computed Weight in Kilograms) and HTM4 (Computed
Height in Meters), with WTKG3 originally being calculated        There are advantages to using and teaching rule-based clas-
from the variable WEIGHT2 (Reported Weight in Pounds).           sification, since the rules are easy to explain, and can be
The other type of calculated variable is used to classify or     understood by practitioners [50]. Rules are typically rep-
categorize respondents for simplifying analysis or identi-       resented in logic form as IF–THEN statements. For exam-
fying risk of specific injury or illness. The CDC provides       ple, in using rule-based classification for predicting breast
the Statistical Analysis System (SAS) code that was used to      cancer, Singh used IF statements such as Gender = Female,
calculate each variable [48].                                    Age >  = 60, and Gene Mutation = BRCA2 [51]. Therefore, if
                                                                 a woman was found to have a breast cancer risk factor, such
2.4 Preparing the data                                          as the BRCA2 mutation, then she would be classified into
                                                                 the “Risky Class” by the algorithm [51]. There are several
Before beginning analysis, some of the data was transformed      tree-building algorithms available for classifying and seg-
using SPSS. A missing value analysis was conducted and           menting data. For those without a computer science back-
variables that were favored for inclusion in the analysis        ground, new tree building software, such as TreeAge has
showed less than 20% missing data. As recommended by             made the process easier [7, 52].
Garson, a conservative cutoff of 20% missing values was              In this study, conditional inference classification decision
used [49]. All survey variables required careful recoding.       tree models were created using the R “party package” and
The dependent variable was coded as binary. The variables        the “ctree” algorithm. Three conditional inference classifica-
were recoded so that they would be equivalent in code/           tion decision trees were created for diabetes using different
response to each other, as many questions were coded dif-        independent variables. A fourth prediction decision tree was
ferently. On categorical yes/no variables where “Yes” = 1,       created with split training and testing data sets. Using the
“No” = 2 (or vice versa), “Don’t know/not sure” (usually         training data set, this tree provided a prediction using the
coded as 7) and “Refused/missing” (usually coded as a value      classification algorithm and “predict” function in R. The
of 9) responses were recoded into “No” categories (in some       prediction was then tested using the validation data set.
questions these two were combined into a common code
of 9, or similar, which was recoded into “No”). For exam-
ple, on variable “Drink any alcoholic beverages in the past      3 Results
30 days”, 1 = ”Yes,” 2 = “No,” 7 = “Don’t know/not sure,”
9 = “Refused/Missing.” 7 and 9 were recoded into 2 s, sig-       3.1 Decision tree models
nifying people who did not answer “Yes”.
   For discrete variables, respondent’s answers were kept        The first tree, shown in Fig. 2, captured the relationship
the same, only recoding the “Don’t know/Not sure/Refused/        between Exercise, Stroke, Depression and Diabetes. In this
                                                                                                                      13
540                                                                                                     Health and Technology (2021) 11:535–545
model, the algorithm selected exercise as the most influ-              “predict tree” R algorithm learned from the initial classifi-
ential variable, followed by stroke history and depression.            cation, and then was used to predict an outcome with the
Figure 2 shows the classification error rates ranging from             validation data. The prediction was tested a second time.
8.8%—23.8%, with an average error rate of 17.46%.                      Figure 5 shows these results. With the initial prediction,
   The second tree examined the relationship between High              there was an average error rate of 17.6%. When testing the
Blood Cholesterol, BMI and Diabetes. In this model, the                prediction, there was an average error rate of 17.9%. When
algorithm selected high blood cholesterol as the most influ-           compared to the tree using the entire data set, this model had
ential variable, followed by body mass index. Figure 3 shows           a slightly poorer predictive value. However, this tree has the
that this decision tree had classification error rates ranging         advantage of being able to make a prediction of disease risk
from 3.4% to 22.2%, with an average of 10.64%, meaning it              versus just classifying individuals. Error rates for individual
was more accurate than the first tree at predicting diabetes.          nodes are noted in Table 3.
   The third tree showed the relationship between High                    To test the Accuracy, Sensitivity and Specificity of the
Blood Cholesterol, High Blood Pressure and Diabetes. In                classification models, we followed Delen et al. [33] Eqs.
this model, the algorithm selected high blood pressure as              (1, 2, 3).
the most important variable, followed by high cholesterol.
                                                                                               TP + TN
Figure 4 shows that this decision tree had error rates ranging         Accuracy =                                                          (1)
                                                                                          TP + TN + FP + FN
from 3.93%—21.5%. The average error rate for this tree was
11.01%, meaning that its classification value was slightly
higher when compared to the second tree. The results from                                   TP
                                                                       Sensitivity =                                                       (2)
these three decision trees show that tree 2 would have the                                TP + FN
lowest error rate.
                                                                                            TN
                                                                       Specificity =                                                       (3)
3.2 Machine learning                                                                     TN + FP
                                                                           In these formulations FN, FP, TN, TP, denote false neg-
A fourth tree was created using machine learning in R. This            ative, false positive, true negative, true positive respec-
model has the ability to predict whether someone might get             tively. Equations 1, 2, and 3, respectively, were used to calculate
diabetes, instead of just classifying individuals. This tree           accuracy, sensitivity, and specificity [33]. Results indicate the best
showed the relationship between High Blood Cholesterol,                overall performance was obtained when high blood cholesterol
High Blood Pressure and Diabetes. First, a base decision tree          and body mass index were considered with a predictive accuracy
was created using the “ctree” function. In this model, the             of 93.57%, a sensitivity of 92.47% and specificity of 87.02%.
algorithm selected high blood pressure as the most impor-              Exercise, stroke and depression formed the least predictive clas-
tant variable, followed by high cholesterol. Using this tree,          sification tree with only 88.67% accuracy, 66.75% sensitivity and
we trained the prediction model using the training data set            96.12% specificity. This tree was very accurate at predicting true
and then validated the model using a smaller data set. The
                                                               Had                                        Had
                                                             Stroke?                                    Stroke?
                                                                                                                                    N=89251
                                    N=3367         N=7733        N=54976     N=259360        N=2911      N=4974        N=28066
                                                                                                                                   Err=12467.7
                                   Err=724.1      Err=1488       Err=6628   Err=22929.4     Err=694.2   Err=1121.2    Err=4989.2
                                                                                              23.8%                                   14.0%
                                     21.5%          19.2%         12.1%         8.8%                      22.5%         17.8%
13
Health and Technology (2021) 11:535–545                                                                                                          541
                                                                                   High Blood
                                                                                   Cholesterol
                                                                                                                                  BMI
                                                                 BMI
                            BMI
                                                                                                               BMI
BMI
Fig. 3 Conditional inference tree for diabetes with high cholesterol and BMI
negatives with the high specificity level but poor at predicting               After training this model and analyzing the hold out data there
true positives. Using high blood pressure and high cholesterol, the            was a slight improvement with an accuracy rate at 89.48%, a
next tree was better at predicting true positives with an accuracy             sensitivity of 96.60% and specificity of 66.48%.
rate at 89.47%, a sensitivity of 96.61% and specificity of 66.42%.
                                                                                                                 High Blood
                                                                                                                  Pressure
                                                                                                                                         13
542                                                                                                         Health and Technology (2021) 11:535–545
High Cholesterol
High Cholesterol
                                                                                                                     High Blood
                                                                                                                      Pressure
                                          N=81282          N=64031
                                         Err=17238.7      Err=9248.7
                                            21.2%           14.4%                  N=52299
                                                                                  Err=5056.9
                                                                                     9.7%
                                                                                               N=161045                                  N=2022
                                                                                               Err=6286.4                               Err=114.6
                                                                                                  3.9%                                    5.7%
   The results of this study effectively addressed the research                for clinicians and encourage the use of evidence-based medi-
questions that were posed. In response to RQ1, this study                      cine in a new more accurate and predictive ways. In response
showed that diabetes risk factors can be identified from a                     to RQ3, conditional inference decision trees provided simple
large behavioral data set using binary logistic regression.                    and effective classification models for diabetes supporting
In response to RQ2 and in support of evidence-based medi-                      the use of machine learning to create disease prediction
cine, models using different risk factors can effectively be                   models for practitioners.
compared in order to evaluate which risk factors have the
strongest influence on predictive capability. These prediction
models may then be used to provide clinical decision support                   4 Discussion
Table 3  Results for conditional inference tree prediction using valida-
tion data                                                                      Four decision tree models were created in this study. The
                                                                               first tree looked at the relationship between exercise, stroke,
Tree 4 High Blood Pressure and High Cholesterol, Prediction with
Training Data                                                                  depression, and diabetes. This tree had an average error rate
                                                                               of 17.46%, meaning it was correct in classifying diabetes
Node              N                  Error                 Pct error           or no diabetes in individuals 82.54% of the time. This tree
1                 56,468             24,814                43.94%              had the highest error rate, which could indicate that these
2                 52,819             11,212                21.23%              particular risk factors are weaker in predicting diabetes.
3                 46,627             5672                  12.16%                  The second tree looked at the relationship between high
4                 1900               122                   6.42%               blood cholesterol, BMI, and diabetes. This tree had an aver-
5                 154,492            6553                  4.24%               age error rate of 10.64%, meaning it was correct in classify-
                                     Average               17.60%              ing diabetes or no diabetes in an individual 89.36% of the
Tree 4 High Blood Pressure and High Cholesterol, Prediction                    time. This tree had the best classification capability and the
  Tested with Validation Data
                                                                               lowest error rate, which could indicate that these risk factors
Node              N                  Error                 Pct Error
                                                                               are more strongly associated with diabetes.
1                 13,911             6199                  44.56%
                                                                                   The third tree looked at the relationship between high
2                 13,176             2738                  20.78%
                                                                               blood cholesterol, high blood pressure, and diabetes. This
3                 11,833             1477                  12.48%
                                                                               tree had an average error rate of 11.01%, meaning it was
4                 454                33                    7.27%
                                                                               correct in classifying diabetes or no diabetes in individuals
5                 38,440             1698                  4.42%
                                                                               88.99% of the time, making it slightly poorer in classification
                                     Average               17.90%
13
Health and Technology (2021) 11:535–545                                                                                        543
compared to the second tree. This could indicate that these       more likely to not answer the question about smoking. There
risk factors are good classifiers of diabetes.                    may also be sampling and measurement errors, such as the
   The fourth tree was created with the purpose of predicting     wording on the questionnaire.
which individuals would have diabetes using machine learn-           There was also a limitation in the structure of the BRFSS
ing. The tree looked at the relationship between high blood       question concerning diabetes type. The question did not dif-
cholesterol, high blood pressure, and diabetes, similar to the    ferentiate among the various types of diabetes. We assume a
third tree. This tree had an average error rate of 17.6% when     small number of respondents had type I or gestational, and
evaluated for classification strength. When the prediction        this could have affected the results. Another limitation con-
was tested using the validation data set, there was an error      cerned disease risk factors. The patients in this survey had
rate of 17.9%, meaning it was correct at predicting whether       already been diagnosed with diabetes and may have altered
an individual had or did not have diabetes 82.1% of the time.     their lifestyle due to the disease. The final limitation was
Overall this tree performed just slightly poorer compared to      that not all disease risk factors that were identified in litera-
the third tree when it came to predictive capability.             ture had corresponding BRFSS variables and could not be
   Healthcare professionals should consider using decision        included in the analysis.
trees and machine learning for classification and predictive
analysis to assist in reducing costs and improving clinical
                                                                  4.2 Future research
outcomes. Different combinations of risk factors could affect
prediction and classification results. Risks are correlated and
                                                                  Looking forward, a next step is standardization of disease
dependent on each other, and so predictive models need to
                                                                  prediction models for clinical use. The Society of Actuaries
address simultaneous conditions [10].
                                                                  conducted a survey analyzing the state of predictive ana-
   Healthcare providers need resources and available time
                                                                  lytics in healthcare. The survey identified that within the
to utilize predictive analytics. However many report that
                                                                  US healthcare industry, fewer than half (43%) of healthcare
incomplete data and insufficient technology are the biggest
                                                                  organizations are currently using predictive analytics [53].
obstacles in implementing predictive analytics [53]. Hospi-
                                                                  While most payers in healthcare are using predictive ana-
tals are more likely to lack the technology required to take
                                                                  lytics (80%), only 39% of medical groups/clinics, and only
advantage of predictive analytics. Medical groups and clin-
                                                                  36% of hospitals are using these tools. For those who are
ics are twice as likely to lack employees who are skilled in
                                                                  using predictive analytics, the most common use is predict-
predictive analytics [53].
                                                                  ing hospital readmissions and costs. Future research on this
   The healthcare industry has historically made decisions
                                                                  topic might be implementing the predictive and classifica-
differently than other business sectors [53]. The clinical
                                                                  tion models created in this study into the clinical setting.
issue, rather than habit or protocol, should determine medi-
                                                                  Finally, the models must be evaluated for their usefulness
cal intervention. Medical authorities will have to adapt a
                                                                  in medical practice.
new way of thinking about research, including switching
from the primary use of deductive reasoning to inductive
reasoning and pattern recognition [18]. The clinician must        4.3 Conclusion
not be merely bound by rules and guidelines but be taught
to apply those rules in the context of each patient. A recent     Although disease prevention awareness campaigns have
campaign in the United Kingdom, “Too Much Medicine,”              become more prevalent, the US continues to be ravaged by
led by academics, clinicians, and patients is hoping to reduce    chronic disease, in both mortality and cost. In their most
over screening, overdiagnosis, and overtreatment [54].            recent report, the Partnership to Fight Chronic Disease [55]
                                                                  estimates the projected total cost of chronic disease in Amer-
4.1 Limitations                                                  ica to reach $42 trillion between 2016–2030. The number
                                                                  of people with three or more chronic diseases in the US is
There were some limitations regarding the nature of self-         expected to reach 83.4 million by 2030, compared to 30.8
reported survey results. First, is the clumping of data around    million in 2015. With behavioral changes, new interventions,
whole numbers. For example, when asked their weight,              and treatment advances, 16 million lives could be saved in
respondents would be more likely to report 150 than 151           the next 15 years [55].
lbs. Data smoothing to account for this can be done but was          Although many advanced diagnostic tools exist, the
out of scope for this paper. Other limitations of self-reported   healthcare field is still lacking accessible predictive mod-
data included people underestimating their tobacco/alcohol        els to plan interventions. Healthcare providers currently
usage, or overestimating frequency of seeking healthcare and      say that clinical outcomes and costs are the most valuable
following medical advice. There can also be non-response          data to predict [53]. By focusing on prediction of disease
bias, participants such as those who are smokers, may be          risk we can improve clinical outcomes and reduce costs.
                                                                                                                       13
544                                                                                                                   Health and Technology (2021) 11:535–545
Professionals involved with analyzing healthcare data should                         10. Belle A, Thiagarajan R, Soroushmehr S, et al. Big data analytics in
assist clinicians and healthcare organizations in utilizing pre-                         healthcare. BioMed Res Int. 2015. https://doi.org/10.1155/2015/
                                                                                         370194.
dictive analytics, maintaining clean data, and bridging the                          11. Raghupathi W, Raghupathi V. Big data analytics in healthcare:
gap between clinicians and data scientists.                                              promise and potential. Health Inf Sci Syst. 2014;2(1):3. https://
                                                                                         doi.org/10.1186/2047-2501-2-3.
Author’s contributions Elena G Toth: Conceptualization, Methodol-                    12. Monica K. Why are so few healthcare providers using EHR data
ogy, Software, Writing- Original Draft, Visualization, Formal Analy-                     analytics? EHR Intelligence xtelligent Healthcare Media, 2017.
sis. Alexander McLeod: Conceptualization, Methodology, Software,                     13. Palaniappan S and Awang R. Intelligent heart disease prediction
Visualization, Formal Analysis, Writing-Review & Editing, Supervi-                       system using data mining techniques. In: IEEE/ACS Interna-
sion, Resources. David Gibbs: Writing- Review & Editing, Resources.                      tional Conference on Computer Systems and Applications Doha,
Jacqueline Moczygemba: Writing- Review & Editing, Resources.                             Qatar; 2008, pp.108–115. IEEE.
                                                                                     14. Viceconti M, Hunter PJ, Hose RD. Big data, big knowledge: big
                                                                                         data for personalized healthcare. IEEE J Biomed Health Inform.
Declarations                                                                             2015;19(4):1209–15. https://d oi.o rg/1 0.1 109/J BHI.2 015.2406883.
                                                                                     15. Kaisler S, Armour F, Espinosa JA, et al. Big data: Issues and chal-
Data availability BRFSS Data Set is available at https://www.cdc.                     lenges moving forward. In: Proceedings of the 46th Hawaii Inter-
gov/brfss/annual_data/annual_2017.html. Processed data used in the                 national Conference on System Sciences (HICSS) Maui, HI; 2013,
research paper is available upon request.                                                pp.995–1004. IEEE.
                                                                                     16. Kent J. Big data to see explosive growth, challenging healthcare
Availability of code Code used in R software is available upon request.                  organizations. Health IT Analytics. 2018.
                                                                                     17. Razavian N, Blecker S, Schmidt AM, et al. Population-level pre-
Informed consent This study did not have human participants and                          diction of type 2 diabetes from claims data and analysis of risk
therefore did not require informed consent.                                              factors. Big Data. 2015;3(4):277–87. https://doi.org/10.1089/big.
                                                                                         2015.0020.
Conflicts of interest To the best of our knowledge, the named authors                18. Krumholz HM. Big data and new knowledge in medicine: the
have no conflict of interest, financial or otherwise.                                    thinking, training, and tools needed for a learning health system.
                                                                                         Health Aff. 2014;33(7):1163–70. https://doi.org/10.1377/hlthaff.
                                                                                         2014.0053.
                                                                                     19. Wang L, Porter B, Maynard C, et al. Predicting risk of hospitalization
                                                                                         or death among patients receiving primary care in the Veterans Health
                                                                                         Administration. Med Care. 2013;51(4):368–73. https://doi.org/10.
References                                                                               1097/MLR.0b013e31827da95a PubMedPMID:edsjsr.23434292.
                                                                                     20. Neuvirth H, Ozery-Flato M, Hu J, et al. Toward personalized care
 1. Barrett MA, Humblet O, Hiatt RA, et al. Big data and disease                         management of patients at risk: The diabetes case study. Knowl-
    prevention: from quantified self to quantified communities. Big                      edge discovery and data mining. 2011:395-403.https://d oi.o rg/1 0.
    Data. 2013;1(3):168–75. https://doi.org/10.1089/big.2013.0027.                1145/2020408.2020472.
 2. National Center for Health Statistics. Health, United States, 2011:              21. Lloyd-Jones DM, Leip EP, Larson MG, et al. Prediction of lifetime
    With Special Feature on Socioeconomic Status and Health. MD:                         risk for cardiovascular disease by risk factor burden at 50 years
    Hyattsville; 2012.                                                                   of age. Circulation. 2006;113(6):791–8. https://doi.org/10.1161/
 3. Lin Y-K, Chen H, Brown RA, et al. Healthcare predictive analyt-                      CIRCULATIONAHA.105.548206.
    ics for risk profiling in chronic care: a Bayesian multitask learning            22. Maguire J, Dhar V. Comparative effectiveness for oral anti-
    approach. MIS Quarterly. 2017;41(2):473–95. https://doi.org/10.                  diabetic treatments among newly diagnosed type 2 diabetics:
    25300/MISQ/2017/41.2.07.                                                         data-driven predictive analytics in healthcare. Health Syst.
 4. Fox B. Using big data for big impact: Leveraging data and analyt-                    2013;2(2):73–92. https://doi.org/10.1057/hs.2012.20 PubMedP
    ics provides the foundation for rethinking how to impact patient                     MID:edselc.2-52.0-84888421992.
    behavior. Health Manag Technol. 2011;32(11):16–16 PubMed                         23. Adler P, Rajesh R, Jamie SH, et al. Risk prediction for chronic
    PMID: 22141243.                                                                      kidney disease progression using heterogeneous electronic health
 5. Keehan SP, Stone DA, Cuckler GA, et al. National health                              record data and time series analysis. J Am Med Inform Assoc.
    expenditure projections, 2016–25: Price increases, aging push                        2015;22(4):872–80. https://doi.org/10.1093/jamia/ocv024 PubM
    sector to 20 percent of economy. Health Aff. 2017;36(3):553–63.                      edPMID:edsovi.00042637.201507000.00016.
    https://doi.org/10.1377/hlthaf f.2016.1627 PubMedPMID:eds                24. Letham B, Rudin C, McCormick TH, et al. Interpretable classi-
    elc.2-52.0-85014636612.                                                              fiers using rules and Bayesian analysis: Building a better stroke
 6. Lash TA, Escobedo MR. Introduction. Clin Geriatr Med.                                prediction model. Ann Appl Stat. 2015;9(3):1350–70. https://doi.
    2018;34(3):XVII–XIX. https://d oi.o rg/1 0.1 016/j.c ger.2 018.06.002.       org/10.1214/15-AOAS848 PubMedPMID:edsjsr.43826424.
 7. Jalal H, Pechlivanoglou P, Krijkamp E, et al. An overview of R in                25. Sun J, Hu J, Luo D, et al. Combining knowledge and data driven
    health decision sciences. Med Decis Making. 2017;37(7):735–46.                       insights for identifying risk factors using electronic health records.
    https://doi.org/10.1177/0272989X16686559 PubMedPMID:eds                       AMIA Annu Symp Proc. 2012;2012:901–10.
    elc.2-52.0-85027071420.                                                          26. Henry KE, Saria S, Hager DN, et al. A targeted real-time early
 8. Moskowitz A, McSparron J, Stone DJ, et al. Preparing a new gen-                      warning score (TREWScore) for septic shock. Sci Transl Med.
    eration of clinicians for the era of big data. Harv Med Stud Rev.                    2015;7(299):1–9. https://doi.org/10.1126/scitranslmed.aab3719
    2015;2(1):24–7.                                                                      PubMedPMID:edselc.2-52.0-84938704873.
 9. Centers for Disease Control and Prevention. Behavioral Risk Fac-                 27. Bates DW, Saria S, Ohno-Machado L, et al. Big data in health
    tor Surveillance System Survey (BRFSS). In: National Center for                      care: using analytics to identify and manage high-risk and high-
    Chronic Disease Prevention and Health Promotion: Division of                         cost patients. Health Aff. 2014;33(7):1123–31. https://doi.org/10.
    Population Health, (ed.). Atlanta, GA; 2016.                                         1377/hlthaff.2014.0041.
13
Health and Technology (2021) 11:535–545                                                                                                                                    545
28. Bardhan I, Oh J-h, Zheng Z, et al. Predictive analytics for read-                 41. Bertsimas D, O’Hair A, Relyea S, et al. An analytics approach to
    mission of patients with congestive heart failure. Inf Syst Res.                      designing combination chemotherapy regimens for cancer. Man-
    2015;26(1):19–39.                                                                     age Sci. 2016;62(5):1511–31.
29. Xie Z, Li D, Nikolayeva O, et al. Building risk prediction mod-                   42. Nichols H. The top 10 leading causes of death in the United States,
    els for type 2 diabetes using machine learning techniques. Prev                       https://www.medicalnewstoday.com/articles/282929.php (2018).
    Chronic Dis. 2019;16(E130):1–9. https://doi.org/10.5888/pcd16.              43. National Institute of Diabetes and Digestive and Kidney Diseases.
    190109 PubMedPMID:edselc.2-52.0-85072402053.                                          Risk factors for type 2 diabetes, https://w         ww.n iddk.n ih.g ov/h ealth-
30. Piri S, Delen D, Liu T, et al. A data analytics approach to build-                    inform
                                                                                                ation/d iabet es/o vervi ew/r isk-f actor s-t ype-2-d iabet es (2016,
    ing a clinical decision support system for diabetic retinopathy:                      2018).
    Developing and deploying a model ensemble. Decis Support Syst.                    44. Sperandei S. Understanding logistic regression analysis. Biochemia
    2017;101:12–27. https://doi.org/10.1016/j.dss.2017.05.012.                    Medica. 2014;24(1):12–8. https://doi.org/10.11613/BM.2 014.0 03
31. Wang T, Qiu RG, Yu M, et al. Directed disease networks to facili-                     PubMedPMID:PMC3936971.
    tate multiple-disease risk assessment modeling. Decis Support                     45. Centers for Disease Control and Prevention. Overview: BRFSS
    Syst. 2020;129:113171. https://d oi.o rg/1 0.1 016/j.d ss.2 019.113171.        2017. 2018.
32. Steinberg GB, Church BW, McCall CJ, et al. Novel predic-                          46. Pierannunzi C, Hu SS, Balluz L. A systematic review of publica-
    tive models for metabolic syndrome risk: a" big data" analytic                        tions assessing reliability and validity of the Behavioral Risk Factor
    approach. Am J Manag Care. 2014;20(6):e221–8.                                         Surveillance System (BRFSS), 2004–2011. BMC Med Res Meth-
33. Delen D, Oztekin A, Tomak L. An analytic approach to better                           odol. 2013;13(1):49. https://doi.org/10.1186/1471-2288-13-49.
    understanding and management of coronary surgeries. Decis Sup-                    47. Li C, Balluz LS, Ford ES, et al. A comparison of prevalence
    port Syst. 2012;52(3):698–705. https://d oi.o rg/1 0.1 016/j.d ss.2 011.       estimates for selected health indicators and chronic diseases or
    11.004.                                                                              conditions from the Behavioral Risk Factor Surveillance System,
34. Dag A, Oztekin A, Yucel A, et al. Predicting heart transplan-                         the National Health Interview Survey, and the National Health
    tation outcomes through data analytics. Decis Support Syst.                           and Nutrition Examination Survey, 2007–2008. Prev Med.
    2017;94:42–52.                                                                        2012;54(6):381–7. https://doi.org/10.1016/j.ypmed.2012.04.003.
35. Zhu Y, Fang J. Logistic regression-based trichotomous classification              48. Centers for Disease Control and Prevention. Calculated variables
    tree and its application in medical diagnosis. Med Decis Making.                      in the 2017 Behavioral Risk Factor Surveillance System data file
    2016;36(8):973–89. https://doi.org/10.1177/0272989X15618658                    2018.
    PubMedPMID:edselc.2-52.0-84989285855.                                             49. Garson GD. Missing values analysis and data imputation. Ashe-
36. Turnea M, Ilea M. Predictive simulation for type II diabetes using                    boro: Statistical Associates Publishing Asheboro, NC; 2015.
    data mining strategies applied to big data. In: The International                 50. Li X and Liu B. Rule-based classification. In: Aggarwal CC (ed)
    Scientific Conference eLearning and Software for Education                            Data Classification: Algorithms and Applications. Chapman &
    Bucharest, Romania, 2018, pp.481–486. Carol I National Defence                        Hall/CRC; 2014, pp.121–156.
    University.                                                                       51. Singh NK. Prediction of breast cancer using rule based classifica-
37. Neff G. Why big data won’t cure us. Big Data. 2013;1(3):117–23.                       tion. Appl Med Inf. 2015;37(4):11–22.
    https://doi.org/10.1089/big.2013.0029.                                     52. Bae J-M. The clinical decision analysis using decision tree. Epi-
38. Talboy AN, Schneider SL. Improving accuracy on Bayesian                               demiol Health. 2014;36:e2014025. https://doi.org/10.4178/epih/
    inference problems using a brief tutorial. J Behav Decis Mak.                         e2014025 PubMedPMID:PMC4251295.
    2017;30(2):373–88. https://doi.org/10.1002/bdm.1949 PubMed                  53. Society of Actuaries. The state of predictive analytics in US
    PMID:edselc.2-52.0-84961282329.                                                       healthcare Modern Healthcare. 2016.
39. Eastwood J, Snook B, Luther K. What people want from their pro-                   54. Greenhalgh T, Howick J, Maskrey N. Evidence based medicine:
    fessionals: Attitudes toward decision-making strategies. J Behav                      a movement in crisis? BMJ. 2014;348:g3725. https://doi.org/10.
    Decis Mak. 2012;25(5):458–68. https://doi.org/10.1002/bdm.741                   1136/bmj.g3725.
    PubMed PMID: 82468884.                                                            55. Partnership to Fight Chronic Disease. What is the impact of
40. Oztekin A, Kong ZJ, Delen D. Development of a structural equa-                        chronic disease in America? 2016. FightChronicDisease.org.
    tion modeling-based decision tree methodology for the analysis
    of lung transplantations. Decis Support Syst. 2011;51(1):155–66.
    https://doi.org/10.1016/j.dss.2010.12.004.
13