A Project Report on
Predicting Heart Disease Events based on 11 Clinical Features using
Logistic Regression
Submitted in partial fulfilment of the course
Quantitative Techniques - II
Submitted to
Prof. Pritha Guha
Submitted by
Group 4, Section C
Aditi Goyal (BJ21125), Astha Alankrita (BJ21134),
Ishita Thakur (BJ21143), Nihar Raichada (BJ21152),
Rahul Manna (BJ21161), Shreyas Jayasankar (BJ21170) &
Vishal Singh (BJ21179)
On
January 16, 2022
1|Page
1. Introduction
According to the research by World Health Organization, Cardiovascular diseases (CVDs)
form the number 1 reason for deaths worldwide. They are responsible for approximately 17.9
million deaths each year, making about 31% of the total global deaths. Among the CVDs, heart
attacks and strokes lead to 4 out of 5 deaths due to heart diseases. Moreover, a third of these
deaths are prematurely observed in people under the age of 70.
This dataset has been able to bring together 11 significant factors, which can prove highly
helpful in predicting heart disease. A model which can accurately predict heart disease's
presence based on certain bodily factors can be of great significance to patients and medical
practitioners. This model can also help people with other ailments like diabetes, hypertension,
high blood pressure, etc., because it offers them early detection and management opportunities.
2. Dataset Description
The data has been sourced from Kaggle.com and it represents different factors that can predict
heart disease for a person. The dataset contains 918 observations and has 12 columns for factors
affecting an individual's heart condition. A brief description of all factors in the dataset is given
below:
S. No. Column Name Type Description
1. Age Int It depicts the age of the individual in years.
2. Sex Chr This column represents the sex of the individual.
3. ChestPainType Chr It depicts the type of chest pain reported by the person (TA:
Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal
Pain, ASY: Asymptomatic).
4. RestingBP Int This column reports the resting Blood Pressure in mm Hg.
5. Cholesterol Int Depicts the cholesterol levels for the individual mm/dl.
6. FastingBS Int It measures the basal sugar levels in the blood (1 if
FastingBS > 120 mg/dl and 0 otherwise).
7. RestingECG Chr It depicts the resting (ECG) electrocardiogram for the
individual.
8. MaxHR Int It measures the maximum heart rate achieved with value
between 60 and 202.
9. ExerciseAngina Chr It refers to the exercise induced pain in chest. The column
exercise-induced angina is set to Y if pain if felt and to N if
pain is not felt)
10. OldPeak Int It represents ST depression which is induced by exercise
2|Page
11. ST_Slope Chr It depicts slope of the exercise segment (ST) (Up =
upsloping, Flat = flat, Down = down sloping]
12. HeartDisease Int This is the output column which predicts whether the patient
has heart disease. (1 for heart disease, 0 for Normal)
Table (2.1)
We will first perform exploratory analysis on the above-mentioned dataset and then run
regression analysis to understand which factors contribute the most towards heart disease. We
have assumed the level of significance(alpha) as 0.05 in all our tests.
In the dataset we observed that 1 record was missing data for the RestingBP column, and 172
records were missing data on cholesterol. As part of the data filtering process, we decided to
not use those records in exploratory and regression processes.
3. Exploratory Analysis:
We can divide the dataset features into two groups:
Categorical Variables: ChestPainType, Sex, RestingECG, ST_Slope, FastingBS,
ExerciseAngina
Quantitative Variables: Age, RestingBP, Cholesterol, MaxHR, OldPeak
Statistical Summary for Quantitative Variables:
Variable Range Mean Median
Age [28, 77] 52.88 54
RestingBP [92, 200] 133 130
Cholesterol [85, 603] 244.6 237
MaxHR [69, 202] 140.2 140
OldPeak [-0.1, 6.2] 0.9016 0.5
Table (3.1)
3|Page
3.1 Q-Q Plots:
Fig (3.1) Fig (3.2)
Fig (3.3) Fig (3.4)
Fig (3.5)
The QQ plots show that RestingBP and OldPeak do not follow normal distribution as most of
the data points lie far from the QQ line. For Age, Cholesterol and MaxHR, the QQ points
appear to lie closer to the QQ line but more data analysis is required before stating their
normality. Density Curves and Shapiro’s Test can help us better understand these variables.
4|Page
1.1 Density Curves:
Fig (3.6) Fig (3.7)
Fig (3.8) Fig (3.9)
Fig (3.10)
The density curves for all the continuous variables are shown above. From these graphs we
observe that only Age and MaxHR appear to follow Normal Distribution. RestingBP has
multimodal graph, whereas OldPeak and Cholesterol show skewed curves.
5|Page
3.2 Scatter Plots:
Fig (3.11) Fig (3.12)
Fig (3.13) Fig (3.14)
In Fig (3.11), the cholesterol vs age plot shows that these two widely scattered variables
are positively correlated.
The cholesterol vs OldPeak plot in Fig (3.12) shows that the two variables are very
poorly and positively correlated.
The MaxHR vs OldPeak plot in Fig (xiii) appear to be negatively and very poorly
correlated.
The data points in MaxHR vs Cholesterol plot in Fig (3.14) are widely scattered and
show the variables to be negatively and poorly correlated.
3.3 Correlation Matrix:
Age RestingBP Cholesterol MaxHR Oldpeak
Age 1 0.25986472 0.05875824 -0.38211212 0.28600628
RestingBP 0.25986472 1 0.09593929 -0.12577393 0.19857506
Cholesterol 0.05875824 0.09593929 1 -0.01985579 0.05848813
MaxHR -0.38211212 -0.12577393 -0.01985579 1 -0.25953263
Oldpeak 0.28600628 0.19857506 0.05848813 -0.25953263 1
Table (3.2)
The correlation table shows us the relationship between different continuous variables in the
dataset. The table helps us identify correlation and independence between different pairs of
6|Page
variables. From the table above we observe that no two quantitative variables are highly
correlated.
3.4 Bar Plots:
To understand the Categorical Variables, we have derived the following bar plots:
Fig (3.15) Fig (3.16)
Fig (3.17) Fig (3.18)
Fig (3.19) Fig (3.20)
7|Page
Inferences based on the bar plots:
In the Chest Pain Type vs. Heart Disease Fig (3.16), it is observed that chest pain can
contribute to heart disease. According to the figure, approximately 300 of the 400
patients reporting “ASY” chest pain type were diagnosed with heart disease.
According to Fig (3.15), most people report normal Resting ECG.
Fig (3.17) shows Sex vs Heart Disease distribution, and we observe that in the given
dataset, percentage of men diagnosed with heart disease is much more than that of
women.
The ST_Slope vs. Heart Disease graph in Fig (3.18) shows that most people who report
flat ST_Slope are diagnosed with heart disease.
Fig (3.19) shows that heart diseases are observed in both kinds of patients, with basal
sugar level above and below 120 mg/dl. The percentage of people with heart disease is
more when the basal sugar level is greater than 120 mg/dl.
Fig (3.20) shows that people who report exercise angina have higher chances of getting
diagnosed with heart disease.
3.5 Some more plots:
Heart Disease is an integer type, so first of all we would change it to factor for classification
purposes and then we used ggplot function to visulaise the data for analysis purposes.
We have delineated everything based on the target variable i.e. Heart Disease:
1) First plot is Age vs Count which shows the frequency of heart diseases with
increasing age. As visible from the plot below, under normal circumstances, the
advent of heart diseases is seen post the age of 40 for a significant number of people.
Fig (3.21)
2) Second plot is Sex vs Count which shows the frequency of heart diseases for the two
genders. As visible from the plot below, heart diseases are more frequent in males.
8|Page
Fig (3.22)
3) Third plot is Chest Pain type vs Count which shows the frequency of heart diseases
with the varying chest pain. As visible from the plot below, heart diseases are more
frequent in people experiencing ASY or NAP types of chest pain, and chest pain of
angina types (TA or ATA) are usually not associated with heart diseases.
Fig (3.23)
4) Fourth Plot is RestingBP vs Count which shows the frequency of heart diseases as
per the different RestingBP value. As we can see, most people having RestingBP
value of more than 100 experience heart diseases.
Fig (3.24)
9|Page
5) Fifth plot is Cholesterol vs Count which shows the frequency of heart diseases as per
the different cholesterol values. As visible from the plot below, the relation between
cholesterol and heart diseases is inconclusive, as there is no particular value or
threshold of cholesterol which signifies the presence of heart diseases. But generally,
the range lies between 200-600.
Fig (3.25)
4. Building the Logistic Regression Model
Logistic Regression, a statistical model, is deployed to model a binary dependent variable by
estimating the parameters of a logistic model. For the value labeled '1', the log of odds is a
linear combination of independent variables, also known as "predictors". The model is based
on discontinuous binary outcomes.
The logistic function is y = log (p/(1-p)). Probabilities transformed through the logistic
function are known as logit values. So, logit (p) = log (p/(1-p)).
To perform the regression analysis, πi = logit (pi) = β0 + β1 * xi + εi.
Once the coefficients are determined from the model, the probability of the event happening
is given by: pi = (e^ πi ) / (1 + e^ πi)
In the case of our heart disease data, since there are 11 dependent variables (3 variables with
further classification as well), the logit equation (fitted model) would be:
πi = logit (pi) = β0 + β1 * a + β2 * b + β3 * c + β4 * d + β5 * e + β6 * f + β7 * g + β8 * h +
β9 * i + β10 * j + β11 * k + β12 * l + + β13 * m + β14 * n + β15 * o
The values of β represent the coefficients attached to each independent variable. The list of
independent variables is:
10 | P a g e
Table (4.1)
The categorical variables ChestPainType, Sex, RestingECG, ST_Slope, FastingBS,
ExerciseAngina, OldPeak were converted to numerical values so that they can be used in the
regression model. To calculate the coefficients, the Generalized Linear Model (GLM) was
used. The value of α (significance level) was kept at 5%. The output of the GLM is as follows-
Fig (4.1)
Thus, the logit equation for the original model was calculated to be:
πi = logit (pi) = -7.3402738 + (0.0313784) * a + (1.8655490) * b + (1.6731804) * c +
(0.01001683) * d + (0.0399275) * e + (0.0117792) * f + (0.0024955) * g + (0.2923999) * h
+ (0.2297888) * i + (0.0551871) * j + (0.0005807) * k + (0.9073515) * l + (0.4108355) * m
+ (1.3038217) * n + (-1.2100372) * o
The probability of having a heart disease (for a person) could be ascertained by replacing the
values of the independent variables and calculating the logit score πi . From the logit score, the
probability can be computed as pi = (e^ πi ) / (1 + e^ πi) .
It can also be inferred that a rise in 1 unit of an independent variable would increase the logit
score by the coefficient of that variable, and increase the odds ratio by e^β. For example, if
11 | P a g e
fasting blood sugar level increases by 1 unit, the logit score would increase by 0. 2923999, and
the odds ratio of the person having heart disease would increase by e^ 0. 2923999 times.
4.1 Outlier Detection
Boxplots have been used to detect the presence of outliers.
Fig (4.2) Fig (4.3)
Fig (4.4) Fig (4.5)
All data points having Cholesterol value as zero were removed, as cholesterol value being zero
is not medically possible, and such data points were not meaningful for our analysis. There
were 172 such data points.
MaxHR (Maximum heart rate) can range from 60 to 200 bpm based on medical science [1].
Resting BP (Blood Pressure) can also have values up to 200 [3]. Old Peak can also have values
less than 2 and maximum upto 6.2 [2]. For cholesterol the zero values have already been
removed, the other outliers are not considered as people can have Cholesterol ranging upto
even 600 mg/dL [4]. If we omit or replace these high cholesterol values with the mean or
median, it will make our model less effective as it will give a low probability of heart attack
even when the cholesterol level is extremely high. Since our data set is basically health data,
we have taken that into account (based on medical science justifications) and have not treated
the outliers mathematically so as to make our model ideal for prediction of heart attack for all
(medically possible) values of predictor variables.
4.2 Multicollinearity Diagnostics
Multicollinearity is a situation in which one or more independent variables are highly linearly
related. This can affect the results of your data. To check multicollinearity, we had to run a
Variance Expansion Factor (VIF) test on an existing model. It measures how much the variance
of the regression coefficients expands due to the multicollinearity of the model. The minimum
possible value for VIF is 1, indicating that there is no multicollinearity. A VIF score of less
than 5 usually indicates that the significance of multicollinearity is low.
12 | P a g e
The outcome after running the VIF test was:
Table (4.2)
The VIF returned values less than 2 in all the cases indicating an absence of multicollinearity
in the model.
4.3 Finding the Best Fit Model
To further validate which of the models is better, we run the step function to arrive at the AIC
value. The Akaike Information Criterion or AIC is a score used to compare various models
within a given dataset to arrive at the best model for that particular data. The AIC score is based
on a model’s Maximum Likelihood Estimation as a measure of fit. The lower the AIC score,
the better.
We first used the summary command on the full model consisting of all the 11 predictor
variables. The AIC score of this model came out to be 515.58.
We then carried out a stepwise regression using the step function in the forward, both and
backward directions so as to arrive at the best fit model. All the three functions gave the same
result – The best model with AIC score of 509.12 which is based on 7 predictor variables (the
variables excluded from the model are Fasting_BS, Cholestrol, Resting_ECG and Max_HR).
In other words, we validate the conclusion that in a probabilistic ranking of the models to
understand, which minimizes loss of information, or is the best fit, it is (Model_new) or a model
consisting of all the predictor variables except Fasting_BS, Cholestrol, Rest_ECG and Max
HR. Therefore, the GLM model was run again, this time excluding the variables Fasting_BS,
Cholestrol, Rest_ECG and Max HR.
The output for that is as follows-
13 | P a g e
Fig (4.6)
For the above updated model, the logit function changed to:
πi = logit (pi) = -6.821891 + (0.033988) * a+ (1.835024) * b + (1.707323) * c + (0.123502) *
d + (0.098073) * e + (0.013504) * f + (0.888317) * l + (0.403) * m + (1.255385) * n + (-
1.292281) * o
5. Tests for Regression Coefficients
The observed z-values and the corresponding p-values for the z-tests for individual predictor
variables help us to determine the significance of the variable in the model while controlling
the rest of the variables. The z-test gives an indication whether the coefficient of the variable
in question is 0 or not. The null and alternate hypotheses for the z-test are:
Null Hypothesis H0 – The variable is insignificant, or βi = 0
Alternate Hypothesis H1 – The variable is significant, or βi ≠ 0
Results are summarized in the table below:
Table (5.1)
14 | P a g e
6. Model Performance/ Validation
6.1Test for Goodness of Fit
To check if the model is effective or not, the G-Test was performed. The Null and Alternate
hypotheses for the same are:
Null Hypothesis H0 – Null model fits the data at least as well as our model, or
βi = 0
Alternate Hypothesis H1 – Our model fits the data better than the null model, or
at least one of βi ≠ 0
PCHISQ Function: pchisq(1032.63-487.12, df=745-734, lower.tail=F)
The output for the above was 6.213367e-110, which was significantly lower than the alpha
value. Hence, we rejected the null hypothesis and could say that the βi values are not 0. Thus,
we concluded that the model is significant.
ANOVA Function: anova(Model_new, test = “Chisq”)
This test adds terms to the model one by one and tests the significance at each step using a chi-
square test. The outcome was:
Fig (6.1)
Since each of the terms gives a p-value < alpha, we rejected the null hypothesis and concluded
that the βi values are not 0. Thus, our model fitted the data better than the null model. Both the
above tests suggested that the model was effective and good.
6.2 Error Analysis: Confusion Matrix
The confusion matrix is frequently used in statistical classification to examine the performance
of a statistical model. The measures obtained from Confusion Matrix are:
- Accuracy: Indicates how correctly the model can classify heart disease
- Sensitivity: Indicates model’s ability to designate a person with disease as positive
- Specificity: Indicates model’s ability to designate a person without disease as negative
15 | P a g e
- Precision: Indicates model’s ability to correctly classify presence of disease
- F-Measure: A measure created considering both precision and recall [11]
Higher the accuracy, sensitivity, and specificity, the better. However, as the accuracy and
specificity of the model improves, the sensitivity lowers; as a result, an optimal cutoff
probability value must be determined to use the best model.
6.3 Analysis:
The code "pred.1[pred.prob.10.5] = 0" assumed a probability cutoff of 0.5, which meant that
only outcomes with a probability greater than 0.5 were examined. However, to find the
optimum model, we constructed confusion matrices with probability values ranging from 0.1
to 0.9 at multiples of 0.1. These three measurements were then plotted against the probability
values, with the probability value at the intersection point serving as our model's cutoff.
Table (6.2)
Sensitivity-Specificity Graph
1.2
0.8
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Sensitivity
Cut off probability Specificity
Fig (6.2)
16 | P a g e
Accuracy vs cut off probability
90.00%
88.00% 0.5286721, 87.67% 0.5, 86.86%
86.00% 0.6, 86.33%
0.3, 85.25% 0.4, 86.60% 0.7, 84.85%
84.00%
Accuracy
0.2, 82.71%
82.00%
0.8, 81.37%
80.00%
78.00% 0.1, 78.28%
76.00%
74.00% 0.9, 74.26%
72.00%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cut off probability
Fig (6.3)
Table(6.3)
The probability value at the intersection point is 0.528672. At this probability cut off value, the
percentage accuracy is maximum which is equal to 87.67%. Hence, for our model, we will
consider the probability cutoff equal to 0.528672.
6.4 Scatterplot of residuals against the predicted values:
Fig (6.4)
Through this model we are predicting the probability of having a heart attack (p=1) or not
having a heart attack (p=0). When the p value is 0, the estimated value will always be greater
17 | P a g e
and hence the residuals will be negative (indicated by the blue curve). Similarly, when the p
value is 1, the estimated value will always be lower and hence the residuals will be positive
(indicated by the red curve).
6.5 Model Validation using Cross Validation
K-Fold Cross Validation has been used to validate our model. In this technique, the data is
divided in K folds or parts. K-1 parts are being used to train the model and the Kth part is used
as a test. This is done repeatedly, and average accuracy is found out. It gave an accuracy of
89.6% when used with 10 clusters. The function cv.glm() was used from the boot library which
gives misclassification rate and an adjusted misclassification rate as the delta output.
Fig (6.5)
7. Results & Interpretation
Table (7.1)
The aim of this project work is to analyse eleven predictor variables to understand how the
predictor variables are contributing towards determining whether a person is at the risk of
having a heart (disease) failure or not.
We applied a logistic regression model, as the dependent variable was categorical and not
continuous. As we had many variables to account for, we used General Logistic Regression
Model, which states that the probability that an event (heart failure detection) will occur due to
‘k’ independent variables. After subsequent tests, goodness-of-fit, to measure the significance
of each of the 11 independent predictor variables on the outcome, we found that 7 variables
(Age, sex, ChestPainType, RestingBP, ExerciseAngina, ST_Slope, Oldpeak) had p-value
18 | P a g e
lesser than 0.05, meaning they gave no evidence of lack of fit of the logistic regression model
to the outcome data.
To select the most optimal classifier, we need to compare the cost of failing to detect positives
vs cost of raising false alarms. For this, we applied the Confusion matrix to decide on the best
threshold value and this came out to be 0.528.
8. Conclusion & Future Work
From the model it was observed that males are more prone to heart disease as compared to
women which is confirmed in [9]. Asymptomatic chest pain too has been associated with heart
disease while other type of chest pain do not seem to affect presence of heart diseases
significantly, showing that most diseases occur silently. Flat slope in the exercise segment
indicated presence of heart disease while slightly upward slope did not lead to a problem as
proven in [10]. Fasting Blood Sugar, Maximum Heart Rate, Resting ECG and Cholesterol did
not seem to be able to determine the presence of heart disease significantly.
The work done through this project could be taken forward in several ways. Given that the
dataset is limited to one place or geography, in the future, studies can be expanded to people
from other ethnicities and geographies. Building on such holistic studies will also be beneficial
for developing drug usage criteria to modify heart diseases medication to suit the specific needs
of different populations. With the emergence of new technologies such as Artificial Intelligence
in healthcare, such information about patients could be monitored by digital systems for early
diagnosis, prevention, and care, which can be benefitted from such regression models &
studies.
9. References
1. Bahar Gholipour , Nicoletta Lanese (2021, December 14), What is a normal heart rate,
https://www.livescience.com/42081-normal-heart-rate.html
2. Jindong Feng, Qian Wang, Na Li (2021, September), An Intelligent System for Heart
Disease Prediction using Adaptive Neuro-Fuzzy Inference Systems and Genetic Algorithm,
https://www.researchgate.net/figure/CLASSIFICATION-OF-OLD-PEAK_tbl2_44260568
3. Can J Cardiol. (2006, May), Things you need to know about blood pressure and
hypertension, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2560868/
4. Mose, Romesh Khardori (2021, December 29), Familial Hypercholesterolemia,
https://emedicine.medscape.com/article/121298-overview
5. Paul Rubin (2013 December 1), Testing Regression Significance in R,
https://spartanideas.msu.edu/2013/12/01/testing-regression-significance-in-r/
6. Zach (2021, April 1), How to create a Confusion Matrix in R (Step-by-Step),
https://www.statology.org/confusion-matrix-in-r/ (Creating confusion matrix, finding optimal
cutoff in R)
7. David Trafimow (2018, August), Confidence intervals, precision and confounding,
https://www.sciencedirect.com/science/article/abs/pii/S0732118X17301691 (Relation
between precision and confidence interval)
8. https://www.rdocumentation.org/packages/boot/versions/1.3-28/topics/cv.glm (K-Fold
Cross Validation in R)
19 | P a g e
9. G Weidner (2000,May), Why do men get more heart disease than women? An international
perspective, https://pubmed.ncbi.nlm.nih.gov/10863872/ (Why men get more heart diseases
than women)
10. https://en.wikipedia.org/wiki/ST_segment# – Flat ST-Slope strongly related to heart
disease while upward slope not indicating heart disease
11. Jason Brownlee (2020, January 3), How to Calculate Precision, Recall, and F-Measure for
Imbalanced Classification, https://machinelearningmastery.com/precision-recall-and-f-
measure-for-imbalanced-classification/ (F-measure)
12. Zach (2019, May 9), How to Calculate Variance Inflation Factor (VIF) in R,
https://www.statology.org/variance-inflation-factor-r/
13. https://www.kaggle.com/fedesoriano/heart-failure-prediction - The Data Set
14. WHO (2021,June 11), Cardiovascular Diseases (CVDs), https://www.who.int/news-
room/fact-sheets/detail/cardiovascular-diseases-(cvds)
20 | P a g e