0% found this document useful (0 votes)

71 views20 pages

QT Report

The document describes a study that uses logistic regression to predict heart disease based on 11 clinical features. It provides details on the dataset, including variable descriptions and exploratory analysis through statistical summaries, Q-Q plots, density curves, scatter plots, and bar plots. The analysis examines relationships between variables and identifies ones that are potentially useful for predicting heart disease.

Uploaded by

Komal Modi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views20 pages

QT Report

Uploaded by

Komal Modi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

A Project Report on

Predicting Heart Disease Events based on 11 Clinical Features using

Logistic Regression

Submitted in partial fulfilment of the course

Quantitative Techniques - II

Submitted to

Prof. Pritha Guha

Submitted by

Group 4, Section C

Aditi Goyal (BJ21125), Astha Alankrita (BJ21134),

Ishita Thakur (BJ21143), Nihar Raichada (BJ21152),

Rahul Manna (BJ21161), Shreyas Jayasankar (BJ21170) &

Vishal Singh (BJ21179)

January 16, 2022

1|Page
1. Introduction
According to the research by World Health Organization, Cardiovascular diseases (CVDs)
form the number 1 reason for deaths worldwide. They are responsible for approximately 17.9
million deaths each year, making about 31% of the total global deaths. Among the CVDs, heart
attacks and strokes lead to 4 out of 5 deaths due to heart diseases. Moreover, a third of these
deaths are prematurely observed in people under the age of 70.

This dataset has been able to bring together 11 significant factors, which can prove highly
helpful in predicting heart disease. A model which can accurately predict heart disease's
presence based on certain bodily factors can be of great significance to patients and medical
practitioners. This model can also help people with other ailments like diabetes, hypertension,
high blood pressure, etc., because it offers them early detection and management opportunities.

2. Dataset Description
The data has been sourced from Kaggle.com and it represents different factors that can predict
heart disease for a person. The dataset contains 918 observations and has 12 columns for factors
affecting an individual's heart condition. A brief description of all factors in the dataset is given
below:

S. No. Column Name Type Description

1. Age Int It depicts the age of the individual in years.

2. Sex Chr This column represents the sex of the individual.

3. ChestPainType Chr It depicts the type of chest pain reported by the person (TA:
Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal
Pain, ASY: Asymptomatic).

4. RestingBP Int This column reports the resting Blood Pressure in mm Hg.

5. Cholesterol Int Depicts the cholesterol levels for the individual mm/dl.

6. FastingBS Int It measures the basal sugar levels in the blood (1 if

FastingBS > 120 mg/dl and 0 otherwise).

7. RestingECG Chr It depicts the resting (ECG) electrocardiogram for the

individual.

8. MaxHR Int It measures the maximum heart rate achieved with value
between 60 and 202.

9. ExerciseAngina Chr It refers to the exercise induced pain in chest. The column
exercise-induced angina is set to Y if pain if felt and to N if
pain is not felt)

10. OldPeak Int It represents ST depression which is induced by exercise

2|Page
11. ST_Slope Chr It depicts slope of the exercise segment (ST) (Up =
upsloping, Flat = flat, Down = down sloping]

12. HeartDisease Int This is the output column which predicts whether the patient
has heart disease. (1 for heart disease, 0 for Normal)
Table (2.1)

We will first perform exploratory analysis on the above-mentioned dataset and then run
regression analysis to understand which factors contribute the most towards heart disease. We
have assumed the level of significance(alpha) as 0.05 in all our tests.

In the dataset we observed that 1 record was missing data for the RestingBP column, and 172
records were missing data on cholesterol. As part of the data filtering process, we decided to
not use those records in exploratory and regression processes.

3. Exploratory Analysis:
We can divide the dataset features into two groups:

 Categorical Variables: ChestPainType, Sex, RestingECG, ST_Slope, FastingBS,

ExerciseAngina
 Quantitative Variables: Age, RestingBP, Cholesterol, MaxHR, OldPeak

Statistical Summary for Quantitative Variables:

Variable Range Mean Median

Age [28, 77] 52.88 54

RestingBP [92, 200] 133 130

Cholesterol [85, 603] 244.6 237

MaxHR [69, 202] 140.2 140

OldPeak [-0.1, 6.2] 0.9016 0.5

Table (3.1)

3|Page
3.1 Q-Q Plots:

Fig (3.1) Fig (3.2)

Fig (3.3) Fig (3.4)

Fig (3.5)

The QQ plots show that RestingBP and OldPeak do not follow normal distribution as most of
the data points lie far from the QQ line. For Age, Cholesterol and MaxHR, the QQ points
appear to lie closer to the QQ line but more data analysis is required before stating their
normality. Density Curves and Shapiro’s Test can help us better understand these variables.

4|Page
1.1 Density Curves:

Fig (3.6) Fig (3.7)

Fig (3.8) Fig (3.9)

Fig (3.10)

The density curves for all the continuous variables are shown above. From these graphs we
observe that only Age and MaxHR appear to follow Normal Distribution. RestingBP has
multimodal graph, whereas OldPeak and Cholesterol show skewed curves.

5|Page
3.2 Scatter Plots:

Fig (3.11) Fig (3.12)

Fig (3.13) Fig (3.14)

 In Fig (3.11), the cholesterol vs age plot shows that these two widely scattered variables
are positively correlated.
 The cholesterol vs OldPeak plot in Fig (3.12) shows that the two variables are very
poorly and positively correlated.
 The MaxHR vs OldPeak plot in Fig (xiii) appear to be negatively and very poorly
correlated.
 The data points in MaxHR vs Cholesterol plot in Fig (3.14) are widely scattered and
show the variables to be negatively and poorly correlated.

3.3 Correlation Matrix:

Age RestingBP Cholesterol MaxHR Oldpeak

Age 1 0.25986472 0.05875824 -0.38211212 0.28600628
RestingBP 0.25986472 1 0.09593929 -0.12577393 0.19857506
Cholesterol 0.05875824 0.09593929 1 -0.01985579 0.05848813
MaxHR -0.38211212 -0.12577393 -0.01985579 1 -0.25953263
Oldpeak 0.28600628 0.19857506 0.05848813 -0.25953263 1
Table (3.2)

The correlation table shows us the relationship between different continuous variables in the
dataset. The table helps us identify correlation and independence between different pairs of
6|Page
variables. From the table above we observe that no two quantitative variables are highly
correlated.

3.4 Bar Plots:

To understand the Categorical Variables, we have derived the following bar plots:

Fig (3.15) Fig (3.16)

Fig (3.17) Fig (3.18)

Fig (3.19) Fig (3.20)

7|Page
Inferences based on the bar plots:

 In the Chest Pain Type vs. Heart Disease Fig (3.16), it is observed that chest pain can
contribute to heart disease. According to the figure, approximately 300 of the 400
patients reporting “ASY” chest pain type were diagnosed with heart disease.
 According to Fig (3.15), most people report normal Resting ECG.
 Fig (3.17) shows Sex vs Heart Disease distribution, and we observe that in the given
dataset, percentage of men diagnosed with heart disease is much more than that of
women.
 The ST_Slope vs. Heart Disease graph in Fig (3.18) shows that most people who report
flat ST_Slope are diagnosed with heart disease.
 Fig (3.19) shows that heart diseases are observed in both kinds of patients, with basal
sugar level above and below 120 mg/dl. The percentage of people with heart disease is
more when the basal sugar level is greater than 120 mg/dl.
 Fig (3.20) shows that people who report exercise angina have higher chances of getting
diagnosed with heart disease.

3.5 Some more plots:

Heart Disease is an integer type, so first of all we would change it to factor for classification
purposes and then we used ggplot function to visulaise the data for analysis purposes.
We have delineated everything based on the target variable i.e. Heart Disease:

1) First plot is Age vs Count which shows the frequency of heart diseases with
increasing age. As visible from the plot below, under normal circumstances, the
advent of heart diseases is seen post the age of 40 for a significant number of people.

Fig (3.21)
2) Second plot is Sex vs Count which shows the frequency of heart diseases for the two
genders. As visible from the plot below, heart diseases are more frequent in males.

8|Page
Fig (3.22)
3) Third plot is Chest Pain type vs Count which shows the frequency of heart diseases
with the varying chest pain. As visible from the plot below, heart diseases are more
frequent in people experiencing ASY or NAP types of chest pain, and chest pain of
angina types (TA or ATA) are usually not associated with heart diseases.

Fig (3.23)
4) Fourth Plot is RestingBP vs Count which shows the frequency of heart diseases as
per the different RestingBP value. As we can see, most people having RestingBP
value of more than 100 experience heart diseases.

Fig (3.24)

9|Page
5) Fifth plot is Cholesterol vs Count which shows the frequency of heart diseases as per
the different cholesterol values. As visible from the plot below, the relation between
cholesterol and heart diseases is inconclusive, as there is no particular value or
threshold of cholesterol which signifies the presence of heart diseases. But generally,
the range lies between 200-600.

Fig (3.25)

4. Building the Logistic Regression Model

Logistic Regression, a statistical model, is deployed to model a binary dependent variable by
estimating the parameters of a logistic model. For the value labeled '1', the log of odds is a
linear combination of independent variables, also known as "predictors". The model is based
on discontinuous binary outcomes.

The logistic function is y = log (p/(1-p)). Probabilities transformed through the logistic
function are known as logit values. So, logit (p) = log (p/(1-p)).

To perform the regression analysis, πi = logit (pi) = β0 + β1 * xi + εi.

Once the coefficients are determined from the model, the probability of the event happening
is given by: pi = (e^ πi ) / (1 + e^ πi)

In the case of our heart disease data, since there are 11 dependent variables (3 variables with
further classification as well), the logit equation (fitted model) would be:

πi = logit (pi) = β0 + β1 * a + β2 * b + β3 * c + β4 * d + β5 * e + β6 * f + β7 * g + β8 * h +
β9 * i + β10 * j + β11 * k + β12 * l + + β13 * m + β14 * n + β15 * o

The values of β represent the coefficients attached to each independent variable. The list of
independent variables is:

10 | P a g e
Table (4.1)

The categorical variables ChestPainType, Sex, RestingECG, ST_Slope, FastingBS,

ExerciseAngina, OldPeak were converted to numerical values so that they can be used in the
regression model. To calculate the coefficients, the Generalized Linear Model (GLM) was
used. The value of α (significance level) was kept at 5%. The output of the GLM is as follows-

Fig (4.1)

Thus, the logit equation for the original model was calculated to be:

πi = logit (pi) = -7.3402738 + (0.0313784) * a + (1.8655490) * b + (1.6731804) * c +

(0.01001683) * d + (0.0399275) * e + (0.0117792) * f + (0.0024955) * g + (0.2923999) * h
+ (0.2297888) * i + (0.0551871) * j + (0.0005807) * k + (0.9073515) * l + (0.4108355) * m
+ (1.3038217) * n + (-1.2100372) * o

The probability of having a heart disease (for a person) could be ascertained by replacing the
values of the independent variables and calculating the logit score πi . From the logit score, the
probability can be computed as pi = (e^ πi ) / (1 + e^ πi) .

It can also be inferred that a rise in 1 unit of an independent variable would increase the logit
score by the coefficient of that variable, and increase the odds ratio by e^β. For example, if

11 | P a g e
fasting blood sugar level increases by 1 unit, the logit score would increase by 0. 2923999, and
the odds ratio of the person having heart disease would increase by e^ 0. 2923999 times.

4.1 Outlier Detection

Boxplots have been used to detect the presence of outliers.

Fig (4.2) Fig (4.3)

Fig (4.4) Fig (4.5)

All data points having Cholesterol value as zero were removed, as cholesterol value being zero
is not medically possible, and such data points were not meaningful for our analysis. There
were 172 such data points.
MaxHR (Maximum heart rate) can range from 60 to 200 bpm based on medical science [1].
Resting BP (Blood Pressure) can also have values up to 200 [3]. Old Peak can also have values
less than 2 and maximum upto 6.2 [2]. For cholesterol the zero values have already been
removed, the other outliers are not considered as people can have Cholesterol ranging upto
even 600 mg/dL [4]. If we omit or replace these high cholesterol values with the mean or
median, it will make our model less effective as it will give a low probability of heart attack
even when the cholesterol level is extremely high. Since our data set is basically health data,
we have taken that into account (based on medical science justifications) and have not treated
the outliers mathematically so as to make our model ideal for prediction of heart attack for all
(medically possible) values of predictor variables.

4.2 Multicollinearity Diagnostics

Multicollinearity is a situation in which one or more independent variables are highly linearly
related. This can affect the results of your data. To check multicollinearity, we had to run a
Variance Expansion Factor (VIF) test on an existing model. It measures how much the variance
of the regression coefficients expands due to the multicollinearity of the model. The minimum
possible value for VIF is 1, indicating that there is no multicollinearity. A VIF score of less
than 5 usually indicates that the significance of multicollinearity is low.
12 | P a g e
The outcome after running the VIF test was:

Table (4.2)

The VIF returned values less than 2 in all the cases indicating an absence of multicollinearity
in the model.

4.3 Finding the Best Fit Model

To further validate which of the models is better, we run the step function to arrive at the AIC
value. The Akaike Information Criterion or AIC is a score used to compare various models
within a given dataset to arrive at the best model for that particular data. The AIC score is based
on a model’s Maximum Likelihood Estimation as a measure of fit. The lower the AIC score,
the better.

We first used the summary command on the full model consisting of all the 11 predictor
variables. The AIC score of this model came out to be 515.58.

We then carried out a stepwise regression using the step function in the forward, both and
backward directions so as to arrive at the best fit model. All the three functions gave the same
result – The best model with AIC score of 509.12 which is based on 7 predictor variables (the
variables excluded from the model are Fasting_BS, Cholestrol, Resting_ECG and Max_HR).

In other words, we validate the conclusion that in a probabilistic ranking of the models to
understand, which minimizes loss of information, or is the best fit, it is (Model_new) or a model
consisting of all the predictor variables except Fasting_BS, Cholestrol, Rest_ECG and Max
HR. Therefore, the GLM model was run again, this time excluding the variables Fasting_BS,
Cholestrol, Rest_ECG and Max HR.

The output for that is as follows-

13 | P a g e
Fig (4.6)
For the above updated model, the logit function changed to:
πi = logit (pi) = -6.821891 + (0.033988) * a+ (1.835024) * b + (1.707323) * c + (0.123502) *
d + (0.098073) * e + (0.013504) * f + (0.888317) * l + (0.403) * m + (1.255385) * n + (-
1.292281) * o

5. Tests for Regression Coefficients

The observed z-values and the corresponding p-values for the z-tests for individual predictor
variables help us to determine the significance of the variable in the model while controlling
the rest of the variables. The z-test gives an indication whether the coefficient of the variable
in question is 0 or not. The null and alternate hypotheses for the z-test are:
Null Hypothesis H0 – The variable is insignificant, or βi = 0
Alternate Hypothesis H1 – The variable is significant, or βi ≠ 0

Results are summarized in the table below:

Table (5.1)

14 | P a g e
6. Model Performance/ Validation

6.1Test for Goodness of Fit

To check if the model is effective or not, the G-Test was performed. The Null and Alternate
hypotheses for the same are:

Null Hypothesis H0 – Null model fits the data at least as well as our model, or

βi = 0

Alternate Hypothesis H1 – Our model fits the data better than the null model, or

at least one of βi ≠ 0

PCHISQ Function: pchisq(1032.63-487.12, df=745-734, lower.tail=F)

The output for the above was 6.213367e-110, which was significantly lower than the alpha
value. Hence, we rejected the null hypothesis and could say that the βi values are not 0. Thus,
we concluded that the model is significant.

ANOVA Function: anova(Model_new, test = “Chisq”)

This test adds terms to the model one by one and tests the significance at each step using a chi-
square test. The outcome was:

Fig (6.1)

Since each of the terms gives a p-value < alpha, we rejected the null hypothesis and concluded
that the βi values are not 0. Thus, our model fitted the data better than the null model. Both the
above tests suggested that the model was effective and good.

6.2 Error Analysis: Confusion Matrix

The confusion matrix is frequently used in statistical classification to examine the performance
of a statistical model. The measures obtained from Confusion Matrix are:

- Accuracy: Indicates how correctly the model can classify heart disease
- Sensitivity: Indicates model’s ability to designate a person with disease as positive
- Specificity: Indicates model’s ability to designate a person without disease as negative

15 | P a g e
- Precision: Indicates model’s ability to correctly classify presence of disease
- F-Measure: A measure created considering both precision and recall [11]

Higher the accuracy, sensitivity, and specificity, the better. However, as the accuracy and
specificity of the model improves, the sensitivity lowers; as a result, an optimal cutoff
probability value must be determined to use the best model.

6.3 Analysis:

The code "pred.1[pred.prob.10.5] = 0" assumed a probability cutoff of 0.5, which meant that
only outcomes with a probability greater than 0.5 were examined. However, to find the
optimum model, we constructed confusion matrices with probability values ranging from 0.1
to 0.9 at multiples of 0.1. These three measurements were then plotted against the probability
values, with the probability value at the intersection point serving as our model's cutoff.

Table (6.2)

Sensitivity-Specificity Graph
1.2

0.8

0.6

0.4

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Sensitivity
Cut off probability Specificity

Fig (6.2)

16 | P a g e
Accuracy vs cut off probability
90.00%
88.00% 0.5286721, 87.67% 0.5, 86.86%

86.00% 0.6, 86.33%

0.3, 85.25% 0.4, 86.60% 0.7, 84.85%
84.00%
Accuracy

0.2, 82.71%
82.00%
0.8, 81.37%
80.00%
78.00% 0.1, 78.28%
76.00%
74.00% 0.9, 74.26%
72.00%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cut off probability

Fig (6.3)

Table(6.3)

The probability value at the intersection point is 0.528672. At this probability cut off value, the
percentage accuracy is maximum which is equal to 87.67%. Hence, for our model, we will
consider the probability cutoff equal to 0.528672.

6.4 Scatterplot of residuals against the predicted values:

Fig (6.4)

Through this model we are predicting the probability of having a heart attack (p=1) or not
having a heart attack (p=0). When the p value is 0, the estimated value will always be greater
17 | P a g e
and hence the residuals will be negative (indicated by the blue curve). Similarly, when the p
value is 1, the estimated value will always be lower and hence the residuals will be positive
(indicated by the red curve).

6.5 Model Validation using Cross Validation

K-Fold Cross Validation has been used to validate our model. In this technique, the data is
divided in K folds or parts. K-1 parts are being used to train the model and the Kth part is used
as a test. This is done repeatedly, and average accuracy is found out. It gave an accuracy of
89.6% when used with 10 clusters. The function cv.glm() was used from the boot library which
gives misclassification rate and an adjusted misclassification rate as the delta output.

Fig (6.5)

7. Results & Interpretation

Table (7.1)

The aim of this project work is to analyse eleven predictor variables to understand how the
predictor variables are contributing towards determining whether a person is at the risk of
having a heart (disease) failure or not.

We applied a logistic regression model, as the dependent variable was categorical and not
continuous. As we had many variables to account for, we used General Logistic Regression
Model, which states that the probability that an event (heart failure detection) will occur due to
‘k’ independent variables. After subsequent tests, goodness-of-fit, to measure the significance
of each of the 11 independent predictor variables on the outcome, we found that 7 variables
(Age, sex, ChestPainType, RestingBP, ExerciseAngina, ST_Slope, Oldpeak) had p-value

18 | P a g e
lesser than 0.05, meaning they gave no evidence of lack of fit of the logistic regression model
to the outcome data.

To select the most optimal classifier, we need to compare the cost of failing to detect positives
vs cost of raising false alarms. For this, we applied the Confusion matrix to decide on the best
threshold value and this came out to be 0.528.

8. Conclusion & Future Work

From the model it was observed that males are more prone to heart disease as compared to
women which is confirmed in [9]. Asymptomatic chest pain too has been associated with heart
disease while other type of chest pain do not seem to affect presence of heart diseases
significantly, showing that most diseases occur silently. Flat slope in the exercise segment
indicated presence of heart disease while slightly upward slope did not lead to a problem as
proven in [10]. Fasting Blood Sugar, Maximum Heart Rate, Resting ECG and Cholesterol did
not seem to be able to determine the presence of heart disease significantly.
The work done through this project could be taken forward in several ways. Given that the
dataset is limited to one place or geography, in the future, studies can be expanded to people
from other ethnicities and geographies. Building on such holistic studies will also be beneficial
for developing drug usage criteria to modify heart diseases medication to suit the specific needs
of different populations. With the emergence of new technologies such as Artificial Intelligence
in healthcare, such information about patients could be monitored by digital systems for early
diagnosis, prevention, and care, which can be benefitted from such regression models &
studies.

9. References
1. Bahar Gholipour , Nicoletta Lanese (2021, December 14), What is a normal heart rate,
https://www.livescience.com/42081-normal-heart-rate.html
2. Jindong Feng, Qian Wang, Na Li (2021, September), An Intelligent System for Heart
Disease Prediction using Adaptive Neuro-Fuzzy Inference Systems and Genetic Algorithm,
https://www.researchgate.net/figure/CLASSIFICATION-OF-OLD-PEAK_tbl2_44260568
3. Can J Cardiol. (2006, May), Things you need to know about blood pressure and
hypertension, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2560868/
4. Mose, Romesh Khardori (2021, December 29), Familial Hypercholesterolemia,
https://emedicine.medscape.com/article/121298-overview
5. Paul Rubin (2013 December 1), Testing Regression Significance in R,
https://spartanideas.msu.edu/2013/12/01/testing-regression-significance-in-r/
6. Zach (2021, April 1), How to create a Confusion Matrix in R (Step-by-Step),
https://www.statology.org/confusion-matrix-in-r/ (Creating confusion matrix, finding optimal
cutoff in R)
7. David Trafimow (2018, August), Confidence intervals, precision and confounding,
https://www.sciencedirect.com/science/article/abs/pii/S0732118X17301691 (Relation
between precision and confidence interval)
8. https://www.rdocumentation.org/packages/boot/versions/1.3-28/topics/cv.glm (K-Fold
Cross Validation in R)

19 | P a g e
9. G Weidner (2000,May), Why do men get more heart disease than women? An international
perspective, https://pubmed.ncbi.nlm.nih.gov/10863872/ (Why men get more heart diseases
than women)
10. https://en.wikipedia.org/wiki/ST_segment# – Flat ST-Slope strongly related to heart
disease while upward slope not indicating heart disease
11. Jason Brownlee (2020, January 3), How to Calculate Precision, Recall, and F-Measure for
Imbalanced Classification, https://machinelearningmastery.com/precision-recall-and-f-
measure-for-imbalanced-classification/ (F-measure)
12. Zach (2019, May 9), How to Calculate Variance Inflation Factor (VIF) in R,
https://www.statology.org/variance-inflation-factor-r/
13. https://www.kaggle.com/fedesoriano/heart-failure-prediction - The Data Set
14. WHO (2021,June 11), Cardiovascular Diseases (CVDs), https://www.who.int/news-
room/fact-sheets/detail/cardiovascular-diseases-(cvds)

20 | P a g e

Project Deliverable 3
No ratings yet
Project Deliverable 3
7 pages
Heart Disease
No ratings yet
Heart Disease
33 pages
Visualization
No ratings yet
Visualization
9 pages
Project Mid
No ratings yet
Project Mid
4 pages
PrimerEntregable MOET
No ratings yet
PrimerEntregable MOET
17 pages
Heart Disease
No ratings yet
Heart Disease
37 pages
Data Science Week 4
No ratings yet
Data Science Week 4
14 pages
MayankBaryal
No ratings yet
MayankBaryal
9 pages
Final Project AinaMarti
No ratings yet
Final Project AinaMarti
21 pages
Heart Disease Dataset Analysis
No ratings yet
Heart Disease Dataset Analysis
26 pages
Heart Disease Dataset Analysis
No ratings yet
Heart Disease Dataset Analysis
26 pages
AI-Based Predictive Support For Heart Disease Diagnosis
No ratings yet
AI-Based Predictive Support For Heart Disease Diagnosis
16 pages
Heart Disease Prediction Model Guide
No ratings yet
Heart Disease Prediction Model Guide
2 pages
Logistic Reg Application 2024-1
No ratings yet
Logistic Reg Application 2024-1
56 pages
My ML Project
No ratings yet
My ML Project
14 pages
Set 3 Report - Dhruv Pasricha - PA-31
No ratings yet
Set 3 Report - Dhruv Pasricha - PA-31
15 pages
181B226 Internship Report
No ratings yet
181B226 Internship Report
48 pages
Eda Report
No ratings yet
Eda Report
8 pages
Ide To 6 Classification Algorithms
No ratings yet
Ide To 6 Classification Algorithms
34 pages
Weka Regression LinearRegression
No ratings yet
Weka Regression LinearRegression
18 pages
Heart Disease Prediction Model
No ratings yet
Heart Disease Prediction Model
19 pages
q3 Stat2100 Bautista-Lhuriely
No ratings yet
q3 Stat2100 Bautista-Lhuriely
11 pages
Mini Project - Heart Disease Statistical Report With Regression 2
No ratings yet
Mini Project - Heart Disease Statistical Report With Regression 2
6 pages
Q3 - Stat2100 Dupol Melkiancaesar
No ratings yet
Q3 - Stat2100 Dupol Melkiancaesar
12 pages
Heart Disease Diagnostic Analysis
No ratings yet
Heart Disease Diagnostic Analysis
19 pages
Heart Disease Prediction Models
No ratings yet
Heart Disease Prediction Models
45 pages
Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook
No ratings yet
Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook
20 pages
Heart Disease Analysis Project Report
No ratings yet
Heart Disease Analysis Project Report
15 pages
Ntroduction: Uses Proximity To Make Classifications or Predictions
No ratings yet
Ntroduction: Uses Proximity To Make Classifications or Predictions
7 pages
Health Care Analytics: Science
No ratings yet
Health Care Analytics: Science
16 pages
Recommendation of Attributes For Heart Disease Prediction Using Correlation Measure
No ratings yet
Recommendation of Attributes For Heart Disease Prediction Using Correlation Measure
6 pages
Managerial Business Analytics Mgt782 (Finished)
No ratings yet
Managerial Business Analytics Mgt782 (Finished)
22 pages
Predicting Heart Attacks with ML
No ratings yet
Predicting Heart Attacks with ML
4 pages
Data Analysis for BSAD 330 Students
No ratings yet
Data Analysis for BSAD 330 Students
1 page
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
34 Davass1
No ratings yet
34 Davass1
8 pages
Respondent Profile Analysis Respondent Profile Analysis by Variable
No ratings yet
Respondent Profile Analysis Respondent Profile Analysis by Variable
3 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Project Report
No ratings yet
Project Report
6 pages
Cardiovascular Health Assessment and Risk Prediction Model Project
No ratings yet
Cardiovascular Health Assessment and Risk Prediction Model Project
24 pages
Project Report
No ratings yet
Project Report
18 pages
Data: (3 Points) Describe How The Observations in The Sample Are Collected, and The
No ratings yet
Data: (3 Points) Describe How The Observations in The Sample Are Collected, and The
9 pages
Final Heart Disease Project Proposal
No ratings yet
Final Heart Disease Project Proposal
12 pages
Project Report Soft
No ratings yet
Project Report Soft
123 pages
23FE10CDS00308 - Vismay Shukla (E1)
No ratings yet
23FE10CDS00308 - Vismay Shukla (E1)
21 pages
Lab Program 7
No ratings yet
Lab Program 7
5 pages
ETE 399 Mini Project
No ratings yet
ETE 399 Mini Project
7 pages
Assignment 2 Bayesian
No ratings yet
Assignment 2 Bayesian
3 pages
AHA Anderson Study
No ratings yet
AHA Anderson Study
8 pages
4 11 Final Modified Chapter-3
No ratings yet
4 11 Final Modified Chapter-3
47 pages
Heart Disease Prediction & Accuracy Estimation Comparison
No ratings yet
Heart Disease Prediction & Accuracy Estimation Comparison
24 pages
Heart Disease Report
No ratings yet
Heart Disease Report
8 pages
Abstract
No ratings yet
Abstract
4 pages
Heart Failure CETM24
No ratings yet
Heart Failure CETM24
28 pages
C ML1
No ratings yet
C ML1
10 pages
FIT2086 Assignment 3 Law Khye Jian
No ratings yet
FIT2086 Assignment 3 Law Khye Jian
12 pages
Biostatistic Assignment 2
No ratings yet
Biostatistic Assignment 2
12 pages
Catplot Project 1
No ratings yet
Catplot Project 1
2 pages
ORM-2 Assignment 1
No ratings yet
ORM-2 Assignment 1
2 pages
Financial Management Assignment 1
No ratings yet
Financial Management Assignment 1
2 pages
Demystifying Product Management
No ratings yet
Demystifying Product Management
11 pages
Managerial Economics Problem Set Solutions
100% (1)
Managerial Economics Problem Set Solutions
7 pages
Managerial Economics Problem Set Solutions
100% (1)
Managerial Economics Problem Set Solutions
7 pages
Lazy Learners PDF
No ratings yet
Lazy Learners PDF
15 pages
Lucy Linguistic Relativity
No ratings yet
Lucy Linguistic Relativity
24 pages
Undefined Terms in Geometry From HTTPS://WWW - Mcckc.edu/tutoring/docs/bt/geometry/basic - Geometric - Terms PDF
100% (1)
Undefined Terms in Geometry From HTTPS://WWW - Mcckc.edu/tutoring/docs/bt/geometry/basic - Geometric - Terms PDF
12 pages
Solar Cooling: Presented by Susan Shrestha 062BME645
100% (1)
Solar Cooling: Presented by Susan Shrestha 062BME645
43 pages
Pengukuran
No ratings yet
Pengukuran
2 pages
Graphites and Fullerene
No ratings yet
Graphites and Fullerene
9 pages
Classical Encryption Methods Guide
No ratings yet
Classical Encryption Methods Guide
18 pages
DM - Question Bank 1 - 2024 25
No ratings yet
DM - Question Bank 1 - 2024 25
3 pages
PostgreSQL for data architects discover how to design develop and maintain your database application effectively with PostgreSQL Maymala pdf available
No ratings yet
PostgreSQL for data architects discover how to design develop and maintain your database application effectively with PostgreSQL Maymala pdf available
99 pages
Part IV BJT
No ratings yet
Part IV BJT
76 pages
Xii Unit-01 Pyq 2024-25-1
No ratings yet
Xii Unit-01 Pyq 2024-25-1
17 pages
District of Bocaue: Grade 1 Name of School N I HS LS X MPS
No ratings yet
District of Bocaue: Grade 1 Name of School N I HS LS X MPS
3 pages
3D Fatigue Crack Closure in Ti-6Al-4V
No ratings yet
3D Fatigue Crack Closure in Ti-6Al-4V
15 pages
January 2015 MS - Paper 2C Edexcel Chemistry IGCSE
No ratings yet
January 2015 MS - Paper 2C Edexcel Chemistry IGCSE
16 pages
Spirulina Platensis Food For Future A Re 1
No ratings yet
Spirulina Platensis Food For Future A Re 1
8 pages
10 04 2023 - 17 07 44 - Crash
No ratings yet
10 04 2023 - 17 07 44 - Crash
15 pages
Grade 4 Mathematics Vocabulary Word Wall Cards: Number and Number Sense Measurement and Geometry
No ratings yet
Grade 4 Mathematics Vocabulary Word Wall Cards: Number and Number Sense Measurement and Geometry
91 pages
Banana Growers' Entrepreneurial Traits
No ratings yet
Banana Growers' Entrepreneurial Traits
6 pages
Types of Bread and Pastry Explained
No ratings yet
Types of Bread and Pastry Explained
22 pages
Circuit Boards With Orcad Layout
No ratings yet
Circuit Boards With Orcad Layout
27 pages
11th Mathematics Public QP
No ratings yet
11th Mathematics Public QP
117 pages
Report
No ratings yet
Report
20 pages
2324sem 2-MA3252
No ratings yet
2324sem 2-MA3252
4 pages
Digital Control Systems Course
No ratings yet
Digital Control Systems Course
2 pages
EIM 4 Lesson 2 Cable Tray
100% (2)
EIM 4 Lesson 2 Cable Tray
64 pages
Reconquest Manuscript PDF
No ratings yet
Reconquest Manuscript PDF
78 pages
12-Tuning Based On Integral Error Criteria
No ratings yet
12-Tuning Based On Integral Error Criteria
15 pages
Review Combustion
No ratings yet
Review Combustion
21 pages
Business Intelligence and Data Warehousing-Merged
No ratings yet
Business Intelligence and Data Warehousing-Merged
401 pages
Solar Plant Project in Basra, Iraq
No ratings yet
Solar Plant Project in Basra, Iraq
24 pages
Full Stack Guide for Frontend Engineers
No ratings yet
Full Stack Guide for Frontend Engineers
137 pages

QT Report

Uploaded by

QT Report

Uploaded by

A Project Report on

Predicting Heart Disease Events based on 11 Clinical Features using

Submitted in partial fulfilment of the course

Prof. Pritha Guha

Aditi Goyal (BJ21125), Astha Alankrita (BJ21134),

Ishita Thakur (BJ21143), Nihar Raichada (BJ21152),

Rahul Manna (BJ21161), Shreyas Jayasankar (BJ21170) &

Vishal Singh (BJ21179)

January 16, 2022

S. No. Column Name Type Description

1. Age Int It depicts the age of the individual in years.

2. Sex Chr This column represents the sex of the individual.

6. FastingBS Int It measures the basal sugar levels in the blood (1 if

7. RestingECG Chr It depicts the resting (ECG) electrocardiogram for the

10. OldPeak Int It represents ST depression which is induced by exercise

 Categorical Variables: ChestPainType, Sex, RestingECG, ST_Slope, FastingBS,

Statistical Summary for Quantitative Variables:

Variable Range Mean Median

Age [28, 77] 52.88 54

RestingBP [92, 200] 133 130

Cholesterol [85, 603] 244.6 237

MaxHR [69, 202] 140.2 140

OldPeak [-0.1, 6.2] 0.9016 0.5

Fig (3.1) Fig (3.2)

Fig (3.3) Fig (3.4)

Fig (3.6) Fig (3.7)

Fig (3.8) Fig (3.9)

Fig (3.11) Fig (3.12)

Fig (3.13) Fig (3.14)

3.3 Correlation Matrix:

Age RestingBP Cholesterol MaxHR Oldpeak

3.4 Bar Plots:

Fig (3.15) Fig (3.16)

Fig (3.17) Fig (3.18)

Fig (3.19) Fig (3.20)

3.5 Some more plots:

4. Building the Logistic Regression Model

To perform the regression analysis, πi = logit (pi) = β0 + β1 * xi + εi.

The categorical variables ChestPainType, Sex, RestingECG, ST_Slope, FastingBS,

πi = logit (pi) = -7.3402738 + (0.0313784) * a + (1.8655490) * b + (1.6731804) * c +

4.1 Outlier Detection

Fig (4.2) Fig (4.3)

Fig (4.4) Fig (4.5)

4.2 Multicollinearity Diagnostics

4.3 Finding the Best Fit Model

The output for that is as follows-

5. Tests for Regression Coefficients

Results are summarized in the table below:

6.1Test for Goodness of Fit

PCHISQ Function: pchisq(1032.63-487.12, df=745-734, lower.tail=F)

ANOVA Function: anova(Model_new, test = “Chisq”)

6.2 Error Analysis: Confusion Matrix

86.00% 0.6, 86.33%

6.4 Scatterplot of residuals against the predicted values:

6.5 Model Validation using Cross Validation

7. Results & Interpretation

8. Conclusion & Future Work

You might also like