Heart disease prediction using machine learning
techniques
                                               Mohammed Ramadan Mohammeed
                                                   Computer Sciences (AI)
                                                   University of Benghazi
                                                      Benghazi, Libya
                                               mohammed.ramadan@uob.edu.ly
   Abstract—one of the most well-known uses of artificial                            II. LITERATURE REVIEW
intelligence, machine learning (ML), is revolutionizing
                                                                        Through research in this area, techniques for predicting
the field of study. In this work, the use of machine               cardiovascular disease using supervised machine learning
learning to determine a person's risk of heart disease is          algorithms have been developed. On this subject, several
discussed. Cardiovascular diseases (CVDs) are common               study articles have been prepared. A report surveying the
and can possibly be fatal for people anywhere in the globe.        performance of many models based on machine learning
A person's age, cholesterol level, chest discomfort, and           algorithms and methodologies has been given. [4]. One of the
other characteristics may all be taken into account using          studies describes efforts to develop a Graphical User Interface
machine learning to determine if they have a                       (GUI) that uses a Weighted Association rule-based classifier
cardiovascular disease. Cardiovascular disease diagnosis           to determine if a person has heart disease or not [5]. A novel
can be facilitated by machine learning classification              method for the prediction of cardiac illness based on the
algorithms based on supervised learning. To distinguish            coactive neuro-fuzzy interference system (CANFIS) has been
between individuals with and without cardiac disease,              reported in another study [6]. In one of the publications [7],
algorithms such as Random Forest and K-Nearest                     the methods frequently used to forecast cardiac disease and
Neighbor (KNN) are utilized. This study uses two                   their associated difficulties are summarized. One of the studies
supervised machine learning algorithms: Random Forest              [8] described a classifier strategy for the identification of heart
and K-Nearest Neighbor (K-NN). K-Nearest Neighbor                  disease and demonstrates the usage of Naive Bayes for
(K-NN) yielded a prediction accuracy of 79.0%, whereas             classification purposes. One of the publications conducts a
the Random Forest method produced an accuracy of                   survey comprising several papers whereby one or more data
                                                                   mining techniques have been applied to forecast heart disease
80.7%.
                                                                   [9].
   Keywords
                                                                                    III. PROPOSED METHODS
Heart Disease, Random Forest, K Nearest Neighbor (K-NN),           A. K-Nearest Neighbor (K-NN)
Machine Learning
                                                                       For classification tasks, a well-liked machine learning
                       I. INTRODUCTION                             algorithm is K-Nearest Neighbors (K-NN). Data points are
                                                                   categorized using this non-parametric algorithm according to
    Human body is made up of various organs, all of which          how close they are to one another in a feature space. The K-
have their own functions. Heart is one such organ which            NN algorithm counts the number of neighbors, represented by
pumps blood throughout the body and if it does not do so, the      the letter k, that will be taken into consideration for
human body can have fatal circumstances. One of the main           classification when a new data point is encountered and its
reasons of mortality today is having a heart disease [1]. So, it   category or class is unknown. Usually, the user specifies this
becomes necessary to make sure that our cardiovascular             value of k in advance or finds it through cross-validation.
system or any other system in the human body for that matter
must remain healthy. Unfortunately, people all around the              Because it is based on the notion that data points belonging
world have been facing cardiovascular diseases. Any                to the same class tend to be closer to one another in the feature
technology that can help diagnose these diseases before much       space, this technique makes the K-NN algorithm a
damage is done will prove as helpful in saving people’s money      straightforward but efficient technique for classification tasks.
and more importantly their lives. Data mining techniques can       The method makes predictions for unknown data points by
be useful in predicting heart diseases. Predictive models can      utilizing the local structure of the data by taking into account
be made by finding previously unknown patterns and trends          the class labels of the closest neighbors.
in databases and using the obtained information [2]. To extract    B. Random Forest
knowledge from vast volumes of data is to engage in data
mining [3]. One technological advancement that can assist in           The Random Forest algorithm is a powerful ensemble
diagnosing cardiac disease early on before significant harm is     learning method that combines multiple decision trees to make
done to an individual is machine learning. Machine learning        predictions. Using the training set, it constructs a set of
is a rapidly developing subject in science and technology that     decision trees, each of which independently generates a
has the ability to diagnose and categorize patients based on       predicted class as an output. When it comes to classification
their risk of heart disease.                                       tasks, the final prediction is the class that appears the most
                                                                   frequently throughout all decision trees.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
    Random Forest uses the wisdom of the crowd to produce
predictions that are more reliable and accurate by building a
variety of decision trees and combining their forecasts. The
premise of this ensemble approach is that, despite the potential
biases and limitations of each decision tree, the process of
collective decision-making can make up for these drawbacks.
                 IV. EXPERIMENTAL SETUP
    Getting a dataset containing the traits of a person with and
without heart disease is the first step in getting ready. The
dataset for this experiment may be retrieved from the Kaggle
website (https://www.kaggle.com). The Orange Machine
Learning software is the new program used in this experiment.
The data will now be analyzed using this application. To get a
quick overview of the data set, I used a tool called Data Info.
                                                                               Fig 3 , Distributions Attribute of sex attribute
                         Fig 1 , Data info
    To get certain statistical statistics for the data set, such the
                                                                                           Fig 4 , Distributions of chest pain attribute
average values of the characteristics used, Distributions
Attribute a service offered by Orange Machine Learning is                  After checking the data balance, the correlation between
utilized. Target is an attribute that is taken; if the patient has     the data is discovered using a tool called Heat Map
heart disease, its value is 1, and if not, its value is 0.
             Fig 2 , Distributions of target attribute
    It is clear from the findings displayed by attribute
distributions that the data set employed in this investigation is
balanced.
    Also use Distributions Attribute with different attributes of
the dataset such as the sex attribute which has values of 1
(male) and 0 (female) and the cp (chest pain) attribute which
shows the type of chest pain ranging from 0 to 3.
                                                                                   Fig 5 , Correlation between variables
     The heat map unequivocally demonstrates the positive                             TABLE I.         CONFIUSION MATRIX KNN
link between the desired characteristic and qualities like                                K-Nearest Neighbor (K-NN)
maximal heart rate reached (thalack) and chest pain (cp).
Having confirmed the association, the dataset has to be                                                          Actual
processed in order to turn categorical variables like sex, cp,
fbs, restecg, exang, sclop, ca, and thal into dummy variables.                                                0            1        ∑
To get the best results while training the models, we will
                                                                          Predicted
change the values of these characteristics to a value between                                    0           83            26      109
0 and 1.
                                                                                                 1           19           115      134
    The training data, which makes up 80% of the total data
set in this study, and the testing data, which makes up the                                      ∑           102          141      243
remaining 20%, were carefully separated from the original
data set. This section enables a thorough assessment of the
machine learning algorithms used in the research.                       From the confusion matrix, the accuracy is calculated
    The chosen machine learning algorithms were then               which comes out to be 79.0 %.
applied to the training data once the data set was ready. These    B. Random Forest
algorithms constructed heart disease prediction models by             Component The value of number of trees is kept 10. The
leveraging the available features and attributes. In order to      confusion matrix obtained was as follows.
make precise predictions, it was necessary to train these
models to recognize the underlying relationships and patterns                         TABLE II.        CONFIUSION MATRIX RF
in the data.
                                                                                                     Random Forest
    A confusion matrix was used to evaluate the trained
models' performance. The comprehensive assessment of the                                                    Actual
algorithm's predictive capabilities is given by the confusion
matrix. Confusion matrix can also be shown as a matrix in the                                                0            1        ∑
following way:
                                                                          Predicted
                                                                                                 0          79            30      109
                                                                                                 1          21            113     134
                                                                                                 ∑          100           143     243
                                                                        From the confusion matrix, the accuracy is calculated
                                                                   which comes out to be 80.7%.
                                                                   C. Results after applying each algorithm
                                                                                         TABLE III.        RESULTS ALGORITHM
                                                                     Algorithm Used                   TP   FP      TN     FN    Accuracy
                                                                            K-NN                      83    26     115     19    79.0%
                                                                     Random Forest                    79    30     113     21    80.7%
         Fig 6 , Distributions Attribute of sex attribute
   The accuracy of the algorithm can be calculated using the                            VI. CONCLUSION
formula:                                                               After putting different algorithms to use, it can be
   Accuracy = {(TP + TN) / TP + FP + TN + FN)} * 100               concluded that machine learning is showing to be very helpful
                                                                   in predicting heart disease, which is one of the biggest issues
       Through an examination of the algorithmic accuracy,         facing society today. There may soon be new techniques to
can determine how well the machine learning models predict         make machine learning more beneficial in the healthcare
heart disease. A higher accuracy score indicates a more            industry as more and more research is being done in this area.
reliable and precise algorithm, suggesting that it is capable of   With the attributes at hand, the algorithms employed in this
making accurate predictions based on the given attributes.         experiment have shown excellent performance. Finally, it can
                                                                   be concluded that by anticipating heart disease, machine
                        V. RESULTS                                 learning can lessen the harm done to a person's physical and
A. K-Nearest Neighbor (K-NN)                                       mental health.
   The value of k was taken as 5 in the Manhattan matrix, as                       VII. ACKNOWLEDGMENTS
5 was one of the values that gave the highest accuracy for the
algorithm. The confusion matrix obtained was as follows:              Thanks and appreciation to Dr. Muhammad Salem and Dr.
                                                                   Younis Al-Badri for everything you gave me in this semester,
                                                                   and I hope that we will meet in future lessons.
                        VIII.REFERENCES                                         [7]    Chitra, R., & Seenivasagam, V. (2013). Review of heart disease
                                                                                       prediction system using data mining and hybrid intelligent techniques.
[1]   Mohan, S., Thirumalai, C., & Srivastava, G. (2019). Effective heart              ICTACT journal on soft computing, 3(04), 605-09.
      disease prediction using hybrid machine learning techniques. IEEE
      Access, 7, 81542-81554.                                                   [8]    Medhekar, D. S., Bote, M. P., & Deshmukh, S. D. (2013). Heart disease
                                                                                       prediction system using naive Bayes. Int. J. Enhanced Res. Sci.
[2]   Bhatla, N., & Jyoti, K. (2012). An analysis of heart disease prediction          Technol. Eng, 2(3).
      using different data mining techniques. International Journal of
      Engineering, 1(8), 1-4.                                                   [9]    Kaur, B., & Singh, W. (2014). Review on heart disease prediction
                                                                                       system using data mining techniques. International journal on recent
[3]   Patel, J., TejalUpadhyay, D., & Patel, S. (2015). Heart disease                  and innovation trends in computing and communication, 2(10), 3003-
      prediction using machine learning and data mining technique. Heart               3008.
      Disease, 7(1), 129-137.
[4]   Ramalingam, V. V., Dandapath, A., & Raja, M. K. (2018). Heart
      disease prediction using machine learning techniques: a survey.                 IEEE conference templates contain guidance text for
      International Journal of Engineering & Technology, 7(2.8), 684687.                composing and formatting conference papers. Please
[5]   Soni, J., Ansari, U., Sharma, D., & Soni, S. (2011). Intelligent and               ensure that all template text is removed from your
      effective heart disease prediction system using weighted associative                  conference paper prior to submission to the
      classifiers. International Journal on Computer Science and
      Engineering, 3(6), 2385-2392.                                                       conference. Failure to remove template text from
[6]   Parthiban, L., & Subramanian, R. (2008). Intelligent heart disease                   your paper may result in your paper not being
      prediction system using CANFIS and genetic algorithm. International                                    published.
      Journal of Biological, Biomedical and Medical Sciences, 3(3).