BP 2
BP 2
Abstract—In today’s era data mining plays important role for attribute to generate fuzzy rules which are weighted based on
prediction of diseases in medical field. With the growing research the frequency in the learning database. Then these fuzzy rules
on disease predicting system, it has become important to discover are used to built the decision support system. The proposed
hidden patterns and relationships from medical databases. In system is tested on three types of dataset those are cleveland
classical clinical diagnosis, it requires lots of tests which could
dataset collected from V.A medical center which contains total
complicate the disease prediction. Hence the data mining
techniques can help medical expertise to take the decision about instance of 303 records in which 202 are training data and 101
the disease using computer aided decision support system. In this are used as test data. Hungarian dataset collected from
paper comprehensive survey on various data mining techniques Hungarian Institute of Cardiology, Budapest includes 196 of
used for disease prediction is presented. training data, 98 of test data which has total instance of 294
records. And total instance of 123 records are taken in which
Keywords—Data mining, disease prediction, breast cancer, 41 are test data and 82 are training data from Switzerland
heart, feature selection. dataset collected from University Hospital, Zurich,
I. Introduction Switzerland. This method when compared to neural-network
based system has achieved highest accuracy. The method
Data mining is a novel approach for extracting knowledge reported accuracy of 57.851% for Cleveland data. Hungarian
from databases. One of the most active research of data dataset, gives the accuracy of 50.583%. Switzerland dataset
mining is healthcare industry. As healthcare companies are reported the accuracy which is 20%, higher then neural-
making efforts to gather patient’s records. Estimation shows network based system. [1]
that there is approximately 1,099,511,627,776 bytes of data,
which is thus increasing day by day. This data has to be mined S. Apte et al. (2012) proposes data mining classification
to extract useful information. As sometimes patients fails to technique, for prediction of heart disease. In this approach data
explain, their symptoms correctly and laboratory reports preprocessing technique is applied, to remove missing values
outcomes may be with some degrees of error. The doctors find and this missing value has been replaced with mean mode
difficulties in taking decision about the disease as they may method. Later the multi-layer perceptron neural network is
not have expertise in all fields. Thus to solve this problems, used for mapping the data. Hence data mining classification
there is a need for development of decision prediction system techniques namely: naïve Bayes, neural network and decision
that combines knowledge of medical expertise with automated trees are analyzed on Heart disease database. Here the data
system to achieve best results and can serve the society. mining tool used is Weka 3.6.6. The method collected 303
II. Description of data mining techniques for disease records from Cleveland heart disease database which is used
prediction as training set & 270 records from Statlog Heart Disease
database which is used as test set. The data set consists of 3
In this section significant amount of work has gone in to the types of attributes: Predictable attribute, Input and Key.
research related to data mining technique for disease Totally 573 records are used to detect the disease. Prior 13
prediction. input attributes are used and further two more attributes are
P. K. Anooj (2012) proposed a clinical decision support added, those attributes are smoking and obesity, as these
system using weighted fuzzy rules for risk level prediction of attributes are considered as important factor for disease like
heart disease. In this method, first data preprocessing for heart. Thus, the result obtained, is that the accuracy of neural
eliminating missing value is applied. Further they carry out network reported to be 100%, decision tree gives accuracy of
generation of weighted fuzzy and developed a fuzzy rule- 99.62% and naïve bayes reported the accuracy of 90.74%.
based decision support system. The method selects suitable And after comparing these three classification techniques, the
978-1-5386-0569-1$31.00 2017
c IEEE 550
result derived was, as compared to decision trees and naïve classification of cerebrovascular accident attack. This method
bayes, the accuracy of neural network was highest. [2] first predicts the variable which is dependent. The model with
hidden layer 10 nodes and output layer consist of one node is
D.I. Kotsia et al. (2008) proposed an automatically generated used. The dataset contains 100 records in which 40 were
system for diagnosis of coronary artery disease using data females and 60 were males from federal medical centre, Owo,
mining and fuzzy modeling. It contains various steps such as: Nigeria. The neural network was trained with 150 epoch and
induction of a decision tree from data, extraction of rules, MSE of 0.0698843. Thus, simulation result achieved by this
formulation of crisp model, transformation of crisp model to approach is that the model was capable of producing a
fuzzy model and finally optimization. The data for testing this reasonable forecasting accuracy. [8]
method is collected from Invasive Cardiology, department of
the university, hospital of Ioannina. The dataset consists of I.H.Elhajj et al. (2010) proposed an anticipated decision
199 subjects, each one characterized by 19 features. The support system to detect agitation transition. The model uses
method reports a sensitivity accuracy of 80% and specificity decision confidence measure and two new support vector
accuracy of 65%, after fuzzification and optimization. [3] machine architectures namely confidence-based SVM and
confidence-based multilevel SVM for detecting agitation
T. Turner et al. (2012) proposed a method by integrating k- transition. The dataset is obtained using sensors, placed
means clustering and decision tree for diagnosing heart around body. Then the patient undergoes trait scale state-trait
disease Patient. The method employs different centroid anxiety inventory (T-STAI), which is used to measure anxiety
selection methods for k-means clustering algorithm and in adults. By this 240 samples are collected. Thus, an accuracy
decision tree for determining the clusters. The dataset is of 91.4% was achieved as compared to conventional support
collected from Cleveland Clinic Foundation Heart Disease. vector machine which had an accuracy of 90.9%. [9]
The dataset contains 13 different attributes. The combination
of K-means Clustering and decision tree has achieved great S. Pal et al. (2013) describe a model for predicting heart
results when compared to traditional decision tree. Thus, the disease using data mining technique. The proposed
integrated algorithm reported the accuracy of 83.9%. [4] methodology surveyed on three different classifiers namely:
ID3 (Iterative Dichotomized 3), Decision tree, and CART
R. Stocker et al. (2012) proposed a method for diagnosing (Classification and Regression tree). The dataset is collected
heart disease patient, by integrating k- means clustering and from Cleveland Clinic Foundation. Thus, the observation and
naïve bayes with different centroid selection. The dataset is comparison showed that Classification and Regression tree
obtained from Cleveland Clinic Foundation. Thus, by (CART) achieved the accuracy of 83.49% which was
integrating k- means clustering and naïve bayes, the accuracy comparatively better then ID3 (Iterative Dichotomized 3) and
reported is 84.5%, when compared to individual algorithm. [5] Decision tree. The average error reported by CART was 0.3.
The time taken to build CART model is 0.23 seconds. [10]
R. Subramanian et al. (2007) proposed a method for predicting
intelligent heart disease. The method is implemented by K. Chandra Shekar et al. (2011) proposed a method for
integrating three models namely neural network, coactive classification of heart attack patients. Here in this approach,
neuro-fuzzy inference system (CANFIS) for discovering firstly the dataset is preprocessed, then modified equal width
nonlinear relationship maps between different attribute model bining interval approach is applied. Further numeric attributes
and genetic algorithm. The simulation is performed on are converted in to categorical form and frequent patterns
NeuroSolution Software. The dataset is obtained from UCI. applicable to heart disease are mined, using pruning-
Hence CANFIS reported the mean square error of 0.000842. classification association rule (PCAR) algorithm from the data
[6] extracted. Thus, the model uses only selected class label for
effective prediction. The dataset is obtained from UCI. Hence
Montazer et al. (2010) proposed a model to detect coronary the model is capable of predicting the heart attack effectively.
heart disease risk assessment using fuzzy-evidential hybrid [11]
inference engine. In this method, first fuzzy set rules are
applied for the information which is not clear and then extract S. Soni et al. (2011) proposed a model for heart attack
fuzzy rule set. This result is considered as basic belief. And prediction using weighted associative classifier (WAC). The
from this belief, plausibility functions are positioned. This is dataset is obtained from University of California Irvine (UCI)
called as decision making uncertainty and hence information machine learning repository. In this method instead of using 5
fusion takes place from various sources. The dataset is class label i,e 4 for four types of Heart Disease and 1 for no
obtained from Hungarian institute of cardiology’s heart heart Disease. The method considers only 2 class labels 1 for
disease dataset in the university of California, Irvine’s “Heart Disease” and another for “No Heart Disease” as the
machine learning repository. The dataset consists of 294 data set is having less number of records for different types of
samples. Hence the accuracy achieved is 91.58%. [7] Heart Disease. Using 25% of support value and 80% of
confidence value. The model achieved the accuracy of
Olabode et al. (2012) proposes multilayer feed forward 81.51%. Thus, weighted associative classifier is the best
artificial neural network with back propagation error for
A.Govrdhan et al. (2011) proposed a data mining application S. Apte et al. (2012) 100%
in medical industry for predicting heart attacks. The method
uses one dependency augmented naïve bayes classifier
(ODANB) and naive creedal classifier 2 (NCC2) for data D.I. Kotsia et al. (2008) 80%
preprocessing. The application uses three data mining
algorithms namely: decision list, naïve bayes and K-NN for
predicting heart attacks. The dataset used here is plain text
T. Turner etal. (2012) 83.9%
format ARFF files and also dataset from the University of
California Irvine (UCI) machine learning repository. The
method has been validated on 3000 instances of dataset with
14 different attributes. The method performed validations on R. Stocker et al. (2012) 84.5%
both training and test data, in which 70% data is used as
training and remaining 30% as test data. The decision list
reported the accuracy of 52%, naïve bayes gives the accuracy R. Subramanian et (2007) MSE -0.000842
of 52.33% and K-NN has achieved the accuracy of 45.67%. al.
The comparison was made among these algorithms. The
results were judged on the basis of accuracy and time taken in
diagnosing the heart disease. Navie bayes was chosen as the Montazer etal. (2010) 91.58%
best classification algorithm as time taken by navie bayes was
comparatively less when compared to decision list and K-NN
algorithm i.e 609ms. [13] Olabode et al. (2012) MSE-0.0698843
E. Anupriya et al. (2010) employ a model to predict heart
disease. First genetic algorithm is used to determine
significant attribute. Then new population is constructed using I.H.Elhajj et al. (2010) 91.4%
survival of fittest. Further the model uses three classification
techniques namely: decision tree, classification via clustering
and naive bayes for predicting disease. The dataset consists of S. Pal et al. (2013) 83.49%
909 records. Initially with 13 attributes, which were reduced to
6 attributes with 0.6 cross over probability and 0.033 mutation
probability. The Decision tree reported highest accuracy of
99.2%, when compared to other classification technique. [14] K. Chandra Shekar (2011) Predicts
et al. effectively
C. Ardil et al. (2013) Proposed a method using data mining
technique for predicting acute coronary syndrome. First the S. Soni et al. (2011) 81.51%
model uses data reduction technique to reduce the dimensions.
After applying principal component analysis on the ten
independent numeric variables, the model founds that the first
eight principle components cover more than 98% of the total A.Govrdhan et al. (2011) 52.33%
variability of the continuous data space. The model uses data
sets from two different cardiac hospitals of Karachi and
Pakistan. After data reduction, the 14 independent variables E. Anupriya et al. (2010) 99.2%
are hypertension, gender, fasting blood sugar, cholesterol,
pulse rate, heart rate, smoke, age, blood pressure (diastolic),
family history, hypertension, diabetics mellitus, streptokinase,
blood pressure (systolic). Thus, the observation showed that C. Ardil et al. (2013) Smoking is
smoking is the most significant factor or risk for acute significant factor.
coronary syndrome, when compared to other factors. [15]
Burke B.H et al. (1999) proposed a model for evaluating the Behnam H et al. (2005) proposes a model to predict the
accuracy of ANN in predicting 5, 10 and 15 years breast disease and assist the radiologists for diagnosing breast cancer.
cancer specific survival. The eight input variables entered in The model integrates multiwavelet based sub band image
this model are nuclear pleomorphism, tumor necrosis, tubule decomposition and artificial neural network (ANN). The
formation, age, axillary nodal status, mitotic count, method is tested on mammographic image analysis
histological and tumor size. The dataset is obtained from City society(MIAS) mammographic database. Among different
Hospital of Turku and Turku University Central Hospital. The multiwavelet, performance of biorthogonal geronimo, hardin
dataset consists of 951 instances. Further divided in to training and massopust multiwavelet with length 2 (BiGHM2) was
set of 651 and a validation set of 300 patients. Here in this best. Thus, BiGHM2 achieved accuracy with areas ranging
model, the results of artificial neural network and logistic around 0.96 under receiver operating characteristic curve. [21]
regression is compared. The accuracy for 5 years survival
reported to be 0.909, 0.086 for 10 years and 0.883 for 15 S.Nahavandi et al. (2015) proposes an automated medical data
years. Thus, the observation and comparison showed that classification method using interval type-2 fuzzy logic system
artificial neural network reports consistently high accuracy (IT2FLS) and wavelets. The model deals with uncertainity and
over time when compared with logistic regression. [17] high dimensionality data challenge. This implementation is
carried out on two different medical datasets: Cleveland heart
G.Walker et al. (2005) implemented a method for comparing disease and Wisconsin breast cancer from UCI repository for
three data mining technique for predicting breast cancer machine learning. The result demonstrates that, advantage of
survivability. The data mining technique includes logistic IT2FLS is better when compared to other machine learning
regression, decision tree and artificial neural network. The method. [22]
model uses a large dataset with more than 200,000 cases.
Thus, the results obtained, is that the accuracy of logistic S. Sulong et al. (2012) proposes a method to detect cervical
regression reported to be 89.2%, decision tree (C5) gives the cancer using confounding effects like age, marital status and
accuracy of 93.6% and artificial neural network reported the treatment among Malaysian women. The cervical cancer
accuracy of 91.2%. To test the data 10-fold cross-validation is patient records are taken from databank of department,
performed, to measure the unbiased estimate, for prediction of university Kebangsaan Malaysia(UKM) medical center. The
three techniques. Thus, the comparative study concludes that model considers four stages, with 444 patient records, who are
the decision tree (C5), is the best predictor for predicting suffering from cervical cancer, and found out the treatment for
breast cancer survivability, as compared to artificial neural women according to their age, and marital status. Thus, found
network and logistic regression. [18] that the women at the age of 46 years have more chances of
cervical cancer. So Malaysian women are suggested to take
E. Gauven et al. (2006) proposed a model for prediction of test before the age of 45 years and it also discovers that
breast cancer survivability using data mining techniques. In married and Chinese women less the 57 years old are more
this approach, first pre-classification process is performed by likely to diagnose in the early stage of cervical cancer either
considering three fields namely: vital status recode, cause of by operation or by both combined treatment of radiotherapy
death, and survival time recode. For classification, the and operations compared to any other treatment. [23]