See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/323227974
Prediction of Chronic Kidney Disease Using Data Mining Techniques
Conference Paper · May 2017
CITATIONS READS
2 2,833
4 authors, including:
Maryam Soltanpour Gharibdousti Kamran Azimi
State University of New York at Bingahmton, U.S. Binghamton University
8 PUBLICATIONS 36 CITATIONS 2 PUBLICATIONS 6 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Effects on Thermophysical Properties of Carbon Based Nanofluids: Experimental Data, Modelling using Regression, ANFIS and ANN View project
All content following this page was uploaded by Maryam Soltanpour Gharibdousti on 16 February 2018.
The user has requested enhancement of the downloaded file.
Proceedings of the 2017 Industrial and Systems Engineering Conference
K. Coperich, E. Cudney, H. Nembhard, eds.
Prediction of Chronic Kidney Disease Using Data Mining Techniques
Maryam Soltanpour Gharibdousti, Kamran Azimi, Saraswathi Hathikal, Dae H Won
Department of Systems Science and Industrial Engineering
State University of New York at Binghamton, Binghamton, NY 13902 Abstract
The procedure of finding hidden and unidentified patterns and trends in big datasets, extracting information from them
and building predictive models is defined as data mining. In another word, it’s the process of collection and exploration
of data sets and building models by huge data stores to expose previously unknown outlines. Healthcare Management
is one of the areas which is using machine learning techniques broadly for different objectives. Chronic Kidney
Disease is a growing disease in recent years and many researches are being done to predict its progression and classify
the datasets based on related features. In this paper, we focus on applying different machine learning classification
algorithm to a dataset with 400 observations and 24 attributes for diagnosis of chronic kidney disease. Various
classification techniques that are used are: Decision Tree, Linear Regressing, Super Vector Machine, Naive Bayesian
and Neural Network. Obtaining correlation matrix, we examined the correlation of the features. Also, the performance
measurements of different methods before and after applying feature selection are calculated and compared to each
other.
Keywords— Chronic Kidney Disease, Data mining, Decision Tree, Logistic regression, Support Vector Machine,
Naïve Byes, Neural Network.
1. Introduction and Back Ground
The procedure of finding hidden and unidentified patterns and trends in big datasets, extracting information from them
and building predictive models is defined as data mining. In another word, it’s the process of collection and exploration
of data sets and building models by huge data stores to expose previously unknown outlines.
Due to complexity and vagueness of data engendered by healthcare transactions, it is impossible to analyze them
with traditional tools. In order to make the decision-making process easier and more trustable, data mining techniques
are provided to transmute these data into useful information and makes it feasible to get useful results and patterns
and trends out of these huge amounts of data.
Data mining has been widely used in many areas. One of these areas which is using it even more and more as an
essential tool is healthcare management. All agents in a healthcare industry can significantly benefit Data mining
applications. Data mining is not new.it has been used intensively and extensively by financial institutions, for credit
scoring and fraud detection; marketers, for direct marketing and cross-selling or up-selling; retailers, for market
segmentation and store layout; and manufacturers, for quality control and maintenance scheduling. [1]
A heart disease prediction is done using three data mining techniques namely Neural Network, Decision Tree and
Naive Bayes. Their results disclose that neural networks with 15 features has surpassed two other techniques and
accordingly is selected as the predictive model [3]
In a recent paper authors used a hybrid machine learning boosted regression algorithm and logistic regression, to
identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–
2010). Combining a machine learning algorithm with traditional statistical modelling, and also designing a complex
survey and transforming the missing data are the main contribution of this study. The features they took into account
in designing a questionnaire include gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass
Index, physical activity, alcohol use, medical conditions and medications [4].
Prediction of four types of Kidney diseases namely Nephritic Syndrome, Chronic Kidney disease, Acute Renal
Failure and Chronic Glomerulonephritis. Supervised classification algorithm Support Vector Machine (SVM) and
Artificial Neural Network (ANN) is used to predict the kidney disease. Experimental results show that ANN is a best
classifier Classification accuracy for ANN is higher compared to SVM and the execution time for SVM is lower
compared to ANN. ANN has better classification accuracy [5]. Different machine learning classification algorithm
for diagnosis of chronic kidney disease is discussed. Various classification techniques that are used are: Decision Tree,
Linear Discriminant classifier, Quadratic Discriminant classifier, Linear SVM, Quadratic SVM, Fine KNN, Medium
KNN, Cosine KNN, Cubic KNN, Weighted KNN, Feed Forward Back Propagation Neural Network using Gradient
Descent and Feed Forward Back Propagation Neural [12].
2135
Soltanpour Gharibdousti, Azimi, Hathikal, Won
A comparison study to reveal the importance of feature/attribute selection in classifier performance is described.
Various classification algorithm: Sequential Minimal Optimization, Naïve Bayes and k-nearest neighbor algorithm
(IBK) classifiers were used to classify CKD patients with non-CKD patients. WEKA was used as data mining tool.
Wrapper subset attribute evaluator and best first search method were used for feature selection. Results showed that
the performance of the classifiers improved after reducing the number of features. IBK classifier performed the best
compared to the others on a reduced dataset [13].
Predicting of presence of Chronic Kidney Disease based on certain health parameters such as, random blood glucose
level, serum creatinine level, blood pressure and others. Missing values in the dataset were imputed with the average
value of the corresponding feature column. Classification algorithms were applied to accurately predict the presence
chronic kidney disease (CKD) and clustering algorithm were used to group the data based on the presence of CKD.
Different classification algorithms that were implemented are decision tree, logistic regression, Ada boost and Support
Vector Machine. Principal component analysis was employed to reduce the dimensions and clustering algorithm such
as K-means and hierarchical clustering were applied. Experimental results showed that the performance of SVM (with
linear kernel) was the best followed by Random Forest Classifier, Adaboost, Logistic regression and Decision tree
[14].
2. Data Description and Analysis
2.1. Chronic Kidney Disease Data
The dataset used in this paper has been obtained from UCI source. [15]. the dataset contains data of 400 sample
from the southern part of India with their ages ranging between 2-90 years. There are in total twenty-four features,
most of which are clinical in nature and the rest are physiological. Table 1 summarizes various parameters. As a part
of data pre-processing, missing values and outliers are imputed with mean value of that feature for continuous data
and attribute model value for categorical data. Nominal data are converted to numerical values. For example, Nominal
values ‘Normal’ are labelled “1” and ‘Abnormal’ are labelled “0”.
Table 1 Features
1 Specific Gravity 13 Pus Cell clumps
2 Albumin 14 Age
3 Sugar 15 Blood
4 Red Blood Cells 16 Blood Glucose Random
5 Pus Cell 17 Blood Urea
6 Bacteria 18 Serum Creatinine
7 Hypertension 19 Sodium
8 Diabetes Mellitus 20 Potassium
9 Coronary Artery Disease 21 Hemoglobin
10 Appetite 22 Packed Cell Volume
11 Pedal Edema 23 White Blood Cell Count
12 Anemia 24 Red Blood Cell Count
3. Methodology
This section describes the proposed methodology for data mining from CKD dataset. As shown in Fig. 1 the very first
and important step is preprocessing and cleaning the data. Cleaning process is the process of filling the missing values
base on either there are categorical or nominal data. second, five different machine learning methods, namely SVM,
LR, NN, DT and NB are applied on both original and normalized data set. Next, feature selection using LR and SVM
feature selection based methods are used and at last performance measurement criteria are used to compare the utility
of different techniques to each other "Logistic Regression" and "SVM" based feature selection technique is used for
variable selection. As the parameter control ‘c’ is decreased fewer features are selected. Classification techniques are
applied on all features and selected features for both original and normalized data. Data is split into training set and
testing set. 70 % of the data is used for training the model and rest 30 % is used for testing. Different classification
algorithms such Decision Tree, Logistic regression, Support Vector Machine, Naïve Bayes and Artificial neural
networks are applied on the original data (with and without feature selection) and normalized data (with and without
2136
Soltanpour Gharibdousti, Azimi, Hathikal, Won
feature selection). Performance measures such accuracy, sensitivity, specificity and Area Under Curve (AUC) for
Receiver Operating Characteristic (ROC) Curve are used to evaluate various techniques. Keeping the split percentage
constant, training and testing data are selected five times randomly from the dataset and the average performance
measures are reported.
4. Results
In this section the experiment is done and results are discussed. First, five different machine learning techniques
including Decision Tree, Logistic Regression, Naïve Bayesian, Super Vector Machine and Neural Network is applied
on the original data and performance measurement using criterial like sensitivity, specificity, accuracy and area under
cover is done. Second, the same techniques are applied to the normalized data using the same performance,
measurement criteria to compare them. Third, feature selection methods are applied on both original and normalized
data set and performance of techniques are compared again.
In Table 3-6, results of each step are shown, by comparing and analyzing the results it is understood that except for
neural network which is very sensitive to scale of data set, performance of all techniques is almost same for original
and normalized data set, which shows that we can use them all for decision making regarding this data set. Another
insight obtained from feature selection part is that the performance when using almost 8 to 10 features is almost same
when using all 24 features. This is a very important result and validates the fact former was obtained. It shows that
features are highly correlated and it is doable to eliminate some of them which are highly correlated and dependent to
other features and run the experiment and do the classification using less features. In real world, it is difficult to gather
data for different features, especially when there are missing values, same as our data set, it is useful to take in account
the correlation between features and eliminate the dependent one. Former, the correlation matrix was calculated and
it was seen that there is a highly-correlated relation between some features and here feature selections validate and
confirm those results.
Table 2 Original Data with All Features
Method Accuracy Sensitivity Specificity AUC
NB 0.967 0.947 1.000 0.974
DT 0.992 0.987 1.000 0.993
LR 0.983 0.987 0.977 0.982
SVM 0.983 0.987 0.977 0.982
ANN 0.633 1.000 0.000 0.500
Table 3 Normalized Data with All Features
Method Accuracy Sensitivity Specificity AUC
NB 0.967 0.947 1.000 0.974
DT 0.992 0.987 1.000 0.993
LR 1.000 1.000 1.000 1.000
SVM 1.000 1.000 1.000 1.000
ANN 1.000 1.000 1.000 1.000
Table 4 Confusion Matrix for Original Data-Full Features
TN (44) FP (0)
NB
FN (4) TP (72)
44 0
DT
2 74
43 1
LR
1 75
43 1
SVM
1 75
2137
Soltanpour Gharibdousti, Azimi, Hathikal, Won
0 44
ANN
0 76
Table 4 Confusion Matrix for Normalized Data-Full Features
NB TN (44) FP (0)
FN (4) TP (72)
DT 44 0
1 75
LR 44 0
0 76
SVM 44 0
0 76
ANN 44 0
0 76
Table 4 Logistic Regression Based Feature Selection with Original Data
Number
AUC- AUC- AUC- AUC- AUC-
C of
NB LR DT SVM ANN
Features
- 24 0.97 0.98 0.99 0.98 0.5
1 16 0.97 0.98 0.99 0.98 0.5
0.1 10 0.97 0.98 0.99 0.98 0.5
0.01 5 0.95 0.89 0.96 0.9 0.5
0.001 4 0.83 0.81 0.86 0.8 0.5
0.0001 1 0.57 0.5 0.71 0.49 0.5
Table 5 Logistic Regression Based Feature Selection with Normalized Data
Number
AUC- AUC- AUC- AUC- AUC-
C of
NB LR DT SVM ANN
Features
- 24 0.97 1.000 0.99 1.00 1.00
1 16 0.97 0.98 0.99 0.98 0.97
0.1 10 0.92 0.97 0.99 0.98 0.94
0.01 5 0.95 0.91 0.92 0.93 0.9
0.001 4 0.83 0.84 0.86 0.84 0.8
0.0001 1 0.57 0.5 0.71 0.49 0.5
Table 6 SVM Regression Based Feature Selection with Original Data
Number
AUC- AUC- AUC- AUC- AUC-
C of
NB LR DT SVM ANN
features
- 24 0.97 0.98 0.99 0.98 0.5
1 19 0.97 0.98 1 0.98 0.5
2138
Soltanpour Gharibdousti, Azimi, Hathikal, Won
0.1 14 0.97 0.98 0.99 0.98 0.5
0.01 10 0.97 0.98 0.99 0.98 0.5
0.001 5 0.95 0.89 0.96 0.90 0.5
0.0001 2 0.84 0.74 0.75 0.81 0.5
Table 6 SVM Regression Based Feature Selection with Normalized Data
Number
AUC- AUC- AUC- AUC- AUC-
C of
NB LR DT SVM ANN
features
- 24 0.97 1.000 0.99 1.00 1.00
1 19 0.97 0.98 1 0.99 0.99
0.1 14 0.96 0.98 0.99 0.98 0.9
0.01 10 0.92 0.97 0.99 0.98 0.94
0.001 5 0.95 0.91 0.96 0.93 0.90
0.0001 2 0.84 0.79 0.77 0.8 0.79
Figure 1 Performance of LR Classifier for LR Feature Selection Method for Original Data
Figure 2 Performance of NB Classifier for LR Feature Selection Method for Original Data
2139
Soltanpour Gharibdousti, Azimi, Hathikal, Won
5. Conclusion
In this paper first the data set containing 400 sample and 24 feature was selected from UCI data base and preprocessing
was done to remove noisy and unreliable data. In order to do so, first missing value is filled via mean for nominal
features and filled via mode for categorical features. Then, dataset has been normalized to have a unit scale for all
data. The correlation matrix of features is obtained and it is observed that features are highly correlated to each other.
Classification has been done in three stages. In the first stage, two L1-based feature selection is done for different
values of controlling factors and accordingly different number of features are selected. Next performance of five
different techniques, namely, DT, NN, LR, SVM and NB in classifying both original and normalized data based on
their AUC is compared. In the third step, by using classification is done for all features of original and normalized
data set and the performance of classifiers is compared by their sensitivity, specificity, accuracy and AUC. The aim
is to analyze the results and see the importance of features on the classification results. Results show that, first except
of NN which is sensitive to scale of data, performance of other classifiers are almost the same for original and
normalized data set. Second, same results are obtainable using 8 or 9 features instead of 24 features which validates
the results correlation matrix which shows high correlativity between the features.
6. References
[1] Milley, A. (2000). Healthcare and data mining. Health Management Technology, 21(8), 44-47.
[2] Koh Bhatla, N., & Jyoti, K. (2012). An analysis of heart disease prediction using different data mining
techniques. International Journal of Engineering, 1(8), 1-4.
[3] Dipnall, J. F., Pasco, J. A., Berk, M., Williams, L. J., Dodd, S., Jacka, F. N., & Meyer, D. (2016). Fusing
Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.
PloS one, 11(2), e0148195.
[4] SA, S. (2013). Intelligent heart disease prediction system using data mining techniques. International Journal
of Healthcare & Biomedical Research, 1, 94-101.
[5] Boukenze, B., Mousannif, H., & Haqiq, a. predictive analytics in healthcare system using data mining
techniques. Computer Science & Information Technology, 1.
[6] Vijayarani, S., Dhayanand, M. S., & Phil, M. (2015). Kidney disease prediction using svm and ann
algorithms‖. International Journal of Computing and Business Research (IJCBR) ISSN (Online), 2229-6166.
[7] Sharma, S., Sharma, V., & Sharma, A. (2016). Performance Based Evaluation of Various Machine Learning
Classification Techniques for Chronic Kidney Disease Diagnosis. arXiv preprint arXiv:1606.09581.
[8] UCI Machine Learning Repository: Chronic_Kidney_Disease Data Set. (n.d.).
https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease
[9] Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J. F., & Hua, L. (2012). Data
mining in healthcare and biomedicine: a survey of the literature. Journal of medical systems, 36(4), 2431-2448.
[10] Mitchell, T. M. (1997). Machine Learning.
[11] Delen, D., Walker, G., & Kadam, A. (2005). Predicting breast cancer survivability: a comparison of three
data mining methods. Artificial intelligence in medicine, 34(2), 113-127.
[12] Tomar, D., & Agarwal, S. (2013). A survey on Data Mining approaches for Healthcare. International Journal
of Bio-Science and Bio-Technology, 5(5), 241-266.
[13] Dey, A., Singh, J., & Singh, N. (2016). Analysis of Supervised Machine Learning Algorithms for Heart
Disease Prediction with Reduced
[14] Number of Attributes using Principal Component Analysis. Analysis, 140(2).
2140
Reproduced with permission of copyright owner. Further reproduction
prohibited without permission.
View publication stats