CHAPTER I
INTRODUCTION
1.1 DATA AND INFORMATION MINING
Data and information have become major assets for most businesses. Knowledge
discovery in medical databases is a well-defined process and data mining an essential step.
Databases are collections of data with a specific well defined structure and purpose. The
programs to develop and manipulate these data are called DBMS. Knowledge discovery in
databases is the overall process that is involved in unearthing knowledge from data. Data mining
is concerned with the process of computationally extracting hidden knowledge structures
represented in models and patterns from large data repositories. In paper [1] author defines KDD
as a non-trivial process of identifying valid, novel, potentially useful and ultimately
understandable patterns in data. According to this definition, data are any facts, numbers or text
that can be processed by a computer. The term pattern indicates models and regularity which can
be observed within the data. The patterns, associations or relationships among all this data can
provide information and it can be converted into knowledge about historical patterns and future
trends. The other way such as data preprocessing, data selection, data cleaning and data
visualization which also part of the KDD.
Data mining is an interdisciplinary field of study in databases, machine learning and
visualization. It helps to identify the patterns of successful medical therapies for different
illnesses and also it aims to find useful information from large collections of data [2].
Data mining is the core of KDD which is used to extract interesting patterns from data
that are easy to perceive, interpret and manipulate. It is the science of finding patterns in huge
reserves of data, in order to generate useful information. The KDD process comprises of few
steps leading from raw data collections to form new knowledge.
As shown in Figure 1.1, described the knowledge discovery process as consisting of an
iterative sequence of data cleaning, data integration, data selection, data mining pattern
recognition and knowledge presentation. Data mining is the search for the relationships and
global patterns that exist in large databases hidden among large amounts of data. A target data set
must be assembled before data mining algorithms can be used. A common source for data is a
data mart or data warehouse and pre-processing is essential to analyze these multivariate data
sets. The final step of knowledge discovery from data is to verify that the patterns produced by
the data mining algorithms occur in the wider data set.
Data integration
Data source
Data Selection
Data Cleaning
Pattern Evaluation and Knowledge Presentation
Data Transformation
Data Mining Decisions / Use of Discovered Knowledge
Figure 1.1 Knowledge discovery process
The discovered knowledge may contain rules that describe the properties of the data,
patterns that occur frequently and objects that are found to be in clusters in the database etc., the
motivation for handling data and performing computation is the discovery of knowledge. The
KDD process employs data mining methods to identify patterns at some measure of
interestingness and it is the process of turning the low-level data into high-level knowledge.
1.2 DATA MINING TASKS
. The data mining tasks are used to specify the kind of patterns to be found in data mining
process. Basically, the algorithms try to fit a model closest to the characteristics of data under
consideration and models can be either predictive or descriptive.
Predictive models are used to make predictions, for example, of the diagnosis of a
particular disease. These will analyze past performance to assess the likelihood of a customer
exhibiting a specific behavior in order to improve marketing effectiveness.
Descriptive models are used to identify the patterns in data. For example, a physician
might be interested in discovering the influence of climate among typhoid patients by grouping
patients in different climate zones. Unlike the predictive models that focus on predicting a single
customer behavior, descriptive models identify many different relationships between customers
or products. As shown in Figure 1.2, classification, regression and time series analysis are some
of the tasks of predictive modeling. Clustering, association rules, visualization are some of the
tasks of descriptive modeling.
Figure 1.2 Data mining tasks
Classification: In machine learning, classification is the task of identification to which a set of
categories a new observation belongs. This is done on the basis of training a set of data
containing observations or instances whose category membership is known. Classification is the
process of finding a model or function that describes and distinguishes data classes for the
purpose of the ability use the model to predict the class of objects whose class label is unknown
[3]. It is a learning function that maps or classifies a data item into one of several predefined
groups or classes which come under supervised learning. The classification model makes use of
training data set in order to build a classification predictive model and testing of data set is done
for testing the classification efficiency. Two separate problems such as binary classification and
multiclass classification can be considered as its two components. In binary classification, only
two classes are involved, whereas multiclass classification involves assigning an object to one of
several classes.
Regression: Prediction is achieved with the help of regression. It is the process of analyzing the
current and past states of the attribute and prediction of its future state. Regression is a data
mining technique that is used to predict a value. It takes a numeric dataset and develops a
mathematical formula to fit the data. A regression task begins with a dataset of known target
values and regression analysis can be used to model the relationship between one or more
independent or predictor variables and a dependent or response variable. The types of regression
methods are linear regression, multivariate linear regression, nonlinear regression and
multivariate nonlinear regression.
Time series: Time series is a sequence of data points, measured typically at successive points in
time spaced at uniform time intervals. Time series analysis comprises methods for analyzing time
series data in order to extract meaningful statistics and other characteristics of such data.
Methods for time series analysis may be divided into two classes, namely, frequency-domain
methods which include spectral analysis and time-domain methods which include auto-
correlation and cross-correlation analysis. There are several types of motivation and data analysis
available for time series which are appropriate for different purposes. In the context of data
mining, pattern recognition and machine learning, time series analysis can be used for clustering,
classification, query by content, anomaly detection as well as forecasting.
Clustering: A cluster is a collection of objects which are similar and are dissimilar to the objects
belonging to others. Clustering has no predefined classes and identifies groups of items that
share specific characteristics which come under unsupervised learning. It analyzes data objects
without consulting a known class label. The objects are clustered or grouped based on the
principle of maximizing the intra-class similarity and minimizing the inter-class similarity. It is
the main task of exploratory data mining and a common technique for statistical data analysis
used in many fields including machine learning, pattern recognition, image analysis, information
retrieval and bioinformatics. Clustering can be roughly distinguished as hard clustering which
specifies each object belongs to a cluster or not and soft clustering specifies each object belongs
to each cluster to a certain degree. One of the most used clustering algorithms is K-means
clustering algorithm.
Association rule learning: Association rule learning is a popular and well researched method
for discovering interesting relations between variables in large databases. It is intended for
identify strong rules discovered in databases using different measures of interest. According to
Jagjeevan Rao et al. (2012) there are several association rule algorithms which are mainly useful
in summarizing and identifying the patterns. They also use correlation along with support and
confidence in order to find the right patterns. These are usually required to satisfy a user-
specified minimum support and a user-specified minimum confidence at the same time. It is split
up into two separate steps such as initially minimum support is applied to find all frequent item
sets in a database and these frequent item sets and the minimum confidence constraint are used to
form rules. Association and correlation are usually meant for locating frequent item set findings
among large data sets. The association differs from classification as it can predict any attribute,
not just the class and they can predict more than one attributes value at a time. The types of
association rules are multilevel association rule, multidimensional association rule and
quantitative association rule.
Supervised learning algorithm analyzes the training data and produces an inferred
function, which is called a classifier. If the output is discrete or categorical attributes, it is called
classification and if the output is numerical or continuous attributes, then it is termed as
regression. Unsupervised learning refers to the problem of trying to find hidden structures in
unlabeled data.
1.3 DATA MINING IN HEALTH CARE
Data mining is an integration of multiple disciplines such as statistics, machine learning,
neural networks and pattern recognition. It is concerned with the process of computationally
extracting hidden knowledge structures represented in models and patterns from large data
repositories. Healthcare is a data intensive process. Many processes run simultaneously
producing new data every second. It is a research intensive field and the largest consumer of
public funds. With the emergence of computers and new algorithms, health care has seen an
increase of computer tools and could no longer ignore these emerging tools. This has resulted in
unification of healthcare and computing to form Health care. They typically work through an
analysis of medical data and a knowledge base of clinical expertise and it is an emerging field. In
[5] authors described the need and algorithms of data mining in healthcare, in medical areas
today, data collection about different diseases as very important. Medical and health areas are
among the most important sections in industrial societies. The extraction of knowledge from a
massive volume of data related to diseases and medical records using the data mining process
can lead to identifying the laws governing the creation, the development of epidemic diseases.
Some medical applications of data mining are:
Prediction of health issues.
Determination of disease treatment
Diagnosis and prediction of diseases of most kind etc.,
Health care is defined as an evolving scientific discipline that deals with the collection,
storage, retrieval, communication and optimal use of health related data, information and
knowledge. It is the field of study applied to clinical care, nursing, public health and biomedical
research all dedicated to the improvement of patient care and population health. It is one of the
fastest growing areas within the health sector and covers a wide range of applications and
research. It deals with biomedical information, data and knowledge. With the help of smart
algorithms and machine intelligence, quality healthcare can be provided through problem solving
and decision-making systems. In the domain of Health care, Decision Support Systems are
defined as knowledge based systems that support information sciences and assist decision
making activities. Physicians can input the patient data through electronic health forms and can
run a decision support system on the data input to get an opinion on the patients health and the
care required. The success of healthcare data mining hinges on the availability of clean
healthcare data. Possible directions include the standardization of clinical vocabulary and the
sharing of data across organizations to enhance the benefits of healthcare data mining
applications.
Data mining for healthcare is useful in evaluating the effectiveness of medical treatments.
Through comparing and contrasting various causes, symptoms, and treatment methodologies,
data mining can produce an analysis of treatments that can correct specific symptoms most
effectively. It is widely used in healthcare fields due to its descriptive and predictive power. It
can predict health insurance fraud, healthcare cost, disease prognosis, disease diagnosis, and
length of stay needed in a hospital. It also obtains frequent patterns from biomedical and
healthcare databases such as relationships between health conditions and a disease, relationships
among diseases and relationships among drugs etc. Data mining today has successful
applications in various fields including health care. This industry generates large amounts of
complex data about patient records, hospitals resources, disease diagnosis, medical devices etc.
These data are a key resource to be processed and analyzed for knowledge extraction and data
mining in various areas as under healthcare. Some of the challenges of data mining in the
medical domain are in the following areas:
Identification of the patterns of successful medical therapies for different ailments.
Too many disease markers (attributes) now available for decision making.
Voluminous data now being collected with the help of computerization (text, graphs,
images)
Handling noisy (containing errors or outliers), inconsistent (containing discrepancy in
codes or names), and incompleteness (lacking attribute values or containing only
aggregates) of medical data issues to be preprocessed.
Data mining not only focuses on collecting and managing data, it also includes analysis and
prediction. The wide range of applications from business tasks to scientific tasks has led to a
huge variety of learning methods and algorithms for rule extraction and prediction. For medical
diagnosis, there are many expert systems based on logical rules for decision making and
prediction. Even though there many data mining techniques in prediction of heart disease, there
is a weakness in availability of data for prediction of heart diseases on the diabetes dataset.
Prediction of risk using data mining can be helpful to understand the possible risk involved in
getting that disease. It has prophylactic capability with the advancement in tools like Rapid
miner that can be used with ease in large datasets and a large number of attributes, the hassle
today lies in the determination of the appropriate machine learning technique that can ensure
accuracy. Healthcare industry is a type of industry, where the available data is voluminous and
sensitive. The data requires careful handling without any mismanagement. There are various data
mining classification techniques that have been used in healthcare industry. The best among them
can be chosen.
1.4 DATA MINING CLASSIFICATION TECHNIQUES
In the early days of data warehousing, data mining was viewed as a subset of the
activities associated with the warehouse. Today, a warehouse may be a good source for the data
to be mined and data mining is recognized as an independent activity. One of the greatest
strengths of data mining lies in its wide range of methodologies and techniques that can be
applied to a various problem sets. Data mining is a natural activity to be performed on large
datasets. Data classification process involves learning and classification. In learning, the training
data are analyzed by classification algorithms and in classification, test data are used to estimate
the accuracy of the classification rules.
According to Vikas Chaurasia et al. (2013) many researchers used data mining techniques
in the diagnosis of diseases such as tuberculosis, diabetes, cancer and heart disease in which
several data mining techniques are used in the diagnosis of heart disease such as neural networks,
Bayesian classification, classification based on clustering, genetic Algorithm, naive Bayes and
decision tree which are showing accuracy at different levels. Each data mining technique serves a
different purpose depending on the modeling objective.
According to Lashari et al. (2013) classification in data mining is used to predict group
membership for data instances. Data mining involves the use of sophisticated data analysis tools
to discover the relationship in large datasets. Decision tree based classification methods are
widely used in data mining for the decision support application. Thus, there is a great potential
for the use of data mining techniques for medical data classification.
Figure 1.3 Data mining classification methods
As shown in Figure 1.3, classification is the discovery of predictive learning function that
classifies a data item into one of several predefined classes. These techniques have been widely
applied with the great success in the field of medical databases. The types of classification
models are Bayesian classification, Support vector machine, neural networks, classification by
decision tree induction and classification based on associations. The classification and
association rules play a major part in data mining. Classification is the process of dividing a
dataset into mutually exclusive groups and enables us to categorize records in a large database
into predefined set of classes. Association rules give a process to find relationships among data
items in a given dataset. It enables us to establish association and relationships between large
unclassified data items based on certain attributes and characteristics. It defines certain rules of
associability between data items and then uses those rules to establish relationship.
According to Varun Kumar et al. (2011) association analysis is the discovery of association rules
showing attribute-value conditions that occur frequently together in a given set of data. It is also
widely used for market basket or transaction data analysis.
The classification process as one divided into two phases training, when a classification
model is built from the training set and testing, when the model is evaluated on the test set. One
of the major goals of a classification algorithm is to maximize the predictive accuracy obtained
by the classification model.
The Bayesian classifiers have a structural model and a set of conditional probabilities.
They assume that the contribution of all variables is independent. It first estimates the prior
probability for each class and the occurrence of each variable value applies to an unknown case.
A Bayes network classifier is based on a Bayesian network which represents a joint probability
distribution over a set of categorical attributes.
Naive Bayes classifier is a term dealing with simple probabilistic classifiers. This method
is based on probabilistic knowledge and on supervised learning. It reads a set of examples from
the training set and uses the Bayes theorem to estimate the probabilities of all classifications. For
each instance, the classification with the highest probability is chosen as the prediction class.
Support Vector Machine is a concept in Statistics and Computer Science for a set of
related supervised learning methods that analyze data and recognize patterns. Introduced by
Corinna Cortes and Vladimir Vapnik (1995) it is used for classification and regression analysis. It
constructs a hyper plane or set of hyper planes in a high- or infinite dimensional space, which
can be used for classification, regression, or other tasks. In addition to performing linear
classification, SVMs can efficiently perform non-linear classification using what are called
kernel functions which implicitly map their inputs into high-dimensional feature spaces. It is a
learning machine that plots the training vectors in high dimensional space and labels each vector
by its class. (http://en.wikipedia.org/wiki/Support_vector_machine)
According to Barakat et al., (2007) it is based on the principle of risk minimization which
aim to minimize the error rate. SVM uses a supervised learning approach for classifying data. It
uses kernel functions to map the data set to a high dimensional data space for performing
classification and the major advantage of SVM is its classification accuracy.
Decision trees belong to classification methods and construct a hierarchical like a tree
structure and their goal is to create a model that predicts the value of a target variable based on
several input variables. Each internal node here denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node holds a class label. It is a popular classifier
and prediction method for handling high dimensional data. Construction of a decision tree is the
training step of classification and the method for the construction of the tree is called ASM.
Decision trees used in data mining are of two main types, classification tree analysis
which is used when the predicted outcome is the class to which the data belongs. Regression tree
analysis which is used when the predicted outcome can be considered a real number. C4.5 is an
algorithm used to generate a decision tree developed by Ross Quinlan (1996) and it is an
extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used
for classification, and for this reason, C4.5 is often referred to as a statistical classifier.
(http://en.wikipedia.org/wiki/Decision_tree_learning)
Tina Patil et al. (2013) have described classification as an important data mining
technique with broad applications to classify the various kinds of data. It is used to classify the
item according to its features of the item with respect to the predefined set of classes. The
performance of classification algorithm is usually examined by evaluation of the accuracy of the
classification. For classifications, the Bayesian networks are used to construct classifiers from a
given set of training examples with class labels.
Shruti Ratnakar et al. (2013) have illustrated, in the field of artificial intelligence, a
genetic algorithm is a search heuristic that imitates the process of natural evolution. This
heuristic is routinely used to generate useful solutions to belong to the larger class of
evolutionary algorithms, which generate optimized solutions using techniques inspired by natural
evolution, such as inheritance, mutation, selection, and crossover. The evolution usually starts
from a population of randomly generated individuals, and is an iterative process, with the
population in each iteration called a generation. In each generation, the fitness of every
individual in the population is evaluated; the fitness is usually the value of the objective function
in the optimization problem being solved.
According to Chitra et al. (2013) the application of Artificial Neural Networks (ANN)
can be time-consuming due to the selection of input features for the multi layer perceptron. The
number of layers, number of neurons in each layer was also determined by the input attributes. It
is inspired by attempts to simulate 18
biological neural systems. Each node or neuron here is interconnected with other nodes
via weighted links. During the learning phase, the network learns by adjusting weights to enable
predication of the correct class labels of the input tuples. The nodes are classified into three
categories like input, hidden and output layers. The neural networks are ideal for identifying
patterns or trends in data and well suited for prediction or forecasting needs and the most widely
used is multi-layer perception with back-propagation algorithm. Some of the disadvantages of
neural networks are: they require many parameters, that are empirically determined and
classification performance is sensitive to the parameters selected. It is very slow training process
and clinicians find it difficult to understand how its classification decisions are taken and cannot
interpret the results easily.
1.5 BIOLOGICAL LINK BETWEEN DIABETES AND HEART DISEASE
There is a big link between diabetes and heart disease. Diabetes by itself is now regarded
as the strongest risk factor for heart disease. Diabetes is about blood glucose control, and heart
disease is about blood pressure and cholesterol control. Both the diseases have insulin resistance
in common. It increases the chances of developing type 2 diabetes and heart disease. Both type 1
diabetes and type 2 diabetes are independent risk factors for CHD. In fact, from the point of view
of cardiovascular medicine, it may be appropriate to say, Diabetes is a cardiovascular disease.
Mai Shouman et al. (2011) say heart disease is the leading cause of death in the world
over the past 10 years. The European Public Health Alliance reports that heart attacks, strokes
and other circulatory diseases account for 41% of all deaths. The Australian bureau of statistics
reports that heart and circulatory system diseases are the first leading cause of death in Australia,
33.7% being fatal. Motivated by the world-wide increasing mortality of heart disease patients
each year and the availability of huge amount of patient data from which to extract useful
knowledge, researchers have been using data mining techniques to help health care professionals
in the diagnosis of heart disease. 22
Muhamad Hariz et al. (2012) point out that diabetes is a metabolic disorder where the
body cannot make proper use of carbohydrate and greatly affected by the patients lifestyle. CHD
is a serious disease that causes many deaths especially in china.
Researchers have found that high blood sugar (hyperglycemia), activates a biological
pathway that causes irregular heartbeats, a condition called cardiac arrhythmia, that triggers heart
failure and sudden cardiac death. (http://www.medicalnewstoday.com/articles/266891.php)
People who suffer from diabetes are two to four times more likely to develop
cardiovascular disease, compared to non-diabetics. (http://www.world- heart-
federation.org/cardiovascular-health/cardiovascular-disease-risk- factors/diabetes)
The American Heart Association says that around 65% of diabetics die from heart disease
or stroke, emphasizing the need for new research looking at links between the conditions. There
is also evidence that obesity, having a sedentary lifestyles and poor blood glucose control
contribute to increased chances of high blood pressure. Women prior to menopause stage, usually
have less risk of heart disease than men of the same age. However, women of all ages with
diabetes have an increased risk of heart disease because diabetes cancels out the protective
effects of being a woman in her child-bearing years.
In fact, the cardiovascular disease leading to heart attack or stroke is by far the leading
cause of death in both men and women diabetics. Another major component of cardiovascular
disease is poor circulation in the legs, which contributes to a greatly increased risk of foot ulcers
and amputations.
Control of the ABCs of diabetes can reduce risk for heart disease and stroke, where A
stands for A1C, a test that measures blood glucose control and it shows the average blood
glucose level over the past 3 months. B stands for blood pressure and C stands for cholesterol.
The best way to prevent or delay the development of cardiovascular disease lies in its prevention.
Weight control and smoking cessation are two important lifestyle measures that have an impact
on preventing heart disease. In addition, good control of blood glucose levels and low-dose
aspirin can enhance these benefits. (http://diabetes.niddk.nih.gov/dm/pubs/stroke/)
1.6 PROBLEM DEFINITION
Discovery of new information in terms of patterns or rules from large amounts of data is
based on the machine learning technique. Disease prediction plays an important role in data
mining. Diagnosis of a disease requires the performance of a number of tests on the patient.
However, use of data mining techniques, can reduce the number of tests. This reduced test set
plays an important role in time and performance. Diabetes data mining is important because it
allows doctors to see which features or attributes are more important for diagnosis such as age,
weight, etc. This will help the doctors diagnose diabetes more efficiently. There are various data
mining techniques in use in healthcare industry but the research that has to be done is on the
performance of the various classification techniques, to enable the choice of the best among them
can be chosen.
The research presented in this thesis is intended to address the challenge of improving the
prediction model to predict the heart disease and diabetic disease and providing timely response
in predicting the disease. Briefly the important research functions are therefore stated as,
Various datasets are used in the proposed classifier and prediction technique.
A classification techniques help in developing the prediction model so as to predict
accurately the risk of heart disease among diabetic patients
1.7 OBJECTIVES OF THE RESEARCH
Application of data mining in analyzing the medical data is a good method for
investigating the existing relationships between variables. Nowadays, data stored in medical
databases are growing in an increasingly rapid rate. It has been widely recognized that medical
data analysis can lead to an enhancement of health care.
The primary objective of the research work is the effective development of prediction
model using various classification techniques to predict the diabetes and heart disease and
performance in prediction. It also shows that data mining can be applied to the medical databases
to predict or classify the data with reasonable accuracy.
The following are the objectives leading to achievement of the primary objective
mentioned supra:
To generate a best classification technique which can help in predicting the risk of heart
and diabetic disease with various attributes.
To recognize and classify patterns in multivariate patient attributes.
To predict the class score based on that, the mild and extreme of the disease can be
identified.
To improve the classification and prediction accuracy by utilizing improved
classification techniques.
To propose a weighted Genetic Algorithm and PCO (Principle Component Analysis)
based feature selection approach and compare the performance of existing and the proposed
feature selection algorithms on clinical datasets.
To design a fusion based Classifier for diabetes and heart disease diagnosis and to predict
the severity of heart disease in patients.
To propose a scoring system to find the severity of heart disease for patients who are
suffering with diabetes.
1.8 ORGANIZATION OF THE THESIS
Chapter two describes the literature review on data mining, its major predictive
techniques, applications, survey of the comparative analysis by other researchers and the criteria
to be used for model comparison in this work.
Chapter three presents the proposed system and the methodologies involved in that.
Chapter four gives the analysis of the experiments done by combining three data mining
techniques. The various heart, diabetes disease risk prediction models are created by categorizing
the dataset based on certain attribute value pairs. It also describes the summary of the results,
compares the results of the techniques on the data sets and the performances are compared
through accuracy, sensitivity, specificity and F-score. Chapter five gives the conclusions and
future enhancement.
CHAPTER 2
BACKGROUND STUDY
2.1 INTROUDCITON TO THE RISK PREDICTION:
Data mining is a key role in the intelligent health domain [1]. There are several softwares
and tools have been used to diagnose and classifies health informations based on the attributes.
The huge size databases are included into this process as input. This process resulted in data
collection complication. The followings are the basic informations about the diabetes and its
basic causes and symptoms.
Diabetes risk Prediction Model can support medical professionals and practitioners in
predicting risk status based on the clinical data records. In biomedical field data mining and its
techniques plays an essential role for prediction and analyzing different type of health issues. The
healthcare industry gives huge amounts of healthcare data and that need to be mined to ascertain
hidden information for valuable decision selection. Determining hidden patterns and
relationships may often very tough and unreliable. The health record is classified and predicted if
they have the symptoms of Diabetes risk and using risk factors of disease [2]. It is indispensable
to find the best fit algorithm that has greater accuracy, speedy and memory utilization on
prediction in the case of Diabetes.
A. Diabetes:
Diabetes is classified into three types:
Type 1 Diabetes: It is a chronic condition in which the pancreas produces little or no insulin.
This type of Diabetes results from the pancreas's failure to produce enough insulin. This
necessitates the individual to insert insulin or carry an insulin pump. This form was previously
referred to as "insulin-dependent diabetes mellitus" (IDDM). The cause of type 1 diabetes is
unknown [3].
Type 2 Diabetes: begins with insulin resistance, a condition in which cells fail to respond to
insulin properly. As the disease progresses a lack of insulin may also develop. This form was
previously referred to as "non insulin-dependent diabetes mellitus" (NIDDM) or "adult-onset
diabetes". The primary cause is excessive body weight and not enough exercise.
The risk factors for type 2 diabetes are being 45 years of age or older, being overweight,
having a parent or sibling with diabetes (family heredity), having high blood pressure (140/90 or
higher), having high cholesterol (High Density Lipoprotein 35 or lower; triglycerides 250 or
higher) and acute stress. Over 80 per cent of people with type 2 diabetes are overweight and they
are treated with diet and exercise and the blood sugar level is lowered with suitable drugs. It can
be classified into two main types. Type1 called as Insulin Dependent Diabetes Mellitus is usually
diagnosed in children and young adults, and was previously known as Juvenile diabetes due to
deficient insulin production. In this case, patients require lifelong insulin injection for survival.
Type 2 diabetes Non-Insulin Dependent Diabetes Mellitus is due to bodys ineffective use of
insulin and often occurs in adulthood.
The most severe form of the disease is IDDM and is usually seen in individuals less than
30 years of age. It occurs mostly in children around the age group of 10-14 years and
occasionally happens in adults. NIDDM is much more common than IDDM. It is often
discovered by chance. It is typically gradual in onset and occurs mainly in the middle-aged and
elderly. Pre diabetes is a condition when patient blood sugar level triggers higher than normal,
but not so high that we can validate it as type 2 diabetes. Gestational diabetes is a form of
diabetes which affects pregnant women. It is thought that the hormones created during pregnancy
reduce a woman's receptivity to insulin, leading to high blood sugar levels. It affects 4% of all
pregnant women and occurs more frequently among African Americans, and American Indians.
Diabetes is a silent epidemic and according to the literature, there are 246 million people in the
world living with diabetes. This number is predicted to double by 2025. It is estimated that
currently there are 40 million people with diabetes in India and by 2025 this number will swell to
70 million. It causes 6 deaths every minute.
Gestational diabetes is the third main form and occurs when pregnant women without a
previous history of diabetes develop high blood-sugar levels.
as a consequence of the human bodies malfunction to generate insulin, and necessitates the
individual to insert insulin or carry an insulin pump. This category was previously indicated as
Insulin-Dependent Diabetes Mellitus (IDDM). The second category of DM is recognized as
Type II DM as a consequence of insulin confrontation, a situation in which cells are ineffective
to exploit insulin appropriately, occasionally merged with an absolute insulin insufficiency. This
category also called as Non Insulin Dependent Diabetes Mellitus (NIDDM) or adult-onset
diabetes. At last, gestational diabetes takes place when conceived women without an earlier
There are several researches conducted epidemiological and public health studies
regarding the associations between anthropometric measurements and type2 diabetes. In general
the diabetes detection, heart disease risk detection are handled using effective data mining
algorithms. The common steps to detect the risk of disease are represented in fig 2.1.
Health dataset
Pre-processing Features extraction Feature analysisFeature grouping Predicted Result
Fig 2.1 steps involved in the health risk prediction process
The above fig 2.1 represents the basic process involved in risk prediction in health dataset. In
order to predict the risk with its features, this is important to find the pertinent and appropriate
features from the health dataset. The above process shows the feature selection and feature based
risk calculation for diabetes and other health dataset.
2.2 DATA MINING IN HEALTHCARE MANAGEMENT
To aid healthcare organization and decision making, data mining applications can be
developed to improved classification and track risk of several diseases from patient health
records, this types of design, analysis and decision making processes reduces the many work in
real time. Consider, in order to create enhanced diagnosis and treatment procedures, this is quite
tough when comparing the decision making from scientific literature. In this case, data mining is
used to avoid data analysis and clinical problems by managing effective healthcare datasets. Data
mining can be used to analyse vast amount of data and statistics to search for patterns [4]. This
survey brings the tools and techniques used in the health care management.
In this chapter, we provide a detailed study about the traditional data mining technique in
health care application, such as heart disease analysis, type classification, and diabetes and heart
risk assessment.
2.2.1 Data mining techniques for Heart Disease:
Healthcare data clustering is the process of segmenting text Healthcare datas into
different groups based on its similarity level. The clustering is the unsupervised learning process,
where it wont need any training samples for the grouping process [5]. The followings are the
popular clustering algorithms are used for Healthcare data management.
There are n number of studies was carried out the heart disease identification and risk
prediction using data mining. From the statistical data, the risk factors are associated and from
those associations, the risks are detected. The factors such as patients age, gender, blood
pressure, food habits, cholesterol, heredity and hypertension etc., the heart disease details are
collected and stored as huge training sample, from the set of data, the useful informations are
extracted. In this paper [6] author detected useful patterns from the database and find the risk of
heart disease. These works are focused on classification process.
The classification algorithm such as Decision tree [7], nave bayes[8], SVM [9], and
neural network algorithms[10] are included. And some bagging [] and semi supervised classifiers
also used to detect heart disease risk.
Table 2.1 data mining algorithms are proposed for heart disease
Algorithm Description Papers used the algorithm Results
Decision Tree Its a tree like graph model used for Tu, et al., The accuracy
classification. 2009 (J4.8 Decision Tree) gained in this
Andreeva, P. 2006 paper is
Palaniappan, et al. 2007 78.9%.
SVM Support vector machine is a fully Kangwanariyakul, et 74.9%
supervised learning process. It is a al. 2010 (linear support
cost effective and suitable for vector
massive data machine)
polynomial support 70.59%
vector machine
radial basis function 60.89%
kernel support vector
machine
Nave Naive bayes is completely based on Srinivas, et al. 2010 84.14%
Bayes the training samples. This is a Nave Bayes and
successful classifier when the One Dependency
training data is huge. Augmented Nave
Bayes classifieR
Neural Neural network is dependent with Kangwanariyakul, et 76.5%
Network neural schemas, and it divided into al. 2010
linear, probabilistic, radial and
polynomial based approaches.
Bagging Bagging is the process of iterative Tu et al., 81.43%
classification, where datas are 2009
divided and used different classifier
The table 2.1 shows the list of algorithms used to handle the heart disease. The table
gives the basic description, paper details along with its result. The results are given in accuracy.
The accuracy has been collected from the necessary paper result. The above table clearly shows
the basic neural network and SVM can only give limited percentage of accuracy than others. The
highest accuracy is bagging method. So the studies in the heart disease and diabetes can be
extended from the above result. Several papers used different set of attributes, so the results may
vary according to the dataset.
2.2.2 Data mining techniques for Diabetes:
As like the heart disease, the diabetes and risk is classified and predicted by various data
mining techniques in the literature such as regression [11], decision tree[12], and Artificial neural
networks (ANN)[13].
Algorithm Description Papers used the Results
algorithm
ANN (Artificial This follows evolutionary Lee SM, Kang JO The accuracy
neural Network) process. 2004 gained in this
paper is 73.52 %
Decision Tree Its a most recent version of Hu FB, Manson 78.27%
(C5.0) decision tree, which has high JE,
accuracy and ability to handle
missing and null values.
C5.0 follows the post-pruning
approach, which removes
branches from a fully grown
tree.
Regression Its a non linear regression Lai CL, Lai CL, 72.74%
method for predicting diabetes Chien SW, Fang
risk. The main advantage of K. 2007
using this is, it supports
categorical data.
Table 2.0 data mining algorithms are proposed for Diabetes disease
The table 2.0 shows the list of algorithms used to handle the diabetes. Because detection
of diabetes is necessary, it generates several diseases such as retinopathy, neurological disorders,
eye related issues and stroke etc., The table gives the basic description, paper details along with
its result. The results are given in accuracy. The accuracy has been collected from the necessary
paper result. The table 2.0 clearly shows the basic ANN and regression can only give limited
percentage of accuracy than Decision Tree. The highest accuracy is bagging method.
In paper [14], author compared three prediction models for diabetes using 12 different
attributes. The authors have taken C5.0 decision tree algorithm, ANN and regression algorithms.
The results from the paper show that the C5.0decision tree model performed best on
classification accuracy. This paper concluded and suggests future assist with optimal predictive
models. This also proofs the result accuracy varies when the attributes are effectively utilized.
The paper [15] presents the knowledge about the diabetes from the web data such as like
MEDLINE. The use of text mining, the reviews and the web datas are mined effectively. Form
this, the research outlined the brief knowledge about the diabetes.
SVM also used for diabetes disease detection, in paper [16] proposed supervised learning
method, i.e SVM, to improve the terms discriminating power for disease detection task. This
paper utilizes vector space model for text representation, this transforms the content of a
Healthcare data into a hyper pane. In this study, the authors investigated numerous unsupervised
and supervised classification methods with SVM and ANN algorithms. Finally the paper shows
the supervised term weighting methods are good in performance.
2.3 EXISTING SYSTEM:
2.3.1 Background on Naive Bayes method
Asha Rajkumar et al. (2010) has illustrated that, nave Bayes classifier as a term dealing
with a simple probabilistic classifier based on application of Bayes theorem with strong
independence assumptions. It assumes that the presence or absence of particular feature of a
class is unrelated to the presence or absence of any other feature. It is based on conditional
probabilities. It uses Bayes' theorem which finds the probability of an event occurring, given the
probability of another event that has already occurred. If B represents the dependent event and A
represents the prior event, Bayes' theorem can be stated as follows:
Prob (B given A) = Prob(A and B)/Prob(A)
In order to calculate the probability of B given A, the algorithm counts the number of
cases where A and B occur together and divides it by the number of cases where A occurs alone.
An advantage of the Naive Bayes classifier is that it requires only a small amount of training data
to estimate the parameters (means and variances of the variables) necessary for classification.
Since independent variables are assumed, only the variances of the variables for each class need
to be determined. It can be used for both binary and multi class classification problems.
2.3.2 Background on Support Vector Machines (SVM)
SVM is a supervised learning method, a useful technique for data classification. In other
terms, it is a classification and regression prediction tool that uses machine learning theory to
maximize predictive accuracy while automatically avoiding over-fit to the data. A classification
task mainly involves separating the datasets into training and testing sets. It can be defined as a
collection of systems which use hypothesis space of linear functions in a high dimensional
feature space, trained with a learning algorithm from optimization theory that implements a
learning bias derived from statistical learning theory. SVM, when used to build the regression
models, is known as Support Vector Regression.
The SVM is very popular as a high-performance classifier in several domains in
classification and is particularly suited to analyzing large amount of data, for example, thousands
of predictor fields. It uses a supervised learning approach for classifying data. That is, SVM
produces a model based on a given training data which is then used for predicting the target
values of the test data.
2.3.3 Background on classification by Decision trees
Sudha et al. (2012) have described the decision tree as a popular classifier and prediction
method for handling high dimensional data and it looks like a tree structure. It is one of the
successful data mining techniques used in the diagnosis of heart disease. It applies a
straightforward idea to solve the classification problem and is a very simple and easy way for
handling dataset. It divides the dataset into multiple groups by evaluating individual data record,
which can be described by its 49 attributes. It requires no domain knowledge or parameter setting
and can handle high dimensional data. It is also simple and easy to visualize the process of
classification where in the predicates return discrete values and can be explained by a series of
nested if-then-else statements. The results obtained from decision trees are easy to read and
interpret.
The goal is to create a model that predicts the value of a target variable based on several
input variables. Here, each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node holds a class label. It is a popular classifier and prediction
method for handling high dimensional data. There are many types of decision trees. The
difference between them is the mathematical model that is used in selecting the splitting attribute
in extracting the decision tree rules.
The main advantages of this algorithm are its simplicity and speed which allows it to run
on large datasets. It breaks down a dataset into smaller subsets while at the same time; an
associated decision tree is incrementally developed. The final result is a tree with decision nodes
and leaf nodes. A decision tree can easily be transformed into a set of rules by mapping from the
root node to the leaf nodes one by one. By creating a decision tree, the data can be mined based
on the past history to determine the likelihood a person may be having the risk of heart disease.
The decision trees generated by C4.5 algorithm can be used for classification and it is
often referred to as a statistical classifier. It builds decision trees from a set of training data using
the concept of information entropy. It works top-down, at each node of the tree; C4.5 chooses the
attribute of the data that most effectively splits. The attribute with the highest normalized
information gain is chosen to make the decision. The entropy (Information Gain) approach
selects the splitting attribute that minimizes the value of entropy, thus maximizing the
information Gain.
CHAPTER SUMMARY
This section summarizes various review and technical articles on data mining
classification techniques applied to healthcare datasets. Data mining have the potential to
generate a knowledge-rich environment which can help to significantly improve the quality of
clinical decisions. Here, data mining is applied to the medical databases. The advancement in
tools like Rapid Miner, this can be used with ease in large datasets with large number of
attributes. It also predicts and classifies the data with a reasonable accuracy.