Adm Unit 5
Adm Unit 5
UNIT V
CLASSIFICATION & CLUSTERING:
What is classification?
The term "classification" is usually used when there are exactly two target classes
called binary classification. When more than two classes may be predicted, specifically
in pattern recognition problems, this is often referred to as multinomial classification.
However, multinomial classification is also used for categorical response data, where
one wants to predict which category amongst several categories has the instances with
the highest probability.
Classification is one of the most important tasks in data mining. It refers to a process
of assigning pre-defined class labels to instances based on their attributes. There is a
similarity between classification and clustering, it looks similar, but it is different. The
major difference between classification and clustering is that classification includes
the levelling of items according to their membership in pre-defined groups. Let's
understand this concept with the help of an example; suppose you are using a self-
organizing map neural network algorithm for image recognition where there are 10
different kinds of objects. If you label each image with one of these 10 classes, the
classification task is solved.
On the other hand, clustering does not involve any labelling. Assume that you are given
an image database of 10 objects and no class labels. Using a clustering algorithm to
find groups of similar-looking images will result in determining clusters without object
labels.
These are given some of the important data mining classification methods:
UNIT - V 1
[ADVANCED DATABASE AND MINING]
K-Nearest Neighbors Method is used to classify the datasets into what is known as a K
observation. It is used to determine the similarities between the neighbours.
The Naive Bayes method is used to scan the set of data and locate the records wherein
the predictor values are equal.
The Neural Networks resemble the structure of our brain called the Neuron. The sets
of data pass through these networks and finally come out as output. This neural
network method compares the different classifications. Errors that occur in the
classifications are further rectified and are fed into the networks. This is a recurring
process.
In this method, a linear function is built and used to predict the class of variables from
observation with the unknown class.
What is clustering?
Clustering refers to a technique of grouping objects so that objects with the same
functionalities come together and objects with different functionalities go apart. In
other words, we can say that clustering is a process of portioning a data set into a set
of meaningful subclasses, known as clusters. Clustering is the same as classification in
which data is grouped. Though, unlike classification, the groups are not previously
defined. Instead, the grouping is achieved by determining similarities between data
according to characteristics found in the real data. The groups are called Clusters.
Methods of clustering
o Partitioning methods
o Hierarchical clustering
o Fuzzy Clustering
o Density-based clustering
o Model-based clustering
UNIT - V 2
[ADVANCED DATABASE AND MINING]
Classification Clustering
It uses algorithms to categorize the new It uses statistical concepts in which the
data as per the observations of the data set is divided into subsets with the
training set. same features.
In classification, there are labels for In clustering, there are no labels for
training data. training data.
Its objective is to find which class a new Its objective is to group a set of objects to
object belongs to form the set of find whether there is any relationship
predefined classes. between them.
1R ALGORITHM
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that
generates one rule for each predictor in the data, then selects the rule with the
smallest total error as its "one rule". To create a rule for a predictor, we construct a
frequency table for each predictor against the target. It has been shown that OneR
produces rules only slightly less accurate than state-of-the-art classification algorithms
while producing rules that are simple for humans to interpret.
UNIT - V 3
[ADVANCED DATABASE AND MINING]
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.
Example:
Finding the best predictor with the smallest total error using OneR algorithm
based on related frequency tables.
UNIT - V 4
[ADVANCED DATABASE AND MINING]
Predictors Contribution
Simply, the total error calculated from the frequency tables is the measure of each
predictor contribution. A low total error means a higher contribution to the
predictability of the model.
Model Evaluation
The following confusion matrix shows significant predictability power. OneR does not
generate score or probability, which means evaluation charts (Gain, Lift, K-S and ROC)
are not applicable.
Play Golf
Confusion Matrix
Yes No
Yes 7 2 Positive Predictive Value 0.78
OneR Negative Predictive
No 2 3 0.60
Value
Sensitivity Specificity
Accuracy = 0.71
0.78 0.60
DECISION TREES
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label. The topmost node in the tree is the root
node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
UNIT - V 5
[ADVANCED DATABASE AND MINING]
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
Pre-pruning − The tree is pruned by halting its construction early.
Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
UNIT - V 6
[ADVANCED DATABASE AND MINING]
COVERING RULES
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule
for a given class covers many of the tuples of that class.
UNIT - V 7
[ADVANCED DATABASE AND MINING]
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned,
a tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induc on can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one
class at a time. When learning a rule from a class Ci, we want the rule to cover all the
tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
UNIT - V 8
[ADVANCED DATABASE AND MINING]
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
TASK PREDICTION
Data mining deals with the kind of patterns that can be mined. On the basis of the kind
of data to be mined, there are two categories of functions involved in Data Mining −
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −
Data Characterization − This refers to summarizing data of class under
study. This class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classifica on of a class
with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here
is the list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear
together, for example, milk and bread.
Frequent Subsequence − A sequence of pa erns that occur frequently
such as purchasing a camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined with
item-sets or subsequences.
UNIT - V 9
[ADVANCED DATABASE AND MINING]
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data
and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk
is sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to
analyze that if they have positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
The list of functions involved in these processes are as follows −
Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
Prediction − It is used to predict missing or unavailable numerical data
values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.
Outlier Analysis − Outliers may be defined as the data objects that do
not comply with the general behavior or model of the data available.
UNIT - V 10
[ADVANCED DATABASE AND MINING]
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primi ves allow us to communicate in an interac ve manner with the
data mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes
the following −
Database Attributes
Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction.
For example, the Concept hierarchies are one of the background knowledge that
allows data to be mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
UNIT - V 11
[ADVANCED DATABASE AND MINING]
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
STATISTICAL CLASSIFICATION
Data mining refers to extracting or mining knowledge from large amounts of data. In
other words, data mining is the science, art, and technology of discovering large and
complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process more
efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in
data mining:
Statistical Analysis: In statistics, data is collected, analyzed, explored, and
presented to identify patterns and trends. Alternatively, it is referred to as
quantitative analysis.
Non-statistical Analysis: This analysis provides generalized information and
includes sound, still images, and moving images.
In statistics, there are two main categories:
Descriptive Statistics: The purpose of descriptive statistics is to organize data
and identify the main characteristics of that data. Graphs or numbers
summarize the data. Average, Mode, SD(Standard Deviation), and Correlation
are some of the commonly used descriptive statistical methods.
Inferential Statistics: The process of drawing conclusions based on probability
theory and generalizing the data. By analyzing sample statistics, you can infer
parameters about populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with
statistics. Some of these are:
Population
Sample
UNIT - V 12
[ADVANCED DATABASE AND MINING]
Variable
Quantitative Variable
Qualitative Variable
Discrete Variable
Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using
mathematical formulas, models, and techniques. Through the use of statistical
methods, information is extracted from research data, and different ways are available
to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically
are derived from the vast statistical toolkit developed to answer problems arising in
other fields. These techniques are taught in science curriculums. It is necessary to
check and test several hypotheses. The hypotheses described above help us assess the
validity of our data mining endeavor when attempting to infer any inferences from the
data under study. When using more complex and sophisticated statistical estimators
and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations,
a variety of statistical methods are available in Data Mining and some of these are:
Logistic regression analysis
Correlation analysis
Regression analysis
Discriminate analysis
Linear discriminant analysis (LDA)
Classification
Clustering
Outlier detection
Classification and regression trees,
Correspondence analysis
Nonparametric regression,
Statistical pattern recognition,
Categorical data analysis,
Time-series methods for trends and periodicity
Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used
in data mining:
UNIT - V 13
[ADVANCED DATABASE AND MINING]
Linear Regression: The linear regression method uses the best linear
relationship between the independent and dependent variables to predict the
target variable. In order to achieve the best fit, make sure that all the distances
between the shape and the actual observations at each point are as small as
possible. A good fit can be determined by determining that no other position
would produce fewer errors given the shape chosen. Simple linear regression
and multiple linear regression are the two major types of linear regression. By
fitting a linear relationship to the independent variable, the simple linear
regression predicts the dependent variable. Using multiple independent
variables, multiple linear regression fits the best linear relationship with the
dependent variable. For more details, you can refer linear regression.
Classification: This is a method of data mining in which a collection of data is
categorized so that a greater degree of accuracy can be predicted and analyzed.
An effective way to analyze very large datasets is to classify them. Classification
is one of several methods aimed at improving the efficiency of the analysis
process. A Logistic Regression and a Discriminant Analysis stand out as two
major classification techniques.
o Logistic Regression: It can also be applied to machine learning
applications and predictive analytics. In this approach, the dependent
variable is either binary (binary regression) or multinomial (multinomial
regression): either one of the two or a set of one, two, three, or four
options. With a logistic regression equation, one can estimate
probabilities regarding the relationship between the independent
variable and the dependent variable. For understanding logistic
regression analysis in detail, you can refer to logistic regression.
o Discriminant Analysis: A Discriminant Analysis is a statistical method of
analyzing data based on the measurements of categories or clusters and
categorizing new observations into one or more populations that were
identified a priori. The discriminant analysis models each response class
independently then uses Bayes’s theorem to flip these projections
around to estimate the likelihood of each response category given the
value of X. These models can be either linear or quadratic.
Linear Discriminant Analysis: According to Linear Discriminant
Analysis, each observation is assigned a discriminant score to
classify it into a response variable class. By combining the
independent variables in a linear fashion, these scores can be
obtained. Based on this model, observations are drawn from a
Gaussian distribution, and the predictor variables are correlated
across all k levels of the response variable, Y. and for further
details linear discriminant analysis
UNIT - V 14
[ADVANCED DATABASE AND MINING]
BAYESIAN THEOREM:
In numerous applications, the connection between the attribute set and the class
variable is non- deterministic. In other words, we can say the class label of a test record
cant be assumed with certainty even though its attribute set is the same as some of
the training examples. These circumstances may emerge due to the noisy data or the
presence of certain confusing factors that influence classification, but it is not included
in the analysis. For example, consider the task of predicting the occurrence of whether
an individual is at risk for liver illness based on individuals eating habits and working
efficiency. Although most people who eat healthly and exercise consistently having
less probability of occurrence of liver disease, they may still do so due to other factors.
For example, due to consumption of the high-calorie street foods and alcohol abuse.
Determining whether an individual's eating routine is healthy or the workout efficiency
is sufficient is also subject to analysis, which in turn may introduce vulnerabilities into
the leaning issue.
Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.
UNIT - V 15
[ADVANCED DATABASE AND MINING]
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an
unknown parameter.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other.
This is known as the marginal probability.
Bayesian interpretation:
UNIT - V 16
[ADVANCED DATABASE AND MINING]
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
The nodes here represent random variables, and the edges define the relationship
between these variables.
A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability
Table (CPT) is used to represent the CPD of each variable in a network.
the instances themselves and works by associating new instances whose class is
unknown to the current ones whose class is known. Rather than trying to make rules,
work directly from the instances themselves. This is called instance-based learning.
In instance-based learning, all the actual work is completed when the time appears to
define a new instance instead of when the training set is processed. The difference
between this approach and the others that it can be seen is the time at which the
“learning” takes place.
Instance-based learning is inactive, deferring the real work considering possible,
whereas different methods are eager, generalizing as soon as the data has been seen.
In instance-based classification, each new instance is distinguished from current ones
using a distance metric, and the nearest existing instance is used to make the class to
the new one. This is known as the nearest-neighbor classification method.
Sometimes more than one nearest neighbor is used, and the majority class of the
nearest k neighbors (or the distance weighted average if the class is numeric) is
created to the new instance. This is defined as the k-nearest-neighbor method.
When nominal attributes are current, it is essential to come up with a “distance”
between multiple values of that attribute. Various attributes will be significant than
others, and it is usually reflected in the distance metric by several types of attribute
weighting. It is changing suitable attribute weights from the training group is an
essential problem in instance-based learning.
An apparent limitation to instance-based representations is that they do not create
explicit architecture that is learned. The instances connect with the distance metric to
divide out boundaries into instanced areas that analyze one class from another, and
this is a type of explicit description of knowledge.
For instance, given a single instance of each of two classes, the nearest-neighbor rule
efficiently divides the instance area along the perpendicular bisector of the line
connecting the instances. Given several instances of every class, the space is splitted
by a set of lines that defines the perpendicular bisectors of selected lines linking an
instance of one class to one of another class.
LINEAR MODELS
Linear regression may be defined as the statistical model that analyzes the linear
relationship between a dependent variable with given set of independent variables.
Linear relationship between variables means that when the value of one or more
independent variables will change (increase or decrease), the value of dependent
variable will also change accordingly (increase or decrease).
Mathematically the relationship can be represented with the help of following
equation −
Y=mX+bY=mX+b
UNIT - V 18
[ADVANCED DATABASE AND MINING]
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable
to changes and helps single out useful features that distinguish different
groups.
UNIT - V 20
[ADVANCED DATABASE AND MINING]
The following points throw light on why clustering is required in data mining −
Scalability − We need highly scalable clustering algorithms to deal with
large databases.
Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based
(numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They should
not be bounded to only distance measures that tend to find spherical
cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able
to handle low-dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead
to poor quality clusters.
Interpretability − The clustering results should be interpretable,
comprehensible, and usable.
Clustering Methods
UNIT - V 21
[ADVANCED DATABASE AND MINING]
For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches here −
o Agglomerative Approach
o Divisive Approach
o Agglomerative Approach
This approach is also known as the bottom-up approach. In this,
we start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keep on doing so
until all of the groups are merged into one or until the termination
condition holds.
o Divisive Approach
This approach is also known as the top-down approach. In this,
we start with all of the objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller clusters. It is down until each
object in one cluster or the termination condition holds. This method is
rigid, i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical
partitioning.
Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
UNIT - V 22
[ADVANCED DATABASE AND MINING]
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user
or the application requirement.
COBWEB
COBWEB is a popular a simple method of incremental conceptual learning.
It creates a hierarchical clustering in the form of a classification tree.
Each node refers to a concept and contains a probabilistic description of that
concept.
Classification Tree
UNIT - V 23
[ADVANCED DATABASE AND MINING]
Limitations of COBWEB
The assumption that the attributes are independent of each other is often too strong
because correlation may exist.
It is not suitable for clustering large database data – skewed tree and expensive
probability distributions.
CLASSIT
It is an extension of COBWEB for incremental clustering of continuous data.
It suffers similar problems as COBWEB.
Competitive learning
It involves a hierarchical architecture of several units (neurons).
Neurons compete in a “winner-takes-all” fashion for the object currently
being presented.
UNIT - V 24
[ADVANCED DATABASE AND MINING]
Clustering is also performed by having several units competing for the current object.
The unit whose weight vector is closest to the current object wins
The winner and its neighbors learn by having their weights adjusted.
UNIT - V 25
[ADVANCED DATABASE AND MINING]
SOMs are believed to resemble processing that can occur in the brain.
Useful for visualizing high-dimensional data in 2-D or 3-D space.
k-means
It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
UNIT - V 26
[ADVANCED DATABASE AND MINING]
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
UNIT - V 27
[ADVANCED DATABASE AND MINING]
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into
two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
UNIT - V 28
[ADVANCED DATABASE AND MINING]
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
UNIT - V 29
[ADVANCED DATABASE AND MINING]
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
UNIT - V 30
[ADVANCED DATABASE AND MINING]
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There
are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster
Sum of Squares, which defines the total variations within a cluster. The formula to
calculate the value of WCSS (for 3 clusters) is given below:
UNIT - V 31
[ADVANCED DATABASE AND MINING]
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
In the above section, we have discussed the K-means algorithm, now let's see how it
can be implemented using Python.
Before implementation, let's understand what type of problem we will solve here. So,
we have a dataset of Mall_Customers, which is the data of customers who visit the
mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and
Spending Score (which is the calculated value of how much a customer has spent in
the mall, the more the value, the more he has spent). From this dataset, we need to
calculate some patterns, as it is an unsupervised method, so we don't know what to
calculate exactly.
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
UNIT - V 32
[ADVANCED DATABASE AND MINING]
The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be different from
other models. Let's discuss it:
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model,
which is part of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics
calculation, matplotlib is for plotting the graph, and pandas are for managing the
dataset.
By executing the above lines of code, we will get our dataset in the Spyder IDE. The
dataset looks like the below image:
Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will just add
a line of code for the matrix of features.
UNIT - V 33
[ADVANCED DATABASE AND MINING]
As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d
plot to visualize the model, and some features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering
problem. So, as discussed above, here we are going to use the elbow method for this
purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting
WCSS values on the Y-axis and the number of clusters on the X-axis. So we are going
to calculate the value for WCSS for different k values ranging from 1 to 10. Below is
the code for it:
As we can see in the above code, we have used the KMeans class of sklearn. cluster
library to form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is used
to contain the value of wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k
ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken
as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the
model on a matrix of features and then plotted the graph between the number of
clusters and WCSS.
UNIT - V 34
[ADVANCED DATABASE AND MINING]
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.
As we have got the number of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the
above section, but here instead of using i, we will use 5, as we know there are 5 clusters
that need to be formed. The code is given below:
UNIT - V 35
[ADVANCED DATABASE AND MINING]
3. y_predict= kmeans.fit_predict(x)
The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train
the model.
By executing the above lines of code, we will get the y_predict variable. We can check
it under the variable explorer option in the Spyder IDE. We can now compare the
values of y_predict with our original dataset. Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4,
and so on.
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will
visualize each cluster one by one.
In above lines of code, we have written code for each clusters, ranging from 1 to 5.
The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value
for the showing the matrix of features values, and the y_predict is ranging from 0 to 1
The output image is clearly showing the five different clusters with different colors.
The clusters are formed between two parameters of the dataset; Annual income of
customer and Spending. We can change the colors and labels as per the requirement
or choice. We can also observe some points from the above patterns, which are given
below:
o Cluster1 shows the customers with average salary and average spending so we
can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we can
categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so they
can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can
be categorized as target, and these customers can be the most profitable
customers for the mall owner.
UNIT - V 36
[ADVANCED DATABASE AND MINING]
HIERARCHICAL METHODS
1. Determine the similarity between individuals and all other clusters. (Find
proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a
dendrogram.
With the help of given demonstration, we can understand that how the actual
algorithm work. Here no calculation has been done below all the proximity among the
clusters are assumed.
Step 1:
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance
between the individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster
R are similar to each other so that we can merge them in the second step. Finally, we
get the clusters [ (P), (QR), (ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest
clusters [(ST), (V)] together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined
together to form a new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster
[(PQRSTV)]
UNIT - V 38
[ADVANCED DATABASE AND MINING]
UNIT - V 39
[ADVANCED DATABASE AND MINING]
Some of the well-known data mining methods are decision tree analysis, Bayes
theorem analysis, Frequent item-set mining, etc. The software market has many
open-source as well as paid tools for data mining such as Weka, Rapid Miner, and
Orange data mining tools.
The data mining process starts with giving a certain input of data to the data mining
tools that use statistics and algorithms to show the reports and patterns. The results
can be visualized using these tools that can be understood and further applied to
conduct business modification and improvements.
Some of the data mining examples are given below for your reference.
From a large amount of data such as billing information, email, text messages, web
data transmissions, and customer service, the data mining tools can predict “churn”
that tells the customers who are looking to change the vendors.
With these results, a probability score is given. The mobile service providers are then
able to provide incentives, offers to customers who are at higher risk of churning.
This kind of mining is often used by major service providers such as broadband,
phone, gas providers, etc.
2) Retail Sector
Data Mining helps the supermarket and retail sector owners to know the choices of
the customers. Looking at the purchase history of the customers, the data mining
tools show the buying preferences of the customers.
UNIT - V 40
[ADVANCED DATABASE AND MINING]
With the help of these results, the supermarkets design the placements of products
on shelves and bring out offers on items such as coupons on matching products, and
special discounts on some products.
These campaigns are based on RFM grouping. RFM stands for recency, frequency,
and monetary grouping. The promotions and marketing campaigns are customized
for these segments. The customer who spends a lot but very less frequently will be
treated differently from the customer who buys every 2-3 days but of less amount.
3) Artificial Intelligence
A system is made artificially intelligent by feeding it with relevant patterns. These
patterns come from data mining outputs. The outputs of the artificially intelligent
systems are also analyzed for their relevance using the data mining techniques.
UNIT - V 41
[ADVANCED DATABASE AND MINING]
4) Ecommerce
Many E-commerce sites use data mining to offer cross-selling and upselling of their
products. The shopping sites such as Amazon, Flipkart show “People also viewed”,
“Frequently bought together” to the customers who are interacting with the site.
These recommendations are provided using data mining over the purchasing history
of the customers of the website.
Data mining in computer science helps to monitor system status, improve its
performance, find out software bugs, discover plagiarism and find out faults. Data
mining also helps in analyzing the user feedback regarding products, articles to
deduce opinions and sentiments of the views.
6) Crime Prevention
Data Mining detects outliers across a vast amount of data. The criminal data includes
all details of the crime that has happened. Data Mining will study the patterns and
trends and predict future events with better accuracy.
The agencies can find out which area is more prone to crime, how much police
personnel should be deployed, which age group should be targeted, vehicle numbers
to be scrutinized, etc.
7) Research
Researchers use Data Mining tools to explore the associations between the
parameters under research such as environmental conditions like air pollution and
the spread of diseases like asthma among people in targeted regions.
8) Farming
Farmers use Data Mining to find out the yield of vegetables with the amount of
water required by the plants.
9) Automation
By using data mining, the computer systems learn to recognize patterns among the
parameters which are under comparison. The system will store the patterns that will
be useful in the future to achieve business goals. This learning is automation as it
helps in meeting the targets through machine learning.
UNIT - V 42
[ADVANCED DATABASE AND MINING]
11) Transportation
Data Mining helps in scheduling the moving of vehicles from warehouses to outlets
and analyze the product loading patterns.
12) Insurance
Data mining methods help in forecasting the customers who buy the policies, analyze
the medical claims that are used together, find out fraudulent behaviors and risky
customers.
The finance sector includes banks, insurance companies, and investment companies.
These institutions collect a huge amount of data. The data is often complete, reliable
and of high quality and demands a systematic data analysis.
To store financial data, data warehouses that store data in the form of data cubes
are constructed. To analyze this data, advanced data cube concepts are used. Data
UNIT - V 43
[ADVANCED DATABASE AND MINING]
mining methods such as clustering and outlier analysis, characterization are used in
financial data analysis and mining.
Some cases in finance where data mining is used are given below.
1) Loan Payment Prediction
Data mining methods like attribute selection and attribute ranking will analyze the
customer payment history and select important factors such as payment to income
ratio, credit history, the term of the loan, etc. The results will help the banks decide
its loan granting policy, and also grant loans to the customers as per factor analysis.
2) Targeted Marketing
Clustering and classification data mining methods will help in finding the factors that
influence the customer’s decisions towards banking. Similar behavioral customers’
identification will facilitate targeted marketing.
Data mining helps to identify customer buying behavior, improve customer service,
focus on customer retention, enhance sales, and reduce the cost of businesses.
2) Anomaly Detection
Data mining techniques are deployed to detect any abnormalities in data that may
cause any kind of flaw in the system. The system will scan thousands of complex
entries to perform this operation.
UNIT - V 44
[ADVANCED DATABASE AND MINING]
3) System Security
Data Mining tools detect intrusions that may harm the database offering greater
security to the entire system. These intrusions may be in the form of duplicate
entries, viruses in the form of data by hackers, etc.
Some data mining examples of the healthcare industry are given below for your
reference.
1) Healthcare Management
The data mining method is used to identify chronic diseases, track high-risk regions
prone to the spread of disease, design programs to reduce the spread of disease.
Healthcare professionals will analyze the diseases, regions of patients with maximum
admissions to the hospital.
With this data, they will design the campaigns for the region to make people aware
of the disease and see how to avoid it. This will reduce the number of patients
admitted to hospitals.
2) Effective Treatments
Using data mining, the treatments can be improved. By continuous comparison of
symptoms, causes, and medicines, data analysis can be performed to make effective
treatments. Data mining is also used for the treatment of specific diseases, and the
association of side-effects of treatments.
The recommended items are either similar to the items queried by the user in the
past or by looking at the other customer preferences which have similar taste as the
UNIT - V 45
[ADVANCED DATABASE AND MINING]
Many techniques like information retrieval, statistics, machine learning, etc are used
in recommender systems.
There are many challenges in this approach. The recommendation system needs to
search through millions of data in real-time.
False negatives are products that were not recommended by the system but the
customer would want them. False-positive are products that were recommended by
the system but not wanted by the customer. Another challenge is the
recommendation for the users who are new without any purchasing history.
An intelligent query answering technique is used to analyze the query and provide
generalized, associated information relevant to the query. For Example: Showing the
review of restaurants instead of just the address and phone number of the
restaurant searched for.
Data Mining For CRM (Customer Relationship Management)
Customer Relationship Management can be reinforced with data mining. Good
customer Relations can be built by attracting more suitable customers, better cross-
selling and up-selling, better retention.
UNIT - V 46
[ADVANCED DATABASE AND MINING]
The internal node represents an attribute and the leaf node represents a class label.
Some algorithms used for Decision Tree Induction include Hunt’s Algorithm, CART,
ID3, C4.5, SLIQ, and SPRINT.
UNIT - V 47
[ADVANCED DATABASE AND MINING]
1) Banks are the first users of data mining technology as it helps them with credit
assessment. Data mining analyzes what services offered by banks are used by
customers, what type of customers use ATM cards and what do they generally buy
using their cards (for cross-selling).
Banks use data mining to analyze the transactions which the customer do before
they decide to change the bank to reduce customer attrition. Also, some outliers in
transactions are analyzed for fraud detection.
ADVANCED TOPICS:
As the amount of research and industry data being collected daily continues to grow,
intelligent software tools are increasingly needed to process and filter the data, detect
new patterns and similarities within it, and extract meaningful information from it.
Data mining and predictive modeling offer a means of effective classification and
analysis of large, complex, multi-dimensional data, leading to discovery of functional
models, trends and patterns.
Building upon the skills learned in previous courses, this course covers advanced data
mining, data analysis, and pattern recognition concepts and algorithms, as well as
models and machine learning algorithms.
Topics include:
UNIT - V 48
[ADVANCED DATABASE AND MINING]
oRadial-basis functions
o Recurrent neural networks
Probability graph models and Bayesian learning
Hidden Markov models
Support vector machines
Ensemble learning: bagging, boosting, stacking
Random forests
Data mining tools
Text mining
Data Mining uses tools such as statistical models, machine learning, and visualization
to "Mine" (extract) the useful data and patterns from the Big Data, whereas Big Data
processes high-volume and high-velocity data, which is challenging to do in older
databases and analysis program.
Big Data:
Big Data refers to the vast amount that can be structured, semi-structured, and
unstructured sets of data ranging in terms of tera-bytes. It is challenging to process a
huge amount of data on a single system that's why the RAM of our computer stores
the interim calculations during the processing and analyzing. When we try to process
such a huge amount of data, it takes much time to do these processing steps on a
single system. Also, our computer system doesn't work correctly due to overload.
Here we will understand the concept (how much data is produced) with a live example.
We all know about Big Bazaar. We as a customer goes to Big Bazaar at least once a
month. These stores monitor each of its product that the customers purchase from
them, and from which store location over the world. They have a live information
feeding system that stores all the data in huge central servers. Imagine the number of
Big bazaar stores in India alone is around 250. Monitoring every single item purchased
by every customer along with the item description will make the data go around 1 TB
in a month.
UNIT - V 49
[ADVANCED DATABASE AND MINING]
We know some promotions are running in Big Bazaar on some items. Do we genuinely
believe Big Bazaar would just run those products without any full back-up to find those
promotions would increase their sales and generate a surplus? That is where Big Data
analysis plays a vital role. Using Data Analysis techniques, Big Bazaar targets its new
customers as well as existing customers to purchase more from its stores.
Big data comprises of 5Vs that is Volume, Variety, Velocity, Veracity, and Value.
Volume: In Big Data, volume refers to an amount of data that can be huge when it
comes to big data.
Variety: In Big Data, variety refers to various types of data such as web server logs,
social media data, company data.
UNIT - V 50
[ADVANCED DATABASE AND MINING]
Velocity: In Big Data, velocity refers to how data is growing with respect to time. In
general, data is increasing exponentially at a very fast rate.
Value: In Big Data, value refers to the data which we are storing, and processing is
valuable or not and how we are getting the advantage of these huge data sets.
A very efficient method, known as Hadoop, is primarily used for Big data processing.
It is an Open-source software that works on a Distributed Parallel processing method.
Hadoop Common:
Hadoop YARN:
Hadoop MapReduce:
Data Mining:
As the name suggests, Data Mining refers to the mining of huge data sets to identify
trends, patterns, and extract useful information is called data mining.
In data Mining, we are looking for hidden data but without any idea about what exactly
type of data we are looking for and what we plan to use it for once you find it. When
we discover interesting information, we start thinking about how to make use of it to
boost business.
UNIT - V 51
[ADVANCED DATABASE AND MINING]
A Data Miner starts discovering the call records of a mobile network operator without
any specific target from his manager. The manager probably gives him a significant
objective to discover at least a few new patterns in a month. As he begins extracting
the data to discover a pattern that there are some international calls on Friday
(example) compared to all other days. Now he shares this data with management, and
they come up with a plan to shrink international call rates on Friday and start a
campaign. Call duration goes high, and customers are happy with low call rates, more
customers join, the organization makes more profit as utilization percentage has
increased.
Data Integration:
In step first, Data are integrated and collected from various sources.
Data Selection:
In the first step, we may not collect all the data simultaneously, so in this step, we
select only those data which are left, and we think it is useful for data mining.
Data Cleaning:
In this step, the information we have collected is not clean and may consist of errors,
noisy or inconsistent data, missing values. So we need to implement various strategies
to get rid of such problems.
Data Transformation:The data even after cleaning is not prepared for mining, so we
need to transform them into structures for mining. The methods used to achieve this
are aggregation, normalization, smoothing, etc.
UNIT - V 52
[ADVANCED DATABASE AND MINING]
Data Mining:
Once the data has transformed, we are ready to implement data mining methods on
data to extract useful data and patterns from data sets. Techniques like clustering
association rules are among the many various techniques used for data mining.
Pattern Evaluation:
Decision:
It is the last step in data mining. It helps users to make use of the acquired user data
to make better data-driven decisions.
It can be used for large volume as well It contains a huge volume of data.
as low volume data.
It is a method primarily used for data It is a whole concept than a brief term.
analysis.
It uses the following data types e.g., It uses the following data types e.g.,
Structured data, relational, and Structured, Semi-Structured, and
dimensional database. unstructured data.
UNIT - V 53
[ADVANCED DATABASE AND MINING]
But what about those problems whose answers are not clear, so that is where our
traditional approach face failure and so Neural Networks came into existence. Neural
Networks processes information in a similar way the human brain does, and these
networks actually learn from examples, you cannot program them to perform a
specific task. They will learn only from past experiences as well as examples, which is
why you don't need to provide all the information regarding any specific task. So, that
was the main reason why neural networks came into existence.
Neural networks are modeled in accordance with the human brain so as to imitate
their functionality. The human brain can be defined as a neural network that is made
up of several neurons, so is the Artificial Neural Network is made of numerous
perceptron.
UNIT - V 54
[ADVANCED DATABASE AND MINING]
o Input layer: The input layer accepts all the inputs that are provided by the
programmer.
o Hidden layer: In between the input and output layer, there is a set of hidden
layers on which computations are performed that further results in the output.
o Output layer: After the input layer undergoes a series of transformations while
passing through the hidden layer, it results in output that is delivered by the
output layer.
Basically, the neural network is based on the neurons, which are nothing but the brain
cells. A biological neuron receives input from other sources, combines them in some
way, followed by performing a nonlinear operation on the result, and the output is the
final result.
The dendrites will act as a receiver that receives signals from other neurons, which are
then passed on to the cell body. The cell body will perform some operations that can
be a summation, multiplication, etc. After the operations are performed on the set of
input, then they are transferred to the next neuron via axion, which is the transmitter
of the signal for the neuron.
Artificial Neural Networks are the computing system that is designed to simulate the
way the human brain analyzes and processes the information. Artificial Neural
Networks have self-learning capabilities that enable it to produce a better result as
more data become available. So, if the network is trained on more data, it will be more
accurate because these neural networks learn from the examples. The neural network
can be configured for specific applications like data classification, pattern recognition,
etc.
With the help of the neural network, we can actually see that a lot of technology has
been evolved from translating webpages to other languages to having a virtual
assistant to order groceries online. All of these things are possible because of neural
networks. So, an artificial neural network is nothing but a network of various artificial
neurons.
o Without Neural Network: Let's have a look at the example given below. Here
we have a machine, such that we have trained it with four types of cats, as you
can see in the image below. And once we are done with the training, we will
UNIT - V 55
[ADVANCED DATABASE AND MINING]
provide a random image to that particular machine that has a cat. Since this cat
is not similar to the cats through which we have trained our system, so without
the neural network, our machine would not identify the cat in the picture.
Basically, the machine will get confused in figuring out where the cat is.
o With Neural Network: However, when we talk about the case with a neural
network, even if we have not trained our machine with that particular cat. But
still, it can identify certain features of a cat that we have trained on, and it can
match those features with the cat that is there in that particular image and can
also identify the cat. So, with the help of this example, you can clearly see the
importance of the concept of a neural network.
Instead of directly getting into the working of Artificial Neural Networks, lets
breakdown and try to understand Neural Network's basic unit, which is called
a Perceptron.
So, a perceptron can be defined as a neural network with a single layer that classifies
the linear data. It further constitutes four major components, which are as follows;
1. Inputs
2. Weights and Bias
3. Summation Functions
4. Activation or transformation function
UNIT - V 56
[ADVANCED DATABASE AND MINING]
The inputs (x) are fed into the input layer, which undergoes multiplication with the
allotted weights (w) followed by experiencing addition in order to form weighted
sums. Then these inputs weighted sums with their corresponding weights are
executed on the pertinent activation function.
As and when the input variable is fed into the network, a random value is given as a
weight of that particular input, such that each individual weight represents the
importance of that input in order to make correct predictions of the result.
Summation Function
After the weights are assigned to the input, it then computes the product of each input
and weights. Then the weighted sum is calculated by the summation function in which
all of the products are added.
Activation Function
In the linear activation function, the output of functions is not restricted in between
any range. Its range is specified from -infinity to infinity. For each individual neuron,
the inputs get multiplied with the weight of each respective neuron, which in turn
leads to the creation of output signal proportional to the input. If all the input layers
are linear in nature, then the final activation of the last layer will actually be the linear
function of the initial layer's input.
UNIT - V 57
[ADVANCED DATABASE AND MINING]
These are one of the most widely used activation function. It helps the model in
generalizing and adapting any sort of data in order to perform correct differentiation
among the output. It solves the following problems faced by linear activation
functions:
UNIT - V 58
[ADVANCED DATABASE AND MINING]
The non-linear activation function is further divided into the following parts:
UNIT - V 59
[ADVANCED DATABASE AND MINING]
4. Softmax Function
It is one of a kind of sigmoid function whereby solving the problems of
classifications. It is mainly used to handle multiple classes for which it
squeezes the output of each class between 0 and 1, followed by dividing it by
the sum of outputs. This kind of function is specially used by the classifier in
the output layer.
Procedure:
It all starts with the coefficient's initial value or function's coefficient that may be
either 0.0 or any small arbitrary value.
UNIT - V 60
[ADVANCED DATABASE AND MINING]
coefficient = 0.0
For estimating the cost of the coefficients, they are plugged into the function that
helps in evaluating.
cost = f(coefficient)
or, cost = evaluate(f(coefficient))
Next, the derivate will be calculated, which is one of the concepts of calculus that
relates to the function's slope at any given instance. In order to know the direction in
which the values of the coefficient will move, we need to calculate the slope so as to
accomplish a low cost in the next iteration.
delta = derivative(cost)
Now that we have found the downhill direction, it will further help in updating the
values of coefficients. Next, we will need to specify alpha, which is a learning rate
parameter, as it handles the amount of amendments made by coefficients on each
update.
Until the cost of the coefficient reaches 0.0 or somewhat close enough to it, the whole
process will reiterate again and again.
For every repetition of gradient descent, the main aim of batch gradient descent is to
processes all of the training examples. In case we have a large number of training
examples, then batch gradient descent tends out to be one of the most expensive and
less preferable too.
Now assume that hƟ represents the hypothesis for linear regression and ∑ computes
the sum of all training examples from i=1 to m. Then the cost of function will be
computed by:
UNIT - V 61
[ADVANCED DATABASE AND MINING]
Repeat {
Here x(i) indicates the jth feature of the ith training example. In case if m is very large,
then derivative will fail to converge at a global minimum.
At a single repetition, the stochastic gradient descent processes only one training
example, which means it necessitates for all the parameters to update after the one
single training example is processed per single iteration. It tends to be much faster
than that of the batch gradient descent, but when we have a huge number of training
examples, then also it processes a single example due to which system may undergo a
large no of repetitions. To evenly train the parameters provided by each type of data,
properly shuffle the dataset.
Repeat {
For i=1 to m{
UNIT - V 62
[ADVANCED DATABASE AND MINING]
The Batch Gradient Descent algorithm follows a straight-line path towards the
minimum. The algorithm converges towards the global minimum, in case the cost
function is convex, else towards the local minimum, if the cost function is not convex.
Here the learning rate is typically constant.
However, in the case of Stochastic Gradient Descent, the algorithm fluctuates all over
the global minimum rather than converging. The learning rate is changed slowly so
that it can converge. Since it processes only one example in one iteration, it tends out
to be noisy.
Backpropagation
The backpropagation consists of an input layer of neurons, an output layer, and at least
one hidden layer. The neurons perform a weighted sum upon the input layer, which is
then used by the activation function as an input, especially by the sigmoid activation
function. It also makes use of supervised learning to teach the network. It constantly
updates the weights of the network until the desired output is met by the network. It
includes the following factors that are responsible for the training and performance of
the network:
Working of Backpropagation
UNIT - V 63
[ADVANCED DATABASE AND MINING]
Need of Backpropagation
Building an ANN
Before starting with building an ANN model, we will require a dataset on which our
model is going to work. The dataset is the collection of data for a particular problem,
which is in the form of a CSV file.
UNIT - V 64
[ADVANCED DATABASE AND MINING]
CSV stands for Comma-separated values that save the data in the tabular format. We
are using a fictional dataset of banks. The bank dataset contains data of its 10,000
customers with their details. This whole thing is undergone because the bank is seeing
some unusual churn rates, which is nothing but the customers are leaving at an
unusual high rate, and they want to know the reason behind it so that they can assess
and address that particular problem.
Here we are going to solve this business problem using artificial neural networks. The
problem that we are going to deal with is a classification problem. We have several
independent variables like Credit Score, Balance, and Number of Products on the basis
of which we are going to predict which customers are leaving the bank. Basically, we
are going to do a classification problem, and artificial neural networks can do a terrific
job at making such kind of predictions.
So, we will start with installing the Keras library, TensorFlow library, as well as
the Theano library on Anaconda Prompt, and for that, you need to open it as
administrator followed by running the commands one after other as given below.
Now that we are done with the installation, the next step is to update all these libraries
to the most updated version, and it can be done by following the given code.
Since we are doing it for the very first time, it will ask whether to proceed or not.
Confirm it with y and press enter.
After the libraries are updated successfully, we will close the Anaconda prompt and
get back to the Spyder IDE.
Now we will start building our model in two parts, such that in part 1st, we will do data
pre-processing, however in 2nd part, we will create the ANN model.
Data pre-processing is very necessary to prepare the data correctly for building a
future deep learning model. Since we are in front of a classification problem, so we
have some independent variables encompassing some information about customers
in a bank, and we are trying to predict the binary outcome for the dependent variable,
i.e., either 1 if the customer leaves the bank or 0 if the customer stays in the bank.
UNIT - V 65
[ADVANCED DATABASE AND MINING]
NumPy
NumPy is a python library that stands for Numerical Python, allows the
implementation of linear, mathematical and logical operations on arrays as well as
Fourier transformation and routine to manipulate the shapes.
1. import numpy as np
Matplotlib
It is also an open-source library with the help of which charts can be plotted in
the python. The sole purpose of this library is to visualize the data for which it
necessitates to import its pyplot sub library.
Pandas
1. import pandas as pd
An output image is given below, which shows that the libraries have been successfully
imported.
Next, we will import the data file from the current working directories with the help
of Pandas. We will use read.csv() for reading the CSV file both locally as well as through
the URL.
1. dataset = pd.read_csv('Churn_Modelling.csv')
From the code given above, the dataset is the name of the variable in which we are
going to save the data. We have passed the name of the dataset in the read.csv().
Once the code is run, we can see that the data is uploaded successfully.
UNIT - V 66
[ADVANCED DATABASE AND MINING]
By clicking on the Variable explorer and selecting the dataset, we can check the
dataset, as shown in the following image.
Next, we will create the matrix of feature, which is nothing but a matrix of the
independent variable. Since we don't know which independent variable might has the
most impact on the dependent variable, so that is what our artificial neural network
will spot by looking at the correlations; it will give bigger weight to those independent
variables that have the most impact in the neural network.
So, we will include all the independent variables from the credit score to the last one
that is the estimated salary.
1. X = dataset.iloc[:, 3:13].values
After running the above code, we will see that we have successfully created the matrix
of feature X. Next, we will create a dependent variable vector.
1. y = dataset.iloc[:, 13].values
By clicking on y, we can have a look that y contains binary outcome, i.e., 0 or 1 for all
the 10,000 customers of the bank.
Next, we will split the dataset into the training and test set. But before that, we need
to encode that matrix of the feature as it contains the categorical data. Since the
dependent variable also comprises of categorical data but sidewise, it also takes a
numerical value, so don't need to encode text into numbers. But then again, we have
our independent variable, which has categories of strings, so we need to encode the
categorical independent variables.
The main reason behind encoding the categorical data before splitting is that it is must
to encode the matrix of X and the dependent variable y.
So, now we will encode our categorical independent variable by having a look at our
matrix from console and for that we just need to press X at the console.
Output:
UNIT - V 67
[ADVANCED DATABASE AND MINING]
From the image given above, we can see that we have only two categorical
independent variables, which is the country variable containing three countries, i.e.,
France, Spain, and Germany, and the other one is the gender variable, i.e., male and
female. So, we have got these two variables, which we will encode in our matrix of
features.
So we will need to create two label encoder objects, such that we will create our first
label encoder object named labelencoder_X_1 followed by
applying fit_transform method to encode this variable, which will, in turn, the strings
here France, Spain, and Germany into the numbers 0, 1 and 2.
After executing the code, we will now have a look at the X variable, simply by pressing
X in the console, as we did in the earlier step.
Output:
So, from the output image given above, we can see that France became 0, Germany
became 1, and Spain became 2.
Now in a similar manner, we will do the same for the other variable, i.e., Gender
variable but with a new object.
1. labelencoder_X_2 = LabelEncoder()
2. X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
Output:
We can clearly see that females became 0 and males became 1. Since there is no
relational order between the categories of our categorical variable, so for that we need
to create a dummy variable for the country categorical variable as it contains three
categories unlike the gender variable having only two categories, which is why we will
be removing one column to avoid the dummy variable trap. It is useless to create the
dummy variable for the gender variable. We will use the OneHotEncoder class to
create the dummy variables.
UNIT - V 68
[ADVANCED DATABASE AND MINING]
5. transformers=[
6. ("OneHot", # Just a name
7. OneHotEncoder(), # The transformer class
8. [1] # The column(s) to be applied on.
9. )
10. ],
11. remainder='passthrough' # don't apply anything to the remaining columns
12. )
13. X = transformer.fit_transform(X.tolist())
14. X = X.astype('float64')
Output:
By having a look at X, we can see that all the columns are of the same type now. Also,
the type is no longer an object but float64. We can see that we have twelve
independent variables because we have three new dummy variables.
Next, we will remove one dummy variable to avoid falling into a dummy variable trap.
We will take a matrix of features X and update it by taking all the lines of this matrix
and all the columns except the first one.
1. X = X[:, 1:]
Output:
It can be seen that we are left with only two dummy variables, so no more dummy
variable trap.
Now we are ready to split the dataset into the training set and test set. We have taken
the test size to 0.2 for training the ANN on 8,000 observations and testing its
performance on 2,000 observations.
By executing the code given above, we will get four different variables that can be seen
under the variable explorer section.
Output:
UNIT - V 69
[ADVANCED DATABASE AND MINING]
After executing the above code, we can have a quick look at X_train and X_test to
check if all the independent variables are scaled properly or not.
Now that our data is well pre-processed, we will start by building an artificial neural
network.
We will start with importing the Keras libraries as well as the desired packages as it
will build the Neural Network based on TensorFlow
1. import keras
After importing the Keras library, we will now import two modules, i.e., the Sequential
module, which is required to initialize our neural network, and the Dense module that
is needed to build the layer of our ANN.
Next, we will initialize the ANN, or simply we can say we will be defining it as a
sequence of layers. The deep learning model can be initialized in two ways, either by
defining the sequence of layers or defining a graph. Since we are going to make our
ANN with successive layers, so we will initialize our deep learning model by defining it
as a sequence of layers.
It can be done by creating an object of the sequential class, which is taken from the
sequential model. The object that we are going to create is nothing but the model
itself, i.e., a neural network that will have a row of classifiers because we are solving a
classification problem where we have to predict a class, so our neural network model
is going to be a classifier. As in the next step, we will be predicting the test set result
using the classifier name, so we will call our model as a classifier that is nothing but
our future Artificial Neural Network that we are going to build.
UNIT - V 70
[ADVANCED DATABASE AND MINING]
Since this classifier is an object of Sequential class, so we will be using it, but will not
pass any argument because we will be defining the layers step by step by starting with
the input layer followed by adding some hidden layers and then the output layer.
1. classifier = Sequential()
After this, we will start by adding the input layer and the first hidden layer. We will
take the classifier that we initialized in the previous step by creating an object of the
sequential class, and we will use the add() method to add different layers in our neural
network. In the add(), we will pass the layer argument, and since we are going to add
two layers, i.e., the input and first hidden layer, which we will be doing with the help
of Dense() function that we have mentioned above.
o units are the very first argument, which can be defined as the number of nodes
that we want to add in the hidden layer.
o The second argument is the kernel_initializer that randomly initializes the
weight as a small number close to zero so that they can be randomly initialized
with a uniform function. Here we have a simple uniform function that will
initialize the weight according to the uniform distribution.
o The third argument is the activation, which can be understood as the function
that we want to choose in our hidden layer. So, we will be using the rectifier
function for the hidden layers and the sigmoid function for the output layer.
Since we are in the hidden layer, we are using the "relu" perimeter as it
corresponds to the rectifier function.
o And the last is the input_dim argument that specifies the number of nodes in
the input layer, which is actually the number of independent variables. It is very
necessary to add the argument because, by so far, we have only initialized our
ANN, we haven't created any layer yet, and that's why it doesn't know which
node this hidden layer we are creating is expecting as inputs. After the first
hidden layer gets created, we don't need to specify this argument for the next
hidden layers.
Next, we will add the second hidden layer by using the same add method followed by
passing the same parameter, which is the Dense() as well as the same parameters
inside it as we did in the earlier step except for the input_dim.
UNIT - V 71
[ADVANCED DATABASE AND MINING]
After adding the two hidden layers, we will now add the final output layer. This is again
similar to the previous step, just the fact that we will be units parameter because in
the output layer we only require one node as our dependent variable is a categorical
variable encompassing a binary outcome and also when we have binary outcome then,
in that case, we have only one node in the output layer. So, therefore, we will put units
equals to 1, and since we are in the output layer, we will be replacing
the rectifier function to sigmoid activation function.
As we are done with adding the layers of our ANN, we will now compile the whole
artificial neural network by applying the stochastic gradient descent. We will start with
our classifier object, followed by using the compile method and will pass on the
following arguments in it.
o The first argument is the optimizer, which is simply the algorithm that we want
to use to find the optimal set of weights in the neural networks. The algorithm
that we are going to use is nothing but the stochastic gradient descent
algorithm. Since there are several types of stochastic descent algorithms and
the most efficient one is called "adam," which is going to be the input of this
optimizer parameter.
o The second parameter is the loss, which is a loss function within the stochastic
gradient descent algorithm, which is used to find the optimal weights. Since our
dependent variable has a binary outcome, so we will be
using binary_crossentropy logarithmic function, and when there is a binary
outcome, then we will incorporate categorical_crossentropy.
o The last argument will be the metrics, which is nothing but a criterion to
evaluate our model, and we are using the "accuracy." So, what happens is when
the weights are updated after each observation, the algorithm makes use of
this accuracy to improve the model's performance.
UNIT - V 72
[ADVANCED DATABASE AND MINING]
Next, we will fit the ANN to the training set for which we will be using the fit method
to fit our ANN to the training set. In the fit method, we will be passing the following
arguments:
o The first argument is the dataset on which we want to train our classifier, which
is the training set separated into two-argument such as X_train (matrix of
feature containing the observations of the train set) and y_train (containing the
actual outcomes of the dependent variable for all the observations in the
training set).
o The next argument is the batch_size, which is the number of observations, after
which we want to update the weight.
o And lastly, the no. of epochs that we are going to apply to see the algorithm in
action as well the improvement in accuracy over the different epochs.
From the output image given above, you can see that our model is ready and has
reached an accuracy of 84% approximately, so this how a stochastic gradient descent
algorithm is performed.
Since we are done with training the ANN on the training set, now we will make the
predictions on the set.
1. y_pred = classifier.predict(X_test)
From the output image given above, we can see all the probabilities that the 2,000
customers of the test set will leave the bank. For example, if we have a look at first
probability, i.e., 21% means that this first customer of the test set, indexed by zero,
has a 20% chance to leave the bank.
Since the predicted method returns the probability of the customers leave the bank
and in order to use this confusion matrix, we don't need these probabilities, but we do
need the predicted results in the form of True or False. So, we need to transform these
probabilities into the predicted result.
We will choose a threshold value to decide when the predicted result is one, and when
it is zero. So, we predict 1 over the threshold and 0 below the threshold as well as the
natural threshold that we will take is 0.5, i.e., 50%. If the y_pred is larger, then it will
return True else False.
UNIT - V 73
[ADVANCED DATABASE AND MINING]
Now, if we have a look at y_pred, we will see that it has updated the results in the
form of "False" or "True".
So, the first five customers of the test set don't leave the bank according to the model,
whereas the sixth customer in the test set leaves the bank.
Next, we will execute the following code to get the confusion matrix.
From the output given above, we can see that out of 2000 new observations; we get
1542+141= 1683 correct predictions 264+53= 317 incorrect predictions.
So, now we will compute the accuracy on the console, which is the number of correct
predictions divided by the total number of predictions.
So, we can see that we got an accuracy of 84% on new observations on which we didn't
train our ANN, even that get a good amount of accuracy. Since this is the same
accuracy that we obtained in the training set but obtained here on the test set too.
So, eventually, we can validate our model, and now the bank can use it to make a
ranking of their customers, ranked by their probability to leave the bank, from the
customer that has the highest probability to leave the bank, down to the customer
that has the lowest probability to leave the bank.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
UNIT - V 74
[ADVANCED DATABASE AND MINING]
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support
vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
UNIT - V 75
[ADVANCED DATABASE AND MINING]
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
UNIT - V 76
[ADVANCED DATABASE AND MINING]
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
UNIT - V 77
[ADVANCED DATABASE AND MINING]
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
UNIT - V 78
[ADVANCED DATABASE AND MINING]
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
UNIT - V 79
[ADVANCED DATABASE AND MINING]
Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
UNIT - V 80
[ADVANCED DATABASE AND MINING]
After executing the above code, we will pre-process the data. The code will give the
dataset as:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier,
we will import SVC class from Sklearn.svm library. Below is the code for it:
In the above code, we have used kernel='linear', as here we are creating SVM for
linearly separable data. However, we can change it for non-linear data. And then we
fitted the classifier to the training dataset(x_train, y_train)
After getting the y_pred vector, we can compare the result of y_pred and y_test to
check the difference between the actual value and predicted value
UNIT - V 81
[ADVANCED DATABASE AND MINING]
As we can see in the above output image, there are 66+24= 90 correct predictions and
8+2= 10 correct predictions. Therefore we can say that our SVM model improved as
compared to the Logistic regression model.
As we can see, the above output is appearing similar to the Logistic regression output.
In the output, we got the straight line as hyperplane because we have used a linear
kernel in the classifier. And we have also discussed above that for the 2d space, the
hyperplane in SVM is a straight line.
Output:
As we can see in the above output image, the SVM classifier has divided the users into
two regions (Purchased or Not purchased). Users who purchased the SUV are in the
red region with the red scatter points. And users who did not purchase the SUV are in
the green region with green scatter points. The hyperplane has divided the two classes
into Purchased and not purchased variable.
UNIT - V 83
[ADVANCED DATABASE AND MINING]
Bagging Vs Boosting
We all use the Decision Tree Technique on day to day life to make the decision.
Organizations use these supervised machine learning techniques like Decision trees to
make a better decision and to generate more surplus and profit.
There are two techniques given below that are used to perform ensemble decision
tree.
Bagging
Bagging is used when our objective is to reduce the variance of a decision tree. Here
the concept is to create a few subsets of data from the training sample, which is chosen
randomly with replacement. Now each collection of subset data is used to prepare
their decision trees thus, we end up with an ensemble of various models. The average
of all the assumptions from numerous tress is used, which is more powerful than a
single decision tree.
Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than
using all features to develop trees. When we have numerous random trees, it is called
the Random Forest.
These are the following steps which are taken to implement a Random forest:
o Let us consider X observations Y features in the training data set. First, a model
from the training data set is taken randomly with substitution.
o The tree is developed to the largest.
o The given steps are repeated, and prediction is given, which is based on the
collection of predictions from n number of trees.
UNIT - V 84
[ADVANCED DATABASE AND MINING]
Since the last prediction depends on the mean predictions from subset trees, it won't
give precise value for the regression model.
Boosting:
If a given input is misclassified by theory, then its weight is increased so that the
upcoming hypothesis is more likely to classify it correctly by consolidating the entire
set at last converts weak learners into better performing models.
It utilizes a gradient descent algorithm that can optimize any differentiable loss
function. An ensemble of trees is constructed individually, and individual trees are
summed successively. The next tree tries to restore the loss ( It is the difference
between actual and predicted values).
Bagging Boosting
Various training data subsets are Each new subset contains the
randomly drawn with replacement from components that were misclassified
the whole training dataset. by previous models.
UNIT - V 85
[ADVANCED DATABASE AND MINING]
STACKING
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved.
UNIT - V 86
[ADVANCED DATABASE AND MINING]
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta-model that combines the predictions
of the base models. These base models are called level 0 models, and the meta-model
is known as the level 1 model. So, the Stacking ensemble method includes original
(training) data, primary level models, primary level prediction, secondary level
model, and final prediction. The basic architecture of stacking can be represented as
shown below the image.
o Original data: This data is divided into n-folds and is also considered test data
or training data.
o Base models: These models are also referred to as level-0 models. These
models use training data and provide compiled predictions (level-0) as an
output.
o Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-
model, which helps to best combine the predictions of the base models. The
meta-model is also known as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the
predictions of the base models and is trained on different predictions made by
individual base models, i.e., data not used to train the base models are fed to
the meta-model, predictions are made, and these predictions, along with the
expected outputs, provide the input and output pairs of the training dataset
used to fit the meta-model.
There are some important steps to implementing stacking models in machine learning.
These are as follows:
UNIT - V 87
[ADVANCED DATABASE AND MINING]
o Split training data sets into n-folds using the RepeatedStratifiedKFold as this is
the most common approach to preparing training datasets for meta-models.
o Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size
n,
o Now, the model is trained on all the n parts, which will make predictions for the
sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using
Model 2 and 3 for training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions will
be used as features for the model.
o Finally, Meta learners can now be used to make a prediction on test data in the
stacking model.
There are some other ensemble techniques that can be considered the forerunner of
the stacking method. For better understanding, we have divided them into the
different frameworks of essential stacking so that we can easily understand the
differences between methods and the uniqueness of each technique. Let's discuss a
few commonly used ensemble techniques related to stacking.
Voting ensembles:
This is one of the simplest stacking ensemble methods, which uses different algorithms
to prepare all members individually. Unlike the stacking method, the voting ensemble
uses simple statistics instead of learning how to best combine predictions from base
models separately.
UNIT - V 88
[ADVANCED DATABASE AND MINING]
RANDOM FORESTS
The voting ensemble differs from than stacking ensemble in terms of weighing models
based on each member's performance because here, all models are considered to
have the same skill levels.
Member Assessment: In the voting ensemble, all members are assumed to have the
same skill sets.
Combine with Model: Instead of using combined prediction from each member, it
uses simple statistics to get the final prediction, e.g., mean or median.
The weighted average ensemble is considered the next level of the voting ensemble,
which uses a diverse collection of model types as contributing members. This method
uses some training datasets to find the average weight of each ensemble member
based on their performance. An improvement over this naive approach is to weigh
each member based on its performance on a hold-out dataset, such as a validation set
or out-of-fold predictions during k-fold cross-validation. Furthermore, it may also
involve tuning the coefficient weightings for each model using an optimization
algorithm and performance on a holdout dataset.
Combine With Model: It considers the weighted average of prediction from each
member separately.
Blending Ensemble:
Combine With Model: Linear model (e.g., linear regression or logistic regression).
UNIT - V 89
[ADVANCED DATABASE AND MINING]
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Since the random forest combines multiple trees to predict the class of the dataset, it
is possible that some decision trees may predict the correct output, while others may
UNIT - V 90
[ADVANCED DATABASE AND MINING]
not. But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
Random Forest works in two-phase first is to create the random forest by combining
N decision tree, and second is to make predictions for each tree created in the first
phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets and
given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of
UNIT - V 91
[ADVANCED DATABASE AND MINING]
results, the Random Forest classifier predicts the final decision. Consider the below
image:
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
UNIT - V 92
[ADVANCED DATABASE AND MINING]
o Although random forest can be used for both classification and regression tasks,
it is not more suitable for Regression tasks.
Now we will implement the Random Forest Algorithm tree using Python. For this, we
will use the same dataset "user_data.csv", which we have used in previous
classification models. By using the same dataset, we can compare the Random Forest
classifier with other classification models such as Decision tree
Classifier, KNN, SVM, Logistic Regression, etc.
Data Mining is the set of techniques that utilize specific algorithms, statical analysis,
artificial intelligence, and database systems to analyze data from different dimensions
and perspectives.
It is a framework, such as Rstudio or Tableau that allows you to perform different types
of data mining analysis.
UNIT - V 93
[ADVANCED DATABASE AND MINING]
We can perform various algorithms such as clustering or classification on your data set
and visualize the results itself. It is a framework that provides us better insights for our
data and the phenomenon that data represent. Such a framework is called a data
mining tool.
The Market for Data Mining tool is shining: as per the latest report from ReortLinker
noted that the market would top $1 billion in sales by 2023, up from $ 591 million
in 2018
Orange is a perfect machine learning and data mining software suite. It supports the
visualization and is a software-based on components written in Python computing
language and developed at the bioinformatics laboratory at the faculty of computer
and information science, Ljubljana University, Slovenia.
UNIT - V 94
[ADVANCED DATABASE AND MINING]
Why Orange?
Data comes to orange is formatted quickly to the desired pattern, and moving the
widgets can be easily transferred where needed. Orange is quite interesting to users.
Orange allows its users to make smarter decisions in a short time by rapidly comparing
and analyzing the data.It is a good open-source data visualization as well as evaluation
that concerns beginners and professionals. Data mining can be performed via visual
programming or Python scripting. Many analyses are feasible through its visual
programming interface(drag and drop connected with widgets)and many visual tools
tend to be supported such as bar charts, scatterplots, trees, dendrograms, and heat
maps. A substantial amount of widgets(more than 100) tend to be supported.
The instrument has machine learning components, add-ons for bioinformatics and text
mining, and it is packed with features for data analytics. This is also used as a python
library.
Python scripts can keep running in a terminal window, an integrated environment like
PyCharmand PythonWin, pr shells like iPython. Orange comprises of canvas interface
onto which the user places widgets and creates a data analysis workflow. The widget
proposes fundamental operations, For example, reading the data, showing a data
table, selecting features, training predictors, comparing learning algorithms,
visualizing data elements, etc. Orange operates on Windows, Mac OS X, and a variety
of Linux operating systems. Orange comes with multiple regression and classification
algorithms.
Orange can read documents in native and other data formats. Orange is dedicated to
machine learning techniques for classification or supervised data mining. There are
two types of objects used in classification: learner and classifiers. Learners consider
class-leveled data and return a classifier. Regression methods are very similar to
classification in Orange, and both are designed for supervised data mining and require
class-level data. The learning of ensembles combines the predictions of individual
models for precision gain. The model can either come from different training data or
use different learners on the same sets of data.
UNIT - V 95
[ADVANCED DATABASE AND MINING]
Learners can also be diversified by altering their parameter sets. In orange, ensembles
are simply wrappers around learners. They act like any other learner. Based on the
data, they return models that can predict the results of any data instance.
SAS stands for Statistical Analysis System. It is a product of the SAS Institute created
for analytics and data management. SAS can mine data, change it, manage information
from various sources, and analyze statistics. It offers a graphical UI for non-technical
users.
SAS data miner allows users to analyze big data and provide accurate insight for timely
decision-making purposes. SAS has distributed memory processing architecture that is
highly scalable. It is suitable for data mining, optimization, and text mining purposes.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system
which is compatible with JVM (Java Virtual Machine). It consists of Science and
mathematics libraries.
o Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
o Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms,
curve fitting, etc.
DMelt can be used for the analysis of the large volume of data, data mining, and
statistical analysis. It is extensively used in natural sciences, financial markets, and
engineering.
4. Rattle:
Ratte is a data mining tool based on GUI. It uses the R stats programming language.
Rattle exposes the statical power of R by offering significant data mining features.
While rattle has a comprehensive and well-developed user interface, It has an
integrated log code tab that produces duplicate code for any GUI operation.
UNIT - V 96
[ADVANCED DATABASE AND MINING]
The data set produced by Rattle can be viewed and edited. Rattle gives the other
facility to review the code, use it for many purposes, and extend the code without any
restriction.
5. Rapid Miner:
Rapid Miner is one of the most popular predictive analysis systems created by the
company with the same name as the Rapid Miner. It is written in JAVA programming
language. It offers an integrated environment for text mining, deep learning, machine
learning, and predictive analysis.
The instrument can be used for a wide range of applications, including company
applications, commercial applications, research, education, training, application
development, machine learning.
Rapid Miner provides the server on-site as well as in public or private cloud
infrastructure. It has a client/server model as its base. A rapid miner comes with
template-based frameworks that enable fast delivery with few errors(which are
commonly expected in the manual coding writing process)
TEXT MINING
Text data mining can be described as the process of extracting essential data from
standard language text. All the data that we generate via text messages, documents,
emails, files are written in common language text. Text mining is primarily used to
draw useful insights or patterns from such data.
UNIT - V 97
[ADVANCED DATABASE AND MINING]
The text mining market has experienced exponential growth and adoption over the
last few years and also expected to gain significant growth and adoption in the coming
future. One of the primary reasons behind the adoption of text mining is higher
competition in the business market, many organizations seeking value-added
solutions to compete with other organizations. With increasing completion in business
and changing customer perspectives, organizations are making huge investments to
find a solution that is capable of analyzing customer and competitor data to improve
competitiveness. The primary source of data is e-commerce websites, social media
platforms, published articles, survey, and many more. The larger part of the generated
data is unstructured, which makes it challenging and expensive for the organizations
to analyze with the help of the people. This challenge integrates with the exponential
growth in data generation has led to the growth of analytical tools. It is not only able
to handle large volumes of text data but also helps in decision-making purposes. Text
mining software empowers a user to draw useful information from a huge set of data
available sources.
depend on many complex variables, including slang, social context, and regional
dialects.
o Data Mining: Data mining refers to the extraction of useful data, hidden
patterns from large data sets. Data mining tools can predict behaviors and
future trends that allow businesses to make a better data-driven decision. Data
mining tools can be used to resolve many business problems that have
traditionally been too time-consuming.
o Information Retrieval: Information retrieval deals with retrieving useful data
from data that is stored in our systems. Alternately, as an analogy, we can view
search engines that happen on websites such as e-commerce sites or any other
sites as part of information retrieval.
The text mining process incorporates the following steps to extract the data from the
document.
o Text transformation
A text transformation is a technique that is used to control the capitalization
of the text.
Here the two major way of document representation is given.
1. Bag of words
2. Vector Space
o Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural
UNIT - V 99
[ADVANCED DATABASE AND MINING]
o Social Media Analysis: Social media analysis helps to track the online data, and
there are numerous text mining tools designed particularly for performance
analysis of social media sites. These tools help to monitor and interpret the text
generated via the internet from the news, emails, blogs, etc. Text mining tools
can precisely analyze the total no of posts, followers, and total no of likes of
your brand on a social media platform that enables you to understand the
response of the individuals who are interacting with your brand and content.
These are the following text mining approaches that are used in data mining.
It collects sets of keywords or terms that often happen together and afterward
discover the association relationship among them. First, it preprocesses the text data
by parsing, stemming, removing stop words, etc. Once it pre-processed the data, then
it induces association mining algorithms. Here, human effort is not required, so the
number of unwanted results and the execution time is reduced.
This analysis is used for the automatic classification of the huge number of online text
documents like web pages, emails, etc. Text document classification varies with the
classification of relational data as document databases are not organized according to
attribute values pairs.
Numericizing text:
UNIT - V 101
[ADVANCED DATABASE AND MINING]
UNIT - V 102