0% found this document useful (0 votes)
11 views102 pages

Adm Unit 5

The document discusses the concepts of classification and clustering in data mining, highlighting that classification is a supervised learning approach requiring labeled data, while clustering is an unsupervised method grouping data based on similarities. It details various classification methods such as logistic regression, K-nearest neighbors, and decision trees, and contrasts them with clustering techniques. Additionally, it covers the importance of data mining tasks, including classification, prediction, and the mining of patterns and associations.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views102 pages

Adm Unit 5

The document discusses the concepts of classification and clustering in data mining, highlighting that classification is a supervised learning approach requiring labeled data, while clustering is an unsupervised method grouping data based on similarities. It details various classification methods such as logistic regression, K-nearest neighbors, and decision trees, and contrasts them with clustering techniques. Additionally, it covers the importance of data mining tasks, including classification, prediction, and the mining of patterns and associations.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

[ADVANCED DATABASE AND MINING]

UNIT V
CLASSIFICATION & CLUSTERING:

The primary difference between classification and clustering is that classification is a


supervised learning approach where a specific label is provided to the machine to
classify new observations. Here the machine needs proper testing and training for the
label verification. So, classification is a more complex process than clustering. On the
other hand, clustering is an unsupervised learning approach where grouping is done
on similarities basis. Here the machine learns from the existing data and does not need
any training. In this article, we will discuss the two-term classification and clustering
separately; after that, we will see the major differences.

What is classification?

The term "classification" is usually used when there are exactly two target classes
called binary classification. When more than two classes may be predicted, specifically
in pattern recognition problems, this is often referred to as multinomial classification.
However, multinomial classification is also used for categorical response data, where
one wants to predict which category amongst several categories has the instances with
the highest probability.

Classification is one of the most important tasks in data mining. It refers to a process
of assigning pre-defined class labels to instances based on their attributes. There is a
similarity between classification and clustering, it looks similar, but it is different. The
major difference between classification and clustering is that classification includes
the levelling of items according to their membership in pre-defined groups. Let's
understand this concept with the help of an example; suppose you are using a self-
organizing map neural network algorithm for image recognition where there are 10
different kinds of objects. If you label each image with one of these 10 classes, the
classification task is solved.

On the other hand, clustering does not involve any labelling. Assume that you are given
an image database of 10 objects and no class labels. Using a clustering algorithm to
find groups of similar-looking images will result in determining clusters without object
labels.

Classification of data mining

These are given some of the important data mining classification methods:

1. Logistic Regression Method

The logistic Regression Method is used to predict the response variable.

UNIT - V 1
[ADVANCED DATABASE AND MINING]

2. K-Nearest Neighbors Method

K-Nearest Neighbors Method is used to classify the datasets into what is known as a K
observation. It is used to determine the similarities between the neighbours.

3. Naive Bayes Method

The Naive Bayes method is used to scan the set of data and locate the records wherein
the predictor values are equal.

4. Neural Networks Method

The Neural Networks resemble the structure of our brain called the Neuron. The sets
of data pass through these networks and finally come out as output. This neural
network method compares the different classifications. Errors that occur in the
classifications are further rectified and are fed into the networks. This is a recurring
process.

5. Discriminant Analysis Method

In this method, a linear function is built and used to predict the class of variables from
observation with the unknown class.

What is clustering?

Clustering refers to a technique of grouping objects so that objects with the same
functionalities come together and objects with different functionalities go apart. In
other words, we can say that clustering is a process of portioning a data set into a set
of meaningful subclasses, known as clusters. Clustering is the same as classification in
which data is grouped. Though, unlike classification, the groups are not previously
defined. Instead, the grouping is achieved by determining similarities between data
according to characteristics found in the real data. The groups are called Clusters.

Methods of clustering

o Partitioning methods
o Hierarchical clustering
o Fuzzy Clustering
o Density-based clustering
o Model-based clustering

UNIT - V 2
[ADVANCED DATABASE AND MINING]

Difference between Classification and Clustering

Classification Clustering

Classification is a supervised learning Clustering is an unsupervised learning


approach where a specific label is approach where grouping is done on
provided to the machine to classify new similarities basis.
observations. Here the machine needs
proper testing and training for the label
verification.

Supervised learning approach. Unsupervised learning approach.

It uses a training dataset. It does not use a training dataset.

It uses algorithms to categorize the new It uses statistical concepts in which the
data as per the observations of the data set is divided into subsets with the
training set. same features.

In classification, there are labels for In clustering, there are no labels for
training data. training data.

Its objective is to find which class a new Its objective is to group a set of objects to
object belongs to form the set of find whether there is any relationship
predefined classes. between them.

It is more complex as compared to It is less complex as compared to


clustering. clustering.

1R ALGORITHM
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that
generates one rule for each predictor in the data, then selects the rule with the
smallest total error as its "one rule". To create a rule for a predictor, we construct a
frequency table for each predictor against the target. It has been shown that OneR
produces rules only slightly less accurate than state-of-the-art classification algorithms
while producing rules that are simple for humans to interpret.

For each predictor,


For each value of that predictor, make a rule as follows;
Count how often each value of target (class) appears
Find the most frequent class

UNIT - V 3
[ADVANCED DATABASE AND MINING]

Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.

Example:
Finding the best predictor with the smallest total error using OneR algorithm
based on related frequency tables.

The best predictor is:

UNIT - V 4
[ADVANCED DATABASE AND MINING]

Predictors Contribution
Simply, the total error calculated from the frequency tables is the measure of each
predictor contribution. A low total error means a higher contribution to the
predictability of the model.
Model Evaluation
The following confusion matrix shows significant predictability power. OneR does not
generate score or probability, which means evaluation charts (Gain, Lift, K-S and ROC)
are not applicable.

Play Golf
Confusion Matrix
Yes No
Yes 7 2 Positive Predictive Value 0.78
OneR Negative Predictive
No 2 3 0.60
Value
Sensitivity Specificity
Accuracy = 0.71
0.78 0.60

DECISION TREES
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label. The topmost node in the tree is the root
node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.

UNIT - V 5
[ADVANCED DATABASE AND MINING]

The benefits of having a decision tree are as follows −


 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and
fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree


algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was
the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is
no backtracking; the trees are constructed in a top-down recursive divide-and-conquer
manner.

Tree Pruning

Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity

The cost complexity is measured by the following two parameters −


 Number of leaves in the tree, and

UNIT - V 6
[ADVANCED DATABASE AND MINING]

 Error rate of the tree.

COVERING RULES

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training
data. We do not require to generate a decision tree first. In this algorithm, each rule
for a given class covers many of the tuples of that class.

UNIT - V 7
[ADVANCED DATABASE AND MINING]

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the
general strategy the rules are learned one at a time. For each time rules are learned,
a tuple covered by the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induc on can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one
class at a time. When learning a rule from a class Ci, we want the rule to cover all the
tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;

Rule Pruning

The rule is pruned is due to the following reason −


 The Assessment of quality is made on the original set of training data.
The rule may perform well on training data but less well on subsequent
data. That's why the rule pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned
version of R has greater quality than what was assessed on an
independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg

UNIT - V 8
[ADVANCED DATABASE AND MINING]

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
TASK PREDICTION
Data mining deals with the kind of patterns that can be mined. On the basis of the kind
of data to be mined, there are two categories of functions involved in Data Mining −
 Descriptive
 Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −
 Data Characterization − This refers to summarizing data of class under
study. This class under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classifica on of a class
with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here
is the list of kind of frequent patterns −
 Frequent Item Set − It refers to a set of items that frequently appear
together, for example, milk and bread.
 Frequent Subsequence − A sequence of pa erns that occur frequently
such as purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined with
item-sets or subsequences.

UNIT - V 9
[ADVANCED DATABASE AND MINING]

Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data
and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk
is sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to
analyze that if they have positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
The list of functions involved in these processes are as follows −
 Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
 Prediction − It is used to predict missing or unavailable numerical data
values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.
 Outlier Analysis − Outliers may be defined as the data objects that do
not comply with the general behavior or model of the data available.

UNIT - V 10
[ADVANCED DATABASE AND MINING]

 Evolution Analysis − Evolu on analysis refers to the descrip on and


model regularities or trends for objects whose behavior changes over
time.

Data Mining Task Primitives

 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.
Note − These primi ves allow us to communicate in an interac ve manner with the
data mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes
the following −
 Database Attributes
 Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction.
For example, the Concept hierarchies are one of the background knowledge that
allows data to be mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation

UNIT - V 11
[ADVANCED DATABASE AND MINING]

This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes

STATISTICAL CLASSIFICATION
Data mining refers to extracting or mining knowledge from large amounts of data. In
other words, data mining is the science, art, and technology of discovering large and
complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process more
efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in
data mining:
 Statistical Analysis: In statistics, data is collected, analyzed, explored, and
presented to identify patterns and trends. Alternatively, it is referred to as
quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized information and
includes sound, still images, and moving images.
In statistics, there are two main categories:
 Descriptive Statistics: The purpose of descriptive statistics is to organize data
and identify the main characteristics of that data. Graphs or numbers
summarize the data. Average, Mode, SD(Standard Deviation), and Correlation
are some of the commonly used descriptive statistical methods.
 Inferential Statistics: The process of drawing conclusions based on probability
theory and generalizing the data. By analyzing sample statistics, you can infer
parameters about populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with
statistics. Some of these are:
 Population
 Sample
UNIT - V 12
[ADVANCED DATABASE AND MINING]

 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using
mathematical formulas, models, and techniques. Through the use of statistical
methods, information is extracted from research data, and different ways are available
to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically
are derived from the vast statistical toolkit developed to answer problems arising in
other fields. These techniques are taught in science curriculums. It is necessary to
check and test several hypotheses. The hypotheses described above help us assess the
validity of our data mining endeavor when attempting to infer any inferences from the
data under study. When using more complex and sophisticated statistical estimators
and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations,
a variety of statistical methods are available in Data Mining and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminant analysis (LDA)
 Classification
 Clustering
 Outlier detection
 Classification and regression trees,
 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used
in data mining:

UNIT - V 13
[ADVANCED DATABASE AND MINING]

 Linear Regression: The linear regression method uses the best linear
relationship between the independent and dependent variables to predict the
target variable. In order to achieve the best fit, make sure that all the distances
between the shape and the actual observations at each point are as small as
possible. A good fit can be determined by determining that no other position
would produce fewer errors given the shape chosen. Simple linear regression
and multiple linear regression are the two major types of linear regression. By
fitting a linear relationship to the independent variable, the simple linear
regression predicts the dependent variable. Using multiple independent
variables, multiple linear regression fits the best linear relationship with the
dependent variable. For more details, you can refer linear regression.
 Classification: This is a method of data mining in which a collection of data is
categorized so that a greater degree of accuracy can be predicted and analyzed.
An effective way to analyze very large datasets is to classify them. Classification
is one of several methods aimed at improving the efficiency of the analysis
process. A Logistic Regression and a Discriminant Analysis stand out as two
major classification techniques.
o Logistic Regression: It can also be applied to machine learning
applications and predictive analytics. In this approach, the dependent
variable is either binary (binary regression) or multinomial (multinomial
regression): either one of the two or a set of one, two, three, or four
options. With a logistic regression equation, one can estimate
probabilities regarding the relationship between the independent
variable and the dependent variable. For understanding logistic
regression analysis in detail, you can refer to logistic regression.
o Discriminant Analysis: A Discriminant Analysis is a statistical method of
analyzing data based on the measurements of categories or clusters and
categorizing new observations into one or more populations that were
identified a priori. The discriminant analysis models each response class
independently then uses Bayes’s theorem to flip these projections
around to estimate the likelihood of each response category given the
value of X. These models can be either linear or quadratic.
 Linear Discriminant Analysis: According to Linear Discriminant
Analysis, each observation is assigned a discriminant score to
classify it into a response variable class. By combining the
independent variables in a linear fashion, these scores can be
obtained. Based on this model, observations are drawn from a
Gaussian distribution, and the predictor variables are correlated
across all k levels of the response variable, Y. and for further
details linear discriminant analysis

UNIT - V 14
[ADVANCED DATABASE AND MINING]

 Quadratic Discriminant Analysis: An alternative approach is


provided by Quadratic Discriminant Analysis. LDA and QDA both
assume Gaussian distributions for the observations of the Y
classes. Unlike LDA, QDA considers each class to have its own
covariance matrix. As a result, the predictor variables have
different variances across the k levels in Y.
o Correlation Analysis: In statistical terms, correlation analysis captures
the relationship between variables in a pair. The value of such variables
is usually stored in a column or rows of a database table and represents
a property of an object.
o Regression Analysis: Based on a set of numeric data, regression is a data
mining method that predicts a range of numerical values (also known as
continuous values). You could, for instance, use regression to predict the
cost of goods and services based on other variables. A regression model
is used across numerous industries for forecasting financial data,
modeling environmental conditions, and analyzing trends.
The first step in creating good statistics is having good data that was derived with an
aim in mind. There are two main types of data: an input (independent or predictor)
variable, which we control or are able to measure, and an output (dependent or
response) variable which is observed. A few will be quantitative measurements, but
others may be qualitative or categorical variables (called factors).

BAYESIAN THEOREM:

In numerous applications, the connection between the attribute set and the class
variable is non- deterministic. In other words, we can say the class label of a test record
cant be assumed with certainty even though its attribute set is the same as some of
the training examples. These circumstances may emerge due to the noisy data or the
presence of certain confusing factors that influence classification, but it is not included
in the analysis. For example, consider the task of predicting the occurrence of whether
an individual is at risk for liver illness based on individuals eating habits and working
efficiency. Although most people who eat healthly and exercise consistently having
less probability of occurrence of liver disease, they may still do so due to other factors.
For example, due to consumption of the high-calorie street foods and alcohol abuse.
Determining whether an individual's eating routine is healthy or the workout efficiency
is sufficient is also subject to analysis, which in turn may introduce vulnerabilities into
the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.

UNIT - V 15
[ADVANCED DATABASE AND MINING]

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an
unknown parameter.

Bayes's theorem is expressed mathematically by the following equation that is given


below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given


that Y is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given


that X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other.
This is known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes


theorem connects the degree of belief in a hypothesis before and after accounting for
evidence. For example, Lets us consider an example of the coin. If we toss a coin, then
we get either heads or tails, and the percent of occurrence of either heads and tails is
50%. If the coin is flipped numbers of times, and the outcomes are observed, the
degree of belief may rise, fall, or remain the same depending on the outcomes.

For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

UNIT - V 16
[ADVANCED DATABASE AND MINING]

Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling


(PGM) procedure that is utilized to compute uncertainties by utilizing the probability
concept. Generally known as Belief Networks, Bayesian Networks are used to show
uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.

The nodes here represent random variables, and the edges define the relationship
between these variables.

A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability
Table (CPT) is used to represent the CPD of each variable in a network.

INSTANCE BASED METHODS

The simplest structure of learning is plain memorization, or rote learning. Because a


group of training instances has been remembered, on encountering a new instance
the memory is investigated for the training instance that most powerfully resembles
the new one.
The only problem is how to clarify resembles. First, this is a completely different
method of describing the “knowledge” extracted from a group of instances − It stores
UNIT - V 17
[ADVANCED DATABASE AND MINING]

the instances themselves and works by associating new instances whose class is
unknown to the current ones whose class is known. Rather than trying to make rules,
work directly from the instances themselves. This is called instance-based learning.
In instance-based learning, all the actual work is completed when the time appears to
define a new instance instead of when the training set is processed. The difference
between this approach and the others that it can be seen is the time at which the
“learning” takes place.
Instance-based learning is inactive, deferring the real work considering possible,
whereas different methods are eager, generalizing as soon as the data has been seen.
In instance-based classification, each new instance is distinguished from current ones
using a distance metric, and the nearest existing instance is used to make the class to
the new one. This is known as the nearest-neighbor classification method.
Sometimes more than one nearest neighbor is used, and the majority class of the
nearest k neighbors (or the distance weighted average if the class is numeric) is
created to the new instance. This is defined as the k-nearest-neighbor method.
When nominal attributes are current, it is essential to come up with a “distance”
between multiple values of that attribute. Various attributes will be significant than
others, and it is usually reflected in the distance metric by several types of attribute
weighting. It is changing suitable attribute weights from the training group is an
essential problem in instance-based learning.
An apparent limitation to instance-based representations is that they do not create
explicit architecture that is learned. The instances connect with the distance metric to
divide out boundaries into instanced areas that analyze one class from another, and
this is a type of explicit description of knowledge.
For instance, given a single instance of each of two classes, the nearest-neighbor rule
efficiently divides the instance area along the perpendicular bisector of the line
connecting the instances. Given several instances of every class, the space is splitted
by a set of lines that defines the perpendicular bisectors of selected lines linking an
instance of one class to one of another class.
LINEAR MODELS
Linear regression may be defined as the statistical model that analyzes the linear
relationship between a dependent variable with given set of independent variables.
Linear relationship between variables means that when the value of one or more
independent variables will change (increase or decrease), the value of dependent
variable will also change accordingly (increase or decrease).
Mathematically the relationship can be represented with the help of following
equation −
Y=mX+bY=mX+b

UNIT - V 18
[ADVANCED DATABASE AND MINING]

Here, Y is the dependent variable we are trying to predict.


X is the independent variable we are using to make predictions.
m is the slop of the regression line which represents the effect X has on Y
b is a constant, known as the 𝑌Y-intercept. If X = 0,Y would be equal to 𝑏b.
Furthermore, the linear relationship can be positive or negative in nature as explained
below −
Positive Linear Relationship
A linear relationship will be called positive if both independent and dependent variable
increases. It can be understood with the help of following graph −

Negative Linear relationship


A linear relationship will be called positive if independent increases and dependent
variable decreases. It can be understood with the help of following graph −

Types of Linear Regression

Linear regression is of the following two types −


 Simple Linear Regression
 Multiple Linear Regression
The following are some assumptions about dataset that is made by Linear Regression
model −
UNIT - V 19
[ADVANCED DATABASE AND MINING]

Multi-collinearity − Linear regression model assumes that there is very li le or no


multi-collinearity in the data. Basically, multi-collinearity occurs when the
independent variables or features have dependency in them.
Auto-correlation − Another assump on Linear regression model assumes is that there
is very little or no auto-correlation in the data. Basically, auto-correlation occurs when
there is dependency between residual errors.
Relationship between variables − Linear regression model assumes that the
relationship between response and feature variables must be linear.
Cluster/2
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable
to changes and helps single out useful features that distinguish different
groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market


research, pattern recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups based
on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain insight
into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an
earth observation database. It also helps in the identification of groups
of houses in a city according to house type, value, and geographic
location.
 Clustering also helps in classifying documents on the web for information
discovery.

UNIT - V 20
[ADVANCED DATABASE AND MINING]

 Clustering is also used in outlier detection applications such as detection


of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight
into the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with
large databases.
 Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based
(numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They should
not be bounded to only distance measures that tend to find spherical
cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able
to handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead
to poor quality clusters.
 Interpretability − The clustering results should be interpretable,
comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories −


 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
 Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it
will classify the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −

UNIT - V 21
[ADVANCED DATABASE AND MINING]

 For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
 Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
 Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches here −
o Agglomerative Approach
o Divisive Approach
o Agglomerative Approach
This approach is also known as the bottom-up approach. In this,
we start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keep on doing so
until all of the groups are merged into one or until the termination
condition holds.
o Divisive Approach
This approach is also known as the top-down approach. In this,
we start with all of the objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller clusters. It is down until each
object in one cluster or the termination condition holds. This method is
rigid, i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
 Perform careful analysis of object linkages at each hierarchical
partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
 Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.
 Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.

UNIT - V 22
[ADVANCED DATABASE AND MINING]

Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the
quantized space.
 Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
 Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user
or the application requirement.
COBWEB
COBWEB is a popular a simple method of incremental conceptual learning.
It creates a hierarchical clustering in the form of a classification tree.
Each node refers to a concept and contains a probabilistic description of that
concept.
Classification Tree

UNIT - V 23
[ADVANCED DATABASE AND MINING]

Limitations of COBWEB

The assumption that the attributes are independent of each other is often too strong
because correlation may exist.

It is not suitable for clustering large database data – skewed tree and expensive
probability distributions.

Some of the other methods alike COBWEB are:

CLASSIT
 It is an extension of COBWEB for incremental clustering of continuous data.
 It suffers similar problems as COBWEB.

AutoClass (Cheeseman and Stutz, 1996)


 It uses Bayesian statistical analysis to estimate the number of clusters.
 It has been popular in the industry.

Other Model-Based Clustering Methods

Neural network approaches


 It represents each cluster as an exemplar, acting as a “prototype” of the
cluster.
 Then new objects are distributed to the cluster whose exemplar is the most
similar according to some distance measure.

Competitive learning
 It involves a hierarchical architecture of several units (neurons).
 Neurons compete in a “winner-takes-all” fashion for the object currently
being presented.

UNIT - V 24
[ADVANCED DATABASE AND MINING]

Self-Organizing Feature Maps

Clustering is also performed by having several units competing for the current object.
The unit whose weight vector is closest to the current object wins
The winner and its neighbors learn by having their weights adjusted.

UNIT - V 25
[ADVANCED DATABASE AND MINING]

SOMs are believed to resemble processing that can occur in the brain.
Useful for visualizing high-dimensional data in 2-D or 3-D space.

k-means

K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science. In this topic, we will learn
what is K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The


main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

UNIT - V 26
[ADVANCED DATABASE AND MINING]

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

UNIT - V 27
[ADVANCED DATABASE AND MINING]

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into
two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.

UNIT - V 28
[ADVANCED DATABASE AND MINING]

o As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.

UNIT - V 29
[ADVANCED DATABASE AND MINING]

As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:

UNIT - V 30
[ADVANCED DATABASE AND MINING]

o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There
are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster
Sum of Squares, which defines the total variations within a cluster. The formula to
calculate the value of WCSS (for 3 clusters) is given below:

UNIT - V 31
[ADVANCED DATABASE AND MINING]

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in

CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values


(ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point
is considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:

Python Implementation of K-means Clustering Algorithm

In the above section, we have discussed the K-means algorithm, now let's see how it
can be implemented using Python.

Before implementation, let's understand what type of problem we will solve here. So,
we have a dataset of Mall_Customers, which is the data of customers who visit the
mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and
Spending Score (which is the calculated value of how much a customer has spent in
the mall, the more the value, the more he has spent). From this dataset, we need to
calculate some patterns, as it is an unsupervised method, so we don't know what to
calculate exactly.

The steps to be followed for the implementation are given below:

o Data Pre-processing
o Finding the optimal number of clusters using the elbow method

UNIT - V 32
[ADVANCED DATABASE AND MINING]

o Training the K-means algorithm on the training dataset


o Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be different from
other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model,
which is part of data pre-processing. The code is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

In the above code, the numpy we have imported for the performing mathematics
calculation, matplotlib is for plotting the graph, and pandas are for managing the
dataset.

o Importing the Dataset:


Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')

By executing the above lines of code, we will get our dataset in the Spyder IDE. The
dataset looks like the below image:

From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will just add
a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values

UNIT - V 33
[ADVANCED DATABASE AND MINING]

As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d
plot to visualize the model, and some features are not required, such as customer_id.

Step-2: Finding the optimal number of clusters using the elbow method

In the second step, we will try to find the optimal number of clusters for our clustering
problem. So, as discussed above, here we are going to use the elbow method for this
purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by plotting
WCSS values on the Y-axis and the number of clusters on the X-axis. So we are going
to calculate the value for WCSS for different k values ranging from 1 to 10. Below is
the code for it:

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

As we can see in the above code, we have used the KMeans class of sklearn. cluster
library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is used
to contain the value of wcss computed for different values of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different value of k
ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken
as 11 to include 10th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the
model on a matrix of features and then plotted the graph between the number of
clusters and WCSS.
UNIT - V 34
[ADVANCED DATABASE AND MINING]

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.

Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on the dataset.

To train the model, we will use the same two lines of code as we have used in the
above section, but here instead of using i, we will use 5, as we know there are 5 clusters
that need to be formed. The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)

UNIT - V 35
[ADVANCED DATABASE AND MINING]

3. y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.

In the second line of code, we have created the dependent variable y_predict to train
the model.

By executing the above lines of code, we will get the y_predict variable. We can check
it under the variable explorer option in the Spyder IDE. We can now compare the
values of y_predict with our original dataset. Consider the below image:

From the above image, we can now relate that the CustomerID 1 belongs to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4,
and so on.

Step-4: Visualizing the Clusters

The last step is to visualize the clusters. As we have 5 clusters for our model, so we will
visualize each cluster one by one.

In above lines of code, we have written code for each clusters, ranging from 1 to 5.
The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value
for the showing the matrix of features values, and the y_predict is ranging from 0 to 1

The output image is clearly showing the five different clusters with different colors.
The clusters are formed between two parameters of the dataset; Annual income of
customer and Spending. We can change the colors and labels as per the requirement
or choice. We can also observe some points from the above patterns, which are given
below:

o Cluster1 shows the customers with average salary and average spending so we
can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we can
categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so they
can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can
be categorized as target, and these customers can be the most profitable
customers for the mall owner.
UNIT - V 36
[ADVANCED DATABASE AND MINING]

HIERARCHICAL METHODS

Hierarchical clustering refers to an unsupervised learning procedure that determines


successive clusters based on previously defined clusters. It works via grouping data
into a tree of clusters. Hierarchical clustering stats by treating each data points as an
individual cluster. The endpoint refers to a different set of clusters, where each cluster
is different from the other cluster, and the objects within each cluster are the same as
one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering


o Divisive Clustering

o Agglomerative hierarchical clustering

Agglomerative clustering is one of the most common types of hierarchical clustering


used to group similar objects in clusters. Agglomerative clustering is also known as
AGNES (Agglomerative Nesting). In agglomerative clustering, each data point act as an
individual cluster and at each step, data objects are grouped in a bottom-up method.
Initially, each data object is in its cluster. At each iteration, the clusters are combined
with different clusters until one cluster is formed.

Agglomerative hierarchical clustering algorithm

1. Determine the similarity between individuals and all other clusters. (Find
proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.

Let’s understand this concept with the help of graphical representation using a
dendrogram.

With the help of given demonstration, we can understand that how the actual
algorithm work. Here no calculation has been done below all the proximity among the
clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.


UNIT - V 37
[ADVANCED DATABASE AND MINING]

Step 1:

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance
between the individual cluster from all other clusters.

Step 2:

Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster
R are similar to each other so that we can merge them in the second step. Finally, we
get the clusters [ (P), (QR), (ST), (V)]

Step 3:

Here, we recalculate the proximity as per the algorithm and combine the two closest
clusters [(ST), (V)] together to form new clusters as [(P), (QR), (STV)]

Step 4:

Repeat the same process. The clusters STV and PQ are comparable and combined
together to form a new cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster
[(PQRSTV)]

o Divisive Hierarchical Clustering

UNIT - V 38
[ADVANCED DATABASE AND MINING]

Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical


clustering. In Divisive Hierarchical clustering, all the data points are considered an
individual cluster, and in every iteration, the data points that are not similar are
separated from the cluster. The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.

Advantages of Hierarchical clustering

o It is simple to implement and gives the best output in some cases.


o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering

o It breaks the large clusters.


o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.

MINING REAL DATA:


This Tutorial Covers Most Popular Data Mining Examples in Real Life. Learn About
Data Mining Application In Finance, Marketing, Healthcare, and CRM:
In this Free Data Mining Training Series, we had a look at the Data Mining Process in
our previous tutorial. Data Mining, which is also known as Knowledge Discovery in
Databases (KDD), is a process of discovering patterns in a large set of data and data
warehouses.

UNIT - V 39
[ADVANCED DATABASE AND MINING]

Various techniques such as regression analysis, association, and clustering,


classification, and outlier analysis are applied to data to identify useful outcomes.
These techniques use software and backend algorithms that analyze the data and
show patterns.

Some of the well-known data mining methods are decision tree analysis, Bayes
theorem analysis, Frequent item-set mining, etc. The software market has many
open-source as well as paid tools for data mining such as Weka, Rapid Miner, and
Orange data mining tools.

The data mining process starts with giving a certain input of data to the data mining
tools that use statistics and algorithms to show the reports and patterns. The results
can be visualized using these tools that can be understood and further applied to
conduct business modification and improvements.

Data mining is widely used by organizations in building a marketing strategy, by


hospitals for diagnostic tools, by eCommerce for cross-selling products through
websites and many other ways.

Some of the data mining examples are given below for your reference.

Examples Of Data Mining In Real Life


The importance of data mining and analysis is growing day by day in our real life.
Today most organizations use data mining for analysis of Big Data.

Let us see how these technologies benefit us.


1) Mobile Service Providers
Mobile service providers use data mining to design their marketing campaigns and to
retain customers from moving to other vendors.

From a large amount of data such as billing information, email, text messages, web
data transmissions, and customer service, the data mining tools can predict “churn”
that tells the customers who are looking to change the vendors.

With these results, a probability score is given. The mobile service providers are then
able to provide incentives, offers to customers who are at higher risk of churning.
This kind of mining is often used by major service providers such as broadband,
phone, gas providers, etc.

2) Retail Sector
Data Mining helps the supermarket and retail sector owners to know the choices of
the customers. Looking at the purchase history of the customers, the data mining
tools show the buying preferences of the customers.
UNIT - V 40
[ADVANCED DATABASE AND MINING]

With the help of these results, the supermarkets design the placements of products
on shelves and bring out offers on items such as coupons on matching products, and
special discounts on some products.

These campaigns are based on RFM grouping. RFM stands for recency, frequency,
and monetary grouping. The promotions and marketing campaigns are customized
for these segments. The customer who spends a lot but very less frequently will be
treated differently from the customer who buys every 2-3 days but of less amount.

Data Mining can be used for product recommendation and cross-referencing of


items.

Data Mining In Retail Sector From Different Data Sources.

3) Artificial Intelligence
A system is made artificially intelligent by feeding it with relevant patterns. These
patterns come from data mining outputs. The outputs of the artificially intelligent
systems are also analyzed for their relevance using the data mining techniques.

The recommender systems use data mining techniques to make personalized


recommendations when the customer is interacting with the machines. The artificial
intelligence is used on mined data such as giving product recommendations based on
the past purchasing history of the customer in Amazon.

UNIT - V 41
[ADVANCED DATABASE AND MINING]

4) Ecommerce
Many E-commerce sites use data mining to offer cross-selling and upselling of their
products. The shopping sites such as Amazon, Flipkart show “People also viewed”,
“Frequently bought together” to the customers who are interacting with the site.

These recommendations are provided using data mining over the purchasing history
of the customers of the website.

5) Science And Engineering


With the advent of data mining, scientific applications are now moving from
statistical techniques to using “collect and store data” techniques, and then perform
mining on new data, output new results and experiment with the process. A large
amount of data is collected from scientific domains such as astronomy, geology,
satellite sensors, global positioning system, etc.

Data mining in computer science helps to monitor system status, improve its
performance, find out software bugs, discover plagiarism and find out faults. Data
mining also helps in analyzing the user feedback regarding products, articles to
deduce opinions and sentiments of the views.

6) Crime Prevention
Data Mining detects outliers across a vast amount of data. The criminal data includes
all details of the crime that has happened. Data Mining will study the patterns and
trends and predict future events with better accuracy.

The agencies can find out which area is more prone to crime, how much police
personnel should be deployed, which age group should be targeted, vehicle numbers
to be scrutinized, etc.

7) Research
Researchers use Data Mining tools to explore the associations between the
parameters under research such as environmental conditions like air pollution and
the spread of diseases like asthma among people in targeted regions.

8) Farming
Farmers use Data Mining to find out the yield of vegetables with the amount of
water required by the plants.

9) Automation
By using data mining, the computer systems learn to recognize patterns among the
parameters which are under comparison. The system will store the patterns that will
be useful in the future to achieve business goals. This learning is automation as it
helps in meeting the targets through machine learning.

UNIT - V 42
[ADVANCED DATABASE AND MINING]

10) Dynamic Pricing


Data mining helps the service providers such as cab services to dynamically charge
the customers based on the demand and supply. It is one of the key factors for the
success of companies.

11) Transportation
Data Mining helps in scheduling the moving of vehicles from warehouses to outlets
and analyze the product loading patterns.

12) Insurance
Data mining methods help in forecasting the customers who buy the policies, analyze
the medical claims that are used together, find out fraudulent behaviors and risky
customers.

Data Mining Examples In Finance

The finance sector includes banks, insurance companies, and investment companies.
These institutions collect a huge amount of data. The data is often complete, reliable
and of high quality and demands a systematic data analysis.

To store financial data, data warehouses that store data in the form of data cubes
are constructed. To analyze this data, advanced data cube concepts are used. Data

UNIT - V 43
[ADVANCED DATABASE AND MINING]

mining methods such as clustering and outlier analysis, characterization are used in
financial data analysis and mining.

Some cases in finance where data mining is used are given below.
1) Loan Payment Prediction
Data mining methods like attribute selection and attribute ranking will analyze the
customer payment history and select important factors such as payment to income
ratio, credit history, the term of the loan, etc. The results will help the banks decide
its loan granting policy, and also grant loans to the customers as per factor analysis.

2) Targeted Marketing
Clustering and classification data mining methods will help in finding the factors that
influence the customer’s decisions towards banking. Similar behavioral customers’
identification will facilitate targeted marketing.

3) Detect Financial Crimes


Banking data come from many different sources, various cities, and different bank
locations. Multiple data analysis tools are deployed to study and to detect unusual
trends like big value transactions. Data visualization tools, outlier analysis tools,
clustering tools, etc are used to identify the relationships and patterns of action.

Applications Of Data Mining In Marketing


Data mining boosts the company’s marketing strategy and promotes business. It is
one of the key factors for the success of companies. A huge amount of data is
collected on sales, customer shopping, consumption, etc. This data is increasing day
by day due to e-commerce.

Data mining helps to identify customer buying behavior, improve customer service,
focus on customer retention, enhance sales, and reduce the cost of businesses.

Some examples of data mining in marketing are:


1) Forecasting Market
To predict the market, the marketing professionals will use Data Mining techniques
like regression to study customer behavior, changes, and habits, customer response
and other factors like marketing budget, other incurring costs, etc. In the future, it
will be easier for professionals to predict the customers in case of any factor
changes.

2) Anomaly Detection
Data mining techniques are deployed to detect any abnormalities in data that may
cause any kind of flaw in the system. The system will scan thousands of complex
entries to perform this operation.

UNIT - V 44
[ADVANCED DATABASE AND MINING]

3) System Security

Data Mining tools detect intrusions that may harm the database offering greater
security to the entire system. These intrusions may be in the form of duplicate
entries, viruses in the form of data by hackers, etc.

Examples Of Data Mining Applications In Healthcare

In healthcare, data mining is becoming increasingly popular and essential.

Data generated by healthcare is complex and voluminous. To avoid medical fraud


and abuse, data mining tools are used to detect fraudulent items and thereby
prevent loss.

Some data mining examples of the healthcare industry are given below for your
reference.
1) Healthcare Management
The data mining method is used to identify chronic diseases, track high-risk regions
prone to the spread of disease, design programs to reduce the spread of disease.
Healthcare professionals will analyze the diseases, regions of patients with maximum
admissions to the hospital.

With this data, they will design the campaigns for the region to make people aware
of the disease and see how to avoid it. This will reduce the number of patients
admitted to hospitals.

2) Effective Treatments
Using data mining, the treatments can be improved. By continuous comparison of
symptoms, causes, and medicines, data analysis can be performed to make effective
treatments. Data mining is also used for the treatment of specific diseases, and the
association of side-effects of treatments.

3) Fraudulent And Abusive Data


Data mining applications are used to find abnormal patterns such as laboratory,
physician’s results, inappropriate prescriptions, and fraudulent medical claims.

Data Mining And Recommender Systems


Recommender systems give customers with product recommendations that may be
of interest to the users.

The recommended items are either similar to the items queried by the user in the
past or by looking at the other customer preferences which have similar taste as the

UNIT - V 45
[ADVANCED DATABASE AND MINING]

user. This approach is called a content-based approach and a collaborative approach


appropriately.

Many techniques like information retrieval, statistics, machine learning, etc are used
in recommender systems.

Recommender systems search for keywords, user profiles, user transactions,


common features among items to estimate an item for the user. These systems also
find the other users who have a similar history of buying and predict items that those
users could buy.

There are many challenges in this approach. The recommendation system needs to
search through millions of data in real-time.

There are two types of errors made by Recommender Systems:


False negatives and False positives.

False negatives are products that were not recommended by the system but the
customer would want them. False-positive are products that were recommended by
the system but not wanted by the customer. Another challenge is the
recommendation for the users who are new without any purchasing history.
An intelligent query answering technique is used to analyze the query and provide
generalized, associated information relevant to the query. For Example: Showing the
review of restaurants instead of just the address and phone number of the
restaurant searched for.
Data Mining For CRM (Customer Relationship Management)
Customer Relationship Management can be reinforced with data mining. Good
customer Relations can be built by attracting more suitable customers, better cross-
selling and up-selling, better retention.

Data Mining can enhance CRM by:


1. Data mining can help businesses create targeted programs for higher
response and better ROI.
2. Businesses can offer more products and services as desired by the
customers through up-selling and cross-selling thereby increasing
customer satisfaction.
3. With data mining, a business can detect which customers are looking
for other options. Using that information companies can build ideas to
retain the customer from leaving.
Data Mining helps CRM in:
1. Database Marketing: Marketing software enables companies to send
messages and emails to customers. This tool along with data mining
can do targeted marketing. With data mining, automation, and

UNIT - V 46
[ADVANCED DATABASE AND MINING]

scheduling of jobs can be performed. It helps in better decision making.


It will also help in technical decisions as to what kind of customers are
interested in a new product, which market area is good for product
launching.
2. Customer Acquisition Campaign: With data mining, the market
professional will be able to identify potential customers who are
unaware of the products or new buyers. They will be able to design the
offers and initiatives for such customers.
3. Campaign Optimization: Companies use data mining for the
effectiveness of the campaign. It can model customer responses to
marketing offers.
Data Mining Using Decision Tree Example
Decision tree algorithms are called CART( Classification and Regression Trees). It is a
supervised learning method. A tree structure is built on the features chosen,
conditions for splitting and when to stop. Decision trees are used to predict the value
of class variables based on learning from the previous training data.

The internal node represents an attribute and the leaf node represents a class label.

Following steps are used to build a Decision Tree Structure:


1. Place the best attribute at the top of the tree (root).
2. Subsets are created in such a way that each subset represents data
with the same value for an attribute.
3. Repeat the same steps to find the leaf nodes of all branches.
To predict a class label, the record’s attribute is compared with the root of the tree.
On comparing, the next branch is chosen. The internal nodes are also compared in
the same way until the leaf node reached predicts the class variable.

Some algorithms used for Decision Tree Induction include Hunt’s Algorithm, CART,
ID3, C4.5, SLIQ, and SPRINT.

UNIT - V 47
[ADVANCED DATABASE AND MINING]

Most Popular Example Of Data Mining: Marketing And Sales


Marketing and Sales are the domains in which companies have large volumes of
data.

1) Banks are the first users of data mining technology as it helps them with credit
assessment. Data mining analyzes what services offered by banks are used by
customers, what type of customers use ATM cards and what do they generally buy
using their cards (for cross-selling).
Banks use data mining to analyze the transactions which the customer do before
they decide to change the bank to reduce customer attrition. Also, some outliers in
transactions are analyzed for fraud detection.

2) Cellular Phone Companies use data mining techniques to avoid churning.


Churning is a measure showing the number of customers leaving the services. It
detects patterns that show how customers can benefit from the services to retain
customers.
3) Market Basket Analysis is the technique to find the groups of items that are
bought together in stores. Analysis of the transactions show the patterns such as
which things are bought together often like bread and butter, or which items have
higher sales volume on certain days such as beer on Fridays.
This information helps in planning the store layouts, offering a special discount to the
items that are less in demand, creating offers such as “buy 2 get 1 free” or “get 50%
on second purchase” etc.

ADVANCED TOPICS:

As the amount of research and industry data being collected daily continues to grow,
intelligent software tools are increasingly needed to process and filter the data, detect
new patterns and similarities within it, and extract meaningful information from it.
Data mining and predictive modeling offer a means of effective classification and
analysis of large, complex, multi-dimensional data, leading to discovery of functional
models, trends and patterns.

Building upon the skills learned in previous courses, this course covers advanced data
mining, data analysis, and pattern recognition concepts and algorithms, as well as
models and machine learning algorithms.

Topics include:

 Data mining with big data


 Artificial neural networks
o Back-propagation
o Feed-forward networks

UNIT - V 48
[ADVANCED DATABASE AND MINING]

oRadial-basis functions
o Recurrent neural networks
 Probability graph models and Bayesian learning
 Hidden Markov models
 Support vector machines
 Ensemble learning: bagging, boosting, stacking
 Random forests
 Data mining tools
 Text mining

 DATA MINING WITH BIG DATA

Data Mining uses tools such as statistical models, machine learning, and visualization
to "Mine" (extract) the useful data and patterns from the Big Data, whereas Big Data
processes high-volume and high-velocity data, which is challenging to do in older
databases and analysis program.

Big Data:

Big Data refers to the vast amount that can be structured, semi-structured, and
unstructured sets of data ranging in terms of tera-bytes. It is challenging to process a
huge amount of data on a single system that's why the RAM of our computer stores
the interim calculations during the processing and analyzing. When we try to process
such a huge amount of data, it takes much time to do these processing steps on a
single system. Also, our computer system doesn't work correctly due to overload.

Here we will understand the concept (how much data is produced) with a live example.
We all know about Big Bazaar. We as a customer goes to Big Bazaar at least once a
month. These stores monitor each of its product that the customers purchase from
them, and from which store location over the world. They have a live information
feeding system that stores all the data in huge central servers. Imagine the number of
Big bazaar stores in India alone is around 250. Monitoring every single item purchased
by every customer along with the item description will make the data go around 1 TB
in a month.

UNIT - V 49
[ADVANCED DATABASE AND MINING]

What does Big Bazaar do with that data:

We know some promotions are running in Big Bazaar on some items. Do we genuinely
believe Big Bazaar would just run those products without any full back-up to find those
promotions would increase their sales and generate a surplus? That is where Big Data
analysis plays a vital role. Using Data Analysis techniques, Big Bazaar targets its new
customers as well as existing customers to purchase more from its stores.

Big data comprises of 5Vs that is Volume, Variety, Velocity, Veracity, and Value.

Volume: In Big Data, volume refers to an amount of data that can be huge when it
comes to big data.

Variety: In Big Data, variety refers to various types of data such as web server logs,
social media data, company data.

UNIT - V 50
[ADVANCED DATABASE AND MINING]

Velocity: In Big Data, velocity refers to how data is growing with respect to time. In
general, data is increasing exponentially at a very fast rate.

Veracity: Big Data Veracity refers to the uncertainty of data.

Value: In Big Data, value refers to the data which we are storing, and processing is
valuable or not and how we are getting the advantage of these huge data sets.

How to Process Big Data:

A very efficient method, known as Hadoop, is primarily used for Big data processing.
It is an Open-source software that works on a Distributed Parallel processing method.

The Apache Hadoop methods are comprised of the given modules:

Hadoop Common:

It contains dictionaries and utilities required by other Hadoop modules.

Hadoop Distributed File System(HDFS):

A distributed file-system which stores data on commodity machine, supporting very


high gross bandwidth over the cluster.

Hadoop YARN:

It is a resource-management Platform responsible for administrating various resources


in clusters and using them for scheduling of user's application.

Hadoop MapReduce:

It is a programming model for huge-scale data processing.

Data Mining:

As the name suggests, Data Mining refers to the mining of huge data sets to identify
trends, patterns, and extract useful information is called data mining.

In data Mining, we are looking for hidden data but without any idea about what exactly
type of data we are looking for and what we plan to use it for once you find it. When
we discover interesting information, we start thinking about how to make use of it to
boost business.

We will understand the data mining concept with an example:

UNIT - V 51
[ADVANCED DATABASE AND MINING]

A Data Miner starts discovering the call records of a mobile network operator without
any specific target from his manager. The manager probably gives him a significant
objective to discover at least a few new patterns in a month. As he begins extracting
the data to discover a pattern that there are some international calls on Friday
(example) compared to all other days. Now he shares this data with management, and
they come up with a plan to shrink international call rates on Friday and start a
campaign. Call duration goes high, and customers are happy with low call rates, more
customers join, the organization makes more profit as utilization percentage has
increased.

There are various steps involved in Data Mining:

Data Integration:

In step first, Data are integrated and collected from various sources.

Data Selection:

In the first step, we may not collect all the data simultaneously, so in this step, we
select only those data which are left, and we think it is useful for data mining.

Data Cleaning:

In this step, the information we have collected is not clean and may consist of errors,
noisy or inconsistent data, missing values. So we need to implement various strategies
to get rid of such problems.

Data Transformation:The data even after cleaning is not prepared for mining, so we
need to transform them into structures for mining. The methods used to achieve this
are aggregation, normalization, smoothing, etc.

UNIT - V 52
[ADVANCED DATABASE AND MINING]

Data Mining:

Once the data has transformed, we are ready to implement data mining methods on
data to extract useful data and patterns from data sets. Techniques like clustering
association rules are among the many various techniques used for data mining.

Pattern Evaluation:

Patten evaluation contains visualization, removing random patterns, transformation,


etc. from the patterns we generated.

Decision:

It is the last step in data mining. It helps users to make use of the acquired user data
to make better data-driven decisions.

Difference Between Data Mining and Big Data:

Data Mining Big Data

It primarily targets an analysis of data It primarily targets the data relationship.


to extract useful information.

It can be used for large volume as well It contains a huge volume of data.
as low volume data.

It is a method primarily used for data It is a whole concept than a brief term.
analysis.

It is primarily based on Statistical It is primarily based on data analysis,


Analysis, generally target prediction, generally target prediction, and finding
and finding business factors on a small business factors on a large scale.
scale.

It uses the following data types e.g., It uses the following data types e.g.,
Structured data, relational, and Structured, Semi-Structured, and
dimensional database. unstructured data.

It expresses what about the data. It refers to why of the data.

It is the closest view of the data. It is a broad view of the data.

UNIT - V 53
[ADVANCED DATABASE AND MINING]

It is primarily used for strategic It is primarily used for Dashboards and


decision-making purposes. predictive measures.
 ARTIFICIAL NEURAL NETWORKS

At earlier times, the conventional computers incorporated algorithmic approach that


is the computer used to follow a set of instructions to solve a problem unless those
specific steps need that the computer need to follow are known the computer cannot
solve a problem. So, obviously, a person is needed in order to solve the problems or
someone who can provide instructions to the computer so as to how to solve that
particular problem. It actually restricted the problem-solving capacity of conventional
computers to problems that we already understand and know how to solve.

But what about those problems whose answers are not clear, so that is where our
traditional approach face failure and so Neural Networks came into existence. Neural
Networks processes information in a similar way the human brain does, and these
networks actually learn from examples, you cannot program them to perform a
specific task. They will learn only from past experiences as well as examples, which is
why you don't need to provide all the information regarding any specific task. So, that
was the main reason why neural networks came into existence.

Artificial Neural Network is biologically inspired by the neural network, which


constitutes after the human brain.

Neural networks are modeled in accordance with the human brain so as to imitate
their functionality. The human brain can be defined as a neural network that is made
up of several neurons, so is the Artificial Neural Network is made of numerous
perceptron.

A neural network comprises of three main layers, which are as follows;

UNIT - V 54
[ADVANCED DATABASE AND MINING]

o Input layer: The input layer accepts all the inputs that are provided by the
programmer.
o Hidden layer: In between the input and output layer, there is a set of hidden
layers on which computations are performed that further results in the output.
o Output layer: After the input layer undergoes a series of transformations while
passing through the hidden layer, it results in output that is delivered by the
output layer.

Motivation behind Neural Network

Basically, the neural network is based on the neurons, which are nothing but the brain
cells. A biological neuron receives input from other sources, combines them in some
way, followed by performing a nonlinear operation on the result, and the output is the
final result.

The dendrites will act as a receiver that receives signals from other neurons, which are
then passed on to the cell body. The cell body will perform some operations that can
be a summation, multiplication, etc. After the operations are performed on the set of
input, then they are transferred to the next neuron via axion, which is the transmitter
of the signal for the neuron.

What are Artificial Neural Networks?

Artificial Neural Networks are the computing system that is designed to simulate the
way the human brain analyzes and processes the information. Artificial Neural
Networks have self-learning capabilities that enable it to produce a better result as
more data become available. So, if the network is trained on more data, it will be more
accurate because these neural networks learn from the examples. The neural network
can be configured for specific applications like data classification, pattern recognition,
etc.

With the help of the neural network, we can actually see that a lot of technology has
been evolved from translating webpages to other languages to having a virtual
assistant to order groceries online. All of these things are possible because of neural
networks. So, an artificial neural network is nothing but a network of various artificial
neurons.

Importance of Neural Network:

o Without Neural Network: Let's have a look at the example given below. Here
we have a machine, such that we have trained it with four types of cats, as you
can see in the image below. And once we are done with the training, we will

UNIT - V 55
[ADVANCED DATABASE AND MINING]

provide a random image to that particular machine that has a cat. Since this cat
is not similar to the cats through which we have trained our system, so without
the neural network, our machine would not identify the cat in the picture.
Basically, the machine will get confused in figuring out where the cat is.
o With Neural Network: However, when we talk about the case with a neural
network, even if we have not trained our machine with that particular cat. But
still, it can identify certain features of a cat that we have trained on, and it can
match those features with the cat that is there in that particular image and can
also identify the cat. So, with the help of this example, you can clearly see the
importance of the concept of a neural network.

Working of Artificial Neural Networks

Instead of directly getting into the working of Artificial Neural Networks, lets
breakdown and try to understand Neural Network's basic unit, which is called
a Perceptron.

So, a perceptron can be defined as a neural network with a single layer that classifies
the linear data. It further constitutes four major components, which are as follows;

1. Inputs
2. Weights and Bias
3. Summation Functions
4. Activation or transformation function

The main logic behind the concept of Perceptron is as follows:

UNIT - V 56
[ADVANCED DATABASE AND MINING]

The inputs (x) are fed into the input layer, which undergoes multiplication with the
allotted weights (w) followed by experiencing addition in order to form weighted
sums. Then these inputs weighted sums with their corresponding weights are
executed on the pertinent activation function.

Weights and Bias

As and when the input variable is fed into the network, a random value is given as a
weight of that particular input, such that each individual weight represents the
importance of that input in order to make correct predictions of the result.

However, bias helps in the adjustment of the curve of activation function so as to


accomplish a precise output.

Summation Function

After the weights are assigned to the input, it then computes the product of each input
and weights. Then the weighted sum is calculated by the summation function in which
all of the products are added.

Activation Function

The main objective of the activation function is to perform a mapping of a weighted


sum upon the output. The transformation function comprises of activation functions
such as tanh, ReLU, sigmoid, etc.

The activation function is categorized into two main parts:

1. Linear Activation Function


2. Non-Linear Activation Function

1. Linear Activation Function

In the linear activation function, the output of functions is not restricted in between
any range. Its range is specified from -infinity to infinity. For each individual neuron,
the inputs get multiplied with the weight of each respective neuron, which in turn
leads to the creation of output signal proportional to the input. If all the input layers
are linear in nature, then the final activation of the last layer will actually be the linear
function of the initial layer's input.

UNIT - V 57
[ADVANCED DATABASE AND MINING]

2. Non- linear function

These are one of the most widely used activation function. It helps the model in
generalizing and adapting any sort of data in order to perform correct differentiation
among the output. It solves the following problems faced by linear activation
functions:

o Since the non-linear function comes up with derivative functions, so the


problems related to backpropagation has been successfully solved.
o For the creation of deep neural networks, it permits the stacking up of several
layers of the neurons.

UNIT - V 58
[ADVANCED DATABASE AND MINING]

The non-linear activation function is further divided into the following parts:

1. Sigmoid or Logistic Activation Function


It provides a smooth gradient by preventing sudden jumps in the output
values. It has an output value range between 0 and 1 that helps in the
normalization of each neuron's output. For X, if it has a value above 2 or
below -2, then the values of y will be much steeper. In simple language, it
means that even a small change in the X can bring a lot of change in Y.
It's value ranges between 0 and 1 due to which it is highly preferred by binary
classification whose result is either 0 or 1.

2. Tanh or Hyperbolic Tangent Activation Function


The tanh activation function works much better than that of the sigmoid
function, or simply we can say it is an advanced version of the sigmoid
activation function. Since it has a value range between -1 to 1, so it is utilized

UNIT - V 59
[ADVANCED DATABASE AND MINING]

3. ReLU(Rectified Linear Unit) Activation Function


ReLU is one of the most widely used activation function by the hidden layer in
the neural network. Its value ranges from 0 to infinity. It clearly helps in
solving out the problem of backpropagation. It tends out to be more
expensive than the sigmoid, as well as the tanh activation function. It allows
only a few neurons to get activated at a particular instance that leads to
effectual as well as easier computations.

4. Softmax Function
It is one of a kind of sigmoid function whereby solving the problems of
classifications. It is mainly used to handle multiple classes for which it
squeezes the output of each class between 0 and 1, followed by dividing it by
the sum of outputs. This kind of function is specially used by the classifier in
the output layer.

Gradient Descent Algorithm

Gradient descent is an optimization algorithm that is utilized to minimize the cost


function used in various machine learning algorithms so as to update the parameters
of the learning model. In linear regression, these parameters are coefficients, whereas,
in the neural network, they are weights.

Procedure:

It all starts with the coefficient's initial value or function's coefficient that may be
either 0.0 or any small arbitrary value.

UNIT - V 60
[ADVANCED DATABASE AND MINING]

coefficient = 0.0

For estimating the cost of the coefficients, they are plugged into the function that
helps in evaluating.

cost = f(coefficient)
or, cost = evaluate(f(coefficient))

Next, the derivate will be calculated, which is one of the concepts of calculus that
relates to the function's slope at any given instance. In order to know the direction in
which the values of the coefficient will move, we need to calculate the slope so as to
accomplish a low cost in the next iteration.

delta = derivative(cost)

Now that we have found the downhill direction, it will further help in updating the
values of coefficients. Next, we will need to specify alpha, which is a learning rate
parameter, as it handles the amount of amendments made by coefficients on each
update.

coefficient = coefficient - (alpha * delta)

Until the cost of the coefficient reaches 0.0 or somewhat close enough to it, the whole
process will reiterate again and again.

It can be concluded that gradient descent is a very simple as well as straightforward


concept. It just requires you to know about the gradient of the cost function or simply
the function that you are willing to optimize.

Batch Gradient Descent

For every repetition of gradient descent, the main aim of batch gradient descent is to
processes all of the training examples. In case we have a large number of training
examples, then batch gradient descent tends out to be one of the most expensive and
less preferable too.

Algorithm for Batch Gradient Descent

Let m be the number of training examples and n be the number of features.

Now assume that hƟ represents the hypothesis for linear regression and ∑ computes
the sum of all training examples from i=1 to m. Then the cost of function will be
computed by:

UNIT - V 61
[ADVANCED DATABASE AND MINING]

Jtrain (Ɵ) = (1/2m) ∑ (hƟ(x(i)) - (y(i))2

Repeat {

Ɵj = Ɵj - (learning rate/m) * ∑ (hƟ(x(i)) - y(i)) xj(i)

For every j = 0...n

Here x(i) indicates the jth feature of the ith training example. In case if m is very large,
then derivative will fail to converge at a global minimum.

Stochastic Gradient Descent

At a single repetition, the stochastic gradient descent processes only one training
example, which means it necessitates for all the parameters to update after the one
single training example is processed per single iteration. It tends to be much faster
than that of the batch gradient descent, but when we have a huge number of training
examples, then also it processes a single example due to which system may undergo a
large no of repetitions. To evenly train the parameters provided by each type of data,
properly shuffle the dataset.

Algorithm for Stochastic Gradient Descent

Suppose that (x(i), y(i)) be the training example

Cost (Ɵ, (x(i), y(i))) = (1/2) ∑ (hƟ(x(i)) - (y(i))2

Jtrain (Ɵ) = (1/m) ∑ Cost (Ɵ, (x(i), y(i)))

Repeat {

For i=1 to m{

Ɵj = Ɵj - (learning rate) * ∑ (hƟ(x(i)) - y(i)) xj(i)

For every j=0...n

UNIT - V 62
[ADVANCED DATABASE AND MINING]

Convergence trends in different variants of Gradient Descent

The Batch Gradient Descent algorithm follows a straight-line path towards the
minimum. The algorithm converges towards the global minimum, in case the cost
function is convex, else towards the local minimum, if the cost function is not convex.
Here the learning rate is typically constant.

However, in the case of Stochastic Gradient Descent, the algorithm fluctuates all over
the global minimum rather than converging. The learning rate is changed slowly so
that it can converge. Since it processes only one example in one iteration, it tends out
to be noisy.

Backpropagation

The backpropagation consists of an input layer of neurons, an output layer, and at least
one hidden layer. The neurons perform a weighted sum upon the input layer, which is
then used by the activation function as an input, especially by the sigmoid activation
function. It also makes use of supervised learning to teach the network. It constantly
updates the weights of the network until the desired output is met by the network. It
includes the following factors that are responsible for the training and performance of
the network:

o Random (initial) values of weights.


o A number of training cycles.
o A number of hidden neurons.
o The training set.
o Teaching parameter values such as learning rate and momentum.

Working of Backpropagation

Consider the diagram given below.

UNIT - V 63
[ADVANCED DATABASE AND MINING]

1. The preconnected paths transfer the inputs X.


2. Then the weights W are randomly selected, which are used to model the input.
3. After then, the output is calculated for every individual neuron that passes from
the input layer to the hidden layer and then to the output layer.
4. Lastly, the errors are evaluated in the outputs. ErrorB= Actual Output - Desired
Output
5. The errors are sent back to the hidden layer from the output layer for adjusting
the weights to lessen the error.
6. Until the desired result is achieved, keep iterating all of the processes.

Need of Backpropagation

o Since it is fast as well as simple, it is very easy to implement.


o Apart from no of inputs, it does not encompass of any other parameter to
perform tuning.
o As it does not necessitate any kind of prior knowledge, so it tends out to be
more flexible.
o It is a standard method that results well.

Building an ANN

Before starting with building an ANN model, we will require a dataset on which our
model is going to work. The dataset is the collection of data for a particular problem,
which is in the form of a CSV file.

UNIT - V 64
[ADVANCED DATABASE AND MINING]

CSV stands for Comma-separated values that save the data in the tabular format. We
are using a fictional dataset of banks. The bank dataset contains data of its 10,000
customers with their details. This whole thing is undergone because the bank is seeing
some unusual churn rates, which is nothing but the customers are leaving at an
unusual high rate, and they want to know the reason behind it so that they can assess
and address that particular problem.

Here we are going to solve this business problem using artificial neural networks. The
problem that we are going to deal with is a classification problem. We have several
independent variables like Credit Score, Balance, and Number of Products on the basis
of which we are going to predict which customers are leaving the bank. Basically, we
are going to do a classification problem, and artificial neural networks can do a terrific
job at making such kind of predictions.

So, we will start with installing the Keras library, TensorFlow library, as well as
the Theano library on Anaconda Prompt, and for that, you need to open it as
administrator followed by running the commands one after other as given below.

1. pip install theano


2. pip install tensorflow
3. pip install keras

So, we have installed Keras library too.

Now that we are done with the installation, the next step is to update all these libraries
to the most updated version, and it can be done by following the given code.

1. conda update --all

Since we are doing it for the very first time, it will ask whether to proceed or not.
Confirm it with y and press enter.

After the libraries are updated successfully, we will close the Anaconda prompt and
get back to the Spyder IDE.

Now we will start building our model in two parts, such that in part 1st, we will do data
pre-processing, however in 2nd part, we will create the ANN model.

Data pre-processing is very necessary to prepare the data correctly for building a
future deep learning model. Since we are in front of a classification problem, so we
have some independent variables encompassing some information about customers
in a bank, and we are trying to predict the binary outcome for the dependent variable,
i.e., either 1 if the customer leaves the bank or 0 if the customer stays in the bank.

UNIT - V 65
[ADVANCED DATABASE AND MINING]

Part1: Data Pre-processing

We will start by importing some of the pre-defined Python libraries such as


NumPy, Matplotlib, and Pandas so as to perform data-preprocessing. All these
libraries perform some sort of specific tasks.

NumPy

NumPy is a python library that stands for Numerical Python, allows the
implementation of linear, mathematical and logical operations on arrays as well as
Fourier transformation and routine to manipulate the shapes.

1. import numpy as np

Matplotlib

It is also an open-source library with the help of which charts can be plotted in
the python. The sole purpose of this library is to visualize the data for which it
necessitates to import its pyplot sub library.

1. import matplotlib.pyplot as plt

Pandas

Pandas is also an open-source library that enables high-performance data


manipulation as well as analyzing tools. It is mainly used to handle the data and make
the analysis.

1. import pandas as pd

An output image is given below, which shows that the libraries have been successfully
imported.

Next, we will import the data file from the current working directories with the help
of Pandas. We will use read.csv() for reading the CSV file both locally as well as through
the URL.

1. dataset = pd.read_csv('Churn_Modelling.csv')

From the code given above, the dataset is the name of the variable in which we are
going to save the data. We have passed the name of the dataset in the read.csv().
Once the code is run, we can see that the data is uploaded successfully.

UNIT - V 66
[ADVANCED DATABASE AND MINING]

By clicking on the Variable explorer and selecting the dataset, we can check the
dataset, as shown in the following image.

Next, we will create the matrix of feature, which is nothing but a matrix of the
independent variable. Since we don't know which independent variable might has the
most impact on the dependent variable, so that is what our artificial neural network
will spot by looking at the correlations; it will give bigger weight to those independent
variables that have the most impact in the neural network.

So, we will include all the independent variables from the credit score to the last one
that is the estimated salary.

1. X = dataset.iloc[:, 3:13].values

After running the above code, we will see that we have successfully created the matrix
of feature X. Next, we will create a dependent variable vector.

1. y = dataset.iloc[:, 13].values

By clicking on y, we can have a look that y contains binary outcome, i.e., 0 or 1 for all
the 10,000 customers of the bank.

Next, we will split the dataset into the training and test set. But before that, we need
to encode that matrix of the feature as it contains the categorical data. Since the
dependent variable also comprises of categorical data but sidewise, it also takes a
numerical value, so don't need to encode text into numbers. But then again, we have
our independent variable, which has categories of strings, so we need to encode the
categorical independent variables.

The main reason behind encoding the categorical data before splitting is that it is must
to encode the matrix of X and the dependent variable y.

1. from sklearn.preprocessing import LabelEncoder, OneHotEncoder


2. labelencoder_X = LabelEncoder()
3. X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
4. onehotencoder = OneHotEncoder(categorical_features = [0])
5. X = onehotencoder.fit_transform(X).toarray()

So, now we will encode our categorical independent variable by having a look at our
matrix from console and for that we just need to press X at the console.

Output:

UNIT - V 67
[ADVANCED DATABASE AND MINING]

From the image given above, we can see that we have only two categorical
independent variables, which is the country variable containing three countries, i.e.,
France, Spain, and Germany, and the other one is the gender variable, i.e., male and
female. So, we have got these two variables, which we will encode in our matrix of
features.

So we will need to create two label encoder objects, such that we will create our first
label encoder object named labelencoder_X_1 followed by
applying fit_transform method to encode this variable, which will, in turn, the strings
here France, Spain, and Germany into the numbers 0, 1 and 2.

1. from sklearn.preprocessing import LabelEncoder, OneHotEncoder


2. labelencoder_X_1 = LabelEncoder()
3. X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])

After executing the code, we will now have a look at the X variable, simply by pressing
X in the console, as we did in the earlier step.

Output:

So, from the output image given above, we can see that France became 0, Germany
became 1, and Spain became 2.

Now in a similar manner, we will do the same for the other variable, i.e., Gender
variable but with a new object.

1. labelencoder_X_2 = LabelEncoder()
2. X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

Output:

We can clearly see that females became 0 and males became 1. Since there is no
relational order between the categories of our categorical variable, so for that we need
to create a dummy variable for the country categorical variable as it contains three
categories unlike the gender variable having only two categories, which is why we will
be removing one column to avoid the dummy variable trap. It is useless to create the
dummy variable for the gender variable. We will use the OneHotEncoder class to
create the dummy variables.

1. from sklearn.compose import ColumnTransformer


2. label_encoder_x_1 = LabelEncoder()
3. X[: , 2] = label_encoder_x_1.fit_transform(X[:,2])
4. transformer = ColumnTransformer(

UNIT - V 68
[ADVANCED DATABASE AND MINING]

5. transformers=[
6. ("OneHot", # Just a name
7. OneHotEncoder(), # The transformer class
8. [1] # The column(s) to be applied on.
9. )
10. ],
11. remainder='passthrough' # don't apply anything to the remaining columns
12. )
13. X = transformer.fit_transform(X.tolist())
14. X = X.astype('float64')

Output:

By having a look at X, we can see that all the columns are of the same type now. Also,
the type is no longer an object but float64. We can see that we have twelve
independent variables because we have three new dummy variables.

Next, we will remove one dummy variable to avoid falling into a dummy variable trap.
We will take a matrix of features X and update it by taking all the lines of this matrix
and all the columns except the first one.

1. X = X[:, 1:]

Output:

It can be seen that we are left with only two dummy variables, so no more dummy
variable trap.

Now we are ready to split the dataset into the training set and test set. We have taken
the test size to 0.2 for training the ANN on 8,000 observations and testing its
performance on 2,000 observations.

1. from sklearn.model_selection import train_test_split


2. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_
state = 0)

By executing the code given above, we will get four different variables that can be seen
under the variable explorer section.

Output:

UNIT - V 69
[ADVANCED DATABASE AND MINING]

Besides parallel computations, we are going to have highly computed intensive


calculations as well as we don't want one independent variable dominating the other
one, so we will be applying feature scaling to ease out all the calculations.

1. from sklearn.preprocessing import StandardScaler


2. sc = StandardScaler()
3. X_train = sc.fit_transform(X_train)
4. X_test = sc.transform(X_test)

After executing the above code, we can have a quick look at X_train and X_test to
check if all the independent variables are scaled properly or not.

Now that our data is well pre-processed, we will start by building an artificial neural
network.

Part2: Building an ANN

We will start with importing the Keras libraries as well as the desired packages as it
will build the Neural Network based on TensorFlow

1. import keras

After importing the Keras library, we will now import two modules, i.e., the Sequential
module, which is required to initialize our neural network, and the Dense module that
is needed to build the layer of our ANN.

1. from keras.models import Sequential


2. from keras.layers import Dense

Next, we will initialize the ANN, or simply we can say we will be defining it as a
sequence of layers. The deep learning model can be initialized in two ways, either by
defining the sequence of layers or defining a graph. Since we are going to make our
ANN with successive layers, so we will initialize our deep learning model by defining it
as a sequence of layers.

It can be done by creating an object of the sequential class, which is taken from the
sequential model. The object that we are going to create is nothing but the model
itself, i.e., a neural network that will have a row of classifiers because we are solving a
classification problem where we have to predict a class, so our neural network model
is going to be a classifier. As in the next step, we will be predicting the test set result
using the classifier name, so we will call our model as a classifier that is nothing but
our future Artificial Neural Network that we are going to build.

UNIT - V 70
[ADVANCED DATABASE AND MINING]

Since this classifier is an object of Sequential class, so we will be using it, but will not
pass any argument because we will be defining the layers step by step by starting with
the input layer followed by adding some hidden layers and then the output layer.

1. classifier = Sequential()

After this, we will start by adding the input layer and the first hidden layer. We will
take the classifier that we initialized in the previous step by creating an object of the
sequential class, and we will use the add() method to add different layers in our neural
network. In the add(), we will pass the layer argument, and since we are going to add
two layers, i.e., the input and first hidden layer, which we will be doing with the help
of Dense() function that we have mentioned above.

Within the Dense() function we will pass the following arguments;

o units are the very first argument, which can be defined as the number of nodes
that we want to add in the hidden layer.
o The second argument is the kernel_initializer that randomly initializes the
weight as a small number close to zero so that they can be randomly initialized
with a uniform function. Here we have a simple uniform function that will
initialize the weight according to the uniform distribution.
o The third argument is the activation, which can be understood as the function
that we want to choose in our hidden layer. So, we will be using the rectifier
function for the hidden layers and the sigmoid function for the output layer.
Since we are in the hidden layer, we are using the "relu" perimeter as it
corresponds to the rectifier function.
o And the last is the input_dim argument that specifies the number of nodes in
the input layer, which is actually the number of independent variables. It is very
necessary to add the argument because, by so far, we have only initialized our
ANN, we haven't created any layer yet, and that's why it doesn't know which
node this hidden layer we are creating is expecting as inputs. After the first
hidden layer gets created, we don't need to specify this argument for the next
hidden layers.

1. classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu',


input_dim = 11))

Next, we will add the second hidden layer by using the same add method followed by
passing the same parameter, which is the Dense() as well as the same parameters
inside it as we did in the earlier step except for the input_dim.
UNIT - V 71
[ADVANCED DATABASE AND MINING]

1. classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

After adding the two hidden layers, we will now add the final output layer. This is again
similar to the previous step, just the fact that we will be units parameter because in
the output layer we only require one node as our dependent variable is a categorical
variable encompassing a binary outcome and also when we have binary outcome then,
in that case, we have only one node in the output layer. So, therefore, we will put units
equals to 1, and since we are in the output layer, we will be replacing
the rectifier function to sigmoid activation function.

1. classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmo


id'))

As we are done with adding the layers of our ANN, we will now compile the whole
artificial neural network by applying the stochastic gradient descent. We will start with
our classifier object, followed by using the compile method and will pass on the
following arguments in it.

o The first argument is the optimizer, which is simply the algorithm that we want
to use to find the optimal set of weights in the neural networks. The algorithm
that we are going to use is nothing but the stochastic gradient descent
algorithm. Since there are several types of stochastic descent algorithms and
the most efficient one is called "adam," which is going to be the input of this
optimizer parameter.
o The second parameter is the loss, which is a loss function within the stochastic
gradient descent algorithm, which is used to find the optimal weights. Since our
dependent variable has a binary outcome, so we will be
using binary_crossentropy logarithmic function, and when there is a binary
outcome, then we will incorporate categorical_crossentropy.
o The last argument will be the metrics, which is nothing but a criterion to
evaluate our model, and we are using the "accuracy." So, what happens is when
the weights are updated after each observation, the algorithm makes use of
this accuracy to improve the model's performance.

1. classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['


accuracy'])

UNIT - V 72
[ADVANCED DATABASE AND MINING]

Next, we will fit the ANN to the training set for which we will be using the fit method
to fit our ANN to the training set. In the fit method, we will be passing the following
arguments:

o The first argument is the dataset on which we want to train our classifier, which
is the training set separated into two-argument such as X_train (matrix of
feature containing the observations of the train set) and y_train (containing the
actual outcomes of the dependent variable for all the observations in the
training set).
o The next argument is the batch_size, which is the number of observations, after
which we want to update the weight.
o And lastly, the no. of epochs that we are going to apply to see the algorithm in
action as well the improvement in accuracy over the different epochs.

1. classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

From the output image given above, you can see that our model is ready and has
reached an accuracy of 84% approximately, so this how a stochastic gradient descent
algorithm is performed.

Part3: Making the Predictions and Evaluating the Model

Since we are done with training the ANN on the training set, now we will make the
predictions on the set.

1. y_pred = classifier.predict(X_test)

From the output image given above, we can see all the probabilities that the 2,000
customers of the test set will leave the bank. For example, if we have a look at first
probability, i.e., 21% means that this first customer of the test set, indexed by zero,
has a 20% chance to leave the bank.

Since the predicted method returns the probability of the customers leave the bank
and in order to use this confusion matrix, we don't need these probabilities, but we do
need the predicted results in the form of True or False. So, we need to transform these
probabilities into the predicted result.

We will choose a threshold value to decide when the predicted result is one, and when
it is zero. So, we predict 1 over the threshold and 0 below the threshold as well as the
natural threshold that we will take is 0.5, i.e., 50%. If the y_pred is larger, then it will
return True else False.

UNIT - V 73
[ADVANCED DATABASE AND MINING]

1. y_pred = (y_pred > 0.5)

Now, if we have a look at y_pred, we will see that it has updated the results in the
form of "False" or "True".

So, the first five customers of the test set don't leave the bank according to the model,
whereas the sixth customer in the test set leaves the bank.

Next, we will execute the following code to get the confusion matrix.

1. from sklearn.metrics import confusion_matrix


2. cm = confusion_matrix(y_test, y_pred)

From the output given above, we can see that out of 2000 new observations; we get
1542+141= 1683 correct predictions 264+53= 317 incorrect predictions.

So, now we will compute the accuracy on the console, which is the number of correct
predictions divided by the total number of predictions.

So, we can see that we got an accuracy of 84% on new observations on which we didn't
train our ANN, even that get a good amount of accuracy. Since this is the same
accuracy that we obtained in the training set but obtained here on the test set too.

So, eventually, we can validate our model, and now the bank can use it to make a
ranking of their customers, ranked by their probability to leave the bank, from the
customer that has the highest probability to leave the bank, down to the customer
that has the lowest probability to leave the bank.

SUPPORT VECTOR MACHINES

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:

UNIT - V 74
[ADVANCED DATABASE AND MINING]

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support
vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

UNIT - V 75
[ADVANCED DATABASE AND MINING]

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes


in n-dimensional space, but we need to find out the best decision boundary that helps
to classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:

UNIT - V 76
[ADVANCED DATABASE AND MINING]

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

UNIT - V 77
[ADVANCED DATABASE AND MINING]

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

UNIT - V 78
[ADVANCED DATABASE AND MINING]

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

UNIT - V 79
[ADVANCED DATABASE AND MINING]

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.

UNIT - V 80
[ADVANCED DATABASE AND MINING]

15. from sklearn.model_selection import train_test_split


16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_s
tate=0)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the
dataset as:

The scaled output for the test set will be:

Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier,
we will import SVC class from Sklearn.svm library. Below is the code for it:

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for
linearly separable data. However, we can change it for non-linear data. And then we
fitted the classifier to the training dataset(x_train, y_train)

The model performance can be altered by changing the value of C(Regularization


factor), gamma, and kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new
vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to
check the difference between the actual value and predicted value

UNIT - V 81
[ADVANCED DATABASE AND MINING]

o Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many
incorrect predictions are there as compared to the Logistic regression
classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value
return by the classifier). Below is the code for it:

As we can see in the above output image, there are 66+24= 90 correct predictions and
8+2= 10 correct predictions. Therefore we can say that our SVM model improved as
compared to the Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output.
In the output, we got the straight line as hyperplane because we have used a linear
kernel in the classifier. And we have also discussed above that for the 2d space, the
hyperplane in SVM is a straight line.

o Visualizing the test set result:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
UNIT - V 82
[ADVANCED DATABASE AND MINING]

4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].


max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)
)
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).resh
ape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into
two regions (Purchased or Not purchased). Users who purchased the SUV are in the
red region with the red scatter points. And users who did not purchase the SUV are in
the green region with green scatter points. The hyperplane has divided the two classes
into Purchased and not purchased variable.
UNIT - V 83
[ADVANCED DATABASE AND MINING]

Bagging Vs Boosting

We all use the Decision Tree Technique on day to day life to make the decision.
Organizations use these supervised machine learning techniques like Decision trees to
make a better decision and to generate more surplus and profit.

Ensemble methods combine different decision trees to deliver better predictive


results, afterward utilizing a single decision tree. The primary principle behind the
ensemble model is that a group of weak learners come together to form an active
learner.

There are two techniques given below that are used to perform ensemble decision
tree.

Bagging

Bagging is used when our objective is to reduce the variance of a decision tree. Here
the concept is to create a few subsets of data from the training sample, which is chosen
randomly with replacement. Now each collection of subset data is used to prepare
their decision trees thus, we end up with an ensemble of various models. The average
of all the assumptions from numerous tress is used, which is more powerful than a
single decision tree.

Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than
using all features to develop trees. When we have numerous random trees, it is called
the Random Forest.

These are the following steps which are taken to implement a Random forest:

o Let us consider X observations Y features in the training data set. First, a model
from the training data set is taken randomly with substitution.
o The tree is developed to the largest.
o The given steps are repeated, and prediction is given, which is based on the
collection of predictions from n number of trees.

Advantages of using Random Forest technique:

o It manages a higher dimension data set very well.


o It manages missing quantities and keeps accuracy for missing data.

Disadvantages of using Random Forest technique:

UNIT - V 84
[ADVANCED DATABASE AND MINING]

Since the last prediction depends on the mean predictions from subset trees, it won't
give precise value for the regression model.

Boosting:

Boosting is another ensemble procedure to make a collection of predictors. In other


words, we fit consecutive trees, usually random samples, and at each step, the
objective is to solve net error from the prior trees.

If a given input is misclassified by theory, then its weight is increased so that the
upcoming hypothesis is more likely to classify it correctly by consolidating the entire
set at last converts weak learners into better performing models.

Gradient Boosting is an expansion of the boosting procedure

1. Gradient Boosting = Gradient Descent + Boosting

It utilizes a gradient descent algorithm that can optimize any differentiable loss
function. An ensemble of trees is constructed individually, and individual trees are
summed successively. The next tree tries to restore the loss ( It is the difference
between actual and predicted values).

Advantages of using Gradient Boosting methods:

o It supports different loss functions.


o It works well with interactions.

Disadvantages of using a Gradient Boosting methods:

o It requires cautious tuning of different hyper-parameters.

Difference between Bagging and Boosting:

Bagging Boosting

Various training data subsets are Each new subset contains the
randomly drawn with replacement from components that were misclassified
the whole training dataset. by previous models.

UNIT - V 85
[ADVANCED DATABASE AND MINING]

Bagging attempts to tackle the over- Boosting tries to reduce bias.


fitting issue.

If the classifier is unstable (high If the classifier is steady and


variance), then we need to apply straightforward (high bias), then we
bagging. need to apply boosting.

Every model receives an equal weight. Models are weighted by their


performance.

Objective to decrease variance, not bias. Objective to decrease bias, not


variance.

It is the easiest way of connecting It is a way of connecting predictions


predictions that belong to the same type. that belong to the different types.

Every model is constructed New models are affected by the


independently. performance of the previously
developed model.

STACKING

Stacking is one of the popular ensemble modeling techniques in machine learning.


Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.

This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved.

In stacking, an algorithm takes the outputs of sub-models as input and attempts to


learn how to best combine the input predictions to make a better output prediction.

Stacking is also known as a stacked generalization and is an extended form of the


Model Averaging Ensemble technique in which all sub-models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.

UNIT - V 86
[ADVANCED DATABASE AND MINING]

Architecture of Stacking

The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta-model that combines the predictions
of the base models. These base models are called level 0 models, and the meta-model
is known as the level 1 model. So, the Stacking ensemble method includes original
(training) data, primary level models, primary level prediction, secondary level
model, and final prediction. The basic architecture of stacking can be represented as
shown below the image.

o Original data: This data is divided into n-folds and is also considered test data
or training data.
o Base models: These models are also referred to as level-0 models. These
models use training data and provide compiled predictions (level-0) as an
output.
o Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-
model, which helps to best combine the predictions of the base models. The
meta-model is also known as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the
predictions of the base models and is trained on different predictions made by
individual base models, i.e., data not used to train the base models are fed to
the meta-model, predictions are made, and these predictions, along with the
expected outputs, provide the input and output pairs of the training dataset
used to fit the meta-model.

Steps to implement Stacking models:

There are some important steps to implementing stacking models in machine learning.
These are as follows:

UNIT - V 87
[ADVANCED DATABASE AND MINING]

o Split training data sets into n-folds using the RepeatedStratifiedKFold as this is
the most common approach to preparing training datasets for meta-models.
o Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size
n,
o Now, the model is trained on all the n parts, which will make predictions for the
sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using
Model 2 and 3 for training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions will
be used as features for the model.
o Finally, Meta learners can now be used to make a prediction on test data in the
stacking model.

Stacking Ensemble Family

There are some other ensemble techniques that can be considered the forerunner of
the stacking method. For better understanding, we have divided them into the
different frameworks of essential stacking so that we can easily understand the
differences between methods and the uniqueness of each technique. Let's discuss a
few commonly used ensemble techniques related to stacking.

Voting ensembles:

This is one of the simplest stacking ensemble methods, which uses different algorithms
to prepare all members individually. Unlike the stacking method, the voting ensemble
uses simple statistics instead of learning how to best combine predictions from base
models separately.

It is significant to solve regression problems where we need to predict the mean or


median of the predictions from base models. Further, it is also helpful in various
classification problems according to the total votes received for prediction. The label
with the higher numbers of votes is referred to as hard voting, whereas the label that
receives the largest sums of probability or lesser votes is referred to as soft voting.

UNIT - V 88
[ADVANCED DATABASE AND MINING]

RANDOM FORESTS

The voting ensemble differs from than stacking ensemble in terms of weighing models
based on each member's performance because here, all models are considered to
have the same skill levels.

Member Assessment: In the voting ensemble, all members are assumed to have the
same skill sets.

Combine with Model: Instead of using combined prediction from each member, it
uses simple statistics to get the final prediction, e.g., mean or median.

Weighted Average Ensemble

The weighted average ensemble is considered the next level of the voting ensemble,
which uses a diverse collection of model types as contributing members. This method
uses some training datasets to find the average weight of each ensemble member
based on their performance. An improvement over this naive approach is to weigh
each member based on its performance on a hold-out dataset, such as a validation set
or out-of-fold predictions during k-fold cross-validation. Furthermore, it may also
involve tuning the coefficient weightings for each model using an optimization
algorithm and performance on a holdout dataset.

Member Assessment: Weighted average ensemble method uses member


performance based on the training dataset.

Combine With Model: It considers the weighted average of prediction from each
member separately.

Blending Ensemble:

Blending is a similar approach to stacking with a specific configuration. It is considered


a stacking method that uses k-fold cross-validation to prepare out-of-sample
predictions for the meta-model. In this method, the training dataset is first to split into
different training sets and validation sets then we train learner models on the training
sets. Further, predictions are made on the validation set and sample set, where
validation predictions are used as features to build a new model, which is later used
to make final predictions on the test set using the prediction values as features.

Member Predictions: The blending stacking ensemble uses out-of-sample predictions


on a validation set.

Combine With Model: Linear model (e.g., linear regression or logistic regression).

UNIT - V 89
[ADVANCED DATABASE AND MINING]

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it
is possible that some decision trees may predict the correct output, while others may

UNIT - V 90
[ADVANCED DATABASE AND MINING]

not. But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset so
that the classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining
N decision tree, and second is to make predictions for each tree created in the first
phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets and
given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of

UNIT - V 91
[ADVANCED DATABASE AND MINING]

results, the Random Forest classifier predicts the final decision. Consider the below
image:

Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

UNIT - V 92
[ADVANCED DATABASE AND MINING]

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression


tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks,
it is not more suitable for Regression tasks.

Python Implementation of Random Forest Algorithm

Now we will implement the Random Forest Algorithm tree using Python. For this, we
will use the same dataset "user_data.csv", which we have used in previous
classification models. By using the same dataset, we can compare the Random Forest
classifier with other classification models such as Decision tree
Classifier, KNN, SVM, Logistic Regression, etc.

Implementation Steps are given below:

o Data Pre-processing step


o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

DATA MINING TOOLS

Data Mining is the set of techniques that utilize specific algorithms, statical analysis,
artificial intelligence, and database systems to analyze data from different dimensions
and perspectives.

Data Mining tools have the objective of discovering patterns/trends/groupings among


large sets of data and transforming data into more refined information.

It is a framework, such as Rstudio or Tableau that allows you to perform different types
of data mining analysis.

UNIT - V 93
[ADVANCED DATABASE AND MINING]

We can perform various algorithms such as clustering or classification on your data set
and visualize the results itself. It is a framework that provides us better insights for our
data and the phenomenon that data represent. Such a framework is called a data
mining tool.

The Market for Data Mining tool is shining: as per the latest report from ReortLinker
noted that the market would top $1 billion in sales by 2023, up from $ 591 million
in 2018

These are the most popular data mining tools:

1. Orange Data Mining:

Orange is a perfect machine learning and data mining software suite. It supports the
visualization and is a software-based on components written in Python computing
language and developed at the bioinformatics laboratory at the faculty of computer
and information science, Ljubljana University, Slovenia.

As it is a software-based on components, the components of Orange are called


"widgets." These widgets range from preprocessing and data visualization to the
assessment of algorithms and predictive modeling.

UNIT - V 94
[ADVANCED DATABASE AND MINING]

Widgets deliver significant functionalities such as:

o Displaying data table and allowing to select features


o Data reading
o Training predictors and comparison of learning algorithms
o Data element visualization, etc.

Besides, Orange provides a more interactive and enjoyable atmosphere to dull


analytical tools. It is quite exciting to operate.

Why Orange?

Data comes to orange is formatted quickly to the desired pattern, and moving the
widgets can be easily transferred where needed. Orange is quite interesting to users.
Orange allows its users to make smarter decisions in a short time by rapidly comparing
and analyzing the data.It is a good open-source data visualization as well as evaluation
that concerns beginners and professionals. Data mining can be performed via visual
programming or Python scripting. Many analyses are feasible through its visual
programming interface(drag and drop connected with widgets)and many visual tools
tend to be supported such as bar charts, scatterplots, trees, dendrograms, and heat
maps. A substantial amount of widgets(more than 100) tend to be supported.

The instrument has machine learning components, add-ons for bioinformatics and text
mining, and it is packed with features for data analytics. This is also used as a python
library.

Python scripts can keep running in a terminal window, an integrated environment like
PyCharmand PythonWin, pr shells like iPython. Orange comprises of canvas interface
onto which the user places widgets and creates a data analysis workflow. The widget
proposes fundamental operations, For example, reading the data, showing a data
table, selecting features, training predictors, comparing learning algorithms,
visualizing data elements, etc. Orange operates on Windows, Mac OS X, and a variety
of Linux operating systems. Orange comes with multiple regression and classification
algorithms.

Orange can read documents in native and other data formats. Orange is dedicated to
machine learning techniques for classification or supervised data mining. There are
two types of objects used in classification: learner and classifiers. Learners consider
class-leveled data and return a classifier. Regression methods are very similar to
classification in Orange, and both are designed for supervised data mining and require
class-level data. The learning of ensembles combines the predictions of individual
models for precision gain. The model can either come from different training data or
use different learners on the same sets of data.

UNIT - V 95
[ADVANCED DATABASE AND MINING]

Learners can also be diversified by altering their parameter sets. In orange, ensembles
are simply wrappers around learners. They act like any other learner. Based on the
data, they return models that can predict the results of any data instance.

2. SAS Data Mining:

SAS stands for Statistical Analysis System. It is a product of the SAS Institute created
for analytics and data management. SAS can mine data, change it, manage information
from various sources, and analyze statistics. It offers a graphical UI for non-technical
users.

SAS data miner allows users to analyze big data and provide accurate insight for timely
decision-making purposes. SAS has distributed memory processing architecture that is
highly scalable. It is suitable for data mining, optimization, and text mining purposes.

3. DataMelt Data Mining:

DataMelt is a computation and visualization environment which offers an interactive


structure for data analysis and visualization. It is primarily designed for students,
engineers, and scientists. It is also known as DMelt.

DMelt is a multi-platform utility written in JAVA. It can run on any operating system
which is compatible with JVM (Java Virtual Machine). It consists of Science and
mathematics libraries.

o Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
o Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms,
curve fitting, etc.

DMelt can be used for the analysis of the large volume of data, data mining, and
statistical analysis. It is extensively used in natural sciences, financial markets, and
engineering.

4. Rattle:

Ratte is a data mining tool based on GUI. It uses the R stats programming language.
Rattle exposes the statical power of R by offering significant data mining features.
While rattle has a comprehensive and well-developed user interface, It has an
integrated log code tab that produces duplicate code for any GUI operation.

UNIT - V 96
[ADVANCED DATABASE AND MINING]

The data set produced by Rattle can be viewed and edited. Rattle gives the other
facility to review the code, use it for many purposes, and extend the code without any
restriction.

5. Rapid Miner:

Rapid Miner is one of the most popular predictive analysis systems created by the
company with the same name as the Rapid Miner. It is written in JAVA programming
language. It offers an integrated environment for text mining, deep learning, machine
learning, and predictive analysis.

The instrument can be used for a wide range of applications, including company
applications, commercial applications, research, education, training, application
development, machine learning.

Rapid Miner provides the server on-site as well as in public or private cloud
infrastructure. It has a client/server model as its base. A rapid miner comes with
template-based frameworks that enable fast delivery with few errors(which are
commonly expected in the manual coding writing process)

TEXT MINING

Text data mining can be described as the process of extracting essential data from
standard language text. All the data that we generate via text messages, documents,
emails, files are written in common language text. Text mining is primarily used to
draw useful insights or patterns from such data.

UNIT - V 97
[ADVANCED DATABASE AND MINING]

The text mining market has experienced exponential growth and adoption over the
last few years and also expected to gain significant growth and adoption in the coming
future. One of the primary reasons behind the adoption of text mining is higher
competition in the business market, many organizations seeking value-added
solutions to compete with other organizations. With increasing completion in business
and changing customer perspectives, organizations are making huge investments to
find a solution that is capable of analyzing customer and competitor data to improve
competitiveness. The primary source of data is e-commerce websites, social media
platforms, published articles, survey, and many more. The larger part of the generated
data is unstructured, which makes it challenging and expensive for the organizations
to analyze with the help of the people. This challenge integrates with the exponential
growth in data generation has led to the growth of analytical tools. It is not only able
to handle large volumes of text data but also helps in decision-making purposes. Text
mining software empowers a user to draw useful information from a huge set of data
available sources.

Areas of text mining in data mining:

These are the following area of text mining :

o Information Extraction:The automatic extraction of structured data such as


entities, entities relationships, and attributes describing entities from an
unstructured source is called information extraction.
o Natural Language Processing:NLP stands for Natural language processing.
Computer software can understand human language as same as it is spoken.
NLP is primarily a component of artificial intelligence(AI). The development of
the NLP application is difficult because computers generally expect humans to
"Speak" to them in a programming language that is accurate, clear, and
exceptionally structured. Human speech is usually not authentic so that it can
UNIT - V 98
[ADVANCED DATABASE AND MINING]

depend on many complex variables, including slang, social context, and regional
dialects.
o Data Mining: Data mining refers to the extraction of useful data, hidden
patterns from large data sets. Data mining tools can predict behaviors and
future trends that allow businesses to make a better data-driven decision. Data
mining tools can be used to resolve many business problems that have
traditionally been too time-consuming.
o Information Retrieval: Information retrieval deals with retrieving useful data
from data that is stored in our systems. Alternately, as an analogy, we can view
search engines that happen on websites such as e-commerce sites or any other
sites as part of information retrieval.

Text Mining Process:

The text mining process incorporates the following steps to extract the data from the
document.

o Text transformation
A text transformation is a technique that is used to control the capitalization
of the text.
Here the two major way of document representation is given.
1. Bag of words
2. Vector Space
o Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural
UNIT - V 99
[ADVANCED DATABASE AND MINING]

Language Processing (NLP), and information retrieval(IR). In the field of text


mining, data pre-processing is used for extracting useful information and
knowledge from unstructured text data. Information Retrieval (IR) is a matter
of choosing which documents in a collection should be retrieved to fulfill the
user's need.
o Feature selection: Feature selection is a significant part of data mining. Feature
selection can be defined as the process of reducing the input of processing or
finding the essential information sources. The feature selection is also called
variable selection.
o Data Mining: Now, in this step, the text mining procedure merges with the
conventional process. Classic Data Mining procedures are used in the structural
database.
o Evaluate: Afterward, it evaluates the results. Once the result is evaluated, the
result abandon.
o Applications:
These are the following text mining applications:
o Risk Management: Risk Management is a systematic and logical procedure of
analyzing, identifying, treating, and monitoring the risks involved in any action
or process in organizations. Insufficient risk analysis is usually a leading cause
of disappointment. It is particularly true in the financial organizations where
adoption of Risk Management Software based on text mining technology can
effectively enhance the ability to diminish risk. It enables the administration of
millions of sources and petabytes of text documents, and giving the ability to
connect the data. It helps to access the appropriate data at the right time.
o Customer Care Service: Text mining methods, particularly NLP, are finding
increasing significance in the field of customer care. Organizations are spending
in text analytics programming to improve their overall experience by accessing
the textual data from different sources such as customer feedback, surveys,
customer calls, etc. The primary objective of text analysis is to reduce the
response time of the organizations and help to address the complaints of the
customer rapidly and productively.
o Business Intelligence: Companies and business firms have started to use text
mining strategies as a major aspect of their business intelligence. Besides
providing significant insights into customer behavior and trends, text mining
strategies also support organizations to analyze the qualities and weaknesses
of their opponent's so, giving them a competitive advantage in the market.
UNIT - V 100
[ADVANCED DATABASE AND MINING]

o Social Media Analysis: Social media analysis helps to track the online data, and
there are numerous text mining tools designed particularly for performance
analysis of social media sites. These tools help to monitor and interpret the text
generated via the internet from the news, emails, blogs, etc. Text mining tools
can precisely analyze the total no of posts, followers, and total no of likes of
your brand on a social media platform that enables you to understand the
response of the individuals who are interacting with your brand and content.

Text Mining Approaches in Data Mining:

These are the following text mining approaches that are used in data mining.

1. Keyword-based Association Analysis:

It collects sets of keywords or terms that often happen together and afterward
discover the association relationship among them. First, it preprocesses the text data
by parsing, stemming, removing stop words, etc. Once it pre-processed the data, then
it induces association mining algorithms. Here, human effort is not required, so the
number of unwanted results and the execution time is reduced.

2. Document Classification Analysis:

Automatic document classification:

This analysis is used for the automatic classification of the huge number of online text
documents like web pages, emails, etc. Text document classification varies with the
classification of relational data as document databases are not organized according to
attribute values pairs.

Numericizing text:

o Stemming algorithmsA significant pre-processing step before ordering of input


documents starts with the stemming of words. The terms "stemming" can be
defined as a reduction of words to their roots. For example, different
grammatical forms of words and ordered are the same. The primary purpose of
stemming is to ensure a similar word by text mining program.
o Support for different languages:There are some highly language-dependent
operations such as stemming, synonyms, the letters that are allowed in words.
Therefore, support for various languages is important.

UNIT - V 101
[ADVANCED DATABASE AND MINING]

o Exclude certain character:Excluding numbers, specific characters, or series of


characters, or words that are shorter or longer than a specific number of letters
can be done before the ordering of the input documents.
o Include lists, exclude lists (stop-words):A particular list of words to be listed
can be characterized, and it is useful when we want to search for a specific
word. It also classifies the input documents based on the frequencies with
which those words occur. Additionally, "stop words," which means terms that
are to be rejected from the ordering can be characterized. Normally, a default
list of English stop words incorporates "the," "a," "since," etc. These words are
used in the respective language very often but communicate very little data in
the document.

UNIT - V 102

You might also like