0% found this document useful (0 votes)
11 views37 pages

Adm Unit-4,5

The document discusses data mining task primitives, which are essential components for constructing a data mining process, including relevant data, knowledge types, background knowledge, interestingness measures, and visualization methods. It highlights the advantages of these primitives such as modularity, reusability, and improved performance, as well as the importance of effective data visualization techniques for interpreting data. Additionally, it covers types of association rules in data mining, their applications, and the Apriori algorithm for finding frequent itemsets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views37 pages

Adm Unit-4,5

The document discusses data mining task primitives, which are essential components for constructing a data mining process, including relevant data, knowledge types, background knowledge, interestingness measures, and visualization methods. It highlights the advantages of these primitives such as modularity, reusability, and improved performance, as well as the importance of effective data visualization techniques for interpreting data. Additionally, it covers types of association rules in data mining, their applications, and the Apriori algorithm for finding frequent itemsets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT-IV

Data Mining Task Primitives


Data mining task primitives refer to the basic building blocks or components that are used to construct a data
mining process. These primitives are used to represent the most common and fundamental tasks that are
performed during the data mining process. The use of data mining task primitives can provide a modular and
reusable approach, which can improve the performance, efficiency, and understandability of the data mining
process.

The Data Mining Task Primitives are as follows:


1. The set of task relevant data to be mined: It refers to the specific data that is relevant and necessary for
a particular task or analysis being conducted using data mining techniques. This data may include specific
attributes, variables, or characteristics that are relevant to the task at hand, such as customer
demographics, sales data, or website usage statistics. The data selected for mining is typically a subset of
the overall data available, as not all data may be necessary or relevant for the task.
For example: Extracting the database name, database tables, and relevant required attributes from the
dataset from the provided input database.
2. Kind of knowledge to be mined: It refers to the type of information or insights that are being sought
through the use of data mining techniques. This describes the data mining tasks that must be carried out. It
includes various tasks such as classification, clustering, discrimination, characterization, association, and
evolution analysis. For example, It determines the task to be performed on the relevant data in order to
mine useful information such as classification, clustering, prediction, discrimination, outlier detection, and
correlation analysis.
3. Background knowledge to be used in the discovery process: It refers to any prior information or
understanding that is used to guide the data mining process. This can include domain-specific knowledge,
such as industry-specific terminology, trends, or best practices, as well as knowledge about the data itself.
The use of background knowledge can help to improve the accuracy and relevance of the insights obtained
from the data mining process. For example, The use of background knowledge such as concept
hierarchies, and user beliefs about relationships in data in order to evaluate and perform more efficiently.
4. Interestingness measures and thresholds for pattern evaluation: It refers to the methods and criteria
used to evaluate the quality and relevance of the patterns or insights discovered through data mining.
Interestingness measures are used to quantify the degree to which a pattern is considered to be interesting
or relevant based on certain criteria, such as its frequency, confidence, or lift. These measures are used to
identify patterns that are meaningful or relevant to the task. Thresholds for pattern evaluation, on the other
hand, are used to set a minimum level of interestingness that a pattern must meet in order to be considered
for further analysis or action. For example: Evaluating the interestingness and interestingness measures
such as utility, certainty, and novelty for the data and setting an appropriate threshold value for the pattern
evaluation.
5. Representation for visualizing the discovered pattern: It refers to the methods used to represent the
patterns or insights discovered through data mining in a way that is easy to understand and interpret.
Visualization techniques such as charts, graphs, and maps are commonly used to represent the data and
can help to highlight important trends, patterns, or relationships within the data. Visualizing the discovered
pattern helps to make the insights obtained from the data mining process more accessible and
understandable to a wider audience, including non-technical stakeholders. For example Presentation and
visualization of discovered pattern data using various visualization techniques such as barplot, charts,
graphs, tables, etc.

Advantages of Data Mining Task Primitives

The use of data mining task primitives has several advantages, including:
1. Modularity: Data mining task primitives provide a modular approach to data mining, which allows for
flexibility and the ability to easily modify or replace specific steps in the process.
2. Reusability: Data mining task primitives can be reused across different data mining projects, which can
save time and effort.
3. Standardization: Data mining task primitives provide a standardized approach to data mining, which can
improve the consistency and quality of the data mining process.
4. Understandability: Data mining task primitives are easy to understand and communicate, which can
improve collaboration and communication among team members.
5. Improved Performance: Data mining task primitives can improve the performance of the data mining
process by reducing the amount of data that needs to be processed, and by optimizing the data for specific
data mining algorithms.
6. Flexibility: Data mining task primitives can be combined and repeated in various ways to achieve the goals
of the data mining process, making it more adaptable to the specific needs of the project.
7. Efficient use of resources: Data mining task primitives can help to make more efficient use of resources,
as they allow to perform specific tasks with the right tools, avoiding unnecessary steps and reducing the
time and computational power needed.

Data visualization

It is actually a set of data points and information that are represented graphically to make it easy and quick for
user to understand. Data visualization is good if it has a clear meaning, purpose, and is very easy to interpret,
without requiring context. Tools of data visualization provide an accessible way to see and understand trends,
outliers, and patterns in data by using visual effects or elements such as a chart, graphs, and maps .
Characteristics of Effective Graphical Visual :
 It shows or visualizes data very clearly in an understandable manner.
 It encourages viewers to compare different pieces of data.
 It closely integrates statistical and verbal descriptions of data set.
 It grabs our interest, focuses our mind, and keeps our eyes on message as human brain tends to focus on
visual data more than written data.
 It also helps in identifying area that needs more attention and improvement.
 Using graphical representation, a story can be told more efficiently. Also, it requires less time to understand
picture than it takes to understand textual data.
Categories of Data Visualization ;
Data visualization is very critical to market research where both numerical and categorical data can be
visualized that helps in an increase in impacts of insights and also helps in reducing risk of analysis paralysis.
So, data visualization is categorized into following categories :

Figure – Categories of Data Visualization


Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data generally
represents amount such as height, weight, age of a person, etc. Numerical data visualization is easiest way to
visualize data. It is generally used for helping others to digest large data sets and raw numbers in a way that
makes it easier to interpret into action. Numerical data is categorized into two categories :
 Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
 Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a household has).
The type of visualization techniques that are used to represent numerical data visualization is Charts and
Numerical Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data generally
represents groups. It simply consists of categorical variables that are used to represent characteristics such as
a person’s ranking, a person’s gender, etc. Categorical data visualization is all about depicting key themes,
establishing connections, and lending context. Categorical data is classified into three categories :
 Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
 Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
 Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or processes).
The type of visualization techniques that are used to represent categorical data is Graphics, Diagrams, and
Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram, etc.

Implement Data Visualization Using WEKA


The method of representing data through graphs and plots with the aim to understand data clearly is data
visualization.

There are many ways to represent data. Some of them are as follows:
1) Pixel Oriented Visualization: Here the color of the pixel represents the dimension value. The color of the
pixel represents the corresponding values.

2) Geometric Representation: The multidimensional datasets are represented in 2D, 3D, and 4D scatter plots .

3) Icon-Based Visualization: The data is represented using Chernoff’s faces and stick figures. Chernoff’s faces
use the human mind’s ability to recognize facial characteristics and differences between them. The stick figure
uses 5 stick figures to represent multidimensional data.

4) Hierarchical Data Visualization: The datasets are represented using treemaps. It represents hierarchical
data as a set of nested triangles.
Types of Association Rules in Data Mining
Association rule learning is a machine learning technique used for discovering interesting relationships between
variables in large databases. It is designed to detect strong rules in the database based on some interesting
metrics. For any given multi-item transaction, association rules aim to obtain rules that determine how or why
certain items are linked. Association rules are created for finding information about general if-then patterns using
specific criteria with support and trust to define what the key relationships are. They help to show the frequency
of an item in specific data since confidence is defined by the number of times an if-then statement is found to be
true.

Types of Association Rules:

1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new class of association
rules, different from original, simple, and even multi-relational association rules (usually extracted from multi-
relational databases), each rule element consists of one entity but many a relationship. These relationships
represent indirect relationships between entities.
2. Generalized association rules: Generalized association rule extraction is a powerful tool for getting a rough
idea of interesting patterns hidden in data. However, since patterns are extracted at each level of abstraction, the
mined rule sets may be too large to be used effectively for decision-making. Therefore, in order to discover
valuable and interesting knowledge, post-processing steps are often required. Generalized association rules
should have categorical (nominal or discrete) properties on both the left and right sides of the rule.
3. Quantitative association rules: Quantitative association rules is a special type of association rule. Unlike
general association rules, where both left and right sides of the rule should be categorical (nominal or discrete)
attributes, at least one attribute (left or right) of quantitative association rules must contain numeric attributes

Uses of Association Rules

 Medical Diagnosis: Association rules in medical diagnosis can be used to help doctors cure patients. As all
of us know that diagnosis is not an easy thing, and there are many errors that can lead to unreliable end
results. Using the multi-relational association rule, we can determine the probability of disease occurrence
associated with various factors and symptoms.
 Market Basket Analysis: It is one of the most popular examples and uses of association rule mining. Big
retailers typically use this technique to determine the association between items.

Frequent Item set in Data set (Association Rule Mining)


Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset for
boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1
itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its anti-
monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through some definitions which are explained in my previous
post.
Consider the following dataset and we will find frequent itemsets and generate association rules for them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us itemset
L1.

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is that it should
have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.(Example subset
of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
 Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if support_count
of candidate set item is less than min_support then remove those items) this gives us itemset L2.

Step-3:
 Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
 Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset.
(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3,
I4} is not frequent so remove it. Similarly check for every itemset)
 find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if support_count
of candidate set item is less than min_support then remove those items) this gives us itemset L3.

Step-4:
 Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is that, they
should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
 Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
 We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes into
picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of candidate sets with
much frequent itemsets, low minimum support or large itemsets i.e. it is not an efficient approach for large
number of datasets. For example, if there are 10^4 from frequent
1- itemsets, it need to generate more than 10^7 candidates into 2-length which in turn they will be tested and
accumulate. Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have to generate 2^100
candidate itemsets that yield on costly and wasting of time of candidate generation. So, it will check for many
sets from candidate itemsets, also it will scan database many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory capacity is limited with large number of transactions.

Correlation Analysis in Data Mining


Correlation analysis is a statistical method used to measure the strength of the linear relationship between two
variables and compute their association. Correlation analysis calculates the level of change in one variable due
to the change in the other. A high correlation points to a strong relationship between the two variables, while a
low correlation means that the variables are weakly related.

Correlation analysis can reveal meaningful relationships between different metrics or groups of metrics.
Information about those connections can provide new insights and reveal interdependencies, even if the metrics
come from different parts of the business.

Types of Correlation Analysis in Data Mining


Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall rank correlation,
Spearman correlation, and the Point-Biserial correlation.

1. Pearson r correlation

Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship
between linearly related variables. For example, in the stock market, if we want to measure how two stocks are
related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The
point-biserial correlation is conducted with the Pearson correlation formula, except that one of the variables is
dichotomous. The following formula is used to calculate the Pearson r correlation:

rxy= Pearson r correlation coefficient between x and y

n= number of observations

xi = value of x (for ith observation)

yi= value of y (for ith observation)

2. Kendall rank correlation

Kendall rank correlation is a non-parametric test that measures the strength of dependence between two
variables. Considering two samples, a and b, where each sample size is n, we know that the total number of
pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:

Nc= number of concordant

Nd= Number of discordant

3. Spearman rank correlation

Spearman rank correlation is a non-parametric test that is used to measure the degree of association between
two variables. The Spearman rank correlation test does not carry any assumptions about the data distribution. It
is the appropriate correlation analysis when the variables are measured on an at least ordinal scale.

This coefficient requires a table of data that displays the raw data, its ranks, and the difference between the two
ranks. This squared difference between the two ranks will be shown on a scatter graph, which will indicate
whether there is a positive, negative, or no correlation between the two variables. The constraint that this
coefficient works under is -1 ≤ r ≤ +1, where a result of 0 would mean that there was no relation between the data
whatsoever. The following formula is used to calculate the Spearman rank correlation :

ρ= Spearman rank correlation


di= the difference between the ranks of corresponding variables

n= number of observations

The two methods outlined above will be used according to whether there are parameters associated with the
data gathered. The two terms to watch out for are:

o Parametric:(Pearson's Coefficient) The data must be handled with the parameters of populations or
probability distributions. Typically used with quantitative data already set out within said parameters.
o Non-parametric:(Spearman's Rank) Where no assumptions can be made about the probability
distribution. Typically used with qualitative data, but can be used with quantitative data if Spearman's
Rank proves inadequat

Interpreting Results

Typically, the best way to gain a generalized but more immediate interpretation of the results of a set of
data is to visualize it on a scatter graph such as these:

1. Positive Correlation: Any score from +0.5 to +1 indicates a very strong positive correlation, which
means that they both increase simultaneously. This case follows the data points upwards to indicate the
positive correlation. The line of best fit, or the trend line, places to best represent the graph's data.

2.Negative Correlation: Any score from -0.5 to -1 indicates a strong negative correlation, which means that
as one variable increases, the other decreases proportionally. The line of best fit can be seen here to
indicate the negative correlation. In these cases, it will slope downwards from the point of origin.

3.No Correlation: Very simply, a score of 0 indicates no correlation, or relationship, between the two variables.
This fact will stand true for all, no matter which formula is used. The more data inputted into the formula, the
more accurate the result will be. The larger the sample size, the more accurate the result.

Benefits of Correlation Analysis

Here are the following benefits of correlation analysis, such as:

1. Reduce Time to Detection


In anomaly detection, working with many metrics and surfacing correlated anomalous metrics helps draw
relationships that reduce time to detection (TTD) and support shortened time to remediation (TTR). As data-
driven decision-making has become the norm, early and robust detection of anomalies is critical in every industry
domain, as delayed detection adversely impacts customer experience and revenue.

2. Reduce Alert Fatigue

Another important benefit of correlation analysis in anomaly detection is reducing alert fatigue by filtering
irrelevant anomalies (based on the correlation) and grouping correlated anomalies into a single alert. Alert storms
and false positives are significant challenges organizations face - getting hundreds, even thousands of separate
alerts from multiple systems when many of them stem from the same incident.

3. Reduce Costs

Correlation analysis helps significantly reduce the costs associated with the time spent investigating meaningless
or duplicative alerts. In addition, the time saved can be spent on more strategic initiatives that add value to the
organization.
UNIT-V

Learn-One-Rule Algorithm
This method is used in the sequential learning algorithm for learning the rules. It returns a single
rule that covers at least some examples (as shown in Fig 1). However, what makes it really powerful is its
ability to create relations among the attributes given, hence covering a larger hypothesis space.
For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).
Here, any person can be associated with the variables x and y

Fig 1: Learn-One-Rule Example

Learn-One-Rule Algorithm
The Learn-One-Rule algorithm follows a greedy searching paradigm where it searches for the rules with high
accuracy but its coverage is very low. It classifies all the positive examples for a particular instance. It returns a
single rule that covers some examples.
Learn-One-Rule(target_attribute, attributes, examples, k):

Pos = positive examples


Neg = negative examples
best-hypothesis = the most general hypothesis
candidate-hypothesis = {best-hypothesis}

while candidate-hypothesis:
//Generate the next more specific candidate-hypothesis

constraints_list = all constraints in the form "attribute=value"


new-candidate-hypothesis = all specializations of candidate-
hypothesis by adding all-constraints
remove all duplicates/inconsistent hypothesis from new-candidate-hypothesis.
//Update best-hypothesis
best_hypothesis = argmax(h∈CHs) Performance(h,examples,target_attribute)

//Update candidate-hypothesis

candidate-hypothesis = the k best from new-candidate-hypothesis


according to Performance.
prediction = most frequent value of target_attribute from examples that match best-hypothesis
IF best_hypothesis:
return prediction
It involves a PERFORMANCE method that calculates the performance of each candidate hypothesis. (i.e. how
well the hypothesis matches the given set of examples in the training data.
Performance(NewRule,h):
h-examples = the set of rules that match h
return (h-examples)
It starts with the most general rule precondition, then greedily adds the variable that most improves performance
measured over the training examples.
Learn-One-Rule Example
Let us understand the working of the algorithm using an example:

Day Weather Temp Wind Rain PlayBadminton

D1 Sunny Hot Weak Heavy No

D2 Sunny Hot Strong Heavy No

D3 Overcast Hot Weak Heavy No

Col
D4 Snowy Weak Light Yes
d

Col
D5 Snowy Weak Light Yes
d

Col
D6 Snowy Strong Light Yes
d

D7 Overcast Mild Strong Heavy No

D8 Sunny Hot Weak Light Yes

Step 1 - best_hypothesis = IF h THEN PlayBadminton(x) = Yes


Step 2 - candidate-hypothesis = {best-hypothesis}
Step 3 - constraints_list = {Weather(x)=Sunny, Temp(x)=Hot, Wind(x)=Weak, ......}
Step 4 - new-candidate-hypothesis = {IF Weather=Sunny THEN PlayBadminton=YES,
IF Weather=Overcast THEN PlayBadminton=YES, ...}
Step 5 - best-hypothesis = IF Weather=Sunny THEN PlayBadminton=YES
Step 6 - candidate-hypothesis = {IF Weather=Sunny THEN PlayBadminton=YES,
IF Weather=Sunny THEN PlayBadminton=YES...}
Step 7 - Go to Step 2 and keep doing it till the best-hypothesis is obtained.
You can refer to Fig 1. for a better understanding of how the best-hypothesis is obtained. [Step 5
& 6]

Sequential Learning Algorithm uses this algorithm, improving on it and increasing the
coverage of the hypothesis space. It can be modified to accept an argument that specifies
the target value of interest.

Decision tree in Data mining


A type of data mining technique, Decision tree in data mining builds a model for classification of data. The
models are built in the form of the tree structure and hence belong to the supervised form of learning. Other
than the classification models, decision trees are used for building regression models for predicting class
labels or values aiding the decision-making process. Both the numerical and categorical data like gender, age,
etc. can be used by a decision tree.
Structure of a decision tree
The structure of a decision tree consists of a root node, branches, and leaf nodes. The branched nodes are
the outcomes of a tree and the internal nodes represent the test on an attribute. The leaf nodes represent a
class label.

Working of a decision tree

1. A decision tree works under the supervised learning approach for both discreet and continuous variables.
The dataset is split into subsets on the basis of the dataset’s most significant attribute. Identification of the
attribute and splitting is done through the algorithms.

2. The structure of the decision tree consists of the root node, which is the significant predictor node. The
process of splitting occurs from the decision nodes which are the sub-nodes of the tree. The nodes which do
not split further are termed as the leaf or terminal nodes.

3. The dataset is divided into homogenous and non-overlapping regions following a top-down approach. The
top layer provides the observations at a single place which then splits into branches. The process is termed as
“Greedy Approach” due to its focus only on the current node rather than the future nodes.

4. Until and unless a stop criterion is reached, the decision tree will keep on running.

5. With the building of a decision tree, lots of noise and outliers are generated. To remove these outliers and
noisy data, a method of “Tree pruning” is applied. Hence, the accuracy of the model increases.

6. Accuracy of a model is checked on a test set consisting of test tuples and class labels. An accurate model is
defined based on the percentages of classification test set tuples and classes by the model.

Types of Decision Tree

Decision trees lead to the development of models for classification and regression based on a tree-like
structure. The data is broken down into smaller subsets. The result of a decision tree is a tree with decision
nodes and leaf nodes. Two types of decision trees are explained below:

1. Classification

The classification includes the building up of models describing important class labels. They are applied in the
areas of machine learning and pattern recognition. Decision trees in machine learning through classification
models lead to Fraud detection, medical diagnosis, etc. Two step process of a classification model includes:

 Learning: A classification model based on the training data is built.


 Classification: Model accuracy is checked and then used for classification of the new data. Class labels
are in the form of discrete values like “yes”, or “no”, etc.

2. Regression
Regression models are used for the regression analysis of data, i.e. the prediction of numerical attributes.
These are also called continuous values. Therefore, instead of predicting the class labels, the regression
model predicts the continuous values .

List of Algorithms Used

A decision tree algorithm known as “ID3” was developed in 1980 by a machine researcher named, J. Ross
Quinlan. This algorithm was succeeded by other algorithms like C4.5 developed by him. Both the algorithms
applied the greedy approach. The algorithm C4.5 doesn’t use backtracking and the trees are constructed in a
top-down recursive divide and conquer manner. The algorithm used a training dataset with class labels which
get divided into smaller subsets as the tree gets constructed.

ID3

The whole set of data S is considered as the root node while forming the decision tree. Iteration is then carried
out on every attribute and splitting of the data into fragments. The algorithm checks and takes those attributes
which were not taken before the iterated ones. Splitting data in the ID3 algorithm is time consuming and is not
an ideal algorithm as it overfits the data.

C4.5

It is an advanced form of an algorithm as the data are classified as samples. Both continuous and discrete
values can be handled efficiently unlike ID3. Method of pruning is present which removes the unwanted
branches.

CART

Both classification and regression tasks can be performed by the algorithm. Unlike ID3 and C4.5, decision
points are created by considering the Gini index. A greedy algorithm is applied for the splitting method aiming
to reduce the cost function. In classification tasks, the Gini index is used as the cost function to indicate the
purity of leaf nodes. In regression tasks, sum squared error is used as the cost function to find the best
prediction.

CHAID

As the name suggests, it stands for Chi-square Automatic Interaction Detector, a process dealing with any
type of variables. They might be nominal, ordinal, or continuous variables. Regression trees use the F-test,
while the Chi-square test is used in the classification model.

Classification and Predication in Data Mining


There are two forms of data analysis that can be used to extract models describing important classes or predict
future data trends. These two forms are as follows:

Classification and Prediction

We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis provides us
with the best understanding of the data at a large scale.

Classification models predict categorical class labels, and prediction models predict continuous-valued functions.
For example, we can build a classification model to categorize bank loan applications as either safe or risky or a
prediction model to predict the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training data
set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network. In
classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is to check whether it is
raining or not. The answer can either be yes or no. So, there is a particular number of choices. Sometimes there
can be more than two classes to classify. That is called multiclass classification.

How does Classification Works?

The functioning of classification with the assistance of the bank loan application has been mentioned above.
There are two stages in the data classification system: classifier or model creation and classification classifier.

1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records
as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The test data
are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can
use it to extract social media insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning algorithms. The accurate trained
models provide consistently accurate outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the documents into
sections according to the content. Document classification refers to text classification; we can
classify the words in the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
o Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable algorithm rules to
execute analytical tasks that would take humans hundreds of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into five steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?

The data classification life cycle produces an excellent structure for controlling the flow of data to an enterprise.
Businesses need to account for data security and compliance at each level. With the help of data classification,
we can perform it at every stage, from origin to deletion. The data life-cycle has the following stages, such as:

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view and download
in the form of dashboards.

What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model
or a predictor according to the training dataset. The model should find a numerical output when the new data is
given. Unlike in classification, this method does not have a class label. The model predicts a continuous-valued
function or ordered value.

Regression is generally used for prediction. Predicting the value of a house depending on the facts such as the
number of rooms, the total area, etc., is an example for prediction.

For example, suppose the marketing manager needs to predict how much a particular customer will spend at his
company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an example of
numeric prediction is the data processing activity. In this case, a model or a predictor will be developed that
forecasts a continuous or ordered value function.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following
activities, such as:

1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved by replacing
a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following methods.
o Normalization: The data is transformed using normalization. Normalization involves scaling all
values for a given attribute to make them fall within a small specified range. Normalization is
used when the neural networks or the methods involving measurements are used in the
learning step.
o Generalization: The data can also be transformed by generalizing it to the higher concept. For
this purpose, we can use the concept hierarchies.

Comparison of Classification and Prediction Methods


Here are the criteria for comparing the methods of Classification and Prediction, such as:

o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to predict the
class label correctly, and the accuracy of the predictor can be referred to as how well a given predictor
can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating and using the
classifier or predictor.
o Robustness: Robustness is the ability to make correct predictions or classifications. In the context of
data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.
o Scalability: Scalability refers to an increase or decrease in the performance of the classifier or predictor
based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning behind predictions or
classification made by the predictor or classifier.

Difference between Classification and Prediction


Classification Prediction

Classification is the process of identifying which category a Predication is the process of identifying the
new observation belongs to based on a training data set missing or unavailable numerical data for a
containing observations whose category membership is new observation.
known.

In classification, the accuracy depends on finding the class In prediction, the accuracy depends on how
label correctly. well a given predictor can guess the value
of a predicated attribute for new data.

In classification, the model can be known as the classifier. In prediction, the model can be known as
the predictor.

A model or the classifier is constructed to find the categorical A model or a predictor will be constructed
labels. that predicts a continuous-valued function
or ordered value.

For example, the grouping of patients based on their medical For example, We can think of prediction as
records can be considered a classification. predicting the correct treatment for a
particular disease for a person.

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) procedure that is
utilized to compute uncertainties by utilizing the probability concept. Generally known as Belief Networks,
Bayesian Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical graph, a DAG
consists of a set of nodes and links, where the links signify the connection between the nodes.

The nodes here represent random variables, and the edges define the relationship between these variables.

A DAG models the uncertainty of an event taking place based on the Conditional Probability Distribution (CDP) of
each random variable. A Conditional Probability Table (CPT) is used to represent the CPD of each variable in
a network.
Instance-based learning
The Machine Learning systems which are categorized as instance-based learning are the systems that learn
the training examples by heart and then generalizes to new instances based on some similarity measure. It is
called instance-based because it builds the hypotheses from the training instances. It is also known as memory-
based learning or lazy-learning (because they delay processing until a new instance must be classified). The
time complexity of this algorithm depends upon the size of training data. Each time whenever a new query is
encountered, its previously stores data is examined. And assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training instances. For
example, If we were to create a spam filter with an instance-based learning algorithm, instead of just flagging
emails that are already marked as spam emails, our spam filter would be programmed to also flag emails that are
very similar to them. This requires a measure of resemblance between two emails. A similarity measure between
two emails could be the same sender or the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the identification of a
local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning

Generalized Linear Models


 Linear Regression
 Logistic Regression
Generalized Linear Models (GLMs) are a class of regression models that can be used to model a wide range of
relationships between a response variable and one or more predictor variables. Unlike traditional linear
regression models, which assume a linear relationship between the response and predictor variables, GLMs
allow for more flexible, non-linear relationships by using a different underlying statistical distribution.

Some of the features of GLMs include:

1. Flexibility: GLMs can model a wide range of relationships between the response and predictor variables,
including linear, logistic, Poisson, and exponential relationships.
2. Model interpretability: GLMs provide a clear interpretation of the relationship between the response and
predictor variables, as well as the effect of each predictor on the response.
3. Robustness: GLMs can be robust to outliers and other anomalies in the data, as they allow for non-normal
distributions of the response variable.
4. Scalability: GLMs can be used for large datasets and complex models, as they have efficient algorithms for
model fitting and prediction.
5. Ease of use: GLMs are relatively easy to understand and use, especially compared to more complex models
such as neural networks or decision trees.
6. Hypothesis testing: GLMs allow for hypothesis testing and statistical inference, which can be useful in many
applications where it’s important to understand the significance of relationships between variables.
7. Regularization: GLMs can be regularized to reduce overfitting and improve model performance, using
techniques such as Lasso, Ridge, or Elastic Net regression.
8. Model comparison: GLMs can be compared using information criteria such as AIC or BIC, which can help to
choose the best model among a set of alternatives.

Some of the disadvantages of GLMs include:

 Assumptions: GLMs make certain assumptions about the distribution of the response variable, and these
assumptions may not always hold.
 Model specification: Specifying the correct underlying statistical distribution for a GLM can be challenging, and
incorrect specification can result in biased or incorrect predictions.
 Overfitting: Like other regression models, GLMs can be prone to overfitting if the model is too complex or has
too many predictor variables.
 Overall, GLMs are a powerful and flexible tool for modeling relationships between response and predictor
variables, and are widely used in many fields, including finance, marketing, and epidemiology. If you’re
interested in learning more about GLMs, you might consider reading an introductory textbook on regression
analysis, such as “An Introduction to Generalized Linear Models” by Annette J. Dobson and Annette J.
Barnett.
 Limited flexibility: While GLMs are more flexible than traditional linear regression models, they may still not be
able to capture more complex relationships between variables, such as interactions or non-linear effects.
 Data requirements: GLMs require a sufficient amount of data to estimate model parameters and make
accurate predictions, and may not perform well with small or imbalanced datasets.
 Model assumptions: GLMs rely on certain assumptions about the distribution of the response variable and the
relationship between the response and predictor variables, and violation of these assumptions can lead to
biased or incorrect predictions.
Linear Regression Model: To show that Linear Regression is a special case of the GLMs. It is considered
that the output labels are continuous values and are therefore a Gaussian distribution. So, we have
The first equation above corresponds to the first assumption that the output labels (or target variables) should
be the member of an exponential family, the Second equation corresponds to the assumption that
the hypothesis is equal the expected value or mean of the distribution and lastly, the third equation
corresponds to the assumption that natural parameter and the input parameters follow a linear relationship.
 Logistic Regression Model: To show that Logistic Regression is a special case of the GLMs. It is
considered that the output labels are Binary valued and are therefore a
Bernoulli distribution. So, we have

From the third assumption, it is proven that:

 The function that maps the natural parameter to the canonical parameter is known as the canonical
response function (here, the log-partition function) and the inverse of it is known as the canonical
link function.
Therefore by using the three assumptions mentioned before it can be proved that
the Logistic and Linear Regression belongs to a much larger family of models known as GLMs.
Clustering in Data Mining
The process of making a group of abstract objects into classes of similar objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
 In the process of cluster analysis, the first step is to partition the set of data into groups with the help of data
similarity, and then groups are assigned to their respective labels.
 The biggest advantage of clustering over-classification is it can adapt to the changes made and helps single
out useful features that differentiate different groups.
Applications of cluster analysis :
 It is widely used in many applications such as image processing, data analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and they can characterize their customer
groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the
same capabilities.
 It also helps in information discovery by classifying documents on the web.
Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method
2. Hierarchical Method
3. Constraint-Based Method
4. Grid-Based Method
5. Partitioning Method
6. Density-Based Method
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
 Scalability – we require highly scalable clustering algorithms to work with large databases.
 Ability to deal with different kinds of attributes – Algorithms should be able to work with the type of data
such as categorical, numerical, and binary data.
 Discovery of clusters with attribute shape – The algorithm should be able to detect clusters in arbitrary
shapes and it should not be bounded to distance measures.
 Interpretability – The results should be comprehensive, usable, and interpretable.
 High dimensionality – The algorithm should be able to handle high dimensional space instead of only
handling low dimensional data.
Cobweb (clustering)
COBWEB is an incremental system for hierarchical conceptual clustering. COBWEB was invented by
Professor Douglas H. Fisher, currently at Vanderbilt University.[1][2]
COBWEB incrementally organizes observations into a classification tree. Each node in a classification tree
represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value
distributions of objects classified under the node. This classification tree can be used to predict missing attributes
or the class of a new object.[3]
There are four basic operations COBWEB employs in building the classification tree. Which operation is selected
depends on the category utility of the classification achieved by applying it. The operations are:

 Merging Two Nodes


Merging two nodes means replacing them by a node whose children is the union of the original nodes' sets
of children and which summarizes the attribute-value distributions of all objects classified under them .
 Splitting a node
A node is split by replacing it with its children.
 Inserting a new node
A node is created corresponding to the object being inserted into the tree.
 Passing an object down the hierarchy
Effectively calling the COBWEB algorithm on the object and the subtree rooted in the node.

The COBWEB Algorithm

COBWEB(root, record):
Input: A COBWEB node root, an instance to insert record
if root has no children then
children := {copy(root)}
newcategory(record) \\ adds child with record’s feature values.
insert(record, root) \\ update root’s statistics
else
insert(record, root)
for child in root’s children do
calculate Category Utility for insert(record, child),
set best1, best2 children w. best CU.
end for
if newcategory(record) yields best CU then
newcategory(record)
else if merge(best1, best2) yields best CU then
merge(best1, best2)
COBWEB(root, record)
else if split(best1) yields best CU then
split(best1)
COBWEB(root, record)
else
COBWEB(best1, record)
end if
end

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how
the algorithm works, along with the Python implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each

dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm
is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.


Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown
in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But
choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number
of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the
concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations
within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid
within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean distance
or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best
value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:

Note: We can choose the number of clusters equal to the given data points. If we choose the number of clusters equal to the
data points, then the value of WCSS becomes zero, and that will be the endpoint of the plot.

Hierarchical clustering in data mining Hierarchical clustering refers to an unsupervised


learning procedure that determines successive clusters based on previously defined clusters. It works via
grouping data into a tree of clusters. Hierarchical clustering stats by treating each data points as an individual
cluster. The endpoint refers to a different set of clusters, where each cluster is different from the other cluster, and
the objects within each cluster are the same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering


o Divisive Clustering

Agglomerative hierarchical clustering

Agglomerative clustering is one of the most common types of hierarchical clustering used to group
similar objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting). In
agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters
are combined with different clusters until one cluster is formed.

Agglomerative hierarchical clustering algorithm

1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a dendrogram.

With the help of given demonstration, we can understand that how the actual algorithm work. Here no calculation
has been done below all the proximity among the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.

Step 1:

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the individual
cluster from all other clusters.

Step 2:

Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar to each
other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST), (V)]

Step 3:

Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)] together
to form new clusters as [(P), (QR), (STV)]

Step 4:

Repeat the same process. The clusters STV and PQ are comparable and combined together to form a new
cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]

Divisive Hierarchical Clustering

Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, all the data points are considered an individual cluster, and in every iteration, the data
points that are not similar are separated from the cluster. The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering

o It is simple to implement and gives the best output in some cases.


o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering

o It breaks the large clusters.


o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.

Data Mining Techniques

1. Association

Association analysis is the finding of association rules showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is widely used for a market basket or transaction data
analysis. Association rule mining is a significant and exceptionally dynamic area of data mining research. One
method of association-based classification, called associative classification, consists of two steps. In the main
step, association instructions are generated using a modified version of the standard association rule mining
algorithm known as Apriori. The second step constructs a classifier based on the association rules discovered.

2. Classification

Classification is the processing of finding a set of models (or functions) that describe and distinguish data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class
label is unknown. The determined model depends on the investigation of a set of training data information (i.e.
data objects whose class label is known). The derived model may be represented in various forms, such as
classification (if – then) rules, decision trees, and neural networks. Data Mining has a different type of
classifier:
 Decision Tree
 SVM(Support Vector Machine)
 Generalized Linear Models
 Bayesian classification:
 Classification by Backpropagation
 K-NN Classifier
 Rule-Based Classification
 Frequent-Pattern Based Classification
 Rough set theory
 Fuzzy Logic
3. Prediction

Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we do not
utilize the phrasing of “Class label attribute” because the attribute for which values are being predicted is
consistently valued(ordered) instead of categorical (discrete-esteemed and unordered). The attribute can be
referred to simply as the predicted attribute. Prediction can be viewed as the construction and use of a model
to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a given
object is likely to have.
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering analyzes
data objects without consulting an identified class label. In general, the class labels do not exist in the training
data simply because they are not known to begin with. Clustering can be used to generate these labels. The
objects are clustered based on the principle of maximizing the intra-class similarity and minimizing the
interclass similarity. That is, clusters of objects are created so that objects inside a cluster have high similarity
in contrast with each other, but are different objects in other clusters. Each Cluster that is generated can be
seen as a class of objects, from which rules can be inferred. Clustering can also facilitate classification
formation, that is, the organization of observations into a hierarchy of classes that group similar events
together.

5. Regression

Regression can be defined as a statistical modeling method in which previously obtained data is used to
predicting a continuous quantity for new observations. This classifier is also known as the Continuous Value
Classifier. There are two types of regression models: Linear regression and multiple linear regression models.

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a process
model supported by biological neural networks. It consists of an interconnected collection of artificial neurons.
A neural network is a set of connected input/output units where each connection has a weight associated with
it. During the knowledge phase, the network acquires by adjusting the weights to be able to predict the correct
class label of the input samples. Neural network learning is also denoted as connectionist learning due to the
connections between units. Neural networks involve long training times and are therefore more appropriate for
applications where this is feasible. They require a number of parameters that are typically best determined
empirically, such as the network topology or “structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning behind the learned weights. These
features firstly made neural networks less desirable for data mining.
The advantages of neural networks, however, contain their high tolerance to noisy data as well as their ability
to classify patterns on which they have not been trained. In addition, several algorithms have newly been
developed for the extraction of rules from trained neural networks. These issues contribute to the usefulness
of neural networks for classification in data mining.
An artificial neural network is an adjective system that changes its structure-supported information that flows
through the artificial network during a learning section. The ANN relies on the principle of learning by example.
There are two classical types of neural networks, perceptron and also multilayer perceptron.

7. Outlier Detection

A database may contain data objects that do not comply with the general behavior or model of the data. These
data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING. An outlier may
be detected using statistical tests which assume a distribution or probability model for the data, or using
distance measures where objects having a small fraction of “close” neighbors in space are considered outliers.
Rather than utilizing factual or distance measures, deviation-based techniques distinguish exceptions/outlier
by inspecting differences in the principle attributes of items in a group.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent
exploitation of random search provided with historical data to direct the search into the region of better
performance in solution space. They are commonly used to generate high-quality solutions for optimization
problems and search problems. Genetic algorithms simulate the process of natural selection which means
those species who can adapt to changes in their environment are able to survive and reproduce and go to the
next generation. In simple words, they simulate “survival of the fittest” among individuals of consecutive
generations for solving a problem. Each generation consist of a population of individuals and each individual
represents a point in search space and possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.
Advantages Data mining is a powerful tool that offers many benefits across a wide range of industries. The
following are some of the advantages of data mining:

Better Decision Making:


Data mining helps to extract useful information from large datasets, which can be used to make informed and
accurate decisions. By analyzing patterns and relationships in the data, businesses can identify trends and
make predictions that help them make better decisions.
Improved Marketing:
Data mining can help businesses identify their target market and develop effective marketing strategies. By
analyzing customer data, businesses can identify customer preferences and behavior, which can help them
create targeted advertising campaigns and offer personalized products and services.
Increased Efficiency:
Data mining can help businesses streamline their operations by identifying inefficiencies and areas for
improvement. By analyzing data on production processes, supply chains, and employee performance,
businesses can identify bottlenecks and implement solutions that improve efficiency and reduce costs.
Fraud Detection:
Data mining can be used to identify fraudulent activities in financial transactions, insurance claims, and other
areas. By analyzing patterns and relationships in the data, businesses can identify suspicious behavior and
take steps to prevent fraud.
Customer Retention:
Data mining can help businesses identify customers who are at risk of leaving and develop strategies to retain
them. By analyzing customer data, businesses can identify factors that contribute to customer churn and take
steps to address those factors.
Competitive Advantage:
Data mining can help businesses gain a competitive advantage by identifying new opportunities and emerging
trends. By analyzing data on customer behavior, market trends, and competitor activity, businesses can
identify opportunities to innovate and differentiate themselves from their competitors.
Improved Healthcare:
Data mining can be used to improve healthcare outcomes by analyzing patient data to identify patterns and
relationships. By analyzing medical records and other patient data, healthcare providers can identify risk
factors, diagnose diseases earlier, and develop more effective treatment plans.
Disadvantages Of Data mining:
While data mining offers many benefits, there are also some disadvantages and challenges associated with
the process. The following are some of the main disadvantages of data mining:
Data Quality:
Data mining relies heavily on the quality of the data used for analysis. If the data is incomplete, inaccurate, or
inconsistent, the results of the analysis may be unreliable.
Data Privacy and Security:
Data mining involves analyzing large amounts of data, which may include sensitive information about
individuals or organizations. If this data falls into the wrong hands, it could be used for malicious purposes,
such as identity theft or corporate espionage.
Ethical Considerations:
Data mining raises ethical questions around privacy, surveillance, and discrimination. For example, the use of
data mining to target specific groups of individuals for marketing or political purposes could be seen as
discriminatory or manipulative.
Technical Complexity:
Data mining requires expertise in various fields, including statistics, computer science, and domain knowledge.
The technical complexity of the process can be a barrier to entry for some businesses and organizations.
Cost:
Data mining can be expensive, particularly if large datasets need to be analyzed. This may be a barrier to
entry for small businesses and organizations.
Interpretation of Results:
Data mining algorithms generate large amounts of data, which can be difficult to interpret. It may be
challenging for businesses and organizations to identify meaningful patterns and relationships in the data.
Dependence on Technology:
Data mining relies heavily on technology, which can be a source of risk. Technical failures, such as hardware
or software crashes, can lead to data loss or corruption.
Text Mining in Data Mining
Text mining is a component of data mining that deals specifically with unstructured text data. It involves the use
of natural language processing (NLP) techniques to extract useful information and insights from large amounts of
unstructured text data. Text mining can be used as a preprocessing step for data mining or as a standalone
process for specific tasks.
By using text mining, the unstructured text data can be transformed into structured data that can be used for data
mining tasks such as classification, clustering, and association rule mining. This allows organizations to gain
insights from a wide range of data sources, such as customer feedback, social media posts, and news articles.
Text mining is widely used in various fields, such as natural language processing, information retrieval, and social
media analysis. It has become an essential tool for organizations to extract insights from unstructured text data
and make data-driven decisions.
“Extraction of interesting information or patterns from data in large databases is known as data mining.”
Text mining is a process of extracting useful information and nontrivial patterns from a large volume of text
databases. There exist various strategies and devices to mine the text and find important data for the prediction
and decision-making process. The selection of the right and accurate text mining procedure helps to enhance the
speed and the time complexity also. This article briefly discusses and analyzes text mining and its applications in
diverse fields.
“Text Mining is the procedure of synthesizing information, by analyzing relations, patterns, and rules
among textual data.”
As we discussed above, the size of information is expanding at exponential rates. Today all institutes,
companies, different organizations, and business ventures are stored their information electronically. A huge
collection of data is available on the internet and stored in digital libraries, database repositories, and other
textual data like websites, blogs, social media networks, and e-mails. It is a difficult task to determine appropriate
patterns and trends to extract knowledge from this large volume of data. Text mining is a part of Data mining to
extract valuable text information from a text database repository. Text mining is a multi-disciplinary field based on
data recovery, Data mining, AI, statistics, Machine learning, and computational linguistics.
Conventional Process of Text Mining
 Gathering unstructured information from various sources accessible in various document organizations, for
example, plain text, web pages, PDF records, etc.
 Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency in the data.
The data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop
words stemming (the process of identifying the root of a certain word and indexing the data.
 Processing and controlling tasks are applied to review and further clean the data set.
 Pattern analysis is implemented in Management Information System.
 Information processed in the above steps is utilized to extract important and applicable data for a powerful
and convenient decision-making process and trend analysis.

Conventional Process of Text Mining

Procedures for Analyzing Text Mining


 Text Summarization: To extract its partial content and reflect its whole content automatically.
 Text Categorization: To assign a category to the text among categories predefined by users.
 Text Clustering: To segment texts into several clusters, depending on the substantial relevance.
Procedures for Analyzing Text Mining

Text Mining Techniques


Information Retrieval
In the process of Information retrieval, we try to process the available documents and the text data into a
structured form so, that we can apply different pattern recognition and analytical processes. It is a process of
extracting relevant and associated patterns according to a given set of words or text documents. For this, we
have processes like Tokenization of the document or the stemming process in which we try to extract the base
word or let’s say the root word present there.
Information Extraction
It is a process of extracting meaningful words from documents.
 Feature Extraction – In this process, we try to develop some new features from existing ones. This objective
can be achieved by parsing an existing feature or combining two or more features based on some
mathematical operation.
 Feature Selection – In this process, we try to reduce the dimensionality of the dataset which is generally a
common issue while dealing with the text data by selecting a subset of features from the whole dataset.
Natural Language Processing
Natural Language Processing includes tasks that are accomplished by using Machine Learning and Deep
Learning methodologies. It concerns the automatic processing and analysis of unstructured text information.
 Named Entity Recognition (NER): Identifying and classifying named entities such as people, organizations,
and locations in text data.
 Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive, negative, neutral) of text data.
 Text Summarization: Creating a condensed version of a text document that captures the main points.

Application Area of Text Mining

Digital Library
Academic and Research Field
Life Science
Social-Media
Business Intelligence
Issues in Text Mining
1. The efficiency and effectiveness of decision-making.
2. The uncertain problem can come at an intermediate stage of text mining. In the pre-processing stage, different
rules and guidelines are characterized to normalize the text which makes the text-mining process efficient.
Before applying pattern analysis to the document, there is a need to change over unstructured data into a
moderate structure.
3. Sometimes original message or meaning can be changed due to alteration.
4. Another issue in text mining is many algorithms and techniques support multi-language text. It may create
ambiguity in text meaning. This problem can lead to false-positive results.
5. The utilization of synonyms, polysemy, and antonyms in the document text makes issues for the text mining
tools that take both in a similar setting. It is difficult to categorize such kinds of text/ words.
Advantages of Text Mining
1. Large Amounts of Data: Text mining allows organizations to extract insights from large amounts of
unstructured text data. This can include customer feedback, social media posts, and news articles.
2. Variety of Applications: Text mining has a wide range of applications, including sentiment analysis, named
entity recognition, and topic modeling. This makes it a versatile tool for organizations to gain insights from
unstructured text data.
3. Improved Decision Making: Text mining can be used to extract insights from unstructured text data, which
can be used to make data-driven decisions.
4. Cost-effective: Text mining can be a cost-effective way to extract insights from unstructured text data, as it
eliminates the need for manual data entry.
5. Broader benefits: Cost reductions, productivity increases, the creation of novel new services, and new
business models are just a few of the larger economic advantages mentioned by those consulted.
Disadvantages of Text Mining
1. Complexity: Text mining can be a complex process that requires advanced skills in natural language
processing and machine learning.
2. Quality of Data: The quality of text data can vary, which can affect the accuracy of the insights extracted from
text mining.
3. High Computational Cost: Text mining requires high computational resources, and it may be difficult for
smaller organizations to afford the technology.
4. Limited to Text Data: Text mining is limited to extracting insights from unstructured text data and cannot be
used with other data types.
5. Noise in text mining results: Text mining of documents may result in mistakes. It’s possible to find false links
or to miss others. In most situations, if the noise (error rate) is sufficiently low, the benefits of automation
exceed the chance of a larger mistake than that produced by a human reader.
6. Lack of transparency: Text mining is frequently viewed as a mysterious process where large corpora of text
documents are input and new information is produced. Text mining is in fact opaque when researchers lack
the technical know-how or expertise to comprehend how it operates, or when they lack access to corpora or
text mining tools.

What is Text Classification?


Text Classification Algorithms are at the heart of many software systems that process large amounts of text data.
Text Classification is used by email software to determine whether incoming mail is sent to the inbox or filtered
into the spam folder. Text classification is used in discussion forums to determine whether comments should be
flagged as inappropriate.

These are two examples of Topic Classification, in which a text document is classified into one of a predefined
set of topics. Many topic classification problems rely heavily on textual keywords for categorization.

Sentiment Analysis is another common type of text classification, with the goal of determining the polarity of text
content: the type of opinion it expresses. This can be expressed as a binary like/dislike rating or as a more
granular set of options, such as a star rating from 1 to 5.

Sentiment Analysis can be used to determine whether or not people liked the Black Panther movie by analyzing
Twitter posts or extrapolating the general public’s opinion of a new brand of Nike shoes based on Walmart
reviews.
How does Text Classification in Data Mining Work?

The process of categorizing text into organized groups is known as text classification, also known as text tagging
or text categorization. Text Classification in Data Mining can automatically analyze text and assign a set of pre-
defined tags or categories based on its content using Natural Language Processing (NLP).

Text Classification in Data Mining is becoming an increasingly important part of the business because it enables
easy data insights and the automation of business processes.

The following are some of the most common examples and use cases in Text Classification in Data Mining for
Automatic Text Classification:

 Sentiment Analysis for determining whether a given text is speaking positively or negatively about a
particular subject (e.g. for brand monitoring purposes).
 The task of determining the theme or topic of a piece of text is known as topic detection (e.g. knowing if
a product review is about Ease of Use, Customer Support, or Pricing when analyzing customer
feedback).
 Language detection refers to the process of determining the language of a given text (e.g. knowing if an
incoming support ticket is written in English or Spanish for automatically routing tickets to the
appropriate team).

Here is the Text Classification in Data Mining workflow:

Step 1: Collect Information

The most important step in solving any Supervised Machine Learning problem is gathering data. Your Text
Classifier is only as good as the dataset it is trained on.

If you don’t have a specific problem in mind and are simply interested in learning about Text Classification or Text
Classification in Data Mining in general, there are a plethora of open-source datasets available. If, on the other
hand, you are attempting to solve a specific problem, you will need to gather the necessary data.

Text Classification in Data Mining is not a buzz, Many organizations, such as Twitter and the New York Times,
provide public APIs for accessing their data. You might be able to use these to solve the problem you’re trying to
solve.

Here are some things to keep in mind when you gather data for Text Classification in Data Mining:

 Before you use a Public API, make sure you understand its limitations. Some APIs, for example, limit the
number of queries you can make per second.
 The more training examples (referred to as samples throughout this guide), the better. This will help your
model generalize more effectively.
 Make certain that the number of samples for each class or topic is not excessively imbalanced. That is,
each class should have a comparable number of samples.
 Make certain that your samples adequately cover the space of possible inputs, rather than just the
common cases.

Step 2: Investigate Your Data

Building and Training a model is only one step in the process. Understanding the characteristics of your data
ahead of time will allow you to build a more accurate model. This could simply mean achieving greater accuracy.
It could also imply requiring less data or fewer computational resources for training.

Examine the Data

After loading the data, it’s a good idea to run some checks on it: select a few samples and manually check if they
match your expectations. Print a few random samples, for example, to see if the sentiment label corresponds to
the sentiment of the review.
Step 2.5: Select a Model

We have assembled our dataset and gained insights into the key characteristics of our data at this point.
Following that, we should consider which classification model to employ based on the metrics gathered in Step 2.
This includes questions like, “How do we present the text data to an algorithm that expects numeric input?”
(this is known as data preprocessing and vectorization), “What type of model should we use?“, and “What
configuration parameters should we use for our model?” and so on.

We now have access to a wide range of data preprocessing and model configuration options as a result of
decades of research. The availability of a very large array of viable options to choose from, on the other hand,
greatly increases the complexity and scope of the specific problem at hand.

Given that the best options may not be obvious, a naive solution would be to exhaust all possible options,
pruning

Step 3: Gather Your Data

Before we can feed our data to a model, it must be transformed into a format that the model can understand.

For Text Classification in Data Mining, First, the data samples that we have gathered may be in a particular
order. We don’t want any information related to sampling order to influence the relationship between texts and
labels. For example, if a dataset is sorted by class and then divided into training/validation sets, the
training/validation sets will not be representative of the overall data distribution.

If your data has already been divided into training and validation sets, make sure to transform your validation
data in the same way you did your training data. If you don’t already have separate training and validation sets,
you can split the samples after shuffling; typically, 80% of the samples are used for training and 20%
for validation.

Second, Machine Learning Algorithms are fed numerical inputs. This means we’ll have to turn the texts into
numerical vectors. This procedure consists of two steps:

 Tokenization: Break the texts down into words or smaller sub-texts to allow for better generalization of
the relationship between the texts and the labels. This determines the dataset’s “vocabulary” (set of
unique tokens present in the data).
 Vectorization: It’s the process of defining a good numerical measure to characterize these texts.

Step 4: Create, Train, and Test Your Model

In this section, we will work on developing, training, and assessing our model. In Step 3, we decided whether to
use an n-gram model or a sequence model based on our S/W ratio. It is now time to write and train our
classification algorithm. TensorFlow and the tf.keras API will be used for this.

Building Machine Learning Models with Keras is as simple as putting together layers of data-processing building
blocks, similar to how we would put together Lego bricks. These layers allow us to specify the order in which we
want to perform transformations on our input. Because Learning Algorithm accepts single text input and produces
a single classification, we can use the Sequential model API to build a Linear Stack of Layers.

We need to train the model now that we’ve built the model architecture. Training entails making a prediction
based on the current state of the model, calculating how inaccurate the prediction is, and updating the network’s
weights or parameters to minimize this error and improve the model’s prediction. This process is repeated until
our model has converged and can no longer learn.

Step 5: Fine-tune the Hyperparameters

For defining and training the model, we had to select a number of hyperparameters. We relied on our instincts,
examples, and best practice recommendations. However, our initial selection of hyperparameter values may not
produce the best results. It merely provides us with a good starting point for training. Every problem is unique,
and fine-tuning these hyperparameters will aid in refining our model to better represent the specifics of the
problem at hand.
Let’s look at some of the hyperparameters we used and what tuning them entails:

 The model’s number of layers


 The number of units in each layer
 Dropout rates
 Learning rates

Step 6: Put Your Model to Work

When deploying your model, please keep the following points in mind:

 Check that your production data is distributed in the same way as your training and evaluation data.
 Re-evaluate on a regular basis by gathering more training data.
 Retrain your model if your data distribution changes.

Benefits of Text Classification in Data Mining

Here are some benefits of Text Mining Approaches in Data Mining:

 Text Classification in Data Mining provides an accurate representation of the language and how
meaningful words are used in context.
 Text Classification in Data Mining can work at a higher level of abstraction, it makes it easier to write
simpler rules.
 Text Classification in Data Mining uses the fundamental features of semantic technology to understand
the meaning of words in context. Because semantic technology allows words to be understood in their
proper context, this provides superior precision and recall.
 Documents that do not “fit” into a specific category are identified and automatically separated once the
system is deployed, and the system administrator can fully understand why they were not classified.

Web Mining Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns on mostly structured
data embedded into a knowledge discovery process. Web mining has a distinctive property to provide a set of
various data types. The web has multiple aspects that yield different approaches for the mining process, such as
web pages consist of text, web pages are linked via hyperlinks, and user activity can be monitored via web server
logs. These three features lead to the differentiation between the three areas are web content mining, web
structure mining, web usage mining.

Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge from the web page
content. In web content mining, each web page is considered as an individual document. The individual
can take advantage of the semi-structured nature of web pages, as HTML provides information that
concerns not only the layout but also logical structure. The primary task of content mining is data
extraction, where structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web content mining can
be utilized to distinguish topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is used to identify that
data either link the web pages or direct link network. In Web Structure Mining, an individual considers
the web as a directed graph, with the web pages being the vertices that are associated with hyperlinks.
The most important application in this regard is the Google search engine, which estimates the ranking
of its outcomes primarily with the PageRank algorithm. It characterizes a page to be exceptionally
relevant when frequently connected by other highly related pages. Structure and content mining
methodologies are usually combined. For example, web structured mining can be beneficial to
organizations to regulate the network between two commercial sites.

Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from the weblog records, and
assists in recognizing the user access patterns for web pages. In Mining, the usage of web resources, the
individual is thinking about records of requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of web pages follow the intentions of the
authors of the pages, the individual requests demonstrate how the consumers see these pages. Web
usage mining may disclose relationships that were not proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which incorporates the guest
records, days, time, sessions, etc. This data can be utilized to analyze the visitor's behavior.

The document is created after this analysis, which contains the details of repeatedly visited web pages,
common entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.

OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:


The web pretends incredible challenges for resources, and knowledge discovery based on the following
observations:

o The complexity of web pages:

The site pages don't have a unifying structure. They are extremely complicated as compared to traditional text
documents. There are enormous amounts of documents in the digital library of the web. These libraries are not
organized according to a specific order.

o The web is a dynamic data source:

The data on the internet is quickly updated. For example, news, climate, shopping, financial news, sports, and so
on.

o Diversity of client networks:

The client network on the web is quickly expanding. These clients have different interests, backgrounds, and
usage purposes. There are over a hundred million workstations that are associated with the internet and still
increasing tremendously.
o Relevancy of data:

It is considered that a specific person is generally concerned about a small portion of the web, while the rest of
the segment of the web contains the data that is not familiar to the user and may lead to unwanted results.

o The web is too broad:

The size of the web is tremendous and rapidly increasing. It appears that the web is too huge for data
warehousing and data mining.

Application of Web Mining:

Web mining has an extensive application because of various uses of the web. The list of some applications of
web mining is given below.

o Marketing and conversion tool


o Data analysis on website and application accomplishment.
o Audience behavior analysis
o Advertising and campaign accomplishment analysis.
o Testing and analysis of a site.

You might also like