Adm Unit-4,5
Adm Unit-4,5
The use of data mining task primitives has several advantages, including:
1. Modularity: Data mining task primitives provide a modular approach to data mining, which allows for
flexibility and the ability to easily modify or replace specific steps in the process.
2. Reusability: Data mining task primitives can be reused across different data mining projects, which can
save time and effort.
3. Standardization: Data mining task primitives provide a standardized approach to data mining, which can
improve the consistency and quality of the data mining process.
4. Understandability: Data mining task primitives are easy to understand and communicate, which can
improve collaboration and communication among team members.
5. Improved Performance: Data mining task primitives can improve the performance of the data mining
process by reducing the amount of data that needs to be processed, and by optimizing the data for specific
data mining algorithms.
6. Flexibility: Data mining task primitives can be combined and repeated in various ways to achieve the goals
of the data mining process, making it more adaptable to the specific needs of the project.
7. Efficient use of resources: Data mining task primitives can help to make more efficient use of resources,
as they allow to perform specific tasks with the right tools, avoiding unnecessary steps and reducing the
time and computational power needed.
Data visualization
It is actually a set of data points and information that are represented graphically to make it easy and quick for
user to understand. Data visualization is good if it has a clear meaning, purpose, and is very easy to interpret,
without requiring context. Tools of data visualization provide an accessible way to see and understand trends,
outliers, and patterns in data by using visual effects or elements such as a chart, graphs, and maps .
Characteristics of Effective Graphical Visual :
It shows or visualizes data very clearly in an understandable manner.
It encourages viewers to compare different pieces of data.
It closely integrates statistical and verbal descriptions of data set.
It grabs our interest, focuses our mind, and keeps our eyes on message as human brain tends to focus on
visual data more than written data.
It also helps in identifying area that needs more attention and improvement.
Using graphical representation, a story can be told more efficiently. Also, it requires less time to understand
picture than it takes to understand textual data.
Categories of Data Visualization ;
Data visualization is very critical to market research where both numerical and categorical data can be
visualized that helps in an increase in impacts of insights and also helps in reducing risk of analysis paralysis.
So, data visualization is categorized into following categories :
There are many ways to represent data. Some of them are as follows:
1) Pixel Oriented Visualization: Here the color of the pixel represents the dimension value. The color of the
pixel represents the corresponding values.
2) Geometric Representation: The multidimensional datasets are represented in 2D, 3D, and 4D scatter plots .
3) Icon-Based Visualization: The data is represented using Chernoff’s faces and stick figures. Chernoff’s faces
use the human mind’s ability to recognize facial characteristics and differences between them. The stick figure
uses 5 stick figures to represent multidimensional data.
4) Hierarchical Data Visualization: The datasets are represented using treemaps. It represents hierarchical
data as a set of nested triangles.
Types of Association Rules in Data Mining
Association rule learning is a machine learning technique used for discovering interesting relationships between
variables in large databases. It is designed to detect strong rules in the database based on some interesting
metrics. For any given multi-item transaction, association rules aim to obtain rules that determine how or why
certain items are linked. Association rules are created for finding information about general if-then patterns using
specific criteria with support and trust to define what the key relationships are. They help to show the frequency
of an item in specific data since confidence is defined by the number of times an if-then statement is found to be
true.
1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new class of association
rules, different from original, simple, and even multi-relational association rules (usually extracted from multi-
relational databases), each rule element consists of one entity but many a relationship. These relationships
represent indirect relationships between entities.
2. Generalized association rules: Generalized association rule extraction is a powerful tool for getting a rough
idea of interesting patterns hidden in data. However, since patterns are extracted at each level of abstraction, the
mined rule sets may be too large to be used effectively for decision-making. Therefore, in order to discover
valuable and interesting knowledge, post-processing steps are often required. Generalized association rules
should have categorical (nominal or discrete) properties on both the left and right sides of the rule.
3. Quantitative association rules: Quantitative association rules is a special type of association rule. Unlike
general association rules, where both left and right sides of the rule should be categorical (nominal or discrete)
attributes, at least one attribute (left or right) of quantitative association rules must contain numeric attributes
Medical Diagnosis: Association rules in medical diagnosis can be used to help doctors cure patients. As all
of us know that diagnosis is not an easy thing, and there are many errors that can lead to unreliable end
results. Using the multi-relational association rule, we can determine the probability of disease occurrence
associated with various factors and symptoms.
Market Basket Analysis: It is one of the most popular examples and uses of association rule mining. Big
retailers typically use this technique to determine the association between items.
Before we start understanding the algorithm, go through some definitions which are explained in my previous
post.
Consider the following dataset and we will find frequent itemsets and generate association rules for them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us itemset
L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is that it should
have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.(Example subset
of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if support_count
of candidate set item is less than min_support then remove those items) this gives us itemset L2.
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset.
(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3,
I4} is not frequent so remove it. Similarly check for every itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if support_count
of candidate set item is less than min_support then remove those items) this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is that, they
should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes into
picture. For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of candidate sets with
much frequent itemsets, low minimum support or large itemsets i.e. it is not an efficient approach for large
number of datasets. For example, if there are 10^4 from frequent
1- itemsets, it need to generate more than 10^7 candidates into 2-length which in turn they will be tested and
accumulate. Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have to generate 2^100
candidate itemsets that yield on costly and wasting of time of candidate generation. So, it will check for many
sets from candidate itemsets, also it will scan database many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory capacity is limited with large number of transactions.
Correlation analysis can reveal meaningful relationships between different metrics or groups of metrics.
Information about those connections can provide new insights and reveal interdependencies, even if the metrics
come from different parts of the business.
1. Pearson r correlation
Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship
between linearly related variables. For example, in the stock market, if we want to measure how two stocks are
related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The
point-biserial correlation is conducted with the Pearson correlation formula, except that one of the variables is
dichotomous. The following formula is used to calculate the Pearson r correlation:
n= number of observations
Kendall rank correlation is a non-parametric test that measures the strength of dependence between two
variables. Considering two samples, a and b, where each sample size is n, we know that the total number of
pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:
Spearman rank correlation is a non-parametric test that is used to measure the degree of association between
two variables. The Spearman rank correlation test does not carry any assumptions about the data distribution. It
is the appropriate correlation analysis when the variables are measured on an at least ordinal scale.
This coefficient requires a table of data that displays the raw data, its ranks, and the difference between the two
ranks. This squared difference between the two ranks will be shown on a scatter graph, which will indicate
whether there is a positive, negative, or no correlation between the two variables. The constraint that this
coefficient works under is -1 ≤ r ≤ +1, where a result of 0 would mean that there was no relation between the data
whatsoever. The following formula is used to calculate the Spearman rank correlation :
n= number of observations
The two methods outlined above will be used according to whether there are parameters associated with the
data gathered. The two terms to watch out for are:
o Parametric:(Pearson's Coefficient) The data must be handled with the parameters of populations or
probability distributions. Typically used with quantitative data already set out within said parameters.
o Non-parametric:(Spearman's Rank) Where no assumptions can be made about the probability
distribution. Typically used with qualitative data, but can be used with quantitative data if Spearman's
Rank proves inadequat
Interpreting Results
Typically, the best way to gain a generalized but more immediate interpretation of the results of a set of
data is to visualize it on a scatter graph such as these:
1. Positive Correlation: Any score from +0.5 to +1 indicates a very strong positive correlation, which
means that they both increase simultaneously. This case follows the data points upwards to indicate the
positive correlation. The line of best fit, or the trend line, places to best represent the graph's data.
2.Negative Correlation: Any score from -0.5 to -1 indicates a strong negative correlation, which means that
as one variable increases, the other decreases proportionally. The line of best fit can be seen here to
indicate the negative correlation. In these cases, it will slope downwards from the point of origin.
3.No Correlation: Very simply, a score of 0 indicates no correlation, or relationship, between the two variables.
This fact will stand true for all, no matter which formula is used. The more data inputted into the formula, the
more accurate the result will be. The larger the sample size, the more accurate the result.
Another important benefit of correlation analysis in anomaly detection is reducing alert fatigue by filtering
irrelevant anomalies (based on the correlation) and grouping correlated anomalies into a single alert. Alert storms
and false positives are significant challenges organizations face - getting hundreds, even thousands of separate
alerts from multiple systems when many of them stem from the same incident.
3. Reduce Costs
Correlation analysis helps significantly reduce the costs associated with the time spent investigating meaningless
or duplicative alerts. In addition, the time saved can be spent on more strategic initiatives that add value to the
organization.
UNIT-V
Learn-One-Rule Algorithm
This method is used in the sequential learning algorithm for learning the rules. It returns a single
rule that covers at least some examples (as shown in Fig 1). However, what makes it really powerful is its
ability to create relations among the attributes given, hence covering a larger hypothesis space.
For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).
Here, any person can be associated with the variables x and y
Learn-One-Rule Algorithm
The Learn-One-Rule algorithm follows a greedy searching paradigm where it searches for the rules with high
accuracy but its coverage is very low. It classifies all the positive examples for a particular instance. It returns a
single rule that covers some examples.
Learn-One-Rule(target_attribute, attributes, examples, k):
while candidate-hypothesis:
//Generate the next more specific candidate-hypothesis
//Update candidate-hypothesis
Col
D4 Snowy Weak Light Yes
d
Col
D5 Snowy Weak Light Yes
d
Col
D6 Snowy Strong Light Yes
d
Sequential Learning Algorithm uses this algorithm, improving on it and increasing the
coverage of the hypothesis space. It can be modified to accept an argument that specifies
the target value of interest.
1. A decision tree works under the supervised learning approach for both discreet and continuous variables.
The dataset is split into subsets on the basis of the dataset’s most significant attribute. Identification of the
attribute and splitting is done through the algorithms.
2. The structure of the decision tree consists of the root node, which is the significant predictor node. The
process of splitting occurs from the decision nodes which are the sub-nodes of the tree. The nodes which do
not split further are termed as the leaf or terminal nodes.
3. The dataset is divided into homogenous and non-overlapping regions following a top-down approach. The
top layer provides the observations at a single place which then splits into branches. The process is termed as
“Greedy Approach” due to its focus only on the current node rather than the future nodes.
4. Until and unless a stop criterion is reached, the decision tree will keep on running.
5. With the building of a decision tree, lots of noise and outliers are generated. To remove these outliers and
noisy data, a method of “Tree pruning” is applied. Hence, the accuracy of the model increases.
6. Accuracy of a model is checked on a test set consisting of test tuples and class labels. An accurate model is
defined based on the percentages of classification test set tuples and classes by the model.
Decision trees lead to the development of models for classification and regression based on a tree-like
structure. The data is broken down into smaller subsets. The result of a decision tree is a tree with decision
nodes and leaf nodes. Two types of decision trees are explained below:
1. Classification
The classification includes the building up of models describing important class labels. They are applied in the
areas of machine learning and pattern recognition. Decision trees in machine learning through classification
models lead to Fraud detection, medical diagnosis, etc. Two step process of a classification model includes:
2. Regression
Regression models are used for the regression analysis of data, i.e. the prediction of numerical attributes.
These are also called continuous values. Therefore, instead of predicting the class labels, the regression
model predicts the continuous values .
A decision tree algorithm known as “ID3” was developed in 1980 by a machine researcher named, J. Ross
Quinlan. This algorithm was succeeded by other algorithms like C4.5 developed by him. Both the algorithms
applied the greedy approach. The algorithm C4.5 doesn’t use backtracking and the trees are constructed in a
top-down recursive divide and conquer manner. The algorithm used a training dataset with class labels which
get divided into smaller subsets as the tree gets constructed.
ID3
The whole set of data S is considered as the root node while forming the decision tree. Iteration is then carried
out on every attribute and splitting of the data into fragments. The algorithm checks and takes those attributes
which were not taken before the iterated ones. Splitting data in the ID3 algorithm is time consuming and is not
an ideal algorithm as it overfits the data.
C4.5
It is an advanced form of an algorithm as the data are classified as samples. Both continuous and discrete
values can be handled efficiently unlike ID3. Method of pruning is present which removes the unwanted
branches.
CART
Both classification and regression tasks can be performed by the algorithm. Unlike ID3 and C4.5, decision
points are created by considering the Gini index. A greedy algorithm is applied for the splitting method aiming
to reduce the cost function. In classification tasks, the Gini index is used as the cost function to indicate the
purity of leaf nodes. In regression tasks, sum squared error is used as the cost function to find the best
prediction.
CHAID
As the name suggests, it stands for Chi-square Automatic Interaction Detector, a process dealing with any
type of variables. They might be nominal, ordinal, or continuous variables. Regression trees use the F-test,
while the Chi-square test is used in the classification model.
We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis provides us
with the best understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction models predict continuous-valued functions.
For example, we can build a classification model to categorize bank loan applications as either safe or risky or a
prediction model to predict the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training data
set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network. In
classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.
Classification is the process of classifying a record. One simple example of classification is to check whether it is
raining or not. The answer can either be yes or no. So, there is a particular number of choices. Sometimes there
can be more than two classes to classify. That is called multiclass classification.
The functioning of classification with the assistance of the bank loan application has been mentioned above.
There are two stages in the data classification system: classifier or model creation and classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records
as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The test data
are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can
use it to extract social media insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning algorithms. The accurate trained
models provide consistently accurate outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize the documents into
sections according to the content. Document classification refers to text classification; we can
classify the words in the entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
o Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable algorithm rules to
execute analytical tasks that would take humans hundreds of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into five steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.
The data classification life cycle produces an excellent structure for controlling the flow of data to an enterprise.
Businesses need to account for data security and compliance at each level. With the help of data classification,
we can perform it at every stage, from origin to deletion. The data life-cycle has the following stages, such as:
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view and download
in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derives the model
or a predictor according to the training dataset. The model should find a numerical output when the new data is
given. Unlike in classification, this method does not have a class label. The model predicts a continuous-valued
function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts such as the
number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer will spend at his
company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an example of
numeric prediction is the data processing activity. In this case, a model or a predictor will be developed that
forecasts a continuous or ordered value function.
The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following
activities, such as:
1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved by replacing
a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following methods.
o Normalization: The data is transformed using normalization. Normalization involves scaling all
values for a given attribute to make them fall within a small specified range. Normalization is
used when the neural networks or the methods involving measurements are used in the
learning step.
o Generalization: The data can also be transformed by generalizing it to the higher concept. For
this purpose, we can use the concept hierarchies.
o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier to predict the
class label correctly, and the accuracy of the predictor can be referred to as how well a given predictor
can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating and using the
classifier or predictor.
o Robustness: Robustness is the ability to make correct predictions or classifications. In the context of
data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.
o Scalability: Scalability refers to an increase or decrease in the performance of the classifier or predictor
based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning behind predictions or
classification made by the predictor or classifier.
Classification is the process of identifying which category a Predication is the process of identifying the
new observation belongs to based on a training data set missing or unavailable numerical data for a
containing observations whose category membership is new observation.
known.
In classification, the accuracy depends on finding the class In prediction, the accuracy depends on how
label correctly. well a given predictor can guess the value
of a predicated attribute for new data.
In classification, the model can be known as the classifier. In prediction, the model can be known as
the predictor.
A model or the classifier is constructed to find the categorical A model or a predictor will be constructed
labels. that predicts a continuous-valued function
or ordered value.
For example, the grouping of patients based on their medical For example, We can think of prediction as
records can be considered a classification. predicting the correct treatment for a
particular disease for a person.
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) procedure that is
utilized to compute uncertainties by utilizing the probability concept. Generally known as Belief Networks,
Bayesian Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical graph, a DAG
consists of a set of nodes and links, where the links signify the connection between the nodes.
The nodes here represent random variables, and the edges define the relationship between these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability Distribution (CDP) of
each random variable. A Conditional Probability Table (CPT) is used to represent the CPD of each variable in
a network.
Instance-based learning
The Machine Learning systems which are categorized as instance-based learning are the systems that learn
the training examples by heart and then generalizes to new instances based on some similarity measure. It is
called instance-based because it builds the hypotheses from the training instances. It is also known as memory-
based learning or lazy-learning (because they delay processing until a new instance must be classified). The
time complexity of this algorithm depends upon the size of training data. Each time whenever a new query is
encountered, its previously stores data is examined. And assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training instances. For
example, If we were to create a spam filter with an instance-based learning algorithm, instead of just flagging
emails that are already marked as spam emails, our spam filter would be programmed to also flag emails that are
very similar to them. This requires a measure of resemblance between two emails. A similarity measure between
two emails could be the same sender or the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the identification of a
local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
1. Flexibility: GLMs can model a wide range of relationships between the response and predictor variables,
including linear, logistic, Poisson, and exponential relationships.
2. Model interpretability: GLMs provide a clear interpretation of the relationship between the response and
predictor variables, as well as the effect of each predictor on the response.
3. Robustness: GLMs can be robust to outliers and other anomalies in the data, as they allow for non-normal
distributions of the response variable.
4. Scalability: GLMs can be used for large datasets and complex models, as they have efficient algorithms for
model fitting and prediction.
5. Ease of use: GLMs are relatively easy to understand and use, especially compared to more complex models
such as neural networks or decision trees.
6. Hypothesis testing: GLMs allow for hypothesis testing and statistical inference, which can be useful in many
applications where it’s important to understand the significance of relationships between variables.
7. Regularization: GLMs can be regularized to reduce overfitting and improve model performance, using
techniques such as Lasso, Ridge, or Elastic Net regression.
8. Model comparison: GLMs can be compared using information criteria such as AIC or BIC, which can help to
choose the best model among a set of alternatives.
Assumptions: GLMs make certain assumptions about the distribution of the response variable, and these
assumptions may not always hold.
Model specification: Specifying the correct underlying statistical distribution for a GLM can be challenging, and
incorrect specification can result in biased or incorrect predictions.
Overfitting: Like other regression models, GLMs can be prone to overfitting if the model is too complex or has
too many predictor variables.
Overall, GLMs are a powerful and flexible tool for modeling relationships between response and predictor
variables, and are widely used in many fields, including finance, marketing, and epidemiology. If you’re
interested in learning more about GLMs, you might consider reading an introductory textbook on regression
analysis, such as “An Introduction to Generalized Linear Models” by Annette J. Dobson and Annette J.
Barnett.
Limited flexibility: While GLMs are more flexible than traditional linear regression models, they may still not be
able to capture more complex relationships between variables, such as interactions or non-linear effects.
Data requirements: GLMs require a sufficient amount of data to estimate model parameters and make
accurate predictions, and may not perform well with small or imbalanced datasets.
Model assumptions: GLMs rely on certain assumptions about the distribution of the response variable and the
relationship between the response and predictor variables, and violation of these assumptions can lead to
biased or incorrect predictions.
Linear Regression Model: To show that Linear Regression is a special case of the GLMs. It is considered
that the output labels are continuous values and are therefore a Gaussian distribution. So, we have
The first equation above corresponds to the first assumption that the output labels (or target variables) should
be the member of an exponential family, the Second equation corresponds to the assumption that
the hypothesis is equal the expected value or mean of the distribution and lastly, the third equation
corresponds to the assumption that natural parameter and the input parameters follow a linear relationship.
Logistic Regression Model: To show that Logistic Regression is a special case of the GLMs. It is
considered that the output labels are Binary valued and are therefore a
Bernoulli distribution. So, we have
The function that maps the natural parameter to the canonical parameter is known as the canonical
response function (here, the log-partition function) and the inverse of it is known as the canonical
link function.
Therefore by using the three assumptions mentioned before it can be proved that
the Logistic and Linear Regression belongs to a much larger family of models known as GLMs.
Clustering in Data Mining
The process of making a group of abstract objects into classes of similar objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
In the process of cluster analysis, the first step is to partition the set of data into groups with the help of data
similarity, and then groups are assigned to their respective labels.
The biggest advantage of clustering over-classification is it can adapt to the changes made and helps single
out useful features that differentiate different groups.
Applications of cluster analysis :
It is widely used in many applications such as image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize their customer
groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the
same capabilities.
It also helps in information discovery by classifying documents on the web.
Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method
2. Hierarchical Method
3. Constraint-Based Method
4. Grid-Based Method
5. Partitioning Method
6. Density-Based Method
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
Scalability – we require highly scalable clustering algorithms to work with large databases.
Ability to deal with different kinds of attributes – Algorithms should be able to work with the type of data
such as categorical, numerical, and binary data.
Discovery of clusters with attribute shape – The algorithm should be able to detect clusters in arbitrary
shapes and it should not be bounded to distance measures.
Interpretability – The results should be comprehensive, usable, and interpretable.
High dimensionality – The algorithm should be able to handle high dimensional space instead of only
handling low dimensional data.
Cobweb (clustering)
COBWEB is an incremental system for hierarchical conceptual clustering. COBWEB was invented by
Professor Douglas H. Fisher, currently at Vanderbilt University.[1][2]
COBWEB incrementally organizes observations into a classification tree. Each node in a classification tree
represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value
distributions of objects classified under the node. This classification tree can be used to predict missing attributes
or the class of a new object.[3]
There are four basic operations COBWEB employs in building the classification tree. Which operation is selected
depends on the category utility of the classification achieved by applying it. The operations are:
COBWEB(root, record):
Input: A COBWEB node root, an instance to insert record
if root has no children then
children := {copy(root)}
newcategory(record) \\ adds child with record’s feature values.
insert(record, root) \\ update root’s statistics
else
insert(record, root)
for child in root’s children do
calculate Category Utility for insert(record, child),
set best1, best2 children w. best CU.
end for
if newcategory(record) yields best CU then
newcategory(record)
else if merge(best1, best2) yields best CU then
merge(best1, best2)
COBWEB(root, record)
else if split(best1) yields best CU then
split(best1)
COBWEB(root, record)
else
COBWEB(best1, record)
end if
end
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm
is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown
in the below image:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the
concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations
within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid
within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean distance
or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best
value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the number of clusters equal to the
data points, then the value of WCSS becomes zero, and that will be the endpoint of the plot.
Agglomerative clustering is one of the most common types of hierarchical clustering used to group
similar objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting). In
agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters
are combined with different clusters until one cluster is formed.
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a dendrogram.
With the help of given demonstration, we can understand that how the actual algorithm work. Here no calculation
has been done below all the proximity among the clusters are assumed.
Step 1:
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the individual
cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar to each
other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)] together
to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to form a new
cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, all the data points are considered an individual cluster, and in every iteration, the data
points that are not similar are separated from the cluster. The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering
1. Association
Association analysis is the finding of association rules showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is widely used for a market basket or transaction data
analysis. Association rule mining is a significant and exceptionally dynamic area of data mining research. One
method of association-based classification, called associative classification, consists of two steps. In the main
step, association instructions are generated using a modified version of the standard association rule mining
algorithm known as Apriori. The second step constructs a classifier based on the association rules discovered.
2. Classification
Classification is the processing of finding a set of models (or functions) that describe and distinguish data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class
label is unknown. The determined model depends on the investigation of a set of training data information (i.e.
data objects whose class label is known). The derived model may be represented in various forms, such as
classification (if – then) rules, decision trees, and neural networks. Data Mining has a different type of
classifier:
Decision Tree
SVM(Support Vector Machine)
Generalized Linear Models
Bayesian classification:
Classification by Backpropagation
K-NN Classifier
Rule-Based Classification
Frequent-Pattern Based Classification
Rough set theory
Fuzzy Logic
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we do not
utilize the phrasing of “Class label attribute” because the attribute for which values are being predicted is
consistently valued(ordered) instead of categorical (discrete-esteemed and unordered). The attribute can be
referred to simply as the predicted attribute. Prediction can be viewed as the construction and use of a model
to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a given
object is likely to have.
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering analyzes
data objects without consulting an identified class label. In general, the class labels do not exist in the training
data simply because they are not known to begin with. Clustering can be used to generate these labels. The
objects are clustered based on the principle of maximizing the intra-class similarity and minimizing the
interclass similarity. That is, clusters of objects are created so that objects inside a cluster have high similarity
in contrast with each other, but are different objects in other clusters. Each Cluster that is generated can be
seen as a class of objects, from which rules can be inferred. Clustering can also facilitate classification
formation, that is, the organization of observations into a hierarchy of classes that group similar events
together.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data is used to
predicting a continuous quantity for new observations. This classifier is also known as the Continuous Value
Classifier. There are two types of regression models: Linear regression and multiple linear regression models.
An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a process
model supported by biological neural networks. It consists of an interconnected collection of artificial neurons.
A neural network is a set of connected input/output units where each connection has a weight associated with
it. During the knowledge phase, the network acquires by adjusting the weights to be able to predict the correct
class label of the input samples. Neural network learning is also denoted as connectionist learning due to the
connections between units. Neural networks involve long training times and are therefore more appropriate for
applications where this is feasible. They require a number of parameters that are typically best determined
empirically, such as the network topology or “structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning behind the learned weights. These
features firstly made neural networks less desirable for data mining.
The advantages of neural networks, however, contain their high tolerance to noisy data as well as their ability
to classify patterns on which they have not been trained. In addition, several algorithms have newly been
developed for the extraction of rules from trained neural networks. These issues contribute to the usefulness
of neural networks for classification in data mining.
An artificial neural network is an adjective system that changes its structure-supported information that flows
through the artificial network during a learning section. The ANN relies on the principle of learning by example.
There are two classical types of neural networks, perceptron and also multilayer perceptron.
7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model of the data. These
data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING. An outlier may
be detected using statistical tests which assume a distribution or probability model for the data, or using
distance measures where objects having a small fraction of “close” neighbors in space are considered outliers.
Rather than utilizing factual or distance measures, deviation-based techniques distinguish exceptions/outlier
by inspecting differences in the principle attributes of items in a group.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent
exploitation of random search provided with historical data to direct the search into the region of better
performance in solution space. They are commonly used to generate high-quality solutions for optimization
problems and search problems. Genetic algorithms simulate the process of natural selection which means
those species who can adapt to changes in their environment are able to survive and reproduce and go to the
next generation. In simple words, they simulate “survival of the fittest” among individuals of consecutive
generations for solving a problem. Each generation consist of a population of individuals and each individual
represents a point in search space and possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.
Advantages Data mining is a powerful tool that offers many benefits across a wide range of industries. The
following are some of the advantages of data mining:
Digital Library
Academic and Research Field
Life Science
Social-Media
Business Intelligence
Issues in Text Mining
1. The efficiency and effectiveness of decision-making.
2. The uncertain problem can come at an intermediate stage of text mining. In the pre-processing stage, different
rules and guidelines are characterized to normalize the text which makes the text-mining process efficient.
Before applying pattern analysis to the document, there is a need to change over unstructured data into a
moderate structure.
3. Sometimes original message or meaning can be changed due to alteration.
4. Another issue in text mining is many algorithms and techniques support multi-language text. It may create
ambiguity in text meaning. This problem can lead to false-positive results.
5. The utilization of synonyms, polysemy, and antonyms in the document text makes issues for the text mining
tools that take both in a similar setting. It is difficult to categorize such kinds of text/ words.
Advantages of Text Mining
1. Large Amounts of Data: Text mining allows organizations to extract insights from large amounts of
unstructured text data. This can include customer feedback, social media posts, and news articles.
2. Variety of Applications: Text mining has a wide range of applications, including sentiment analysis, named
entity recognition, and topic modeling. This makes it a versatile tool for organizations to gain insights from
unstructured text data.
3. Improved Decision Making: Text mining can be used to extract insights from unstructured text data, which
can be used to make data-driven decisions.
4. Cost-effective: Text mining can be a cost-effective way to extract insights from unstructured text data, as it
eliminates the need for manual data entry.
5. Broader benefits: Cost reductions, productivity increases, the creation of novel new services, and new
business models are just a few of the larger economic advantages mentioned by those consulted.
Disadvantages of Text Mining
1. Complexity: Text mining can be a complex process that requires advanced skills in natural language
processing and machine learning.
2. Quality of Data: The quality of text data can vary, which can affect the accuracy of the insights extracted from
text mining.
3. High Computational Cost: Text mining requires high computational resources, and it may be difficult for
smaller organizations to afford the technology.
4. Limited to Text Data: Text mining is limited to extracting insights from unstructured text data and cannot be
used with other data types.
5. Noise in text mining results: Text mining of documents may result in mistakes. It’s possible to find false links
or to miss others. In most situations, if the noise (error rate) is sufficiently low, the benefits of automation
exceed the chance of a larger mistake than that produced by a human reader.
6. Lack of transparency: Text mining is frequently viewed as a mysterious process where large corpora of text
documents are input and new information is produced. Text mining is in fact opaque when researchers lack
the technical know-how or expertise to comprehend how it operates, or when they lack access to corpora or
text mining tools.
These are two examples of Topic Classification, in which a text document is classified into one of a predefined
set of topics. Many topic classification problems rely heavily on textual keywords for categorization.
Sentiment Analysis is another common type of text classification, with the goal of determining the polarity of text
content: the type of opinion it expresses. This can be expressed as a binary like/dislike rating or as a more
granular set of options, such as a star rating from 1 to 5.
Sentiment Analysis can be used to determine whether or not people liked the Black Panther movie by analyzing
Twitter posts or extrapolating the general public’s opinion of a new brand of Nike shoes based on Walmart
reviews.
How does Text Classification in Data Mining Work?
The process of categorizing text into organized groups is known as text classification, also known as text tagging
or text categorization. Text Classification in Data Mining can automatically analyze text and assign a set of pre-
defined tags or categories based on its content using Natural Language Processing (NLP).
Text Classification in Data Mining is becoming an increasingly important part of the business because it enables
easy data insights and the automation of business processes.
The following are some of the most common examples and use cases in Text Classification in Data Mining for
Automatic Text Classification:
Sentiment Analysis for determining whether a given text is speaking positively or negatively about a
particular subject (e.g. for brand monitoring purposes).
The task of determining the theme or topic of a piece of text is known as topic detection (e.g. knowing if
a product review is about Ease of Use, Customer Support, or Pricing when analyzing customer
feedback).
Language detection refers to the process of determining the language of a given text (e.g. knowing if an
incoming support ticket is written in English or Spanish for automatically routing tickets to the
appropriate team).
The most important step in solving any Supervised Machine Learning problem is gathering data. Your Text
Classifier is only as good as the dataset it is trained on.
If you don’t have a specific problem in mind and are simply interested in learning about Text Classification or Text
Classification in Data Mining in general, there are a plethora of open-source datasets available. If, on the other
hand, you are attempting to solve a specific problem, you will need to gather the necessary data.
Text Classification in Data Mining is not a buzz, Many organizations, such as Twitter and the New York Times,
provide public APIs for accessing their data. You might be able to use these to solve the problem you’re trying to
solve.
Here are some things to keep in mind when you gather data for Text Classification in Data Mining:
Before you use a Public API, make sure you understand its limitations. Some APIs, for example, limit the
number of queries you can make per second.
The more training examples (referred to as samples throughout this guide), the better. This will help your
model generalize more effectively.
Make certain that the number of samples for each class or topic is not excessively imbalanced. That is,
each class should have a comparable number of samples.
Make certain that your samples adequately cover the space of possible inputs, rather than just the
common cases.
Building and Training a model is only one step in the process. Understanding the characteristics of your data
ahead of time will allow you to build a more accurate model. This could simply mean achieving greater accuracy.
It could also imply requiring less data or fewer computational resources for training.
After loading the data, it’s a good idea to run some checks on it: select a few samples and manually check if they
match your expectations. Print a few random samples, for example, to see if the sentiment label corresponds to
the sentiment of the review.
Step 2.5: Select a Model
We have assembled our dataset and gained insights into the key characteristics of our data at this point.
Following that, we should consider which classification model to employ based on the metrics gathered in Step 2.
This includes questions like, “How do we present the text data to an algorithm that expects numeric input?”
(this is known as data preprocessing and vectorization), “What type of model should we use?“, and “What
configuration parameters should we use for our model?” and so on.
We now have access to a wide range of data preprocessing and model configuration options as a result of
decades of research. The availability of a very large array of viable options to choose from, on the other hand,
greatly increases the complexity and scope of the specific problem at hand.
Given that the best options may not be obvious, a naive solution would be to exhaust all possible options,
pruning
Before we can feed our data to a model, it must be transformed into a format that the model can understand.
For Text Classification in Data Mining, First, the data samples that we have gathered may be in a particular
order. We don’t want any information related to sampling order to influence the relationship between texts and
labels. For example, if a dataset is sorted by class and then divided into training/validation sets, the
training/validation sets will not be representative of the overall data distribution.
If your data has already been divided into training and validation sets, make sure to transform your validation
data in the same way you did your training data. If you don’t already have separate training and validation sets,
you can split the samples after shuffling; typically, 80% of the samples are used for training and 20%
for validation.
Second, Machine Learning Algorithms are fed numerical inputs. This means we’ll have to turn the texts into
numerical vectors. This procedure consists of two steps:
Tokenization: Break the texts down into words or smaller sub-texts to allow for better generalization of
the relationship between the texts and the labels. This determines the dataset’s “vocabulary” (set of
unique tokens present in the data).
Vectorization: It’s the process of defining a good numerical measure to characterize these texts.
In this section, we will work on developing, training, and assessing our model. In Step 3, we decided whether to
use an n-gram model or a sequence model based on our S/W ratio. It is now time to write and train our
classification algorithm. TensorFlow and the tf.keras API will be used for this.
Building Machine Learning Models with Keras is as simple as putting together layers of data-processing building
blocks, similar to how we would put together Lego bricks. These layers allow us to specify the order in which we
want to perform transformations on our input. Because Learning Algorithm accepts single text input and produces
a single classification, we can use the Sequential model API to build a Linear Stack of Layers.
We need to train the model now that we’ve built the model architecture. Training entails making a prediction
based on the current state of the model, calculating how inaccurate the prediction is, and updating the network’s
weights or parameters to minimize this error and improve the model’s prediction. This process is repeated until
our model has converged and can no longer learn.
For defining and training the model, we had to select a number of hyperparameters. We relied on our instincts,
examples, and best practice recommendations. However, our initial selection of hyperparameter values may not
produce the best results. It merely provides us with a good starting point for training. Every problem is unique,
and fine-tuning these hyperparameters will aid in refining our model to better represent the specifics of the
problem at hand.
Let’s look at some of the hyperparameters we used and what tuning them entails:
When deploying your model, please keep the following points in mind:
Check that your production data is distributed in the same way as your training and evaluation data.
Re-evaluate on a regular basis by gathering more training data.
Retrain your model if your data distribution changes.
Text Classification in Data Mining provides an accurate representation of the language and how
meaningful words are used in context.
Text Classification in Data Mining can work at a higher level of abstraction, it makes it easier to write
simpler rules.
Text Classification in Data Mining uses the fundamental features of semantic technology to understand
the meaning of words in context. Because semantic technology allows words to be understood in their
proper context, this provides superior precision and recall.
Documents that do not “fit” into a specific category are identified and automatically separated once the
system is deployed, and the system administrator can fully understand why they were not classified.
Web Mining Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns on mostly structured
data embedded into a knowledge discovery process. Web mining has a distinctive property to provide a set of
various data types. The web has multiple aspects that yield different approaches for the mining process, such as
web pages consist of text, web pages are linked via hyperlinks, and user activity can be monitored via web server
logs. These three features lead to the differentiation between the three areas are web content mining, web
structure mining, web usage mining.
Web content mining can be used to extract useful data, information, knowledge from the web page
content. In web content mining, each web page is considered as an individual document. The individual
can take advantage of the semi-structured nature of web pages, as HTML provides information that
concerns not only the layout but also logical structure. The primary task of content mining is data
extraction, where structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web content mining can
be utilized to distinguish topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
Web Structured Mining:
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that
data either link the web pages or direct link network. In Web Structure Mining, an individual considers
the web as a directed graph, with the web pages being the vertices that are associated with hyperlinks.
The most important application in this regard is the Google search engine, which estimates the ranking
of its outcomes primarily with the PageRank algorithm. It characterizes a page to be exceptionally
relevant when frequently connected by other highly related pages. Structure and content mining
methodologies are usually combined. For example, web structured mining can be beneficial to
organizations to regulate the network between two commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and
assists in recognizing the user access patterns for web pages. In Mining, the usage of web resources, the
individual is thinking about records of requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of web pages follow the intentions of the
authors of the pages, the individual requests demonstrate how the consumers see these pages. Web
usage mining may disclose relationships that were not proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates the guest
records, days, time, sessions, etc. This data can be utilized to analyze the visitor's behavior.
The document is created after this analysis, which contains the details of repeatedly visited web pages,
common entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
The site pages don't have a unifying structure. They are extremely complicated as compared to traditional text
documents. There are enormous amounts of documents in the digital library of the web. These libraries are not
organized according to a specific order.
The data on the internet is quickly updated. For example, news, climate, shopping, financial news, sports, and so
on.
The client network on the web is quickly expanding. These clients have different interests, backgrounds, and
usage purposes. There are over a hundred million workstations that are associated with the internet and still
increasing tremendously.
o Relevancy of data:
It is considered that a specific person is generally concerned about a small portion of the web, while the rest of
the segment of the web contains the data that is not familiar to the user and may lead to unwanted results.
The size of the web is tremendous and rapidly increasing. It appears that the web is too huge for data
warehousing and data mining.
Web mining has an extensive application because of various uses of the web. The list of some applications of
web mining is given below.