+
Data Mining
Kusum Lata, Assistant Professor, DTU
+
What is data mining?
Data mining is about mining data. Just as gold
miners sieve through a lot of material to find the
precious metal, in data mining we sieve through a
lot of data to find useful information.
Data mining is the art and science of extracting
hidden information from large datasets.
Due to huge amount of data being created from
normal business operations and competitive
business environment, there is a demand to utilize
this data to make effective decisions.
+
Data query vs. Data mining?
Data query: is about searching in data when we
know exactly what we are looking for.
E.g. a list of customers who used master card for a
purchase.
Data mining:
Find all the items which are frequently purchased
with bread.
Construct some rules about the life style of residents
of a locality that may reduce the risk of occurrence of
diabetes at an early age.
+
Decision support progress to data
mining
+
Knowledge discovery phases
+
Knowledge discovery phases
Step 1: Define Business Objectives.
State your objectives. Are you looking to improve
your direct marketing campaigns? Do you want to
detect fraud in credit card usage? Are you looking
for associations between products that sell together?
Step 2: Prepare Data. This step consists of data
selection, pre-processing of data, and data
transformation. Select the data to be extracted from
the data warehouse. Pre-processing is meant to
improve the quality of selected data. When you
select from the data warehouse, it is assumed
that the data is already cleansed.
+
Knowledge discovery phases
Step 3: Perform Data Mining. Obviously, this is the crucial
step. Apply the selected algorithm to the prepared data.
The output from this step is a set of relationships or
patterns.
Step 4: Evaluate results: There are potentially many
patterns or relationships. In this step, you examine all the
resulting patterns. You will apply a filtering mechanism
and select only the promising patterns to be presented
and applied.
+
Knowledge discovery phases
Step 5: Present Discoveries. Presentation of
knowledge discoveries may be in the form of visual
navigation, charts, graphs, or free-form texts.
Presentation also includes storing of interesting
discoveries in the knowledge base for repeated use.
Step 6: Incorporate Usage of Discoveries: This step is
for using the results to create actionable items in the
business. You assemble the results of the discovery
in the best way so that they can be exploited to
improve the business. .
+
Data mining tasks
Descriptive
Descriptive data mining characterize the general
properties of data in the database.
Predictive
Predictive data mining tasks performs inferences on
the current data in order to make predcitions.
+
Classification
Classification is a the data analysis where a model
or classifier is constructed to predict class
(categorical) labels, such as “safe” or “risky” for
the loan application data; “yes” or “no” for the
marketing data; or “treatment A,” “treatment B,” or
“treatment C for the medical data.
These categories can be represented by discrete
values, where the ordering among values has no
meaning.
Predictor: The model constructed to predict a
continuous-valued function, or a numeric value.
+
How does classification work?
Data classification is a two-step process, consisting of a
learning step (where a classification model is
constructed) and a classification step (where the model is
used to predict class labels for given data).
In the learning step (or training phase), where a
classification algorithm builds the classifier by analyzing
or “learning from” a training set made up of database
tuples and their associated class labels.
Because the class label of each training tuple is
provided, this step is also known as supervised
learning (i.e., the learning of the classifier is
supervised” in that it is told to which class each training
tuple belongs).
+
Step 1- Learning
+
Step 2- Classification
+
Classification: Decision Tree
Decision tree induction is the learning of decision
trees from class-labeled training tuples. A decision
tree is a flowchart-like tree structure, where each
internal node (non leaf node) denotes a test on an
attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a
class label.
Learned trees can also be re-represented as sets
of if-then rules to improve human readability.
Decisiontrees classify instances by sorting them
down the tree from the root to some leaf node.
+
Decision Tree Induction
Decision tree starts with a ‘root of the tree’ which is
the first node and ends with the terminal nodes
which are known as ‘leaves of the tree’.
The entire tree is traversed during classification
beginning from the root of the tree until a leaf node
is reached.
+
Basic Methodology
The most critical aspect of decision tree algorithm
is the attribute selection method employed at each
node of the decision tree.
Thereare some attributes that split the data more
purely than the other attributes.
This
means that the values of such attributes are
more with respect to the other attributes.
+
Basic Methodology
Though the working behind these decision tree
building algorithms is different for different
algorithms, but all of them work on the principle of
Greediness.
In
this tutorial, we will explain in detail the working
and rationale behind algorithms used for the
decision tree construction and how the selection
regarding the attribute that best splits the given
dataset can be made.
ID3(Iterative Dichotomiser 3) algorithm, invented
by Ross Quinlan is the most widely used algorithm
used for decision tree construction.
+
Basic Methodology
Root node Branch/Sub-Tree
Decision node Decision node (A)
)A
Terminal node Decision node Terminal node (B) Terminal node (C)
Terminal node Terminal node
Figure 1: A Pictorial Representation of a Decision Tree
+
Basic Terminology
RootNode: It represents entire population or
sample and this further gets divided into two or
more sets.
Splitting: It
is a process of dividing a node into two
or more sub-nodes.
Decision
Node: When a sub-node splits into further
sub-nodes, then it is called decision node.
Leaf/Terminal Node: Nodes that do not split is
called Leaf or Terminal node.
+
Basic Terminology
Branch / Sub-Tree: A sub section of entire tree is
called branch or sub-tree.
Parent and Child Node: A node, which is divided
into sub-nodes is called parent node of sub-nodes
whereas sub-nodes are the child of parent node.
For instance, Node A is a parent of nodes B and C.
+
ID3 Algorithm
Examples: Training Examples
Target-attribute: Attribute to be predicted
Attributes: List of predictors
ID3 (Examples , Target-attribute, Attributes)
Create a root node for the tree.
If all examples are positive, return the single-node
tree Root, with label = +
If all examples are negative, return the single-node
tree Root, with label = -
If Attributes is empty, return the single-node tree
Root with label = most common value of Target-
attribute in Examples.
+
Splitting Based on Nominal
Attributes
Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
Binary split: Divides values into two subsets. Need to
find optimal partitioning.
{Sports, CarType OR {Family, CarType
{Family} Luxury} {Sports}
Luxury}
+
Splitting Based on Ordinal
Attributes
Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Binary split: Divides values into two subset. Need to
find optimal partitioning.
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}
+
Splitting Based on Ordinal
Attributes
They can be grouped as long as the grouping does
not violate the order property of the attribute values.
What about this split?
Size
{Small,
Large} {Medium}
Is not allowed as it disturbs the order.
+
Splitting based on Continuous
Attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
Binary Split Multiway Split
+
Which Attribute Is the Best Classifier?
We would like to select the attribute that is most
useful for classifying examples.
We will define a statistical property, called
information gain, that measures how well a
given attribute separates the training
examples according to their target classification
+
Entropy
A measure used from Information Theory in the ID3
algorithm and popularly used in decision tree
construction is that of Entropy.
Entropy of a dataset measures the impurity of the
dataset.
Informally,
Entropy can be considered to find out
how disordered the dataset is.
+
Entropy
Ithas been shown that there is a relationship
between entropy and information.
That is, higher the uncertainty or entropy of some
data, implies more information is required to
completely describe that data.
In building a decision tree, the aim is to decrease
the entropy of the dataset until we reach leaf nodes
at which point the subset that we are left with is
pure, or has zero entropy and represents instances
all of one class.
+
Entropy
Suppose D be the data partition, be a training dataset
of class labelled examples.
Suppose class label has m distinct values.
+
Infogain
+
Example
+
Which attribute to select?
+
Example
+
Example
There are two classes, play tennis, “yes” and “no”
in the data. Therefore the entropy can be
calculated as below:
9 9 5 5
Entropy Info( D) log 2 log 2
14 14 14 14
0.643 log 2 0.643 0.357 log 2 0.357
0.643(0.643) 0.357(1.486)
0.410 0.521 0.93
+
Example
Now, the next step is to select highly significant
input variable among all the four input variables
(Outlook, Temperature, Humidity, Wind) that will
split the data more purely than the other attributes.
Forthis, we will calculate the information gain that
would result in over the entire dataset after
splitting the attributes (Outlook, Temperature,
Humidity,Wind).
+
Example
5
Outlook) = Entropy(C) - Outlook = Sunny) -
Infogain(C 5
Infogain(C // Outlook) = Entropy(C) Entropy(C // Outlook
- 14 Entropy(C = Sunny) -
14
44 Entropy(C / Outlook = Overcast) - 55 Entropy(C / Outlook = rain)
14 Entropy(C / Outlook = Overcast) - 14 Entropy(C / Outlook = rain)
14 14
55 22 2 33
2 3
3 4 44
4 4
4
=
= 0.93-
0.93- 14 (- log 22 5 -
(- 5 log log 22 5 )) -
- 5 log - 14 (- log 22 4 )) -
(- 4 log -
14 5 5 5 5 14 4 4
55 (- 33 log 33 - 22 log 22 )
14 (- 5 log 22 5 - 5 log 22 5 )
14 5 5 5 5
= 0.93- 55 (0.529 + 0.442) - 44 (0) - 55 (0.442 + 0.529)
= 0.93- 14 (0.529 + 0.442) - 14 (0) - 14 (0.442 + 0.529)
14 14 14
=
= 0.93- 0.347 -
0.93- 0.347 - 00 -
- 0.347
0.347
= 0.245bits
= 0.245bits
+
Example
Wewill perform the above calculation for all the
remaining attributes (Temperature, Humidity,
Wind)
Then select the attribute for the root node which
will result in the highest reduction in entropy.
Inother words, the attribute which results in
highest information gain will be selected as the
root node.
+
Example
Performing the above calculation for all the
remaining attributes gives the following values of
information gain:
Infogain (C| Temperature) = 0.028 bits
Infogain (C| Humidity) = 0.152 bits
Infogain (C|Wind) = 0.048 bits
We can clearly see that the attribute Outlook
results in the highest reduction in entropy or the
highest information gain.
We would therefore choose this at the root node
splitting the data up into subsets corresponding to
all the different values for the Outlook attribute.
+ Example
Which attribute
should be
tested here?
+ Example
We select Humidity as it has the highest Gain.
+ Example
{D1, D2,…D14}
9 Yes, 5 No
Outlook
Sunny Overcast Rain
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be
tested here?
+
Example
We select Wind as it has the highest Gain.
+
Decision Trees
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
+
Example
Construct a Decision Tree using Infogain as the
splitting criteria.
Weekend Weather Parents Money Decision
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay In
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
+
Solution
Entropy(S) = -pcinema log2(pcinema) -ptennis log2(ptennis) -
pshopping log2(pshopping) -pstay_in log2(pstay_in)
= -(6/10) * log2(6/10) -(2/10) * log2(2/10) -(1/10) *
log2(1/10) -(1/10) * log2(1/10)
= -(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -
(1/10) * -3.322
= 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571
+
Solution
Now,
Gain (S,Weather) = 0.70
Gain (S, Parents) = 0.61
Gain (S, Money) = 0.2816
Therefore, the first attribute selected for splitting is
weather.
Now,
Gain (Ssunny, Parents) = 0.918
Gain (Ssunny, Money) = 0
The chosen attribute is Parents.
+
Solution
The complete tree is as follows:
+
Decision Trees
+
Problem with Information Gain
Approach
Biased towards tests with many outcomes (attributes
having a large number of values)
E.g: An attribute acting as unique identifier will
produce a large number of partitions (1 tuple per
partition).
Each resulting partition D is pure Info(D) =0
The information gain is maximized
Eg: In a data with customer attributes (Customer ID,
Gender and Car Type), a split on “Customer ID” is
preferred as compared to other attributes such as
“Gender” and “Car Type”. But “Customer ID” is not a
predictive attribute.
+
Extension to Information Gain
C4.5 a successor of ID3 uses an extension to
information gain known as gain ratio.
It overcomes the bias of Information gain as it
applies a kind of normalization to information gain
using a split information value.
The split information value represents the potential
information generated by splitting the training data
set D into v partitions, corresponding to v outcomes
on attribute A.
𝑣 |𝐷𝑗|
S𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 𝐷 = − 𝑗=1 |𝐷| ∗ log( 𝐷𝑗 /|𝐷|
+
Extension to Information Gain
Gain ratio is defined as
𝑮𝒂𝒊𝒏(𝑨)
𝑮𝒂𝒊𝒏𝑹𝒂𝒕𝒊𝒐 𝑨 =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨(𝑫)
The attribute with the maximum gain ratio is
selected as the splitting attribute.
+
Example
ForOutlook
Gain: 0.247
Split Info:
Sunny: 5 examples, Overcast: 4 examples, Rain:
5 examples
Gain Ratio:
+
Example
ForHumidity
Split Info: 1.000
Gain Ratio: 0.152
Again, Outlook is selected as
For Temperature it has the highest gain ratio.
Split
Info: 1.557
Gain Ratio: 0.019
For Wind
Split
Info: 0.985
Gain Ratio: 0.049
+
Gini Index
It is used as a splitting criteria in Classification and
Regression Tree (CART).
Measures the impurity of a data partition D
m is the number of classes and
pi : the probability that a tuple in D belongs to class
Ci
+
Gini Index
The Gini Index considers a binary split for each
attribute A, say D1 and D2. The Gini index of D
given that partitioning is:
The reduction in impurity is given by:
The
attribute that maximizes the reduction in
impurity is chosen as the splitting attribute
+
Gini Index
Examine the partitions resulting from all
possible subsets of {a1…,av}
2v
possible subsets. We exclude the power
set and the empty set, then we have 2v-2
subsets.
The subset that gives the minimum Gini
index for attribute A is selected as its
splitting subset
+
Example
+
Example
Compute the Gini Index for the overall collection of
training examples.
Using attribute income: there are three values: low,
medium and high Choosing the subset {low, medium}
results in two partitions:
D1 (income ∈ {low, medium} ): 10 tuples
D2 (income ∈ {high} ): 4 tuples
+
Example
The Gini Index measures of the remaining partitions are:
The best binary split for attribute income is on {medium,
high} and {low}
+
Summarizing Splitting Criteria
Information Gain used in
ID3
Gain ratio is used in
C4.5
J48
Gini
Index is used in
CART
+
Comparing Attribute Selection
Measures
Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one
partition is much smaller than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is large
Tends to favor tests that result in equal-sized
partitions and purity in both partitions
+
Exercise
Consider the training examples shown in Table for
a binary classification problem.
(a) What is the entropy of this collection of training
examples ?
+
Exercise
(b) What are the information gains of a1 and a2
relative to these training examples?
(c) What is the best split (among a1 and a2)
according to the information gain?
(d) What is the best split (among a1 and a2)
according to the Gini index?
+
Solution
Whatis the entropy of this collection of training
examples with respect to the positive class?
There are four positive examples and five negative
examples. Thus, P(+) = 4/9 and P(−) = 5/9. The
entropy of the training examples is −4/9 log2(4/9)
− 5/9 log2(5/9) = 0.9911.
+
Exercise
What are the information gains of a1 and a2 relative
to these training examples?
The entropy for a1 is
The information gain is 0.9911-0.7616 = 0.2294.
+
Exercise
What are the information gains of a1 and a2 relative
to these training examples?
The entropy for a2 is
The information gain is 0.9911-0.9839 = 0.0072.
+
Solution
What is the best split (among a1 and a2) according
to the information gain?
According to information gain a1 produces the best
split.
+
Solution
What is the best split (among a1 and a2) according
to Gini Index?
𝐹𝑜𝑟 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑎1, 𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥 𝑖𝑠
𝐹𝑜𝑟 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑎2 𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥 𝑖𝑠
Since Gini index of a1 is smaller, it is selected for
splitting.