0% found this document useful (0 votes)
58 views44 pages

Machine Learning: Prepared by

The document discusses decision trees and their use in machine learning for classification problems. It provides explanations of key concepts related to decision trees, including: - The basic structure and terminology of decision trees, which use internal decision nodes to split the data into homogeneous subsets represented by leaf nodes. - Decision tree algorithms like ID3, C4.5 and C5.0 that use techniques like information gain to build the tree by recursively splitting nodes. - How decision trees can represent their classifications in logical structures that are understandable without statistical knowledge, making them useful for business applications.

Uploaded by

abdala sabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views44 pages

Machine Learning: Prepared by

The document discusses decision trees and their use in machine learning for classification problems. It provides explanations of key concepts related to decision trees, including: - The basic structure and terminology of decision trees, which use internal decision nodes to split the data into homogeneous subsets represented by leaf nodes. - Decision tree algorithms like ID3, C4.5 and C5.0 that use techniques like information gain to build the tree by recursively splitting nodes. - How decision trees can represent their classifications in logical structures that are understandable without statistical knowledge, making them useful for business applications.

Uploaded by

abdala sabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Machine

Learning
Prepared By:
Dr. Sara Sweidan
Divide and
Conquer –
Classification
Using Decision
Trees and
Rules
To Know

decision trees and rule learners—two machine learning


methods that also make complex decisions from sets of
simple choices.

These methods then present their knowledge in


the form of logical structures that can be
understood with no statistical knowledge. This
aspect makes these models particularly useful for
business strategy and process improvement.
Understanding of decision tree
• Concept of Decision Tree

• Use of Decision Tree to classify data

• Basic algorithm to build Decision Tree

• Concept of Entropy
• Basic concept of entropy in information theory
• Mathematical formulation of entropy
• Calculation of entropy of a training set

• Decision Tree induction algorithms


• ID3, CART, C4.5, C5.0
Why use decision trees
• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.

• It is a graphical representation for getting all the possible solutions to a problem/decision


based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into sub-trees. Below diagram explains the general structure of a decision tree.
decision tree terminology.

The decision tree comprises of root node, leaf node, branch nodes, parent/child node etc.
following is the explanation of this terminology.
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub- nodes according
to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
Understanding of decision tree
• A Decision Tree is an important data structure known to solve many computational problems

Example 1: Binary Decision Tree

A B C f
0 0 0 m0
0 0 1 m1
0 1 0 m2
0 1 1 m3
1 0 0 m4
1 0 1 m5
1 1 0 m6
1 1 1 m7
Understanding of decision tree
• In Example 1, we have considered a decision tree where values of any attribute if binary only. The
decision tree is also possible where attributes are of continuous data type

Example 2: Decision Tree with numeric data


DT characteristics
• Decision tree may be n-ary, n ≥ 2.
• There is a special node called root node.
• All nodes drawn with circle (ellipse) are called internal nodes.
• All nodes drawn with square boxes are called terminal nodes or leaf nodes.
• Edges of a node represent the outcome for a value of the node.
• In a path, a node with same label is never repeated.

• Decision tree is not unique, as different ordering of internal nodes can give different
decision tree.
• Example: Suppose there is a
candidate who has a job offer and
wants to decide whether he should
accept the offer or Not. So, to
solve this problem, the decision
tree starts with theroot node
(Salary attribute by ASM).
Building Decision Tree
• In principle, there are exponentially many decision tree that can be constructed from a
given dataset (also called training data).

• Some of the trees may not be optimum

• Some of them may give an inaccurate result

• Two approaches are known


➢ Greedy strategy
• A top-down recursive
• divide-and-conquer
• Modification of greedy strategy
• ID3, C4.5, CART, C5.0.
Definition of decision tree

Definition 1: Decision Tree

Given a database D = 𝑡1 , 𝑡2 , … . . , 𝑡𝑛 , where 𝑡𝑖 denotes a tuple, which is defined by a set of


attribute 𝐴 = 𝐴1 , 𝐴2 , … . . , 𝐴𝑚 . Also, given a set of classes C = 𝑐1 , 𝑐2 , … . . , 𝑐𝑘 .

A decision tree T is a tree associated with D that has the following properties:

• Each internal node is labeled with an attribute Ai


• Each edges is labeled with predicate value that can be applied to the attribute associated with
the parent node of it

• Each leaf node is labeled with class cj


Decision Tree algorithm for classification

• Step-1: Begin the tree with the root node, says D, which contains the complete dataset.

• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM)

i.e. information gain and Gini index.

• Step-3: Divide the D into subsets that contains possible values for the best attributes.

• Step-4: Generate the decision tree node, which contains the best attribute.

• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Built Decision Tree Algorithm
• Input: D : Training dataset
• Output: T : Decision tree

Steps

1. If all tuples in D belongs to the same class Cj


Add a leaf node labeled as Cj
Return // Termination condition

2. Select an attribute Ai (so that it is not selected twice in the same branch)

3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D

4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute value in Dk

5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop
Node Splitting in BuildDT Algorithm
• BuildDT algorithm must provides a method for expressing an attribute test
condition and corresponding outcome for different attribute type

• Case: Binary attribute


• This is the simplest case of node splitting

• The test condition for a binary attribute generates only two outcomes
Node Splitting in BuildDT Algorithm
• Case: Nominal attribute
• Since a nominal attribute can have many values, its test condition can be expressed in two
ways:
• A multi-way split
• A binary split

• Muti-way split: Outcome depends on the number of distinct values for the corresponding
attribute

• Binary splitting by grouping attribute values


Node Splitting in BuildDT Algorithm
• Case: Ordinal attribute
• It also can be expressed in two ways:
• A multi-way split
• A binary split

• Muti-way split: It is same as in the case of nominal attribute


• Binary splitting attribute values should be grouped maintaining the order property of the
attribute values
Node Splitting in BuildDT Algorithm
• Case: Numerical attribute
• For numeric attribute (with discrete or continuous values), a test condition can be expressed
as a comparison set
• Binary outcome: A > v or A ≤ v
• In this case, decision tree induction must consider all possible split positions

• Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)
• Here, q should be decided a priori
Node Splitting in BuildDT Algorithm
Example illustration of BuildDT Algorithm
• Consider a training data set as shown.

Person Gender Height Class


1 F 1.6 S
Attributes:
2 M 2.0 M
3 F 1.9 M Gender = {Male(M), Female (F)} // Binary attribute
4 F 1.88 M
5 F 1.7 S
Height = {1.5, …, 2.5} // Continuous attribute
6 M 1.85 M
7 F 1.6 S Class = {Short (S), Medium (M), Tall (T)}
8 M 1.7 S
9 M 2.2 T
10 M 2.1 T
11 F 1.8 M
12 M 1.95 M
13 F 1.9 M
Given a person, we are to test in which class s/he
14 F 1.8 M belongs
15 F 1.75 S
Node Splitting in BuildDT Algorithm
Example illustration of BuildDT Algorithm
• To built a decision tree, we can select an attribute in two different orderings: <Gender,
Height> or <Height, Gender>

• Further, for each ordering, we can choose different ways of splitting

• Different instances are shown in the following.

• Approach 1 : <Gender, Height>


Node Splitting in
BuildDT
Algorithm
Node Splitting in <Height, Gender>

BuildDT
Algorithm
Concept of entropy
Concept of entropy

More ordered Less ordered


less entropy higher entropy
More organized or Less organized or
ordered (less probable) disordered (more probable)
Concept of entropy
• The entropy is a measure of the uncertainty associated with a ra n d om
v ariable
• As uncertainty and or randomness increases for a result set so does the
entropy
• Values range from 0 – 1
• To represent the entropy of information :
Entropy (D)   − pi log2 ( pi )
Uncertainty

i=1

entropy
Concept of entropy
Entropy (D)   − pi log2 ( pi )
i=1
• Note:
• The above formula should be summed over the non-empty classes only, that is, classes for
which 𝑝𝑖 ≠ 0

• E is always a positive quantity

• E takes it’s minimum value (zero) if and only if all the instances have the same class (i.e.,
the training set with only one non-empty class, for which the probability 1).

• Entropy takes its maximum value when the instances are equally distributed among k
possible classes. In this case, the maximum value of E is 𝑙𝑜𝑔2 𝑘.
Concept of entropy
• Entropy is an important concept used in Physics in the context of heat and thereby
uncertainty of the states of a matter.

• At a later stage, with the growth of Information Technology, entropy becomes an


important concept in Information Theory.

• To deal with the classification job, entropy is an important concept, which is considered
as
• an information-theoretic measure of the “uncertainty” contained in training data due
to the presence of more than one classes.
Concept of entropy
Question related to event with eventuality (or impossibility) will be answered with 0 or 1
uncertainty.

Does sun rises in the East? (answer is with 0 uncertainty)

Will mother give birth to male baby? (answer is with ½ uncertainty)

Is there a planet like earth in the galaxy? (answer is with an extreme uncertainty)
Concept of entropy

➢ Entropy is a measure of the randomness in the information being


processed. The higher the entropy, the harder it is to draw any
conclusions from that information.
➢ It is a measure of disorder or impurity or unpredictability or
uncertainty.
➢ Low entropy means less uncertain and high entropy means more
uncertain.
Information Gain:

• Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree. A
decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S) – [(Weighted Average) * Entropy (each feature)]
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in

data. Entropy can be calculated as:


Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Table: Class‐labeled training tuples from All Electronics customer database.
• Class Y: buys_computer = “yes”
• Class N: buys_computer = “no”
9 9 5 5
• Entropy (D)= - 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.940
14 14 14 14
• Compute the expected information requirement for each attribute: start with the
attribute age
|𝑆𝑣 |
• Gain (age, D) = entropy (D) - σv{Youth,Middle−aged ,Senior} entropy (𝑆𝑣 )
|𝑆|
5 4 5
= 0.940 - 14 entropy (𝑆𝑦𝑜𝑢𝑡ℎ ) - 14 entropy (𝑆𝑚𝑖𝑑𝑑𝑙𝑒 ) - 14 entropy (𝑆𝑠𝑒𝑛𝑖𝑜𝑟 )
5 2 2 3 3 4 4 4 5 2 2 3 3
= 0.940 - 14 (- 5 𝑙𝑜𝑔2 - 5 𝑙𝑜𝑔2 ) - 14 (- 4 𝑙𝑜𝑔2 ) - 14 (- 5 𝑙𝑜𝑔2 - 5 𝑙𝑜𝑔2 ) = 0.247
5 5 4 5 5

• Gain (income, D) = 0.029


• Gain (student, D) = 0.151
• Gain (credit _ rating, D) = 0.048
The attribute age has the
highest information gain and
therefore becomes the
splitting attribute at the root
node of the decision tree.
Branches are grown for each
outcome of age. The tuples
are shown partitioned
accordingly.
Strengths Weakness
• An all-purpose classifier that does well • Decision tree models are often
on most problems biased toward splits on features
• Highly automatic learning process, having a large number of levels
which can handle numeric or nominal • It is easy to overfit or underfit the
features, as well as missing data model
• Excludes unimportant features • Can have trouble modeling some
• Can be used on both small and large relationships due to reliance on
datasets axis-parallel splits
• Results in a model that can be • Small changes in the training data can result
interpreted without a mathematical in large changes to decision logic
background (for relatively small trees) • Large trees can be difficult to interpret and
• More efficient than other complex the decisions they make may seem
models counterintuitive
Underfitting &
overfitting
• Underfitting & overfitting
Underfitting & overfitting

Training = 50% Training = 98% Training = 99%


Test = 48% Test = 92% Test = 60%
Bias = 50% (high bias) Bias = 2% (low bias) Bias = 1% (low bias)
Variance= 2 (low var.) Variance= 6 (low var.) Variance= 39 (high var.)
Underfitting model. fitting model. overfitting model.
Underfitting
Underfitting && overfitting
overfitting
• Bias
• measures accuracy or quality of the estimator
• low bias implies on average we will accurately estimate true
parameter from training data

• Variance
• Measures precision or specificity of the estimator
• Low variance implies the estimator does not change much as
the training set varies
underfit region
overfit region
underfit region

overfit region
Regression: Complexity versus Goodness of Fit

Low Bias
Low Variance / / High Variance
High Bias

Highest Bias low Bias low Bias


Lowest variance low Variance High variance
Model complexity = low Model complexity = medium Model complexity = high
10/3/19 Dr. Yanjun Qi / UVA CS 38
Credit: Stanford Machine Learning
Fixes to try:

- Try getting more training examples → fixes high variance.


- Try a smaller set of features → fixes high variance.
- Try a larger set of features → fixes high bias.
- Run gradient descent for more iterations → fixes optimization algorithm.

You might also like