CSCI417
Machine Intelligence
Lecture # 3
Spring 2024
1
Tentative Course Topics
1.Machine Learning Basics
2.Classifying with k-Nearest Neighbors
3.Splitting datasets one feature at a time: decision trees
4.Classifying with probability theory: naïve Bayes
5.Logistic regression
6.Support vector machines
7.Model Evaluation and Improvement: Cross-validation, Grid Search, Evaluation Metrics, and
Scoring
8.Ensemble learning and improving classification with the AdaBoost meta-algorithm.
9.Introduction to Neural Networks - Building NN for classification (binary/multiclass)
10.Convolutional Neural Network (CNN)
11.Pretrained models (VGG, Alexnet,..)
12.Machine learning pipeline and use cases.
2
Recommending App
For a woman who works at
an office, which app do we
recommend?
For a man who works at a
factory, which app do we
recommend?
ML ask:
Between Gender and Occupation,
which one seems more decisive for
predicting what app will the users
download?....
3
Recommending App
Occupation
School Work
Pokemon Go Gender
F M
WhatsApp Snapchat
4
Between a horizontal and
a vertical line, which one
would cut the data
better?
5
6
Non-parametric Estimation
• A non-parametric model is not fixed, but its complexity
depends on the size of the training set or, rather, the
complexity of the problem inherent in the data.
• Here, a non-parametric model does not mean that the model
has no parameters; it means that the number of parameters
is not fixed and that their number can grow depending on the
size of the data or, better still, depending on the complexity of
the regularity that underlies the data.
7
Decision tree
• A decision tree is a hierarchical data structure implementing
the divide-and-conquer strategy.
• It is an efficient non-parametric method that can be used for
both classification and regression.
• A decision tree is also a non-parametric model in the sense
that we do not assume any parametric form for the class
densities, and the tree structure is not fixed a priori, but the
tree grows, branches and leaves are added during learning
depending on the complexity of the problem inherent in the
data.
8
Function Approximation
9
Ref: https://www.seas.upenn.edu, Eric Eaton.
Sample Dataset (Will Nadal Play Tennis?)
• Columns denote features Xi
• Rows denote labeled instances hx i , yi i
Class label denotes whether a tennis game was played
10
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
• A possible decision tree for the data:
• Each internal node: test one attribute Xi
• Each branch from a node: selects one value for Xi
• Each leaf node: predict Y
11
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
• A possible decision tree for the data:
• What prediction would we make for
<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?
12
Ref: https://www.seas.upenn.edu, Eric Eaton.
Decision Tree
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribution over labels
13
Ref: https://www.seas.upenn.edu, Eric Eaton.
Stages of Machine Learning
Given: labeled training data X , Y = { hx i , yi i } ni= 1
• Assumes each x i ⇠ D(X ) with yi = f t ar get (x i )
X, Y
Train the model:
learner
model classifier.train(X, Y )
x model yprediction
Apply the model to new data:
• Given: new unlabeled instance x ⇠ D(X )
yprediction ß model.predict(x)
14
Ref: https://www.seas.upenn.edu, Eric Eaton.
Top-down learning
node = root of decision tree
Main loop:
1. A ß the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else,
recurse over new leaf nodes.
How do we choose which attribute is best?
15
Choosing the Best Attribute
Key problem: choosing which attribute to split a given set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest number of possible values
– Most-Values: Choose the attribute with the largest number of possible values
– Max-Gain: Choose the attribute that has the largest expected information gain
• i.e., attribute that results in smallest expected size of subtrees rooted at its children
• The ID3 algorithm uses the Max-Gain method of selecting the best
attribute
16
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples
Entropy can be roughly thought of as how much variance the data has.
17
18
Entropy: a common way to measure impurity
19
Entropy
Entropy = 0 all examples belong to the same class. Minimum
Entropy = 1 examples are evenly split between classes. impurity
Maximum
entropy = - 1 log21 = 0 impurity
entropy = -0.5 log20.5 – 0.5 log20.5 =1
16
20
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy
Entropy = 0 all examples belong to the same class. Minimum
Entropy = 1 examples are evenly split between classes. impurity
Maximum
impurity
16
21
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
2-Class Cases:
2-Class Cases:
Xn Minimum
Entropy H (x) = − P (x = i ) log2 P (x = i ) impurity
i= 1
• What is the entropy of a group in which all
examples belong to the same class?
– entropy = - 1 log21 = 0
not a good training set for learning Maximum
impurity
• What is the entropy of a group with 50%
in either class?
– entropy = -0.5 log20.5 – 0.5 log20.5 =1
good training set for learning 16
22
Information Gain
23
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain
24
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain
25
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Information Gain of feature WIND
26
Information Gain of feature HUMIDITY
27
Information Gain of feature TEMP
28
Information Gain of feature OUTLOOK
29
After several iterations
30
Information Gain
Example of a family of 10 members, where
5 members are pursuing their studies and
the rest of them have completed or not
pursued.
% of pursuing = 50% % of pursuing = 0%
% of not pursuing = 50% % of not pursuing = 100%
First, calculate the entropy.
Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1 Entropy = -(0) * log2(0) -(1) * log2(1) = 0
we can say that if a node contains only one class in it or formally says the node of the tree is pure the entropy for data in such node
will be zero and according to the information gain formula the information gained for such node will be higher and purity is higher
if the entropy is higher the information gain will be less, and the node can be considered as the less pure.
31
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy for Parent Node
32
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
Entropy for Parent Node
Now according to the performance
of the students, we can say
Students= 20
Curricular activity = 10
No curricular activity = 10
Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1
Entropy for Child Node
Students= 14 Students= 6
Curricular activity = 8/14=57% Curricular activity = 2/6=33%
No curricular activity = 6/14=43%
No curricular activity = 4/6=67%
Entropy = -(0.43) * log2(0.43) -(0.57) * log2(0.57) = 0.98 Entropy = -(0.33) * log2(0.33) -(0.67) * log2(0.67) = 0.91
we have calculated the entropy for the parent and child nodes now the weighted sum of these entropies will give the weighted entropy of all the nodes.
Weighted Entropy : (14/20)*0.98 + (6/20)*0.91 = 0.959
33
https://analyticsindiamag.com/a-complete-guide-to-decision-tree-split-using-information-gain/
splits based on the class
Entropy for parent nodes
Entropy = -(0.5) * log2(0.5) -(0.5) * log2(0.5) = 1
Entropy for child nodes
Class 11th
Entropy = -(0.8) * log2(0.8) -(0.2) * log2(0.2) = 0.722
Class 12th
Weighted entropy Entropy = -(0.2) * log2(0.2) -(0.8) * log2(0.8) = 1
Weighted Entropy : (10/20)*0.722 + (10/20)*0.722 = 0.722
34
Calculation of Information Gain
the amount of uncertainty because of any process or any given random
variable.
the amount of information improved in the nodes before splitting them for
making further decisions.
• Information Gain = 1 – Entropy higher Information Gain = more Entropy removed, which is what we
want.
Split Entropy Information Gain
Performance of class 0.959 0.041
class 0.722 0.278
35
36
37
Which Tree Should We Output?
38
Overfitting in DTs
39
Overfitting in DTs
40
Avoiding overfitting
How can we avoid overfitting?
• Stop growing when data split is not statistically significant
• Acquire more training data
• Remove irrelevant attributes (manual process – not always possible)
• Grow full tree, then post-prune
How to select “best” tree:
• Measure performance over training data
• Measure performance over separate validation data set
• Add complexity penalty to performance measure
(heuristic: simpler is better)
41
Reduced-error pruning
Split training data further into training and validation sets
Grow tree based on training set
Do until further pruning is harmful:
1. Evaluate impact on validation set of pruning each
possible node (plus those below it)
2. Greedily remove the node that most improves
validation set accuracy
42
Effect of Reduced-Error Pruning
Effect of Reduced-Error Pruning
The tree is pruned back to the red line where
it gives more accurate results on the test data 43
Decision Tree in summary
• Representation: decision trees
• Bias: prefer small decision trees
• Search algorithm: greedy
• Heuristic function:
information gain or information
content or others
• Overfitting / pruning
44
Decision Tree PROS & CONS
Fast and simple to implement Prone to over-fitting, specially when
many features are to be considered.
DTs would work will with any type of
nonlinearly separable data. While DTs visualize the decisions to be
made, at the same time they condense
A “path” through possibilities, with a complex process into discrete steps
alternatives, always leads toward a (which may be a good or a bad thing).
desirable outcome.
Non-incremental (batch method)
45