ML Unit-2.1
ML Unit-2.1
Picking the best splitting attribute: entropy and information gain, searching for simple trees and computational
complexity, Occam's razor, Overfitting, noisy data, and pruning. Experimental Evaluation of Learning Algorithms:
Measuring the accuracy of learned hypotheses. Comparing learning algorithms: cross-validation, learning curves, and
statistical hypothesis testing.
The main point to remember while creating a machine learning model. They are two reasons for using the Decision
tree.
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
1. Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
2. Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.
3. Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
4. Branch/Sub Tree: A tree formed by splitting the tree.
5. Pruning: Pruning is the process of removing the unwanted branches from the tree.
6. Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.
Features of Decision Tree Learning
Method for approximating discrete-valued functions (including boolean)
Learned functions are represented as decision trees (or if-then-else rules)
Expressive hypotheses space, including disjunction
The decision tree is robust to noisy data
Representation of Decision Tree Learning
➢ Decision trees classify instances by sorting them down the tree from the root to some leaf node, it provides the
classification of the instance.
➢ Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that
node corresponds to one of the possible values for this attribute.
➢ An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then
moving down the tree branch corresponding to the value of the attribute in the given example. This process is
repeated for the subtree rooted at the new node.
➢ In decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances.
➢ Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a
disjunction of these conjunctions.
where p+, is the proportion of positive examples in S and p-, is the proportion of negative examples in S.
The information gain is simply the expected reduction in entropy caused by partitioning the examples according to this
attribute. the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as,
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which attribute A has value
v (i.e., S_v= {s ∈ S|A(s) = v})
Problem Definition:
Build a decision tree using ID3 algorithm for the given training data in the table (Buy Computer data), and predict the
class of the following new example: age<=30, income=medium, student=yes, credit-rating=fair
age income student Credit rating Buys computer
Solution:
Attribute provides the highest Information Gain in order to split the training set based on that attribute,to calculate the
expected information to classify the set and the entropy of each attribute.
The information gain is this mutual information minus the entropy:The mutual information of the two classes,
Since Age has the highest Information Gain we start splitting the dataset using the age attribute.
The same process of splitting has to happen for the two remaining branches.
Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which one should be used to split the
partition?
We can then safely split on attribute student without checking the other attributes since the information gain is
maximized.
Decision
Tree after step 2_2
Right sub-branch
The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5 log2(2/5)=0.97
For Income, we have two values incomemedium (2 yes and 1 no) and incomelow (1 yes and 1 no)
Entropy(income) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5 (-1/2log2(1/2)-1/2log2(1/2))
For Student, we have two values studentyes (2 yes and 1 no) and studentno (1 yes and 1 no)
Entropy(student) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5(-1/2log2(1/2)-1/2log2(1/2)) = 0.95
Gain (student) = 0.97 – 0.95 = 0.02
For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and credit_ratingexcellent (0 yes and 2 no)
Entropy(credit_rating) = 0 Gain(credit_rating) = 0.97 – 0 = 0.97
then split based on credit_rating. These splits give partitions each with records from the same class. We just need to
make these into leaf nodes with their class label attached:
Decision Tree for Buys Computer
New example: age<=30, income=medium, student=yes, credit-rating=fair
3. Disjunctive descriptions may be required: Decision trees naturally represent disjunctive expressions.
4. The training data may contain errors: Decision tree learning methods are robust to errors, both errors in
classifications of the training examples and errors in the attribute values that describe these examples.
5. The training data may contain missing attribute values: Decision tree methods can be used even when some
training examples have unknown values (e.g., if the Humidity of the day is known for only some of the training
examples)
Gini Index:
Gini Index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it
is randomly chosen.
If all the elements belong to a single class, then it can be called pure. The degree of Gini Index varies between 0 and
1.where,
▪ '0' denotes that all elements belong to a certain class or there exists only one class (pure), and
▪ '1' denotes that the elements are randomly distributed across various classes (impure).
▪ A Gini Index of '0.5 'denotes equally distributed elements into some classes.
Issues in Decision tree learning
The algorithm (DECISION TREE) grows each branch of the tree just deeply enough to perfectly classify the training
examples. While this is sometimes a reasonable strategy, in fact it can lead to difficulties when there is noise in the
data, or when the number of training examples is too small to produce a representative sample of the true target
function. In either of these cases, this simple algorithm can produce trees that overfit the training examples.
Definition: Given a hypothesis space H, a hypothesis h E H is said to overfit the training data if there exists
some alternative hypothesis h' E H, such that h has smaller error than h' over the training examples, but h'
has a smaller error than h over the entire distribution of instances.
Overfitting is a significant practical difficulty for decision tree learning and many other learning methods. There are
several approaches to avoiding overfitting in decision tree learning. These can be grouped into two classes:
• approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the
training data.
• approaches that allow the tree to overfit the data, and then post-prune the tree.
The first of these approaches might seem more direct. The second approach of post-pruning overfit trees has been
found to be more successful in practice.
This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree. Regardless of
whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used
to determine the correct final tree size.
Approaches include:
o Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-
pruning nodes from the tree.
o Use all the available data for training, but apply a statistical test to estimate whether expanding (or
pruning) a particular node is likely to produce an improvement beyond the training set. For example,
Quinlan (1986) uses a chi-square test to estimate whether further expanding a node is likely to improve
performance over the entire instance distribution, or only on the current sample of training data.
o Use an explicit measure of the complexity for encoding the training examples and the decision tree,
halting growth of the tree when this encoding size is minimized. This approach, based on a heuristic
called the Minimum Description Length principle.
➢ The exactly might we use a validation set to prevent overfitting, One approach is called reduced-error pruning,
consider each of the decision nodes in the tree to be candidates for pruning. Pruning a decision node consists of
removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification
of the training examples affiliated with that node.
➢ Nodes are removed only if the resulting pruned tree performs no worse than-the original over the validation set.
This has the effect that any leaf node added due to coincidental regularities in the training set is likely to be
pruned because these same coincidences are unlikely to occur in the validation set.
➢ Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy
over the validation set. Pruning of nodes continues until further pruning is harmful. The impact of reduced-error
pruning on the accuracy of the decision tree is illustrated in below Figure
C. Rule post-pruning
1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and
allowing overfitting to occur.
2. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf
node.
3. Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. 4. Sort
the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.
Why convert the decision tree to rules before pruning? There are three main advantages.
▪ Converting to rules allows distinguishing among the different contexts in which a decision node is used.
Because each distinct path through the decision tree node produces a distinct rule, the pruning decision
regarding that attribute test can be made differently for each path. In contrast, if the tree itself were pruned, the
only two choices would be to remove the decision node completely, or to retain it in its original form.
▪ Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those
that occur near the leaves. Thus, we avoid messy bookkeeping issues such as how to reorganize the tree if the
root node is pruned while retaining part of the subtree below this test.
▪ Converting to rules improves readability. Rules are often easier for to understand.
Our initial definition of ID3 is restricted to attributes that take on a discrete set of values.
❖ First, the target attribute whose value is predicted by the learned tree must be discrete valued. Second, the
attributes tested in the decision nodes of the tree must also be discrete valued.
❖ This second restriction can easily be removed so that continuous-valued decision attributes can be incorporated
into the learned tree. This can be accomplished by dynamically defining new discrete valued attributes that
partition the continuous attribute value into a discrete set of intervals.
❖ Example: Suppose we wish to include the continuous-valued attribute Temperature. Suppose further that the
training examples associated with a particular node in the decision tree have the following values for Temperature
i) Pick a threshold, c, that produces the greatest information gain. By sorting the examples according to
the continuous attribute A, then identifying adjacent examples that differ in their target classification
ii) In the current example, there are two candidate thresholds, corresponding to the values of Temperature at
which the value of Play Tennis changes: (48 + 60)/2, and (80 + 90)/2.
iii) The information gain can then be computed for each of the candidate attributes, Temperature >54, and
Temperature >85 and the best can be selected (Temperature >54)
iv) This dynamically created boolean attribute can then compete with the other discrete-valued candidate
attributes available for growing the decision tree.
❖ In some learning tasks the instance attributes may have associated costs. For example, in learning to classify
medical diseases we might describe patients in terms of attributes such as Temperature, Biopsy Result, Pulse,
Blood Test Results, etc.
❖ These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort. In
such tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-cost
attributes only when needed to produce reliable classifications.
❖ ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection
measure. For example, we might divide the Gain by the cost of the attribute, so that lower-cost attributes would
be preferred. cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do
bias the search in favor of low-cost attributes.
❖ Attribute cost is measured by the number of seconds required to obtain the attribute value by positioning and
operating the sonar. They demonstrate that more efficient recognition strategies are learned, without sacrificing
classification accuracy, by replacing the information gain attribute selection measure by the following measure.
In certain cases, the available data may be missing values for some attributes. For example, in a medical domain in
which we wish to predict patient outcome based on various laboratory tests, it may be that the lab test Blood-Test-Result
is available only for a subset of the patients. In such cases, it is common to estimate the missing attribute value based
on other examples for which this attribute has a known value.
❖ Consider the situation in which Gain(S, A) is to be calculated at node n in the decision tree to evaluate whether
the attribute A is the best attribute to test at this decision node. Suppose that (x, c(x)) is one of the training
examples in S and that the value A(x) is unknown.
❖ One strategy for dealing with the missing attribute value is to assign it the value that is most common among
training examples at node n. Alternatively, we might assign it the most common value among examples at node
n that have the classification c(x). The elaborated training example using this estimated value for A(x) can then
be used directly by the existing decision tree learning algorithm.
❖ A second, more complex procedure is to assign a probability to each of the possible values of A rather than
simply assigning the most common value to A(x). These probabilities can be estimated again based on the
observed frequencies of the various values for A among the examples at node n.
❖ There is a natural bias in the information gain measure that favors attributes with many values over those
with few values.
❖ As an extreme example, consider the attribute Date, has a very large number of possible values. What is
wrong with the attribute Date.
❖ Simply put, it has so many possible values that it is bound to separate the training examples into very small
subsets. Because of this, it will have a very high information gain relative to the training examples. Having
very high information gain, its a very poor predictor of the target function over unseen instances.
METHOD-1: The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information
that is sensitive to broadly and uniformly the attribute splits the data.
Where S1 through Sc, are the c subsets of examples resulting from partitioning S by the c-valued attribute A. The Gain
Ratio measure is defined in terms of the earlier Gain measure, as well as this Split information, as follows
The Split information term discourages the selection of attributes with many uniformly distributed values (e.g., Date).
Hypothesis Space Search by ID3
As with other inductive learning methods, ID3 can be characterized as searching a space of hypotheses for one that fits
the training examples. The hypothesis space searched by ID3 is the set of possible decision trees. ID3 performs a simple-
to complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. The
evaluation function.
that guides this hill-climbing search is the information gain measure. This search is depicted in Figure 3.5. By viewing
ID in terms of its search space and search strategy, we can get some insight into its capabilities and limitations.
1.ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This contrasts, for
example, with the earlier version space candidate lirninat-od, which maintains the set of all hypotheses consistent with
the available training examples. By determining only a single hypothesis, ID^ loses the capabilities that follow from
explicitly representing all consistent hypotheses. For example, it does not have the ability to determine how many
alternative decision trees are consistent with theavailable training data, or to pose new instance queries that optimally
resolve among these competing hypotheses.
2. ID3 in its pure form performs no backtracking in its search. Once it,selects an attribute to test at a particular level in
the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the usual risks of hill-climbing search
without backtracking: converging to locally optimal solutions that are not globally optimal. In the case of ID3, a locally
optimal solution corresponds to the decision tree it selects along the single search path it explores. However, this locally
optimal solution may be less desirable than trees that would have been encountered along a different branch of the
search. Below we discuss an extension that adds a form of backtracking (post- pruning the decision tree).
3.ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine
its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training
examples (e.g., FINDCANDIDATE-ELIMINATION ) advantage of using statistical properties of all the examples (e.g.,
information gain) is that the resulting search is much less sensitive to errors in individual training examples. ID3 can be
easily extended to handle noisy training data by modifying its termination criterion to accept hypotheses that imperfectly
fit the training data
Space complexity
Space complexity measures how much memory an algorithm needed to run in terms of the input size. A ML program
could not be successfully run if a ML algorithm loads too much data into the working memory of a machine.
Sample complexity
Sample complexity measures the number of training examples needed to train the network in order to guarantee a valid
generalization. For example, deep neural network has high sample complexity since lots of training data are needed to
train it.
Bias-variance tradeoff
Different ML algorithms would have different bias-variance tradeoff. The bias errors come from the fact that a model is
biased towards a specific solution or assumption. For example, if linear decision boundary is fit on nonlinear data, the
bias will be high. Variance, on the other hand, measures the errors coming from the variance of the model. It is the average
square difference of the prediction of a model and the prediction of expected model.
Online and offline learning refers to the way a machine learning software learns to update the model. Online learning
means training data can be presented one at a time so that parameters can be updated immediately when new data are
available. Offline learning requires the training to start over again (re-train the whole model) when new data presented in
order to update the parameters.
ID3 Decision tree algorithm is an example of offline learning. The way ID3 works is to look at the data globally and do
a greedy search to maximize information gain.
Parallelizability
A parallel algorithm means that an algorithm can complete multiple operations at a given time. This can be done by
distributing the workloads across different workers, like processors in a single machine or multiple machines. The nature
of k-nearest neighbors (k-NN) model allows it to be easily run on multiple machines at the same time.
Parametricity
The concept of parametricity is widely used in the fields of statistical learning. a parametric model means the number of
parameters of a model is fixed the number of parameters of a non-parametric model grows more data are available
Another way of defining a parametric model is based on its underlying assumptions about the shapes of the probability
distribution of the data. If there is no assumption presented, then it is a non-parametric model Parametric model is very
common in machine learning. Examples are linear regression, neural networks and many other ML models. k-NN and
SVM (support vector machine), on the other hand, are nonparametric models