0% found this document useful (0 votes)
276 views17 pages

ML Unit-2.1

The document discusses decision tree learning algorithms. It covers representing concepts as decision trees using recursive induction, selecting the best splitting attribute using entropy and information gain, searching for simple trees, dealing with overfitting and pruning trees. It also discusses experimentally evaluating learning algorithms by measuring accuracy, comparing algorithms using cross-validation, learning curves and statistical hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
276 views17 pages

ML Unit-2.1

The document discusses decision tree learning algorithms. It covers representing concepts as decision trees using recursive induction, selecting the best splitting attribute using entropy and information gain, searching for simple trees, dealing with overfitting and pruning trees. It also discusses experimentally evaluating learning algorithms by measuring accuracy, comparing algorithms using cross-validation, learning curves and statistical hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-II: Decision Tree Learning: Representing concepts as decision trees, Recursive induction of decision trees,

Picking the best splitting attribute: entropy and information gain, searching for simple trees and computational
complexity, Occam's razor, Overfitting, noisy data, and pruning. Experimental Evaluation of Learning Algorithms:
Measuring the accuracy of learned hypotheses. Comparing learning algorithms: cross-validation, learning curves, and
statistical hypothesis testing.

Decision Tree Learning


• Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, They are Decision Node and Leaf Node. Decision nodes are used to
make any decision and have multiple branches, Leaf nodes are the output of those decisions and do not contain
any further branches.
• It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
• In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree
algorithm.

Why use Decision Trees

The main point to remember while creating a machine learning model. They are two reasons for using the Decision
tree.
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Decision Tree Terminologies

1. Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
2. Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.
3. Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
4. Branch/Sub Tree: A tree formed by splitting the tree.
5. Pruning: Pruning is the process of removing the unwanted branches from the tree.
6. Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child
nodes.
Features of Decision Tree Learning
Method for approximating discrete-valued functions (including boolean)
Learned functions are represented as decision trees (or if-then-else rules)
Expressive hypotheses space, including disjunction
The decision tree is robust to noisy data
Representation of Decision Tree Learning
➢ Decision trees classify instances by sorting them down the tree from the root to some leaf node, it provides the
classification of the instance.
➢ Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that
node corresponds to one of the possible values for this attribute.
➢ An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then
moving down the tree branch corresponding to the value of the attribute in the given example. This process is
repeated for the subtree rooted at the new node.
➢ In decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances.
➢ Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a
disjunction of these conjunctions.

1. Outlook=Sunny, Humidity=High –> No


2. Outlook=Sunny, Humidity=Normal –> Yes
3. Outlook=Overcast –> Yes
4. Outlook=Rain, Wind=Strong –> No
5. Outlook=Rain, Wind=Weak–> Yes
Now, The new instance is classified using the above rules.
Example:

<Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong> —> No

Appropriate Problems for Decision Tree Learning


1. Instances are represented by attribute-value pairs.
"A fixed set of attributes, like Temperature, and their values, like Hot, are used to characterize instances. Each attribute
in the decision tree learning has a limited number of disjoint potential values (such as Hot, Mild, and Cold). Extensions
to the fundamental algorithm enable handling of real-valued properties as well (for example, numerical representation
of temperature).
2. The target function has discrete output values.
"The decision tree can be used in examples of Boolean classification, such as yes or no. When learning functions with
more than two possible output values, decision tree approaches are simply extended. With a more significant expansion,
learning target functions with real-valued outputs is available; decision trees are less frequently used in this situation.
3. Disjunctive descriptions may be required. Decision trees naturally represent disjunctive expressions
4. The training data may contain errors.
“Decision tree learning methods are robust to errors, both errors in classifications of the training examples and errors in
the attribute values that describe these examples.”
5. The training data may contain missing attribute values.
Decision tree approaches can be utilized even when some training examples have uncertain values (for instance, if only
part of the training examples have data on the day's humidity).
Entropy And Information Gain in Decision Tree Learning
Which Attribute Is the Best Classifier? is the first question that needs to be addressed while building a decision tree.
The ID3 algorithm's essential decision is which attribute to test at each node in the tree.
We want to choose the attribute that will be most useful for categorizing samples. We shall define information gain, a
statistical feature that assesses how well a given attribute differentiates training instances based on their target
categorization.
ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.

ENTROPY MEASURES THE HOMOGENEITY OF EXAMPLES


Entropy characterizes the (im)purity of an arbitrary collection of examples.
Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this
boolean classification is

where p+, is the proportion of positive examples in S and p-, is the proportion of negative examples in S.

In all calculations involving entropy, we define 0log0 to be 0.


Let, S is a collection of training examples,
p+ the proportion of positive examples in S
p– the proportion of negative examples in S
INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY
The entropy as a measure of the impurity in a collection of training examples, we can define a measure of the
effectiveness of an attribute in classifying the training data.

The information gain is simply the expected reduction in entropy caused by partitioning the examples according to this
attribute. the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as,
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which attribute A has value
v (i.e., S_v= {s ∈ S|A(s) = v})

ID3 Algorithm Decision Tree – Solved Example – Machine Learning

Problem Definition:
Build a decision tree using ID3 algorithm for the given training data in the table (Buy Computer data), and predict the
class of the following new example: age<=30, income=medium, student=yes, credit-rating=fair
age income student Credit rating Buys computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

Solution:
Attribute provides the highest Information Gain in order to split the training set based on that attribute,to calculate the
expected information to classify the set and the entropy of each attribute.

The information gain is this mutual information minus the entropy:The mutual information of the two classes,

Entropy(S)= E(9,5)= -9/14 log2(9/14) – 5/14 log2(5/14)=0.94


Now Consider the Age attribute
For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no), and age>40 (3 yes and 2 no)
Entropy(age) = 5/14 (-2/5 log2(2/5)-3/5log2(3/5)) + 4/14 (0) + 5/14 (-3/5log2(3/5)-2/5log2(2/5))
= 5/14(0.9709) + 0 + 5/14(0.9709) = 0.6935 Gain(age) = 0.94 – 0.6935 = 0.2465

Next, consider Income Attribute


For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes and 2 no), and incomelow (3 yes 1 no)
Entropy(income) = 4/14(-2/4log2(2/4)-2/4log2(2/4)) + 6/14 (-4/6log2(4/6)-2/6log2(2/6)) + 4/14 (-3/4log2(3/4)-
1/4log2(1/4))
= 4/14 (1) + 6/14 (0.918) + 4/14 (0.811)

= 0.285714 + 0.393428 + 0.231714 = 0.9108 Gain(income) = 0.94 – 0.9108 = 0.0292

Next, consider Student Attribute


For Student, we have two values studentyes (6 yes and 1 no) and studentno (3 yes 4 no)

Entropy(student) = 7/14(-6/7log2(6/7)-1/7log2(1/7)) + 7/14(-3/7log2(3/7)-4/7log2(4/7)


= 7/14(0.5916) + 7/14(0.9852)

= 0.2958 + 0.4926 = 0.7884 Gain (student) = 0.94 – 0.7884 = 0.1516

Finally, consider Credit_Rating Attribute


For Credit_Rating we have two values credit_ratingfair (6 yes and 2 no) and credit_ratingexcellent (3 yes 3 no)

Entropy(credit_rating) = 8/14(-6/8log2(6/8)-2/8log2(2/8)) + 6/14(-3/6log2(3/6)-3/6log2(3/6))


= 8/14(0.8112) + 6/14(1)

= 0.4635 + 0.4285 = 0.8920 Gain(credit_rating) = 0.94 – 0.8920 = 0.479

Since Age has the highest Information Gain we start splitting the dataset using the age attribute.

Decision Tree after step 1


Since all records under the branch age31..40 are all of the class, Yes, we can replace the leaf with Class=Yes
Decision Tree after step 1_1

Now build the decision tree for the left subtree

The same process of splitting has to happen for the two remaining branches.

Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which one should be used to split the
partition?

The mutual information is E(Sage<=30)= E(2,3)= -2/5 log2(2/5) – 3/5 log2(3/5)=0.97


For Income, we have three values incomehigh (0 yes and 2 no), incomemedium (1 yes and 1 no) and incomelow (1 yes and
0 no)
Entropy(income) = 2/5(0) + 2/5 (-1/2log2(1/2)-1/2log2(1/2)) + 1/5 (0) = 2/5 (1) = 0.4 Gain(income) = 0.97 – 0.4 = 0.57
For Student, we have two values studentyes (2 yes and 0 no) and studentno (0 yes 3 no)
Entropy(student) = 2/5(0) + 3/5(0) = 0 Gain (student) = 0.97 – 0 = 0.97

We can then safely split on attribute student without checking the other attributes since the information gain is
maximized.

Decision Tree after step 2


Since these two new branches are from distinct classes, we make them into leaf nodes with their respective class as
label:

Decision
Tree after step 2_2

build the decision tree for right left subtree

Right sub-branch
The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5 log2(2/5)=0.97
For Income, we have two values incomemedium (2 yes and 1 no) and incomelow (1 yes and 1 no)
Entropy(income) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5 (-1/2log2(1/2)-1/2log2(1/2))

= 3/5(0.9182)+2/5 (1) = 0.55+0. 4= 0.95 Gain(income) = 0.97 – 0.95 = 0.02

For Student, we have two values studentyes (2 yes and 1 no) and studentno (1 yes and 1 no)
Entropy(student) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5(-1/2log2(1/2)-1/2log2(1/2)) = 0.95
Gain (student) = 0.97 – 0.95 = 0.02

For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and credit_ratingexcellent (0 yes and 2 no)
Entropy(credit_rating) = 0 Gain(credit_rating) = 0.97 – 0 = 0.97

then split based on credit_rating. These splits give partitions each with records from the same class. We just need to
make these into leaf nodes with their class label attached:
Decision Tree for Buys Computer
New example: age<=30, income=medium, student=yes, credit-rating=fair

branch(age<=30) then student=yes we predict Class=yes ;Buys_computer = yes.

Appropriate problems for Decision tree learning


1. Instances are represented by attribute-value pairs:
➢ Instances are described by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot).
➢ The easiest situation for decision tree learning is each attribute takes on a small number of disjoint possible
values (e.g., Hot, Mild, Cold).
➢ Extensions to the basic algorithm allow handling real-valued attributes as well (e.g., representing Temperature
numerically).

2. The target function has discrete output values:


➢ The decision tree is usually used for Boolean classification (e.g., yes or no) kind of example.
➢ Decision tree methods easily extend to learning functions with more than two possible output values.
➢ A more substantial extension allows learning target functions with real-valued outputs, though the application
of decision trees in this setting is less common.

3. Disjunctive descriptions may be required: Decision trees naturally represent disjunctive expressions.

4. The training data may contain errors: Decision tree learning methods are robust to errors, both errors in
classifications of the training examples and errors in the attribute values that describe these examples.

5. The training data may contain missing attribute values: Decision tree methods can be used even when some
training examples have unknown values (e.g., if the Humidity of the day is known for only some of the training
examples)
Gini Index:
Gini Index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it
is randomly chosen.

where pi is the probability of a thing having a place with a specific class.

If all the elements belong to a single class, then it can be called pure. The degree of Gini Index varies between 0 and
1.where,

▪ '0' denotes that all elements belong to a certain class or there exists only one class (pure), and
▪ '1' denotes that the elements are randomly distributed across various classes (impure).
▪ A Gini Index of '0.5 'denotes equally distributed elements into some classes.
Issues in Decision tree learning

Avoiding Overfitting the Data

The algorithm (DECISION TREE) grows each branch of the tree just deeply enough to perfectly classify the training
examples. While this is sometimes a reasonable strategy, in fact it can lead to difficulties when there is noise in the
data, or when the number of training examples is too small to produce a representative sample of the true target
function. In either of these cases, this simple algorithm can produce trees that overfit the training examples.

Definition: Given a hypothesis space H, a hypothesis h E H is said to overfit the training data if there exists
some alternative hypothesis h' E H, such that h has smaller error than h' over the training examples, but h'
has a smaller error than h over the entire distribution of instances.

Overfitting is a significant practical difficulty for decision tree learning and many other learning methods. There are
several approaches to avoiding overfitting in decision tree learning. These can be grouped into two classes:

• approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the
training data.

• approaches that allow the tree to overfit the data, and then post-prune the tree.

The first of these approaches might seem more direct. The second approach of post-pruning overfit trees has been
found to be more successful in practice.

This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree. Regardless of
whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used
to determine the correct final tree size.

Approaches include:

o Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-
pruning nodes from the tree.

o Use all the available data for training, but apply a statistical test to estimate whether expanding (or
pruning) a particular node is likely to produce an improvement beyond the training set. For example,
Quinlan (1986) uses a chi-square test to estimate whether further expanding a node is likely to improve
performance over the entire instance distribution, or only on the current sample of training data.

o Use an explicit measure of the complexity for encoding the training examples and the decision tree,
halting growth of the tree when this encoding size is minimized. This approach, based on a heuristic
called the Minimum Description Length principle.

B. Reduced error pruning

➢ The exactly might we use a validation set to prevent overfitting, One approach is called reduced-error pruning,
consider each of the decision nodes in the tree to be candidates for pruning. Pruning a decision node consists of
removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification
of the training examples affiliated with that node.

➢ Nodes are removed only if the resulting pruned tree performs no worse than-the original over the validation set.
This has the effect that any leaf node added due to coincidental regularities in the training set is likely to be
pruned because these same coincidences are unlikely to occur in the validation set.
➢ Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy
over the validation set. Pruning of nodes continues until further pruning is harmful. The impact of reduced-error
pruning on the accuracy of the decision tree is illustrated in below Figure

C. Rule post-pruning

Rule post-pruning involves the following steps:

1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and
allowing overfitting to occur.

2. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf
node.

3. Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. 4. Sort
the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

Why convert the decision tree to rules before pruning? There are three main advantages.

▪ Converting to rules allows distinguishing among the different contexts in which a decision node is used.
Because each distinct path through the decision tree node produces a distinct rule, the pruning decision
regarding that attribute test can be made differently for each path. In contrast, if the tree itself were pruned, the
only two choices would be to remove the decision node completely, or to retain it in its original form.

▪ Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those
that occur near the leaves. Thus, we avoid messy bookkeeping issues such as how to reorganize the tree if the
root node is pruned while retaining part of the subtree below this test.

▪ Converting to rules improves readability. Rules are often easier for to understand.

D.Incorporating continuous-valued attributes

Our initial definition of ID3 is restricted to attributes that take on a discrete set of values.

❖ First, the target attribute whose value is predicted by the learned tree must be discrete valued. Second, the
attributes tested in the decision nodes of the tree must also be discrete valued.
❖ This second restriction can easily be removed so that continuous-valued decision attributes can be incorporated
into the learned tree. This can be accomplished by dynamically defining new discrete valued attributes that
partition the continuous attribute value into a discrete set of intervals.

❖ Example: Suppose we wish to include the continuous-valued attribute Temperature. Suppose further that the

training examples associated with a particular node in the decision tree have the following values for Temperature

and the target attribute PlayTennis.

i) Pick a threshold, c, that produces the greatest information gain. By sorting the examples according to
the continuous attribute A, then identifying adjacent examples that differ in their target classification
ii) In the current example, there are two candidate thresholds, corresponding to the values of Temperature at
which the value of Play Tennis changes: (48 + 60)/2, and (80 + 90)/2.
iii) The information gain can then be computed for each of the candidate attributes, Temperature >54, and
Temperature >85 and the best can be selected (Temperature >54)
iv) This dynamically created boolean attribute can then compete with the other discrete-valued candidate
attributes available for growing the decision tree.

E. Handling attributes with differing costs

❖ In some learning tasks the instance attributes may have associated costs. For example, in learning to classify
medical diseases we might describe patients in terms of attributes such as Temperature, Biopsy Result, Pulse,
Blood Test Results, etc.

❖ These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort. In
such tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-cost
attributes only when needed to produce reliable classifications.

❖ ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection
measure. For example, we might divide the Gain by the cost of the attribute, so that lower-cost attributes would
be preferred. cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do
bias the search in favor of low-cost attributes.

❖ Attribute cost is measured by the number of seconds required to obtain the attribute value by positioning and
operating the sonar. They demonstrate that more efficient recognition strategies are learned, without sacrificing
classification accuracy, by replacing the information gain attribute selection measure by the following measure.

F. Handling training examples with missing attribute values

In certain cases, the available data may be missing values for some attributes. For example, in a medical domain in
which we wish to predict patient outcome based on various laboratory tests, it may be that the lab test Blood-Test-Result
is available only for a subset of the patients. In such cases, it is common to estimate the missing attribute value based
on other examples for which this attribute has a known value.
❖ Consider the situation in which Gain(S, A) is to be calculated at node n in the decision tree to evaluate whether
the attribute A is the best attribute to test at this decision node. Suppose that (x, c(x)) is one of the training
examples in S and that the value A(x) is unknown.

❖ One strategy for dealing with the missing attribute value is to assign it the value that is most common among
training examples at node n. Alternatively, we might assign it the most common value among examples at node
n that have the classification c(x). The elaborated training example using this estimated value for A(x) can then
be used directly by the existing decision tree learning algorithm.

❖ A second, more complex procedure is to assign a probability to each of the possible values of A rather than
simply assigning the most common value to A(x). These probabilities can be estimated again based on the
observed frequencies of the various values for A among the examples at node n.

G. Alternative Measures for Selecting Attributes

❖ There is a natural bias in the information gain measure that favors attributes with many values over those
with few values.
❖ As an extreme example, consider the attribute Date, has a very large number of possible values. What is
wrong with the attribute Date.
❖ Simply put, it has so many possible values that it is bound to separate the training examples into very small
subsets. Because of this, it will have a very high information gain relative to the training examples. Having
very high information gain, its a very poor predictor of the target function over unseen instances.

METHOD-1: The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information
that is sensitive to broadly and uniformly the attribute splits the data.

Where S1 through Sc, are the c subsets of examples resulting from partitioning S by the c-valued attribute A. The Gain
Ratio measure is defined in terms of the earlier Gain measure, as well as this Split information, as follows

The Split information term discourages the selection of attributes with many uniformly distributed values (e.g., Date).
Hypothesis Space Search by ID3

As with other inductive learning methods, ID3 can be characterized as searching a space of hypotheses for one that fits
the training examples. The hypothesis space searched by ID3 is the set of possible decision trees. ID3 performs a simple-
to complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. The
evaluation function.
that guides this hill-climbing search is the information gain measure. This search is depicted in Figure 3.5. By viewing
ID in terms of its search space and search strategy, we can get some insight into its capabilities and limitations.

1.ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This contrasts, for
example, with the earlier version space candidate lirninat-od, which maintains the set of all hypotheses consistent with
the available training examples. By determining only a single hypothesis, ID^ loses the capabilities that follow from
explicitly representing all consistent hypotheses. For example, it does not have the ability to determine how many
alternative decision trees are consistent with theavailable training data, or to pose new instance queries that optimally
resolve among these competing hypotheses.

2. ID3 in its pure form performs no backtracking in its search. Once it,selects an attribute to test at a particular level in
the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the usual risks of hill-climbing search
without backtracking: converging to locally optimal solutions that are not globally optimal. In the case of ID3, a locally
optimal solution corresponds to the decision tree it selects along the single search path it explores. However, this locally
optimal solution may be less desirable than trees that would have been encountered along a different branch of the
search. Below we discuss an extension that adds a form of backtracking (post- pruning the decision tree).

3.ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine
its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training
examples (e.g., FINDCANDIDATE-ELIMINATION ) advantage of using statistical properties of all the examples (e.g.,
information gain) is that the resulting search is much less sensitive to errors in individual training examples. ID3 can be
easily extended to handle noisy training data by modifying its termination criterion to accept hypotheses that imperfectly
fit the training data

Inductive bias in decision tree learning


Given a collection of training examples, there are typically many decision trees consistent with these examples.
Describing the inductive bias of ID3 therefore consists of describing the basis by which it chooses one of these consistent
hypotheses over the others. Which of these decision trees does ID3 choose? It chooses the first acceptable tree it
encounters in its simple-to-complex, hill climbing search through the space of possible trees. Roughly speaking.
ID3 search strategy
(a) selects in favor of shorter trees over longer ones, and
(b) selects trees that place the attributes with highest information gain closest to the root. Because of the
subtle interaction between the attribute selection heuristic used by ID3 and the particular training
examples it encounters, it is difficult to characterize precisely the inductive bias exhibited by ID3.
However, we can approximately characterize its bias as a preference for short decision trees over
complex trees.
Approximate inductive bias of ID3:
❖ Shorter trees are preferred over larger trees. In fact, one could imagine an algorithm similar to ID3 that exhibits
precisely this inductive bias.
❖ Consider an algorithm that begins with the empty tree and searches breadth Jirst through progressively more
complex trees, first considering all trees of depth 1, then all trees of depth 2, etc. Once it finds a decision tree
consistent with the training data, it returns the smallest consistent tree at that search depth (e.g., the tree with
the fewest nodes).
❖ Let us call this breadth-first search algorithm BFS-ID3. BFS-ID3 finds a shortest decision tree and thus exhibits
precisely the bias "shorter trees are preferred over longer trees."
❖ ID3 can be viewed as an efficient approximation to BFS- ID3, using a greedy heuristic search to attempt to find
the shortest tree without conducting the entire breadth-first search through the hypothesis space.
❖ Because ID3 uses the information gain heuristic and a hill climbing strategy, it exhibits a more complex bias
than BFS-ID3. In particular, it does not always find the shortest consistent tree, and it is biased to favor trees
that place attributes with high information gain closest to the root.
A closer approximation to the inductive bias of ID3:
Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are
preferred over those that do not.
A. Restriction Biases and Preference Biases
❖ Consider the difference between the hypothesis space search in these two approaches: ID3 searches a
complete hypothesis space (i.e., one capable of expressing any finite discrete valued function). It
searches incompletely through this space, from simple to complex hypotheses, until its termination
condition is met (e.g., until it finds a hypothesis consistent with the data). Its inductive bias is solely a
consequence of the ordering of hypotheses by its search strategy. Its hypothesis space introduces no
additional bias.
❖ The version space CANDIDATE-ELIMINATION algorithm searches an incomplete hypothesis space
(i.e., one that can express only a subset of the potentially teachable concepts). It searches this space
completely, finding every hypothesis consistent with the training data. Its inductive bias is solely a
consequence of the expressive power of its hypothesis representation. Its search strategy introduces no
additional bias.
❖ The inductive bias of ID3 is thus a preference for certain hypotheses over others (e.g., for shorter
hypotheses), with no hard restriction on the hypotheses that can be eventually enumerated. This form
of bias is typically called a preference bias (or, alternatively, a search bias). In contrast, the bias of the
CANDIDATEELIMINATION Algorithm is in the form of a categorical restriction on the set of
hypotheses considered. This form of bias is typically called a restriction bias (or, alternatively, a
language bias).
Computational complexity:
computational complexity of an algorithm using the number of “operations” it needs to perform, where we assume that
for any machine that implements the underlying abstract machine there exists a constant c such that any such “operation”
can be performed on the machine using c seconds
(OR)
A learning algorithm has access to a domain of examples, Z, a hypothesis class, H, a loss function and a training set of
examples from Z that are sampled. according to an unknown distribution D. Given parameters ε, δ, the algorithm should
output a hypothesis h such that with probability of at least 1 – δ.

Consider the rate of change of that complexity along a sequence of tasks.


1. Given a function f : (0, 1)2 → N, a learning task (Z, H, `), and a learning algorithm, A, we say that A solves the
learning task in time O(f) if there exists some constant number c, such that for every probability distribution D over Z,
and input ε, δ ∈ (0, 1), when A has access to samples generated by D,
Why Prefer Short Hypotheses?
Is ID3's inductive bias favoring shorter decision trees a sound basis for generalizing beyond the training data?
William of Occam was one of the first to discuss the question:
Occam's razor: Prefer the simplest hypothesis that fits the data. Upon closer examination,
it turns out there is a major difficulty with the above argument. By the same reasoning we could have argued that one
should prefer decision trees containing exactly 17 leaf nodes with 11 non leaf nodes, that use the decision attribute A1
at the root, and test attributes A2 through All, in numerical order.
A second problem with the above argument for Occam's razor is that the size of a hypothesis is determined by the
particular representation used internally by the learner. Two learners using different internal representations could
therefore anive at different hypotheses, both justifying their contradictory conclusions by Occam's razor.
This last argument shows that Occam's razor will produce two different hypotheses from the same training examples
when it is applied by two learners that perceive these examples in terms of different internal representations.
Occam's razor:
The discussion of Occam's razor, a popular inductive bias that can be summarized as "choose the shortest explanation
for the observed data." we discussed several arguments in the long-standing debate regarding Occam's razor. Here we
consider a Bayesian perspective on this issue and a closely related principle called the Minimum Description Length
(MDL)principle. The Minimum Description Length principle is motivated by interpreting the definition of hMAP in the
light of basic concepts from information theory. Consider again the now familiar definition of hMAP.

Significance of Occam's Razor


❖ It promotes simplicity and elegance in explanations, enabling clearer understanding and communication
❖ Occam's razor also aids in decision-making processes.
❖ Occam's razor fosters a mindset of critical thinking and intellectual integrity. It encourages individuals to
question complex explanations and seek simpler, more coherent ones.
Comparing learning algorithms in machine learning:
Time complexity
The “time” an algorithm takes is measured by the elementary operations of the algorithm. The users and developers may
concern more about the wall clock time an algorithm takes to train the models, it would be fairer to use the standard worst
case computational time complexity to compare the time the models take to train. For example, parametric models like
linear regression could have long training time but they are efficient during test time.

Space complexity

Space complexity measures how much memory an algorithm needed to run in terms of the input size. A ML program
could not be successfully run if a ML algorithm loads too much data into the working memory of a machine.
Sample complexity

Sample complexity measures the number of training examples needed to train the network in order to guarantee a valid
generalization. For example, deep neural network has high sample complexity since lots of training data are needed to
train it.

Bias-variance tradeoff

Different ML algorithms would have different bias-variance tradeoff. The bias errors come from the fact that a model is
biased towards a specific solution or assumption. For example, if linear decision boundary is fit on nonlinear data, the
bias will be high. Variance, on the other hand, measures the errors coming from the variance of the model. It is the average
square difference of the prediction of a model and the prediction of expected model.

Online and Offline Learning

Online and offline learning refers to the way a machine learning software learns to update the model. Online learning
means training data can be presented one at a time so that parameters can be updated immediately when new data are
available. Offline learning requires the training to start over again (re-train the whole model) when new data presented in
order to update the parameters.

ID3 Decision tree algorithm is an example of offline learning. The way ID3 works is to look at the data globally and do
a greedy search to maximize information gain.

Parallelizability

A parallel algorithm means that an algorithm can complete multiple operations at a given time. This can be done by
distributing the workloads across different workers, like processors in a single machine or multiple machines. The nature
of k-nearest neighbors (k-NN) model allows it to be easily run on multiple machines at the same time.

Parametricity

The concept of parametricity is widely used in the fields of statistical learning. a parametric model means the number of
parameters of a model is fixed the number of parameters of a non-parametric model grows more data are available
Another way of defining a parametric model is based on its underlying assumptions about the shapes of the probability
distribution of the data. If there is no assumption presented, then it is a non-parametric model Parametric model is very
common in machine learning. Examples are linear regression, neural networks and many other ML models. k-NN and
SVM (support vector machine), on the other hand, are nonparametric models

You might also like