Computational Learning Theory Guide
Computational Learning Theory Guide
probably approximately correct (PAC) learning. Sample complexity for infinite hypothesis
spaces, Vapnik- Chervonenkis dimension.
Rule Learning: Propositional and First-Order, Translating decision trees into rules,
Heuristic rule induction using separate and conquer and information gain, First-order
Horn-clause induction (Inductive Logic Programming) and Foil, Learning recursive rules,
Inverse resolution, Golem, and Progol.
This division of learning tasks vs. learning algorithms is arbitrary, and in practice, there
is quite a large degree of overlap between these two fields.
In supervised learning, an algorithm is given samples that are labeled in some useful
way. For example, the samples might be descriptions of mushrooms, and the labels
could be whether or not the mushrooms are edible. The algorithm takes these
previously labeled samples and uses them to induce a classifier. This classifier is a
function that assigns labels to samples, including samples that have not been seen
previously by the algorithm. The goal of the supervised learning algorithm is to optimize
some measure of performance such as minimizing the number of mistakes made on
new samples. In computational learning theory, a computation is considered feasible if
it can be done in polynomial time.
Negative results often rely on commonly believed, but yet unproven assumptions, such as:
VC-DIMENSION:
richness, or flexibility. The cardinality of the biggest set of points that the method
can shatter is specified. the context of a dataset, shatter or a shattered set implies
that points in the feature space may be picked or divided from one another using
hypotheses in the space such that the labels of samples in the distinct groups are
right.
❖ The VC dimension measures the complexity of a hypothesis space, for example, the
models that can be fit given a representation and learning method. The amount of
unique possibilities in a hypothesis space (space of models that might be fit) and
the space may be traversed are two ways to assess the complexity of a hypothesis
❖ The VC dimension is an ingenious method that instead counts the number of cases
from the target issue that can be distinguished by hypotheses in the space.
Given a set of n points S = {x1, x2, …, xn} in a d-dimensional space and a binary
classifier h, the VC dimension of h is the largest integer d such that there exists a
set of d points that can be shattered by h, i.e., for any labeling of the d points, there
of h is:
VC(h) = max{d | there exists a set of d points that can be shattered by h}.for each of
And that's why the VC dimension of H is 3. Because for any 4 points in 2D plane, a linear
classifier can not shatter all the combinations of the points. For example,
For this set of points, there is no separating hyper plane can be drawn to classify this set.
So the VC dimension is 3.
s
Or the pattern where a three points coincides on each other, Here also we can not draw
separating hyper plane between 3 points. But still this pattern is not considered in the
definition of the VC dimension.
distribution D, and the aim of a neural network is to classify any further unclassified
➢ To find information about any unknown target function, the learner is given access
to different examples of the functions that are drawn randomly according to some
➢ In general, a PAC algorithm may be performed on supplied data and the inaccuracy
because most of these algorithms use the input for more than just establishing the
PAC-learnability:
• Training examples are obtained by taking random samples from X. We assume that the
Definition (informal) :
arbitrary, but fixed, probability distribution. The concept class C is said to be PAC-
learnable if there is an algorithm A which, for samples drawn with any probability
distribution F and any concept c ∈ C, will with high probability produce a hypothesis
• True error To formally define PAC-learnability, we require the concept of the true
is defined by error
F (h) = Px∈F (h(x) ≠ c(x))
where the notation Px∈F indicates that the probability is taken for x drawn from X
according to the distribution F. This error is the probability that h will misclassify
not directly observable to the learner; it can only see the training error of each
hypothesis (that is, how often h(x) ≠ c(x) over training instances).h
The online learning model, also known as the mistake-bounded learning model, is a form
of learning model in which the worst-case scenario is considered for all environments.
There is a known conception class for each situation, and the target concept is chosen
from there.
The target functions and the sequence of presentation for all instances are then chosen
by an opponent with limitless computer power and knowledge of the learner's algorithm.
Problem setting :The learner knows the right value after each learning session and can
utilize the input to enhance its hypothesis. The objective of this model is to use any
Definition 1: An algorithm A is said to learn C in the mistake bound model if for any
concept c ∈ C, and for any ordering of examples consistent with c, the total number of
mistakes ever made by A is bounded by p(n, size(c)), where p is a polynomial. We say that
A is a polynomial time learning algorithm if its running time per stage is also polynomial
in n and size(c). Let us now examine a few problems that are learnable in the mistake
bound model.
Conjunctions : Let us assume that we know that the target concept c will be a
conjunction of a set of (possibly negated) variables, with an example space of n-bit strings.
3. If the prediction is False but the label is actually True, remove all the literals in h which
are False in x. (So if the first mistake is on 1001, the new h will be x1x2 x3 x4.)
4. If the prediction is True but the label is actually False, then output “no consistent
conjunction”.
5. Return to step 2.
An invariant of this algorithm is that the set of literals in c will always be a subset of the
set of literals in h. The first mistake on a positive example will bring the size of h to n.
Each subsequent such mistake will remove at least one literal from h, so that the
WEAK LEARNING:
The PAC learning model requires the learner to generate numerous hypotheses that
are arbitrarily near to the goal idea. The simple methods that are generally right are
easy to find, it is extremely difficult to find a single hypothesis that is highly
accurate.
a weak learner into a PAC learner is known as hypothesis boosting. Since Schapire's
driving the weak learner to provide hypotheses that can be merged to produce a
learner is tested.
➢ The restricted to finite hypothesis spaces Some infinite hypothesis spaces are more
expressive than others – E.g., Rectangles, vs. 17- sides convex polygons vs. general
measure of the expressiveness of an infinite hypothesis space other than its size.
Analogous to |H|, there are bounds for sample complexity using VC(H).
hypothesis space H shatters the entire instance space X (is able to induce every
possible partition on the set of all possible instances) The larger the subset X that
can be shattered, the more expressive a hypothesis space is, i.e., the less biased.
that is, h partitions S into the two subsets {x E S/h(x) = 1) and {x E S/h(x) = 0).
Given some instance set S, there are 2ISI possible dichotomies, though H may be
concepts over S.
We can shatter any dataset of two reals, but we cannot shatter datasets of three real
values.A set .of instances is thus a measure of its capacity to represent target concepts
defined over these instances.
VC dimension of H
The VC dimension of the hypothesis space H, VC(H), is the size of the largest finite
subset of the instance space X that can be shattered by H. If arbitrarily large finite subsets
of X can be shattered by X then VC(H) = ∞
VC Dimension
The VC dimension of hypothesis space H over instance space X is the size of the largest
finite subset of X that is shattered by H.
❖ If there exists one (or more) subsets of size d that can be shattered, then VC(H) ≥ d
❖ If no subset of size d can be shattered, then VC(H) < d.
❖ The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be
labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low.
❖ If H is finite: VC(H) ≤ log2|H| A set S with d instances has 2d distinct
subsets/dichotomies. Hence, H requires 2d distinct hypotheses to shatter d
instances.
– If VC(H) = d: 2d ≤ |H| hence: VC(H) = d ≤ log2|H|
when with probability at least 1 − δ, h has error less then . We consider that the hypothesis
space has to be infinite if we want to use this bound. If we want to shatter m points, then
H has to be at least 2m in order to shatter any configurations of those m examples.
COLT SUMMARY:
The PAC framework provides a reasonable model for theoretically analyzing the
effectiveness of learning algorithms.
The sample complexity for any consistent learner using the hypothesis space, H,
can be determined from a measure of H’s expressiveness (|H|, VC(H)) .
If the sample complexity is tractable, then the computational complexity of finding
a consistent hypothesis governs the complexity of the problem.
Sample complexity bounds given here are far from being tight, but separates
learnable classes from non-learnable classes (and show what’s important).
Computational complexity results exhibit cases where information theoretic
learning is feasible, but finding good hypothesis is intractable.
The theoretical framework allows for a concrete analysis of the complexity of
learning as a function of various assumptions (e.g., relevant variables)
*********************************************************************************
RULE LEARNING:
1. It is useful to learn the target function represented as a set of if-then rules that
jointly define the function. One way to learn sets of rules is to first learn a decision
tree, then translate the tree into an equivalent set of rules-one rule for each leaf
node in the tree.
2. A variety of algorithms that directly learn rule sets and that differ from these
algorithms in two key respects. First, they are designed to learn sets of first-order
rules that contain variables. This is significant because first-order rules are much
more expressive than propositional rules. Second, the algorithms discussed here
use sequential covering algorithms that learn one rule at a time to incrementally
grow the final set of rules.
3. First, they are designed to learn sets of first-order rules that contain variables. This
is significant because first-order rules are much more expressive than propositional
rules. Second, the algorithms discussed here use sequential covering algorithms
that learn one rule at a time to incrementally grow the final set of rules.
Learn-One-Rule:
we consider a family of algorithms for learning rule sets based on the strategy of
learning one rule, removing the data it covers, then iterating this process. Such
algorithms are called sequential covering algorithms. To elaborate, imagine we have
a subroutine LEARN-ONE-RULE that accepts a set of positive and negative training
examples as input, then outputs a single rule that covers many of the positive
examples and few of the negative examples. We require that this output rule have
high accuracy, but not necessarily high coverage. By high accuracy, we mean the
predictions it makes should be correct. By accepting low coverage, we mean it need
not make predictions for every training example.
For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).
Here, any person can be associated with the variables x and y. The Learn-One-Rule
algorithm follows a greedy searching paradigm where it searches for the rules with
high accuracy but its coverage is very low. It classifies all the positive examples for
a particular instance. It returns a single rule that covers some examples.
Learn-One-Rule(target_attribute, attributes, examples, k):
Pos = positive examples
Neg = negative examples
best-hypothesis = the most general hypothesis
candidate-hypothesis = {best-hypothesis}
while candidate-hypothesis:
//Update best-hypothesis
best_hypothesis=argmax(h∈CHs)Performance(h,examples,tar
get_attribute)
//Update candidate-hypothesis
IF best_hypothesis:
return prediction.
It starts with the most general rule precondition, then greedily adds the variable
that most improves performance measured over the training examples.
Day Weather Temp Wind Rain PlayBadminton
D1 Sunny Hot Weak Heavy No
D2 Sunny Hot Strong Heavy No
D3 Overcast Hot Weak Heavy No
D4 Snowy Cold Weak Light Yes
D5 Snowy Cold Weak Light Yes
D6 Snowy Cold Strong Light Yes
D7 Overcast Mild Strong Heavy No
D8 Sunny Hot Weak Light Yes
The Sequential Learning algorithm takes care of to some extent, the low coverage
problem in the Learn-One-Rule algorithm covering all the rules in a sequential
manner.
Working on the Algorithm:
Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive
examples.
Step 4 – When this rule is obtained, delete all the training data associated with that
rule.
(i.e. when the rule is applied to the dataset, it covers most of the training data, and
has to be removed)
Step 5 – The new rule is added to the bottom of decision list, ‘R’ Below figure
➢ Let us understand step by step how the algorithm is working in the example
shown in the below figure.
➢ First, we created an empty decision list. During Step 1, we see that there are
three sets of positive examples present in the dataset. So, as per the
algorithm, we consider the one with maximum no of positive example. (6, as
shown in Step 1 of above fig)
➢ Once we cover these 6 positive examples, we get our first rule R1, which is
then pushed into the decision list and those positive examples are removed
from the dataset. (as shown in Step 2 of below fig).
➢ Now, we take the next majority of positive examples (5, as shown in Step 2 of
below Fig ) and follow the same process until we get rule R2. (Same for R3)
➢ In the end, we obtain our final decision list with all the desirable rules.
The following equivalent, using earlier rule notation IF L1 л.... л. L,, THEN H is the
notation, the Horn clause preconditions L1 л. . . л. L, are called the clause body or,
alternatively, the clause antecedents. The literal H forms the postcondition is called the
clause head or, alternatively, the clause consequent.
Step 4 - Foil now considers all the literals from the previous step as well as:
(Female(z), Father(z,w), Father(w,z), etc.) and their negations.
Step 5 - Foil might select Father(z,x), and on the next step Female(y) leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)
Step 6 - If this greedy approach covers only positive examples it terminates
the search for further better results.
FOIL now removes all positive examples covered by this new rule. If more are left then the
outer while loop continues.
FOIL: PERFORMANCE EVALUATION MEASURE
The performance of a new rule is not defined by its entropy measure (like the
PERFORMANCE method in Learn-One-Rule algorithm).FOIL uses a gain algorithm to
determine which new specialized rule to opt. Each rule’s utility is estimated by the number
of bits required to encode all the positive bindings.
where,
L is the candidate literal to add to rule R
p0 = number of positive bindings of R
n0 = number of negative bindings of R
p1 = number of positive binding of R + L
n1 = number of negative bindings of R + L
t = number of positive bindings of R also covered by R + L
FOIL Algorithm is another rule-based learning algorithm that extends on the Sequential
Covering + Learn-One-Rule algorithms and uses a different Performance metrics (other
than entropy/information gain) to determine the best rule possible.
1. Consider an appropriate set of training examples, these two rules can be learned
following a trace similar to the one above for Grand Daughter. The second rule is
among the rules that are potentially within reach of FOIL'S search.
2. It provided Ancestor is included in the list Predicates that determines which
predicates may be considered when generating new literals. this particular rule
would be learned or not depends on whether these particular literals outscore
competing candidates during FOIL'S greedy search for increasingly specific rules.
So, it is possible. how to avoid learning rule sets that produce infinite recursion.
INDUCTION AS INVERTED DEDUCTION
➢ A second, quite different approach to inductive logic programming is based on the
simple observation that induction is just the inverse of deduction.
➢ In general, machine learning involves building theories that explain the observed
data.
➢ xi denotes the ith training instance and f (xi) denotes its target value. Then learning
is the problem of discovering a hypothesis h, such that the classification f (xi) of
each training instance xi follows deductively from the hypothesis h, the description
of xi, and any other background knowledge B known to the system
Decision-Trees to Rules:
For each path in a decision tree from the root to a leaf, create a rule with the
conjunction of tests along the path as an antecedent and the leaf label as the
consequent.
❖ Post-Processing Decision-Tree Rules Resulting may contain unnecessary
antecedents that are not needed to remove negative examples and result in
over-fitting.
❖ Rules are post-pruned by greedily removing antecedents or rules until
performance on training data or validation set is significantly harmed.
❖ Resulting rules may lead to competing conflicting conclusions on some
instances.
❖ Sort rules by training (validation) accuracy to create an ordered decision list.
The first rule in the list that applies is used to classify a test instance.
PROGOL:
Reduce comb explosion by generating the most speci c acceptable h
1. User speci es H by stating predicates, functions, and forms of arguments allowed
for each
2. Progol uses sequential covering algorithm.
For each hx ; f (x )i
Find most speci c hypothesis hi.s.t.
B^h^x ` f (x )
{ actually, considers only k-step entailment
3. Conduct general-to-speci c search bounded byspeci c hypothesis h , choosing
hypothesis with minimum description length
GOLEM:
A second dimension along which approaches vary is the direction of the search in LEARN-
ONE-RULE. In the algorithm described above, the search is from general to specific
hypotheses. Other algorithms we have discussed (e.g., FIND-S) search from specific to
general. One advantage of general to specific search in this case is that there is a single
maximally general hypothesis from which to begin the search, whereas there are very many
specific hypotheses in most hypothesis spaces (i.e., one for each possible instance). Given
many maximally specific hypotheses, it is unclear which to select as the starting point of
the search. One program that conducts a specific-to-general search, called GOLEM.
addresses this issue by choosing several positive examples at random to initialize and to
guide the search. The best hypothesis obtained through multiple random choices is then
selected.