0% found this document useful (0 votes)

200 views24 pages

Computational Learning Theory Guide

The document discusses key concepts in computational learning theory including PAC learning models, VC dimension, and online learning models. PAC learning aims to measure learning problem complexity and evaluate if a learning algorithm can produce a hypothesis with high accuracy. VC dimension is a measure of a classifier's capacity to separate datasets, defined as the largest set of points a classifier can shatter. Online learning considers a worst-case scenario where an opponent chooses data sequences and the target concept, and the learner's goal is to minimize errors.

Uploaded by

JayamangalaSristi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

200 views24 pages

Computational Learning Theory Guide

Uploaded by

JayamangalaSristi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

UNIT-III: Computational Learning Theory: Models of learnability: learning in the limit;

probably approximately correct (PAC) learning. Sample complexity for infinite hypothesis
spaces, Vapnik- Chervonenkis dimension.
Rule Learning: Propositional and First-Order, Translating decision trees into rules,
Heuristic rule induction using separate and conquer and information gain, First-order
Horn-clause induction (Inductive Logic Programming) and Foil, Learning recursive rules,
Inverse resolution, Golem, and Progol.

COMPUTATIONAL LEARNING THEORY:

Computational learning theory (CoLT) is using mathematical methods or the design
applied to computer learning programs. It involves using mathematical frameworks for the
purpose of quantifying learning tasks and algorithms.
Computational learning theory can be considered to be an extension of statistical learning
theory or SLT for short, that makes use of formal methods for the purpose of quantifying
learning algorithms.

• Computational Learning Theory (CoLT): Formal study of learning tasks.

• Statistical Learning Theory (SLT): Formal study of learning algorithms.

This division of learning tasks vs. learning algorithms is arbitrary, and in practice, there
is quite a large degree of overlap between these two fields.

How important is computational learning theory

Computational learning theory provides a formal framework is possible to precisely
formulate and address questions regarding the performance of different learning
algorithms. Comparisons of both the predictive power and the computational efficiency of
competing learning algorithms can be made. Three key aspects that must be formalized
are:

• The way in which the learner interacts with its environment,

• The definition of success in completing the learning task,
• A formal definition of efficiency of both data usage (sample complexity) and
processing time (time complexity).

Computational learning in machine learning mainly deal with a type of inductive

learning called supervised learning.

In supervised learning, an algorithm is given samples that are labeled in some useful
way. For example, the samples might be descriptions of mushrooms, and the labels
could be whether or not the mushrooms are edible. The algorithm takes these
previously labeled samples and uses them to induce a classifier. This classifier is a
function that assigns labels to samples, including samples that have not been seen
previously by the algorithm. The goal of the supervised learning algorithm is to optimize
some measure of performance such as minimizing the number of mistakes made on
new samples. In computational learning theory, a computation is considered feasible if
it can be done in polynomial time.

There are two kinds of time complexity results:

• Positive results – Showing that a certain class of functions is learnable in polynomial

time.
• Negative results – Showing that certain classes cannot be learned in polynomial
time.

Negative results often rely on commonly believed, but yet unproven assumptions, such as:

• Computational complexity – P ≠ NP (the P versus NP problem);

• Cryptographic – One-way functions exist.

VC-DIMENSION:

❖ The VC dimension theory or Vapnik-Chervonenkis dimension theory is the

theoretical study on Machine Learning Algorithms. It is a machine learning

framework developed by Vladimir Vapnik and Alexey Chervonenkis.

❖ It is a measure of the capability of a collection of functions that can be learnt by a

statistical binary classification algorithm in terms of complexity, expressive power,

richness, or flexibility. The cardinality of the biggest set of points that the method

can shatter is specified. the context of a dataset, shatter or a shattered set implies

that points in the feature space may be picked or divided from one another using

hypotheses in the space such that the labels of samples in the distinct groups are

right.

❖ The VC dimension measures the complexity of a hypothesis space, for example, the

models that can be fit given a representation and learning method. The amount of

unique possibilities in a hypothesis space (space of models that might be fit) and
the space may be traversed are two ways to assess the complexity of a hypothesis

space (space of models that could be fit).

❖ The VC dimension is an ingenious method that instead counts the number of cases

from the target issue that can be distinguished by hypotheses in the space.

Mathematically, the VC dimension of a binary classifier is defined as follows:

Given a set of n points S = {x1, x2, …, xn} in a d-dimensional space and a binary

classifier h, the VC dimension of h is the largest integer d such that there exists a

set of d points that can be shattered by h, i.e., for any labeling of the d points, there

exists a hypothesis h in H that correctly classifies them. Formally, the VC dimension

of h is:

VC(h) = max{d | there exists a set of d points that can be shattered by h}.for each of

the 2² = 4 possible color assignments of the given points.

N= 3 points can be classified by H correctly with separating hyper plane as shown

in the following figure.

And that's why the VC dimension of H is 3. Because for any 4 points in 2D plane, a linear
classifier can not shatter all the combinations of the points. For example,
For this set of points, there is no separating hyper plane can be drawn to classify this set.
So the VC dimension is 3.

s
Or the pattern where a three points coincides on each other, Here also we can not draw
separating hyper plane between 3 points. But still this pattern is not considered in the
definition of the VC dimension.

PAC LEARNING MODEL:

➢ “PAC Learning or Probably Approximately Correct Learning is a framework in the

theory of machine learning that aims to measure the complexity of a learning

problem and is probably the most advanced sub-field of computational learning

theory. It was a seminal work done by Leslie Valiant.”

➢ In PAC models, examples are created according to an arbitrary probability

distribution D, and the aim of a neural network is to classify any further unclassified

instances with high accuracy (with regard to the distribution D).

➢ To find information about any unknown target function, the learner is given access

to different examples of the functions that are drawn randomly according to some

unknown target distribution D.

➢ In general, a PAC algorithm may be performed on supplied data and the inaccuracy

of the resulting hypothesis objectively measured.

➢ An exception is when attempting to utilize statistical query methods empirically,

because most of these algorithms use the input for more than just establishing the

necessary sample size.

PAC-learnability:

PAC-learnability we require some specific terminology and related notations.

• Let X be a set called the instance space which may be finite or infinite. For example, X

may be the set of all points in a plane.

• A concept class C for X is a family of functions c ∶ X → {0, 1}. A member of C is called a

concept. A concept can also be thought of as a subset of X. If C is a subset of X, it defines

a unique function µC ∶ X → {0, 1} as follows:

• A hypothesis h is also a function h ∶ X → {0, 1}. So, as in the case of concepts, a

hypothesis can also be thought of as a subset of X. H will denote a set of hypotheses.

• We assume that F is an arbitrary, but fixed, probability distribution over X.

• Training examples are obtained by taking random samples from X. We assume that the

samples are randomly generated from X according to the probability distribution F.

Definition (informal) :

Let X be an instance space, C a concept class for X, h a hypothesis in C and F an

arbitrary, but fixed, probability distribution. The concept class C is said to be PAC-

learnable if there is an algorithm A which, for samples drawn with any probability

distribution F and any concept c ∈ C, will with high probability produce a hypothesis

h ∈ C whose error is small. Additional notions

• True error To formally define PAC-learnability, we require the concept of the true

error of a hypothesis h with respect to a target concept c denoted by error F (h). It

is defined by error
F (h) = Px∈F (h(x) ≠ c(x))

where the notation Px∈F indicates that the probability is taken for x drawn from X

according to the distribution F. This error is the probability that h will misclassify

an instance x drawn at random from X according to the distribution F. This error is

not directly observable to the learner; it can only see the training error of each

hypothesis (that is, how often h(x) ≠ c(x) over training instances).h

ONLINE LEARNING MODEL:

The online learning model, also known as the mistake-bounded learning model, is a form

of learning model in which the worst-case scenario is considered for all environments.

There is a known conception class for each situation, and the target concept is chosen

from there.

The target functions and the sequence of presentation for all instances are then chosen

by an opponent with limitless computer power and knowledge of the learner's algorithm.

Throughout the learning session, the student gets unlabeled examples.

Problem setting :The learner knows the right value after each learning session and can

utilize the input to enhance its hypothesis. The objective of this model is to use any

effective learning technique to reduce the worst-case amount of errors.

In this setting, the following scenario is repeated indefinitely:

1. The algorithm receives an unlabeled example.

2. The algorithm predicts a classification of this example.

3. The algorithm is then told the correct answer.

Definition 1: An algorithm A is said to learn C in the mistake bound model if for any

concept c ∈ C, and for any ordering of examples consistent with c, the total number of

mistakes ever made by A is bounded by p(n, size(c)), where p is a polynomial. We say that

A is a polynomial time learning algorithm if its running time per stage is also polynomial

in n and size(c). Let us now examine a few problems that are learnable in the mistake

bound model.

Conjunctions : Let us assume that we know that the target concept c will be a

conjunction of a set of (possibly negated) variables, with an example space of n-bit strings.

Consider the following algorithm:

1. Initialize hypothesis h to be x1x1 x2x2 . . . xnxn .

2. Predict using h(x).

3. If the prediction is False but the label is actually True, remove all the literals in h which

are False in x. (So if the first mistake is on 1001, the new h will be x1x2 x3 x4.)

4. If the prediction is True but the label is actually False, then output “no consistent

conjunction”.

5. Return to step 2.

An invariant of this algorithm is that the set of literals in c will always be a subset of the

set of literals in h. The first mistake on a positive example will bring the size of h to n.

Each subsequent such mistake will remove at least one literal from h, so that the

maximum number of mistakes made will be at most n + 1.

WEAK LEARNING:

The PAC learning model requires the learner to generate numerous hypotheses that

are arbitrarily near to the goal idea. The simple methods that are generally right are
easy to find, it is extremely difficult to find a single hypothesis that is highly

accurate.

A Weak Learning Algorithm is a sort of training data that generates an output

hypothesis that outperforms random guessing. The overall process of transforming

a weak learner into a PAC learner is known as hypothesis boosting. Since Schapire's

initial study, many boosting techniques have been presented.

AdaBoost is an Algorithm that claims to utilize empirical formulae. The key to

driving the weak learner to provide hypotheses that can be merged to produce a

highly accurate hypothesis is to generate diverse distributions on which the weak

learner is tested.

SAMPLE COMPLEXITY FOR INFINITE HYPOTHESIS SPACES:

➢ The restricted to finite hypothesis spaces Some infinite hypothesis spaces are more

expressive than others – E.g., Rectangles, vs. 17- sides convex polygons vs. general

convex polygons Linear threshold function vs. a conjunction of LTUs Need a

measure of the expressiveness of an infinite hypothesis space other than its size.

➢ The Vapnik-Chervonenkis dimension (VC dimension) provides such a measure.

Analogous to |H|, there are bounds for sample complexity using VC(H).

VC dimension (basic idea):

The VC dimension of a hypothesis space H measures the complexity of H by the

number of distinct instances from X that can be completely discriminated

(‘shattered’) using H, not by the number of distinct hypotheses (|H|). An unbiased

hypothesis space H shatters the entire instance space X (is able to induce every

possible partition on the set of all possible instances) The larger the subset X that

can be shattered, the more expressive a hypothesis space is, i.e., the less biased.

Shattering a set of instances

DEFINATION:A set of instances S is shattered by the hypothesis space H if and only

if for every dichotomy of S there is a hypothesis h in H that is consistent

with this dichotomy. Consider some subset of instances shows a subset of

three instances from X. Each hypothesis h from H imposes some dichotomy on S;

that is, h partitions S into the two subsets {x E S/h(x) = 1) and {x E S/h(x) = 0).

Given some instance set S, there are 2ISI possible dichotomies, though H may be

unable to represent some of these. We say that H shatters S if every possible

dichotomy of S can be represented by some hypothesis from H.

o dichotomy: partition instances in S into + and –

o one dichotomy = label all instances in a subset P⊆ S as +, and instances

in the complement of P, S\P as taken.

o The ability of H to shatter S is a measure of its capacity to represent

concepts over S.

We can shatter any dataset of two reals, but we cannot shatter datasets of three real
values.A set .of instances is thus a measure of its capacity to represent target concepts
defined over these instances.

VC dimension of H
The VC dimension of the hypothesis space H, VC(H), is the size of the largest finite
subset of the instance space X that can be shattered by H. If arbitrarily large finite subsets
of X can be shattered by X then VC(H) = ∞
VC Dimension
The VC dimension of hypothesis space H over instance space X is the size of the largest
finite subset of X that is shattered by H.
❖ If there exists one (or more) subsets of size d that can be shattered, then VC(H) ≥ d
❖ If no subset of size d can be shattered, then VC(H) < d.
❖ The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be
labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low.
❖ If H is finite: VC(H) ≤ log2|H| A set S with d instances has 2d distinct
subsets/dichotomies. Hence, H requires 2d distinct hypotheses to shatter d
instances.
– If VC(H) = d: 2d ≤ |H| hence: VC(H) = d ≤ log2|H|

VC Dimension of linear classifiers in 2 dimensions

SAMPLE COMPLEXITY AND VC DIMENSION :

VC dimension serves the same role as the size of the hypothesis space. Using VC
dimension as a measure of expressiveness, we can give an Occam algorithm for infinite
hypothesis spaces. Given a sample D of m examples we will find h ∈ H that is consistent
with all m examples, if

when with probability at least 1 − δ, h has error less then . We consider that the hypothesis
space has to be infinite if we want to use this bound. If we want to shatter m points, then
H has to be at least 2m in order to shatter any configurations of those m examples.
COLT SUMMARY:
The PAC framework provides a reasonable model for theoretically analyzing the
effectiveness of learning algorithms.
The sample complexity for any consistent learner using the hypothesis space, H,
can be determined from a measure of H’s expressiveness (|H|, VC(H)) .
If the sample complexity is tractable, then the computational complexity of finding
a consistent hypothesis governs the complexity of the problem.
Sample complexity bounds given here are far from being tight, but separates
learnable classes from non-learnable classes (and show what’s important).
Computational complexity results exhibit cases where information theoretic
learning is feasible, but finding good hypothesis is intractable.
The theoretical framework allows for a concrete analysis of the complexity of
learning as a function of various assumptions (e.g., relevant variables)

*********************************************************************************

RULE LEARNING:
1. It is useful to learn the target function represented as a set of if-then rules that
jointly define the function. One way to learn sets of rules is to first learn a decision
tree, then translate the tree into an equivalent set of rules-one rule for each leaf
node in the tree.
2. A variety of algorithms that directly learn rule sets and that differ from these
algorithms in two key respects. First, they are designed to learn sets of first-order
rules that contain variables. This is significant because first-order rules are much
more expressive than propositional rules. Second, the algorithms discussed here
use sequential covering algorithms that learn one rule at a time to incrementally
grow the final set of rules.
3. First, they are designed to learn sets of first-order rules that contain variables. This
is significant because first-order rules are much more expressive than propositional
rules. Second, the algorithms discussed here use sequential covering algorithms
that learn one rule at a time to incrementally grow the final set of rules.
Learn-One-Rule:

we consider a family of algorithms for learning rule sets based on the strategy of
learning one rule, removing the data it covers, then iterating this process. Such
algorithms are called sequential covering algorithms. To elaborate, imagine we have
a subroutine LEARN-ONE-RULE that accepts a set of positive and negative training
examples as input, then outputs a single rule that covers many of the positive
examples and few of the negative examples. We require that this output rule have
high accuracy, but not necessarily high coverage. By high accuracy, we mean the
predictions it makes should be correct. By accepting low coverage, we mean it need
not make predictions for every training example.

For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).

Here, any person can be associated with the variables x and y. The Learn-One-Rule
algorithm follows a greedy searching paradigm where it searches for the rules with
high accuracy but its coverage is very low. It classifies all the positive examples for
a particular instance. It returns a single rule that covers some examples.
Learn-One-Rule(target_attribute, attributes, examples, k):
Pos = positive examples
Neg = negative examples
best-hypothesis = the most general hypothesis
candidate-hypothesis = {best-hypothesis}

while candidate-hypothesis:

//Generate the next more specific candidate-hypothesis

constraints_list = all constraints in the form

"attribute=value"
new-candidate-hypothesis = all specializations of
candidate-
hypothesis by adding all-constraints
remove all duplicates/inconsistent hypothesis from
new-candidate-hypothesis.

//Update best-hypothesis

best_hypothesis=argmax(h∈CHs)Performance(h,examples,tar
get_attribute)

//Update candidate-hypothesis

candidate-hypothesis = the k best from new-

candidate-hypothesis according to Performance.
prediction = most frequent value of target_attribute from
examples that match best-hypothesis

IF best_hypothesis:
return prediction.

It starts with the most general rule precondition, then greedily adds the variable
that most improves performance measured over the training examples.
Day Weather Temp Wind Rain PlayBadminton
D1 Sunny Hot Weak Heavy No
D2 Sunny Hot Strong Heavy No
D3 Overcast Hot Weak Heavy No
D4 Snowy Cold Weak Light Yes
D5 Snowy Cold Weak Light Yes
D6 Snowy Cold Strong Light Yes
D7 Overcast Mild Strong Heavy No
D8 Sunny Hot Weak Light Yes

Step 1 - best_hypothesis = IF h THEN PlayBadminton(x) = Yes

Step 2 - candidate-hypothesis = {best-hypothesis}

Step 3 - constraints_list = {Weather(x)=Sunny, Temp(x)=Hot, Wind(x)=Weak, ......}

Step4 - new-candidate-hypothesis ={IF Weather=Sunny

THEN PlayBadminton=YES,
IF Weather=Overcast THEN PlayBadminton=YES, ...}

Step 5 - best-hypothesis = IF Weather=Sunny THEN

PlayBadminton=YES

Step 6 - candidate-hypothesis = {IF Weather=Sunny THEN

PlayBadminton=YES,
IF Weather=Sunny THEN
PlayBadminton=YES...}

Step 7 - Go to Step 2 and keep doing it till the best-hypothesis is obtained.

SEQUENTIAL COVERING ALGORITHM:

Sequential Covering is a popular algorithm based on Rule-Based Classification used
for learning a disjunctive set of rules. The basic idea here is to learn one rule, remove
the data that it covers, then repeat the same process. In this process, In this way,
it covers all the rules involved with it in a sequential manner during the training
phase.
Algorithm Involved:

Sequential_covering (Target_attribute, Attributes,

Examples, Threshold):
Learned_rules = {}
Rule = Learn-One-Rule(Target_attribute,
Attributes, Examples)

while Performance(Rule, Examples) > Threshold

:
Learned_rules = Learned_rules + Rule
Examples = Examples - {examples correctly
classified by Rule}
Rule = Learn-One-Rule(Target_attribute,
Attributes, Examples)

Learned_rules = sort Learned_rules according to

performance over Examples
return Learned_rules

The Sequential Learning algorithm takes care of to some extent, the low coverage
problem in the Learn-One-Rule algorithm covering all the rules in a sequential
manner.
Working on the Algorithm:

The algorithm involves a set of ‘ordered rules’ or ‘list of decisions’ to be made.

Step 1 – create an empty decision list, ‘R’.
Step 2 – ‘Learn-One-Rule’ Algorithm
It extracts the best rule for a particular class ‘y’, where a rule is defined as: (Fig.)

General Form of Rule

Step 1 – create an empty decision list, ‘R’.

Step 2 – ‘Learn-One-Rule’ Algorithm
It extracts the best rule for a particular class ‘y’, where a rule is defined as: (Fig.2)
In the beginning,
Step 2.a – if all training examples ∈ class ‘y’, then it’s classified as positive example.
Step 2.b – else if all training examples ∉ class ‘y’, then it’s classified as negative
example.

Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive
examples.
Step 4 – When this rule is obtained, delete all the training data associated with that
rule.
(i.e. when the rule is applied to the dataset, it covers most of the training data, and
has to be removed)

Step 5 – The new rule is added to the bottom of decision list, ‘R’ Below figure

➢ Let us understand step by step how the algorithm is working in the example
shown in the below figure.
➢ First, we created an empty decision list. During Step 1, we see that there are
three sets of positive examples present in the dataset. So, as per the
algorithm, we consider the one with maximum no of positive example. (6, as
shown in Step 1 of above fig)
➢ Once we cover these 6 positive examples, we get our first rule R1, which is
then pushed into the decision list and those positive examples are removed
from the dataset. (as shown in Step 2 of below fig).
➢ Now, we take the next majority of positive examples (5, as shown in Step 2 of
below Fig ) and follow the same process until we get rule R2. (Same for R3)
➢ In the end, we obtain our final decision list with all the desirable rules.

Sequential Learning is a powerful algorithm for generating rule-based classifiers in

Machine Learning. It uses ‘Learn-One-Rule’ algorithm as its base to learn a sequence of
disjunctive rules.
FIRST-ORDER HORN CLAUSES:
The advantages of first-order representations over propositional (variable-free) representations,
consider the task of learning the simple target concept Daughter (x, y), defined over pairs of people
x and y. The value of Daughter(x, y) is True when x is the daughter of y, and False otherwise.
Suppose each person in the data is described by the attributes Name, Mother, Father, Male,
Female. Hence, each training example will consist of the description of two people in terms of these
attributes, along with the value of the target attribute Daughter.
For example, the following is a positive example in which Sharon is the daughter of Bob:
(Namel = Sharon, Motherl = Louise, Fatherl = Bob,
Malel = False, Female1 = True,
Name2 = Bob, Mother2 = Nora, Father2 = Victor,
Male2 = True, Female2 = False, Daughterl.2 = True)
Where the subscript on each attribute name indicates which of the. two persons is being described.
To collect a number of such training examples for the target concept Daughter1,2 and provide them
to a propositional rule learner such as CN2 or C4.5, the result would be a collection of very specific
rules such as

FIRST-ORDER INDUCTIVE LEARNER (FOIL) ALGORITHM:

✓ Constants: Every well-formed expression is composed of constants— e.g. tyler, 23,
a
✓ Variable: A term is any constant, any variable, or any function applied to any
term.— e.g. A, B, C
✓ predicate symbols: A literal is any predicate (or its negation) applied to any set of
terms. — e.g. male, father (True or False values only)
✓ function symbols — e.g. age (can take on any constant as a value)
✓ connectives — e.g. ∧, ∨, ¬, →, ←
✓ quantifiers — e.g. ∀, ∃
✓ Term: It can be defined as any constant, variable or function applied to any term.
e.g. age(bob).
✓ Literal: It can be defined as any predicate or negated predicate applied to any terms.
e.g. female(sue), father(X, Y).It has 3 types:
Ground Literal — a literal that contains no variables. e.g. female(sue)
Positive Literal — a literal that does not contain a negated predicate. e.g.
female(sue)
Negative Literal — a literal that contains a negated predicate. e.g. father(X,Y)
✓ Clause – It can be defined as any disjunction of literals whose variables are
universally quantified. M1V………………VM1
where, M1, M2, ...,Mn --> literals (with variables universally quantified),V -->
Disjunction (logical OR)
✓ Horn clause — It can be defined as any clause containing exactly one positive literal.
where, H --> Horn ClauseL1,L2,...,Ln --> Literals
(A ⇠ B) --> can be read as 'if B then A' [Inductive Logic] and
∧ --> Conjunction (logical AND)
∨ --> Disjunction (logical OR)
¬ --> Logical NOT

The following equivalent, using earlier rule notation IF L1 л.... л. L,, THEN H is the
notation, the Horn clause preconditions L1 л. . . л. L, are called the clause body or,
alternatively, the clause antecedents. The literal H forms the postcondition is called the
clause head or, alternatively, the clause consequent.

FIRST ORDER INDUCTIVE LEARNER (FOIL)

➢ A variety of algorithms has been proposed for learning first-order rules, or Horn
clauses. we consider a program called FOIL.
➢ In machine learning, first-order inductive learner (FOIL) is a rule-based learning
algorithm. It is a natural extension of SEQUENTIAL-COVERING and LEARN-ONE-
RULE algorithms.
➢ It follows a Greedy approach. the FOIL program is the natural extension of these
earlier algorithms to first-order representations.
➢ The hypotheses learned by FOIL are sets of first-order rules, where each rule is
similar to a Horn clause with two exceptions.
➢ First, the rules learned by FOIL are more restricted than general Horn clauses,
because the literals are not pennitted to contain function symbols (this reduces the
complexity of the hypothesis space search).
➢ Second, FOIL rules are more expressive than Horn clauses, because the literals
appearing in the body of the rule may be negated.
Inductive Learning:
Inductive learning analyzing and understanding the evidence and then using it to
determine the outcome. It is based on Inductive Logic.
Algorithm Involved

FOIL(Target predicate, predicates, examples)

• Pos ← positive examples
• Neg ← negative examples
• Learned rules ← {}
• while Pos, do
//Learn a NewRule
– NewRule ← the rule that predicts target-
predicate with no preconditions
– NewRuleNeg ← Neg
– while NewRuleNeg, do
Add a new literal to specialize New Rule
1. Candidate_literals ← generate candidates for
new Rule based on Predicates
2. Best_literal ←
argmaxL∈Candidate
literalsFoil_Gain(L,NewRule)
3. add Best_literal to NewRule preconditions
4. NewRuleNeg ← subset of NewRuleNeg that
satisfies New Rule preconditions
– Learned rules ← Learned rules + New
Rule
– Pos ← Pos − {members of Pos covered
by New Rule}
• Return Learned rules.
Working of the Algorithm: In the algorithm, the inner loop is used to generate a new best
rule. Let us consider an example and understand the step-by-step working of the
algorithm.

To predict the Target-predicate- GrandDaughter(x,y). We perform the following steps:

[Refer above fig]
Step 1 - NewRule = GrandDaughter(x,y)
Step 2 -
2.a - Generate the candidate_literals.
(Female(x), Female(y), Father(x,y), Father(y.x),
Father(x,z), Father(z,x), Father(y,z), Father(z,y))
2.b - Generate the respective candidate literal negations.
(¬Female(x), ¬Female(y), ¬Father(x,y), ¬Father(y.x),
¬Father(x,z), ¬Father(z,x), ¬Father(y,z), ¬Father(z,y))

Step 3 - FOIL might greedily select Father(x,y) as most promising, then

NewRule = GrandDaughter(x,y) ← Father(y,z) [Greedy approach]

Step 4 - Foil now considers all the literals from the previous step as well as:
(Female(z), Father(z,w), Father(w,z), etc.) and their negations.

Step 5 - Foil might select Father(z,x), and on the next step Female(y) leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)
Step 6 - If this greedy approach covers only positive examples it terminates
the search for further better results.
FOIL now removes all positive examples covered by this new rule. If more are left then the
outer while loop continues.
FOIL: PERFORMANCE EVALUATION MEASURE
The performance of a new rule is not defined by its entropy measure (like the
PERFORMANCE method in Learn-One-Rule algorithm).FOIL uses a gain algorithm to
determine which new specialized rule to opt. Each rule’s utility is estimated by the number
of bits required to encode all the positive bindings.

where,
L is the candidate literal to add to rule R
p0 = number of positive bindings of R
n0 = number of negative bindings of R
p1 = number of positive binding of R + L
n1 = number of negative bindings of R + L
t = number of positive bindings of R also covered by R + L
FOIL Algorithm is another rule-based learning algorithm that extends on the Sequential
Covering + Learn-One-Rule algorithms and uses a different Performance metrics (other
than entropy/information gain) to determine the best rule possible.

LEARNING RECURSIVE RULE SETS

The possibility of new literals added to the rule body could refer to the target predicate
itself (i.e., the predicate occurring in the rule head). However, if we include the target
predicate in the input list of Predicates, then FOIL will consider it as well as generating
candidate literals.
This will allow it to form recursive rules-rules that use the same predicate in
the body and the head of the rule. For instance, recall the following rule set that
provides a recursive definition of the Ancestor relation.
IF Parent (x, y) THEN Ancestor(x, y)
IF Parent (x, z) л Ancestor(z, y) THEN Ancestor@, y)

1. Consider an appropriate set of training examples, these two rules can be learned
following a trace similar to the one above for Grand Daughter. The second rule is
among the rules that are potentially within reach of FOIL'S search.
2. It provided Ancestor is included in the list Predicates that determines which
predicates may be considered when generating new literals. this particular rule
would be learned or not depends on whether these particular literals outscore
competing candidates during FOIL'S greedy search for increasingly specific rules.
So, it is possible. how to avoid learning rule sets that produce infinite recursion.
INDUCTION AS INVERTED DEDUCTION
➢ A second, quite different approach to inductive logic programming is based on the
simple observation that induction is just the inverse of deduction.
➢ In general, machine learning involves building theories that explain the observed
data.
➢ xi denotes the ith training instance and f (xi) denotes its target value. Then learning
is the problem of discovering a hypothesis h, such that the classification f (xi) of
each training instance xi follows deductively from the hypothesis h, the description
of xi, and any other background knowledge B known to the system

➢ The process of augmenting the set of predicates, based on background knowledge,

is often referred to as constructive induction.
➢ Induction is, in fact, the inverse operation of deduction, and cannot be conceived to
exist without the corresponding operation, so that the question of relative
importance cannot arise.
➢ An inverse entailment operator, O(B, D) takes the training data D = {(xi,f (xi ) ) }and
background knowledge B as input and produces as output a hypothesis h
satisfying Equation
➢ This formulation subsumes the common definition of learning as finding some
general concept that matches a given set of training examples (which corresponds
to the special case where no background knowledge B is available).
➢ By incorporating the notion of background information B, this formulation allows a
more rich definition of when a hypothesis may be said to "fit" the data. Up until now,
we have always determined whether a hypothesis (e.g., neural network) fits the data
based solely on the description of the hypothesis and data, independent of the task
domain under study. In contrast, this formulation allows the domain-specific
background information B to become part of the definition of "fit." In particular, h
fits the training example (xi, f (xi)) as long as f (xi) follows deductively from B A h A
xi.
➢ By incorporating background information B, this formulation invites learning
methods that use this background information to guide the search for h, rather than
merely searching the space of syntactically legal hypotheses. The inverse resolution
procedure described in the following section uses background knowledge in this
fashion

Rule Learning Approaches

• Translate decision trees into rules (C4.5)
• Sequential (set) covering algorithms
– General-to-specific (top-down) (CN2, FOIL)
– Specific-to-general (bottom-up) (GOLEM,
CIGOL)
– Hybrid search (AQ, Chillin, Progol)
• Translate neural-nets into rules (TREPAN)

Decision-Trees to Rules:
For each path in a decision tree from the root to a leaf, create a rule with the
conjunction of tests along the path as an antecedent and the leaf label as the
consequent.
❖ Post-Processing Decision-Tree Rules Resulting may contain unnecessary
antecedents that are not needed to remove negative examples and result in
over-fitting.
❖ Rules are post-pruned by greedily removing antecedents or rules until
performance on training data or validation set is significantly harmed.
❖ Resulting rules may lead to competing conflicting conclusions on some
instances.
❖ Sort rules by training (validation) accuracy to create an ordered decision list.
The first rule in the list that applies is used to classify a test instance.

PROGOL:
Reduce comb explosion by generating the most speci c acceptable h
1. User speci es H by stating predicates, functions, and forms of arguments allowed
for each
2. Progol uses sequential covering algorithm.
For each hx ; f (x )i
Find most speci c hypothesis hi.s.t.
B^h^x ` f (x )
{ actually, considers only k-step entailment
3. Conduct general-to-speci c search bounded byspeci c hypothesis h , choosing
hypothesis with minimum description length

GOLEM:
A second dimension along which approaches vary is the direction of the search in LEARN-
ONE-RULE. In the algorithm described above, the search is from general to specific
hypotheses. Other algorithms we have discussed (e.g., FIND-S) search from specific to
general. One advantage of general to specific search in this case is that there is a single
maximally general hypothesis from which to begin the search, whereas there are very many
specific hypotheses in most hypothesis spaces (i.e., one for each possible instance). Given
many maximally specific hypotheses, it is unclear which to select as the starting point of
the search. One program that conducts a specific-to-general search, called GOLEM.
addresses this issue by choosing several positive examples at random to initialize and to
guide the search. The best hypothesis obtained through multiple random choices is then
selected.

02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Unit 4
No ratings yet
Unit 4
38 pages
Intro to k-Nearest Neighbor Algorithm
No ratings yet
Intro to k-Nearest Neighbor Algorithm
3 pages
Neural Networks & SVMs in AI
No ratings yet
Neural Networks & SVMs in AI
19 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
SVM Guide for Data Science Enthusiasts
100% (1)
SVM Guide for Data Science Enthusiasts
28 pages
Cross-Validation and Model Selection
No ratings yet
Cross-Validation and Model Selection
46 pages
DAA VIT AP 27 Maximum Matching in Bipartite Graphs
No ratings yet
DAA VIT AP 27 Maximum Matching in Bipartite Graphs
6 pages
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
20 pages
Advanced RNN Techniques Explained
No ratings yet
Advanced RNN Techniques Explained
15 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
Chapter
100% (1)
Chapter
101 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
ML - Expectation-Maximization Algorithm
No ratings yet
ML - Expectation-Maximization Algorithm
3 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages
Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
Machine Learning Deep Learning
No ratings yet
Machine Learning Deep Learning
2 pages
Deep Learning with RBMs and DBNs
No ratings yet
Deep Learning with RBMs and DBNs
79 pages
Unit 1.2 Desigining A Learning System
No ratings yet
Unit 1.2 Desigining A Learning System
15 pages
ML Unit-1
No ratings yet
ML Unit-1
43 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Machine Learning Study Guide
No ratings yet
Machine Learning Study Guide
2 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
ML Unit 5
No ratings yet
ML Unit 5
34 pages
2-Capacity, Underfitting, overfitting-15-Jul-2020Material - I - 15-Jul-2020 - ML - Fundamentals
No ratings yet
2-Capacity, Underfitting, overfitting-15-Jul-2020Material - I - 15-Jul-2020 - ML - Fundamentals
35 pages
Tutorial-1-Machine Learning-2020
No ratings yet
Tutorial-1-Machine Learning-2020
1 page
01-Introduction Machine Learning
100% (1)
01-Introduction Machine Learning
48 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
79 pages
Unit 2
No ratings yet
Unit 2
64 pages
Decision Trees
No ratings yet
Decision Trees
25 pages
Decision Trees
No ratings yet
Decision Trees
32 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Unit 4
No ratings yet
Unit 4
24 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
Unit - V
No ratings yet
Unit - V
10 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
K Fold Cross Validation
No ratings yet
K Fold Cross Validation
17 pages
Ad3351 Daa Question Bank
No ratings yet
Ad3351 Daa Question Bank
12 pages
ML Viva Questions
100% (1)
ML Viva Questions
4 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Machine Learning Theory Essentials
No ratings yet
Machine Learning Theory Essentials
9 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Rajesh (DL Unit1) 04dec2024
No ratings yet
Rajesh (DL Unit1) 04dec2024
125 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
DBMS - Unit-3
No ratings yet
DBMS - Unit-3
35 pages
Fisher Linear Discriminant Analysis: Max Welling
No ratings yet
Fisher Linear Discriminant Analysis: Max Welling
4 pages
Gradient Descent for Deep Learning
No ratings yet
Gradient Descent for Deep Learning
21 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
ML Unit-1
No ratings yet
ML Unit-1
45 pages
AAL Programs
No ratings yet
AAL Programs
12 pages
SCSA3015 Deep Learning Unit 2 PDF
No ratings yet
SCSA3015 Deep Learning Unit 2 PDF
32 pages
ML Unit-3.-1
No ratings yet
ML Unit-3.-1
28 pages
MachineLearning - UNIT III
No ratings yet
MachineLearning - UNIT III
30 pages
ML Unit 1
100% (1)
ML Unit 1
44 pages
Programs On Operators
No ratings yet
Programs On Operators
15 pages
DESIGN THINKING FOR INNOVATION Unit-V
No ratings yet
DESIGN THINKING FOR INNOVATION Unit-V
27 pages
Dti Notes 1&2
100% (1)
Dti Notes 1&2
9 pages
CS3452 - TOC - QB New - V2
No ratings yet
CS3452 - TOC - QB New - V2
10 pages
T4 2023 MAD Task1Feedback
No ratings yet
T4 2023 MAD Task1Feedback
16 pages
PHP Bangla Tutorial Part 2 HD
No ratings yet
PHP Bangla Tutorial Part 2 HD
4 pages
Neuro-Fuzzy System
No ratings yet
Neuro-Fuzzy System
6 pages
LeetCode Hashing Solutions
No ratings yet
LeetCode Hashing Solutions
74 pages
Chapter 5. Probabilistic Models of Pronunciation and Spelling
No ratings yet
Chapter 5. Probabilistic Models of Pronunciation and Spelling
40 pages
Project Management Analysis
No ratings yet
Project Management Analysis
2 pages
Genetic Algorithm Based Semi-Feature Selection Method: Hualong Bu Shangzhi Zheng, Jing Xia
No ratings yet
Genetic Algorithm Based Semi-Feature Selection Method: Hualong Bu Shangzhi Zheng, Jing Xia
4 pages
MA5102 Final - 2021
No ratings yet
MA5102 Final - 2021
4 pages
BTech CSE Syllabus Notes Questions
No ratings yet
BTech CSE Syllabus Notes Questions
5 pages
Information Theory Coding and Cryptography 3rd Edition Ranjan Bose Instant Download
No ratings yet
Information Theory Coding and Cryptography 3rd Edition Ranjan Bose Instant Download
131 pages
Graph CP Qtns
No ratings yet
Graph CP Qtns
5 pages
Deletion in B-Tree
No ratings yet
Deletion in B-Tree
13 pages
Consensus Clustering Explained
No ratings yet
Consensus Clustering Explained
7 pages
STLD Question Bank Unit I 1: (RR, R05, Nov 08SET II, III)
No ratings yet
STLD Question Bank Unit I 1: (RR, R05, Nov 08SET II, III)
16 pages
Transportation Assignment
No ratings yet
Transportation Assignment
11 pages
Reinforcement Learning - Unit 7 - Week 4
No ratings yet
Reinforcement Learning - Unit 7 - Week 4
2 pages
5.interconversion of Universal Gates and de Morgans Theorem
No ratings yet
5.interconversion of Universal Gates and de Morgans Theorem
6 pages
Neural Networks Bias
No ratings yet
Neural Networks Bias
7 pages
DS On Search
No ratings yet
DS On Search
17 pages
Assignment 1
No ratings yet
Assignment 1
1 page
MCSL 216
No ratings yet
MCSL 216
16 pages
Midterm
No ratings yet
Midterm
57 pages
Project Scheduling & Cost Analysis
No ratings yet
Project Scheduling & Cost Analysis
20 pages
Quantum Teleport
No ratings yet
Quantum Teleport
1 page
Information Theory Basics
No ratings yet
Information Theory Basics
23 pages
ADI - CHO-5th Sem
No ratings yet
ADI - CHO-5th Sem
8 pages
AI Mini Project
No ratings yet
AI Mini Project
6 pages
Automata Chapter 1
No ratings yet
Automata Chapter 1
20 pages