UNIT 1
Introduction
Syllabus:
Introduction: Well posed learning problems, Designing a Learning system,
Perspective and Issues in Machine Learning.
Concept Learning and the General-to-Specific Ordering: A Concept
Learning Task, Concept Learning as Search, Find-S: Finding a Maximally
Specific Hypothesis, Version Spaces and the Candidate-Elimination Algorithm,
Remarks on Version Spaces and Candidate-Elimination, Inductive Bias.
1. What is Machine Learning?
2. Some successful applications of machine learning
3. Why is Machine Learning Important?
Some tasks cannot be defined well, except by examples (e.g., recognizing people).
Relationships and correlations can be hidden within large amounts of data.
Machine Learning/Data Mining may be able to find these relationships.
Human designers often produce machines that do not work as well as desired in
the environments in which they are used.
The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans (e.g., medical diagnostic).
Environments change over time.
New knowledge about tasks is constantly being discovered by humans. It may be
difficult to continuously re-design systems “by hand”.
WELL-POSED LEARNING PROBLEMS
Definition: A computer program which learns from experience is called a
machine learning program or simply a learning program. Such a program is
sometimes also referred to as a learner.
Tom Mitchell provides a more modern definition: A computer program is said
to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with
experience E.
In general to have a Well Posed learning Problems, we must identify 3 features:
1. The class of tasks(T)
2. The measure of performance to be improved(P)
3. The source of experience(E)
Examples
Checkers game: A computer program that learns to play
checkers might improve its performance as measured
by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing
games against itself.
Fig: Checker game board
DESIGNING A LEARNING SYSTEM
The basic design issues and approaches to machine learning are illustrated
by designing a program to learn to play checkers, with the goal of entering
it in the world checkers tournament
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
a)Estimating training values
b)Adjusting the weights
5. The Final Design
1. Choosing the Training Experience
The first design choice is to choose the type of training experience from
which the system will learn.
The type of training experience available can have a significant impact on
success or failure of the learner.
There are 3 Key attributes which impact on success or failure of the
learner:
a)Whether the training experience provides direct or indirect feedback
regarding the choices made by the performance system.
b)The degree to which the learner controls the sequence of training
examples.
c)How well it represents the distribution of examples over which the
final system performance P must be measured.
a) Whether the training experience provides Direct or Indirect Feedback
regarding the choices made by the performance system.
Direct training Feedback
a. One key attribute is whether the training experience provides direct or
indirect
feedback regarding the choices made by the performance system.
b. Mostly, learning from direct training feedback is typically easier than
learning from indirect feedback.
c. In learning to play checkers, the system might learn from direct
training examples consisting of individual checkers board states
and the correct move for each.
d. For each board state, set of possible moves are given.
a) The degree to which the learner controls the sequence of training examples.
The learner might depends on the teacher to select informative board states and to
provide the correct move for each.
The learner might itself propose board states that it finds particularly confusing and
ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect)
training classifications, as it does when it learns by playing against itself with no
teacher present
a) How well it represents the distribution of examples over which the final
system performance P must be measured.
A third important attribute of the training experience is how well it represents
the distribution of examples over which the final system performance P must
be measured.
In checkers learning scenario, the performance metric P is the percent of
games the system wins in the world tournament.
If its training experience E consists only of games played against itself, there is
a danger that this training experience might not be fully representative of the
distribution of situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from
those on which the final system will be evaluated
Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and how
this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board state.
The program needs only to learn how to choose the best move from among these legal moves. We
must learn to choose among the legal moves, the most obvious choice for the type of information
to be learned is a program, or function, that chooses the best move for any given board state.
Let ChooseMove be the target function and the notation is
ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board states
B and produces as output some move from the set of legal moves M.
ChooseMove is a choice for the target function in checkers example, but this function will turn out
to be very difficult to learn given the kind of indirect training experience available to our system.
An alternative target function is an evaluation function that assigns a numerical score to any
given board state.
Choosing a Representation for the Target Function
Let’s choose a simple representation - for any given board
state, the function c will becalculated as a linear combination of
the following board features:
xl: the number of black pieces on the board
x2: the number of red pieces on the board
x3: the number of black kings on the board
x4: the number of red kings on the board
x5: the number of black pieces threatened by red (i.e.,
which can be captured on red's next turn)
x6: the number of red pieces threatened by black
Thus, learning program will represent as a linear function of the form
Where,
1.w0 through w6 are numerical coefficients, or weights, to be chosen by the
learningalgorithm.
2.Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the board
3.The weight w0 will provide an additive constant to the board value.
1. Choosing a Function Approximation Algorithm
In order to learn the target function f we require a set of training examples,
each describing aspecific board state b and the training value Vtrain(b) for
b.
Each training example is an ordered pair of the form (b, Vtrain(b)).
For instance, the following training example describes a board state b in
which black has won the game (note x2 = 0 indicates that red has no
remaining pieces) and for which the target function value Vtrain(b) is
therefore +100.
1. The Final Design
The final design of checkers learning system can be
described by four distinct program modules that
represent the central components in many learning
systems
PERSPECTIVES AND ISSUES IN MACHINE LEARNING
One useful perspective on machine learning is that it involves searching a very large space of possible
hypotheses to determine one that best fits the observed data and any prior knowledge held by the learner.
Issues in Machine Learning
The field of machine learning, and much of this book, is concerned with answering questions such as the
following
What algorithms exist for learning general target functions from specific training examples? In what
settings will particular algorithms converge to the desired function, given sufficient training data?
Which algorithms perform best for which types of problems and representations?
How much training data is sufficient? What general bounds can be found to relate the confidence in
learned hypotheses to the amount of training experience and the character of the learner's hypothesis
space?
When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
What is the best strategy for choosing a useful next training experience, and how does the choice of
this strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function approximation problems? Put
another way, what specific functions should the system attempt to learn? Can this process itself be
automated?
How can the learner automatically alter its representation to improve its ability to represent and learn
the target function?
CONCEPT LEARNING
A Concept is a subset of objects or events defined over a larger set.
For example, we refer to the set of everything ( i.e. all objects) as the set
of things.
Animals are a subset of things, and birds are subset of animals.
In more technical terms, a Concept is a Boolean valued function
defined over this larger set.
For example, a function defined over all animals whose value is true for
birds and false for every other animal
A CONCEPT LEARNING TASK
Consider the example task of learning the target concept "Days on which
Aldo enjoys his favorite water sport”
Table: Positive and negative training examples for the target
concept EnjoySport.
The task is to learn to predict the value of EnjoySport for an arbitrary
day, based on the values of its other attributes?
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
What hypothesis representation is provided to the learner?
Let us begin by considering a simple representation in which each hypothesis
consists of a conjunction of constraints on the instance attributes.
In particular, let each hypothesis be a vector of six constraints, specifying the values
of the six attributes
o Sky – (values: Sunny, Cloudy, Rainy),
o AirTemp – (values: Warm, Cold),
o Humidity – (values: Normal, High),
o Wind – (values: Strong, Weak),
o Water – (values: Warm, Cold),
o Forecast – (values: Same, Change).
For each attribute, the hypothesis will either
o Indicate by a "?' that any value is acceptable for this attribute,
o Specify a single required value (e.g., Warm) for the attribute, or
o Indicate by a "Ø" that no value is acceptable.
If some instance x satisfies all the constraints of hypothesis h, then h
classifies x as a positive example (h(x) = 1).
To illustrate, the hypothesis that
– A person enjoys her favorite sport only on warm days with high
humidity (independent of the values of the other attributes) is
represented by the expression
<?, Warm, High, ?, ?, ?>
– The most general hypothesis -Everyday is positive example-is
represented by
<?, ?,?, ?, ?, ? >
– The most specific possible hypothesis- No day is a positive example-
is represented by
<Ø, Ø, Ø, Ø, Ø, Ø>
Notation
The set of items over which the concept is defined is called the set of instances, which is
denoted by X.
Example: X is the set of all possible days, each represented by the attributes: Sky,
AirTemp, Humidity, Wind, Water, and Forecast.
The concept or function to be learned is called the target concept, which is denoted by
c. c can be any Boolean valued function defined over the instances X
c: X→ {O, 1}
Example: In the current example, the target concept corresponds to the value of the
attribute EnjoySport. i.e.,
o c(x) = 1 if EnjoySport = Yes, and
o c(x) = 0 if EnjoySport = No.
Instances for which c(x) = 1 are called positive examples, or members of the target
concept.
Instances for which c(x) = 0 are called negative examples, or non-members of
the target concept.
THE INDUCTIVE LEARNING HYPOTHESIS
Inductive learning is a learning approach in which rules(hypothesis) are inferred
from the facts or data.
Inductive learning algorithms can at best guarantee that the output hypothesis
fits the target concept over the training data.
Assumption: an hypothesis that approximates well the training data will also
approximate the target function over unobserved examples. i.e. given a significant
training set, the output hypothesis is able to make predictions.
The inductive learning hypothesis: Any hypothesis found to approximate the target
function well over a sufficiently large set of training examples will also approximate
the target function well over other unobserved examples
CONCEPT LEARNING AS SEARCH
Concept learning is a task of searching an hypotheses space.
The goal of this search is to find the hypothesis that best fits the training examples.
It is important to note that by selecting a hypothesis representation, the designer of the
learning algorithm implicitly defines the space of all hypotheses that the program can ever
represent and therefore can ever learn.
Example:
Consider the instances X and hypotheses H in the EnjoySport learning task. The attribute
Sky has three possible values, and AirTemp, Humidity, Wind, Water, Forecast each have two
possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.
Every hypothesis containing one or more "Φ" symbols represents the empty set of
instances; that is, it classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses
GENERAL-TO-SPECIFIC ORDERING OF HYPOTHESES
Many algorithms for concept learning organize the search through the hypothesis space
by relying on a very useful structure that exists for any concept learning problem: a
general-to-specific ordering of hypotheses.
By taking advantage of this naturally occurring structure over the hypothesis space, we
can design learning algorithms that exhaustively search even infinite hypothesis spaces
without explicitly enumerating every hypothesis.
To illustrate the general-to-specific ordering, consider the two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
Consider the sets of instances that are classified positive by hl and by h2.
h2 imposes fewer constraints on the instance, it classifies more instances as positive. So,
any instance classified positive by hl will also be classified positive by h2. Therefore, h2
is more general than hl.
Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any
instance that satisfies hk also satisfies hi
VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM
Note difference between definitions of consistent and satisfies
1.An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is a
positive or negative example of the target concept.
2.An example x is said to consistent with hypothesis h iff h(x) = c(x)
VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM
• Version Space: It is intermediate of general hypothesis and Specific hypothesis. It not
only just written one hypothesis but a set of all possible hypothesis based on training
data-set.
• The candidate elimination algorithm incrementally builds the version space given a
hypothesis space H and a set D of examples.
• The examples are added one by one; each example possibly shrinks the version space
by removing the hypotheses that are inconsistent with the example.
• The candidate elimination algorithm does this by updating the general and specific
boundary for each new example.
– You can consider this as an extended form of Find-S algorithm.
– Consider both positive and negative examples.
– Actually, positive examples are used here as Find-S algorithm (Basically they are
generalizing from the specification).
– While the negative example is specified from generalize form.
CANDIDATE-ELIMINATION Learning Algorithm
The CANDIDATE-ELIMINTION algorithm computes the version space containing all
hypotheses from H that are consistent with an observed sequence of training
examples.
The Find-S and Candidate Elimination algorithms are both used in concept learning, but
they differ in their approach to finding a hypothesis.
Find-S Algorithm
The Find-S algorithm is a straightforward approach that focuses on finding the most
specific hypothesis that is consistent with all the positive training examples. It operates as
follows:
It only considers positive examples. Negative examples are completely ignored.
It starts with the most specific possible hypothesis, which is a hypothesis that matches
no examples (e.g., all attributes are set to 'phi').
For each positive example, it generalizes the current hypothesis only as much as
necessary to accommodate that example.
The final hypothesis is a single, maximally specific hypothesis that covers all positive
training instances.
Candidate Elimination Algorithm
The Candidate Elimination algorithm is a more comprehensive approach that identifies a set
of all hypotheses consistent with the training data, known as the version space. This
algorithm is more powerful than Find-S because it uses both positive and negative examples
to narrow down the possible hypotheses.
It maintains two sets of hypotheses:
o G (General boundary): A set of the most general hypotheses consistent with the data.
o S (Specific boundary): A set of the most specific hypotheses consistent with the data.
It uses positive examples to generalize the specific boundary (S) and negative examples
to specialize the general boundary (G).
The algorithm continues to refine these boundaries until the version space shrinks to a
single hypothesis or a small set of hypotheses.
INDUCTIVE BIAS
The inductive bias (also known as learning bias) of a learning algorithm is the set
of assumptions that the learner uses to predict outputs of given inputs that it has
not encountered.
In machine learning, the term inductive bias refers to a set of (explicit or implicit)
assumptions made by a learning algorithm in order to perform induction, that is, to
generalize a finite set of observation (training data) into a general model of the
domain.
Inductive reasoning is the process of learning general principles on the basis of
specific instances – in other words, it’s what any machine learning algorithm does
when it produces a prediction for any unseen test instance on the basis of a finite
number of training instances. Inductive bias describes the tendency for a system to
prefer a certain set of generalizations over others that are equally consistent with
the observed data.
Inductive Bias-Advantage
It provides a nonprocedural means of characterizing their policy for
generalizing beyond the observed data.
It allows comparison of different learners according to the
strength of the inductive bias they employ.
Modeling Inductive Systems by Equivalent Deductive Systems
The figure below explains Modeling inductive systems by equivalent deductive
systems.