44 «© Machine Learning
QQ plot of sample data versus standard normal
16
14
Quantiles of input sample
5 0 t 05 0 05 1 15
Standard normal quantiles
Figure 2.10: Normal Q-Q Plot
Ideally, the points fall along the reference line (45 Degree) if the data follows normal distri-
bution. Ifthe deviation is more, then there is greater evidence that the datasets follow some different
distribution, that is, other than the normal distribution shape. In such a case, careful analysis of the
slatistical investigations should be carried out before interpretation.
This skewness, kurtosis, mean absolute deviation and coefficient of variation help in assessing
the univariate data.
vy
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data ofthe temperature in
a shop and sales of sweaters.
Table 2.3: Temperature in a Shop and Sales Data
Oye ce
Temperature (in centigr
5 200
10 150
15 ~ 140
20 75
2 @
3 55 |
B 20Understanding Data» 45
stri-
vent
> the
‘ing
nis
ein
Here, the aim of bivariate analysis is to find relationships among variables. The relationships
can then be used in comparisons, finding causes; and in further explorations. To do that, graphical
display ofthe data is necessary. One such graph method is called ggatter plot,
Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without
‘nominal variables to illustrate the trends, and also to show differences, Itis a plot between explan:
atory
and response variables. [tis a 2D graph showing the relationship between two variables.
‘The scatter plot (Refer Figure 2.11) indicates strength, shape, direction and the presence
of Outliers. It is useful in exploratory data before calculating, a correlation coefficient or fitting
regression curve.
Scatter plot
200 e | @ Series 1
150
5 |
g |
& seo |- |
3 |
a ee
Bg e i
°
0 5 10 B20 2B
Temperature
Figure 2.11: Scatter Plot
Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.
sales
Sales data
0
—— Line t
150
100
50
ot —
5 10 15 20 2
Temperture
Figure 2.12: Line Chart46 «Machine earning $$
2.6.1 Bivariate Statistics
Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of joint
probability of random variables, say X and Y. Generally, random variables are represented in
capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance
between two dimensions. The formula for finding co-variance for specific x, and y are:
cov(X,Y) = EB, ~ E(X)\(y, - EO) (2.17)
Here, x, and y, are data values from X and Y, E(X) and E(Y) are the mean values of x, and y,
Nis the number of given data. Also, the COV(X, Y) is same as COV(Y, X).
eee
Find the covariance of data X= (1, 2,3 4,5} and Y= (1,4, 9, 16,25}.
Solution: Mean(X) = E(X) = 2
Mean(Y) = E(Y) = 2 = 11, The covariance is computed
-using Eq, (2.17) as:
(= S)A= 11) + 2 = 34-11) + 8 ~30)9- 11) + (4-996 ~ 11) + -9925-1)_
-
The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation coefficient.
Sometimes, N— 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
—.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
‘The correlation indicates the relationship between dimensions using its sign. The sign is more
important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. Ifthe value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3. If the value is zero, then it indicates that both the dimensions are independent of each
other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension.
Ifthe given attributes are X = (ty x5
coefficient, that is denoted as r, is given as:
= COVKY)
0,0,
1%) and Y= (yy, Yo ~-, yy), then the Pearson correlation
(2.18)
where, 6, 6, are the standard deviations of X and Y.dint
his
2 ont.
ion
en,
ore
ion
ich
ant
on
18)
Understanding Data + 47
pe Find the correlation coefficient of data X = (1, 2,3,4, 5} and Y= (1, 4,9, 16, 25}
olution: The mean values of X and Y are Be 3and
11. The standard deviations of
5
sand Yare 1.41 and 8.6486, respectively. Therefore the correlation coefficient is given as ratio of
‘ovariance (12 from the previous problem 2.5) and standard deviation of x and y as per Eq. (2.18) as:
12
———— = 0.984
1.41 x 8.6486
—--------——
2.7 MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
The multivariate data is like bivariate data but may have more than two dependant variables.
ome of the multivariate analysis are regression analysis, principal component analysis, and
path analysis.
Id Attribute | Attribute 2 Attribute 3
_ 7 4 1
2 5 2
303 6 1
‘The mean ofhultivariate data is a mean vector and the mean of the above three attributes
is given as 2, #8, 1.33). The variance of multivariate data becomes the covariance matrix.
‘The mean vector is called centroid and variance is called dispersion matrix. This is discussed in
the next section.
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance that are
explained in the subsequent chapters of this book.
Heatmap
Heatmap is a graphical representation of 2D mattix. It takes a matrix as input and colours
it. The darker colours indicate very large values and lighter colours indicate smaller values.
‘The advantage of this method is that humans perceive colours well. So, by colour shaping, larger
ue:
dit
n be perceived well. For example, in vehicle traffic data, heavy traffic regions can be
entiated from low traffi
regions through heatmap.
In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis
light
ounts. The dar tients’ weights vs
is weights and Y-axis is patie lour regions high
patient counts in health status,48 © Machine Learning
~ Count of self assessed health status vs, Weight
T
excelent 0 lel aloe iho olool loa i ils 0
E Fair loot! 4jofo}4}4oofojetololslt lojlohn ofa ololgollo| so
i Hy
z
j ;
§ ood |i sflolo iff
oof ooo ooo
| Ly
EEEEEERAASSARRESARARERREGEEEOOS
Weight
Figure 2.13: Heatmap for Patient Data
Hextmap fo
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists
of several pair-wise scatter plots of variables of the multivariate data. All the results are Presented
in a matrix format. By visual examination of the chart, one can easily find relationships among the
variables such as correlation between the variables.
A random matrix of three columns is chosen and the relationships ofthe columns is plotted
as a pairplot (or scattermatrix) as shown below in Figure 2.14.
50
40
30
20
10]
0
100
90
80
3240123 4324012 32470129
Figure 2.14: Pairplot for Random Data
—_ —_____—— Understanding Data » 49
2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA
Machine learning involves many mathematical concepts from the domain of Linear algebra,
sgatstics, Probability and Information theory. The subsequent sections discuss important aspects
sflinear algebra and probability.
‘Linear Algebra’ is a branch of mathematics that is central for many scientific applications and
other mathematical subjects. While all branches of mathematics are crucial for machine learning,
finear algebra plays a major large role as it is the mathematics of data. Linear algebra deals with
jinear equations, vectors, matrices, vector spaces and transformations. These are the driving forces
ofmachine learning and machine learning cannot exist without these data types.
Let us discuss some of the important concepts of linear algebra now.
2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data
‘Alinear system of equations is a group of equations with unknown variables.
Let Ax=y, then the solution x is given as:
xaylAsAty 19)
‘This is true ify is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
Itmeans if A = andy =(y, ¥,---¥,); then the unknown variable x can be
Fay Mag Sy
computed as:
x=yA=Aty (2.20)
If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and if
the equations are contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
Write the given matrix.
y
Append vector y to the matrix A. This matrix is called augmentation matrix.
»
Keep the element a,, as pivot and eliminate all a,, in second row using the matrix operation,
4, 4,
R [ =| here R, is the second row and [2] is called the multiplier. The same logic
4, 4,
in hi
can be used to remove a,, in all other equations.
Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:
¥,
x =
Then, the remaining unknown variables can be found by back-substitution as:
(2.22)
This partis called backward substitution.
|
|50 + Machine Learning
To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it
These concepts are illustrated in Example 2.8.
A
Solve the following set of equations using Gaussian Elimination method.
Solution: Rewrite this in matrix form as follows:
2416
4317
2416
{i 31 ik
Apply the transformation by dividing the row 1 by 2. There are no general guidelines of
row operations other than reducing the given matrix to row echelon form. The operator ~ means
reducing to, The above matrix can further be reduced as follows:
-(! 2} i}aek oar
4317 ean
( el 5 ears
0-5 1-5 2
1213
-() 0 iJ=8 28
10/17
orld
‘Therefore, in the reduced echelon form, it can be observed that:
qe
Fo
eee
2.8.2 Matrix Decompositions
Itis often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed. These methods are also known as matrix factorization methods.
‘The most popular matrix decomposition is called eigen decomposition. tis a way of reducing
the matrix into eigen values and eigen vectors.
_ Then, the matrix A can be decomposed as:
° : A=QAQ? 223)
is the matrix of eigen vectors, A is the diagonal matrix and QT is the transpose of matrix Q.ee Iinerstanding Data 51
LU Decomposition
ne of the sinaplest matrix decompositions is LU decomposition where the matrix A can be decom-
sed matrices:
Ls AsLU
Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition
can be done using Gaussian elimination method as discussed in the previous section. First,
an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination |
js applied to reduce the given matrix to get matrices L and U. i
Example 2.9 illustrates the application of Gaussian elimination to get LU. i
—_—_
es Find LU decomposition of the given matrix:
124
A=|332
342
100
010
oo1
100
310
oo1
310
301
00
10
2
1
3
124
332
342
12 4
0-3 +10
34 2
Initial Matrix)
Now, it can be observed that the first matrix is 1 as it is the lower triangular matrix whose
values are the determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose v
matrix because of Gaussian elimination,
lues are the values of the reduced
arn *
+ CAMBRIDGE INSTITE
LIBRARY 8 INFORMATION CENTRE. |
RPURAW.BANGALORE-SEO 036, |52 Machine Learning
It can be cross verified that the multiplication of LU yields the original matrix A. What are the
applications of LU decomposition? Some of the applications are finding matrix inverses and deter-
minant. Ifthe order of the matrix is large, then this method can be used.
ee “”n
2.8.3 Machine Learning and Importance of Probability and Statistics
‘Machine leaming is linked with statistics and probability. Like linear algebra, statistics is the
heart of machine leaning. The importance of statistics needs to be stressed as without statistics;
analysis of data is difficult. Probability is especially important for machine learning. Any data can
be assumed to be generated by a probability distribution. Machine learning datasets have multiple
data that are generated by multiple distributions. So, a knowledge of probability distribution and
random variables are must for better understanding of the machine learning concepts.
What is the objective for an experiment? This is about hypothesis or constructing a model.
Machine learning has many models. So, the theory of hypothesis, construction of models, evalu-
ating the models using hypothesis testing, significance of the model and evaluation of the model
are all linking factors of machine learning and probability theory. Similarly, the construction of
datasets using sampling theory is also required.
‘The subsequent section will deal with the important concepts of statistics and probability.
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X's
events, Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the probability
of individual observations.
Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
‘The relationships between the events for a continuous random variable and their probabilities
is called a continuous probability distribution. It is summarized as Probability Density Function
(PDF). PDF calculates the probability of observing an instance. The plot of PDF shows the shape
of the distribution. Cumulative Distributive Function (CDF) computes the probability of an obser-
vation < value.
Both PDF and CDF are continuous values. The discrete equivalent of PDF in discrete
distribution is called Probability Mass Function (PMF).
The probability of an event cannot be detected directly. It should be computed as the area
under the curve for a small interval around the specific outcome. This is defined as CDF.
Let us discuss some of the distributions that are encountered in machine learning.
Continuous Probability Distributions Normal, Rectangular, and Exponential distributions
fall under this category.ee Basics of Learning Theory » 81
seaming framework. One such important framework is PAC. The learning framework is discussed
ina detailed manner in Section 3.7. COLT focuses on supervised leaming tasks. Since the complexity
“analyzing difficult, normally, binary classification tasks are considered for analysis.
3.3 DESIGN OF A LEARNING SYSTEM
‘system that is built around a learning algorithm is called a learning, system. The design of ystems
focuses on these steps:
1. Choosing a training experience
2. Choosing a target function
3, Representation of a target function
4, Function approximation
Training Experience
Let us consider designing ofa chess game. In direct experience, individual board states and correct
moves of the chess game are given directly. In indirect system, the move sequences and results are
only given. The training experience also depends on the presence of a supervisor who can label all
valid moves for a board state. In the absence of a supervisor, the game agent plays against itself and
learns the good moves, if the training samples cover all scenarios, or in other words, distributed
enough for performance computation. If the training samples and testing samples have the same
distribution, the results would be good.
Determine the Target Function
‘The next step is the determination of a target function. In this step, the type of knowledge that
needs to be learnt is determined. In direct experience, a board move is selected and is determined
whether it is a good move or not against all other moves. If it is the best move, then it is chosen as:
B-> M, where, B and M are legal moves. In indirect experience, all legal moves are accepted and
a score is generated for each. The move with largest score is then chosen and executed.
Determine the Target Function Representation
‘The representation of knowledge may be a table, collection of rules or a neural network. The linear
combination of these factors can be coined as:
View, +x, + 0,2, +0,3,
where, x, x, and x, represent different board features and w, w, 1, and 10, represent weights.
Choosing an Approximation Algorithm for the Target Function
The focus is to choose weights and fit the given training samp
the error given as:
ectively. The aim is to reduce
Ee Y [Val voy]
a SSS82. © Machine Learning
Here, b is the sample and (/(b) is the predicted hypothesis. The approximation is carrie
out as:
* Computing the error as the difference between trained and expected hypothesis. Let error
be error(b).
* Then, for every board feature x, the weights are updated as:
w, = w, + [Lx error(b) x x,
Here, 11 is the constant that moderates the size of the weight update.
Thus, the learning system has the following components:
* A Performance system to allow the game to play against itself.
* A Critic system to generate the samples.
* A Generalizer system to generate a hypothesis based on samples.
+ An Experimenter system to generate a new system based on the currently learnt function, |
This is sent as input to the performance system.
3.4 INTRODUCTION TO CONCEPT LEARNING
Concept leaming is a learning strategy of acquiring abstract knowledge or inferring a general
concept or deriving a category from the given training samples. It is a process of abstraction and
generalization from the data.
Concept learning helps to classify an object that has a set of common, relevant features. Thus,
it helps a learner compare and contrast categories based on the similarity and association of
positive and negative instances in the training data to classify an object. The leatner tries to
simplify by observing. the common features from the training samples and then apply this
simplified model to the future samples. This task is also known as learning from experience.
Each concept or category obtained by learning is a Boolean valued function which takes a true
or false value. For example, humans can identify different kinds of animals based on common
relevant features and categorize all animals based on specific sets of features. The special features
that distinguish one animal from another can be called as a concept. This way of learning categories
for object and to recognize new instances of those categories is called as concept learning.
It is formally defined as inferring a Boolean valued function by processing training instances.
Concept learning requires three things:
1, Input - Training dataset which is a set of training instances, each labeled with the name
of a concept or category to which it belongs. Use this past experience to train and build
the model.
xv
Output - Target concept or Target function f. It is a mapping function f(x) from input x to
output y. It is to determine the specific features or common features to identify an object.
In other words, it is to find the hypothesis to determine the target concept. For eg, the
specific set of features to identify an elephant from all animals.
3. Test - New instances to test the learned model.Basics of Learning Theory «88
Formally, Concept learning is defined as-"Given a set of hypotheses, the leamer searches,
trough the hypothesis space to identify the best hypothesis that matches the target concept".
Consider the following set of training instances shown in Table 3.1.
able 3.1: Sample Training instances
ure tin Seer ie
ae eo eee
. [No {Short [Yest [No |No [Black [No _| Big Yes
2 [ves [Short [No [No [No |Brown | Yes Medium [No
3. [No [Short [yes [No [No [Black _|No Medium | Yes
4. |No Long, No ‘Yes ‘Yes White No Medium | No
3. [No [Short [ves [ves [Yes [Black [No Big Yes
Here, in this set of training instances, the independent attributes considered are ‘Horns’, “Tail
tusks’, ‘Paws, Fur, ‘Color’, Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’. The target
concept is to identify the animal to be an Elephant.
Let us now take this example and understand further the concept of hypothesis.
‘Target Concept: Predict the type of animal - For example ~“Elephant’.
3.4.1 Representation of a Hypothesis
A hypothesis ‘W’ approximates a target function ‘f’ to represent the relationship between the
independent attributes and the dependent attribute of the training instances. The hypothesis is the
predicted approximate model that best maps the inputs to outputs. Each hypothesis is represented
as a conjunction of attribute conditions in the antecedent part.
For example, (Tail = Short) (Color = Black)...
The set of hypothesis in the search space is called as hypotheses. Hypotheses are the plural
form of hypothesis. Generally ‘H’ is used to represent the hypotheses and ‘fi’ is used to represent
a candidate hypothesis.
Bach attribute condition is the constraint on the attribute whichis represented as attribute-value
pair, In the antecedent of an attribute condition of a hypothesis, each attribute can take value as
either Y or ‘g" or can hold a single value.
* “2” denotes that the attribute can take any value [e.g., Color = 7]
+ “g" denotes that the attribute cannot take any value, ie, it represents a null value
[e.g., Homs = 9]
* Single value denotes a specific single value from acceptable values of the attribute, ie, the
attribute ‘Tail’ can take a value as ‘short’ [e.g,, Tail = Short]
For example, a hypothesis will look like,
Homs Tail Tusks Paws Fur Color Hooves Size
h
Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h84 « Machine Learning
‘The training dataset given above has 5 training instances with 8 independent attributes
and one dependent attribute. Here, the different hypotheses that can be predicted for the target
concept are,
Horns Tail Tusks Paws Fur Color Hooves Size
h= —
(or)
h=
‘The task is to predict the best hypothesis for the target concept (an elephant). The most general
hypothesis can allow any value for each of the attribute.
It is represented as:
<2,2,2, 2,2, 2,2, 2. This hypothesis indicates that any animal can be an elephant.
‘The most specific hypothesis will not allow any value for each of the attribute {Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [i.e, each
literal is represented as an attribute-value pair]
Solution: The hypothesis ‘i’ for the concept leaming task of an Elephant is given as:
h= Basics of Learning Theory » 87
When reading the second instance 22, itis a negative instance, so ignore it.
p: Yes Short No No No Brown Yes Medium No (Negative instance)
fa= :
Similarly, when reading the third instance 13, itis a positive instance so generalize H2.to 13 to
ocommodate it. The resulting i3 is generalized.
B: No Short Yes No No Black No Medium Yes (Positive instance)
ig=
va ? > No b
2 2 men igseeeare
i2 imposes negative instance to true,
13: No Short Yes No No No Medium Yes (Positive instance)
i=
2? ? ? ? ? No ?>
2? ? ? ? ? ? Big>
4; No Long No Yes Yes White No Medium No (Negative instance)
a
NM= Ys 2 7 27 2 b
22 2 2 2 Black 2
a7 2 2 2 2 2 Big>
Remove any hypothesis inconsistent with this negative instance.
J: No Short Yes Yes Yes Black No Big. Yes (Positive instance)
Wb=2 2? Ys 2 2? 2 20 >
22 2 2 2 Black ? =
2? 2 2 2 2 2 Big>
‘Thus, his the hypothesis space generated which will classify the positive instances to true and
negative instances to false.
——____—_____—_——_—_—
3.4.5 Hypothesis Space Search by Find-S Algorithm
Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent
with the positive instances in the training dataset. Obviously, it will also be consistent with the
negative instances. Thus, this algorithm considers only the positive instances and eliminates negative
instances while generating the hypothesis. It initially starts with the most specific hypothesis.
Input: Positive instances in the Training dataset
Output: Hypothesis “h’
1. Initialize ‘h’ to the most specific hypothesis.
h=
Step 3: Scan the next instance 12, since I2 is a positive instance. Generalize ’h’ to include positive
instance I2. For each of the non-matching attribute value in ‘h’ put a ‘2’ to include this positive
instance. The third attribute value is mismatching in ‘I with 12, so puta ‘2’,
R: 29 Yes Good Good Fast Yes Positive instance
h=<29 Yes? Good Fast Yes>
Now,scan 13. Since it is a negative instance, ignore it. Hence, the hypothesis remains the same
without any change after scanning 13.
I 2B No Good Good Fast No Negative instance
h=<29 Yes? Good Fast Yes>
Now scan I4, Since it is a positive instance, check for mismatch in the hypothesis ‘h’ with I4.
The 5" and 6* attribute value are mismatching, so add “? to those attributes in ‘W’.
M4: 29 Yes Good Good Slow No Positive instance
h=<29 Yes? Good 2
Now, the final hypothesis generated with Find-S algorithm is
h=<29 Yes? Good 2?
It includes all positive instances and obviously ignores any negative instance.
a)
Limitations of Find-S Algorithm
1, Find-S algorithm tries to find a hypothe
all negative instances, As long as the training dataset is consistent, the hypothesis found by this
algorithm may be consistent
hat is consistent with positive instances, ignoring
> 2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses
1 that are consistent with the training dataset.90 «Machine Learning ——.
3, Many times, the training dataset may contain some errors; hence such inconsistent data instances
can mislead this algorithm in determining the consistent hypothesis since it ignores negative
‘instances.
Hence, it is necessary to find the set of hypotheses that are consistent with the training data
including the negative examples. To overcome the limitations of Find-S algorithm, Candidate |
Elimination algorithm was proposed to output the set of all hypotheses consistent with the training
dataset.
346 Version Spaces
The version space contains the subset of hypotheses from the hypothesis space that is consistent
with all training instances in the training dataset.
Scan for information on ‘Addtional Examples on Version Spaces”
List-Then-Eliminate Algorithm
‘The principle idea ofthis learning algorithm isto initialize the version space to contain all hypotheses
and then eliminate any hypothesis that is found inconsistent with any training instances. Initially,
the algorithm starts with a version space to contain all hypotheses scanning each training instance.
‘Thehypotheses that are inconsistent with the training instance are eliminated. Finally, the algorithm
outputs the list of remaining hypotheses that are all consistent.
sian
|
Input: Version Space ~a list of all hypotheses
Output: Set of consistent hypotheses
1. Initialize the version space with a list of hypotheses,
2. For each training instance,
* remove from version space any hypothesis that is inconsistent.
This algorithm works fine if the hypothesis space is finite but practically it is difficult to
deploy this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination
algorithm.
Version Spaces and the Candidate Elimination Algorithm
Version space learning is to generate all consistent hypotheses around. This algorithm computes
the version space by the combination of the two cases namely,
* Specific to General learning ~ Generalize S to include the positive example
* General to Specific learning - Specialize G to exclude the negative exampleBasics oftearning Theory» 94
Using the Candidate Elimination algorithm, we can compute the version space containing
ail and only those) hypotheses from H that are consistent with the given observed sequence of
training instances, The algorithm defines two boundaries called ‘general boundary’ which is a set of
ail hypotheses that are the most general and ‘specific boundary’ which isa set ofall hypotheses that
sre the most specific. Thus, the algorithm limits the version space to contain only those hypotheses
that are most general and most specific. Thus, it provides a compact representation of List-then
algorithm.
Input: Set of instances in the Training dataset
Output: Hypothesis G and S
1. Initialize G, to the maximally general hypotheses.
2. Initialize §, to the maximally specific hypotheses.
© Generalize the initial hypothesis for the first positive instance.
3, For each subsequent new training instance,
* Ifthe instance is positive,
© Generalize $ to include the positive instance,
> Check the attribute value of the positive instance and S,
«If the attribute value of positive instance and $ are different, fill that
field value with ‘?’.
+ If the attribute value of positive instance and $ are same, then do no
change.
0 Prune G to exclude all inconsistent hypotheses in G with the positive instance.
+ Ifthe instance is negative,
© Specialize G to exclude the negative instance,
> Add to G all minimal specializations to exclude the negative example and
be consistent with S.
+ If the attribute value of $ and the negative instance are different, then
fill that attribute value with $ value,
«If the attribute value of $ and negative instance are same, no need to
update ‘G’ and fill that attribute value with ‘7’.
‘0 Remove from S all inconsistent hypotheses with the negative instance.
Generating Positive Hypothesis ‘S’ If itis a positive example, refine $ to include the positive
instance, We need to generalize $ to include the positive instance. The hypothesis is the conjun
S’ and positive instance. When generalizing, for the first positive instance, add to § all minimal
generalizations such that S is filled with attribute values of the positive instance. For the subsequent
Dositive instances scanned, check the attribute value of the positive instance and S obtained in the92. © Machine Learning $$ poms
previous iteration. If the attribute values: of positive instance and S$ are different, fill that field value
/ jteration
with a7. Ifthe attribute values of positive instance and $ are same, no change is required. i
Now Sea
Ifitis a negative instance, it skips, :
Generating Negative Hypothesis ‘G’ If it is a negative instance, refine G to exclude the gine
negative instance. Then, prune G to exclude all inconsistent hypotheses in G with the positive consiste
instance, The idea is to add to G all minimal specializations to exclude the negative instance and end fill
be consistent with the positive instance. Negative hypothesis indicates general hypothesis. ‘elutes,
If the attribute values of positive and negative instances are different, then fill that field with are gen
positive instance value so that the hypothesis does not classify that negative instance as true. Ifthe The
attribute values of positive and negative instances are same, then no need to update ‘G’ and fill that oa
attribute value with a’. e
Generating Version Space - [Consistent Hypothesis] We need to take the combination of
sets in‘G’ and check that with ‘S', When the combined set fields are matched with fields in’S’, then
only that is included in the version space as consistent hypothesis.
a 33
BPERRSREWA Consider the same set of instances from the training dataset shown in-Table 3.3 Iterati
and generate version space as consistent hypothesis. Bae
e
Solution: u
Step 1: Initialize ‘G’ boundary to the maximally general hypotheses,
Gao? 2 2% 2% ® s
Step 2: Initialize ‘S’ boundary to the maximally specific hypothesis. There are 6 attributes, so for u
each attribute, we initially fill’¢ in the hypothesis ‘S’. c
S=<—p 9 9 9 P
Generalize the initial hypothesis for the first positive instance. Il is a positive instance,
so generalize the most specific hypothesis ‘S’ to include this positive instance. Hence,
I: 29 Yes Excellent © Good Fast. Yes _ Positive instance one.
Si=<29 Yes Excellent Good Fast Yes>
G 7 2 2 2 >
Step 3:
Iteration 1 cor:
Scan the next instance 12. Since [2 is a positive instance, generalize ‘S1' to include positive
instance 12. For each of the non-matching attribute value in ‘$1’, put a‘?’ to include this positive
instance. The third attribute value is mismatching in ‘$1’ with 12, so put a‘?’.
R: 29 Yes Good Good Fast Yes _ Positive instance
$2=<29 Yes? Good Fast - Yes>
Prune G1 to exclude all inconsistent hypotheses with the positive instance. Since G1 is o
consistent with this positive instance, there is no change. The resulting G2 is,
Gl=2? 2? ? ? 2 >—__——— Basics of Learning Theory + 93
Iteration 2
Now Sean 13,
B: 2 No Good Good Fast. No Negative instance
since it is a negative instance, specialize G2 to exclude the negative example but stay
consistent with $2. Generate hypothesis for each of the non-matching, attribute value in $2
and fill with the attribute value of $2, In those generated hypotheses, for all matching attribute
values, put a‘? The first, second and 6* attribute values do not match, hence ‘3 hypotheses
are generated in G3.
‘There is no inconsistent hypothesis in $2 with the negative instance, hence $3 remains the
same.
B09 27 2 2 2 Bb
£ a Ys 2° 2 2 Pb
: ar 2 2 2 Yes
§3=<29 Yes ? — Good Fast’ Yeo
eration 3
Now Scan [4, Since it is a positive instance, check for mismatch in the hypothesis ‘53° with M4.
‘The 5 and 6" attribute value are mismatching, so add ‘?’ to those attributes in ‘54’.
Is 29 Yes Good Good Slow No Positive instance
Sh=<29 Yes 2 Good 2? 9
Prune G3 to exclude all inconsistent hypotheses with the positive instance I4.
G=s9 2 2 2 2 Bw
“a Yes? 2 >
2 Yes, Indonsistent
Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third
one, The resulting G4 is,
ou 2
Gh=<29 ? 2 2 2 >
2 Yes? 2 2 >
Using the two boundary sets, $4 and G4, the version space is converged to contain the set of
consistent hypotheses.
‘The final version space is,
<29 Yes? ? 2 >
2 Good ? >
Yes? Good
jorithm finds the version s
those hypotheses that are most
general and most specific.
The shown in Figure 3,2,
nntation of deriving the version space
mmatic rep!