0 ratings 0% found this document useful (0 votes) 11 views 58 pages Machine Learning Module 2
The document discusses various statistical concepts relevant to machine learning, including correlation coefficients, multivariate statistics, and graphical representations like heatmaps and pairplots. It emphasizes the importance of linear algebra, probability, and statistics in understanding and analyzing data, particularly in the context of machine learning models. Additionally, it covers probability distributions, including normal, uniform, and binomial distributions, and their applications in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Machine Learning Module 2 For Later Understanding Data « 47
ee
Find the correlation coefficient of data X = {1, 2, 3, 4,5) and Y={1, 4, 9, 16, 25}.
Solution: The mean values of X and Y are 2 =3and = =11. The standard deviations of
X and Y are 1.41 and 8,6486, respectively. Therefore, the correlation coefficient is given as ratio of
covariance (12 from the previous problem 2.5) and standard deviation of x and y as per Eq. (2.18) as:
12,
"= Tarxecage 0°84
FE"
2.7 MULTIVARIATE STATISTICS
In machine leaning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
The multivariate data is like bivariate data but may have more than two dependant variables.
Some of the multivariate analysis are regression analysis, principal component analysis, and
path analysis.
Id Attribute | Attribute 2 Attribute 3
1 4 4 1
2 2 5 2
303 6 1
The mean of multivariate data is a mean vector and the mean of the above three attributes
is given as (2, 7.5, 1.33). The variance of multivariate data becomes the covariance matrix.
The mean vector is called centroid and variance is called dispersion matrix. This is discussed in
the next section.
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance that are
explained in the subsequent chapters of this book.
Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours
it. The darker colours indicate very large values and lighter colours indicate smaller values.
The advantage of this method is that humans perceive colours well. So, by colour shaping, larger
values can be perceived well. For example, in vehicle traffic data, heavy traffic regions can be
differentiated from low traffic regions through heatmap.
In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis
is weights and Y-axis is patient counts. The dark colour regions highlight patients’ weights vs
patient counts in health status.48° © Machine Learning
Count of self assessed health status vs. Weight
ly Tl
| |
Excellent fofo}1 olo 0 ololo + s}o} ne lo o ofofafol2is 2}afa oft i
| | | °
B Fair fofofo 1]1 0 ololo 1 o)1}olo}o 2}0 1 0 1 0 1]0]0 0 o|1/ol0)1 1 ofo ofolofolol1 1 ololofolo 0 1}ol0 ol: 0000
£ | Ly hs
7 ! Wot
2 Good fafafo ofo + 4]1]2.0 s]aftl2ts a]o.0 1.0 1 ofolt 2 olofololo.o oft sls ahyfolo o2|a\1/sJ0 0 2/o|1 2fo|s|1 00
a j Ib
i | I MH
5
Poor Jololo alo 0 ajo 0 olsJololo oo 0.0 1 0
SeaReagunsenesonsmema: °
Figure 2.13: Heatmap for Patient Data
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists
of several pair-wise scatter plots of variables of the multivariate data. All the results are presented
ina matrix format. By visual examination of the chart, one can easily find relationships among the
variables such as correlation between the variables.
A random matrix of three columns is chosen and the relationships of the columns is plotted
as a pairplot (or scattermatrix) as shown below in Figure 2.14,
50
40
30
20
10
0
100
3210123 4327012 321701723
Figure 2.14: Pairplot for Random DataUnderstandingData + 49
2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA
Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory. The subsequent sections discuss important aspects
of linear algebra and probability.
‘Linear Algebra’ is a branch of mathematics that is central for many scientific applications and
other mathematical subjects. While all branches of mathematies are crucial for machine learning,
linear algebra plays a major large role as it is the mathematics of data. Linear algebra deals with
linear equations, vectors, matrices, vector spaces and transformations. These are the driving forces
of machine learning and machine learning cannot exist without these data types.
Let us discuss some of the important concepts of linear algebra now.
2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data
A linear system of equations is a group of equations with unknown variables.
Let Ar= y, then the solution x is given as:
xaylA-Aty (219)
This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
My My”
Itmeans if A= and y=(y, y,--- Y,)- then the unknown variable x can be
computed as:
x-ylA-Aty (220)
If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and if
the equations are contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element ,, as pivot and eliminate all a,, in second row using the matrix operation,
a,
a
R- (=} here R, is the second row and (=) is called the multiplier. The same logic
hn
can be used to remove 4,, in all other equations.
4, Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:
21)
5. Then, the remaining unknown variables can be found by back-substitution as:
Yas ~ 4%
xx,
(2.22)
a
‘aKet)
This part is called backward substitution.50 + Machine Learning
To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it
These concepts are illustrated in Example 28.
——___
Solve the following set of equations using Gaussian Elimination method.
2x, + 4x,=6
4x, +3y,
Solution: Rewrite this in matrix form as follows:
2416
4317
Apply the transformation by dividing the row 1 by 2. There are no general guidelines of
row operations other than reducing the given matrix to row echelon form. The operator ~ means
reducing to. The above matrix can further be reduced as follows:
1213), pa
“la 37 2 AR,
1213
(3 -5 13 )Ra8r9
121
-( 18-82%
Ey
oti?
1old
(i 11 :)
Therefore, in the reduced echelon form, it can be observed that:
x=1
2.8.2 Matrix Decompositions
It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed. These methods are also known as matrix factorization methods.
The most popular matrix decomposition is called eigen decomposition. It is a way of reducing
the matrix into eigen values and eigen vectors.
Then, the matrix A can be decomposed as:
A=QAQ" (2.23)
where, Qis the matrix of eigen vectors, A is the diagonal matrix and Q" is the transpose of matrix Q.LU Decomposition
One of the simplest matrix decompositions is LL decomposition where the matrix A can be decom-
posed matrices:
A-LU
Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition
can be done using Gaussian elimination method as discussed in the previous section. First,
an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination
is applied to reduce the given matrix to get matrices Land U.
Example 2.9 illustrates the application of Gaussian elimination to get LU,
ee
[EEE Find LU decomposition of the given matrix:
124
A=|332
342
Solution: First, augment an identity matrix and apply Gaussian elimination. The steps are as
shown in:
1oolf124
010/332 [Initial Matrix|
001{[/342
1001 2 4
310]/0-3 10] [R=R,-3R,
001
3
100]1
310|}0-3 40} [R,=R,-3R,
30 1}f0
1oof12 4
3.1 0f/o 3 -10
321fo o 22
3 3
Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose
values are the determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced
matrix because of Gaussian elimination.
100 12 4
L=|3 10|andu=|0-3 -10].
2 10
351 0 0o-=
3 352 + Machine Learning
It can be cross verified that the multiplication of LU yields the original matrix A. What are the
applications of LU decomposition? Some of the applications are finding matrix inverses and deter-
minant. If the order of the matrix is large, then this method can be used.
ee “*”*
2.8.3 Machine Learning and Importance of Probability and Statistics
Machine learning is linked with statistics and probability. Like linear algebra, statistics is the
heart of machine learning. The importance of statistics needs to be stressed as without statistics;
analysis of data is difficult. Probability is especially important for machine learning. Any data can
be assumed to be generated by a probability distribution. Machine learning datasets have multiple
data that are generated by multiple distributions. So, a knowledge of probability distribution and
random variables are must for better understanding of the machine learning concepts.
What is the objective for an experiment? This is about hypothesis or constructing a model.
Machine learning has many models. So, the theory of hypothesis, construction of models, evalu-
ating the models using hypothesis testing, significance of the model and evaluation of the model
are all linking factors of machine learning and probability theory. Similarly, the construction of
datasets using sampling theory is also required.
‘The subsequent section will deal with the important concepts of statistics and probability.
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X's
events. Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the probability
of individual observations.
Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
The relationships between the events for a continuous random variable and their probabilities
is called a continuous probability distribution. It is summarized as Probability Density Function
(PDF). PDF calculates the probability of observing an instance. The plot of PDF shows the shape
of the distribution. Cumulative Distributive Function (CDF) computes the probability of an obser-
vation .54 + Machine Learning
Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category.
1. Binomial Distribution — Binomial distribution is another distribution that is often encoun-
tered in machine learning. It has only two outcomes: success or failure. This is also called
Bernoulli trial.
The objective of this distribution is to find probability of getting success k out of 1 trials.
The way to get success out of k out of n number of trials is given as:
n nt
()- K(n— B®! e27
The binomial distribution function is given as follows, where p is the probability of
success and probability of failure is (1 — p). The probability of success in a certain number
of trials is given as:
PAL py tor pt gt * (2.28)
Combining both, one gets PDF of binomial distribution as:
(cj -pn 229)
Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:
Hanxp (230)
And the variance is given as:
o*=np(—p) @31)
Hence, the standard deviation is given as:
o = Jnp(l— p) (2.32)
2. Poisson Distribution ~ It is another important distribution that is quite useful. Given an
interval of time, this distribution is used to model the probability of a given number of
events k. The mean rule 2 is inclusive of previous events. Some of the examples of Poisson
distribution are number of emails received, number of customers visiting a shop and the
number of phone calls received by the office.
The PDF of Poisson distribution is given as follows:
ena
S(X = x; A) = PX = x]= — (2.33)
x!
Here, x is the number of times the event occurs and .is the mean number of times an
event occurs.
The mean is the population mean at number of emails received and the standard
deviation is V2.
3, Bernoulli Distribution — This distribution models an experiment whose outcome is binary.
The outcome is positive with p and negative with 1 — p. The PMF of this distribution is
given as:
=1-p ifk=0
Pp ifk=1. (234)
The mean is p and variance is p(1—p)=4
FiUnderstanding Data « 55
Density Estimation
Let there be a set of observed values x, x, --- , x, from a larger set of data whose distribution is not
known. Density estimation is the problem of estimating the density function from an observed data.
The estimated density function, denoted as, p(x) can be used to value directly for any unknown
data, say x, as p(x). If its value is less than ¢, then x, is not an outlier or anomaly data. Else, it is
categorized as an anomaly data.
There are two types of density estimation methods, namely parametric density estimation and
non-parametric density estimation.
Parametric Density Estimation It assumes that the data is from a known probabilistic distri-
ution and can be estimated as p(x | ©), where, @ is the parameter. Maximum likelihood function
is a parametric estimation method.
Maximum Likelihood Estimation For a sample of observations, one can estimate the probability
distribution. This is called density estimation. Maximum Likelihood Estimation (MLE) is a
probabilistic framework that can be used for density estimation. This involves formulating
a function called likelihood function which is the conditional probability of observing the
observed samples and distribution function with its parameters. For example, if the observations
are X = [x,,x, -~ , x,}, then density estimation is the problem of choosing a PDF with suitable
parameters to describe the data. MLE treats this problem as a search or optimization problem
where the probability should be maximized for the joint probabilities of X and its parameter, 6.
For example, this is expressed as p(X; 6), where, X= (x, xy, x,)
The likelihood of observing the data is given as a function L(X; 6). The objective of MLE is to
maximize this function as max L(X; 0).
The joint probability of this problem can be restated as [Tp(z,; 8).
The computation of the above formula is unstable and the hence the problem is restated as
maximum of log conditional probability given 0. This is given as:
Slog pAx,; 6) (235)
Instead of maximizing, one can minimize this function as:
min =—¥ log p(x,3 9) as, often, minimization is preferred over maximization,
This is called negative log-likelihood function.
The relevance of this theory of MLE for machine learning is that MLE can solve the problem
of predictive modelling in machine learning. Regression problem is treated in Chapter 5 using the
principle of least-square approach. The same problem can be viewed from MLE perspective too.
If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:
max Dlog(y | x,, 4) (236)56 + Machine Learning
Here, his the linear regression model. If Gaussian distribution is assumed as it is an obvious
fact that most of the data follow Gaussian distribution, then MLE can be stated as:
ui8)
(237)
Here, fis the regression coefficient and x, is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.
Stochastic gradient descent (SGD) is an algorithm that is normally used to maximize or
minimize any function. This algorithm can be used to minimize the above function to get the
results effectively. Hence, many statistical packages use this approach for solving linear regression
problem. Theoretically, any model can be used instead of linear regression in the above equation.
Logistic regression is another model that can be used in MLE framework, Logistic regressioy
discussed in Chapter 5 of this book.
Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm In machine learning,
clustering is one of the important tasks. It is discussed in Chapter 13. MLE framework is quite
useful for designing model-based methods for clustering data. A model is a statistical method and
data is assumed to be generated by a distribution model with its parameter, 8, There may be many
distributions involved and that is why it is called as mixture model. Since, Gaussian are normally
assumed for data, this mixture model is categorized as Gaussian Mixture Model (GMM).
The EM algorithm is one algorithm that is commonly used for estimating the MLE in the
presence of latent or missing variables. What is a latent variable? Let us assume that the dataset
includes weights of boys and girls. Considering the fact that the boys’ weights would be slightly
more than the weights of the girls, one can assume that the larger weights are generated by one
gaussian distribution with one set of parameters while girls’ weights are generated with another
set of parameters. There is an influence of gender in the data, but it is not directly present or
observable. These are called latent variables. The EM algorithm is effective for estimating the PDF
in the presence of latent variables.
Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (F) Stage - In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2. Maximization (M) stage In this, the parameters are optimized using the MLE function,
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.
Non-parametric Density Estimation A non-parametric estimation can be generative
or discriminative. Parzen window is a generative estimation method that finds p(x | ) as
conditional density. Discriminative methods directly compute p(@ | x) as posteriori probability.
Parzen window and k-Nearest Neighbour (KNN) rule are examples of non-parametric density
estimation. Let us discuss about them now.UnderstandingData + 57
Parzen Window Let there be ‘n’ samples, X = (x, x, --- ,x,}
The samples are drawn independently, called as identically independent distribution.
Let R be the region that covers’K’ samples of total n’ samples. Then, the probability density function
is given as:
p=kKin (2.38)
The estimate is given as: it
pe) == 239)
where, Vis the volume of the region R. If R is the hypercube centered at x and h is the length of
the hypercube, the volume V is i? for 2D square cube and i for 3D cube.
The Parzen window is given as follows:
Lit
(240)
0 otherwise
The window indicates if the sample is inside the region or not. The Parzen probability density
function estimate using Eq. (2.40) is given as:
_ kin
PX) =
(241)
This window can be replaced by any other function too. If Gaussian function is used, then it
is called Gaussian density function.
ENN Estimation The KNN estimation is another non-parametric density estimation method.
Here, the initial parameter k is determined and based on that k-neighbours are determined.
The probability density function estimate is the average of the values that are retumed by
the neighbours.
2.9 OVERVIEW OF HYPOTHESIS
Data collection alone is not enough. Data must be interpreted to give a conclusion. The conclusion
should be a structured outcome. This assumption of the outcome is called a hypothesis. Statistical
methods are used to confirm or reject the hypothesis. The assumption of the statistical test is called
null hypothesis. It is also called as hypothesis zero (H0). In other words, hypothesis is the existing
belief. The violation of this hypothesis is called first hypothesis (H1) or hypothesis one. This is the
hypothesis the researcher is trying to establish.
There are two types of hypothesis tests, parametric and non-parametric. Parametric tests are
based on parameters such as mean and standard deviation. Non-parametric tests are dependent on
characteristics such as independence of events or data following certain distribution.
Statistical tests help to:
1. Define null and alternate hypothesis
2. Describe the hypothesis using parameters
3. Identify the statistical test and statistics58 + Machine Learning
4. Decide the criteria called significance value a
5. Compute p-value (probability value)
6. Take the final decision of accepting or rejecting the hypothesis based on the parameters
Hypothesis testing is particularly important as it is an integral part of the leaming algorithms.
Generally, the data size is small. So, one may have to know whether the hypothesis will work for
additional samples and how accurate it is. No matter how effective the statistical tests are, two
kinds of errors are involved, that are Type I and Type II.
Type I error is the incorrect rejection of a true null hypothesis and is called false positive.
Type If error is the incomplete failure of rejecting a false hypothesisand is called false negative.
During these calculations, one must include the size of the data sample. Degree of freedom
indicates the number of independent pieces of information used for the test. It is indicated as v. The
mean or variance can be used to indicate the degree of freedom.
Hypothesis Testing
Let us define two important errors called sample error and true (or actual error). Let us assume
that D is the unknown distribution, Target function is f(x): x > {0, 1}, x is the instance, h(2) is the
hypothesis, and sample set is S$ that derives the samples on instances drawn from X. Then,
the actual error is denoted as:
error g(h) = PrecoVix) # h(x) (2.42)
In other words, true error is the probability that the hypothesis will misclassify an instance
that is drawn at random. The point is that population is very large and hence it is not possible to
determine true error and can only be estimated. So, another error is called sample error or estimator.
Sample error is with respect to sample S. It is the probability for instances drawn from X,
that is, the fractions of $ that are misclassified. The sample error is given as follows:
error) = 2, 6/08), ha) (243)
Here, (f(x), h(x)) is 1 if f(x) + h(x), otherwise is zero.
‘The data associated with the sample has two problems. One is bias and another is variance.
Since the dataset is small in size, the data may be overfitting. To generalize the concept of the
population with smaller dataset is difficult. If S is the sample set, then the data is biased and can
be estimated as Elerror,()] — error,(h). Another data error is variance, that is, deviation of error,(Ht)
from error, (i). Here, errors are sample error and true or actual error.
p-value
Statistical tests can be performed to either accept or reject the null hypothesis. This is done by
the value called p-value or probability value. It indicates the probability of hypothesis being true.
The p-value is used to interpret or quantify the test. For example, a statistical test result may give
a value of 0.03. Then, one can compare it with the level 0.05. As 0.03 <0.05, the result is assumed to be
significant. This means that the variables tested are not independent. Here, 0.05 is called significant
level. In general, significant level is called and p-value is compared with . If p-value < o, then the
hypothesis H1 is rejected and if p-value >a, then the hypothesis H0 is rejected.UnderstandingData + 59
Confidence Intervals
The acceptance or rejection of the hypothesis can also be done using confidence interval.
The confidence interval is computed as:
Confidence interval = 1 - significant level (2.44)
Confidence level is the range of values that indicates the location of true mean. Confidence
intervals indicate the confidence of the result. If the confidence level is 90%, then it infers that there
is 90% of chance that the true mean lies in this range and remaining 10% indicates that true mean
in not present. For finding this, one requires mean and standard deviation. Then, x can be given as
5
mean(x) +z Here, sis the standard deviation, N is the number of samples, and zis the value
associated with 90% and is called % of confidence. This is also called as margin of error.
Sample error is the unbiased estimate of true error. If no information is provided, then both
errors are the same. It is, however, often safe to suggest a margin of confidence associated with the
hypothesis. The hypothesis with 95% confidence about the sample error can be given as follows:
error, (h)(1 — error, (h))
error,(h) £1.96, (245)
"
This 1.96 indicates the 95% confidence of the error. The number 1.96 can be replaced by
any number that is associated with different levels of confidence. The procedure to estimate the
difference between two hypothesis, say h, and h,, is as follows:
1. A parameter d can be chosen to estimate the error of two hypothesis:
2. d=error,(h,) — error,(h,) (2.46)
Here, there are two hypothesis h, and h, tested on two sample sets s, and s,. Similarly,
1m, and n, are randomly drawn number of samples.
3. The estimator d can be estimated as the difference as:
error, (tt,) — error, (it,)
4, The confidence intervals can be used for the estimator also as follows:
error, (h,)—(1— error, (h,))__ error, (h,) — (I error, (h,))
dzz (2.47)
1, n,
Sometimes, it is desirable to find interval L and U such that N% of the probability falls in this
interval.
Let us discuss about comparing two algorithms based on the hypothesis. This is done based on
statistical tests. Some of the important statistical tests are discussed in the next section.
2.9.1 Comparing Learning Methods
Some of the methods for comparing the leaming programs are given below:
Z-test
Z-test assumes normal distribution of data whose population variation is known. The sample size
is assumed to be large. The focus is to test the population mean. The z-statistic is given as:60 + Machine Learning
(2.48)
Here, X is the input data, and m is number of data elements. pt and ¢ are mean and standard
deviation of X, respectively.
SSS
[EEREEEZG) 1 «1 12 be the population mean (1) with the population variance (o") of 2. Consider
the sample X = {1, 2, 3, 4, 5}. Apply z-test and show whether the result is significant.
Solution: The sample mean of X = 2 =3and the number of samples is = 5. Substituting in
Eq. (2.48) gives:
X=
Z=—Z— = 10.06
vn
By checking the critical value at significance 0.05, one can find that the null hypothesis H0 is
rejected.
>">
t-test and Paired t-test
t-test is a hypothesis test and checks if the difference between two samples’ mean is real or by
chance. Here, data is continuous and randomly selected. There will only be small number of
samples and variance between groups is real. The t-test statistics follows (-distribution under null
hypothesis and is used when the number of samples <30. The distribution follows t-distribution
rather than Gaussian distribution. It indicates whether two groups are different or not
One Sample Test In this test, the mean of one group is checked against the set average that can
be either theoretical value or population mean, So, the procedure is:
+ Select a group
+ Compute average
+ Compare it with theoretical value and compute t-statistic:
+= oP
(249)
vn
Here, t is t-statistic, m is the mean of the group, #is the theoretical value or population mean,
s is the standard deviation, and n is the group size or sample size.
Independent Two Sample t-test t-statistic for two groups A and B is computed as follows:
mean(A) — mean(B) aaa
Here, mean(A) and mean(B) are for two different samples. N, and N, are sample sizes of
two groups A and B. s* is the variance of the two samples and the degree of freedom is given as
N,+N,~2. Then, t-statistic is compared with the critical value.Understanding Data + 61
Let there be one group {8, 6, 7, 6, 3} and another group {7, 8, 8, 8, 2}. The number of elements
in each group (1) are 5. For the first group, the mean and variance are 6 and 3.5, respectively.
For the second group, the mean and variance are 6.6 and 6.8, respectively. The degree of freedom
is number of samples N — 1 = 4. The t-statistic formula can be applied for the above data and its
value is -0.42. The critical value is 0.343. Since the value is less that the critical value, one can say
the test is insignificant.
Paired t-test It is used to evaluate the hypothesis before and after intervention. The fact is that
these samples are not independent. For example, consider the case of an effect of medication for
a diabetic patient. The sequence is that first the sugar is tested, then the medication is done, and
again the sugar test is conducted to study the effect of medication. In short, in paired t-test, the data
is taken from the same subject twice. In an unpaired t-test, the samples are taken independently. In
this only one group is involved. The t-statistic is computed as:
may
5
vn
Here, tis t-statistic, m is the mean of the group, 1 is the theoretical value or population mean,
sis the standard deviation, and n is the group size or sample size.
f=
Chi-Square Test
Chi-Square test is a non-parametric test. The goodness-of-fit test statistics follows a Chi-Square
distribution under null hypothesis and measures the statistical significance between observed
frequency and expressed frequency, and each observation is independent of each other and follows
normal distribution. This is used to compare the observed frequency of some observation (such
as, frequency of buying different brands of fans) with the expected frequency of the observation
(of buying of each fan by customers) and. decide whether any statistical difference exists. This
comparison is used to calculate the value of the Chi-Square statistic as:
= (0,-E?
a 51)
Here, F is the expected frequency, O is the observed frequency and the degree of freedom is
C—1, where, C is number of categories. The Chi-Square test allows us to detect the duplication of
data and helps to remove the redundancy of values.
OE
Consider the following Table 2.4, where the machine learning course registration
is done by both boys and girls. There are 50 boys and 50 girls in the class and the registration of
the course is given in the table. Apply Chi-Square test and find out whether any differences exist
between boys and girls for course registration.
Table 2.4: Observed Data
eae
Boys
Girls
Total62. + Machine Learning
Solution: Let the null hypothesis be H, when there is no difference between boys and girls and
H, be the alternate hypothesis when there is a significant difference between boys and girls.
For applying the Chi-Square test based on the observations, the expectation should be obtained
by multiplying the total boys X registered/Total and Total girls X not registered/Total as shown in
Table 25.
50 60 50x40 «
Girls 0729 100 50
Total 60 40 100
The Chi-Statistic is obtained using Eq. (2.51) as follows:
z (O,-EY _ (35-30) 4 @5= 20" , 25-307 , (25-20
a OE, 30 20 30 20
freedom = number of categories —1 = 2 — 1 = 1. The p value for this statistic is 0.0412. This is less
than 0.05. Therefore, the result is significant.
a
= 4.166 for degree of
2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION
TECHNIQUES
Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or
any other model in machine learning.
Feature engineering deals with two problems - Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in
increasing performance. For example, the height and weight may give a new attribute called Body
Mass Index (BMI),
Feature subset selection is another important aspect of feature engineering that focuses on
selection of features to reduce the time but not at the cost of reliability.
The subset selection reduces the dataset size by removing irrelevant features and constructs
a minimum set of attributes for machine learning. If the dataset has n altributes, then time
complexity is extremely high as 1 dimensions need to be processed for the given dataset.
For 1 attributes, there are 2" possible subsets. If the value of 1 is high, the problem becomes
intractable. This is called ‘curse of dimensionality’. Since, as the number of dimensions increases,
the time complexity increases. The remedy is that some of the components that do not contribute
much can be deleted. This results in the reduction of dimensionality. Choosing optimal attributes
becomes a graph search problem. Typically, the feature subset selection problem uses greedy
approach by looking for the best choice at the time using locally optimal choice while hoping that
it would lead to global optimal solutions.Understanding Data + 63
The features can be removed based on two aspects:
1. Feature relevancy - Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like
nose. In simple words, the features should be relevant. The relevancy of the features can
be determined based on information measures such as mutual information, correlation-
based features like correlation coefficient and distance measures. Distance measures are
discussed in Chapter 13 of this book.
2. Feature redundancy — Some features are redundant. For example, when a database table
has a field called Date of birth, then age field is not relevant as age can be computed
easily from date of birth. This helps in removing the column age that leads to reduction of
dimension one.
So, the procedure is:
1. Generate all possible subsets
2. Evaluate the subsets and model performance
3, Evaluate the results for optimal feature selection
Filter-based selection uses statistical measures for assessing features. In this approach,
no learning algorithm is used. Correlation and information gain measures like mutual information
and entropy are all examples of this approach.
Wrapper-based methods use classifiers to identify the best features. These are selected
and evaluated by the learning algorithms. This procedure is computationally intensive but has
superior performance.
Let us discuss some of the important algorithms that fall under this category.
2.10.1 Stepwise Forward Selection
This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical
significance for best quality and is added to the reduced set. This process is continued till a good
reduced set of attributes is obtained.
2.10.2 Stepwise Backward Elimination
This procedure starts with a complete set of attributes. At every stage, the procedure removes the
worst attribute from the set, leading to the reduced set.
Combined Approach Both forward and reverse methods can be combined so that the procedure
can add the best attribute and remove the worst attribute.
2.10.3 Principal Component Analysis
The idea of the principal component analysis (PCA) or KL transform is to transform a given
set of measurements to a new set of features so that the features exhibit high information
packing properties. This leads to a reduced and compact set of features. Basically, this elimination
is made possible because of the information redundancies. This compact representation is of
areduced dimension.64 + Machine Learning
Consider a group of random vectors of the form:
The mean vector of the set of random vectors is defined as:
m,=Elx}
The operator E refers to the expected value of the population. This is calculated theoretically
using the probability density functions (PDF) of the elements x, and the joint probability density
functions between the elements x, and x, From this, the covariance matrix can be calculated as:
C= El(x—m,) (x—m,)"} (2.52)
For M random vectors, when M is large enough, the mean vector and covariance matrix can be
approximately calculated as:
m,-— x, (2.53)
A=— Sixxt —m mt (2.54)
This covariance matrix is real and symmetric. Ife, and 2,(where, i=1, 2, ..., ) be the set of eigen
vectors and corresponding eigen values of the covariance matrix, the eigen values can be arranged
in a descending order so that 4, > 2,, for i= 1, 2, ..., 11. The corresponding eigen vectors are
calculated. Based on this, the transform kernel is constructed. Let the transform kernel be A. Then,
the matrix rows are formed from the eigen vectors of the covariance matrix.
The mapping of the vectors x to y using the transformation can now be described as:
y-Ae-m,) (2.55)
This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:
x=Aly+m, (2.56)
The goal of PCA is to reduce the set of attributes to a newer, smaller set that captures the variance
of the data. The variance is captured by fewer components, which would give the same result as the
original, with all the attributes. Here, instead of using all the eigen vectors of the covariance matrix,
only a small set that exhibits the variance can be used. By using a small set, the KL transform can
achieve maximum compression of the available data.
If K largest eigen values are used, the recovered information would be:
x= Aly +m, (257)
The advantages of PCA are immense. It reduces the attribute list by eliminating all irrelevant
attributes. The PCA algorithm is as follows:
1. The target dataset x is obtained
2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is
X—m. The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.Understanding Data + 65
4, Higen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset.
‘The eigen values are arranged in a descending order. The feature vector is formed with these
eigen vectors in its columns.
Feature vector = {eigen vector, eigen vector, ..., eigen vector,
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y— A x (x - m), where x is the input dataset, m is the mean, and A is the
transpose of the feature vector,
The original data can be retrieved using the formula given below:
Original data (f)={(A)" x y) +m (2.58)
=A) xy) em (2.59)
The new data is a dimensionally reduced matrix that represents the original data. Therefore,
PCA is effective in removing the attributes that do not contribute. If the original data is required,
it can be obtained with no loss of information. Scree plot is a visualization technique to visualize
the principal components or variables that play a more important role as compared to other
attributes. Scree plot is a visualization technique to visualize the principal components visually.
For a randomly selected dataset of 246 attributes, PCA is applied and its scree plot is shown in
Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.
scree plat
25
20
F as
§
#10)
os
00
7 2 3 @ 3 6
‘Component number
Figure 2.15: Scree Plot
From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.
ee SSS———SsSsSsSSSS
Let the data points be ( and (} Apply PCA and find the transformed data.
Again, apply the inverse and prove that PCA works.
Solution: One can combine two vectors into a matrix as follows:
The mean vector can be computed as Eq, (2.53) as follows:
241
_ 2 |_f15
Bl 647 |> (3
266 + Machine Learning
As part of PCA, the mean must be subtracted from the data to get the adjusted data:
2-15)_( 05
6-65) \05
1-15) _(-0
7-65) | 05,
One can find the covariance for these data vectors. The covariance can be obtained using,
Fa 254): os 025 -0.25)
m, (05 -05) =
05, 0.25 025,
05 0.25 -0.25
(05 05) =
05; 0.25 0.25,
The final covariance matrix is obtained by adding these two matrices as:
05 -05
oe 4 cc)
The eigen values and eigen vectors of matrix C can be obtained (left as an exercise) as A, = 1,
1
al
eigen vector of these eigen values (after sorting it) of matrix C. For this problem, A -( 1 i}:
11
The transpose of 4, AT = is also the same matrix as it is an orthogonal matrix. The matrix
Ee 11 Bo
4, = 0. The eigen vectors are () and (} The matrix A can be obtained by packing the
can be normalized by diving each elements of the vector, by the norm of the vector to get:
11
aj ee
oo.
EE
One can check that the PCA matrix A is orthogonal. A matrix is orthogonal is A*' = A and
AAt=I.
AA? =
_fio
“lo1
‘The transformed matrix y using Eq. (2.55) is given as:
ale sl
al- Al-
al tl
al ais
y=Ax(x—m)Recollect that (x-m) is the adjusted matrix.
14
yodte-nie v2 V2 & a)
1 ifes 05
2 2
Malai
-| 2 2] 2 2 [sr cmnenene 05-1)
fA 2
v2 22 2
2 ae
=| v2 2
0 0
One can check the original matrix can be retrieved from this matrix as:
(A) xy) tm
-—t4) 11)
xeayime| BBY BL)
zazllo 0 iB
v2 2.
if 1)
1
2 2 4 f) _(27)
1 1}'les) “le a)
2
Therefore, one can infer the original is obtained without any loss of information.
ean
2.10.4 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of
LDA is to project higher dimension data to a line (lower dimension data). LDA is also used to
classify the data. Let there be two classes, ¢, and c,. Let j1, and #, be the mean of the patterns of two
classes. The mean of the class ¢, and c, can be computed as:
le
=—— ¥ x, andy, =
eA
UL,
The aim of LDA is to optimize the function:
vt0,V
V6,
where, V is the linear projection and a, and g,, are class scatter matrix and within scatter matrix,
respectively. For the two-class problem, these matrices are given as:
o, = N,(u, — uu, — w+ N,(u, ~ a), ~ 2) (2.61)
oy, = Ze ~ Ha, w+ 2m ~ HM, — HYP (2.62)
W()= (2.60)68 + Machine Learning
‘The maximization of J(V) should satisfy the equation:
6,V = 4o,,V or o730,V = av (2.63)
As 0,V is always in the direction of (u,— #,), V can be given as:
Vs On, — Hy) (2.64)
Let V={0,,0, ---, 0,) be the generalized eigen vectors of g, and 6,,, where, dis the largest eigen
values as in PCA. The transformation of x is then given as:
y=Vix (2.65)
Like in PCA, the largest eigen values can be retained to have projections.
2.10.5 Singular Value Decomposition
Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the
matrix, then the matrix A can be decomposed as:
A=usvT (2.66)
Here, A is the given matrix of dimension m x n, Uis the orthogonal matrix whose dimension is
m xn, Sis the diagonal matrix of dimension m x , and Vis the orthogonal matrix. The procedure
for finding decomposition matrix is given as follows:
1. Fora given matrix, find AAT
2. Find eigen values of AAT
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4, Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for ATA. Find the eigen value and pack the eigen vector
asa matrix called V.
Thus, A= USV!, Here, U and V are orthogonal matrices. The columns of U and Vare left and
right singular values, respectively. SVD is useful in compression, as one can decide to retain only a
certain component instead of the original matrix A as:
1,8,0, (267)
mae
Based on the choice of retention, the compression can be controlled.
SSS
[EEREZZEY Find svp of the matrix:
a(t?
49
Solution: The first step is to compute:
aat (1214) (5 2
4 9}l29) (22 97,
The eigen value and eigen vector of this matrix can be calculated to get U. The eigen values of
this matrix are 0.0098 and 101.9902.The eigen vectors of this matrix are:
These vectors are normalized to get the vectors respectively as:
uy = (02212
1 = (o,9752,
= (0-972
2° \ 0.2212
The matrix U can be obtained by concatenating the above vector as:
u=[ 2 0.2212 —0.9752
"2" (09752, 0.2212
17 38
The matrix V can be obtained by finding ATA. It is (s 3] The eigen values are 0.0098 and
101.9902. The eigen vectors can be found as follows:
id °*") when A = 101.99
(7) when A = 0.0098
The above can be normalized as follows:
(0.4082
2,-
*” (0.9129,
0.9129
2, =
0.4082,
‘The matrix V can be obtained by concatenating the above vector as:
(iors foe]
V=lD, 2,
0.9129 0.4082
The matrix $ can be found as the diagonal matrix as:
g_{YI0r9902 0) _ (10.099 0
0 ~~ Yo.0098 0 0.099
Therefore, the matrix decomposition A = U $V" is complete.
7.
The main advantage of SVD is compression. A matrix, say an image, can be decomposed and selec
tively only certain components can be retained by making all other elements zero. This reduces
the contents of image while retaining the quality of the image. SVD is useful in data reduction too.70 + Machine Learning
1. Datais fact. Data must he converted to information.
2. Data analysis is an operation that converts data to information. Data analytics is a general area
‘encompassing data analysis.
3. Exploratory Data analysis aims to understand the data better.
4. Descriptive statistics aims to summarize the data and data visualization aims to understand the data
using charts.
5. Data types are qualitative and quantitative. The qualitative data types include nominal and ordinal.
‘The quantitative data types include interval and ratio.
6. Data also can be classified as univariate data, bivariate data, and multivariate data based on the
variables used.
7. ‘The measures of the central tendency inchide mean, median, mode, and midrange.
8, The most common dispersion measures are range, 5-number summary, interquartile range and
standard deviation.
9. Skewness and kurtosis are shape measures. Skewness, kurtosis, mean absolute deviation, and
coefficient of variation help in assessing the shape.
10. Visualization is an important aspect of data mining which helps the user to recognize and interpret
the results quickly.
11. Bivariate data involves two variables and hence the focus is on finding relationships among two
variables.
12, Multivariate data has more variables. The focus is on finding the relations, cause and effects, and
statistical inference.
13. Covariance is a measure of joint probability of random variables, say x and y. Itis defined as coo(X, Y).
Covariance is used to measure variance between two dimensions, Correlation measures the strength of
linear relationship among variables.
14, The idea of feature reduction is to transform a given set of measurements to a new set of features, so
that the features exhibit high information packing properties. This leads to a reduced and compact
set of features.
15. Sampling is a statistical procedure of data reduction Statistics starts by identifying the group for
carrying out a study.
16. The assumption of an experiment is called hypothesis. Statistical methods are used to confirm or
reject the hypothesis.
17. Zetest, tests, and Chi-Square tests are used to evaluate the hypothesis.
‘Scan for additional ‘Key Terms’, ‘Short Questions’ and ‘Numerical Problems’Understanding Data « 71
* Data — Raw facts.
* — Information — Processed data.
‘* Data Analysis — A operation that extract knowledge from facts,
* Data Analytics ~ A general area of analysis that encompasses data analysis.
+ Big Data — A data that has characteristics like volume, velocity, variety and veracity.
* Data Preprocessing — A set of operations that are used to condition the data that is needed for
machine learning.
* Noisy Data- An abnormal data that is different from normal data. Outlier data is same as noisy data
but useful.
+ Data Transformations - Steps that are necessary to convert data to a processable form.
* Descriptive Statistics — A branch of statistics that summarizes data.
* Data Visualization — A branch of study that deals with the presentation of data.
* Central Tendency — It is a summary of data in terms of its center like mean, median and mode.
* Five Point Summary - A summary of data in terms of minimum, maximum, mode, and upper/
lower quartiles.
+ Kurtosis - A peak of data.
* Bivariate Data —A dataset that has two variables.
+ Pairplot - A technique of visualizing multivariate data. Multivariate data is one that is characterized
by multiple variables.
‘* Expectation of X - Expectation is the mean of random variable observations.
+ Variance - The average squared difference of each value from the expected value.
* Covariance - A measure of the variance between two sets of data.
* Pearson Correlation Coefficient - The normalized covariance is called correlation coefficient.
+ Feature Reduction - Reduction of a set of attributes to a new or simpler set that captures the variance
of data.
* Random Variable — A set of all possible values of an experiment.
+ Distribution — Parameterized mathematical function that characterizes the data.
+ Sampling Techniques — A set of procedures for reducing the data.
+ Hypothesis —The assumption of testing a data. The current assumption is called null hypothesis and
the one that wishes to establish is called altemate hypothesis.
* Etror ~ The incorrect rejection of a true null hypothesis (called Type I error) and incorrect rejection
of a false hypothesis (called Type Il error).
QO eC Ca Og
1, What is an univariate data?
2. What are the types of data?72 + Machine Learning
3. Distinguish between ‘good’ and ‘bad’ data.
4, What are the problems of data collection?
5. Explain missing data analysis.
6. What are the measures of central tendencies?
7. Why are central tendency and dispersion measures important for data miners?
8. What are the measures of skewness and kurtosis?
9. How interquartile range is useful in eliminating outliers?
10. List out the visualization aids available for exploratory data analysis.
LL. What is the use of correlation coefficient for data mining?
12. What are the advantages of LU decomposition?
13, What is the difference between correlation and covariance?
14. What is the need for PCA analysis?
15. Justify the use of scree plots in PCA.
16. Explain the role of probability distributions in machine learning.
17. Explain the basics of binomial distribution.
18. Distinguish between sample error and actual error.
19. List out the types of sampling techniques.
20. What isa hypothesis?
Long Questions
1. Explain in stages the data management life cycle.
2. Explain the types of Big Data with an example.
3. Explain in detail data cleaning processes.
4, Explain in detail univariate data analysis.
5. Explain atleast five charts in detail that help data visualization.
6. Explain the procedure for SVD.
7. Explain the process of obtaining principle components and its relevance in feature reduction.
8. Explain the procedure for hypothesis testing.
9. Explain the procedure for pair-rtests and Chi-Square goodness fit test.
Numerical/Problems
1. Fora given univariate dataset $= (5, 10, 15, 20, 25, 30} of marks. Find mean, median, mode, standard
deviation and variance,
2. For a given univariate dataset
mean.
= {5, 10, 15, 20, 25, 30} of marks. Find arithmetic mean, geometricUnderstanding Data
73
3. Fora given univariate dataset S = {5, 10, 15, 20, 25, 30) of marks. Find five-point summary and plot
the box chart.
4, For the given Tables 2.6 and 27, perform the descriptive analysis of data:
Table 2.6: Sample Data
42
45
47
52
6
62
7
72
75
85
Table 2.7: Students Marks Table
Tee noo
1 45 705 90 40
2 60 25 80 5
3 eo 80 90 50
4 80 80 90 80
5 85. 72 70 60
(a) Find nrin and max marks scored in each subject.
(b) Find details of student who scored highest marks in Maths.
(¢) Find the students with marks English >60 and Maths >70,
5. For Univariate Attribute such as Weight, English and Math marks (consider Tables 2.6 and 2.7), find
the followin,
(a) Mean, Median, Mode
(b) Weighted Mean, Geometric Mean and Harmonic Mean
(©) Variance and Standard Deviation
(d) Absolute Deviation, Mean Absolute Deviation, and Median Absolute Deviation
(e) Coefficient of Variation
(®)Skewness and Kurtosis
(g) Five-point summary, IQR, Semi-Quartile
6. For the Bivariate data such as English and Math (consider Tables 2.6 and 2.7), find:
(a) Covariance and Correlation between two variables
(b) Covariance between English and Hindi Marks74 + Machine Learning
7. Use appropriate Data Visualization to plot the above Table 2.6 using the following charts:
(a) Bar Plot and Pie Chart
(b) Histogram, Box Plot, and QQPLOT
(9) Dot Plot, Line Chart, Scatterplot
(d) Stem and Leaf Plot
8. Solve the following set of equations using Gaussian elimination method.
2x, +5x,-7
6x, + 12x, = 18
9. Solve the following set of equations using LU decomposition method.
10. Apply PCA for the following matrix and prove that it works.
("3
(")
Perform matrix decompositions and prove that SVD works.
11. Apply SVD for the following matrix:
12. Find covariance and correlation coefficients for the following two sets of data:
X12 612
Y:8 12 18 22re
Across
2. Mean and mode are examples of
of univariate data.
4, Noisy data i (Normal/
Abnormal) data.
7. Abranch of statistics that is used for summ-
arizing data is called. statistics.
10. Kurtosis characterized the of
data.
12, ‘The assumption of testing of data is called a
13. Raw facts are called
14. Data wrangling refers to making data
suitable for processing. (Yes/No)
15. Pairplot is used to visualize univariate data.
(Yes/No)
Down
1.
Understanding Data + 75
Grossword:
. The averaged squared distance from its
mean is called. :
‘The characteristics of Big data are volume,
velocity and
A dataset of two variables is called
data.
Visualization helps in presentation of data.
(Nes/No)
Normalized covariance is called
Processed data is
Incorrect rejection of true hypothesis is
called
error.76 + Machine Learning
Word|Searchi
Find and mark the words listed below.
QS VARI
I
KVLCYXBTYPE
ONIN DH RONDA
DA .VODAaEMOUn
hPOOWHUZU4ZEZ0
ZzEQCABHUen abe w
UmZee¢ueeZawZu
aman O< Dexa mn Mad
BMZya4HD eM FOSS
AdHd urn mero denM
OOP uan <0za
OnmHnd eu BZ <00Tma
HOZXENGMEDN<>H
ORR ZHCOCUMKn my
outa brand >a due
HD bm HOMERUN
YPHGWEXI
D
®
B
wo
cs
>
I DWWWLT
RBVL
s
X DI
AEVXYGHMVAJ
RBDNKRRGBJ
Yes
Abnormal
Central tendency Shape
Variety
Correlation
Information
Data
Bivariate
Typel
Yes
Descriptive
No
Hypothesis
VarianceChapter 3
Basics of
Learning Theory
“Predicting the future isn’t magic, it’s artificial intelligence.”
— Dave Waters
Learning is a process by which one can acquire knowledge and construct new ideas or
concepts based on the experiences. Machine learning is an intelligent way of leaming
general concept from training examples without writing a program. There are many
machine learning algorithms through which computers can intelligently learn from past
data or experiences, identify patterns, and make predictions when new data is fed. This
chapter deals with an introduction of concept learning of hypotheses space and modelling
of machine learning.
Learning Objectives
* Understand the basics of leaning theory and the key elements of machine leaning
+ Introduce Concept learning to obtain an abstraction and generalization from the data
* Learn about hypothesis representation language and to explore searching the hypo-
thesis space using Find-S algorithm and its limitations
* Study about version spaces, List-Then-Eliminate algorithm and Candidate Elimin-
ation algorithm
* Introduce Inductive bias, a set of prior assumptions considered by a learning algorithm
beyond the training data in order to perform induction
«Discuss the tradeoff between the two factors called bias and variance which exist
when modelling a machine learning algorithm
Understand the different challenges in model selection and model evaluation
+ Study popular re-sampling model selection method called Cross-Validation (K-fold,
LOOCY, etc.) to tune machine learning model
* Introduce the various learning frameworks and learn to evaluate hypothesis
SS78 + Machine Learning
3.1 INTRODUCTION TO LEARNING AND ITS TYPES
The process of acquiring knowledge and expertise through study, experience, or being taught is
called as leaming, Generally, humans leam in different ways. To make machines learn, we need to
simulate the strategies of human learning in machines. But, will the computers learn? This question
has been raised over many centuries by philosophers, mathematicians and logicians.
First let us address the question - What sort of tasks can the computers learn? This depends on
the nature of problems that the computers can solve. There are two kinds of problems — well-posed
and ill-posed. Computers can solve only well-posed problems, as these have well-defined
specifications and have the following components inherent to it.
1. Class of learning tasks (T)
2. A measure of performance (P)
3. A source of experience (E)
The standard definition of learning proposed by Tom Mitchell is that a program can leam
from E for the task T, and P improves with experience E. Let us formalize the concept of learning
as follows:
Let x be the input and Z be the input space, which is the set of all inputs, and Y is the output
space, which is the set of all possible outputs, that is, yes/no.
Let D be the input dataset with examples, (x,,¥,),(x,/¥,)/"""+(,¥,) for inputs.
Let the unknown target function be f: 2 + Y, that maps the input space to output space.
‘The objective of the learning program is to pick a function, g: x Y to approximate hypothesis f.
Alll the possible formulae form a hypothesis space. In short, let H be the set of all formulae from
which the learning algorithm chooses. The choice is good when the hypothesis g replicates f for all
samples. This is shown in Figure 3.1
‘deal target
hypothesis fx)
Training
S| samples
Generated
hypothesis g(x)
( Candidate
\\ formulae
Hypothests
space H
Figure 3.1: Learning Environment
It can be observed that training samples and target function are dependent on the given
problem. The learning algorithm and hypothesis set are independent of the given problem. Thus,
learning model is informally the hypothesis set and learning algorithm. Thus, learning model can
‘be stated as follows:
Learning Model = Hypothesis Set + Learning AlgorithmBasics of Learning Theory « 79
Let us assume a problem of predicting a label for a given input data. Let D be the input dataset
with both positive and negative examples. Let y be the output with class or 1. The simple learning
model can be given as:
2
.x,w, > Threshold, belongs to class 1 and
Lxyw, < Threshold, belongs to another class
This can be put into a single equation as follows:
h(x) = cinf(E xm) + ‘)
where, x,,x,/--,x,are the components of the input vector, »,,1,,...,0,are the weights and +1
and -1 represent the class. This simple model is called perception model. One can simplify this by
making w,=b and fixing it as 1, then the model can further be simplified as:
I(x) = sigu(w" x)
This is called perception learning algorithm. The formal learning models are later discussed in
Section 3,7 of this chapter.
Classical and Adaptive Machine Learning Systems
A classical machine learning system has components such as Input, Process and Output. The input
values are taken from the environment directly. These values are processed and a hypothesis is
generated as output model. This model is then used for making predictions. The predicted values
are consumed by the environment.
In contrast to the classical systems, adaptive systems interact with the input for getting
labelled data as direct inputs are not available. This process is called reinforcement learning.
In reinforcement learning, a learning agent interacts with the environment and in return gets
feedback. Based on the feedback, the learning agent generates input samples for learning, which
are used for generating the learning model. Such learning agents are not stalic and change their
behaviour according to the external signal received from the environment. The feedback is known
as reward and learning here is the ability of the learning agent adapting to the environment
based on the reward. These are the characteristics of an adaptive system.
Learning Types
There are different types of learning. Some of the different learning methods are as follows:
1. Learn by memorization or learn by repetition also called as rote learning is done by
memorizing without understanding the logic or concept. Although rote learning is
basically learning by repetition, in machine learning perspective, the learning occurs by
simply comparing with the existing knowledge for the same input data and producing the
output if present.
2. Learn by examples also called as learn by experience or previous knowledge acquired
at some time, is like finding an analogy, which means performing inductive learning
from observations that formulate a general concept. Here, the leaner learns by inferring a
general rule from the set of observations or examples. Therefore, inductive learning is also
called as discovery learning.80 + Machine Learning
3. Learn by being taught by an expert or a teacher, generally called as passive learning.
However, there is a special kind of learning called active learning where the learner can
interactively query a teacher/expert to label unlabelled data instances with the desired
outputs.
4. Learning by critical thinking, also called as deductive learning, deduces new facts or
conclusion from related known facts and information.
5. Self learning, also called as reinforcement learning, is a self-directed learning that
normally learns from mistakes punishments and rewards.
6. Learning to solve problems is a type of cognitive learning where learning happens
in the mind and is possible by devising a methodology to achieve a goal. Here, the
Jearner initially is not aware of the solution or the way to achieve the goal but only knows
the goal. The learning happens either directly from the initial state by following the
steps to achieve the goal or indirectly by inferring the behaviour.
7. Learning by generalizing explanations, also called as explanation-based learning (EBL),
is another learning method that exploits domain knowledge from experts to improve
the accuracy of learned concepts by supervised learning.
Acquiring general concept from specific instances of the training dataset is the main challenge
of machine learning.
‘Scan for the figure showing ‘Types of Learning!
3.2 INTRODUCTION TO COMPUTATION LEARNING THEORY
There are many questions that have been raised by mathematicians and logicians over the time
taken by computers to lear. Some of the questions are as follows:
1, How cana learning system predict an unseen instance?
2. How do the hypothesis h is close to f, when hypothesis fitself is unknown?
3. How many samples are required?
4. Can we measure the performance of a learning system?
5, Is the solution obtained local or global?
These questions are the basis of a field called ‘Computational Learning Theory’ or in short
(COLT). It is a specialized field of study of machine learning. COLT deals with formal methods
used for learning systems. It deals with frameworks for quantifying learning tasks and learning,
algorithms. It provides a fundamental basis for study of machine learning. It deals with Probably
Approximate Learning (PAC) and Vapnik-Chervonenkis (VC) dimensions. The focus of PAC is
the quantification of the computational difficulty of learning tasks and algorithms and the compu-
tation capacity quantification is the focus of VC dimension
Computational Learning Theory uses many concepts from diverse areas such as Theoretical
Computer Science, Artificial Intelligence and Statistics. The core concept of COLT is the concept ofBasics of Learning Theory « 81
learning framework. One such important framework is PAC. The learning framework is discussed
ina detailed manner in Section3.7. COLT focuses on supervised learning tasks. Since the complexity
of analyzing is difficult, normally, binary classification tasks are considered for analysis.
3.3 DESIGN OF A LEARNING SYSTEM
A system that is built around a learning algorithm is called a learning system. The design of systems
focuses on these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation
Training Experience
Let us consider designing of a chess game. In direct experience, individual board states and correct
moves of the chess game are given directly. In indirect system, the move sequences and results are
only given. The training experience also depends on the presence of a supervisor who can label all
valid moves for a board state. In the absence of a supervisor, the game agent plays against itself and
learns the good moves, if the training samples cover all scenarios, or in other words, distributed
enough for performance computation. If the training samples and testing samples have the same
distribution, the results would be good.
Determine the Target Function
The next step is the determination of a target function. In this step, the type of knowledge that
needs to be learnt is determined. In direct experience, a board move is selected and is determined
whether it is a good move or not against all other moves. If it is the best move, then it is chosen as:
B-> M, where, B and M are legal moves. In indirect experience, all legal moves are accepted and
a score is generated for each. The move with largest score is then chosen and executed.
Determine the Target Function Representation
The representation of knowledge may be a table, collection of rules or a neural network. The linear
combination of these factors can be coined as:
V=u, +w,x, +0,x, + w,x,
where, x,,x, and x, represent different board features and 10, 2, 10, and 1, represent weights.
Choosing an Approximation Algorithm for the Target Function
The focus is to choose weights and fit the given training samples effectively. The aim is to reduce
the error given as: ;
E= Syl Vo]82 + Machine Learning
Here, b is the sample and V(b) is the predicted hypothesis. The approximation is carried
out as:
* Computing the error as the difference between trained and expected hypothesis. Let error
be error(b).
«Then, for every board feature x, the weights are updated as:
1, = 10, + [Ex error(b) x x,
Here, j1 is the constant that moderates the size of the weight update.
Thus, the learning system has the following components:
* A Performance system to allow the game to play against itself.
+ A Critic system to generate the samples.
+ AGeneralizer system to generate a hypothesis based on samples.
+ An Experimenter system to generate a new system based on the currently learnt function.
This is sent as input to the performance system.
3.4 INTRODUCTION TO CONCEPT LEARNING
Concept learning is a learning strategy of acquiring abstract knowledge or inferring a general
concept or deriving a category from the given training samples. It is a process of abstraction and
generalization from the data.
Concept learning helps to classify an object that has a set of common, relevant features. Thus,
it helps a learner compare and contrast categories based on the similarity and association of
positive and negative instances in the training data to classify an object. The leamer tries to
simplify by observing the common features from the training samples and then apply this
simplified model to the future samples. This task is also known as learning from experience.
Each concept or category obtained by learning is a Boolean valued function which takes a true
or false value. For example, humans can identify different kinds of animals based on common
relevant features and categorize all animals based on specific sets of features. The special features
that distinguish one animal from another can be called as a concept. This way of leaming categories
for object and to recognize new instances of those categories is called as concept learning.
It is formally defined as inferring a Boolean valued function by processing training instances.
Concept learning requires three things:
1, Input - Training dataset which is a set of training instances, each labeled with the name
of a concept or category to which it belongs. Use this past experience to train and build
the model.
2. Output - Target concept or Target function f. It is a mapping function f(x) from input x to
output y. It is to determine the specific features or common features to identify an object.
In other words, it is to find the hypothesis to determine the target concept. For eg,, the
specific set of features to identify an elephant from all animals.
3. Test — New instances to test the learned model.Basics of Learning Theory « 83
Formally, Concept learning is defined as-"Given a set of hypotheses, the leamer searches
through the hypothesis space to identify the best hypothesis that matches the target concept”.
Consider the following set of training instances shown in Table 3.1.
Table 3.1: Sample Training Instances
Hlephant
A No Short Yes No No Black No Big. Yes
2 ‘Yes Short No No No Brown, Yes Medium | No
3. No Short Yes No No Black No Medium Yes
4 No Long No Yes Yes White No Medium | No
5. No Short Yes Yes Yes Black No Big Yes
Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’,
“Tusks’, ‘Paws’, ‘Fur’, ‘Color, Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’. The target
concept is to identify the animal to be an Elephant.
Let us now take this example and understand further the concept of hypothesis.
Target Concept: Predict the type of animal - For example Elephant’.
3.4.1 Representation of a Hypothesis
A hypothesis ‘h’ approximates a target function ‘f’ to represent the relationship between the
independent attributes and the dependent attribute of the training instances. The hypothesis is the
predicted approximate model that best maps the inputs to outputs. Each hypothesis is represented
asa conjunction of attribute conditions in the antecedent part.
For example, (Tail = Short) a (Color = Black)
The set of hypothesis in the search space is called as hypotheses. Hypotheses are the plural
form of hypothesis. Generally ‘H’ is used to represent the hypotheses and ‘his used to represent
a candidate hypothesis.
Each attribute condition is the constraint on the attribute which is represented as attribute-value
pair. In the antecedent of an attribute condition of a hypothesis, each attribute can take value as
either’? orp’ or can hold a single value.
+ "2" denotes that the attribute can take any value [e.g,, Color =?]
+ “g” denotes that the attribute cannot take any value, i
(eg. Homs = 9]
it represents a null value
* Single value denotes a specific single value from acceptable values of the attribute, ie., the
attribute ‘Tail’ can take a value as ‘short’ [e.g., Tail = Short]
For example, a hypothesis ‘h’ will look like,
Homs Tail © Tusks Paws Fur Color. =~ Hooves Size
h=
Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h.84 + Machine Learning
The training dataset given above has 5 training instances with 8 independent attributes
and one dependent attribute. Here, the different hypotheses that can be predicted for the target
concept are,
Horns Tail Tusks Paws Fur Color Hooves Size
(or)
h=
The task is to predict the best hypothesis for the target concept (an elephant). The most general
hypothesis can allow any value for each of the attribute.
Itis represented as:
<2,2,2,2,2,2,2, ?>. This hypothesis indicates that any animal can be an elephant.
The most specific hypothesis will not allow any value for each of the attribute <9, 9 9 9 9 9
@, ©. This hypothesis indicates that no animal can be an elephant.
The target concept mentioned in this example is to identify the conjunction of specific features
from the training instances to correctly identify an elephant.
——___
Explain Concept Learning Task of an Elephant from the dataset given in Table 3.1.
Given,
Input: 5 instances each with 8 attributes
Target concept/function ‘¢ Elephant > {Yes, No}
Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [ie., each
literal is represented as an attribute-value pair]
Solution: The hypothesis ‘i’ for the concept learning task of an Elephant is given as:
h=
This hypothesis h is expressed in propositional logic form as below:
(Homs = No) a (Tail = Short) a (Tusks = Yes) « (Paws = 2) a (Fur = 2) (Color = Black) a (Hooves =
No) a Gize=?)
Output: Leam the hypothesis i’ to predict an ‘Elephant’ such that for a given test instance x,
Hix) = etx)
This hypothesis produced is also called as concept description which is a model that can be
used to classify subsequent instances.
"="
Thus, concept learning can also be called as Inductive Learning that tries to induce a general
function from specific training, instances. This way of learning a hypothesis that can produce an
approximate target function with a sufficiently large set of training instances can also approxi-
mately dassify other unobserved instances and is called as inductive learning hypothesis. We can
only determine an approximate target function because it is very difficult to find an exact target
function with the observed training instances. That is why a hypothesis is an approximate target
function that best maps the inputs to outputs.Basics of Learning Theory + 85
3.4.2 Hypothesis Space
Hypothesis space is the set of all possible hypotheses that approximates the target function f.
In other words, the set of all possible approximations of the target function can be defined as
hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning algorithm
would determine the best possible hypothesis that would best describe the target function or best
fit the outputs. Generally, a hypothesis representation language represents a larger hypothesis
space. Every machine learning algorithm would represent the hypothesis space in a different
manner about the function that maps the input variables to output variables. For example,
a regression algorithm represents the hypothesis space as a linear function whereas a decision
tree algorithm represents the hypothesis space asa tree.
The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.
‘The subset of hypothesis space that is consistent with all-observed training instances is called
as Version Space. Version space represents the only hypotheses that are used for the classification.
For example, each of the attribute given in the Table 3.1 has the following possible set of
values.
Homs - Yes, No
Tail - Long, Short
Tusks - Yes, No
Paws - Yes, No
Fur - Yes, No
Color - Brown, Black, White
Hooves - Yes, No
Size - Medium, Big
Considering these values for each of the attribute, there are (2x 2x 2x2x2x3X2x2)=384
distinct instances covering all the 5 instances in the training dataset.
So, we can generate (4x4x4x4x4x5x4 x4) =81,920 distinct hypotheses when including
two more values [2, 9] for each of the attribute. However, any hypothesis containing one or more
symbols represents the empty set of instances; that is, it classifies every instance as negative
instance. Therefore, there will be (3 x3x3x3x3x4x3 x3 +1) =8,749 distinct hypotheses
by including only '?' for each of the attribute and one hypothesis representing the empty set
of instances. Thus, the hypothesis space is much larger and hence we need efficient learning
algorithms to search for the best hypothesis from the set of hypotheses.
Hypothesis ordering is also important wherein the hypotheses are ordered from the most
specific one to the most general one in order to restrict searching the hypothesis space exhaustively.
3.4.3 Heuristic Space Search
Heuristic search is a search strategy that finds an optimized hypothesis/solution to a problem
by iteratively improving the hypothesis/solution based on a given heuristic function or a cost
measure. Heuristic search methods will generate a possible hypothesis that can be a solution in86 + Machine Learning
the hypothesis space or a path from the initial state. This hypothesis will be tested with the target
function or the goal state to see if itis a real solution. If the tested hypothesis is a real solution, then
it will be selected. This method generally increases the efficiency because it is guaranteed to find a
better hypothesis but may not be the best hypothesis. It is useful for solving tough problems which
could not solved by any other method. The typical example problem solved by heuristic search is
the travelling salesman problem.
Several commonly used heuristic search methods are hill climbing methods, constraint
satisfaction problems, best-first search, simulated-annealing, A+ algorithm, and genetic algorithms.
3.4.4 Generalization and Specialization
In order to understand about how we construct this concept hierarchy, let us apply this general
principle of generalization/specialization relation. By generalization of the most specific hypothesis,
and by specialization of the most general hypothesis, the hypothesis space can be searched for
an approximate hypothesis that matches all positive instances but does not match any negative
instance.
Searching the Hypothesis Space
There are two ways of learning the hypothesis, consistent with all training instances from the large
hypothesis space.
1. Specialization - General to Specific learning,
2. Generalization — Specific to General learning
Scan for information on “Additonal Examples on Generalization and Specialization’
Generalization - Specific to General Learning Thisleaming methodology will search through
the hypothesis space for an approximate hypothesis by generalizing the most specific hypothesis.
SS OO
Consider the training instances shown in Table 3.1 and illustrate Specific to
General Learning.
Solution: We will start from all false or the most specific hypothesis to determine the most
restrictive specialization. Consider only the positive instances and generalize the most specific
hypothesis. Ignore the negative instances.
This learning is illustrated as follows:
‘The most specific hypothesis is taken now, which will not classify any instance to true.
i a a ° 9 9
Read the first instance /1, to generalize the hypothesis h so that this positive instance can be
dlassified by the hypothesis hl.
nt No Short Yes No No Black No Big Yes (Positive instance)
hl= Basics of Learning Theory + 87
‘When reading the second instance 12, it is a negative instance, so ignore it.
Yes = Short No No No Brown Yes Medium No (Negative instance)
h2=
Similarly, when reading the third instance J3, it is a positive instance so generalize h2 to h3 to
accommodate it, The resulting 3 is generalized.
B: No Short Yes No No Black No Medium Yes (Positive instance)
h3=
Ignore [4 since it is a negative instance.
i4: No Long No Yes Yes White No Medium No (Negative instance)
h4=
When reading the fifth instance I5, h4 is further generalized to h5.
IS: No Short Yes Yes Yes Black No_ Big Yes (Positive instance)
h5=
his more general to classify all instances to true.
Ti: No Short Yes No No Black No Big Yes (Positive instance)
Mla? 2? ? 2 ? 2 ? Pp
I: Yes Short No No No Brown Yes Medium No (Negative instance)
12=
2.2 Yes 2 2 2 2 >
22 2 2 2? Black ? >
2? 702° 2°2 No >
2.2 2.2 2° 2 2 Big>
2 imposes constraints so that it will not classify a negative instance to true.
13: No Short Yes No No Black No Medium Yes (Positive instance)
13=
2 2 Yes 2 2 2 2 >88) «| Machine Learming $$???
2? 20 2 2 Black 2 >
2? 27 2 2 2 No 2
2? 2 222 2 Big>
No Long No Yes Yes White No Medium No (Negative instance)
Maa? Yes 2 2? 2? 2 >
2? 2 2 2 Black 2 }
2.2 27 2 2 2 2 Big>
Remove any hypothesis inconsistent with this negative instance.
Ie: No Short Yes Yes Yes Black No Big Yes (Positive instance)
W5=2 2 Yes 2 2 2 2
2 2 2 2 2 Black 2
22 27 2 72 2 2 Big>
Thus, h5 is the hypothesis space generated which will classify the positive instances to true and
negative instances to false.
SS?
3.4.5 Hypothesis Space Search by Find-S Algorithm
Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent
with the positive instances in the training dataset. Obviously, it will also be consistent with the
negative instances. Thus, this algorithm considers only the positive instances and eliminates negative
instances while generating the hypothesis. It initially starts with the most specific hypothesis.
eR tte
Input: Positive instances in the Training dataset
Output: Hypothesis ‘h’
1. Initialize ‘h’ to the most specific hypothesis.
=< 9 o 9 ed
2. Generalize the initial hypothesis for the first positive instance [Since ‘i’ is more specific].
3. For each subsequent instances:
Hits positive instance,
Check for each attribute value in the instance with the hypothesis.
If the attribute value is the same as the hypothesis value, then do nothing,
Else if the attribute value is different than the hypothesis value, change it
to’? in‘h.
Else if it is a negative instance,
Ignore it,
SSF
Consider the training dataset of 4 instances shown in Table 3.2. It contains the
details of the performance of students and their likelihood of getting a job offer or not in their final
semester. Apply the Find-S algorithm.Basics of Learning Theory + 89
Table 3.2: Training Dataset
erences
Potemese)
2 Yes Excellent Good Fast Yes Yes
2 Yes Good Good Fast Yes Yes
28 No Good Good Fast No No
i! Yes Good Good Slow No Yes
Solution:
Step 1: Initialize ‘h’ to the most specific hypothesis. There are 6 attributes, so for each attribute,
we initially fill ‘gin the initial hypothesis ‘h’.
he
Step 3: Scan the next instance 12, since I2 is a positive instance. Generalize ‘t’ to include positive
instance 12. For each of the non-matching attribute value in ‘h’ put a‘? to include this positive
instance. The third attribute value is mismatching in ‘h’ with 12, so put a‘?’
BR: 29 Yes Good Good Fast Yes Positive instance
h=<2d Yes? Good Fast Yes>
Now,scan 13. Since it is a negative instance, ignore it. Hence, the hypothesis remains the same
without any change after scanning 13.
I3: 28 No Good Good Fast No Negative instance
h=<29 Yes 2 Good Fast Yes>
Now scan J4, Since it is a positive instance, check for mismatch in the hypothesis ‘i’ with I4,
The 5" and 6* attribute value are mismatching, so add ‘2’ to those attributes inh’.
29 Yes Good Good Slow No Positive instance
2 Yes =? Good 2 >
Now, the final hypothesis generated with Find-S algorithm is:
h=<29 Yes? Good 2
It includes all positive instances and obviously ignores any negative instance.
ee
Limitations of Find-S Algorithm
1, Find-S algorithm tries to find a hypothesis that is consistent with positive instances, ignoring
all negative instances. As long as the training dataset is consistent, the hypothesis found by this
algorithm may be consistent.
2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses
that are consistent with the training dataset.90 + Machine Learning
3. Many times, the training dataset may contain some errors; hence such inconsistent data instances
can mislead this algorithm in determining the consistent hypothesis since it ignores negative
instances.
Hence, it is necessary to find the set of hypotheses that are consistent with the training data
including the negative examples. To overcome the limitations of Find-S algorithm, Candidate
Elimination algorithm was proposed to output the set of all hypotheses consistent with the training
dataset.
3.4.6 Version Spaces
The version space contains the subset of hypotheses from the hypothesis space that is consistent
with all training instances in the training dataset.
‘Scan for information on ‘Additional Examples on Version Spaces’
List-Then-Eliminate Algorithm
The principle idea of this learning algorithm is to initialize the version space to containalll hypotheses
and then eliminate any hypothesis that is found inconsistent with any training instances. Initially,
the algorithm starts with a version space to contain all hypotheses scanning each training instance.
The hypotheses that are inconsistent with the training instance are eliminated. Finally, the algorithm
outputs the list of remaining hypotheses that are all consistent.
Algorithm 3:
Input: Version Space — a list of all hypotheses
Output: Set of consistent hypotheses
1. Initialize the version space with a list of hypotheses.
2. For each training instance,
«remove from version space any hypothesis that is inconsistent.
This algorithm works fine if the hypothesis space is finite but practically it is difficult to
deploy this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination
algorithm.
Version Spaces and the Candidate Elimination Algorithm
Version space learning is to generate all consistent hypotheses around. This algorithm computes
the version space by the combination of the two cases namely,
+ Specific to General learning - Generalize $ to include the positive example
‘+ General to Specific learning ~ Specialize G to exclude the negative exampleBasics of Learning Theory « 91
Using the Candidate Elimination algorithm, we can compute the version space containing
all (and only those) hypotheses from H that are consistent with the given observed sequence of
training instances. The algorithm defines two boundaries called ‘general boundary’ which is a set of
all hypotheses that are the most general and ‘specific boundary’ which is a set of all hypotheses that
are the most specific. Thus, the algorithm limits the version space to contain only those hypotheses
that are most general and most specific. Thus, it provides a compact representation of List-then
algorithm.
7A eee AEC
Input: Set of instances in the Training dataset
Output: Hypothesis G and $
1. Initialize G, to the maximally general hypotheses.
2. Initialize S, to the maximally specific hypotheses.
© Generalize the initial hypothesis for the first positive instance.
3. For each subsequent new training instance,
* If the instance is positive,
© Generalize $ to include the positive instance,
> Check the attribute value of the positive instance and S,
«If the attribute value of positive instance and $ are different, fill that
field value with ‘?.
= If the attribute value of positive instance and $ are same, then do no
change.
o Prune G to exclude all inconsistent hypotheses in G with the positive instance.
* Ifthe instance is negative,
© Specialize G to exclude the negative instance,
> Add to Gall minimal specializations to exclude the negative example and
be consistent with S.
+ Ifthe attribute value of $ and the negative instance are different, then
fill that attribute value with $ value.
+ If the attribute value of $ and negative instance are same, no need to
update’G’ and fill that attribute value with ‘7’.
© Remove from $ all inconsistent hypotheses with the negative instance.
Generating Positive Hypothesis ‘S’ If it is a positive example, refine $ to include the positive
instance. We need to generalize $ to include the positive instance. The hypothesis is the conjunction
of ’S’ and positive instance. When generalizing, for the first positive instance, add to $ all minimal
generalizations such that Sis filled with attribute values of the positive instance. For the subsequent
positive instances scanned, check the attribute value of the positive instance and $ obtained in the