Chapter I - Neat
Chapter I - Neat
1
A high-level view of AI is shown in Fig. 1.1. The tasks related to conven-
tional AI , separately. Here, ML may be viewed as dealing with more than
just pattern recognit tasks. Classification and clustering are the typical tasks
of a PR system. However, ML di regression problems also. Data mining is the
efficient organization of data in the form of a 8
• In ML, we deal with vectors and vector spaces and these topics are better
appreciated thin linear algebra. The data input to an ML system may be
2
viewed as a matrix, popularly the data matrix. If there are n data items,
each represented as an l-dimensional vector, the
corresponding data matrix A is of size n × l. Linear algebra is useful
in analysing the weights associated with the edges in a neural network.
Matrix multiplication and eigen analysis are important in initializing the
weights of the neural network and in weight updates. It can also help in
weight normalization. The whole activity of clustering may be viewed as
data matrix factorization.
• The role of probability and statistics need not be explained as ML is,
in fact, statistical ML. These topics help in estimating the distributions
underlying the data. Further, they play a crucial role in analysis and
inference in ML.
• Optimization (along with calculus) is essential in training neural networks
where gradients and their computations are important. Gradient descent-
based optimization is an essential
• Information theoretic concepts like entropy, mutual information and Kullback-
Leibler ingredient of any DL system. divergence are essential to under-
stand topics such as decision tree classifiers, feature selection and deep
neural networks.
We will provide details of all these background topics in their respective
chapters.
3
1.2.2 Learning by Deduction
Deductive learning deals with the exploitation of deductions made earlier. This
type of learning is based on reasoning that is truth preserving. Given A, and
if A then B(A → B), we can deduce B. We can use B along with if B then
C(B → C) to deduce C. Note that whenever A and A → B are True, then B is
True, ratifying the truth preserving nature of learning by deduction. Consider
the following statements:
1. It is raining.
From (1) and (2), we can infor using deduction that (4) the ronds are wet.
This deduction can the be used with (3) to deduce or learn that (5) the roads
are slippery. Here, if statements (1), (2) any (3) are True, then statements (4)
and (5) are antomatically True.
A digital computer is primarily a deductive engine and is ideally antited for
this for of learning. Deductive learning is applied in well-defined domains like
game playing, including in chess.
2. A is a flying object.
From (1) and (2), we infer using abduction that A is an aeroplane. This
kind of reasoning mas lead to incorrect conclusions. For example, A could be a
bird or a kite.
4
1. Classification: Consider the handwritten digits shown in Fig. 1.3. Here,
each row has 15 examples of each of the digits. The problem is to learn
an ML model using such data to classify a new data pattern. This is
also called supervised learning as the model is learn with the help of such
exemplar data. It may be provided by an expert in several practica sit-
uations. For example, a medical doctor may provide examples of normal
patients and patients infected by COVID-19 based on some test results.
In the case of handwritten digits we have 10 class labels, one class label
corresponding to each of the digits from 0 to 9 In classification, we would
like to assign an appropriate class label from these labels to 3 new pattern.
2. Regression: Contrary to classification, there are several prediction appli-
cations where the labels come from a possibly infinite set. For example,
the share value of a stock could be a positive real number. The stock
may have different values at a particular time and each of these values is
a real number. This is a typical regression or curve fitting problem. The
practical need here is to predict the share value of a stock at a future time
instance base on past data in the form of examples.
222222222222220 33333333333 X 3 , 30 .
666666666666666
7771717771777)1
888888888888888
999999999999999
Fig. 1.3 Examples of handwritten digits labelled 0 to 9
could be represented by its centroid or mean. Let x1 , x2 , . . . , xp be p elements
of a cluster. Then the centroid of the cluster is defined by
1X
p
xi
p i=1
5
Let us consider the handwritten digit data set of 3 classes: 0,1 and 3 . By
using the class labels and clustering patterns of each class separately, we obtain
3 clusters that give us the 9 centroids shown in Fig. 1.4.
331311001
Fig. 1.5 Cluster centroids without using the class labels in clustering
6
Breast Cancer data is shown. The domain of Diagnosis, the class label,
is a binary set with values Malignant and Benign. The domain of ID
Number is a subset of integers in the range [ 8670,917897 ] and the domain
of Area_Mean is a collection 01 floating point numbers (interval) in the
range [143.5, 2501]. It is possible to have binary values in the domain for
categorical or numerical data. For example, the domain of Status could be
{Pass, Fail} and this variable is nominal; an example of a binary ordinal
type is {short, tall} for humans based on their height. A very popular
binary numerical type is {0, 1}; also in the
Table 1.1 Different types of data from the Wisconsin Breast Cancer database
classification context, the class label data can have the domain {−1, +1}
where -1 stands for the label of the negative class and +1 stands for the label
of the positive class.
Typically, each pattern or data item is represented as a vector of feature values.
For example, a data item corresponding to a patient with ID 92751 is repre-
sented by a five-dimensional vector (Benign, 92751, 47.92, 181, 0.05263), where
each component of the vector represents the corresponding feature shown in
Table 1.1. Benign is the value of feature 1, Diagnosis; similarly, the third entry
47.92 corresponds to feature 3, that is Perimeter_Mean and so on. Note that
Diagnosis is a nominal feature and ID Number is an ordinal attribute. The
remaining three features are
Here, Diagnosis or the class label is a dependent feature and the remain-
ing four features numerical. are independent features. Given a collection of
such data items or patterns in the form of fivedimensional vectors, the ML sys-
tem learns an association or mapping between the independent feress and the
dependent cature.
1.4 Matching
Matching is an important activity in ML. It is used in both supervised learning
and in learning from observations. Matching is carried out by using a proximity
measure which can be a distance/dissimilarity measure or a similarity measure.
Two data items, u and v, represented as l-dimensional vectors, match better
when the distance between them is smaller or when the
A popular distance measure is the Euclidean distance and a popular similar-
ity measure is similarity between them is larger. the cosine of the angle between
vectors. The Euclidean distance is given by
7
v
u l
uX
d(u, v) = t (u(i) − v(i))2
i=1
ut v
cos(u, v) =
∥u∥∥v∥
where ut v is the dot product between vectors u and v and ∥u∥ is the Eu-
clidean distance between u and the origin; it is also called the Euclidean norm.
Some of the important applications of matching in ML are in:
d x, xi ≤ d x, xj , for j ∈ {1, 2, . . . , K}
This idea is useful in clustering or learning from observations, where Ci is
the ith grotp cluster of patterns or observations and xi is the representative of
C6 . The centroid of the dea vectors in Ci is a popularly used representative, x4 ,
of Cf .
8
data.
Data Preprocessing
In several practical applications, the raw data available needs to be updated
before it can 1 used by an ML model. The common problems encountered with
raw data are missing raw
fferent ranges for different variables and the presence of outliers. We will now
explain how to deal ese problems.
Missing Data: It is likely that in some domains, there could be missing data.
This occurs as a consequence of the inability to measure n feature value or due
to unavailability or erroneous data entry. Some ML algorithms can work even
when there are a reasonable number of misaing data values and, in such cases,
9
there is no need for preprocessing. However, there are a large number of other
cases where the ML models cannot handle missing values. So, there is a need to
examine techniques for dealing with missing data. Different achemes are used
for dealing with the prediction of missing values:
1 X
K
xj(i)
K j=1
(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, −, 2), (1, 1, −), (6, 6, 1)
There are 6 three-dimensional pattern vectors in the set. Missing values are
indicated by Let us see how to predict the missing value in (1, −, 2). Let us
use K = 3 and find the 3 nearest neighbors (NNs) based on the remaining two
feature values. The three NNs are (1, 1, 1), (1, 1, 2) and (1, 1, 3). Note that the
second feature value of all these three neighbors is 1 , which is the predicted
value for the missing value in (1, −, 2). So, the vector becomes
• Cluster the data and locate the nearest cluster: This approach is based
on clustering the (1, 1, 2). training data and locating the cluster to which
x belongs based on the remaining l − 1 components. Let x with its ith
value missing belong to cluster C q . Let µq be the centroid of C q . Then
the predicted value of x(i) is µqi , the ith component of µq . We will explain
clustering in detail in a later chapter; it is sufficient for now to note that
a clustering algorithm can be used to group patterns in the training data
into K clusters where patterns in each cluster are similar to each other
and patterns belonging to different clusters are
Example 2: Consider the following data matrix. It has 5 data vectors in
a fourdissimilar. dimensional space.
10
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
4.8 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.325 1.5 0.2
5.0 3.6 1.4 0.2
We can compute the mean squared error (MSE) with respect to the predicted
value based on the deviations from the original values. The computation of MSE
may be explaine as follows: Given the n true (target) values to be y1 , y2 , . . . , yn
and the predicted values: be ŷ1 , ŷ2 , . . . , ŷn , MSE is defined as
Pn 2
(yi − yˆi )
i=1
n
In the above example, we have predicted two missing values based on the
mean of th remaining values of the respective feature. In the first case, instead
of 5.1, our estimate value is 4.8 . Similarly, in the second case, for the value 3.1
, our estimate is 3.325 . So, 也 MSE here is
11
predictions of the regressor is shown in Fig. 1.7.
Fig. 1.7 MSE of the regressor on data imputed using KNN and mean
It is easy to observe that the regressor performs best when the whole data is
available. However, when prediction is made by removing some values and
guessing them, the performance of the regressor suffers; this is natural. Note
that between the KNN-based and the mean-based imputations, the former made
better predictions leading to smaller MSE. This is because the KNN-based
scheme is more local to the respective point x and so is more
2. Data from Different Domains: The scales of values of different features could
be very focussed. different. This would bias the matching process to depend
more upon features that assume larger values, toning down the contributions of
features with smaller values. So, in applications where different components of
the vectors have different domain ranges, it is possible for some components to
dominate in contributing to the distance between any pair of patterns. Consider
for example, classification of objects into one of two classes: adult or child. Let
the objects be represented by height in metres and weight in grams. Consider
an adult represented by the vector ( 1.6, 75000 ) and a child represented by the
vector ( 0.6, 5000 ), where the heights of the adult and the child in metres are
1.6 and 0.6 , respectively, and the weights of the adult and the child in grams
are 75000 and 5000 , respectively. Assume that the domain of height is [0.5, 2.5]
and the domain of weight is [2000, 200000] in this example. So, there is a large
difference in the ranges of values of these two features.
Now the Euclidean distance between the adult and child vectors given above is
p p
1.6 − 0.6)2 + (75000 − 5000)2 = 1 + 4.9 × 109 ≈ 70000
Similarly, the cosine of the angle between the adult and child vectors is
12
0.96 + 375 × 106
√ √ ≈ 1.0
25000000.36 × 5625000002.56
Note that the proximity values computed between the two vectors, whether
it is the Euclidean distance or the cosine of the angle between the two vectors,
are dependent largely upon only
one of the two features, that is, weight, while the contributions of height are
negligible. This is because of the difference in the magnitudes of the ranges of
values of the two features. Thin example illustrates how the magnitudes/ranges
of values of different features contribute differ. ently to the overall proximity.
This can be handled by scaling different components differenth and such a pro-
cess of scaling is called normalization. There are two popular normalizating
schemes:
• Scaling using the range: On any categorical feature, the values of two
patterns either mate or mismatch and the contribution to the distance is
either zero ( 0 ) (match) or 1 (mismatch If we want to be consistent, then
in the case of any numerical feature also we want the contribution to be
in the range [0, 1]. This is achieved by scaling the difference by the range
of the values of the feature. So, if the pth component is of numerical type,
its contribution to the distance between patterns xi and xj is
xi (p) − xj (p)
,
Rangep
where Range p is the range of the pth feature values. Note that the value of
this term is in the range [0, 1]; the value of 1 is achieved when xip − xjp = Rangep
and it is 0 (zero) ∥ patterns xi and xj have the same value for the pth feature.
Such a scaling will ensure that the contribution, to the distance, of either a
categorical feature or a numerical feature will be in the range [0, 1].
60 + 80 + 20 + 100 + 40
= 60
5
We get zero-mean data by subtracting this mean from each of the 5 data
items to obtain 0, 20, −40, 40, −20 for their q th components. Note that this is
zero-mean data as these value add up to 0 . To make the standard deviation
13
of this data 1 , we divide each of the zero-mean data values by the standard
deviation of the data. Note that the variance of the zero-mean data is
• Assumes values that are far away from those of the average data items
• Deviates from the normally behaving data item
• Is not connected/similar to any other object in terms of its characteristics
• Erroneous data entry: Outlying data can occur at the data entry level
itself. For example, it is very common for spelling mistakes to occur when
names are entered. Also, it is possible to enter numbers such as salary
erroneously as 2000000 instead of 200000 by typing an extra zero (0).
• Evolving systems: It is possible to encounter data items in sparse regions
during the evolution of a system. For example, it is common to encounter
isolated entities in the early days of a social network. Such isolated entities
may or may not be outliers.
• Very naturally: Instead of viewing an outlier as a noisy or unwanted data
item, it may be better to view it as something useful. For example, a novel
idea or breakthrough in a scientific discipline, a highly paid sportsperson
or an expensive car can all be natural and influential outliers.
An outlying data item can be either out-of-range or within-range. For
example, consider an organization in which the salary could be from
{10000, 150000, 225000, 300000}. In this case, an entry like 2250000 is
an out-of-range outlier that occurs possibly because of an erroneous zero (
0 ). Also, if there are only 500 people drawing 10000, 400 drawing 150000,
300 at 225000 and 175 drawing 300000 , then an entry like 270000 could
be a within-range outlier.
14
There are different schemes for detecting outliers. They are based on the
density around points in the data. If a data point is located in a sparse region, it
could be a possible outlier. It is possible to use clustering to locate such outliers.
It does not matter whether it is withinrange or out-of-range. If the clustering
output has a singleton cluster, that is, a one-element cluster, then that element
could be a possible outlier.
15
• Performance of the model: It is well-known that as the dimensionality
increases, we requin a larger training data set to build an ML model.
There is a result, popularly called the peaking phenomenon, that shows
that as the dimensionality keeps increasing, the accuran of a classification
model increases until some value, and beyond that value, the accuracy
stan decreasing.
This may be attributed to the well-known concept of overfitting. The
model will tend to remember the training data and fails to perform well
on validation data. With a larger training data set we can afford to use
a higher dimensional data set and still avoid overfitting. Even though the
dimensionality of the data set in an application is large, it is possible that
the number d available training vectors is small. In such cases, a popular
technique used in ML is to reduce the dimensionality so that the learnt
model does not overfit the available data. Well-known dimensionality
reduction approaches are:
• Feature selection: Let F = {f1 , f2 , . . . , fL } be the set of L features. In the
feature selection approach, we would like to select a subset Fl of F having
l(< L) features such that Fl maximize the performance of the ML model.
• Feature extraction: Here, from the set F of L features, a set H = {h1 , h2 , . . . , hl }
of l(< l features is extracted. It is possible to categorize these schemes as
follows:
X
L
hj = αij fi
i=1
16
B can be viewed as linear combinations of the columns of A because of linear
independence.
We will examine, in detail, the concepts of eigenvalue, eigenvector and linear
independence in later chapters.
2. Non-linear feature extraction: Here, we represent using H = {h1 , . . . , hl },
such that
hi = t (f1 , f2 , . . . , fL )
where t is a non-linear function of the features. For example, if F = {f1 , f2 },
then h1 = af1 + bf2 + cf1 f2 is one such non-linear combination; it is non-linear
because we have a term of the form f1 f2 in h1 .
Autoencoder is a popular, state-of-the-art, non-linear feature extraction tool.
Here, a neural network which has an encoder and a decoder is used. The middle
layer has l neurons so that the l outputs from the middle layer give an l(< L)-
dimensional representation of the L-dimensional pattern that is input to the
autoencoder. Note that the encoder encodes or represents the L-dimensional
pattern in the l-dimensional space while the decoder decodes or converts the
l-dimensional pattern into the L-dimensional space. Note that it is called au-
toencoder because the same L-dimensional pattern is associated with the input
and output layers.
17
1.5.6 Model Evaluation
This step is also called model validation. This step requires specifically ear-
marked data called validation data. It is possible that the ML model works well
on the training data; then we say
that the model is well trained. However, it may not work well on the validation
data. In such, case, we say that the ML model overfits the training data. In
order to overcome overfitting, wh typically use the validation data to tune the
ML model so that it works well on both the training and validation data sets.
18
search in finding a solution. In information theory, we search for purity
(low entropy).
io, several activities including optimization, inference and matrix factor-
ization that are essential or ML are all based on search. Learning itself is
search. We will examine how search aids learnins f each ML model in the
respective chapters.
19
1.8 Data Sets Used
In this book, we make use of two data sets to conduct experiments and present
results in various chapters. These are:
• All the images were taken against a dark homogeneous background with
the in an upright, frontal position (with tolerance for some side movement).
• Each image is of size 64 × 64 = 4096.
• It is available on the scikit-learn platform.
20
4. Wisconsin Breast Cancer Data Set:
• There are two classes Benign and Malignant. The number of patterns
from B 357 and the number of Malignant class patterns is 212.
• It is available on the scikit-learn platform.
• For more details, visit
https://scikit-learn.org/stable/modules/generated/sklearn.datasets _breast_cancer.html
• Data Sets for Regression
• This data set provides monthly totals of US airline passengers from 1949
to 1960.
• This data set is taken from an inbuilt data set of Kaggle called AirPas-
sengers.
• For more details, visit
https://www.kaggle.com/datasets/chirag19/air-passengers
21
Summary
Machine learning (ML) is an important topic and has affected research practices
in both sciel and engineering significantly. The important steps in building an
ML system are:
Exercises
1. You are given that 9 × 17 is 153 and 4 × 17 is 68 . From this data, you
need to learn the valu of 13 × 17. Which learning paradigm is useful here?
Specify any assumptions you need t make.
2. You are given the following statements:
a. The sum of two even numbers is even.
b. 12 is an even number.
c. 22 is an even number.
22
e. Feature that takes values from {short, medium height, tall}
6. Let xi and xj be two l-dimensional unit norm vectors; that is, ∥xi ∥ = 1 and
∥xj ∥ = 1. Derive a relation between the Euclidean distance d (xi , xj ) and cosine
of the angle between xi and xj .
7. Consider the data set:
(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 2, 2), (1, 1, −), (6, 6, 10)
23