0% found this document useful (0 votes)
9 views23 pages

Chapter I - Neat

Chapter 1 introduces machine learning (ML), outlining its evolution, key paradigms, and essential steps involved in the ML process. It discusses various learning paradigms such as rote learning, deduction, abduction, induction, and reinforcement learning, emphasizing the importance of inductive learning. The chapter also covers types of data used in ML, including categorical and numerical data, and highlights the significance of matching in supervised learning.

Uploaded by

vikram_1612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Chapter I - Neat

Chapter 1 introduces machine learning (ML), outlining its evolution, key paradigms, and essential steps involved in the ML process. It discusses various learning paradigms such as rote learning, deduction, abduction, induction, and reinforcement learning, emphasizing the importance of inductive learning. The chapter also covers types of data used in ML, including categorical and numerical data, and highlights the significance of matching in supervised learning.

Uploaded by

vikram_1612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CHAPTER 1

Introduction to Machine Learning


Learning Objectives
At the end of this chapter, you will be able to:

• Give a brief overview of machine learning (ML)


• Describe the learning paradigms used in ML
• Explain the important steps in ML, including data acquisition, feature
engineering, model selection, model learning, model validation, model ex-
planation, representation, and search and explanation

1.1 Evolution of Machine Learning


Machine learning (ML) is the process of learning a model that can be used in
prediction based on data. Prediction involves assigning a data item to one of
the classes or associating the data item with a number. The former activity is
classification while the latter is regression. ML is an important and state-of-
the-art topic. It gained prominence because of the improved processing speed
and storage space of computers and the availability of large data sets for exper-
imentation. Deep learning (DL) is an offshoot of ML. In fact, perceptron was
the earliest popular ML tool and it forms the basic building block of various DL
architectures, including multi-layer perceptron networks, convolutional neural
networks (CNNs) and recurrent neural networks (RNNs).
In the early days of artificial intelligence (AI), it was opined that mathe-
matical logic was the ideal vehicle for building AI systems. Some of the initial
contributions in this area like the General Problem Solver (GPS), Automatic
Theorem Proving (ATP), rule-based systems and programming languages like
Prolog and Lisp (lambda calculus-based) were all outcomes of this view. Various
problem solving and game playing solutions also had this flavour. During the
twentieth century, a majority of prominent AI researchers were of the view that
logic is AI and AI is logic. Most of the reasoning systems were developed based
on this view. Further, the role of artificial neural networks in solving complex
real-world AI problems was under-appreciated.
However, this view was challenged in the early twenty-first century and the
current view is that AI is deep learning and deep learning is AI. The advent
of efficient graphics processing units (GPUs), platforms like TensorFlow and
PyTorch along with the demonstrated success stories of convolutional neural
networks, gated recurrent units and generative models based on neural networks
have impacted every aspect of science and engineering activities across the globe.
So, ML along with DL has become a state-of-the-art subject. Artificial
neural networks backbone of DL.

1
A high-level view of AI is shown in Fig. 1.1. The tasks related to conven-
tional AI , separately. Here, ML may be viewed as dealing with more than
just pattern recognit tasks. Classification and clustering are the typical tasks
of a PR system. However, ML di regression problems also. Data mining is the
efficient organization of data in the form of a 8

Fig. 1.1 A high-level view of AI


The typical background topics of AI are shown in Fig. 1.2.

Fig. 1.2 Background topics of AI


Note that data structures and algorithms are basic to both conventional and
current systems. Logic and discrete structures played an important role in
the analysis and syntil of conventional AI systems. The importance of other
background topics may be summarize follows:

• In ML, we deal with vectors and vector spaces and these topics are better
appreciated thin linear algebra. The data input to an ML system may be

2
viewed as a matrix, popularly the data matrix. If there are n data items,
each represented as an l-dimensional vector, the
corresponding data matrix A is of size n × l. Linear algebra is useful
in analysing the weights associated with the edges in a neural network.
Matrix multiplication and eigen analysis are important in initializing the
weights of the neural network and in weight updates. It can also help in
weight normalization. The whole activity of clustering may be viewed as
data matrix factorization.
• The role of probability and statistics need not be explained as ML is,
in fact, statistical ML. These topics help in estimating the distributions
underlying the data. Further, they play a crucial role in analysis and
inference in ML.
• Optimization (along with calculus) is essential in training neural networks
where gradients and their computations are important. Gradient descent-
based optimization is an essential
• Information theoretic concepts like entropy, mutual information and Kullback-
Leibler ingredient of any DL system. divergence are essential to under-
stand topics such as decision tree classifiers, feature selection and deep
neural networks.
We will provide details of all these background topics in their respective
chapters.

1.2 Paradigms for ML


There are different ways or paradigms for ML, such as learning by rote, learning
by deduction, learning by abduction, learning by induction and reinforcement
learning. We shall look at each of these in detail.

1.2.1 Learning by Rote


This involves memorization in an effective manner. It is a form of learning that is
popular in elementary schools where the alphabet and numbers are memorized.
Memorizing simple addition and multiplication tables are also examples of rote
learning. In the case of data caching, we store computed values so that we do not
have to recompute them later. Caching is implemented by search engines and it
may be viewed as another popular scheme of rote learning. When computation
is more expensive than recall, this strategy can save a significant amount of time.
Chess masters spend a lot of time memorizing the great games of the past. It
is this rote learning that teaches them how to ’think’ in chess. Various board
positions and their potential to reach the winning configuration are exploited in
games like chess and checkers.

3
1.2.2 Learning by Deduction
Deductive learning deals with the exploitation of deductions made earlier. This
type of learning is based on reasoning that is truth preserving. Given A, and
if A then B(A → B), we can deduce B. We can use B along with if B then
C(B → C) to deduce C. Note that whenever A and A → B are True, then B is
True, ratifying the truth preserving nature of learning by deduction. Consider
the following statements:

1. It is raining.

2. If it rains, the roads get wet.


3. If a road is wet, it is slippery.

From (1) and (2), we can infor using deduction that (4) the ronds are wet.
This deduction can the be used with (3) to deduce or learn that (5) the roads
are slippery. Here, if statements (1), (2) any (3) are True, then statements (4)
and (5) are antomatically True.
A digital computer is primarily a deductive engine and is ideally antited for
this for of learning. Deductive learning is applied in well-defined domains like
game playing, including in chess.

1.2.3 Learning by Abduction


Here, we infer A from B and (A → B). Notice that this is not truth preserving
like in deduction as both B and (A → B) can be Thue and A can be False.
Consider the following inference:

1. An acroplane is a flying object (aeroplane → flying object).

2. A is a flying object.

From (1) and (2), we infer using abduction that A is an aeroplane. This
kind of reasoning mas lead to incorrect conclusions. For example, A could be a
bird or a kite.

1.2.4 Learning by Induction


This is the most popular and effective form of ML. Here, learning is achieved
with the help of examples or observations. It may be categorized as follows:

• Learning from Examples: Here, it is assumed that a collection of labelled


examples an provided and the ML system uses these examples to make a
prediction on a new data pattern. In supervised classification or learning
from examples, we deal with two ML problems classification and regres-
sion.

4
1. Classification: Consider the handwritten digits shown in Fig. 1.3. Here,
each row has 15 examples of each of the digits. The problem is to learn
an ML model using such data to classify a new data pattern. This is
also called supervised learning as the model is learn with the help of such
exemplar data. It may be provided by an expert in several practica sit-
uations. For example, a medical doctor may provide examples of normal
patients and patients infected by COVID-19 based on some test results.
In the case of handwritten digits we have 10 class labels, one class label
corresponding to each of the digits from 0 to 9 In classification, we would
like to assign an appropriate class label from these labels to 3 new pattern.
2. Regression: Contrary to classification, there are several prediction appli-
cations where the labels come from a possibly infinite set. For example,
the share value of a stock could be a positive real number. The stock
may have different values at a particular time and each of these values is
a real number. This is a typical regression or curve fitting problem. The
practical need here is to predict the share value of a stock at a future time
instance base on past data in the form of examples.

• Learning from Observations: Observations are also instances like exam-


ples but they ar different because observations need not be labelled. In
this case, we cluster or group the observations into a smaller number of
groups. Such grouping is performed with the help a clustering algorithm
that assigns similar patterns to the same group/cluster. Each clusted
000000000000000

222222222222220 33333333333 X 3 , 30 .

666666666666666

7771717771777)1

888888888888888

999999999999999
Fig. 1.3 Examples of handwritten digits labelled 0 to 9
could be represented by its centroid or mean. Let x1 , x2 , . . . , xp be p elements
of a cluster. Then the centroid of the cluster is defined by

1X
p
xi
p i=1

5
Let us consider the handwritten digit data set of 3 classes: 0,1 and 3 . By
using the class labels and clustering patterns of each class separately, we obtain
3 clusters that give us the 9 centroids shown in Fig. 1.4.

000 111 333


Fig. 1.4 Cluster centroids using the class labels in clustering
However, when we cluster the entire data of digits 0,1 and 3 into 9 clusters, we
obtain the centroids shown in Fig. 1.5. So, the clusters and their representatives
could differ based on how we exploit the class labels.

331311001
Fig. 1.5 Cluster centroids without using the class labels in clustering

1.2.5 Reinforcement Learning


In supervised learning, the ML model is learnt in such a way as to maximize a
performance mean like prediction accuracy. In the case of reinforcement learn-
ing, an agent learns an optimal po to optimize some reward function. The
learnt policy helps the agent in taking an action based the current configuration
or state of the problem. Robot path planning is a typical application reinforce-
ment learning.

1.3 Types of Data


In this book, we primarily deal with inductive learning as it is the most popular
paradigm for 1 It is important to observe that in both supervised learning and
learning from observations, we d with data. In general, data can be categorical
or numerical.

• Categorical: This type of data can be nominal or ordinal. In the case


of nominal data, the is no order among the elements of the domain. For
example, for colour of hair, the domain {brown, black, red}. This data
is of categorical type and the elements of the domain are n ordered. On
the contrary, in ordinal data, there is an order among the values of the
domai For example, the domain of variable employee number could be
{1, 2, . . . , 1011} if there are 10 employees in an organization. Here, order-
ing among the elements of the domain is observe indicating that senior
employees have smaller employee numbers compared to junior employee
the most senior employee will have employee number 1.
• Numerical: In the case of numerical data, the domain of values of the
data type could be set/subset of integers or a set/subset of real numbers.
For example, in Table 1.1, a subset 0 the features used by the Wisconsin

6
Breast Cancer data is shown. The domain of Diagnosis, the class label,
is a binary set with values Malignant and Benign. The domain of ID
Number is a subset of integers in the range [ 8670,917897 ] and the domain
of Area_Mean is a collection 01 floating point numbers (interval) in the
range [143.5, 2501]. It is possible to have binary values in the domain for
categorical or numerical data. For example, the domain of Status could be
{Pass, Fail} and this variable is nominal; an example of a binary ordinal
type is {short, tall} for humans based on their height. A very popular
binary numerical type is {0, 1}; also in the

Table 1.1 Different types of data from the Wisconsin Breast Cancer database

Feature Number Attribute Type of Data Domain


1 Diagnosis Nominal {Malignant, Benign}
2 ID Number Ordinal [8670, 917897]
3 Perimeter_Mean Numerical [43.79, 188.5]
4 Area_Mean Numerical [143.5, 2501]
5 Smoothness_Mean Numerical [0.05263, 0.1634]

classification context, the class label data can have the domain {−1, +1}
where -1 stands for the label of the negative class and +1 stands for the label
of the positive class.
Typically, each pattern or data item is represented as a vector of feature values.
For example, a data item corresponding to a patient with ID 92751 is repre-
sented by a five-dimensional vector (Benign, 92751, 47.92, 181, 0.05263), where
each component of the vector represents the corresponding feature shown in
Table 1.1. Benign is the value of feature 1, Diagnosis; similarly, the third entry
47.92 corresponds to feature 3, that is Perimeter_Mean and so on. Note that
Diagnosis is a nominal feature and ID Number is an ordinal attribute. The
remaining three features are
Here, Diagnosis or the class label is a dependent feature and the remain-
ing four features numerical. are independent features. Given a collection of
such data items or patterns in the form of fivedimensional vectors, the ML sys-
tem learns an association or mapping between the independent feress and the
dependent cature.

1.4 Matching
Matching is an important activity in ML. It is used in both supervised learning
and in learning from observations. Matching is carried out by using a proximity
measure which can be a distance/dissimilarity measure or a similarity measure.
Two data items, u and v, represented as l-dimensional vectors, match better
when the distance between them is smaller or when the
A popular distance measure is the Euclidean distance and a popular similar-
ity measure is similarity between them is larger. the cosine of the angle between
vectors. The Euclidean distance is given by

7
v
u l
uX
d(u, v) = t (u(i) − v(i))2
i=1

The cosine similarity is given by

ut v
cos(u, v) =
∥u∥∥v∥
where ut v is the dot product between vectors u and v and ∥u∥ is the Eu-
clidean distance between u and the origin; it is also called the Euclidean norm.
Some of the important applications of matching in ML are in:

• Finding the Nearest Neighbor of a P ern: Let x be an l-dimensional pattern


vector. Let X = {x1 , x2 , . . . , xn } be a collection of n data vectors. The
nearest neighbor of x from X , denoted by N Nx (X ), is xj if

d (x, xj ) ≤ d (x, xi ) , ∀xi ∈ X


This is an approximate search where a pattern that best matches x is ob-
tained. If there is a tie, that is, when both xp ∈ X and xq ∈ X are the nearest
neighbors of x, we can break the tie arbitrarily or choose either of the two to
be the nearest neighbor of x. This step is useful in classification and will be
discussed in the next chapter.

• Assigning to a Set with the Nearest Representative: Let C1 , C2 , . . . , CK be


K sets with x1 , x2 , . . . , xK as their respective representatives. A pattern
x is assigned to Ci if

 
d x, xi ≤ d x, xj , for j ∈ {1, 2, . . . , K}
This idea is useful in clustering or learning from observations, where Ci is
the ith grotp cluster of patterns or observations and xi is the representative of
C6 . The centroid of the dea vectors in Ci is a popularly used representative, x4 ,
of Cf .

1.5 Stages in Machine Learning


Building a machine learning system involves a number of steps, as illustrated in
Fig. 1.6. Note us emphasis on data in the form of training, validation and test

8
data.

Fig. 1.6 Important steps in a practical machine learning system


Typically, the available data is split into training, validation and test data.
Training data: used in model learning or training and validation data is used
to tune the ML model. Test dall is used to examine how the learnt model is
performing. We now describe the components of 14 ML system.

1.5.1 Data Acquisition


This depends on the domain of the application. For example, to distinguish
between adults all children, measurements of their height or weight are adequate;
however, to distinguish betwel normal and COVID-19-infected humans, their
body temperature and chest congestion may ⋆ more important than their height
or weight. Typically, data collection is carried out before featur engineering.

1.5.2 Feature Engineering


This step involves a combination of data preprocessing and data representation.

Data Preprocessing
In several practical applications, the raw data available needs to be updated
before it can 1 used by an ML model. The common problems encountered with
raw data are missing raw
fferent ranges for different variables and the presence of outliers. We will now
explain how to deal ese problems.
Missing Data: It is likely that in some domains, there could be missing data.
This occurs as a consequence of the inability to measure n feature value or due
to unavailability or erroneous data entry. Some ML algorithms can work even
when there are a reasonable number of misaing data values and, in such cases,

9
there is no need for preprocessing. However, there are a large number of other
cases where the ML models cannot handle missing values. So, there is a need to
examine techniques for dealing with missing data. Different achemes are used
for dealing with the prediction of missing values:

• Use the nearset neighbor: Let x be an l-dimensional


 data vector that has
its ith component x(i) missing. Let X = x1 , x2 , . . . , xn be the set of n
training pattern vectors. Let xj ∈ X be the nearest neighbor of x based
on the remaining l − 1 (excluding the ith ) components. Predict the value
of x(i) to be xj (i), that is, if the ith component x(i) of x is missing, use
the ith component of xj = N Nx (X ) instead.
• Use a langer neighborhood: Use the k-nearest neighbors (KNNs) of x to
predict the missing value x1 . Let the KNNs of x, using the remaining l − 1
components, from X be x1, x2 . . . , xK. Now the predicted value of x(i) is
the average of the ith components of these KNNs. That is, the predicted
value of x(i) is

1 X
K
xj(i)
K j=1

Example 1: Consider the set data vectors

(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, −, 2), (1, 1, −), (6, 6, 1)
There are 6 three-dimensional pattern vectors in the set. Missing values are
indicated by Let us see how to predict the missing value in (1, −, 2). Let us
use K = 3 and find the 3 nearest neighbors (NNs) based on the remaining two
feature values. The three NNs are (1, 1, 1), (1, 1, 2) and (1, 1, 3). Note that the
second feature value of all these three neighbors is 1 , which is the predicted
value for the missing value in (1, −, 2). So, the vector becomes

• Cluster the data and locate the nearest cluster: This approach is based
on clustering the (1, 1, 2). training data and locating the cluster to which
x belongs based on the remaining l − 1 components. Let x with its ith
value missing belong to cluster C q . Let µq be the centroid of C q . Then
the predicted value of x(i) is µqi , the ith component of µq . We will explain
clustering in detail in a later chapter; it is sufficient for now to note that
a clustering algorithm can be used to group patterns in the training data
into K clusters where patterns in each cluster are similar to each other
and patterns belonging to different clusters are
Example 2: Consider the following data matrix. It has 5 data vectors in
a fourdissimilar. dimensional space.

10
 
5.1 3.5 1.4 0.2
 4.9 3.0 1.4 0.2 
 
 4.7 3.2 1.3 0.2 
 
 4.6 3.1 1.5 0.2 
5.0 3.6 1.4 0.2
 
4.8 3.5 1.4 0.2
 4.9 3.0 1.4 0.2 
 
 4.7 3.2 1.3 0.2 
 
 4.6 3.325 1.5 0.2 
5.0 3.6 1.4 0.2
We can compute the mean squared error (MSE) with respect to the predicted
value based on the deviations from the original values. The computation of MSE
may be explaine as follows: Given the n true (target) values to be y1 , y2 , . . . , yn
and the predicted values: be ŷ1 , ŷ2 , . . . , ŷn , MSE is defined as
Pn 2
(yi − yˆi )
i=1
n
In the above example, we have predicted two missing values based on the
mean of th remaining values of the respective feature. In the first case, instead
of 5.1, our estimate value is 4.8 . Similarly, in the second case, for the value 3.1
, our estimate is 3.325 . So, 也 MSE here is

(5.1 − 4.8)2 + (3.1 − 3.325)2 0.09 + 0.00050625


= = 0.04525
2 2
Example 3: Consider three clusters of points and their centroids:
a. Cluster1: {(1, 1, 1), (1, 2, 3), (1, 3, 2)} Centroid 1: (1, 2, 2)
b. Cluster2: {(3, 4, 3), (3, 5, 3), (3, 3, 3)} Centroid 2 : (3, 4, 3)
c. Cluster3: {(6, 6, 6), (6, 8, 6), (6, 7, 6)} Centroid 3: (6, 7, 6)
Consider a pattern vector with a missing value given by (1, −, 2). Its nearest
centrib among the three centroids based on the remaining two features is Cen-
troid 1, 1, 2 So, the missing value in the second location is predicted to be the
second componeth ” Centroid 1 , that is, 2 . So, the pattern with the missing
value becomes ( 1, 2, 2 ).
We will now illustrate how the KNN-based and mean-based schemes work on
a bigge d? set. We consider 20,640 patterns of the California Housing data set.
It has 8 feature ® the target variable is the median house value for California
districts, expressed in hum 23 of thousands of dollars ( $100, 000 ). This is a
regression problem. Some values in the set are removed to create missing values.
The missing values are imputed using the k scheme and the mean-based scheme.
Now the regressor (a function to predict the t2,5 t6 is used on the whole data set
without missing values, on the KNN-based imputed dat3
and the mean-based imputed data set. The resulting mean squared error of the

11
predictions of the regressor is shown in Fig. 1.7.

Fig. 1.7 MSE of the regressor on data imputed using KNN and mean
It is easy to observe that the regressor performs best when the whole data is
available. However, when prediction is made by removing some values and
guessing them, the performance of the regressor suffers; this is natural. Note
that between the KNN-based and the mean-based imputations, the former made
better predictions leading to smaller MSE. This is because the KNN-based
scheme is more local to the respective point x and so is more
2. Data from Different Domains: The scales of values of different features could
be very focussed. different. This would bias the matching process to depend
more upon features that assume larger values, toning down the contributions of
features with smaller values. So, in applications where different components of
the vectors have different domain ranges, it is possible for some components to
dominate in contributing to the distance between any pair of patterns. Consider
for example, classification of objects into one of two classes: adult or child. Let
the objects be represented by height in metres and weight in grams. Consider
an adult represented by the vector ( 1.6, 75000 ) and a child represented by the
vector ( 0.6, 5000 ), where the heights of the adult and the child in metres are
1.6 and 0.6 , respectively, and the weights of the adult and the child in grams
are 75000 and 5000 , respectively. Assume that the domain of height is [0.5, 2.5]
and the domain of weight is [2000, 200000] in this example. So, there is a large
difference in the ranges of values of these two features.
Now the Euclidean distance between the adult and child vectors given above is
p p
1.6 − 0.6)2 + (75000 − 5000)2 = 1 + 4.9 × 109 ≈ 70000
Similarly, the cosine of the angle between the adult and child vectors is

12
0.96 + 375 × 106
√ √ ≈ 1.0
25000000.36 × 5625000002.56
Note that the proximity values computed between the two vectors, whether
it is the Euclidean distance or the cosine of the angle between the two vectors,
are dependent largely upon only
one of the two features, that is, weight, while the contributions of height are
negligible. This is because of the difference in the magnitudes of the ranges of
values of the two features. Thin example illustrates how the magnitudes/ranges
of values of different features contribute differ. ently to the overall proximity.
This can be handled by scaling different components differenth and such a pro-
cess of scaling is called normalization. There are two popular normalizating
schemes:

• Scaling using the range: On any categorical feature, the values of two
patterns either mate or mismatch and the contribution to the distance is
either zero ( 0 ) (match) or 1 (mismatch If we want to be consistent, then
in the case of any numerical feature also we want the contribution to be
in the range [0, 1]. This is achieved by scaling the difference by the range
of the values of the feature. So, if the pth component is of numerical type,
its contribution to the distance between patterns xi and xj is

xi (p) − xj (p)
,
Rangep
where Range p is the range of the pth feature values. Note that the value of
this term is in the range [0, 1]; the value of 1 is achieved when xip − xjp = Rangep
and it is 0 (zero) ∥ patterns xi and xj have the same value for the pth feature.
Such a scaling will ensure that the contribution, to the distance, of either a
categorical feature or a numerical feature will be in the range [0, 1].

• Standardization: Here, the data is normalized so that it will have 0 (zero)


mean and unit variance. This is motivated by the property of standard
normal distribution, which is characterized by zero mean and unit vari-
ance.
Example 4: Let there be 5l-dimensional data vectors and let the the q th
components of the 5 vectors be 60, 80, 20, 100 and 40 . The mean of this
collection is

60 + 80 + 20 + 100 + 40
= 60
5
We get zero-mean data by subtracting this mean from each of the 5 data
items to obtain 0, 20, −40, 40, −20 for their q th components. Note that this is
zero-mean data as these value add up to 0 . To make the standard deviation

13
of this data 1 , we divide each of the zero-mean data values by the standard
deviation of the data. Note that the variance of the zero-mean data is

0 + 202 + (−40)2 + 402 + (−20)2


= 800
5
and the standard deviation is 28.284 . So, the scaled data is 0, 0.707, −1.414, 1.414, −0.707.
Note that this data corresponding to the q th feature value of the 5 vectors has
zero me all and unit variance.
Outliers in the Data: An outlier is a data item that is either noisy or erroneous.
Notal measuring instruments or erroneous data recordings are responsible for
the presence of sud outliers. A common problem across various applications is
the presence of outliers. A data itell is usually called an outlier if it

• Assumes values that are far away from those of the average data items
• Deviates from the normally behaving data item
• Is not connected/similar to any other object in terms of its characteristics

Outliers can occur because of different reasons:

• Noisy measurements: The measuring instruments may malfunction and


may lead to recording of noisy data. It is possible that the recorded value
lies outside the domain of the data type.

• Erroneous data entry: Outlying data can occur at the data entry level
itself. For example, it is very common for spelling mistakes to occur when
names are entered. Also, it is possible to enter numbers such as salary
erroneously as 2000000 instead of 200000 by typing an extra zero (0).
• Evolving systems: It is possible to encounter data items in sparse regions
during the evolution of a system. For example, it is common to encounter
isolated entities in the early days of a social network. Such isolated entities
may or may not be outliers.
• Very naturally: Instead of viewing an outlier as a noisy or unwanted data
item, it may be better to view it as something useful. For example, a novel
idea or breakthrough in a scientific discipline, a highly paid sportsperson
or an expensive car can all be natural and influential outliers.
An outlying data item can be either out-of-range or within-range. For
example, consider an organization in which the salary could be from
{10000, 150000, 225000, 300000}. In this case, an entry like 2250000 is
an out-of-range outlier that occurs possibly because of an erroneous zero (
0 ). Also, if there are only 500 people drawing 10000, 400 drawing 150000,
300 at 225000 and 175 drawing 300000 , then an entry like 270000 could
be a within-range outlier.

14
There are different schemes for detecting outliers. They are based on the
density around points in the data. If a data point is located in a sparse region, it
could be a possible outlier. It is possible to use clustering to locate such outliers.
It does not matter whether it is withinrange or out-of-range. If the clustering
output has a singleton cluster, that is, a one-element cluster, then that element
could be a possible outlier.

1.5.3 Data Representation


Representation is an important step in building ML models. This subsection
introduces how data items are represented. It also discusses the importance of
representation in ML. In the process, it deals with both feature selection and
feature extraction and introduces different categories of
It is often stated in DL literature that feature engineering is important in ML,
but not dimensionality reduction. in DL because DL systems have automatic
representation learning capability. This is a highly debatable issue. However,
it is possible that, in some application domains, DL systems can avoid the rep-
resentation step explicitly. However, preprocessing including handling missing
data and eliminating outliers is still an important part of any DL system. Even
though representation is not explicit, it is implicitly handled in DL by choosing
the appropriate number of layers and number of neurons in each layer of the
neural network.

Representation of Data Items


The most active and state-of-the-art paradigm for ML is statistical machine
learning. Here, each data item is represented as a vector. Typically, we con-
sider addition of vectors in computing the mean or centroid of a collection of
vectors, multiplication of a vector by a scalar in dealing with operations on ma-
trices, and the dot product between a pair of vectors for computing similarity as
important operations on the set of vectors. In most of the practical applications,
the dimensionality
of the data or correspondingly size of the vectors representing data items, L,
can be very large. example, there are around 468 billion Google Ngrams. In
this case, the dimensionality of the vec is the vocabulary size or the number of
Ngrams; so, the dimensionality could be very large S high-dimensional data is
common in bioinformatics, information retrieval, satellite imagery, anc on. So,
representation is an important component of any ML system. An arbitrary rep-
resentat may also be adequate to build an ML model. However, the predictions
made using such a mo may not be meaningful.
Current-day applications deal with high-dimensional data. Some of the difficul-
ties associated wi ML using such high-dimensional data vectors are:

• Computation time increases with the dimensionality.

• Storage space requirement also increases with the dimensionality.

15
• Performance of the model: It is well-known that as the dimensionality
increases, we requin a larger training data set to build an ML model.
There is a result, popularly called the peaking phenomenon, that shows
that as the dimensionality keeps increasing, the accuran of a classification
model increases until some value, and beyond that value, the accuracy
stan decreasing.
This may be attributed to the well-known concept of overfitting. The
model will tend to remember the training data and fails to perform well
on validation data. With a larger training data set we can afford to use
a higher dimensional data set and still avoid overfitting. Even though the
dimensionality of the data set in an application is large, it is possible that
the number d available training vectors is small. In such cases, a popular
technique used in ML is to reduce the dimensionality so that the learnt
model does not overfit the available data. Well-known dimensionality
reduction approaches are:
• Feature selection: Let F = {f1 , f2 , . . . , fL } be the set of L features. In the
feature selection approach, we would like to select a subset Fl of F having
l(< L) features such that Fl maximize the performance of the ML model.
• Feature extraction: Here, from the set F of L features, a set H = {h1 , h2 , . . . , hl }
of l(< l features is extracted. It is possible to categorize these schemes as
follows:

1. Linear schemes: In this case,

X
L
hj = αij fi
i=1

That is, each element of H is a linear combination of the original features.


Note thes feature selection is a specialization of feature extraction. Some promi-
nent schemes unde this category are:
a. Principal components ( P C s ): Consider the data set of n vectors in an
L-dimensions space; this may be represented as a matrix A of size n × L. The
covariance matris Σ of size L × L associated with the data is computed and the
eigenvectors of Σ for the principal components. The eigenvector corresponding
to the largest eigenvalue : the first principal component (PC). Similarly, the
second largest eigenvalue provids its corresponding eigenvector as the second
PC. Finally, the eigenvector cort to the lth largest eigenvalue is the lth PC.
Both the original feature correspondit are sufficiently powerful to represent any
data vector. So, PCs mare vectors and P (b) combinations of the given features.
b. Non-negative matrix factorization possible that PCs have negative entries.
However, it is the data is non-negative, it .
using non-negative entries; NMF is such a factorization of An×L into a product
of Bn×l and Cl×L . Its use is motivated by the notion that NMF can be used
to characterize objects in an image represented by A. In NMF, the columns of

16
B can be viewed as linear combinations of the columns of A because of linear
independence.
We will examine, in detail, the concepts of eigenvalue, eigenvector and linear
independence in later chapters.
2. Non-linear feature extraction: Here, we represent using H = {h1 , . . . , hl },
such that

hi = t (f1 , f2 , . . . , fL )
where t is a non-linear function of the features. For example, if F = {f1 , f2 },
then h1 = af1 + bf2 + cf1 f2 is one such non-linear combination; it is non-linear
because we have a term of the form f1 f2 in h1 .
Autoencoder is a popular, state-of-the-art, non-linear feature extraction tool.
Here, a neural network which has an encoder and a decoder is used. The middle
layer has l neurons so that the l outputs from the middle layer give an l(< L)-
dimensional representation of the L-dimensional pattern that is input to the
autoencoder. Note that the encoder encodes or represents the L-dimensional
pattern in the l-dimensional space while the decoder decodes or converts the
l-dimensional pattern into the L-dimensional space. Note that it is called au-
toencoder because the same L-dimensional pattern is associated with the input
and output layers.

1.5.4 Model Selection


Selection of the model to be used to train an ML system depends upon the nature
of the data and knowledge of the application domain. For some applications,
only a subset of the ML models can be used. For example, if some features
are numerical and others are categorical, then classifiers based on perceptrons
and support vector machine (SVM) are not suitable as they compute the dot
product between vectors and dot products do not make sense when some values
in the corresponding vectors are non-numerical. On the other hand, Bayesian
models and decision tree-based models are ideally suited to deal with such data
as they depend upon the frequency of occurrence of values.

1.5.5 Model Learning


This step depends on the size and type of the training data. In practice, a
subset of the labelled data is used as training data for learning the model and
another subset is used for model validation or model evaluation. Some of the ML
models are highly transparent while others are opaque or black box models. For
example, decision tree-based models are ideally suited to provide transparency;
this is because in a decision tree, at each internal or decision node, branching is
carried out based on the value assumed by a feature. For example, if the height
of an object is larger than 5 feet, it is likely to be an adult and not a child; such
easy-to-understand rules are abstracted by decision trees. Neural networks are
typically opaque as the outputs of intermediate/hidden layer neurons may not
offer transparency.

17
1.5.6 Model Evaluation
This step is also called model validation. This step requires specifically ear-
marked data called validation data. It is possible that the ML model works well
on the training data; then we say
that the model is well trained. However, it may not work well on the validation
data. In such, case, we say that the ML model overfits the training data. In
order to overcome overfitting, wh typically use the validation data to tune the
ML model so that it works well on both the training and validation data sets.

1.5.7 Model Prediction


This step deals with testing the model that is learnt and validated. It is used for
prediction becaus both classification and regression tasks are predictive tasks.
This step employs the test datas ser earmarked for the purpose. In the real
world, the model is used for prediction as new patterns keen coming in. Imagine
an ML model built for medical diagnosis. It is like a doctor who predicts and
makes a diagnosis when a new patient comes in.

1.5.8 Model Explanation


This step is important to explain to an expert or a manager why a decision was
taken by the ML model. This will help in explicit or implicit feedback from the
user to further improve the model Explanation had an important role earlier in
expert systems and other AI systems. However explanation has become very
important in the era of DL. This is because DL systems typically employ neural
networks that are relatively opaque. So, their functioning cannot be easily
explained at a level of detail that can be appreciated by the domain expert/user.
Such opaque behaviour has created the need for explainable AI.

1.6 Search and Learning


Search is a very basic and fundamental operation in both ML and AI. Search
had a special role in conventional AI where it was successfully used in problem
solving, theorem proving, planning and knowledge-based systems.
Further, search plays an important role in several computer science applications.
Some of them are as follows:

• Exact search is popular in databases for answering queries, in operating


systems for operation like grep, and in looking for entries in symbol tables.
• In ML, search is important in learning a classification model, a proximity
measure for clusterns and classification, and the appropriate model for
regression.
• Inference is search in logic and probability. In linear algebra, matrix fac-
torization is search. In optimization, we use a regularizer to simplify the

18
search in finding a solution. In information theory, we search for purity
(low entropy).
io, several activities including optimization, inference and matrix factor-
ization that are essential or ML are all based on search. Learning itself is
search. We will examine how search aids learnins f each ML model in the
respective chapters.

7 Explanation Offered by the Model


aventional AI systems were logic-based or rule-based systems. So, the corre-
sponding reasonials items naturally exhibited transparency and, as a conse-
quence, explainability. Both forward at
kward reasoning was possible. In fact, the same knowledge base, based on ex-
perts’ input, was i in both diagnosis and in teaching because of this flexibility.
Specifically, the knowledge base i by the MYCIN expert system was used in
tutoring medical students using another expert rem called GUIDON.
never, there were some problems associated with conventional AI systems:
There was no general framework for building AI systems. Acquiring knowledge,
using additional beuristics and dealing with exceptions led to adhocism; expe-
rience in building one AI system did not simplify the building of another AI
system.
Acquiring knowledge was a great challenge. Different experts typically differed
in their conclusions, leading to inconsistencies. Conventional logic-based sys-
tems found it difficult to deal with such inconsistent knowledge.
There has been a gradual shift from using knowledge to using data in building
AI systems. irrent-day AI systems, which are mostly based on DL, are by and
large data dependent. They n learn representations automatically. They em-
ploy variants of multi-layer neural networks and ickpropagation algorithms in
training models.
une difficulties associated with DL systems are:
They are data dependent. Their performance improves as the size of the data
set increases. So, they need larger data sets. Fortunately, it is not difficult to
provide large data sets in most of the current applications.
Learning in DL systems involves a simple change of weights in the neural net-
work to optimize the objective function. This is done with the help of backprop-
agation, which is a gradientdescent algorithm and which can get stuck with a
locally optimal solution. Combining this with large data sets may possibly lead
to overfitting. This is typically avoided by using a variety of simplifications in
the form of regularizers and other heuristics.
A major difficulty is that DL systems are black box systems and lack explanation
capability. This problem is currently attracting the attention of AI researchers.
Ne will be discussing how each of the ML models is equipped with explanation
capability in the espective chapters.

19
1.8 Data Sets Used
In this book, we make use of two data sets to conduct experiments and present
results in various chapters. These are:

• Data Sets for Classification

1. MNIST Handwritten Digits Data Set:

• There are 10 classes (corresponding to digits 0, 1, . . . , 9 ) and each digit is


viewed as an image of size 28 × 28(= 784) pixels; each pixel having values
in the range 0 to 255 .
• There are around 6000 digits as training patterns and around 1000 test
patterns in each class and the class label is also provided for each of the
digits.

• For more details, visit http://yann.lecun.com/exdb/mnist/

2. Fashion MNIST Data Set:

It is a data set of Zalando’s article images, consisting of a training set of


60,000 examples and a vest set of 10,000 examples.

• Each example is a 28 × 28 greyscale imago, associated with a label from


10

• It is intended to serve as a possible replacement for the original MNIST 1


benchmarking ML models.
• It has the same image size and structure of training and testing splits as
t data.
For more details, visit https://www.kaggle.com/datasets/zalando-1 fash-
ionmnist

3. Olivetti Face Data Set:

• It consists of 10 different images each of 40 distinct subjects. For some sul


images were taken at different times, varying the lighting, facial expression
closed eyes, smiling / not smiling) and facial details (glasses / no glasses).

• All the images were taken against a dark homogeneous background with
the in an upright, frontal position (with tolerance for some side movement).
• Each image is of size 64 × 64 = 4096.
• It is available on the scikit-learn platform.

• For more details, visit https://ai.stanford.edu/~marinka/nimfa/nimfa.ex


orl \images.html

20
4. Wisconsin Breast Cancer Data Set:

• It consists of 569 patterns and each is a 30-dimensional vector.

• There are two classes Benign and Malignant. The number of patterns
from B 357 and the number of Malignant class patterns is 212.
• It is available on the scikit-learn platform.
• For more details, visit
https://scikit-learn.org/stable/modules/generated/sklearn.datasets _breast_cancer.html
• Data Sets for Regression

1. Boston Housing Data Set:

• It has 506 patterns.


• Each pattern is a 13 -dimensional vector.
• It is available on the scikit-learn platform.
• For more details, visit https://scikit-learn.org/0.15/modules/generated/sklearn.datasets.10a
_boston.html

2. Airline Passengers Data Set:

• This data set provides monthly totals of US airline passengers from 1949
to 1960.

• This data set is taken from an inbuilt data set of Kaggle called AirPas-
sengers.
• For more details, visit
https://www.kaggle.com/datasets/chirag19/air-passengers

3. Australian Weather Data Set:

• It provides various weather record details for cities in Australia.


• The features include location, min and max temperature, etc.
• For more details, visit https://www.kaggle.com/datasets/arunavakrchakrabal
australia-weather-data

21
Summary
Machine learning (ML) is an important topic and has affected research practices
in both sciel and engineering significantly. The important steps in building an
ML system are:

• Data acquisition that is application domain dependent.


• Feature engineering that involves both data preprocessing and represen-
tation.
• Selecting a model based on the type of data and the knowledge of the
domain.
• Learning the model based on the training data.
• Evaluating and tuning the learnt model based on validation data.
• Providing explanation capability so that the model is transparent to the
user/expert.

Exercises
1. You are given that 9 × 17 is 153 and 4 × 17 is 68 . From this data, you
need to learn the valu of 13 × 17. Which learning paradigm is useful here?
Specify any assumptions you need t make.
2. You are given the following statements:
a. The sum of two even numbers is even.
b. 12 is an even number.
c. 22 is an even number.

What can you deduce from the above statements?


3. Consider the following statements:
a. If x is an even number, then x + 1 is odd and x + 2 is even.
b. 34 is an even number.
Which learning paradigm is used to learn that 37 is odd and 38 is even?
4. Consider the following reasoning:
a. If x is odd, then x + 1 is even.
You have learnt from the above that 21 is odd. Which learning paradigm is
used? Specify any
b. 22 is even.
5. Consider the following attributes. Find out whether they are nominal, ordinal
or numerical assumptions to be made. features. Give a reason for your choice.
a. Telephone number
b. Feature that takes values from {ball, bat, wicket, umpire, batsman, bowler}
c. Temperature
d. Weight

22
e. Feature that takes values from {short, medium height, tall}
6. Let xi and xj be two l-dimensional unit norm vectors; that is, ∥xi ∥ = 1 and
∥xj ∥ = 1. Derive a relation between the Euclidean distance d (xi , xj ) and cosine
of the angle between xi and xj .
7. Consider the data set:

(1, 1, 1), (1, 1, 2), (1, 1, 3), (1, 2, 2), (1, 1, −), (6, 6, 10)

23

You might also like