Machine Learning: Professor Department of Computer Science & Engineering
Machine Learning: Professor Department of Computer Science & Engineering
Dr. Shylaja S S
Professor
Department of Computer Science & Engineering
store it in
Training data MODEL Training data
memory
Test Test
MODEL Prediction Training data Prediction
query query
● Creates a model using the training data ● For every test query, it processes the
which is used across all test queries training data to make a prediction
Machine Learning
Eager vs Lazy Learning
Lazy Learning
● No global model is created
Training data
store it in ● For every test instance a common
memory procedure is run on training data to find
the prediction value.
The inductive bias of a learning algorithm is the set of assumptions that the
learner uses to predict outputs of given inputs that it has not encountered.
Assign mode of
All instances Given a query
neighbours for
correspond to point find where
classification and
points in the n-D it belongs in the
mean for
space. space
regression.
where,
● d: number of attributes
● xi and yi: ith attributes
● Any data instance xi will have a label f(xi) (- one of the class from
set V).
Machine Learning
K-NN for Classification and Regression
Machine Learning
K-NN for Classification and Regression
Practice Problem - 1
Machine Learning
Practice problem for K-NN Classification
Practice Problem - 2
Machine Learning
Practice problem for K-NN Regression
Id weight
id6 60
id5 72
id4 59
we choose mean of these 3
nearest neighbour as our answer
Machine Learning
Choices a designer has to make while implementing KNN
Point X Y CLASS
A 2 8 triangle
B 3 9 triangle
C 5 11 circle
D 6 13 circle
Machine Learning
Find appropriate value of k
Point X Y CLASS
C
A 2 8 triangle
B 3 9 triangle
C 5 11 circle
D 6 13 circle B
A
Machine Learning
Find appropriate value of k
Point X Y CLASS
C
A 2 8 triangle
B 3 9 triangle
C 5 11 circle
D 6 13 circle B
A
Machine Learning
Find appropriate value of k
Euclidean distance :
Let us calculate distance of query
point xq to all other points Distance calculation
D xq A 2.82 triangle
xq B 1.414 triangle
xq C 1.414 circle
C
xq D 3.6055 circle
D xq A 2.82 triangle
xq B 1.414 triangle
xq C 1.414 circle
C
xq D 3.6055 circle
B • Let’s set k = 2.
A • We have a dilemma while choosing mode.
MACHINE INTELLIGENCE
What K value should I start ?
D xq A 2.82 triangle
xq B 1.414 triangle
xq C 1.414 circle
C
xq D 3.6055 circle
• Let’s set k = 3
B • There is no problem and we can assign the
A query point as a triangle.
• But we won’t stop here!
MACHINE INTELLIGENCE
What K value should I start ?
D xq A 2.82 triangle
xq B 1.414 triangle
xq C 1.414 circle
C
xq D 3.6055 circle
Let’s calculate the distance between the each point and query point xq = (4,10)
xq A 2.82 triangle
D
xq B 1.414 triangle
Training data G C 1.414 circle
xq
C xq D 3.6055 circle
xq E 1.414 heart
B E F xq F 2.23606 heart
A xq G 4.4721 square
• The label has 4 classes, which is even. So from
our previous statement, k should be odd.
• So let k = 5
• We are stuck again!
Machine Learning
Find appropriate value of k
error-rate
• Run the K-NN algorithm with the best K value
found and redo the classification report and the
confusion matrix.
K-value
• Any changes to the data will require you to redo
the elbow method and re-define the best K
• K is very low: Could label the query points as the point near it as only small
no. of points are considered. Chance for high variance aka overfitting.
• K is very high: Could mostly label the query points as the class label which
has a majority in the given dataset. Chance for high bias aka underfitting.
• K is neither high nor low: Good trade-off without high bias and variance.
MACHINE INTELLIGENCE
kNN algorithm
X Y Class
3 5 red
3 6 red
• For most of the practical purposes, the 4 6 red
dataset is divided into production(training)
and test data set (80:20) 4 4 red
--- some kind of pre-processing 7 10 green
(learning) is being done 8 9 green
--- Not used as a naive (plain 8 8 green
vanilla) version
9 10 green
5 5 red
4 7 green
MACHINE INTELLIGENCE
Working algorithm for kNN Classification
X Y Class
• Upload the production data into 3 5 red
memory space. 3 6 red
• Decide the distance measure. 4 6 red
(we will choose Euclidean)
4 4 red
• Decide the k value
7 10 green
(we will choose k = 4)
8 9 green
• For each data point in test data, find
distance with respect to all 8 8 green
neighbors and choose the nearest 4 9 10 green
neighbors and assign the prediction X Y Class
as mode of the neighbors.
5 5 red
• Calculate the error and accuracy.
4 7 green
MACHINE INTELLIGENCE
Working algorithm for kNN Classification
TP TN FP FN X Y Class
Error = (FP + FN)/
All 3 5 red
predictions 3 6 red
1 0 0 0 0 4 6 red
4 4 red
7 10 green
8 9 green
8 8 green
9 10 green
X Y Class
5 5 red
4 7 green
MACHINE INTELLIGENCE
Working algorithm for kNN Classification
X Y value
3 5 3.5
3 6 2.6
=3.0 4 6 2.8
4 4 3.1
e1=(3.7-3.0)2 =0.49
7 10 8.7
8 9 8.5
e1 0.49 8 8 9.3
9 10 9.7
X Y value
5 5 3.7
7 8 8.4
MACHINE INTELLIGENCE
Working algorithm for kNN Regression
X Y value
3 5 3.5
3 6 2.6
=9.05 4 6 2.8
4 4 3.1
e2=(8.4-9.05)2 =0.4225 7 10 8.7
8 9 8.5
8 8 9.3
e1 0.49
9 10 9.7
e2 0.4225
X Y value
5 5 3.7
7 8 8.4
MACHINE INTELLIGENCE
Working algorithm for kNN Regression
e1 0.49
X Y value
e2 0.4225
3 5 3.5
3 6 2.6
4 6 2.8
4 4 3.1
7 10 8.7
=0.45625
8 9 8.5
8 8 9.3
9 10 9.7
X Y value
5 5 3.7
7 8 8.4
Machine Learning
Weighted KNN
• Weigh the contribution of each of the k neighbour according to their distance to the
query xq
• Give greater weight to closer neighbour.
• Let’s say with k = 5 we get A, B, C, D, E as closest points with distances as depicted in the plot :
• Time
Prediction is computationally expensive as we need to compute the
distance between the query point and all other points. (N is in 1000s)
• Space
High memory requirement - lazy algorithm which stores all of the
training data. (N is in 1000s)
• Does not work well with large datasets. Cost of calculating distance between
the new point and each existing point is high.
• It is sensitive to scaling.
Pre normalizing
Euclidean distance between 1 and 2=
[(100000 - 80000)^2 + (30 - 25)^2] ^ (1/2) = 20000
• High magnitude of income affected the distance between the two points.
• This will impact the performance as higher weightage is given to variables with higher
magnitude.
MACHINE INTELLIGENCE
Effect of normalizing
Pre normalizing
Euclidean distance between 1 and 2=
[(100000 - 80000)^2 + (30 - 25)^2] ^ (1/2) = 20000
How to normalize?
MACHINE INTELLIGENCE
Effect of normalizing
Pre normalizing
Euclidean distance between 1 and 2=
[(100000 - 80000)^2 + (30 - 25)^2] ^ (1/2) = 20000
Post normalizing
Euclidean distance between 1 and 2 =
[(0.608 + 0.260)^2 + (-0.447 + 1.192)^2] ^ (1/2) = 1.14
• Distance is not biased towards the income variable anymore.
• Similar weightage given to both the variables.
MACHINE INTELLIGENCE
The curse of dimensionality
• One might think that one rescue could be to increase the number of training
samples, n, until the nearest neighbors are truly close to the test point.
• How many data points would we need such that ℓ becomes truly small?
• Fix ℓ=1/10=0.1
• For d>100 we would need far more data points than there are electrons in the
universe...
MACHINE INTELLIGENCE
How to deal with it?
• Assigning weights to the attributes when calculating distances. Let’s say we’re
predicting the price of a house, we give higher weights to features like area
and locality when compared to color.
• Iteratively, we leave out one of the attributes and test the algorithm. The
exercise can then lead us to the best set of attributes.
• Non-binary characterizations:
• Natural order among the data, then convert to numerical values based on order.
e.g. Educational attainment: HS, College, MS, PhD -> 1, 2, 3, 4
True
1. k-NN performs much better if all of the data have the same scale
2 .k-NN works well with a small number of input variables (p), but struggles when the
number of inputs is very large
3. k-NN makes no assumptions about the functional form of the problem being solved
1 and 2 1 and 3
To be more sure of which classifications you make, you can try increasing
the value of k.
Machine Learning
Dr. Shylaja S S
Department of Computer Science & Engineering
shylaja.sharath@pes.edu