K Nearest Neighbors
Introduction To Machine Learning
Dr. Hikmat Ullah Khan
KNN - Definition
Idea:
k-NN stands for k-Nearest neighbour
k-NN is a simple algorithm
Classifies new cases based on a
similarity measure of new case to the
already instances having class labels
KNN – different names
• K-Nearest Neighbors
• Considers k neighbors only
• Memory-Based Reasoning
• Required data for computation
• Instance-Based Learning
• Example-Based Reasoning
• Case-Based Reasoning
• Takes existing examples into account, considers test
instance and computes result
• Lazy Learning
• All computation deferred until classification decision
WHY NEAREST NEIGHBOR?
Used to classify objects based on closest training
examples in the feature space
Top 10 Data Mining Algorithm
A very good for start up students
ICDM paper – December 2007
Among the simplest of all Data Mining Algorithms
Classification Method
? 4
Instance-Based Classifiers
Set of Stored Cases • Store the training records
……... • Use training records to
Atr1 AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
Atr1 ……... AtrN
A
C
B
Nearest Neighbor Classifiers
Basic idea:
If it swims like a duck, quacks like a duck, then it’s probably a
duck
Compute
Distance Test
Record
Training Choose k of the
Records “nearest” records
KNN – Number of Neighbors
If K=1,
select the nearest neighbor
If K>1,
For classification select based on k
neighbors.
Nearest-Neighbor Classifiers
Unknown record
Requires three things
– The set of stored
records
– Distance Metric to
compute distance
between records
– The value of k, the
number of nearest
neighbors to retrieve
Nearest-Neighbor Classifiers
Unknown record To classify an unknown
record:
– Compute distance to
other training records
– Identify k nearest
neighbors
– Use class labels of
nearest neighbors to
determine the class label
of unknown record (e.g.,
by taking majority vote)
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
k NEAREST NEIGHBOR
k = 1:
Belongs to square class
k = 3:
? Belongs to triangle class
k = 7:
Belongs to square class
12
8
k NEAREST NEIGHBOR
k = 1:
Belongs to square class
k = 3:
?
Belongs to triangle class
k = 7:
Belongs to square class
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include
points from other classes
13
8
Choose an odd value for k, to eliminate ties
KNN Classification
$250,000
$200,000
$150,000
Loan$ Non-Default
$100,000 Default
$50,000
$0
0 10 20 30 40 50 60 70
Age
KNN Classification – Distance
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000
48 $142,000 ?
D ( x1 x2 ) ( y1 y2 )
2 2
KNN Classification – Standardized
Distance
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771
0.7 0.61 ?
X Min
Xs
Max Min
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
d ( p, q ) ( pi
i
q ) i
2
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-nearest
neighbors
k NEAREST NEIGHBOR
Common Distance Metrics:
Euclidean distance(continuos distribution)
d(p,q) = √∑(pi – qi)2
Hamming distance (overlap metric)
bat (distance = 1) toned (distance = 3)
cat roses
Discrete Metric(boolean metric)
if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1
18
7
Example
(Test: Durability:3, Strength:7, Class;?)
Type No Item Item Class
Durability Strength
Type-1 7 7 Bad
Type-2 7 4 Bad
Type-3 3 4 Good
Type-4 1 4 Good
Example
Type No Item Item Class Distance
Durabilit Streng
y th
Type-1 7 7 Bad Sqrt((7-3)2 + (7-
7)2) = 4
Type-2 7 4 Bad
Type-3 3 4 Good
Type-4 1 4 Good
Type No Item Item Class Distance
Durabilit Strengt
y h
Type-1 7 7 Bad Sqrt((7-
3)2 + (7-
7)2) = 4
Type-2 7 4 Bad 5
Type-3 3 4 Good 3
Type-4 1 4 Good 3.6
Type No Item Item Class Distance Rank
Dura Stren
bility gth
Type-1 7 7 Bad Sqrt((7-3)2 3
+ (7-7)2) =
4
Type-2 7 4 Bad 5 4
Type-3 3 4 Good 3 1
Type-4 1 4 Good 3.6 2
k NEAREST NEIGHBOR
Accuracy of all NN based algorithms depends on a data
model.
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes.
Examples
Height of a person may vary from 4’ to 6’
Weight of a person may vary from 100lbs to 300lbs
Income of a person may vary from $10k to $500k
23
9
Distance – Categorical Variables
X Y Distance
Male Male 0
Male Female 1
x yD0
x y D 1
KNN - Applications
• Classification
– legal,
– medical,
– news,
– banking
• Problem-solving
– planning,
– pronunciation
• Teaching and aiding
– help desk,
– user training
Merits and Demerits
Advantages
Simple technique that is easily implemented
Can work with relatively little information
Well suited for multi classes problems
Learning is simple (does not involve preprocessing )
Performs best in some cases (gene and protein identification)
Dis-advantages
Memory issues and expensive computation for large datasets
Low Accuracy if presence of noisy or irrelevant features
Feature selection problem
Exercise: KNN using R
KNN tutorial using R language
https://www.youtube.com/watch?v=lDCWX6vCLFA
All steps
UCI dataset
WDBC data
U use any data you like
Use python or any other tool
27
9