INTRODUCTION TO MACHINE LEARNING
K-NEAREST NEIGHBOR ALGORITHM
        Mingon Kang, PhD
        Department of Computer Science @ UNLV
KNN
   K-Nearest Neighbors (KNN)
   Simple, but a very powerful classification algorithm
   Classifies based on a similarity measure
   Non-parametric
   Lazy learning
     Does not “learn” until the test example is given
     Whenever we have a new data to classify, we find its
      K-nearest neighbors from the training data
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
KNN: Classification Approach
   Classified by “MAJORITY VOTES” for its neighbor
    classes
     Assigned   to the most common class amongst its K-
        nearest neighbors (by measuring “distant” between
        data)
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
KNN: Example
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
KNN: Pseudocode
Ref: https://www.slideshare.net/PhuongNguyen6/text-categorization
KNN: Example
Ref: http://www.scholarpedia.org/article/K-nearest_neighbor
KNN: Euclidean distance matrix
Ref: http://www.scholarpedia.org/article/K-nearest_neighbor
Decision Boundaries
   Voronoi diagram
     Describes  the areas that are nearest to any given point,
      given a set of data.
     Each line segment is equidistant between two points of
      opposite class
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
Decision Boundaries
   With large number of examples and possible noise
    in the labels, the decision boundary can become
    nasty!
     “Overfitting”             problem
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
Effect of K
   Larger k produces smoother boundary effect
   When K==N, always predict the majority class
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
Discussion
   Which model is better between K=1 and K=15?
   Why?
How to choose k?
   Empirically optimal k?
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
Pros and Cons
   Pros
     Learning           and implementation is extremely simple and
      Intuitive
     Flexible decision boundaries
   Cons
     Irrelevantor correlated features have high impact and
      must be eliminated
     Typically difficult to handle high dimensionality
     Computational costs: memory and classification time
      computation
Ref: https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors
Similarity and Dissimilarity
   Similarity
     Numerical measure of how alike two data objects are.
     Is higher when objects are more alike.
     Often falls in the range [0,1]
   Dissimilarity
     Numerical measure of how different are two data objects
     Lower when objects are more alike
     Minimum dissimilarity is often 0
     Upper limit varies
   Proximity refers to a similarity or dissimilarity
Euclidean Distance
   Euclidean Distance
               𝑑𝑖𝑠𝑡 =     σ𝑝𝑘=1(𝑎𝑘 − 𝑏𝑘 )2
Where p is the number of dimensions (attributes) and
𝑎𝑘 and 𝑏𝑘 are, respectively, the k-th attributes
(components) or data objects a and b.
   Standardization is necessary, if scales differ.
Euclidean Distance
Minkowski Distance
   Minkowski Distance is a generalization of Euclidean
    Distance
                        𝑝              1/𝑟
               𝑑𝑖𝑠𝑡 =  |𝑎𝑘 − 𝑏𝑘 |𝑟
                       𝑘=1
Where r is a parameter, p is the number of dimensions
(attributes) and 𝑎𝑘 and 𝑏𝑘 are, respectively, the k-th
attributes (components) or data objects a and b
Minkowski Distance: Examples
   r = 1. City block (Manhattan, taxicab, L1 norm) distance.
       A common example of this is the Hamming distance, which is just
        the number of bits that are different between two binary vectors
   r = 2. Euclidean distance
   r →∞. “supremum” (𝐿𝑚𝑎𝑥 norm, 𝐿∞ norm) distance.
       This is the maximum difference between any component of the
        vectors
   Do not confuse r with p, i.e., all these distances are defined
    for all numbers of dimensions.
Cosine Similarity
   If 𝑑1 and 𝑑2 are two document vectors
            cos(𝑑1 , 𝑑2 )=(𝑑1 ∙ 𝑑2 )/||𝑑1 ||||𝑑2 ||,
Where ∙ indicates vector dot product and ||𝑑|| is the length of vector d.
   Example:
           𝑑1 = 3 2 0 5 0 0 0 2 0 0
           𝑑2 = 1 0 0 0 0 0 0 1 0 2
𝑑1 ∙ 𝑑2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||𝑑1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= (42)^0.5= 6.481
||𝑑1 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6)^0.5= 2.245
cos( d1, d2) = .3150
Cosine Similarity
                      1: exactly the same
   cos(𝑑1 , 𝑑2 ) = ቐ 0: orthogonal
                     −1: exactly opposite
Feature scaling
   Standardize the range of independent variables
    (features of data)
   A.k.a Normalization or Standardization
Standardization
   Standardization or Z-score normalization
     Rescalethe data so that the mean is zero and the
      standard deviation from the mean (standard scores) is
      one
                                x−𝜇
                       x𝑛𝑜𝑟𝑚 =
                                  𝜎
       𝜇 is mean, 𝜎 is a standard deviation from the mean
                         (standard score)
Min-Max scaling
   Scale the data to a fixed range – between 0 and 1
                           x − xmin
                xmorm   =
                          xmax − xmin
Efficient implementation
   Consider data as a matrix or a vector
   Matrix/Vector computational is much more efficient
    than computing with loop
Discussion
   Can we use KNN for regression problems?