Instance Based Learning
AIML/ BDA
Topics covered
• K-Nearest Neighbors (K-NN) concept
• Distance metrics
• K-NN for classification
Machine Learning Classification
                           • Training instances are stored in memory
  Instances                • For a test (unseen) instances
  Based Learning           • Compare test instances with instances seen in training and
                             gives result
                           • Also known as Memory based learning
IBL algorithms are local
rather    than   global
approximations.
Instances Based Learning
• IBL methods learn by storing the training data.
• When a new query instance is encountered, a set of similar
  related instances is retrieved from memory and used to classify
  the new query instance.
• For each distinct query, IBL construct different approximation to
  the target function.
• These are local rather than global approximations.
Advantages and Disadvantages of IBL
Advantage:
• Suitable for problems with very complex target functions.
Disadvantage:
• The cost of classifying new instances - high.
• Considered all attributes of the instances – dimension increase
Comparison
Example
• A company produce tissues (used by biological labs).
• The company’s objective is to predict how well their products are
  accepted by their clients.
• They conducted a survey with their clients to find the acceptance of
  the product. Quality is based on acid durability and strength parameter.
Example
• The data set pertains to a company that produces tissues for use in biological
  labs.
      Name            Acid                Acid          Acceptability
                    Durability          Strength
      Type-1            7                   7                 Low
      Type-2            7                   4                 Low
      Type-3            3                   4                 High
      Type-4            1                   4                 High
      Test data: Type-5          Acid Durability = 3         Strength = 7
      • Built a classifier to predict a new type of tissue
• Apply the Euclidian distance measure for the data to find the distances from
  the new data Type-5.
       Name         Acid       Strength               Distance            NeighorR
                  Durability                                                ank
       Type-1         7           7             Sqrt((7-3)2+(7-7)2) =4        3
      Type-2          7           4              Sqrt((7-3)2+(4-7)2)=5        4
      Type-3          3           4              Sqrt((3-3)2+(4-7)2)=3        1
      Type-4          1           4             Sqrt((1-3)2+(4-7)2)=3.6       2
 • If k =1, ONE immediate neighbor Type 3 Good (new type is = High)
 • If k =2, TWO immediate neighbor Type 3, Type 4 = High; (new type is = High)
 • If k =3, THREE immediate neighbor Type 3, Type 4 = High & Type 1 = Low (but the
   probability of High is high so, consider new type is classified as High)
Example 2
• Assume a Boolean target function and        Distance Classification
  a 2-dimensional instance space             from query
                                               instance
  (shown in figure).                             1.00        +
• Determine       how    the     k-Nearest       1.35         -
  Neighbour Learning algorithm would             1.40         -
                                                 1.60         -
  classify the new instance xq for k =           1.90        +
  1,3,5 and 7.                                   2.00        +
                                                 2.20         -
• The + and – signs in the instance space        2.40        +
  refer to positive and negative examples        2.80         -
  respectively.
Distance     Classification            +                -
                                               -
from query                         -
instance                      -                    xq       +
     1.00          +                   +
                                                   -
     1.35          -                       +
     1.40          -
     1.60          -
     1.90          +              1-NN                      +
     2.00          +
     2.20          -              3-NN                      -
     2.40          +
     2.80          -              5-NN                      -
                                  7-NN                      -
Selection of K value ?
• Try many different values for K and see what works best for your
  problem.
• K value should be an odd number (3, 5, 7, 9, etc.).
How does the efficiency and accuracy of k-NN search
change as k increases?
• If we have sufficiently large number of training experiences the
  accuracy should increase
• The computational complexity of KNN increases with the size of the
  training dataset.
     • The time to calculate the prediction will also increase.
     • In that sense less efficient
• KNN is a Lazy Learning algorithm – why?
• No learning of the model/ algorithm
• It “memorizes” the training dataset
• DT algorithm learns its model during training time
• KNN is a Non-Parametric algorithm – why?
• It makes no assumptions about the functional form of the
  problem being solved.
• Is KNN supervised or unsupervised learning algorithm?
• KNN is a supervised learning algorithm, uses labeled data for
  classification problem.
• Note: K-means is an unsupervised learning algorithm used for
  clustering problem
Thank you