Polynomial Curve Fitting
Machine Learning   1
                        Polynomial Curve Fitting
                            A Simple Regression Problem
• We observe a real-valued input variable x and we wish to use this observation to
  predict the value of a real-valued target variable t.
• We use synthetically generated data from the function sin(2πx) with random noise
  included in the target values.
    – A small level of random noise having a Gaussian distribution
• We have a training set comprising N observations of x, written x ≡ (x1, . . . , xN)T,
  together with corresponding observations of the values of t, denoted t ≡ (t1, . . . , tN)T.
• Our goal is to predict the value of t for some new value of x,
                                          Machine Learning                                      2
                        Polynomial Curve Fitting
• A training data set of N = 10 points,
  (blue circles),
• The green curve shows the actual
  function sin(2πx) used to generate the
  data.
• Our goal is to predict the value of t for
  some new value of x, without
  knowledge of the green curve.
                                          Machine Learning   3
                       Polynomial Curve Fitting
• We try to fit the data using a polynomial function of the form
                                       Machine Learning            4
                      Polynomial Curve Fitting
• The values of the coefficients will be
  determined by fitting the polynomial to
  the training data.
• This can be done by minimizing an
  error function that measures the misfit
  between the function y(x,w), for any
  given value of w, and the training set
  data points.
• Error Function: the sum of the squares
  of the errors between the predictions
  y(xn,w) for each data point xn and the
  corresponding target values tn.
                                     Machine Learning   5
                        Polynomial Curve Fitting
• We can solve the curve fitting problem by choosing the value of w for which E(w) is
  as small as possible.
• Since the error function is a quadratic function of the coefficients w, its derivatives
  with respect to the coefficients will be linear in the elements of w, and so the
  minimization of the error function has a unique solution, denoted by w*,
• The resulting polynomial is given by the function y(x,w*).
• Choosing the order M of the polynomial  model selection.
                                         Machine Learning                                   6
0th Order Polynomial
       Machine Learning   7
1st Order Polynomial
       Machine Learning   8
3rd Order Polynomial
       Machine Learning   9
9th Order Polynomial
       Machine Learning   10
                        Polynomial Curve Fitting
• The 0th order (M=0) and first order (M=1) polynomials give rather poor fits to the data
  and consequently rather poor representations of the function sin(2πx).
• The third order (M=3) polynomial seems to give the best fit to the function sin(2πx) of
  the examples.
• When we go to a much higher order polynomial (M=9), we obtain an excellent fit to
  the training data.
    – In fact, the polynomial passes exactly through each data point and E(w*) = 0.
                                          Machine Learning                              11
                        Polynomial Curve Fitting
• We obtain an excellent fit to the training data with 9th order.
• However, the fitted curve oscillates wildly and gives a very poor representation of the
  function sin(2πx).
• This behaviour is known as over-fitting
                                         Machine Learning                               12
                         Polynomial Curve Fitting
                                         Over-fitting
• We can then evaluate the residual value of E(w*) for the training data, and we can also
  evaluate E(w*) for the test data set.
• Root-Mean-Square (RMS) Error:
    – in which the division by N allows us to compare different sizes of data sets, and the square
      root ensures that ERMS is measured on the same scale as the target variable t.
                                           Machine Learning                                      13
                       Polynomial Curve Fitting
                              Polynomial Coefficients
• Magnitude of coefficients increases dramatically as order of polynomial increases.
• Large positive and negative values so that the corresponding polynomial function
  matches each of the data points exactly, but between data points the function exhibits
  the large oscillations  over-fitting
                                       Machine Learning                                    14
                       Polynomial Curve Fitting
• Increasing the size of the data set reduces the over-fitting problem.
• 9th Order Polynomial.
                                        Machine Learning                  15
                       Polynomial Curve Fitting
                                   regularization
• We may wish to use relatively complex and flexible models with data sets of limited
  size.
• The over-fitting phenomenon can be controlled with regularization, which involves
  adding a penalty term to the error function.
• Regularization: Penalize large coefficient values
• the coefficient λ governs the relative importance of the regularization term compared
  with the sum-of-squares error term.
                                       Machine Learning                                   16
                       Polynomial Curve Fitting
                                    regularization
• Plots of M = 9 polynomials fitted to the data set using the regularized error function
no regularization (λ=0)                                         too much regularization
                                        Machine Learning                                   17
                     Polynomial Curve Fitting
                            regularization
• Polynomial Coefficients
                                                 Graph of the root-mean-square error
                                                 versus ln λ for the M=9 polynomial.
                              Machine Learning                                         18
Linear Basis Function Models
           Machine Learning    19
                  Linear Basis Function Models
• Generally
where 𝜙j(x) are known as basis functions.
                                      Machine Learning   20
                   Linear Basis Function Models
                                 Linear Regression
• The simplest linear model for regression is one that involves a linear combination of
  the input variables.
• It is often simply known as linear regression.
where
   𝜙0(x) = 1 (=x0)
   𝜙j(x) = xj j>0
                                       Machine Learning                                   21
                Linear Basis Function Models
           Polynomial Curve Fitting: Polynomial basis functions
• 𝜙j(x) = xj
                                Machine Learning                  22
                   Linear Basis Function Models
                              Gaussian basis functions
where the μj govern the locations of the basis functions in input space, and the parameter s
governs their spatial scale.
                                        Machine Learning                                  23
                    Linear Basis Function Models
                              Sigmoidal basis functions
where σ(a) is the logistic sigmoid function
defined by
• we can use the ‘tanh’ function because this is related to the logistic sigmoid by
   tanh(a) = 2σ(a) − 1,
                                         Machine Learning                             24
INSTANCE-BASE LEARNING
         Machine Learning   25
                 INSTANCE-BASE LEARNING
• Instance-based learning methods simply store the training examples instead of learning
  explicit description of the target function.
   – Generalizing the examples is postponed until a new instance must be classified.
   – When a new instance is encountered, its relationship to the stored examples is
      examined in order to assign a target function value for the new instance.
• Instance-based learning includes nearest neighbor, locally weighted regression and
  case-based reasoning methods.
• Instance-based methods are sometimes referred to as lazy learning methods because
  they delay processing until a new instance must be classified.
• A key advantage of lazy learning is that instead of estimating the target function once
  for the entire instance space, these methods can estimate it locally and differently for
  each new instance to be classified.
                                        Machine Learning                                 26
                        k-Nearest Neighbor Learning
• k-Nearest Neighbor Learning algorithm assumes all instances correspond to points in
  the n-dimensional space Rn
• The nearest neighbors of an instance are defined in terms of Euclidean distance.
• Euclidean distance between the instances xi = <xi1,…,xin> and xj = <xj1,…,xjn> are:
                          n
       d ( xi, xj )      (
                         r 1
                             xir  xjr ) 2
• For a given query instance xq, f(xq) is calculated the function values of k-nearest
  neighbor of xq
                                             Machine Learning                           27
                    k-Nearest Neighbor Learning
• Store all training examples <xi,f(xi)>
• Calculate f(xq) for a given query instance xq using k-nearest neighbor
• Nearest neighbor: (k=1)
   – Locate the nearest traing example xn, and estimate f(xq) as
    – f(xq)  f(xn)
• k-Nearest neighbor:
   – Locate k nearest traing examples, and estimate f(xq) as
   – If the target function is real-valued, take mean of f-values of k nearest neighbors.
         f(xq) =
    – If the target function is discrete-valued, take a vote among f-values of k nearest
      neighbors.
                                        Machine Learning                                    28
             When To Consider Nearest Neighbor
• Instances map to points in Rn
• Less than 20 attributes per instance
• Lots of training data
• Advantages
   – Training is very fast
   – Learn complex target functions
   – Can handle noisy data
   – Does not loose any information
• Disadvantages
   – Slow at query time
   – Easily fooled by irrelevant attributes
                                         Machine Learning   29
Distance-Weighted kNN
       Machine Learning   30
       k-Nearest Neighbor Classification - Example
  A       B     Class         3-Nearest Neighbor Classification of instance <3,3>
  1       1       no
  2       1       no                                           A   B   Distance
                                                                       of <3,3>
  3       2       yes
                                                               1   1      8
  7       7       yes
                                                               2   1      5
  8       8       yes
                                                               3   2      1
                                                               7   7      32
                                                               8   8      50
• First three example are 3 Nearest Neighbors of instance <3,3>.
• Two of them is no and one of them is yes.
• Majority of classes of its neighbors are no, the classification of instance <3,3> is no.
                                            Machine Learning                                 31
  Distance Weighted kNN Classification- Example
  A       B     Class           Distance Weighted 3-Nearest Neighbor Classification
                                of instance <3,3>
  1       1      no
  2       1      no                                          A   B    Distance
  3       2      yes                                                  of <3,3>
  7       7      yes                                         1   1       8
  8       8      yes                                         2   1       5
                                                             3   2       1
                                                             7   7       32
                                                             8   8       50
• First three example are 3 Nearest Neighbors of instance <3,3>.
• Weight of no = 1/8 + 1/5 = 13/40          Weight of yes = 1/1 = 1
• Since 1 > 13/40, the classification of instance <3,3> is yes.
                                          Machine Learning                            32
         k-Nearest Neighbor Classification - Issues
• Choosing the value of k:
   – If k is too small, sensitive to noise points
   – If k is too large, neighborhood may
     include points from other classes.
• Scaling issues:
   – Attributes may have to be scaled to
     prevent distance measures from being
     dominated by one of the attributes
                                        Machine Learning       33
Curse of Dimensionality
        Machine Learning   34
                     Locally Weighted Regression
• KNN forms local approximation to f for each query point xq
• Why not form an explicit approximation f(x) for region surrounding xq
       Locally Weighted Regression
• Locally weighted regression uses nearby or distance-weighted training examples to
  form this local approximation to f.
• We might approximate the target function in the neighborhood surrounding x, using a
  linear function, a quadratic function, a multilayer neural network.
• The phrase "locally weighted regression" is called
    – local because the function is approximated based only on data near the query point,
    – weighted because the contribution of each training example is weighted by its distance
      from the query point, and
    – regression because this is the term used widely in the statistical learning community for the
      problem of approximating real-valued functions.
                                           Machine Learning                                      35
                    Locally Weighted Regression
• Given a new query instance xq, the general approach in locally weighted regression
  is to construct an approximation f that fits the training examples in the
  neighborhood surrounding xq.
• This approximation is then used to calculate the value f(xq), which is output as the
  estimated target value for the query instance.
                                        Machine Learning                                 36
                           Case-based reasoning
• Instance-based methods
   – lazy
   – classification based on classifications of near (similar) instances
   – data: points in n-dim. space
• Case-based reasoning
   – as above, but data represented in symbolic form
    – new distance metrics required
                                        Machine Learning                   37
                        Lazy & eager learning
• Lazy: generalize at query time
   – kNN, CBR
• Eager: generalize before seeing query
   – regression, ann, ID3, …
• Difference
   – eager must create global approximation
   – lazy can create many local approximation
   – lazy can represent more complex functions using same H (H = linear functions)
                                      Machine Learning                               38