STAT6121-ML
Introduction
                  Machine learning
     Machine learning approaches to data analysis
         cannot be done without computers
Common with statistical modelling/analysis
• For prediction and classification
• Requires an optimisation procedure
• Obtain parameters or functions from observations
• Uncertainty of learning vs. prediction/classification
New elements or techniques
• Explanation or theoretical construct not emphasis
• Data can be ‘organic’, such as text, image
• Distinction of training vs. validation/test data
• Reliance on ready-made software for implementation
                            2
                Some broad remarks
Supervised vs. unsupervised learning
• Target outcome y and covariates/features x?
  NB. log-linear models of contingency tables
• Can the learned result be applied to unseen units?
  NB. principal components, clustering
Prediction vs. classification
• Best prediction of y is its expectation µx = E(y | x)
                  2               2            2
                                                   
         E (y − µ) | x = (µx − µ) + E (y − µx) | x
• Best classification of categorical y is
                                         ′
                   y0 = arg max
                             ′
                                Pr(y = y   | x)
                             y
 e.g. if y ∼ N (µ, σ 2), then E(y) = µ but Pr(y = µ) = 0
 however, let z = I(y > µ − σ), then z0 = 1
                                 3
                  Some broad remarks
Parametric vs. non-parametric models
• function/model f (x; θ) fixed given θ, i.e. parameters
             f (x) = E(y | x)   or f (x) = Pr(y | x)
• parametric if θ contains a fixed number of constants
  NB. linear regression model as a typical example
• non-parametric if no. unknowns in θ grows with the
  no. observations, or if f is indeterminate in advance
Error vs. residual
• Given f (x) = E(y | x) or y0 = arg max
                                      ′
                                         Pr(y = y ′
                                                    | x), error
                                       y
              e = y − f (x)     or     e = I(y = y0)
• Given fˆ or ŷ0 as estimate f or y0, residual
              ê = y − fˆ(x)   or     ê = I(y = ŷ0)
 if (y, x) are used for obtaining fˆ or ŷ0
                                4
              Bias-variance trade-off
Eq. (2.7), mean squared error (MSE) of fˆ(x) for y given x
               2                               2
           ˆ                                ˆ
   E{ y − f (x) } = E{ y − f (x) + f (x) − f (x) }
                                2                   2
                                          ˆ
                  = E{ y − f (x) } + E f (x) − f (x)
                                       ˆ                                                  
                    − 2E{ y − f (x) f (x) − f (x) }
                                                     2
                                  ˆ
                  = V e(x) + V f (x) + Bias f (x) ˆ                                     
over fˆ(x) and
             y that are independent of each other
NB. V e(x) unaffected by whichever fˆ
                                    2
               ˆ                ˆ                   
Q: Reduce V f (x) and Bias f (x) at the same time?
• to reduce V f (x) , let fˆ(x) be obtained based on many
                 ˆ                     
  observations, e.g. by using parametric f (x; θ)...
                       2
• to reduce Bias f (x) , let fˆ(x) only depend on close-by
                    ˆ
  observations, provided f is reasonably smooth...
• hence, the bias-variance trade-off
                            5
                        Ch. 3, exercise 4
Answer by ML, e.g. x ∈ (1, 10) and f (x) = β0 + β1 log(x) for (c)
get.dta <- function(n=100, beta=c(0.5,1), nonlnr=F)
{
  x = seq(1,10,length=n)
  if (nonlnr) { f = beta[1] + beta[2]*log(x) }
  else { f = beta[1] + beta[2]*x }
  y = f + rnorm(n, 0, 1)
  x2 = x^2; x3 = x^3
  data.frame(y,x,x2,x3,f)
}
main <- function(n=100, beta=c(0.5,1), nonlnr=F, vis=F)
{
  dta = get.dta(n=n, beta=beta, nonlnr=nonlnr)
  if (nonlnr) { cat("data generated under nonlinear model\n\n") }
  else { cat("data generated under linear model\n\n") }
  cat("fitting simple linear regression:\n")
  print(summary(lm(y ~ x, data=dta)))
  cat("fitting cubic (polynomial) regression:\n")
  print(summary(lm(y ~ x + x2 + x3, data=dta)))
  if (vis) { plot(dta$x, dta$y); lines(dta$x, dta$f) }
}
                                      6
                    Additional exercise
        16
        14
        12
    y
        10
        8
        6
        4
                6       8          10   12       14
Equally spaced x, fˆ1(x) = β̂x (solid), fˆ2(x) = y (dashed)
• What is V f (x) at any given x for fˆ = fˆ1 or fˆ2?
            ˆ                 
                                    ˆ
  What can you say about Bias f (x) ?                                        
Consider KNN predictor given K
• How would you apply the method if x = 5 or 10?
               ˆ              ˆ
• What about V f (x) and Bias f (x) in this case?                                  
                               7
                      Additional exercise
              n
              X               n
                              X
       β̂ =         xi yi /         x2i
              i=1             i=1
V fˆ1(x) = V (β̂x) = x V (β̂)   2
        
                         n       n                     n
                                    2 2
                         X       X                     X
            2                 2          2
                                                         x2i
                                     
         = x V (yi | xi)    xi /   xi = x V (yi | xi)/
                                    i=1   i=1                      i=1
                     n
                     X
    V̂ (yi | xi) =         (yi − β̂xi)2/(n − 1)
                     i=1
V fˆ2(x) = V (y | x) = V (yi | xi)              NB. non-existant V̂ (yi | xi)
        
             K
             X
     fˆ(x) =   yj (x)/K
              j=1
              XK
V fˆ(x) =           V yj (x) /K 2
                           
              j=1
               K
              X                    2
                     yj (x) − fˆ(x) /(K − 1)           NB. from K obs.
         
V̂ yj (x) =
              j=1
Assume unbiasedness in all the cases...