Linear Regression
Yijun Zhao
   Northeastern University
       Fall 2016
   Yijun Zhao   Linear Regression
Regression Examples
     Any Attributes                Continuous Value
              x        =⇒                      y
     {age, major , gender , race} ⇒GPA
     {income, credit score, profession} ⇒ loan
     {college, major , GPA} ⇒ future income
                    ..
                     .
                   Yijun Zhao   Linear Regression
Regression Examples
  Data often has/can be converted into matrix form:
       Age     Gender           Race                  Major    GPA
        20       0                   A                  Art    3.85
        22       0                   C              Engineer   3.90
        25       1                   A              Engineer   3.50
        24       0                   AA                 Art    3.60
        19       1                   H                  Art    3.70
        18       1                   C              Engineer   3.00
        30       0                   AA             Engineer   3.80
        25       0                   C              Engineer   3.95
        28       1                   A                  Art    4.00
        26       0                   C              Engineer   3.20
                        Yijun Zhao        Linear Regression
Formal Problem Setup
  Given N observations
            {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}
  a regression problem tries to uncover the function
              yi = f (xi ) ∀i = 1, 2. . . . , n
  such that for a new input value x∗ , we can
  accurately predict the corresponding value
                          y∗ = f (x∗ ).
                        Yijun Zhao   Linear Regression
Linear Regression
     Assume the function f is a linear combination
     of components in x
     Formally, let x = (1, x1 , x2 , . . . , xd )T , we have
         y = ω0 + ω1 x 1 + ω2 x 2 + · · · + ω d x d
          = wT x
     where w = (ω0 , ω1 , ω2 , . . . , ωd )T
     w is the parameter to estimate !
     Prediction:
                     y∗ = wT x∗
                       Yijun Zhao   Linear Regression
Visual Illustration
         Figure: 1D and 2D linear regression
                    Yijun Zhao   Linear Regression
Error Measure
  Mean Squared Error (MSE):
                       N
                    1X T
            E (w) =       (w xn − yn )2
                    N n=1
                     1
                 =     k Xw − y k2
  where              N
                                         
      — x1 T —                          y1
     — x2 T —                        y2 
  X=                              y=     
                                       ... 
        ...   
           T                            yN
      — xN —
                     Yijun Zhao   Linear Regression
Minimizing Error Measure
            1
  E (w) =   N   k Xw − y k2
  5E (w) = N2 XT(Xw − y) = 0
  XTXw = XTy
  w = X† y
  where X† = (XTX)−1XT is the
  ’pseudo-inverse’ of X
                    Yijun Zhao   Linear Regression
LR Algorithm Summary
    Ordinary Least Squares (OLS) Algorithm
      Construct matrix X and the vector y from
      the dataset {(x1 , y1 ), x2 , y2 ), . . . , (xN , yN )}
      (each x includes x0 = 1) as follows:
                                                     
              — xT 1  —                            y1
            — xT —                            y2 
      X=          2                  y  =            
                 ...                          ... 
              — xT N —                             yN
       Compute X† = (XT X)−1 XT
       Return w = X† y
                        Yijun Zhao   Linear Regression
Gradient Descent
     Why?
     Minimize our target function (E (w )) by
     moving down in the steepest direction
                   Yijun Zhao   Linear Regression
Gradient Descent
        Gradient Descent Algorithm
      Initialize the w(0) for time t = 0
      for t = 0, 1, 2, . . . do
          Compute the gradient gt = 5E (w(t))
          Set the direction to move, vt = −gt
          Update w(t + 1) = w(t) + ηvt
          Iterate until it is time to stop
      Return the final weights w
                  Yijun Zhao   Linear Regression
Gradient Descent
  How η affects the algorithm?
      Use 0.1 (practical observation)
      Use variable size: ηt = η k 5E k
                    Yijun Zhao   Linear Regression
OLS or Gradient Descent?
      Yijun Zhao   Linear Regression
Computational Complexity
       OLS                            Gradient Descent
        OLS is expensive when D is large!
                  Yijun Zhao   Linear Regression
Linear Regression
  What is the Probabilistic Interpretation?
                   Yijun Zhao   Linear Regression
Normal Distribution
       Right Skewed      Left Skewed                     Random
                       Normal Distribution
                      Yijun Zhao     Linear Regression
Normal Distribution
     mean = median = mode
     symmetry about the center
                                          1       2
     x ∼ N(µ, σ 2 ) =⇒ f (x) = σ√12π e − 2σ2 (x−µ)
                     Yijun Zhao   Linear Regression
Central Limit Theorem
     All things bell shaped!
     Random occurrences over a large population
     tend to wash out the asymmetry and
     uniformness of individual events. A more
     ’natural’ distribution ensues. The name for it is
     the Normal distribution (the bell curve).
     Formal definition: If (y1 , . . . , yn ) are i.i.d. and
     0 < σy2 < ∞, then when n is large the
     distribution of ȳ is well approximated by a
                                    σ2
     normal distribution N(µy , ny ).
                      Yijun Zhao   Linear Regression
Central Limit Theorem
  Example:
               Yijun Zhao   Linear Regression
LR: Probabilistic Interpretation
                 Yijun Zhao   Linear Regression
LR: Probabilistic Interpretation
                      1    − 12 (wT xi −yi )2
  prob(yi |xi ) =   √
                      2πσ
                          e 2σ
                    Yijun Zhao   Linear Regression
LR: Probabilistic Interpretation
  Likelihood of the entire dataset:
                                   − 12 (wT xi −yi )2
                Y                                    
        L ∝                  e         2σ
                  − 12 (wT xi −yi )2
                      P
                   2σ
             =e                    i
                                                 P T
  Maximize L ⇐⇒ Minimize                          (w xi − yi )2
                                                   i
                      Yijun Zhao       Linear Regression
Non-linear Transformation
     Linear is limited:
     Linear models become powerful when we
     consider non-linear feature transformations:
     Xi = (1, xi , xi2 ) =⇒ yi = ω0 + ω1 xi + ω2 xi2
                     Yijun Zhao   Linear Regression
Overfitting
              Yijun Zhao   Linear Regression
Overfitting
      How do we know we overfitted?
      Ein : Error from the training data
      Eout : Error from the test data
      Example:
                     Yijun Zhao   Linear Regression
Overfitting
  How to avoid overfitting?
     Use more data
      Evaluate on a parameter tuning set
      Regularization
                     Yijun Zhao   Linear Regression
Regularization
     Attempts to impose ”Occam’s razor” principle
     Add a penalty term for model complexity
     Most commonly used :
         L2 regularization (ridge regression) minimizes:
                  E (w) =k Xw − y k2 + λ k w k2
         where λ ≥ 0 and k w k2 = wT w
         L1 regularization (LASSO) minimizes:
                   E (w) =k Xw − y k2 + λ|w|1
                                        D
                                        P
         where λ ≥ 0 and |w|1 =               |ωi |
                                        i=1
                     Yijun Zhao   Linear Regression
Regularization
     L2: closed form solution
          w = (XT X + λI)−1 XT y
     L1: No closed form solution. Use quadratic
     programming:
     minimize k Xw − y k2            s.t. k w k1 ≤ s
                   Yijun Zhao   Linear Regression
L2 Regularization Example
               Yijun Zhao   Linear Regression
Model Selection
  Which model?
     A central problem in supervised learning
     Simple model: ”underfit” the data
          Constant function
          Linear model applied to quadratic data
      Complex model: ”overfit” the data
          High degree polynomials
          Model with hidden logics that fits the data to
          completion
                      Yijun Zhao   Linear Regression
Bias-Variance Trade-off
                      N                                                   
                   1
                           (wT xn − yn )2                     let ŷ = wT xn
                       P
  Consider E       N
                       n=1
      E (ŷ − yn )2 can be decomposed into (reading):
           var {noise} + bias 2 + var {yi }
      var {noise}: can’t be reduced
      bias 2 + var {yi } is what counts for prediction
      High bias 2 : model mismatch, often due to
      ”underfitting”
      High var {yi }: training set and test set
      mismatch, often due to ”overfitting”
                             Yijun Zhao   Linear Regression
Bias-Variance Trade-off
     Often:       low bias ⇒ high variance
                  low variance ⇒ high bias
     Trade-off:
                      Yijun Zhao   Linear Regression
How to choose λ ?
  But we still need to pick λ.
      Use the test set data ? NO!
      Set aside another evaluation set
           Small evaluation set ⇒ inaccurate estimated error
           Large evaluation set ⇒ small training set
      CrossValidation
                      Yijun Zhao   Linear Regression
Cross Validation (CV)
     Divide data into K folds
     Alternatively train on all except k th folds, and
     test on k th fold
                    Yijun Zhao   Linear Regression
Cross Validation (CV)
     How to choose K?
     Common choice of K = 5, 10, or N (LOOCV)
     Measure on average performance
     Cost of computation: K folds × choices of λ
                   Yijun Zhao   Linear Regression
Learning Curve
  A learning curve plots the performance of the
  algorithm as a function of the size of training data
                      Yijun Zhao   Linear Regression
Learning Curve
                 Yijun Zhao   Linear Regression