Logistics Regression
Logistics Regression
Statistics 102
Colin Rundel
1 Background
2 GLMs
3 Logistic Regression
4 Additional Example
Statistics 102
                                   Background
Odds
Odds are another way of quantifying the probability of an event,
commonly used in gambling (and logistic regression).
Odds
For some event E ,
                                               P(E )     P(E )
                                  odds(E ) =      c
                                                     =
                                               P(E )   1 − P(E )
Similarly, if we are told the odds of E are x to y then
                                                 x   x/(x + y )
                                    odds(E ) =     =
                                                 y   y /(x + y )
which implies
1 Background
2 GLMs
3 Logistic Regression
4 Additional Example
Statistics 102
                                                     GLMs
In 1846 the Donner and Reed families left Springfield, Illinois, for California
by covered wagon. In July, the Donner Party, as it became known, reached
Fort Bridger, Wyoming. There its leaders decided to attempt a new and
untested rote to the Sacramento Valley. Having reached its full size of 87
people and 20 wagons, the party was delayed by a difficult crossing of the
Wasatch Range and again in the crossing of the desert west of the Great
Salt Lake. The group became stranded in the eastern Sierra Nevada
mountains when the region was hit by heavy snows in late October. By
the time the last survivor was rescued on April 21, 1847, 40 of the 87
members had died from famine and exposure to extreme cold.
From Ramsey, F.L. and Schafer, D.W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed)
                       40
                       30
                       20
Died Survived
It seems clear that both age and gender have an effect on someone’s
survival, how do we come up with a model that will let us explore this
relationship?
It seems clear that both age and gender have an effect on someone’s
survival, how do we come up with a model that will let us explore this
relationship?
It seems clear that both age and gender have an effect on someone’s
survival, how do we come up with a model that will let us explore this
relationship?
One way to think about the problem - we can treat Survived and Died as
successes and failures arising from a binomial distribution where the
probability of a success is given by a transformation of a linear model of
the predictors.
It turns out that this is a very general way of addressing this type of
problem in regression, and the resulting models are called generalized
linear models (GLMs). Logistic regression is just one example of this type
of model.
It turns out that this is a very general way of addressing this type of
problem in regression, and the resulting models are called generalized
linear models (GLMs). Logistic regression is just one example of this type
of model.
1 Background
2 GLMs
3 Logistic Regression
4 Additional Example
Statistics 102
                                  Logistic Regression
Logistic Regression
Logistic Regression
Logit function
                                                           
                                                  p
                           logit(p) = log                       , for 0 ≤ p ≤ 1
                                                 1−p
The logit function takes a value between 0 and 1 and maps it to a value
between −∞ and ∞.
The inverse logit function takes a value between −∞ and ∞ and maps it
to a value between 0 and 1.
This formulation also has some use when it comes to interpreting the
model as logit can be interpreted as the log odds of a success, more on
this later.
yi ∼ Binom(pi )
η = β0 + β1 x1 + · · · + βn xn
logit(p) = η
##    Call:
##    glm(formula = Status ~ Age, family = binomial, data = donner)
##
##    Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
##    (Intercept) 1.81852     0.99937   1.820   0.0688 .
##    Age         -0.06647    0.03222 -2.063    0.0391 *
##
##        Null deviance: 61.827               on 44        degrees of freedom
##    Residual deviance: 56.291               on 43        degrees of freedom
##    AIC: 60.291
##
##    Number of Fisher Scoring iterations: 4
     Statistics 102 (Colin Rundel)                    Lec 20                    April 15, 2013   13 / 30
                                       Logistic Regression
Model:                                      
                                       p
                            log                  = 1.8185 − 0.0665 × Age
                                      1−p
Model:                                      
                                       p
                            log                  = 1.8185 − 0.0665 × Age
                                      1−p
Model:                                          
                                       p
                            log                      = 1.8185 − 0.0665 × Age
                                      1−p
                    0.4
                    0.2
                    0.0
0 20 40 60 80
                                                              Age
  Statistics 102 (Colin Rundel)                          Lec 20                 April 15, 2013   16 / 30
                                        Logistic Regression
                    0.4
                    0.2
                    0.0
0 20 40 60 80
                                                              Age
  Statistics 102 (Colin Rundel)                          Lec 20                 April 15, 2013   16 / 30
                                    Logistic Regression
Simple interpretation is only possible in terms of log odds and log odds
ratios for intercept and slope terms.
Intercept: The log odds of survival for a party member with an age of 0.
From this we can calculate the odds or probability, but additional
calculations are necessary.
Slope: For a unit increase in age (being 1 year older) how much will the
log odds ratio change, not particularly intuitive. More often then not we
care only about sign and relative magnitude.
                                                                 
                                                          p1
                                              log                     = 1.8185 − 0.0665(x + 1)
                                                        1 − p1
                                                                      = 1.8185 − 0.0665x − 0.0665
                                                                 
                                                          p2
                                              log                     = 1.8185 − 0.0665x
                                                        1 − p2
                                                      
                                 p1                 p2
                     log                  − log            = −0.0665
                               1 − p1             1 − p2
                                                       
                                          p1        p2
                                log                        = −0.0665
                                        1 − p1    1 − p2
                                                 
                                            p1        p2
                                                           = exp(−0.0665) = 0.94
                                          1 − p1    1 − p2
##   Call:
##   glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
##
##   Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
##   (Intercept) 1.63312     1.11018   1.471   0.1413
##   Age         -0.07820    0.03728 -2.097    0.0359 *
##   SexFemale    1.59729    0.75547   2.114   0.0345 *
##   ---
##
##   (Dispersion parameter for binomial family taken to be 1)
##
##       Null deviance: 61.827           on 44      degrees of freedom
##   Residual deviance: 51.256           on 42      degrees of freedom
##   AIC: 57.256
##
##   Number of Fisher Scoring iterations: 4
Gender slope: When the other predictors are held constant this is the log
odds ratio between the given level (Female) and the reference level (Male).
     Statistics 102 (Colin Rundel)                    Lec 20             April 15, 2013   19 / 30
                                           Logistic Regression
Just like MLR we can plug in gender to arrive at two status vs age models
for men and women respectively.
General model:
                                      
                                p1
                    log                    = 1.63312 + −0.07820 × Age + 1.59729 × Sex
                              1 − p1
Male model:
                                        
                                  p1
                      log                    = 1.63312 + −0.07820 × Age + 1.59729 × 0
                                1 − p1
                                             = 1.63312 + −0.07820 × Age
Female model:
                                        
                                  p1
                      log                    = 1.63312 + −0.07820 × Age + 1.59729 × 1
                                1 − p1
                                             = 3.23041 + −0.07820 × Age
                                                                                 Female
            0.8
            0.6
Status
            0.4
            0.2
            0.0
0 20 40 60 80
Age
                                                                                   Male
                                                                                   Female
            0.8
                                                             Females
            0.6
Status
0.4
                                         Males
            0.2
            0.0
0 20 40 60 80
Age
##   Call:
##   glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
##
##   Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
##   (Intercept) 1.63312     1.11018   1.471   0.1413
##   Age         -0.07820    0.03728 -2.097    0.0359 *
##   SexFemale    1.59729    0.75547   2.114   0.0345 *
##   ---
##
##   (Dispersion parameter for binomial family taken to be 1)
##
##       Null deviance: 61.827           on 44      degrees of freedom
##   Residual deviance: 51.256           on 42      degrees of freedom
##   AIC: 57.256
##
##   Number of Fisher Scoring iterations: 4
##   Call:
##   glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
##
##   Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
##   (Intercept) 1.63312     1.11018   1.471   0.1413
##   Age         -0.07820    0.03728 -2.097    0.0359 *
##   SexFemale    1.59729    0.75547   2.114   0.0345 *
##   ---
##
##   (Dispersion parameter for binomial family taken to be 1)
##
##       Null deviance: 61.827           on 44      degrees of freedom
##   Residual deviance: 51.256           on 42      degrees of freedom
##   AIC: 57.256
##
##   Number of Fisher Scoring iterations: 4
Note that the model output does not include any F-statistic, as a general
rule there are not single model hypothesis tests for GLM models.
     Statistics 102 (Colin Rundel)                    Lec 20             April 15, 2013   22 / 30
                                   Logistic Regression
Note the only tricky bit, which is way beyond the scope of this course, is
how the standard error is calculated.
                                            H0 : βage = 0
                                            HA : βage 6= 0
                                               H0 : βage = 0
                                               HA : βage 6= 0
                                   ˆ − βage
                                  βage        -0.0782 − 0
                      Z=                    =             = -2.10
                                     SEage       0.0373
             p-value = P(|Z | > 2.10) = P(Z > 2.10) + P(Z < -2.10)
                          = 2 × 0.0178 = 0.0359
  Statistics 102 (Colin Rundel)                        Lec 20                    April 15, 2013   24 / 30
                                   Logistic Regression
Remember, the interpretation for a slope is the change in log odds ratio
per unit change in the predictor.
Remember, the interpretation for a slope is the change in log odds ratio
per unit change in the predictor.
Remember, the interpretation for a slope is the change in log odds ratio
per unit change in the predictor.
Odds ratio:
1 Background
2 GLMs
3 Logistic Regression
4 Additional Example
Statistics 102
                                        Additional Example
From Ramsey, F.L. and Schafer, D.W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed)
                                                       Bird   No Bird
                                    Lung Cancer         N        •
                                 No Lung Cancer         4        ◦
##   Call:
##   glm(formula = LC ~ FM + SS + BK + AG + YR + CD, family = binomial,
##       data = bird)
##
##   Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
##   (Intercept) -1.93736    1.80425 -1.074 0.282924
##   FMFemale     0.56127    0.53116   1.057 0.290653
##   SSHigh       0.10545    0.46885   0.225 0.822050
##   BKBird       1.36259    0.41128   3.313 0.000923 ***
##   AG          -0.03976    0.03548 -1.120 0.262503
##   YR           0.07287    0.02649   2.751 0.005940 **
##   CD           0.02602    0.02552   1.019 0.308055
##
##   (Dispersion parameter for binomial family taken to be 1)
##
##       Null deviance: 187.14           on 146      degrees of freedom
##   Residual deviance: 154.20           on 140      degrees of freedom
##   AIC: 168.2
##
##   Number of Fisher Scoring iterations: 5
     Statistics 102 (Colin Rundel)                   Lec 20               April 15, 2013   29 / 30
                                    Additional Example