0% found this document useful (0 votes)
21 views26 pages

3 Classification

This document provides an overview of classification methods in statistics, focusing on logistic regression and its applications for predicting binary outcomes, such as credit card defaults. It discusses the limitations of regression methods for qualitative responses, introduces multinomial logistic regression for multi-class classification, and covers probit and Poisson regression for binary and count data, respectively. The document emphasizes the use of maximum likelihood estimation for model fitting and interpretation of coefficients in various regression models.

Uploaded by

Niyati Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views26 pages

3 Classification

This document provides an overview of classification methods in statistics, focusing on logistic regression and its applications for predicting binary outcomes, such as credit card defaults. It discusses the limitations of regression methods for qualitative responses, introduces multinomial logistic regression for multi-class classification, and covers probit and Poisson regression for binary and count data, respectively. The document emphasizes the use of maximum likelihood estimation for model fitting and interpretation of coefficients in various regression models.

Uploaded by

Niyati Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

STAT 654

Chapter 3: Classification

Dr. Sharmistha Guha

Department of Statistics, Texas A&M University

Spring 2024

1 / 26
An Overview of Classification

The response variable is qualitative.


For e.g., y ∈ {0, 1}.
Major reasons not to perform classification using a regression
method (studied earlier):
(1) A regression method cannot accommodate a qualitative
response with multiple classes, and (2) A regression method
will not provide meaningful estimates of Pr (Y |X ), even with
just two classes.
Consider Y = 1 if stroke; 2 if drug overdose; 3 if epileptic
seizure. This coding implies an ordering on the outcomes.
In practice there is no particular reason that this needs to be
the case.
One could choose an equally reasonable coding, Y = 1 if
epileptic seizure; 2 if stroke; 3 if drug overdose.
This would imply a totally different relationship among the
three conditions.

2 / 26
Default Data Illustration

Data to illustrate classification: Default data set (ISLR2).


Goal: Predict if an individual will default on credit card
payment, on the basis of (1) annual income and (2) monthly
credit card balance.

Individuals who defaulted in a given month are shown in orange, and those who did
not in blue. The ’Default’ variable is binary, i.e., defaulted or not.

3 / 26
Default Data Illustration - Regression vs. Classification

Classification using the Default data. Left: Estimated probability of default using linear
regression. Some estimated probabilities are negative! The orange ticks indicate the
0/1 values coded for default (No or Yes). Right: Predicted probabilities of default
using logistic regression. All probabilities lie between 0 and 1.

4 / 26
The Logistic Model
Model the relationship between p (X ) = Pr (Y = 1|X ) and X .
We model p(X) using a function that gives outputs between 0
and 1 for all values of X.
For e.g., we use the logistic function (or sigmoid function) in
logistic regression
β +β X
p (X ) = 1+e e0β0 +β1 1 X

Logistic Function

5 / 26
The Logistic Model

p (X )
This implies 1−p (X )
= e β0 +β1 X
The quantity 1−p(pX(X) ) is called the odds. Can take on any value
between 0 and ∞.
This implies log ( 1−p(pX(X) ) ) = β0 + β1 X . The LHS is called the
log odds or logit.
In a linear regression model, β1 gives the average change in Y
associated with a 1-unit increase in X .
In a logistic regression model, increasing X by one unit changes
the log-odds by β1 .
Because the relationship between p (X ) and X is not a straight
line, β1 does not correspond to the change in p(X) associated
with a one-unit increase in X.
The amount that p (X ) changes due to a 1-unit change in X
depends on the current value of X .

6 / 26
Estimating the Regression Coefficients
The coefficients β0 and β1 are unknown, and must be
estimated.
In linear regression, we used the least squares approach to
estimate the unknown linear regression coefficients.
Here the method of maximum likelihood is preferred due to
good statistical properties.
Consider theQ likelihood function:
Q
L(β0 , β1 ) = i :yi =1 p (xi ) j :yj =0 (1 − p (xj ))

The estimates βˆ0 and βˆ1 are chosen to maximize this likelihood
function.
The z-statistic for β1 = SEβ(β1 1 ) , and so a large (absolute) value
of the z-statistic indicates evidence against the null hypothesis
H0 : β1 = 0.
e β0
H0 implies that p (X ) = 1+e β0
, i.e., the probability of {Y = 1}
does not depend on X .
7 / 26
Making Predictions

Think of the Default data.


Once the coefficients have been estimated, we can compute the
probability of default for any given credit card balance.

For the Default data, estimated coefficients of the logistic regression model that
predicts the probability of default using balance. A 1-unit increase in balance is
associated with an increase in the log odds of default by 0.0055 units.

Using the coefficient estimates given above, we predict that the


default probability for an individual with a balance of $1000 is:
βˆ +βˆ X
p̂(X ) = e 0βˆ0 +1βˆ1 X = 0.00576 (plugging in values).
1+e

8 / 26
Qualitative Predictors
We can use qualitative predictors with the logistic regression
model using the dummy variable approach.
For e.g., the Default dataset contains the qualitative variable
student.
To use student status as a predictor variable, create a dummy
variable that takes value 1 for students, and 0 for non-students.

For the Default data, estimated coefficients of the logistic regression model that
predicts the probability of default using student status.

The coefficient associated with the dummy variable is positive, and


the associated p-value is statistically significant. This indicates that
students tend to have higher default probabilities than
non-students.
e −3.5+0.405X 1
p̂(default = 1|student = 1) = 1+e −3.5+0.405X 1 = 0.0431.
9 / 26
Multiple Logistic Regression

Predict a binary response using multiple predictors (p


predictors).
We can generalize
 the logistic regression model as follows:
log 1−p(pX(X) ) = β0 + β1 X1 + ... + βp Xp .
e β0 +β1 X1 +...+βp Xp
Hence, p (X ) = 1+e β0 +β1 X1 +...+βp Xp
We use the maximum likelihood method to estimate
β0 , β1 , ..., βp .

10 / 26
Multinomial Logistic Regression

Classify a response variable that has more than two classes.


For e.g., we may have three categories of medical condition in
the emergency room: stroke, drug overdose, epileptic seizure.
The logistic regression approach that we have seen only allows
for K = 2 classes for the response variable, e.g., Y ∈ {0, 1}.
It is possible to extend the two-class logistic regression
approach to the setting of K > 2 classes ⇒ Multinomial
logistic regression.
Select a single class to serve as the baseline. Without loss of
generality, we select the K th class as the baseline. Then the
model is: βk0 +βk1 x1 +...+βkp xp
p (Y = k |X = x ) = 1+Pe K −1 e βl0 +βl1 x1 +...+βlp xp for k = 1, ..., K − 1.
l =1

11 / 26
Multinomial Logistic Regression

1
p (Y = K |X = x ) = P −1 β +β x1 +...+β xp
1+ K l0 l1 lp
l =1 e

We can show that for k = 1, ..., K − 1,


=k |X =x )
log pp((YY =K |X =x )
= βk0 + βk1 x1 + ... + βkp xp
Once again, the log odds between any pair of classes is linear in
the features.
The decision to treat the K th class as the baseline is
unimportant.
When classifying emergency room visits into stroke, drug
overdose, and epileptic seizure, if we fit 2 multinomial logistic
regression models: treating (1) stroke and (2) drug overdose, as
baselines.

12 / 26
Multinomial Logistic Regression

The coefficient estimates will differ between the two fitted


models due to the differing choice of baseline.
The fitted values (predictions) and the log-odds between any
pair of classes will remain the same.
Be careful with interpretation of the coefficients in a
multinomial logistic regression model, since they are tied to the
choice of baseline!
E.g., setting epileptic seizure as baseline, interpret βstroke ,0 as
the log odds of stroke versus epileptic seizure, given that
x1 = ... = xp = 0.
A 1-unit increase in Xj is associated with a βstroke ,j increase in
the log odds ofstroke over epileptic seizure, i.e., for 1-unit
p (Y =stroke |X =x )
increase in Xj , p(Y =epilepticseizure |X =x ) increases by e βstroke ,j .

13 / 26
Probit Regression

Binary Response y ∈ {0, 1}. Examples: Yes/No,


Success/Failure, Disease/No Disease.
Vector of regressors X , influencing the outcome Y. We assume
that the model takes the form
p (Y = 1|X ) = Φ(X T β).
Here Φ is the Cumulative Distribution Function (CDF) of the
standard normal distribution.
The parameters β are usually estimated by maximum likelihood.
We can motivate the probit model as a latent variable model.

14 / 26
Probit Regression

Think of a latent variable Y ∗ as the underlying latent


propensity that Y = 1. Note that Y ∗ is unobserved.
E.g., For the binary variable, Disease/No Disease, Y ∗ is the
propensity for Disease.
Consider Y ∗ = X T β + , where  ∼ N (0, 1).
Then Y can be viewed as an indicator for whether this latent
variable is positive: Also, Y = 1 if Y ∗ > 0 and Y = 0 if
Y ∗ ≤ 0.
To see that the two approaches are equivalent, note that
p (Y = 1|X ) = p (Y ∗ > 0) = p (X T β + > 0) = p ( > −X T β)
= p ( < X T β) = Φ(X T β)

15 / 26
Maximum Likelihood estimation

Suppose data set {yi , xi }ni=1 contains n independent


observations.
For a single observation, we have p (yi = 1|xi ) = Φ(xi0 β), and
p (yi = 0|xi ) = 1 − Φ(xi0 β).
The likelihood of a single observation {yi , xi } is
L(β; yi , xi ) = Φ(xi0 β)yi [1 − Φ(xi0 β)](1−yi ) )
The observations are independent, so the likelihood of the
entire sample (joint
Qn likelihood):
L(β; Y , X ) = i =1 Φ(xi0 β)yi [1 − Φ(xi0 β)](1−yi )
We can take the joint log likelihood, and maximize w.r.t. β .
Thus we obtain the estimator β̂ , which has desirable theoretical
properties.

16 / 26
Generalized Linear Models
Till now, we covered linear regression (for continuous
response), logistic regression and probit regression (for binary
response), respectively.
All these are part of the generalized linear models, with
different link functions.
Now we will consider count data.
Consider the Bikeshare dataset in ISLR2.
The response is ‘bikers’, the number of hourly users of a bike
sharing program in Washington, DC.
Consider predicting bikers using mnth (month of the year), hr
(hour of the day, from 0 to 23), workingday (an indicator
variable that equals 1 if it is neither a weekend nor a holiday),
temp (the normalized temperature, in Celsius), and weathersit
(a qualitative variable that takes on one of four possible values:
clear; misty or cloudy; light rain or light snow; or heavy rain or
heavy snow.)
Treat mnth, hr, and weathersit as qualitative variables.
17 / 26
Regression with Count Data

Consider using linear regression for data with count response.


Some obvious issues appear.
Some fitted values may be negative (predicted values may be
negative).
This calls into question our ability to perform meaningful
predictions on the data.
Raises concerns about the accuracy of the coefficient estimates,
confidence intervals, and other outputs of the regression model.
Heteroscedasticity may be observed which questions the
suitability of a linear regression model.
While the response is integer-valued, in a linear model, the
response is necessarily continuous-valued. So a linear regression
model is not entirely satisfactory for this dataset.

18 / 26
Poisson Regression

To overcome the inadequacies of linear regression for count


data, use Poisson regression.
Recall the Poisson distribution: Suppose a random variable Y
takes on nonnegative integer values, i.e., Y ∈ {0, 1, 2, ...}.
−λ k
If Y follows the Poisson distribution, then p (Y = k ) = e k !λ
for k = 0, 1, 2, ....
Here, λ > 0 is E(Y ). Also, λ = V (Y ).
Thus if Y follows Poisson distribution, the larger E (Y ) , the
larger is V (Y ).
The Poisson distribution is typically used to model counts,
since counts, like the Poisson distribution, take on nonnegative
integer values.
For regression, rather than a fixed λ, we would like to allow the
mean to vary as a function of the covariates.

19 / 26
Poisson Regression

Consider the model:


log (λ(X1 , ..., Xp )) = β0 + β1 X1 + ... + βp Xp
Or equivalently, λ(X1 , ..., Xp )) = e β0 +β1 X1 +...+βp Xp
To estimate the coefficients β0 , β1 , ..., βp , we use the same
maximum likelihood approach that we adopted for logistic
regression.
Specifically, given n independent observations from the Poisson
regression model, the likelihood takes the form
−λ(xi )
xi )yi
L(β0 , β1 , ..., βp ) = ni=1 e yiλ(
Q
! , where
λ(xi ) = e β0 +β1 X1 +...+βp Xp

We estimate the coefficients that maximize the likelihood


L(β0 , β1 , ..., βp ), i.e., that make the observed data as likely as
possible.

20 / 26
Poisson Regression

Interpretation: To interpret the coefficients in the Poisson


regression, note that an increase in Xj by 1 unit is associated
with a change in E (Y ) = λ by a factor of e βj .
Mean-variance relationship: Under the Poisson model,
λ = E (Y ) = V (Y ).
Nonnegative fitted values: There are no negative predictions
using the Poisson regression model.

21 / 26
Generalized Linear Models: Closing Remarks

We have now discussed the following regression models: linear,


logistic, probit and Poisson.
All approaches use predictors X1 , ..., Xp to predict a response
Y . Conditional onX1 , ..., Xp , Y belongs to a certain family of
distributions.
Linear regression: Y ∼ Normal and
E (Y | X1 , ..., Xp ) = β0 + β1 X1 + ... + βp Xp
Logistic regression: Y ∼ Bernoulli and
e β0 +β1 X1 +...+βp Xp
E (Y | X1 , ..., Xp ) = p (Y = 1 | X1 , ..., Xp ) = 1+e β0 +β1 X1 +...+βp Xp

22 / 26
Generalized Linear Models: Closing Remarks

Poisson regression: Y ∼ Poisson and


E (Y | X1 , ..., Xp ) = λ(X1 , ..., Xp ) = e β0 +β1 X1 +...+βp Xp
These equations can be expressed using a link function η .
The link function applies a transformation to E (Y | X1 , ..., Xp )
so that the transformed mean is a linear function of the
predictors, i.e., η(E (Y | X1 , ..., Xp )) = β0 + β1 X1 + ... + βp Xp .
The link functions are the following:
linear η(µ) = µ,
logistic η(µ) = log (µ/(1 − µ)), and
Poisson η(µ) = log (µ)

23 / 26
Receiver Operating Characteristic (ROC) curve

ROC curve is a graphical plot that illustrates the diagnostic


ability of a binary classifier, as its discrimination threshold is
varied.
Method developed for operators of military radar receivers,
hence the name.
The ROC curve is created by plotting the true positive rate
(TPR) against the false positive rate (FPR) at various
thresholds.
The true-positive rate is also known as sensitivity or recall
(probability of detection).
The false-positive rate is also known as probability of false
alarm = (1 - specificity).
The performance of a classifier is given by the area under the
(ROC) curve (AUC).

24 / 26
Receiver Operating Characteristic (ROC) curve

An ideal ROC curve will be very close to the top left corner, so
the larger the AUC the better the classifier.
We expect a classifier that performs no better than chance to
have an AUC of 0.5.

The ideal ROC curve is close to the top left corner, indicating a high AUC. The dotted
line represents the “no information” classifier

25 / 26
Chapter Reference

An Introduction to Statistical Learning by G.James, D.Witten,


T.Hastie, R.Tibshirani

26 / 26

You might also like