0% found this document useful (0 votes)
17 views25 pages

Slides 3

stat learning 3

Uploaded by

Pasxalis Itsios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views25 pages

Slides 3

stat learning 3

Uploaded by

Pasxalis Itsios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Lecture 3 - Classification and Non-Linear

Regression
Statistical Learning (CFAS420)

Alex Gibberd

Lancaster University

18th Feb 2020


Outline

Learning Outcomes:
I Understand the basic generalisations of linear models to binary,
categorical and ordinal outputs
I Understand how to interpret the logistic regression coefficients
(log-odds)
I Know how to convert a probability into a binary/ordinal decision via
cutpoints
I Know a range of ways to assess performance of a (binary)
classification model (Accuracy, Sensitivity, ROC)
I Recognise the mathematical construction of Generalised Additive
Models
I Understand how GAM’s may be used to model non-linear
relationships

2
Logistic Regression

I Consider we have binary outcomes coded as Y ∈ {1, 0}


I We want to model P(Y = 1|X = x)
– Treat this as a Bernoulli trial with probability
p(x) := E[Y|X] = P(Y = 1|X = x)
– Note: 0 ≤ p(x) ≤ 1 so need to transform somehow
I Taking log(p(x)) is only unbounded on one side (as p(x) → 0)
I Alternative is to take log of ratio:
 
p(x)
log = flin (x; β )
1 − p(x)

Generalised Linear Models (GLM) 4


Logistic Regression

I For generalisation purposes lets call

glogit (z) = log(z/{1 − z})

I Solving glogit (z) = flin (x; β ) for z gives:

1
z=
1 + exp{−flin (x; β )}

Generalised Linear Models (GLM) 5


Logistic Regression

I In summary
– We map binary outcome Y ∈ {0, 1} via probability to continuous
range (related to X)
– We then model this range as we would in linear regression, via
flin (x; β )
– We assumed that outcome was Bernoulli trial with probability p(x)
I Over n trials, we have a Binomial distribution, probability of getting
k positive outcomes from n coin tosses

Generalised Linear Models (GLM) 6


Interpreting Logistic Regression

I The odds of an outcome (say this is success) are given by

P[success] P[success]
Odd(success) := = .
P[not success] 1 − P[success]
I In the logit model, we note that replacing z = P[success] gives
p
log(Odds(success)) = α + ∑ Xi βi ,
i=1

I The scale of the regression coefficients determines how much the


log-odds of the outcome change in response to a covariate.

Generalised Linear Models (GLM) 7


Generalised Linear Models (GLM)

I More generally, we can consider different link functions g(z) and


g−1 (z)
– A common alternative (the probit model) is to use the cumulative
distribution function of the normal: g−1
probit (z) = Φ(z)
I If there are multiple outputs, we need to rethink our distributional
assumption
– Can either model cumulative probability P(Y ≤ k|X = x)
– Utilise a multinomial distribution (Y1 , . . . , Yk ) when taken over n trials
I General form of model looks like:

g(E[Y|X = x]) = flin (x; β )

Generalised Linear Models (GLM) 8


Binomial Classification

I The GLM models the expectation of the outcome E[Y]


I If Y can have two outcomes and these are coded as Y = 1 or Y = 0
the binomial noise distribution is appropriate
I However our Logistic (GLM) regression gives us

E[Y] = P(Y = 1) × 1 + P(Y = 0) × 0


= P(Y = 1)

I To make a prediction about Y|X we have to decide how to convert


this probabilities to either ŷ = 1 or ŷ = 0.
I To do this, we introduce a decision rule

Converting Probabilities to Predictions 10


Cutpoints

I The simplest way to convert the probability to a class is via a


hard-threshold
I Let us refer to this as a cutpoint, τ ∈ (0, 1)
I Specifically, we may consider
(
+ if P̂[Y = 1|X = x] > τ
ŷ = .
− if P̂[Y = 1|X = x] ≤ τ

– Note: I have recoded our classes here as (1, 0) ⇐⇒ (+, −)


I Remember P̂[Y = +|X = x] = g−1 (flin (x; β̂ )) is given via the GLM
model

Converting Probabilities to Predictions 11


Types of Error for Classification

I In a binary setting, imagine we have classes (+, −) there are only


four possible outcomes:
– A true positive (TP) ŷ = + & y = +
– A false positive (FP) ŷ = + & y = −
– A true negative (TN) ŷ = − & y = −
– A false negative (FN) ŷ = − & y = +
I A common way to analyse these rates is in terms of a confusion
matrix
I We count the number of each outcome and tabulate:
True
+ -
Pred + #TP #FN
- #FP #TN

Evaluating Classification Performance 13


Predicting Survival on Titanic

I One of the lab exercises will get you to predict survival using a GLM

– Predict class (outcome) probabilities


– Define decision rule (cutpoint) and apply rule
– Evaluate performance via confusion matrix

Evaluating Classification Performance 14


Summarising Classification Error

I There are a few more popular ways to summarise the classification


performance:
– Sensitivity: the empirical probability of correctly predicting class "+"
TP TP
Sensitivity := ≡
P TP + FN
– Specificity: the empirical probability of correctly predicting class "-"
TN TN
Specificity := ≡
N TN + FP
– Accuracy: the empirical probability of predicting the correct class
TP + TN TP + TN
Accuracy := ≡
P+N TP + TN + FP + FN

Evaluating Classification Performance 15


The Receiver Operating Characteristic (ROC)

I In the previous examples, we evaluated the classification


performance for a single cutpoint τ
I In practice, we do not always know where to place τ
– A large τ leads to high specificity, but low sensitivity and vice versa
– It is often informative to summarise classification performance across
a range of τ
I The ROC (curve) is a plot of specificity vs sensitivity, where each
point in the curve is given by evaluating a different τ.

Evaluating Classification Performance 16


ROC Curve for Titanic Predictions

Evaluating Classification Performance 17


Multinomial Regression

I We will now look at a form of GLM which can be used for the case
where we have more than K = 2 classes.
I Consider the case where the outcomes are unordered, nominal
categorical data
I Assumption: Independence of Irrelevant Alternatives (IIA)
– The odds of preferring one class over another do not depend on the
presence or absence of other "irrelevant" alternatives.
– For example, the relative probabilities of waling or taking a bus to
work do not change if a bicycle is added as an additional possibility.
I This allows the choice of K alternatives to be modeled as a set of
K − 1 independent binary choices

Multiclass Classification 19
Multinomial Regression

I One way to model this data is as a chain of log-odds relating to


different outcomes1
 
P(Yi = 1) (1)
ln = flin (xi ; β )
P(Yi = K)
..
.
 
P(Yi = K − 1) (K−1)
ln = flin (xi ; β )
P(Yi = K)
I Using the fact that all K probabilities sum to one we find

1
P(Yi = K) = (k)
.
1 + ∑K−1
k=1 exp{flin (xi ; β )}

I Exponentiating the chained equations leads to P(Yi = k) generally.


1 Note: x ∈ Rp
i
Multiclass Classification 20
Proportional Odds Regression (Ordinal Data)

I In some cases we may have categorical outcomes (multiple classes),


which we can ascribe some order to, these are known as ordinal data
I In these cases, we can use the ordering of the outcomes to simplify
the model
I We introduce multiple cutpoints

τ0 = −∞ < τ1 . . . τk < τK = ∞

and link this with the GLM

g(P(Y ≤ k|X = x)) = τk − flin (X; β̂ )

I Note: generally g(z) = logit(z).


I Importantly, we have far less parameters β as we do not need one
set for each outcome.

Multiclass Classification 21
Aside: Kolmogorov-Arnold Representation

I The Kolmogorov-Arnold Representation Theorem states that any


multivariate function f (x1 , . . . , xp ) can be written as a superposition
of functions acting on the individual variables x1 , . . . .xp separately.
I Specifically, one may write:
!
2p p
u(x1 , . . . , xp ) = ∑ φq ∑ fq,l (xl ) (1)
q=0 l=1

where φq and fq,l are some (potentially non-linear) functions.


I The main problem is, that while the theorem states a function of the
above form exists, it does not tell us how to actually identify such a
form.

Generalised Additive Models 23


Generalised Additive Models (GAM)

I The KA theorem is interesting as it suggests to model a multivariate


function in terms of a sum of univariate functions.
I A GAM model is similar, in that instead of assuming flin (x; β ) as
before, we now approximate fj (xj ) using a sum of basis functions.
I We use the GLM framework according to:

g(E[Y]) = α + f1 (x1 ) + f2 (x2 ) + · · · + fp (xp ) .

I For more details on GAM’s see the book by Simon Wood [1]

Generalised Additive Models 24


Basis Expansions

I To approximate the individual fj (xj ) we now use a set of further basis


functions
I Let ψj,k (xj ) represent a basis function for covariate j.
I Assuming that for each covariate we have Kj such functions, we can
construct an approximation of a smooth function fj (xj ) as
Kj
fj (xj ) = ∑ βk,j ψk,j (xj ) ,
k=1

where βk,j are a set of coefficients which need to be estimated.


I This is a linear (sum) approximation, but in terms of non-linear basis
functions

Generalised Additive Models 25


Spline Approximation

I There are a variety of different basis functions ψj,k one can use,
however, in GAM’s a popular choice is to use splines (piecewise
polynomial curve).

– The black line is being approximated by a weighted sum of the


others..

Generalised Additive Models 26


Example of GAM Estimation
I Consider the data Y = sin(X) + ε where X ∼ U (0, 2π) and
ε ∼ N (0, σ 2 )
I Fit a GAM using caret (via mgcv)

I View the results...pretty impressive

Generalised Additive Models 27


Summary

I Introduced logit transform to enable us to apply regression to binary


classification tasks
I Demonstrated how this can be generalised to multinomial,
categorical outcomes
I Presented various ways to assess binary classification accuracy
(ROC, Accuracy, Sensitivity,...)
I Introduced GAM models motivated by the KA representation
theorem
I Sketched how basis functions are used in GAMs to approximate
continuous smooth functions

Generalised Additive Models 28


In The Lab

1. How to fit logit and probit models for binary outcome data.
2. How to use binary variables as covariates via dummy variables
3. Predict survival on the titanic, and analyse your classifiers
performance
4. Fit a proportional odds model using caret

Generalised Additive Models 29


References I

S. Wood.
Generalized Additive Models: An Introduction with R.
2017.

Appendix 30

You might also like