0% found this document useful (0 votes)
14 views56 pages

02 Classification Logit KNN NB

The document provides an overview of classification techniques, focusing on logistic regression, k-Nearest Neighbors (k-NN), and Naïve Bayes classifiers. It discusses the importance of classifiers in mapping input data to categories, evaluates classification performance using metrics like accuracy, precision, and recall, and explains the logistic regression model in detail. Additionally, it highlights the need for careful selection of evaluation measures based on the context of the classification problem.

Uploaded by

gedankenmanken
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views56 pages

02 Classification Logit KNN NB

The document provides an overview of classification techniques, focusing on logistic regression, k-Nearest Neighbors (k-NN), and Naïve Bayes classifiers. It discusses the importance of classifiers in mapping input data to categories, evaluates classification performance using metrics like accuracy, precision, and recall, and explains the logistic regression model in detail. Additionally, it highlights the need for careful selection of evaluation measures based on the context of the classification problem.

Uploaded by

gedankenmanken
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Classification

Logistic regression, k-NN, Naïve Bayes

Gertraud Malsiner-Walli
based on Laura Vana’s slides

Gertraud Malsiner-Walli Classification 1 / 56


Outline

Overview of classification

Logistic regression model


Simple logistic regression model
Multiple logistic regression model

K -Nearest Neighbors classifier

Naïve Bayes classifier

Gertraud Malsiner-Walli Classification 2 / 56


Overview of classification

Gertraud Malsiner-Walli Classification 3 / 56


Classification

▶ The linear regression model assumes the dependent variable Y


is quantitative.
▶ However, often the target variable of interest is
qualitative/categorical.
▶ Predicting a qualitative outcome can also be interpreted as
classifying that observation, as assigning the observation to a
category means assigning the observation to the class
consisting of all observations having that category.

Gertraud Malsiner-Walli Classification 4 / 56


Examples for classifications

Classification problems occur often, perhaps even more often than


regression problems.

QUIZ: Is this a regression or a classification problem?

Gertraud Malsiner-Walli Classification 5 / 56


Classifiers
▶ A classifier is an algorithm that maps input data (i.e., the
predictors) to a category of the output variable.
▶ Examples:
▶ Logistic regression
▶ k-NN - Nearest neighbor classifier
▶ Naïve Bayes classifier
▶ Classification trees
▶ Random forests
▶ Support vector machines (SVM)
▶ Discriminant analysis (DA)
▶ Artificial neural networks (ANN)
▶ (for bachelor thesis)

Gertraud Malsiner-Walli Classification 6 / 56


Example: Click data

▶ An advertising company wants to target ads to users based on


a user’s likelihood to click.
▶ The data set contains information about the behavior of 1000
users, to which an ad was shown (available on Kaggle).
▶ Nine variables are available for each user:
▶ Target variable: a binary variable Clicked.on.Ad indicating
whether the user clicked the ad.
▶ Personal information such as gender, age, time spent on the
website.
▶ Other available information: the timestamp of the click, area
income, city and country.
▶ Aim: Based on the available information of the user, we are
interested in predicting whether the user will click or not.

Gertraud Malsiner-Walli Classification 7 / 56


Visualization of classification

60

50

Clicked
Age

40 0
1

30

20

20000 40000 60000 80000


Area.Income

Gertraud Malsiner-Walli Classification 8 / 56


Classification evaluation
▶ One typically evaluates the classification by comparing the
distribution of the observed response to that of the
predicted response.
▶ In the case of binary classification, the distributions of the
observed and predicted response can be summarized by the
confusion matrix:

observed = 0 observed = 1
predicted = 0 true negatives (TN) false negatives (FN)
predicted = 1 false positives (FP) true positives (TP)

▶ A binary classifier can make 2 types of errors: FP, FN.


▶ It is sometimes important to know which of these errors are
being made.
Gertraud Malsiner-Walli Classification 9 / 56
Evaluation measures
obs = 0 obs = 1
pred = 0 TN FN
pred = 1 FP TP

▶ Accuracy: the percentage of correctly classified observations:


accuracy = (TN + TP)/(TN + TP + FN + FP)

▶ Recall: the percentage of correctly classified positives:


recall = TP/(FN + TP)

▶ Precision: the percentage of predicted positives which are


correctly classified:
precision = TP/(FP + TP)

▶ F1 score: the harmonic mean of precision and recall


F 1 = 2 × recall × precision/(recall + precision)
Gertraud Malsiner-Walli Classification 10 / 56
Details on evaluation measures

▶ The choice of the evaluation measure is application- and


data-dependent.
▶ When choosing the appropriate evaluation measure, one
typically evaluates the costs of false negatives and false
positives.
▶ If false negatives are more costly, then recall is the better
measure.
▶ If false positive ought to be minimized, precision is the better
measure.
▶ If both are important, accuracy or the F1 score can be used.

▶ Caveat: Accuracy can be misleading when the class


distribution of the response is unbalanced.

Gertraud Malsiner-Walli Classification 11 / 56


Exercise (5 minutes)

Would you choose precision or recall in the following cases? Why?


(Refer to the costs associated with false positives or false negatives.)
a) Classifying an email as spam (1) vs. no-spam (0).
b) Classifying a bank customer as a default (1) or no default (0).
c) An advertiser predicts clicks (1) vs no-clicks (0) on one of their
ads.

Gertraud Malsiner-Walli Classification 12 / 56


Logistic regression model

Gertraud Malsiner-Walli Classification 13 / 56


Simple logistic regression model

Gertraud Malsiner-Walli Classification 14 / 56


Scatterplot of area income and clicks
1.0
0.8
Probability of click
0.6
0.4
0.2
0.0

20000 30000 40000 50000 60000 70000 80000


Area.Income

▶ Is there a relationship?
▶ First idea: Use a regression model: Yclick = β0 + β1 Xincome
Gertraud Malsiner-Walli Classification 15 / 56
Click data: Fitted models - linear vs logistic model

Linear model Logistic model


1.0

1.0
0.8

0.8
Probability of click

Probability of click
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

20000 40000 60000 80000 20000 40000 60000 80000

Area.Income Area.Income

Gertraud Malsiner-Walli Classification 16 / 56


Logistic function
The logistic function is useful to describe the relationship between a
binary outcome and a independent variable t by estimating the
probability of a positive outcome through

et
P(Y = 1|t) = .
1 + et

1
p(t)

0.5

−6 −4 −2 0 2 4 6

Gertraud Malsiner-Walli Classification 17 / 56


Simple logistic regression model

▶ We use the linear predictor of X instead of t:

e β0 +β1 X
P(Y = 1|X ) = .
1 + e β0 +β1 X
“logistic regression model”
▶ With P(Y = 1) = pclick :

e β0 +β1 X
pclick = .
1 + e β0 +β1 X

▶ Difficult to see: How does the predictor X affect the probability


of clicking?

Gertraud Malsiner-Walli Classification 18 / 56


Odds
▶ The equation can be rewritten as:
pclick
= e β0 +β1 X
1 − pclick

▶ Note: 1 − pclick is the probabilitiy of observing Y = 0.


▶ This quantity is called odds, can assume any positive number
between 0 and ∞.
▶ Odds are traditionally used in horse-racing, instead of
probabilities.
▶ Example: If pclick = 0.2, the odds for clicking are

pclick 0.2 1
= = (= 0.25).
1 − pclick 0.8 4
“It is 4 times more likely to ‘not-click’ than to ‘click’. ’ ’
Gertraud Malsiner-Walli Classification 19 / 56
Interpretation of β1 in regard to log-odds

▶ By taking the logarithm on both sides, we arrive at


pclick
log( ) = β0 + β1 X
1 − pclick

The left-hande side is called log-odds or logit.


▶ Logistic regression: increasing X by one unit changes the log
odds by β1 .

Gertraud Malsiner-Walli Classification 20 / 56


Interpretation of β1 in regard to odds

▶ Logistic regression model relates X to the odds of Y = 1:


pclick
= e β0 +β1 X
1 − pclick

▶ What happens if X is increased by one unit?

e β0 +β1 (X +1) = e β0 +β1 X +β1 = e β0 +β1 X e β1 = odds · e β1

▶ The odds are multiplied by e β1 .


▶ Note:
▶ Not pclick is changing, but the odds are changing.

Gertraud Malsiner-Walli Classification 21 / 56


Multiple logistic regression model

Gertraud Malsiner-Walli Classification 22 / 56


Multiple logistic regression model

▶ Given multiple covariates X1 , X2 , . . ., Xp , the probability of


observing Y = 1 is given by:

e β0 +β1 X1 +...βp Xp
p = P(Y = 1|X1 , . . . Xp ) =
1 + e β0 +β1 X1 +...βp Xp
▶ Again, transforming the equation leads to the equation with
the log(odds) given by the linear predictor of the p covariates:
p
log = β0 + β1 X1 + . . . + βp Xp
1−p

Gertraud Malsiner-Walli Classification 23 / 56


Interpretation of the coefficients in the logit model

▶ Logit model:
p
log = β0 + β1 X1 + . . . + βp Xp
1−p

Similar to the simple logistic regression case:


▶ β0 gives the (baseline) log-odds when
X1 = X2 = . . . = Xp = 0.
▶ A one unit increase in Xj leads to a βj change in the log-odds,
keeping all other explanatory variables constant (ceteris
paribus).

Gertraud Malsiner-Walli Classification 24 / 56


Interpretation of the coefficients on the odds-scale

▶ The interpretation in terms the odds:


p
= e β0 +β1 X1 +...+βp Xp
1−p

▶ e β0 : gives the (baseline) odds for Y = 1 when the explanatory


variables are equal to zero.
▶ βj : A one unit increase in Xj leads to a e βj multiplicative
change in the odds of Y = 1, ceteris paribus.

Gertraud Malsiner-Walli Classification 25 / 56


Estimating the coefficients

▶ Maximum likelihood estimation is used rather than the least


squares estimation used in traditional multiple regression:
n
piYi (1 − pi )1−Yi
Y
L(β0 , β1 , . . . βp |Y , X1 , . . . , XP ) =
i=1

▶ As in OLS, first idea that comes to mind is to compute the


derivatives wrt βj and set them equal to zero ⇒ no closed form
solution.
▶ Iterative methods such as Newton’s method or stochastic
gradient descent are used instead. In R the algorithm employed
is called iterative reweighted least squares.

Gertraud Malsiner-Walli Classification 26 / 56


Predictions

▶ The output of the logistic regression model is a probability.


▶ We get the predicted conditional probability of clicking:

p̂(x1 , . . . , xp ) = P̂(Y = 1|X1 = x1 , . . . , Xp = xp )


e β̂0 +β̂1 x1 +...+β̂P xp
=
1 + e β̂0 +β̂1 x1 +...+β̂p xp

▶ The prediction of the binary outcome is natural:


(
0 if p̂(x1 , . . . , xp ) < 0.5
Ŷ |(X1 = x1 , . . . , Xp = xp ) =
1 if p̂(x1 , . . . , xp ) > 0.5

Gertraud Malsiner-Walli Classification 27 / 56


Evaluation

▶ Focus on classification:
▶ Accuracy, precision, recall, F-measure.
▶ Note: One needs a decision rule for obtaining outcomes from
probabilities e.g., if p̂i > 0.5 predict one, else zero.

Gertraud Malsiner-Walli Classification 28 / 56


Further topics
▶ Confidence intervals and hypotheses testing can be computed
like in the linear regression case.
▶ Goodness of fit: information criteria such as the AIC or pseudo
R 2 can be used to compare models in-sample.
▶ Variable selection can be performed similarly to the linear
regression case, by using stepwise selection.
▶ Numerical or categorical covariates can be used as predictors
(same as in the linear regression case).
▶ Similar to linear regression, multicollinearity, outliers, omitted
variable bias, etc. are potential problems.
▶ The logistic regression is a linear classifier, i.e., the decision
boundary (i.e., the function separating the observations of the
two classes) is a straight line.

Gertraud Malsiner-Walli Classification 29 / 56


Logistic regression as linear classifier
60

50

Clicked
Age

40 0
1

30

20

20000 40000 60000 80000


Area.Income

▶ Let’s have a look at Area.Income and Age in the two clicking


classes.
▶ The decision boundary are those points whose probability
P(Y = 1) = P(Y = 0) = 0.5.
▶ In the logistic regression case the decision boundary is a
straight line.

Gertraud Malsiner-Walli Classification 30 / 56


K -Nearest Neighbors classifier

Gertraud Malsiner-Walli Classification 31 / 56


Nearest neighbor classifier: Idea

Source: Tan, Steinbach, Karpatne, Kumar, 2018

Gertraud Malsiner-Walli Classification 32 / 56


Instance based classifiers

▶ Instance based classifiers delay the model processing/creation


until the test observation is ready to be classified.
▶ Example
▶ Nearest neighbor: Calculates the k ‘closest’ points (nearest
neighbors) of the test observation for classifying the observation.

Gertraud Malsiner-Walli Classification 33 / 56


Nearest neighbor classifier

▶ Requires:
1. The set of labeled records.
2. Distance metric to compute the distance between records.
3. The value of k, the number of nearest neighbors to retrieve.
▶ To classify an unknown record:
▶ Compute the distance to other training records.
▶ Identify k nearest neighbors.
▶ Use class labels of nearest neighbors to determine class label of
unknown record (e.g., by majority rule).

Gertraud Malsiner-Walli Classification 34 / 56


Definition of nearest neighbor

1 Nearest Neighbour 2 Nearest Neighbors 3 Nearest Neighbors

Gertraud Malsiner-Walli Classification 35 / 56


Nearest neighbor for classification and regression

▶ Compute distance between two points:


▶ Typically, the Euclidean distance is used:
q
d(x, y) = (x1 − y1 )2 + . . . + (xp − yp )2

▶ Classification: Determine the class from nearest neighbor list:


▶ Take the majority vote of class labels among the k-nearest
neighbors.
▶ Alternative: Weight the vote according to distance: w = 1/d 2 .

▶ Regression: Determine the predicted value from nearest


neighbor list:
▶ Take the average value of Y among the k-nearest neighbors.

Gertraud Malsiner-Walli Classification 36 / 56


k-NN: issues
▶ Crucial choice of k (typically between 1 and 20) – can be
determined by accuracy metrics on a “hold-out” data.
▶ If k is too small, sensitive to noise points, overfitting.
▶ If k is too large, neighborhood may include points from other
classes.
▶ Scaling issues: to prevent the distance measure from being
dominated by one of the attributes, attributes should be scaled
(e.g., by subtracting the mean and dividing by the standard
deviation) .
▶ Selection of the right similarity measure is critical. E.g., the
pairs of data points below have the same Euclidean distance. A
distance metric for binary variables (e.g., Jaccard distance)
might be more appropriate.
X = (1, 1, 1, 1, 1, 0), Y = (0, 1, 1, 1, 1, 1), d(X , Y ) = 1.4142
X ∗ = (1, 0, 0, 0, 0, 0) Y ∗ = (0, 0, 0, 0, 0, 1), d(X ∗ , Y ∗ ) = 1.4142
Gertraud Malsiner-Walli Classification 37 / 56
k-NN classifiers: summary

▶ k-NN classifier is a lazy learner since it does not build a model


explicitly.
▶ Classifying unknown records is relatively expensive.
▶ Can produce arbitrarily shaped decision boundaries.
▶ Selection of proximity/distance measure and choice of k is
essential.

Gertraud Malsiner-Walli Classification 38 / 56


Click box question: Comparison of the kNN decision
boundaries
To data consisting of two groups (orange and blue points) kNN is
fitted two times: once with k=100 and once with k=1. The
obtained decision boundaries are plotted as black lines (the purple
dashed line is the ‘true’ discriminant curve). Which plot was fitted
with which k?

Gertraud Malsiner-Walli Classification 39 / 56


Naïve Bayes classifier

Gertraud Malsiner-Walli Classification 40 / 56


Bayes classifier
▶ A probabilistic framework for solving classification problems.
▶ It is based on conditional probabilities, given two variables X
and Y .
E.g. rolling a dice: X = ’6’ and Y = ’number is even’
What is P(X ), P(Y ), P(X |Y ), P(Y |X )?
▶ Bayes’ theorem : is a mathematical rule for inverting
conditional probabilities.
Bayes rule

P(X |Y )P(Y )
P(Y |X ) =
P(X )
where P(Y ) is referred to as prior probability and P(Y |X ) is
referred to as posterior probability.

Gertraud Malsiner-Walli Classification 41 / 56


Bayes rule - Example

▶ If a patient has stiff neck, what’s the probability she/he has


meningitis?
▶ Given:
▶ The doctor has the following knowledge: From the patients with
meningitis, 50% experience the stiff neck symptom
P(S|M) = 0.5.
▶ Prior probability of any patient having meningitis is
P(M) = 1/50 000.
▶ Prior probability of any patient having stiff neck is P(S) = 1/20.
▶ If a patient has stiff neck, the probability she/he has meningitis
is given by:

P(S|M)P(M) 0.5 · 1/50 000


P(M|S) = = = 0.0002
P(S) 1/20

Gertraud Malsiner-Walli Classification 42 / 56


Using Bayes rule for classification

▶ The goal is to predict Y , given a record with attributes


(X1 , X2 , . . . , Xp ).
▶ Idea: Calculate

P(Y = 0|X1 , X2 , . . . , Xp )
P(Y = 1|X1 , X2 , . . . , Xp )

and take that value of Y which has the higher probability.


▶ How can we estimate P(Y |X1 , X2 , . . . , Xp ) directly from data?

Gertraud Malsiner-Walli Classification 43 / 56


Bayes classifier: Approach
▶ By applying Bayes’ rule, we compute the posterior probability
for all categories of Y after observing X :

P(X1 , X2 , . . . , Xp |Y )P(Y )
P(Y |X1 , X2 , . . . , Xp ) = .
P(X1 , X2 , . . . , Xp )

▶ Then, we choose class Y such that the posterior probability is


maximized (Maximum a-posteriori).
▶ This is equivalent to choosing Y which maximizes

P(X1 , X2 , . . . , Xp |Y )P(Y )

as we can ignore the denominator (it is equal for all classes and
will not affect the decision).
▶ Task: estimate P(X1 , X2 , . . . , Xp |Y ) and P(Y ).

Gertraud Malsiner-Walli Classification 44 / 56


Conditional independence

▶ We assume independence among attributes Xi ’s when the


class Y is given (“conditional independence”):

P(X1 , X2 , . . . , Xp |Y ) = P(X1 |Y ) · P(X2 |Y ) · · · P(Xp |Y ).

▶ Then: we can estimate P(Xi |Y ) for all combinations of Xi and


Y from the data.

Gertraud Malsiner-Walli Classification 45 / 56


Example data: Defaulted borrowers
Given the customer’s record

X = (Home = No, Status = Divorced), and target variable ‘Default’,

we would like to calculate

P(Default = Yes|X ) and P(Default = No|X ),

and predict ‘Default’ according to the largest probability.


## Home Status Default
## 1 Yes Single No
## 2 No Married No
## 3 No Single No
## 4 Yes Married No
## 5 No Divorced Yes
## 6 No Married No
## 7 Yes Divorced No
## 8 No Single Yes
## 9 No Married No
## 10 No Single Yes

Gertraud Malsiner-Walli Classification 46 / 56


Example: Defaulted borrowers
For X = (Home = No, Status = Divorced), we want to calculate
P(Xi |Y ) and P(Y ) from the data:
## Home Status Default
## 1 Yes Single No
## 2 No Married No
## 3 No Single No
## 4 Yes Married No
## 5 No Divorced Yes
## 6 No Married No
## 7 Yes Divorced No
## 8 No Single Yes
## 9 No Married No
## 10 No Single Yes

4
P(Y = no|X1 , X2 ) ∝ P(X1 , X2 |Y ) · P(Y ) = P(No|Y ) · P(div |Y ) · P(Y = no) = .
70
1
P(Y = yes|X1 , X2 ) ∝ P(X1 , X2 |Y ) · P(Y ) = P(No|Y ) · P(div |Y ) · P(Y = yes) = .
10

Gertraud Malsiner-Walli Classification 47 / 56


Estimate probabilities from continuous data

▶ For continuous attributes:


▶ Discretization: Partition the range into bins and replace
continuous value with bin value (attribute changed from
continuous to ordinal).
▶ Probability density estimation:
▶ We assume the attribute follows a normal distribution and use
data to estimate parameters of distribution (i.e., mean and
standard deviation) for Y = no and Y = yes separately.
▶ Once probability distributions are known, use them to estimate
the conditional probabilities P(Xi |Y ).

Gertraud Malsiner-Walli Classification 48 / 56


Estimate probabilities from continuous data
## Home Status Default Income
## 1 Yes Single No 0
## 2 No Married No 175
## 3 No Single No 74
## 4 Yes Married No 40
## 5 No Divorced Yes 191
## 6 No Married No 206
## 7 Yes Divorced No 134
## 8 No Single Yes 187
## 9 No Married No 132
## 10 No Single Yes 150

▶ The sample mean of income in class ‘No’ is 109, sample variance is 5464.
▶ Density of normal distribution:
 
1 (Xi − µj )2
f (Xi |Yj ) = p exp −
2πσj2 2σj2

where µj is the mean and σj2 is the variance of Xi in class Yj .


√  2

▶ P(Income=120|No) ≈ ( 2π5464)−1 exp − (120−109)2×5464
= 0.0053

Gertraud Malsiner-Walli Classification 49 / 56


Issues with Naïve Bayes classifier
▶ If one of the conditional probabilities is zero, then the entire
expression becomes zero.
▶ Need to use other estimates of conditional probabilities than
simple fractions:
Nic
Original: P(Xi |C ) =
Nc
Nic + 1
Laplace: P(Xi |C ) =
Nc + c
Nic + mp
m − estimate: P(Xi |C ) = , where
Nc + m
- Nic is the number of observations in class C having attribute
Xi ,
- c is the number of classes, m is a parameter,
- Nc is the number of instances in class C.

Gertraud Malsiner-Walli Classification 50 / 56


Naïve Bayes classifier

▶ Robust to isolated noise points


▶ Handles missing values by ignoring the instance during
probability estimate calculations.
▶ Robust to irrelevant attributes.
▶ Independence assumption may not hold for some attributes.

Gertraud Malsiner-Walli Classification 51 / 56


Strong assumption: Conditional independence:
example

▶ We have assumed that the Xi ’s are conditionally independent


given Y .
▶ Only then holds: P(X1 , X2 |Y ) = P(X1 |Z )P(X2 |Y ).

▶ Example: Arm length and reading skills


▶ A young child has shorter arm length and limited reading skills,
compared to adults.
▶ If age is fixed, no apparent relationship between arm length and
reading skills.
▶ Arm length and reading skills are conditionally independent
given age.

Gertraud Malsiner-Walli Classification 52 / 56


Example
X1 and X2 are positively correlated (r = 0.6909582)
16

14

12
x2

10

8 10 12 14 16
x1

Gertraud Malsiner-Walli Classification 53 / 56


Example
X1 and X2 are conditionally
independent given class A and B (rA = −0.0387753, rB = 0.0295995)
16

14

12 class
x2

A
B

10

8 10 12 14 16
x1

Gertraud Malsiner-Walli Classification 54 / 56


Exercise: How does Naïve Bayes perform on the
following data set?

20

fill
x2

15
black

10

10 15 20
x1

Gertraud Malsiner-Walli Classification 55 / 56


20

class
x2

15 A
B

10

10 15 20
x1

Gertraud Malsiner-Walli Classification 56 / 56

You might also like