0% found this document useful (0 votes)
32 views42 pages

Chapter 4 - Linear Model: Prepared By: Shier Nee, SAW Based On: Probabilistic Machine Learning by Kevin Murphy

Uploaded by

hiphoplistener
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views42 pages

Chapter 4 - Linear Model: Prepared By: Shier Nee, SAW Based On: Probabilistic Machine Learning by Kevin Murphy

Uploaded by

hiphoplistener
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Chapter 4 – Linear Model

Prepared by: Shier Nee, SAW


Based on: Probabilistic Machine Learning by Kevin Murphy
Answer for Week 2: Exercise 5
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
# iris.data = [(Sepal Length, Sepal Width, Petal Length, Petal Width)]
X = iris.data[:, :3] # we take the first three features as X
y = iris.data[:, 3] # we take last feature as Y

# split data into train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

alphas = np.logspace(-10, 1.3, 20) # Regularization strength


nalphas = len(alphas)
mse_train = np.empty(nalphas)
mse_test = np.empty(nalphas)
ytest_pred_stored = dict()

for i, alpha in enumerate(alphas):


model = Ridge(alpha=alpha, fit_intercept=False)
model.fit(X_train, y_train)
ytrain_pred = model.predict(X_train)
ytest_pred = model.predict(X_test)
mse_train[i] = mse(ytrain_pred, y_train)
mse_test[i] = mse(ytest_pred, y_test)
ytest_pred_stored[alpha] = ytest_pred

# Plot MSE vs degree


fig, ax = plt.subplots()
mask = [True]*nalphas
ax.plot(alphas[mask], mse_test[mask], color = 'r', marker = 'x',label='test')
ax.plot(alphas[mask], mse_train[mask], color='b', marker = 's', label='train')
ax.set_xscale('log')
ax.legend(loc='upper right', shadow=True)
plt.xlabel('L2 regularizer')
plt.ylabel('mse')
plt.show()

print('The best L2 regularizer = ', alphas[np.argmin(abs(mse_train - mse_test))])


Recap
• Probability: Univariate Model – Gaussian
• Probability: Multivariate Model – Gaussian
• Statistic – Maximum Likelihood Estimation, Regularizer
• Decision Theory - Bayesian
• Information Theory - Entropy
• Optimization - Stochastic Gradient Descent
Outline
• Logistic Regression
• Linear Regression
• Generalized Linear Model
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression
Binary logistic regression often follows the following model

sigmoid bias
Bernoulli
weight

1 𝑝
𝑝 ( 𝑦 = 1| 𝑥 ; 𝜃 )= 𝜎 ( 𝑎 )= , 𝑤h𝑒𝑟𝑒 𝑎= 𝑙𝑜𝑔
1 +𝑒
−𝑎
, 1− 𝑝
Logistic Regression Linear Regression Generalized Linear Model

Linear Classifier
The prediction can be written as

𝑻
𝑓 ( 𝑥 ; 𝜃 )=𝑏+𝒘 𝒙

normal vector and an offset from


hyperplane
the origin

This linear hyperplane separate 3d space into half  decision boundary

In general, there will be uncertainty about the correct class label, so we need to predict a
probability distribution over labels, feed it to sigmoid function
Logistic Regression Linear Regression Generalized Linear Model

Sigmoid Function
Sigmoid function
1 𝑝 𝑻
𝜎 ( 𝑎) = , 𝑤h𝑒𝑟𝑒 𝑎=𝑙𝑜𝑔 = 𝑏+𝒘 𝒙
1+ 𝑒
−𝑎
, 1− 𝑝

Log loss and determine the steepness of the sigmoid function


Logistic Regression Linear Regression Generalized Linear Model

Non Linear Classifier


Transform input features in suitable way

Decision boundary (where f(x) = 0) defines a circle with radius R


Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Cost Function


Maximize Maximum Likelihood Estimation / Minimize Negative Log Likelihood

No of Sample
probability

Binary cross-entropy
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Cost Function

Check convexity

Error

Here, we can see that the gradient is weighed


by the error for each input
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Cost Function


Ensure NLL has bowl shape (global minimum)
 check Hessian matrix

Check convexity
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Cost Function


Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Optimizer


1. First order method
• Stochastic Gradient Descent

Slow convergence, when gradient is small

2. Second order method


• Newton Method (Iteratively reweighted least squares)
1
𝜃 𝑡 +1= 𝜃 𝑘 − α ′′
𝑓 ′ (𝜃 𝑘)
𝑓 ( 𝜃𝑘 )
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Overfitting

See any trend?


As degree increases,
w increase / decrease?
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Overfitting


Reduce Overfitting
 Do not let weight to grow
 Add regularizer to as penalty
Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Overfitting

Big λ / Small C  less flexible


Logistic Regression Linear Regression Generalized Linear Model

Logistic Regression – Overfitting


Binary Logistic Regression Multiple Logistic Regression

Probability

Activation function σ = sigmoid activation σ = softmax activation

Cost function

Gradient

Hessian
Logistic Regression Linear Regression Generalized Linear Model

Handling large number of classes

Using regular softmax function, when the number of classes, C increases, computational cost to
compute H increases

To facilitate this, we can use hierarchy softmax

The idea behind decomposing the output layer to a


binary tree was to reduce the complexity to obtain
probability distribution
Logistic Regression Linear Regression Generalized Linear Model

Handling imbalance class


- More attention on more ‘common’ dataset
- Less attention on ‘rare’ dataset

Approach
- Resample the data – Oversample / Undersample
Logistic Regression Linear Regression Generalized Linear Model

Handling outlier
Use Mixture model for the likelihood
Otherwise, is generated
using conditional model

is generated uniformly at
random with probability π
Logistic Regression Linear Regression Generalized Linear Model

Handling outlier – Bi tempered loss


Tempered cross-entropy

Tempered softmax
Logistic Regression Linear Regression Generalized Linear Model

Take 15min break


Logistic Regression Linear Regression Generalized Linear Model

Linear Regression
Follow the following equation

bias Slope / weight

If input is 1-D, simple linear regression

If input is N-D, multiple/multivariate linear regression


dimension
Logistic Regression Linear Regression Generalized Linear Model

Least square regression

weight Gaussian

Error Variance

The MLE is the point where


We can first optimize wrt w, and then solve for the optimal σ.
Logistic Regression Linear Regression Generalized Linear Model

Ordinal least square – 1D


Residual sum of square is given

y
(o𝑦 1 − 𝑤 𝑥 𝑥 1)
2 2

o
o

o o
2 2 x
𝑦 5 − 𝑤𝑥 𝑥 5 )
Logistic Regression Linear Regression Generalized Linear Model

Ordinal least square – 2D


Residual sum of square is given

Add one dimension


z

y
o
o

o
o
¿
o
¿
x
Logistic Regression Linear Regression Generalized Linear Model

How to get w?
Minimize RSS

We know y
o

o
o

o o
x
The ‘w’ can be obtained
Logistic Regression Linear Regression Generalized Linear Model

Ridge regression
Least square estimation can results in overfitting

y Training Testing
y
o o

o o Bad
o o

o o o o
x x

Ridge regression add a L2 regularizer to avoid overfitting (avoid very high gradient).
RSS + λ*w2
Logistic Regression Linear Regression Generalized Linear Model

Ridge regression
Zero λ = Least Square Big λ Very Big λ = 100000
y y y
o o o

o o o
o o o

o o o o o o
x x x

penalizing weights that become too


large in magnitude
Logistic Regression Linear Regression Generalized Linear Model

How to choose lambda


Ridge Regression add in a penalty function, L2 regularizer.

Methods:
1. Try with strong lambda and then softer it. Check results.  regularization path
2. Cross validation.
Logistic Regression Linear Regression Generalized Linear Model

Lasso Regression
y Least square regression, RSS  less bias, high variance
o
Ridge regression, RSS + λ*w2  high bias, low variance
o
o
Ridge allows parameters to be small but
o o cannot reach zero.

x Lasso allow parameters to be exactly zero.


Lasso regression, RSS + λ*|w|
This is useful because it can be used to
Absolute of w perform feature selection, where the weight of
certain features can be zero.

 Can make equation simpler


Logistic Regression Linear Regression Generalized Linear Model

Q-norm
General Equation

L0 loss L1 loss L2 loss

Elastic Net – Lasso + Ridge


Logistic Regression Linear Regression Generalized Linear Model

Example - Cancer Data

Least Square – Worst


Ridge – weight is smaller but won’t reach zero, better than LS
Lasso – some features’ weight are zero; features eliminated
Elastic Net - Best
Logistic Regression Linear Regression Generalized Linear Model

Generalized Linear Model


If we have the following, ordinal least square is not suitable. o
y
• Exponential graph
o
o
• Variance of errors in y is not constant, and varies with X.
o o
x
• Response variable is not continuous, but categorical.
y

o o
o

o o
x
Logistic Regression Linear Regression Generalized Linear Model

Generalized Linear Model


We can’t use linear regression.
Variance increase with x
A suitable regression is Poisson Regression,
one type of GLM model.
Logistic Regression Linear Regression Generalized Linear Model

GLM – Poisson Regression


GLM normally made up of three components:
1. Linear predictor – b0+b1x
2. Link function – log link function
3. Probability distribution – Poisson distribution
Logistic Regression Linear Regression Generalized Linear Model

GLM – Linear/Logistic Regression


Linear Regression
1. Linear predictor – b0+b1x
2. Link function – identify link function
3. Probability distribution – Normal distribution

Logistic Regression
1. Linear predictor – b0+b1x
2. Link function – logic link function
3. Probability distribution – Binomial / Bernoulli distribution
Logistic Regression Linear Regression Generalized Linear Model

Custom GLM
Relationship between x and y is not linear.
Link function = Log link function
Logistic Regression Linear Regression Generalized Linear Model

Custom GLM
Variance seems constants

Which probability distribution for variance to


choose?
1. Normal
2. Poisson
Which probability
distribution for variance to
choose?

ⓘ Start presenting to display the poll results on this slide.


Logistic Regression Linear Regression Generalized Linear Model

Custom GLM
Variance seems constants

Which probability distribution for variance to


choose?
1. Normal
2. Poisson
Logistic Regression Linear Regression Generalized Linear Model

Let’s go to colab to try out creating


Logistic Regression with Pytorch.

You might also like