Chapter 4 – Linear Model
Prepared by: Shier Nee, SAW
Based on: Probabilistic Machine Learning by Kevin Murphy
Answer for Week 2: Exercise 5
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
# iris.data = [(Sepal Length, Sepal Width, Petal Length, Petal Width)]
X = iris.data[:, :3] # we take the first three features as X
y = iris.data[:, 3] # we take last feature as Y
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
alphas = np.logspace(-10, 1.3, 20) # Regularization strength
nalphas = len(alphas)
mse_train = np.empty(nalphas)
mse_test = np.empty(nalphas)
ytest_pred_stored = dict()
for i, alpha in enumerate(alphas):
model = Ridge(alpha=alpha, fit_intercept=False)
model.fit(X_train, y_train)
ytrain_pred = model.predict(X_train)
ytest_pred = model.predict(X_test)
mse_train[i] = mse(ytrain_pred, y_train)
mse_test[i] = mse(ytest_pred, y_test)
ytest_pred_stored[alpha] = ytest_pred
# Plot MSE vs degree
fig, ax = plt.subplots()
mask = [True]*nalphas
ax.plot(alphas[mask], mse_test[mask], color = 'r', marker = 'x',label='test')
ax.plot(alphas[mask], mse_train[mask], color='b', marker = 's', label='train')
ax.set_xscale('log')
ax.legend(loc='upper right', shadow=True)
plt.xlabel('L2 regularizer')
plt.ylabel('mse')
plt.show()
print('The best L2 regularizer = ', alphas[np.argmin(abs(mse_train - mse_test))])
Recap
• Probability: Univariate Model – Gaussian
• Probability: Multivariate Model – Gaussian
• Statistic – Maximum Likelihood Estimation, Regularizer
• Decision Theory - Bayesian
• Information Theory - Entropy
• Optimization - Stochastic Gradient Descent
Outline
• Logistic Regression
• Linear Regression
• Generalized Linear Model
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression
Binary logistic regression often follows the following model
sigmoid bias
Bernoulli
weight
1 𝑝
𝑝 ( 𝑦 = 1| 𝑥 ; 𝜃 )= 𝜎 ( 𝑎 )= , 𝑤h𝑒𝑟𝑒 𝑎= 𝑙𝑜𝑔
1 +𝑒
−𝑎
, 1− 𝑝
Logistic Regression Linear Regression Generalized Linear Model
Linear Classifier
The prediction can be written as
𝑻
𝑓 ( 𝑥 ; 𝜃 )=𝑏+𝒘 𝒙
normal vector and an offset from
hyperplane
the origin
This linear hyperplane separate 3d space into half decision boundary
In general, there will be uncertainty about the correct class label, so we need to predict a
probability distribution over labels, feed it to sigmoid function
Logistic Regression Linear Regression Generalized Linear Model
Sigmoid Function
Sigmoid function
1 𝑝 𝑻
𝜎 ( 𝑎) = , 𝑤h𝑒𝑟𝑒 𝑎=𝑙𝑜𝑔 = 𝑏+𝒘 𝒙
1+ 𝑒
−𝑎
, 1− 𝑝
Log loss and determine the steepness of the sigmoid function
Logistic Regression Linear Regression Generalized Linear Model
Non Linear Classifier
Transform input features in suitable way
Decision boundary (where f(x) = 0) defines a circle with radius R
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Cost Function
Maximize Maximum Likelihood Estimation / Minimize Negative Log Likelihood
No of Sample
probability
Binary cross-entropy
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Cost Function
Check convexity
Error
Here, we can see that the gradient is weighed
by the error for each input
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Cost Function
Ensure NLL has bowl shape (global minimum)
check Hessian matrix
Check convexity
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Cost Function
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Optimizer
1. First order method
• Stochastic Gradient Descent
Slow convergence, when gradient is small
2. Second order method
• Newton Method (Iteratively reweighted least squares)
1
𝜃 𝑡 +1= 𝜃 𝑘 − α ′′
𝑓 ′ (𝜃 𝑘)
𝑓 ( 𝜃𝑘 )
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Overfitting
See any trend?
As degree increases,
w increase / decrease?
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Overfitting
Reduce Overfitting
Do not let weight to grow
Add regularizer to as penalty
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Overfitting
Big λ / Small C less flexible
Logistic Regression Linear Regression Generalized Linear Model
Logistic Regression – Overfitting
Binary Logistic Regression Multiple Logistic Regression
Probability
Activation function σ = sigmoid activation σ = softmax activation
Cost function
Gradient
Hessian
Logistic Regression Linear Regression Generalized Linear Model
Handling large number of classes
Using regular softmax function, when the number of classes, C increases, computational cost to
compute H increases
To facilitate this, we can use hierarchy softmax
The idea behind decomposing the output layer to a
binary tree was to reduce the complexity to obtain
probability distribution
Logistic Regression Linear Regression Generalized Linear Model
Handling imbalance class
- More attention on more ‘common’ dataset
- Less attention on ‘rare’ dataset
Approach
- Resample the data – Oversample / Undersample
Logistic Regression Linear Regression Generalized Linear Model
Handling outlier
Use Mixture model for the likelihood
Otherwise, is generated
using conditional model
is generated uniformly at
random with probability π
Logistic Regression Linear Regression Generalized Linear Model
Handling outlier – Bi tempered loss
Tempered cross-entropy
Tempered softmax
Logistic Regression Linear Regression Generalized Linear Model
Take 15min break
Logistic Regression Linear Regression Generalized Linear Model
Linear Regression
Follow the following equation
bias Slope / weight
If input is 1-D, simple linear regression
If input is N-D, multiple/multivariate linear regression
dimension
Logistic Regression Linear Regression Generalized Linear Model
Least square regression
weight Gaussian
Error Variance
The MLE is the point where
We can first optimize wrt w, and then solve for the optimal σ.
Logistic Regression Linear Regression Generalized Linear Model
Ordinal least square – 1D
Residual sum of square is given
y
(o𝑦 1 − 𝑤 𝑥 𝑥 1)
2 2
o
o
o o
2 2 x
𝑦 5 − 𝑤𝑥 𝑥 5 )
Logistic Regression Linear Regression Generalized Linear Model
Ordinal least square – 2D
Residual sum of square is given
Add one dimension
z
y
o
o
o
o
¿
o
¿
x
Logistic Regression Linear Regression Generalized Linear Model
How to get w?
Minimize RSS
We know y
o
o
o
o o
x
The ‘w’ can be obtained
Logistic Regression Linear Regression Generalized Linear Model
Ridge regression
Least square estimation can results in overfitting
y Training Testing
y
o o
o o Bad
o o
o o o o
x x
Ridge regression add a L2 regularizer to avoid overfitting (avoid very high gradient).
RSS + λ*w2
Logistic Regression Linear Regression Generalized Linear Model
Ridge regression
Zero λ = Least Square Big λ Very Big λ = 100000
y y y
o o o
o o o
o o o
o o o o o o
x x x
penalizing weights that become too
large in magnitude
Logistic Regression Linear Regression Generalized Linear Model
How to choose lambda
Ridge Regression add in a penalty function, L2 regularizer.
Methods:
1. Try with strong lambda and then softer it. Check results. regularization path
2. Cross validation.
Logistic Regression Linear Regression Generalized Linear Model
Lasso Regression
y Least square regression, RSS less bias, high variance
o
Ridge regression, RSS + λ*w2 high bias, low variance
o
o
Ridge allows parameters to be small but
o o cannot reach zero.
x Lasso allow parameters to be exactly zero.
Lasso regression, RSS + λ*|w|
This is useful because it can be used to
Absolute of w perform feature selection, where the weight of
certain features can be zero.
Can make equation simpler
Logistic Regression Linear Regression Generalized Linear Model
Q-norm
General Equation
L0 loss L1 loss L2 loss
Elastic Net – Lasso + Ridge
Logistic Regression Linear Regression Generalized Linear Model
Example - Cancer Data
Least Square – Worst
Ridge – weight is smaller but won’t reach zero, better than LS
Lasso – some features’ weight are zero; features eliminated
Elastic Net - Best
Logistic Regression Linear Regression Generalized Linear Model
Generalized Linear Model
If we have the following, ordinal least square is not suitable. o
y
• Exponential graph
o
o
• Variance of errors in y is not constant, and varies with X.
o o
x
• Response variable is not continuous, but categorical.
y
o o
o
o o
x
Logistic Regression Linear Regression Generalized Linear Model
Generalized Linear Model
We can’t use linear regression.
Variance increase with x
A suitable regression is Poisson Regression,
one type of GLM model.
Logistic Regression Linear Regression Generalized Linear Model
GLM – Poisson Regression
GLM normally made up of three components:
1. Linear predictor – b0+b1x
2. Link function – log link function
3. Probability distribution – Poisson distribution
Logistic Regression Linear Regression Generalized Linear Model
GLM – Linear/Logistic Regression
Linear Regression
1. Linear predictor – b0+b1x
2. Link function – identify link function
3. Probability distribution – Normal distribution
Logistic Regression
1. Linear predictor – b0+b1x
2. Link function – logic link function
3. Probability distribution – Binomial / Bernoulli distribution
Logistic Regression Linear Regression Generalized Linear Model
Custom GLM
Relationship between x and y is not linear.
Link function = Log link function
Logistic Regression Linear Regression Generalized Linear Model
Custom GLM
Variance seems constants
Which probability distribution for variance to
choose?
1. Normal
2. Poisson
Which probability
distribution for variance to
choose?
ⓘ Start presenting to display the poll results on this slide.
Logistic Regression Linear Regression Generalized Linear Model
Custom GLM
Variance seems constants
Which probability distribution for variance to
choose?
1. Normal
2. Poisson
Logistic Regression Linear Regression Generalized Linear Model
Let’s go to colab to try out creating
Logistic Regression with Pytorch.