0% found this document useful (0 votes)
115 views32 pages

Unit 2 Notes - Final

Uploaded by

Aviral Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views32 pages

Unit 2 Notes - Final

Uploaded by

Aviral Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 2

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years and
the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a way
that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical
method and used in machine learning and data science. Below are some other reasons for using
Regression analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type
has its own importance on different scenarios, but at the core, all the regression methods analyze the
effect of the independent variable on dependent variables. Here we are discussing some important
types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Some popular applications of linear regression are:
o Analyzing trends and sales estimates
o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary
or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:

o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial regression, a
single element has different degrees instead of multiple variables with the same degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression as well
as classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are
some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number
of datapoints are covered in that margin. The main goal of SVR is to consider the maximum
datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum
number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Decision Tree Regression:
o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents the
"test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes. Consider the
below image:
Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:
g(x)= f0(x)+ f1(x)+ f2(x)+....
o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in
which aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the
X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use
cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for
multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
o Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and target
variables. Or we can say, it is difficult to determine which predictor variable is affecting the
target variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too
wide or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

Simple Linear Regression in Machine Learning


Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to temperature,
Revenue of a company according to the investments in a year, etc.
Simple Linear Regression Model:
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression Algorithm using Python
Problem Statement example for Simple Linear Regression:
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
Introduction to Simple Linear Regression

Simple linear regression is a statistical method used to model the relationship


between a dependent variable (response variable) and a single independent
variable (predictor variable). The primary goal is to find the linear equation that best
predicts the dependent variable based on the independent variable.

Assumptions of Simple Linear Regression


1. Linearity: The relationship between the independent and dependent variable
is linear.
2. Independence: The observations are independent of each other.
3. Homoscedasticity: The variance of residuals is constant across all levels of the
independent variable.
4. Normality: The residuals of the model are normally distributed.
5. No Autocorrelation: Residuals are not correlated with each other (especially
important in time series data).

Simple Linear Regression Model Building

The simple linear regression model is represented by the equation:


Y=β0+β1X+εY=β0+β1X+ε
where:
• YY is the dependent variable.
• XX is the independent variable.
• β0β0 is the intercept.
• β1β1 is the slope coefficient.
• εε is the error term.

Ordinary Least Squares (OLS) Estimation

OLS is a method for estimating the parameters (coefficients) of a linear regression


model. The goal is to minimize the sum of the squared differences between the
observed and predicted values:
minimize∑(Yi−(β0+β1Xi))2minimize∑(Yi−(β0+β1Xi))2

Properties of Least Squares Estimators


1. Unbiasedness: The expected value of the estimator is equal to the true
parameter value.
2. Consistency: As the sample size increases, the estimator converges to the true
parameter value.
3. Efficiency: Among all unbiased estimators, the OLS estimator has the smallest
variance.
4. Best Linear Unbiased Estimator (BLUE): Under the Gauss-Markov theorem, OLS
is the best linear unbiased estimator of the coefficients.
Fitted Regression Model

The fitted regression model is the estimated version of the regression equation, where
the coefficients are replaced with their estimated values:
Y^=β^0+β^1XY^=β^0+β^1X

Interval Estimation in Simple Linear Regression

Interval estimation involves calculating a range (interval) within which the true
parameter values are expected to fall, with a certain level of confidence (e.g., 95%).
• Confidence Interval for Coefficients: Provides a range for the possible values
of the regression coefficients.
• Prediction Interval: Provides a range for the possible values of a new
observation.

Residuals

Residuals are the differences between the observed values and the predicted values
from the regression model:
Residual=Yi−Y^iResidual=Yi−Y^i
• Analysis of Residuals: Residuals are analyzed to check the validity of the
regression assumptions. They should ideally be randomly distributed with
constant variance and no pattern

Multiple Linear Regression (MLR)

Multiple linear regression is a statistical technique that models the relationship


between one dependent variable and two or more independent variables. The model
is represented by the equation:
Y=β0+β1X1+β2X2+…+βnXn+εY=β0+β1X1+β2X2+…+βnXn+ε
where:
• YY is the dependent variable.
• X1,X2,…,XnX1,X2,…,Xn are independent variables.
• β0β0 is the intercept.
• β1,β2,…,βnβ1,β2,…,βn are the coefficients for each independent variable.
• εε is the error term (residual).

Detailed Explanation of Assumptions


1. Linearity:
o Description: The assumption of linearity states that there is a linear
relationship between the dependent and independent variables.
o Verification: This can be checked using scatter plots and by examining
the residual plots. If the residuals display a random pattern, linearity is
likely satisfied.
2. Independence:
o Description: Observations should be independent of each other. This
means that the data points are not influenced by previous points.
o Verification: This can be assessed by checking the study design. For
time series data, independence can be checked using techniques like
the Durbin-Watson test to detect autocorrelation.
3. Homoscedasticity:
o Description: The variance of the residuals should be constant across all
levels of the independent variables. This means the spread of residuals
should be similar across the range of predicted values.
Verification: Homoscedasticity can be checked by plotting the residuals
o
against the predicted values. A funnel shape indicates
heteroscedasticity, which violates this assumption.
4. Normality of Residuals:
o Description: The residuals should be approximately normally
distributed, which is essential for hypothesis testing and confidence
intervals.
o Verification: Normality can be assessed using a Q-Q plot (quantile-
quantile plot) or a histogram of the residuals. The Shapiro-Wilk test can
also be used to statistically test for normality.
5. No Multicollinearity:
o Description: Multicollinearity occurs when two or more independent
variables are highly correlated, making it difficult to determine their
individual effects.
o Verification: Multicollinearity can be checked using Variance Inflation
Factor (VIF). A VIF value above 10 is often regarded as indicating
significant multicollinearity.

Additional Points
• Outliers: Outliers can disproportionately affect the regression model, leading to
biased estimates. They should be identified and handled appropriately,
potentially by using robust regression techniques or transformations.
• Sample Size: A larger sample size can provide more reliable estimates and
help ensure the assumptions are met.

Understanding these assumptions is critical for ensuring the reliability of a multiple


linear regression model. If assumptions are violated, it can lead to inaccurate
estimates and inferences

1. Multiple Regression Output Example:


o This image from Laerd Statistics provides a detailed example of multiple
regression output, including coefficients, R-squared, and other statistics.
o
2. Multiple Linear Regression Overview:
o This image from Scribbr offers a quick overview of multiple linear
regression concepts, including a diagrammatic representation.

o
3. Statistics Solutions Example:
o This image from Statistics Solutions shows another example of multiple
regression output, focusing on statistical analysis details.

R-squared (R2R2)
• Definition: R-squared measures the proportion of the variance in the
dependent variable that is predictable from the independent variables. It is a
statistical measure of how close the data are to the fitted regression line.
• Range: It ranges from 0 to 1. An R2R2 of 0 means that the independent
variables do not explain any of the variability of the dependent variable, while
an R2R2 of 1 means they explain all the variability.
• Usefulness: A higher R2R2 indicates a better fit for the model, but it doesn't
imply causation. It is also sensitive to the number of predictors in the model.
• Adjusted R-squared: Adjusted R2R2 modifies R2R2 to account for the number
of predictors. It is a more accurate measure when comparing models with
different numbers of predictors.

Standard Error
• Definition: The standard error of the regression measures the average
distance that the observed values fall from the regression line. It is essentially
the standard deviation of the residuals.
• Importance: A smaller standard error indicates a more precise estimate of the
dependent variable. It gives an idea of the typical size of the prediction errors.

F-statistic
• Definition: The F-statistic tests the overall significance of the regression model.
It evaluates whether there is a significant relationship between the dependent
and independent variables.
• Interpretation: A larger F-statistic suggests that at least one of the predictors
is significantly related to the dependent variable. The null hypothesis is that all
regression coefficients are equal to zero, meaning no effect.
• Calculation: The F-statistic is calculated as the ratio of the model mean
square to the error mean square.

Significance F (p-value for F-statistic)


• Definition: The significance F is the p-value associated with the F-statistic. It
indicates the probability that the observed F-statistic could occur under the
null hypothesis.
• Interpretation: A small p-value (typically < 0.05) suggests that the model is
statistically significant, meaning that at least one predictor variable has a
significant effect on the dependent variable.

Coefficient P-values
• Definition: Each regression coefficient has an associated p-value that tests the
null hypothesis that the coefficient is equal to zero (i.e., the variable has no
effect).
• Interpretation: A low p-value (< 0.05) indicates that the corresponding
predictor is statistically significant. This suggests that changes in the predictor
are associated with changes in the response variable.
• Significance: P-values help determine which predictors are meaningful
contributors to the model.

R-squared (R2R2)
• Definition: R-squared is a statistical measure of how close the data are to the
fitted regression line. It is also known as the coefficient of determination.
• Calculation: R-squared is calculated as the proportion of the total variance in
the dependent variable that is explained by the independent variables.
• Interpretation:
o An R2R2 of 0 indicates that the model explains none of the variance in
the dependent variable.
o An R2R2 of 1 indicates that the model explains all the variance in the
dependent variable.
o In practice, an R2R2 value between 0 and 1 indicates how well the
independent variables explain the variability of the dependent variable.
• Usefulness:
o A high R-squared value indicates a good fit, but it does not necessarily
mean the model is correct. It can be artificially inflated by adding more
predictors, even if they are not meaningful.
o Adjusted R-squared is often used to account for the number of
predictors and provides a more accurate assessment when comparing
models with different numbers of predictors15.

Standard Error
• Definition: The standard error of the regression (also known as residual
standard error) measures the average distance that the observed values fall
from the regression line. It is the standard deviation of the residuals.
• Interpretation:
o A smaller standard error indicates that the observed values fall closer to
the regression line, suggesting a better fit.
o It provides an absolute measure of fit, unlike R-squared, which is a
relative measure.
• Calculation: The standard error is calculated as the square root of the sum of
squared residuals divided by the degrees of freedom (number of observations
minus number of predictors minus one).
• Usefulness:
o It helps gauge the precision of predictions made by the regression
model.
o It is particularly useful when comparing different models applied to the
same dataset, as a lower standard error generally implies a more
accurate mode

R-squared and Standard Error in Regression

1. Regression: Standard Error of Estimate:


o This image from a YouTube video provides a visual explanation of the
standard error of estimate, which is closely related to the standard error
of the regression.
o
2. Standard Error of the Regression vs. R-squared (Alternative Image):
o Another image from Statistics by Jim that further elaborates on the
comparison between standard error and R-squared in the context of
regression analysis.

Detailed Explanation
• R-squared (R2R2):
o Purpose: Measures the proportion of the variance in the dependent
variable that is predictable from the independent variables.
o Interpretation: A higher R2R2 value indicates a better fit, showing that
the model explains a significant portion of the variability. However, it
does not indicate causation or model validity.
• Standard Error:
o Purpose: Measures the average distance that the observed values fall
from the regression line, providing an absolute measure of fit.
o Interpretation: A smaller standard error indicates more precise
predictions, as the observed values are closer to the predicted values. It
helps assess the accuracy of the model's predictions.
Principal Component Analysis(PCA)
Principal Component Analysis(PCA) technique was introduced by the mathematician Karl
Pearson in 1901. It works on the condition that while the data in a higher dimensional space is
mapped to data in a lower dimension space, the variance of the data in the lower dimensional space
should be maximum.
• Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated
variables.PCA is the most widely used tool in exploratory data analysis and in machine
learning for predictive models. Moreover,
• Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used to
examine the interrelations among a set of variables. It is also known as a general factor
analysis where regression determines a line of best fit.
• The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the variables
without any prior knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by finding a
new set of variables, smaller than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.

Principal Component Analysis


1. Principal Component Analysis (PCA) is a technique for dimensionality reduction that
identifies a set of orthogonal axes, called principal components, that capture the maximum
variance in the data. The principal components are linear combinations of the original
variables in the dataset and are ordered in decreasing order of importance. The total variance
captured by all the principal components is equal to the total variance in the original dataset.
2. The first principal component captures the most variation in the data, but the second principal
component captures the maximum variance that is orthogonal to the first principal
component, and so on.
3. Principal Component Analysis can be used for a variety of purposes, including data
visualization, feature selection, and data compression. In data visualization, PCA can be used
to plot high-dimensional data in two or three dimensions, making it easier to interpret. In
feature selection, PCA can be used to identify the most important variables in a dataset. In
data compression, PCA can be used to reduce the size of a dataset without losing important
information.
4. In Principal Component Analysis, it is assumed that the information is carried in the variance
of the features, that is, the higher the variation in a feature, the more information that features
carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets, making
them easier to understand and work with.
Step-By-Step Explanation of PCA (Principal Component Analysis)
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a standard
deviation of 1.
Z=X−μσZ=σX−μ
Here,
• μμ is the mean of independent features μ={μ1,μ2,⋯,μm}μ={μ1,μ2,⋯,μm}
• σσ is the standard deviation of independent features σ={σ1,σ2,⋯,σm}σ={σ1,σ2,⋯,σm}
Step2: Covariance Matrix Computation
Covariance measures the strength of joint variability between two or more variables, indicating how
much they change in relation to each other. To find the covariance we can use the formula:
cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)
The value of covariance can be positive, negative, or zeros.
• Positive: As the x1 increases x2 also increases.
• Negative: As the x1 increases x2 also decreases.
• Zeros: No direct relation
Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to Identify Principal
Components
Let A be a square nXn matrix and X be a non-zero vector for which
AX=λXAX=λX
for some scalar values λλ. then λλ is known as the eigenvalue of matrix A and X is known as
the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :
AX−λX=0(A−λI)X=0AX−λX(A−λI)X=0=0
where I am the identity matrix of the same shape as matrix A. And the above conditions will be true
only if (A–λI)(A–λI) will be non-invertible (i.e. singular matrix). That means,
∣A–λI∣=0∣A–λI∣=0
From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation AX=λXAX=λX.

Advantages of Principal Component Analysis


1. Dimensionality Reduction: Principal Component Analysis is a popular technique used
for dimensionality reduction, which is the process of reducing the number of variables in a
dataset. By reducing the number of variables, PCA simplifies data analysis, improves
performance, and makes it easier to visualize data.
2. Feature Selection: Principal Component Analysis can be used for feature selection, which is
the process of selecting the most important variables in a dataset. This is useful in machine
learning, where the number of variables can be very large, and it is difficult to identify the
most important variables.
3. Data Visualization: Principal Component Analysis can be used for data visualization. By
reducing the number of variables, PCA can plot high-dimensional data in two or three
dimensions, making it easier to interpret.
4. Multicollinearity: Principal Component Analysis can be used to deal with multicollinearity,
which is a common problem in a regression analysis where two or more independent
variables are highly correlated. PCA can help identify the underlying structure in the data and
create new, uncorrelated variables that can be used in the regression model.
5. Noise Reduction: Principal Component Analysis can be used to reduce the noise in data. By
removing the principal components with low variance, which are assumed to represent noise,
Principal Component Analysis can improve the signal-to-noise ratio and make it easier to
identify the underlying structure in the data.
6. Data Compression: Principal Component Analysis can be used for data compression. By
representing the data using a smaller number of principal components, which capture most of
the variation in the data, PCA can reduce the storage requirements and speed up processing.
7. Outlier Detection: Principal Component Analysis can be used for outlier
detection. Outliers are data points that are significantly different from the other data points in
the dataset. Principal Component Analysis can identify these outliers by looking for data
points that are far from the other points in the principal component space.
Disadvantages of Principal Component Analysis
1. Interpretation of Principal Components: The principal components created by Principal
Component Analysis are linear combinations of the original variables, and it is often difficult
to interpret them in terms of the original variables. This can make it difficult to explain the
results of PCA to others.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the data is
not properly scaled, then PCA may not work well. Therefore, it is important to scale the data
before applying Principal Component Analysis.
3. Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss of
information. The degree of information loss depends on the number of principal components
selected. Therefore, it is important to carefully select the number of principal components to
retain.
4. Non-linear Relationships: Principal Component Analysis assumes that the relationships
between variables are linear. However, if there are non-linear relationships between variables,
Principal Component Analysis may not work well.
5. Computational Complexity: Computing Principal Component Analysis can be
computationally expensive for large datasets. This is especially true if the number of variables
in the dataset is large.
6. Overfitting: Principal Component Analysis can sometimes result in overfitting, which is
when the model fits the training data too well and performs poorly on new data. This can
happen if too many principal components are used or if the model is trained on a small
dataset.
linear Discriminant Analysis(LDA) in Machine Learning
As we know that while dealing with a high dimensional dataset then we must apply some
dimensionality reduction techniques to the data at hand so, that we can explore the data and utilize it
for modeling in an efficient manner. In this article, we will learn about one such dimensionality
reduction technique that is used to map high dimensional data to a comparatively lower dimension
without much data loss.
What is Linear Discriminant Analysis:
Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or Discriminant
Function Analysis, is a dimensionality reduction technique primarily utilized
in supervised classification problems. It facilitates the modeling of distinctions between groups,
effectively separating two or more classes. LDA operates by projecting features from a higher-
dimensional space into a lower-dimensional one. In machine learning, LDA serves as a supervised
learning algorithm specifically designed for classification tasks, aiming to identify a linear combination
of features that optimally segregates classes within a dataset.
For example, we have two classes and we need to separate them efficiently. Classes can have multiple
features. Using only a single feature to classify them may result in some overlapping as shown in the
below figure. So, we will keep on increasing the number of features for proper classification.

Assumptions of LDA
LDA assumes that the data has a Gaussian distribution and that the covariance matrices of the
different classes are equal. It also assumes that the data is linearly separable, meaning that a
linear decision boundary can accurately classify the different classes.
Suppose we have two sets of data points belonging to two different classes that we want to classify.
As shown in the given 2D graph, when the data points are plotted on the 2D plane, there’s no straight
line that can separate the two classes of data points completely. Hence, in this case, LDA (Linear
Discriminant Analysis) is used which reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.
Linearly Separable Dataset

Here, Linear Discriminant Analysis uses both axes (X and Y) to create a new axis and projects data
onto a new axis in a way to maximize the separation of the two categories and hence, reduces the 2D
graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between the means of the two classes.
2. Minimize the variation within each class.
The perpendicular distance between the line and points

In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D graph
such that it maximizes the distance between the means of the two classes and minimizes the variation
within each class. In simple terms, this newly generated axis increases the separation between the data
points of the two classes. After generating this new axis using the above-mentioned criteria, all the
data points of the classes are plotted on this new axis and are shown in the figure given below.

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes both classes linearly separable. In such cases, we
use non-linear discriminant analysis.
How does LDA work?
LDA works by projecting the data onto a lower-dimensional space that maximizes the separation
between the classes. It does this by finding a set of linear discriminants that maximize the ratio of
between-class variance to within-class variance. In other words, it finds the directions in the feature
space that best separates the different classes of data.
Mathematical Intuition Behind LDA
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:
• n1 samples coming from the class (c1) and n2 coming from the class (c2).

If xi is the data point, then its projection on the line represented by


unit vector v can be written as vTxi
Extensions to LDA
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.

Advanatages & Disadvantages of using LDA


Advanatages of using LDA
1. It is a simple and computationally efficient algorithm.
2. It can work well even when the number of features is much larger than the number of training
samples.
3. It can handle multicollinearity (correlation between features) in the data.
Disadvantages of LDA
1. It assumes that the data has a Gaussian distribution, which may not always be the case.
2. It assumes that the covariance matrices of the different classes are equal, which may not be
true in some datasets.
3. It assumes that the data is linearly separable, which may not be the case for some datasets.
4. It may not perform well in high-dimensional feature spaces.
Applications of LDA
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular
application in which each face is represented by a very large number of pixel values. Linear
discriminant analysis (LDA) is used here to reduce the number of features to a more
manageable number before the process of classification. Each of the new dimensions
generated is a linear combination of pixel values, which form a template. The linear
combinations obtained using Fisher’s linear discriminant are called Fisher’s faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient’s
disease state as mild, moderate, or severe based on the patient’s various parameters and the
medical treatment he is going through. This helps the doctors to intensify or reduce the pace
of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers who are most
likely to buy a particular product in a shopping mall. By doing a simple question and answers
survey, we can gather all the features of the customers. Here, a Linear discriminant analysis
will help us to identify and select the features which can describe the characteristics of the
group of customers that are most likely to buy that particular product in the shopping mall.
Independent Component Analysis(ICA)
Independent Component Analysis is a technique used to separate mixed signals into their
independent sources. The application of ICA ranges from audio and image processing to biomedical
signal analysis. The article discusses about the fundamentals of ICA.
What is Independent Component Analysis?
Independent Component Analysis (ICA) is a statistical and computational technique used in machine
learning to separate a multivariate signal into its independent non-Gaussian components. The goal of
ICA is to find a linear transformation of the data such that the transformed data is as close to being
statistically independent as possible.
The heart of ICA lies in the principle of statistical independence. ICA identify components within
mixed signals that are statistically independent of each other.
Statistical Independence Concept:
It is a probability theory that if two random variables X and Y are statistically independent. The joint
probability distribution of the pair is equal to the product of their individual probability distributions,
which means that knowing the outcome of one variable does not change the probability of the other
outcome.
Advantages of Independent Component Analysis (ICA):
• ICA is a powerful tool for separating mixed signals into their independent components. This
is useful in a variety of applications, such as signal processing, image analysis, and data
compression.
• ICA is a non-parametric approach, which means that it does not require assumptions about
the underlying probability distribution of the data.
• ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data is
not available.
• ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.

Difference between PCA and ICA


Both the techniques are used in signal processing and dimensionality reduction, but they have different goals.
Principal Component Analysis Independent Component Analysis

It decomposes the mixed signal into its independent sources’


It reduces the dimensions to avoid the problem of overfitting. signals.

It deals with the Principal Components. It deals with the Independent Components.

It focuses on maximizing the variance. It doesn’t focus on the issue of variance among the data points.

It focuses on the mutual orthogonality property of the principal It doesn’t focus on the mutual orthogonality of the
components. components.

It doesn’t focus on the mutual independence of the


components. It focuses on the mutual independence of the components.

You might also like