20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
II 3
20AIPC302
FUNDAMENTAL OF MACHINE
LEARNING TECHNIQUES
UNIT 4- REGRESSION MODELLING
Introduction regression modeling – Mathematical model for Linear
regression – Simple Linear regression –Multiple Linear Regression
–
Improving Accuracy of Linear regression model –Polynomial
Regression – Logistic regression – Maximum likelihood Estimation -
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
4.1 REGRESSION:
In this chapter, we will build concepts on prediction of numerical variables – which is another key area of
supervised learning. This area, known as regression, focuses on solving problems such as predicting value of real
estate, demand forecast in retail, weather forecast, etc. First, you will be introduced to the most popular and
simplest algorithm, namely simple linear regression. This model roots from the statistical concept of fitting a
straight line and the least squares method. We will explore this algorithm in detail. In this same context, we will
also explore the concept of multiple linear regression. We will then briefly touch upon the other important
algorithms in regression, namely multivariate adaptive regression splines, logistic regression, and maximum
likelihood estimation. By the end of this chapter, you will gain sufficient knowledge in all the aspects of
supervised learning and become ready to start solving problems on your own.
4.2 EXAMPLE OF REGRESSION
New City is the primary hub of the commercial activities in the country. In the last couple of decades, with
increasing globalization, commercial activities have intensified in New City. Together with that, a large number of
people have come and settled in the city with a dream to achieve professional growth in their lives. As an obvious
fall-out, a large number of housing projects have started in every nook and corner of the city.
But the demand for apartments has still outgrown the supply. To get benefit from this boom in real estate
business, Karen has started a digital market agency for buying and selling real estates (including apartments,
independent houses, town houses, etc.). Initially, when the business was small, she used to interact with buyers and
sellers personally and help them arrive at a price quote – either for selling a property (for a seller) or for buying a
property (for a buyer). Her long experience in real estate business helped her develop an intuition on what the
correct price quote of a property could be – given the value of certain standard parameters such as area (sq. m.) of
the property, location, floor, number of years since purchase, amenities available, etc.
However, with the huge surge in the business, she is facing a big challenge. She is not able to manage
personal interactions as well as setting the correct price quote for the properties all alone. She hired an assistant for
managing customer interactions. But the assistant, being new in the real estate business, is struggling with price
quotations. How can Karen solve this problem? Fortunately, Karen has a friend, Frank, who is a data scientist with
in-depth knowledge in machine learning models. Frank comes up with a solution to Karen’s problem.
He builds a model which can predict the correct value of a real estate if it has certain standard inputs such
as area (sq. m.) of the property, location, floor, number of years since purchase, amenities available, etc. Wow, that
sounds to be like Karen herself doing the job! Curious to know what model Frank has used? Yes, you guessed it
right. He used a regression model to solve Karen’s real estate price prediction problem. So, we just discussed about
one problem which can be solved using regression. In the same way, a bunch of other problems related to
prediction of numerical value can be solved using the regression model. In the context of regression, dependent
variable (Y) is the one whose value is to be predicted, e.g. the price quote of the real estate in the context of
Karen’s problem.
This variable is presumed to be functionally related to one (say, X) or more independent variables called
predictors. In the context of Karen’s problem, Frank used area of the property, location, floor, etc. as predictors of
the model that he built. In other words, the dependent variable depends on independent variable(s) or predictor(s).
Regression is essentially finding a relationship (or) association between the dependent variable (Y) and the
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
independent variable(s) (X), i.e. to find the function ‘f ’ for the association Y = f (X).
COMMON REGRESSION ALGORITHMS:
The most common regression algorithms are
Simple linear regression
Multiple linear regression
Polynomial regression
Multivariate adaptive regression splines
Logistic regression
Maximum likelihood estimation (least squares)
4.3 SIMPLE LINEAR REGRESSION:
As the name indicates, simple linear regression is the simplest regression model which involves only one
predictor. This model assumes a linear relationship between the dependent variable and the predictor variable.
In the context of Karen’s problem, if we take Price of a Property as the dependent variable and the Area of
the Property (in sq. m.) as the predictor variable, we can build a model using simple linear regression.
Assuming a linear association, we can reformulate the model as
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Price = a + b. Area Property Property Property Property where ‘a’ and ‘b’ are intercept and slope of the
straight line, respectively. Just to recall, straight lines can be defined in a slope– intercept form Y = (a + bX),
where a = intercept and b = slope of the straight line. The value of intercept indicates the value of Y when X = 0. It
is known as ‘the intercept or Y intercept’ because it specifies where the straight line crosses the vertical or Y-axis.
Slope of the simple linear regression model
Slope of a straight line represents how much the line in a graph changes in the vertical direction (Y-axis) over a
change in the horizontal direction (X-axis) as shown.
Slope = Change in Y/Change in X
Rise is the change in Y-axis (Y − Y ) and Run is the change in X-axis (X − X ). So, slope is represented as given
below:
Let us find the slope of the graph where the lower point on the line is represented as (−3, −2) and the higher point
on the line is represented as (2, 2).
(X , Y ) = (−3, −2) and (X , Y ) = (2, 2)
Rise = (Y − Y ) = (2 − (−2)) = 2 + 2 = 4
Run = (X − X ) = (2 − (−3)) = 2 + 3 = 5
Slope = Rise/Run = 4/5 = 0.8
There can be two types of slopes in a linear regression model: positive slope and negative slope. Different
types of regression lines based on the type of slope include
Linear positive slope
Curve linear positive slope
Linear negative slope
Curve linear negative slope
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Linear positive slope
A positive slope always moves upward on a graph from left to right.
Slope = Rise/Run = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X)
Scenario 1 for positive slope: Delta (Y) is positive and Delta (X) is positive
Scenario 2 for positive slope: Delta (Y) is negative and Delta (X) is negative
Curve linear positive slope
Curves in these graphs (refer to Fig. 8.4) slope upward from left to right.
Slope = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X) S
lope for a variable (X) may vary between two graphs, but it will always be positive;
hence, the above graphs are called as graphs with curve linear positive slope.
Linear negative slope:
A negative slope always moves downward on a graph from left to right.
As X value (on X-axis) increases, Y value decreases
Slope = Rise/Run = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X) Scenario 1 for negative slope:
Delta (Y) is positive and Delta (X) is negative Scenario 2 for negative slope: Delta (Y) is negative and
Delta (X) is positive
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Curve linear negative slope:
Curves in these graphs (refer to Fig. 8.6) slope downward from left to right.
Slope = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X)
Slope for a variable (X) may vary between two graphs, but it will always be negative;
hence, the above graphs are called as graphs with curve linear negative slope.
No relationship graph:
Scatter graph indicates ‘no relationship’ curve as it is very difficult to conclude whether the relationship between X
and Y is positive or negative.
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Error in simple regression
The regression equation model in machine learning uses the above slope–intercept format in algorithms.
X and Y values are provided to the machine, and it identifies the values of a (intercept) and b (slope) by
relating the values of X and Y.
However, identifying the exact match of values for a and b is not always possible. There will be some error
value (ɛ) associated with it.
This error is called marginal or residual error.
Now that we have some context of the simple regression model, let us try to explore an example to
understand clearly how to decide the parameters of the model (i.e. values of a and b) for a given problem.
Example of simple regression:
A college professor believes that if the grade for internal examination is high in a class, the grade for external
examination will also be high. A random sample of 15 students in that class was selected, and the data is given
below:
A scatter plot was drawn to explore the relationship between the independent variable (internal marks) mapped
to X-axis and dependent variable (external marks) mapped to Y-axis.
As you can observe from the above graph, the line (i.e. the regression line) does not predict the data exactly
(refer to Fig. 8.8). Instead, it just cuts through the data. Some predictions are lower than expected, while some
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
others are higher than expected.
Residual is the distance between the predicted point (on the regression line) and the actual point.
As we know, in simple linear regression, the line is drawn using the regression formula.
If we know the values of ‘a’ and ‘b’, then it is easy to predict the value of Y for any given X by using the
above formula.
But the question is how to calculate the values of ‘a’ and ‘b’ for a given set of X and Y values?
A straight line is drawn as close as possible over the points on the scatter plot. Ordinary Least Squares
(OLS) is the technique used to estimate a line that will minimize the error (ε), which is the difference
between the predicted and the actual values of Y.
This means summing the errors of each prediction or, more appropriately, the Sum of the Squares of the
Errors (SSE)
It is observed that the SSE is least when b takes the value
The corresponding value of ‘a’ calculated using the above value of ‘b’ is
So, let us calculate the value of a and b for the given example. For detailed calculation,
Calculation summary
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Sum of X = 299
Sum of Y = 852
Mean X, M = 19.93
Mean Y, M = 56.8
Hence, for the above example, the estimated regression equation is constructed on the basis of the estimated values
of a and b:
So, in the context of the given problem, we can say
Marks in external exam = 19.04 + 1.89 × (Marks in internal exam)
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Detailed calculation of regression parameters
The model built above can be represented graphically as an extended version (refer to Fig. 8.11) a zoom-in version
Interpretation of the intercept As we have already seen, the simple linear regression model built on the data in
the example is
The value of the intercept from the above equation is 19.05. However, none of the internal mark is 0. So, intercept
= 19.05 indicates that 19.05 is the portion of the external examination marks not explained by the internal
examination marks.
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Slope measures the estimated change in the average value of Y as a result of a one-unit change in X. Here,
slope = 1.89 tells us that the average value of the external examination marks increases by 1.89 for each additional
1 mark in the internal examination.
Now that we have a complete understanding of how to build a simple linear regression model for a given
problem, it is time to summarize the algorithm.
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
OLS algorithm
Step 1: Calculate the mean of X and Y
Step 2: Calculate the errors of X and Y
Step 3: Get the product
Step 4: Get the summation of the products
Step 5: Square the difference of X
Step 6: Get the sum of the squared difference
Step 7: Divide output of step 4 by output of step 6 to calculate ‘b’
Step 8: Calculate ‘a’ using the value of ‘b’
Maximum and minimum point of curves:
Maximum (shown in Fig. 8.13) and minimum points (shown in Fig. 8.14) on a graph are found at points
where the slope of the curve is zero. It becomes zero either from positive or negative value. The maximum point is
the point on the curve of the graph with the highest y-coordinate and a slope of zero.
The minimum point is the point on the curve of the graph with the lowest y-coordinate and a slope of zero.
FIG. 8.13 Maximum point of curve Point 63 is at the maximum point for this curve (refer to Fig. 8.13).
Point 63 is at the highest point on this curve. It has a greater y-coordinate value than any other point on the
curve and has a slope of zero. Point 40 (marked with an arrow in Fig. 8.14) is the minimum point for this curve.
Point 40 is at the lowest point on this curve. It has a lesser y-coordinate value than any other point on the
curve and has a slope of zero. FIG. 8.14 Minimum point of curve. Point 40 is at the lowest point on this curve. It
has a lesser y-coordinate value than any other point on the curve and has a slope of zero.
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
4.4 MULTIPLE LINEAR REGRESSIONS:
In a multiple regression model, two or more independent variables, i.e. predictors are involved in the
model. If we think in the context of Karen’s problem, in the last section, we came up with a simple linear
regression by considering Price of a Property as the dependent variable and the Area of the Property (in sq. m.) as
the predictor variable. However, location, floor, number of years since purchase, amenities available, etc. are also
important predictors which should not be ignored. Thus, if we consider Price of a Property (in $) as the dependent
variable and Area of the Property (in sq. m.), location, floor, number of years since purchase and amenities
available as the independent variables, we can form a multiple regression equation as shown below:
The simple linear regression model and the multiple regression model assume that the dependent variable
is continuous. The following expression describes the equation involving the relationship with two predictor
variables, namely X and X .
The model describes a plane in the three-dimensional space of Ŷ, X1 , and X2 . Parameter ‘a’ is the
intercept of this plane. Parameters ‘b ’ and ‘b ’ are referred to as partial regression coefficients. Parameter b
represents the change in the mean response corresponding to a unit change in X1 when X2 is held constant.
Parameter b represents the change in the mean response corresponding to a unit change in X2 when X1 is held
constant.
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
Consider the following example of a multiple linear regression model with two predictor variables, namely
X1 and X2.
Multiple regression for estimating equation when there are ‘n’ predictor variables is as follows:
While finding the best fit line, we can fit either a polynomial or curvilinear regression. These are known as
polynomial or curvilinear regression, respectively.
Assumptions in Regression Analysis:
1. The dependent variable (Y) can be calculated / predicated as a linear function of a specific set of
independent variables (X’s) plus an error term (ε).
2. The number of observations (n) is greater than the number of parameters (k) to be estimated, i.e. n > k.
3. Relationships determined by regression are only relationships of association based on the data set and not
necessarily of cause and effect of the defined class.
4. Regression line can be valid only over a limited range of data. If the line is extended (outside the range of
extrapolation), it may only lead to wrong predictions.
5. If the business conditions change and the business assumptions underlying the regression model are no
longer valid, then the past data set will no longer be able to predict future trends.
6. Variance is the same for all values of X (homoskedasticity).
7. The error term (ε) is normally distributed. This also means that the mean of the error (ε) has an expected
value of 0.
8. The values of the error (ε) are independent and are not related to any values of X. This means that there are
no relationships between a particular X, Y that are related to another specific value of X, Y.
Given the above assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE), and this is
20AIPC302
FUNDAMENTAL OF MACHINE LEARNING TECHNIQUE
called as Gauss-Markov Theorem.
Main Problems in Regression Analysis In multiple regressions, there are two primary problems:
multicollinearity and heteroskedasticity.
Multicollinearity:
Two variables are perfectly collinear if there is an exact linear relationship between them.
Multicollinearity is the situation in which the degree of correlation is not only between the dependent
variable and the independent variable, but there is also a strong correlation within (among) the
independent variables themselves.
A multiple regression equation can make good predictions when there is multicollinearity, but it is
difficult for us to determine how the dependent variable will change if each independent variable is
changed one at a time.
When multicollinearity is present, it increases the standard errors of the coefficients. By overinflating
the standard errors, multicollinearity tries to make some variables statistically insignificant when they
actually should be significant (with lower standard errors).
One way to gauge multicollinearity is to calculate the Variance Inflation Factor (VIF), which assesses
how much the variance of an estimated regression coefficient increases if the predictors are correlated.
If no factors are correlated, the VIFs will be equal to 1.
The assumption of no perfect collinearity states that there is no exact linear relationship among the
independent variables. This assumption implies two aspects of the data on the independent variables.
First, none of the independent variables, other than the variable associated with the intercept term, can
be a constant.
Second, variation in the X’s is necessary. In general, the more variation in the independent variables,
the better will be the OLS estimates in terms of identifying the impacts of the different independent
variables on the dependent variable.
Heteroskedasticity:
Heteroskedasticity refers to the changing variance of the error term.
If the variance of the error term is not constant across data sets, there will be erroneous
predictions.
In general, for a regression equation to make accurate predictions, the error term should be
independent, identically (normally) distributed (iid).
Mathematically, this assumption is written as
where ‘var’ represents the variance, ‘cov’ represents the covariance, ‘u’ represents the error terms,
and ‘X’ represents the independent variables.
This assumption is more commonly written as