0% found this document useful (0 votes)

13 views106 pages

Unit-2 Ak

This document covers the fundamentals of supervised machine learning with a focus on linear regression, including its assumptions, limitations, and applications in various industries. It explains the goal of regression, the importance of linearity, homoscedasticity, and independence of errors, as well as the limitations such as sensitivity to outliers and inability to handle categorical variables directly. Additionally, it highlights the practical applications of linear regression in fields like market analysis, finance, sports, and medicine.

Uploaded by

Hindu Ruttala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views106 pages

Unit-2 Ak

Uploaded by

Hindu Ruttala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

UNIT-II

Supervised Machine
Learning
• Basics of linear regression
• Assumptions, limitations, and industry applications
• Least square based and Gradient Descent Based Regression
Basics of linear regression
Regression

• It is a supervised learning method

• The goal of regression is

• to predict the value of one or more continuous target variables
‘t’ given the value of a D-dimensional vector ‘x’ of input
variables.
• Examples:
• Predicting the selling price and buying price of a house based on
factors like square footage, number of bedrooms, and location.
• Predicting a person's blood pressure based on the number of hours
they exercise per week

• Given a training data set comprising

• ‘N’ observations {xn}, where n=1,...,N,
• together with corresponding target values {tn},
• the goal is to predict the value of ‘t’ for a new value of ‘x’.
• In the simplest approach, this can be done by
• directly constructing an appropriate function y(x) whose
values for new inputs ‘x’ constitute the predictions for the
corresponding values of ‘t’.

• From a probabilistic perspective, we aim to

• model the predictive distribution p(t|x), because this
expresses our uncertainty about the value of ‘t’ for each
value of ‘x’.
• From this conditional distribution,
• we can make predictions of ‘t’, for any new value of ‘x’, in such a
way as to minimize the expected value of a suitably chosen loss
function.
• The simplest form of linear regression models are
also linear functions of the input variables.

• However, we can take linear combinations of a

fixed set of nonlinear functions of the input
variables, known as “basis functions”.

• Such models are

• linear functions of the parameters, and
• can be nonlinear with respect to the input variables.
Linear Basis Function Models
• The simplest linear model for regression is one that
involves a linear combination of the input variables:
y(x,w)=w0 +w1x1 +........+wDxD
where x =(x1,…, xD)T

• It is a linear function of the parameters w0,...,wD.

• Also, a linear function of the input variables xi,
• this imposes significant limitations on the model

• Therefore, extend the class of models by considering linear

combinations of fixed nonlinear functions of the input
variables, of the form

(if φj(x) = x j)
• Parameter ‘w0‘ allows for
• any fixed offset in the data and is sometimes called a bias
parameter
• it is convenient to define an additional dummy ‘basis
function’ φ0(x)=1
• so that

• By using nonlinear basis functions,

• we allow the function y(x,w) to be a non-linear function of the
input vector ‘x’
• Functions are called linear models because this function is linear in
parameters ‘w’.
• Example of above model is polynomial regression
• there is a single input variable x, and
• the basis functions φj(x)=xj
• limitation of polynomial basis functions
• they are global functions of the input variable,
• so that changes in one region of input space affect all other
regions.
• This can be resolved by dividing the input space up into
regions and fit a different polynomial in each region,
leading to spline functions.
• ‘Gaussian’ basis functions
• sigmoidal basis function
• Other Basis Functions
Source: Simple linear regression-Machine Learning-4-1-1-Supervised Learning-Regression-Slopes-
JNTUA-CSE
Deriving the least squares estimators of the slope and intercept
(simple linear regression)

• Source: https://youtu.be/ewnc1cXJmGA?si=DilxFVpp7ELg1d8t
Assumptions of linear regression
• Linear regression validity relies on certain assumptions about the data.

• Violations of these assumptions can affect the model’s performance.

Linearity
• Assumption: The relationship between the
independent and dependent variables is linear.
• i.e., a change in the independent variable results in a
proportional change in the dependent variable.

• If the relationship is not linear, the model may

underfit the data, leading to inaccurate predictions.
• In such cases, transformations of the data or the use of
non-linear regression models may be more appropriate.
• Simplest way to check for linearity
• A residual plot helps us identify poor or incorrect curve
fitting between the data and the regression model.

• Example:
• Consider a dataset where the relationship between
temperature and ice cream sales is being studied.
• Linear Relationship: This is where the increase in temperature
results in a consistent increase in ice cream sales.
• Non-Linear Relationship: In this case, the increase in temperature
leads to a more significant increase in ice cream sales at higher
temperatures, indicating a non-linear relationship.
Homoscedasticity of Residuals

• Homoscedasticity states that the residuals (differences between

observed and predicted values) should have a constant variance
across all independent variable levels (s).
• i.e., the spread of the errors should be relatively uniform, regardless of the
value of the predictor
• variance of residuals maintain constant - model is homoscedastic
• variance of residuals changes with independent variable - model
is heteroscedasticity.

• Heteroscedasticity can lead to several issues:

• Inefficient Estimates:
• The estimates of the coefficients may not be the best linear unbiased estimators
(BLUE), meaning that they could be less accurate than they should be.
• Impact on Hypothesis Testing: Standard errors can become biased, leading
to unreliable significance tests and confidence intervals.
• Left plot (Homoscedasticity):
• The residuals are scattered evenly around the horizontal line
at zero, indicating a constant variance.

• Right plot (Heteroscedasticity):

• The residuals are not evenly scattered.
• There is a clear pattern of increasing variance as the
predicted values increase, indicating heteroscedasticity.
Multivariate Normality – Normal Distribution of errors
• Multivariate normality means that the residuals (differences between observed
and predicted values) should follow a normal distribution when considering
multiple predictors together.

deviation from normality

Independence of Errors
(No autocorrelation of Errors)

• Independence of errors ensures that the residuals are NOT

correlated with one another.
• i.e., the error associated with one observation should not
influence the error of any other observation.

• When errors are correlated,

• indicate that some underlying pattern or trend in the data has been
overlooked by the model.
• can lead to underestimated standard errors, resulting in overconfident
predictions and misleading significance tests.
• Violation of this assumption is most common in time series data,
where the error at one point in time may influence errors at
subsequent time points.
• Such patterns suggest the presence of autocorrelation.
• The Residuals vs. Time plot: shows a random scatter of
points, suggesting no clear pattern or correlation over
time.
• The ACF (Autocorrelation Function of residuals) of
Residuals plot shows a few spikes at low lags, but they
are not significant enough to indicate strong
autocorrelation.
Lack of Multicollinearity
• Assumption: The independent VARIABLES are not highly correlated with
each other.

• Multicollinearity
• occurs when two or more independent variables in the model are highly correlated,
• leading to redundancy in the information they provide
• This can inflate the standard errors of the coefficients, making it difficult to determine
the effect of each independent variable.

• When multicollinearity is present,

• it becomes challenging to interpret the coefficients of the regression model
accurately
• lead to overfitting, where the model performs well on training data but poorly on
unseen data

• Example:
• In a model predicting health outcomes based on multiple health metrics, if both blood
pressure and heart rate are included as predictors, their high correlation may lead
to multicollinearity.
Absence of Endogeneity

• No endogeneity is an important assumption in the context of multiple

linear regression.

• The assumption of NO endogeneity states that

• the independent variables in the regression model should NOT be
correlated with the error term.
• If it is violated, it leads to biased and inconsistent estimates of the
regression coefficients.

• Bias and Consistency:

• When endogeneity is present, the estimates of the regression coefficients
are biased,
• they do not accurately reflect the true relationships between the variables.
• estimates become inconsistent, which means they do not converge to the
true parameter values as the sample size increases.
• Valid Inference:
• The assumption of no endogeneity is critical for conducting valid
hypothesis tests and creating reliable confidence intervals.
• If endogeneity exists, the statistical tests based on these estimates may
lead to incorrect conclusions.
Limitations of Linear Regression (Cont’d….)

• 1. Limited to Linear Relationships

- Linear regression can only model linear relationships between independent and dependent variables. It fails when the relationship is
nonlinear unless transformed appropriately.
• 2. Sensitive to Outliers
- Since linear regression minimizes squared errors, outliers can significantly impact the regression line, leading to misleading
predictions.
• 3. Poor Performance with High-Dimensional Data (Overfitting)
- When the number of independent variables is large relative to the number of observations, linear regression may overfit, making it
less generalizable.
• 4. Feature Interaction is Not Captured
- Standard linear regression does not account for interactions between independent variables unless explicitly included through
interaction terms.
• 5. Assumes Independence of Features
- Highly correlated (multicollinear) features can make it difficult to determine the effect of individual predictors, leading to unstable
coefficient estimates.
• 6. Cannot Handle Categorical Variables Directly
- Linear regression requires categorical variables to be converted into numerical representations (e.g., one-hot encoding), increasing
complexity and potential collinearity.
• 7. Lack of Robustness to Missing Data
- Missing values need to be handled separately, as linear regression does not inherently manage incomplete datasets.
• 8. Not Suitable for Complex Relationships
- Real-world problems often involve nonlinearity, interactions, and dependencies that linear regression cannot effectively capture
without modifications.
Gradient Descent Based Regression

Source: Linear Regression: Gradient Descent Approach, Learning rate, parameters, Simple and Multiple
To understand the cost function for linear regression

x f(x)
Industry applications of LR
• Market analysis
• establishing the relationships between several quantitative
variables, such as social media engagement, pricing and
number of sales.
• This information allows you to utilise specific marketing
strategies to maximise sales and increase revenue.
• For example, you can use a simple linear model to ascertain
how price affects sales and use it to evaluate the strength
between the two variables.

• Financial analysis
• Financial analysts use linear models to evaluate a company's
operational performance and forecast returns on investment.
• They also use it in the capital asset pricing model, which
studies the relationship between the expected investment
returns and the associated market risks. It shows companies if
an investment has a fair price and contributes to decisions on
whether or not to invest in the asset.
• Sports analysis
• This involves sports analysts using statistics to determine a team's or
player's performance in a game. They can use this information to
compare teams and players and provide essential information to their
followers.
• They can also use this data to predict game attendance based on the
status of the teams playing and the market size, so they can advise
team managers on game venues and ticket prices that can maximize
profits.

• Environmental health
• Specialists in this field use this regression model to evaluate the
relationship between natural elements, such as soil, water and air.
• An example is the relationship between the amount of water and
plant growth. This can help environmentalists predict the effects of air
or water pollution on environmental health.

• Medicine
• Medical researchers can use this regression model to determine the
relationship between independent characteristics, such as age and
body weight, and dependent ones, such as blood pressure.
• This can help reveal the risk factors associated with diseases. They can
use this information to identify high-risk patients and promote healthy
lifestyles.
• In Statistics, Linear regression refers to a model that can show relationship between two variables and how
one can impact the other. In essence, it involves showing how the variation in the “dependent variable” can be
captured by change in the “independent variables”.
• In Business, this dependent variable can also be called the predictor or the factor of interest for eg., sales of a
product, pricing, performance, risk etc. Independent variables are also called explanatory variables as they
can explain the factors that influence the dependent variable along with the degree of the impact which can be
calculated using “parameter estimates” or “coefficients”. These coefficients are tested for statistical
significance by building confidence intervals around them so that the model that we are building is statistically
robust and based on objective data. The elasticity based on the coefficient can tell us the extent to which a
certain factor explains the dependent. Further, a negative coefficient can be interpreted to have a negative or
an inverse relation with the dependent variable and positive coefficient can be said to have a positive
influence. The key factor in any statistical models is the right understanding of the domain and its business
application.
• Linear Regression is a very powerful statistical technique and can be used to generate insights on consumer
behaviour, understanding business and factors influencing profitability. Linear regressions can be used in
business to evaluate trends and make estimates or forecasts. For example, if a company’s sales have
increased steadily every month for the past few years, by conducting a linear analysis on the sales data with
monthly sales, the company could forecast sales in future months.
• Linear regression can also be used to analyze the marketing effectiveness, pricing and promotions on sales of
a product. For instance, if company XYZ, wants to know if the funds that they have invested in marketing a
particular brand has given them substantial return on investment, they can use linear regression. The beauty
of linear regression is that it enables us to capture the isolated impacts of each of the marketing campaigns
along with controlling the factors that could influence the sales. In real life scenarios there are multiple
advertising campaigns that run during the same time period. Supposing two campaigns are run on TV and
Radio in parallel, a linear regression can capture the isolated as well as the combined impact of running this
ads together.
• Linear Regression can be also used to assess risk in financial services or insurance domain. For example, a
car insurance company might conduct a linear regression to come up with a suggested premium table using
predicted claims to Insured Declared Value ratio. The risk can be assessed based on the attributes of the car,
driver information or demographics. The results of such an analysis might guide important business decisions.
• In the credit card industry, a financial company maybe interested in minimizing the risk portfolio and wants to
understand the top five factors that cause a customer to default. Based on the results the company could
implement specific EMI options so as to minimize default among risky customers.
• While Linear regression has limited applicability in business situations because it can work only when the
dependent variable is of continuous nature, it still is a very well known technique in the situations it can be
used. It assumes a linear relation between the independent and dependent variables. It must be noted that
sometimes transformations can also be applied to non linear relationships to make them applicable in a linear
regression model.
• The following are some of the areas where Simple Linear Regression is used
• Economics and Finance: Simple linear regression is employed in economics to
analyse relationships between economic variables, such as the impact of
interest rates on consumer spending or the relationship between inflation and
unemployment.
• Marketing and Sales: Businesses use simple linear regression for sales
forecasting. By analysing historical sales data and factors like advertising
expenditure or price changes, companies can make predictions about future
sales and adjust their strategies accordingly.
• Medical and Healthcare: Simple linear regression can be applied in healthcare
to study the relationship between variables like patient age and medical
expenses, drug dosage and treatment outcomes, or patient satisfaction and
hospital wait times.
• Sports Analytics: In sports analytics, simple linear regression can be used to
analyze player performance metrics (e.g., batting average in baseball or
shooting percentage in basketball) and their relationship with factors like
training intensity, player fatigue, or coaching strategies.
• Energy and Utilities: Energy companies can use simple linear regression to
predict energy consumption based on historical data and weather conditions.
This helps in resource planning and optimizing energy distribution.
• A lot of areas are still there where linear relationships between variables persist.

Unit 3 Da
No ratings yet
Unit 3 Da
20 pages
Linear Regression Assumptions and Limitations
No ratings yet
Linear Regression Assumptions and Limitations
10 pages
Module 3
No ratings yet
Module 3
34 pages
Hanan
No ratings yet
Hanan
9 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Complete
No ratings yet
Complete
12 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
MachineLearning Unit II
No ratings yet
MachineLearning Unit II
45 pages
Stats & ML Model Comparisons
100% (1)
Stats & ML Model Comparisons
72 pages
Regression Analysis (AI)
No ratings yet
Regression Analysis (AI)
9 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
5 pages
Unit-3 DA
No ratings yet
Unit-3 DA
50 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
ML Exp 1
No ratings yet
ML Exp 1
4 pages
UNIT II Regration
No ratings yet
UNIT II Regration
62 pages
Unit II Time Series Regression and Exponential Smoothing
No ratings yet
Unit II Time Series Regression and Exponential Smoothing
28 pages
MachineLearning Unit-II
No ratings yet
MachineLearning Unit-II
45 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
Group 1 Practical
No ratings yet
Group 1 Practical
16 pages
Unit 2
No ratings yet
Unit 2
18 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
11 Regression
No ratings yet
11 Regression
34 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Classification & Regression Models
No ratings yet
Classification & Regression Models
32 pages
Unit 3
No ratings yet
Unit 3
30 pages
Da 3
No ratings yet
Da 3
30 pages
Model Development
No ratings yet
Model Development
80 pages
1 - UNIT 2 2 Files Merged
No ratings yet
1 - UNIT 2 2 Files Merged
80 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
ML-UNIT-IV - Complete
No ratings yet
ML-UNIT-IV - Complete
42 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Day 2-Data Science
No ratings yet
Day 2-Data Science
16 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
6 pages
Regression Analysis
No ratings yet
Regression Analysis
40 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
Regression
No ratings yet
Regression
6 pages
Linear Regression
No ratings yet
Linear Regression
35 pages
Unit III
No ratings yet
Unit III
13 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
U-4 Iml
No ratings yet
U-4 Iml
17 pages
Linear Regression For Machine Learning
No ratings yet
Linear Regression For Machine Learning
9 pages
Solution Manual For Cost Management Measuring Monitoring and Motivating Performance 2nd Edition by Eldenburg Newest Edition 2025
100% (12)
Solution Manual For Cost Management Measuring Monitoring and Motivating Performance 2nd Edition by Eldenburg Newest Edition 2025
154 pages
MA Fall2022 W3 Slides
No ratings yet
MA Fall2022 W3 Slides
23 pages
Point-Intercept Method: Guidelines & Accuracy
No ratings yet
Point-Intercept Method: Guidelines & Accuracy
14 pages
Business Decision Making II Simple Linear Regression: Dr. Nguyen Ngoc Phan
No ratings yet
Business Decision Making II Simple Linear Regression: Dr. Nguyen Ngoc Phan
69 pages
DC Operating Costs (Warehousing Costs) 2. Primary Freight 3. Secondary Freight
No ratings yet
DC Operating Costs (Warehousing Costs) 2. Primary Freight 3. Secondary Freight
6 pages
Testing Laminate Bamboo
100% (1)
Testing Laminate Bamboo
5 pages
Econometrics Homework Guide
No ratings yet
Econometrics Homework Guide
8 pages
Viva Voce Presentation (Data Analysis)
No ratings yet
Viva Voce Presentation (Data Analysis)
51 pages
The Application of Pairs Trading To Stock Markets
No ratings yet
The Application of Pairs Trading To Stock Markets
31 pages
Further Issues With The Classical Linear Regression Model: Introductory Econometrics For Finance' © Chris Brooks 2002 1
No ratings yet
Further Issues With The Classical Linear Regression Model: Introductory Econometrics For Finance' © Chris Brooks 2002 1
74 pages
Design of Experiments Mcqs
No ratings yet
Design of Experiments Mcqs
24 pages
Risya Sri W (183040011)
No ratings yet
Risya Sri W (183040011)
17 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Windle 2008
No ratings yet
Windle 2008
9 pages
Report - Group 5D - The Professor Proposes
No ratings yet
Report - Group 5D - The Professor Proposes
21 pages
Radiological Counting Errors Guide
No ratings yet
Radiological Counting Errors Guide
38 pages
Boosting Algorithms Explained
No ratings yet
Boosting Algorithms Explained
79 pages
Spss Note
No ratings yet
Spss Note
35 pages
Stat 371 Ass#1
No ratings yet
Stat 371 Ass#1
2 pages
Analytical Chemistry. - Curvas de Calibracion, Bondad de Ajuste, Ponderación
No ratings yet
Analytical Chemistry. - Curvas de Calibracion, Bondad de Ajuste, Ponderación
8 pages
Geophysical Research Letters - 2023 - Altamimi - ITRF2020 Plate Motion Model
No ratings yet
Geophysical Research Letters - 2023 - Altamimi - ITRF2020 Plate Motion Model
7 pages
PROAIR oilguardPRO English
No ratings yet
PROAIR oilguardPRO English
4 pages
(Ebook PDF) Stat2: Building Models For A World of Data PDF Download
100% (4)
(Ebook PDF) Stat2: Building Models For A World of Data PDF Download
55 pages
Panel
100% (1)
Panel
93 pages
Cross-Validation Strategies For Data With Temporal, Spatial, Hierarchical, or Phylogenetic Structure
No ratings yet
Cross-Validation Strategies For Data With Temporal, Spatial, Hierarchical, or Phylogenetic Structure
17 pages
CE504 - HW2 - Dec 27, 20
No ratings yet
CE504 - HW2 - Dec 27, 20
4 pages
Hypothesis Testing Seminar
No ratings yet
Hypothesis Testing Seminar
10 pages
Analisis Jalur 3x
No ratings yet
Analisis Jalur 3x
6 pages
Unit Summary
No ratings yet
Unit Summary
31 pages
Regression Modelling With Actuarial and Financial Applications - Key Notes
No ratings yet
Regression Modelling With Actuarial and Financial Applications - Key Notes
3 pages

Unit-2 Ak

Uploaded by

Unit-2 Ak

Uploaded by

UNIT-II

• It is a supervised learning method

• The goal of regression is

• Given a training data set comprising

• From a probabilistic perspective, we aim to

• However, we can take linear combinations of a

• Such models are

• It is a linear function of the parameters w0,...,wD.

• Therefore, extend the class of models by considering linear

• By using nonlinear basis functions,

• Violations of these assumptions can affect the model’s performance.

• If the relationship is not linear, the model may

• Homoscedasticity states that the residuals (differences between

• Heteroscedasticity can lead to several issues:

• Right plot (Heteroscedasticity):

deviation from normality

• Independence of errors ensures that the residuals are NOT

• When errors are correlated,

• When multicollinearity is present,

• No endogeneity is an important assumption in the context of multiple

• The assumption of NO endogeneity states that

• Bias and Consistency:

• 1. Limited to Linear Relationships

You might also like