0% found this document useful (0 votes)
13 views106 pages

Unit-2 Ak

This document covers the fundamentals of supervised machine learning with a focus on linear regression, including its assumptions, limitations, and applications in various industries. It explains the goal of regression, the importance of linearity, homoscedasticity, and independence of errors, as well as the limitations such as sensitivity to outliers and inability to handle categorical variables directly. Additionally, it highlights the practical applications of linear regression in fields like market analysis, finance, sports, and medicine.

Uploaded by

Hindu Ruttala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views106 pages

Unit-2 Ak

This document covers the fundamentals of supervised machine learning with a focus on linear regression, including its assumptions, limitations, and applications in various industries. It explains the goal of regression, the importance of linearity, homoscedasticity, and independence of errors, as well as the limitations such as sensitivity to outliers and inability to handle categorical variables directly. Additionally, it highlights the practical applications of linear regression in fields like market analysis, finance, sports, and medicine.

Uploaded by

Hindu Ruttala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

UNIT-II

Supervised Machine
Learning
• Basics of linear regression
• Assumptions, limitations, and industry applications
• Least square based and Gradient Descent Based Regression
Basics of linear regression
Regression

• It is a supervised learning method

• The goal of regression is


• to predict the value of one or more continuous target variables
‘t’ given the value of a D-dimensional vector ‘x’ of input
variables.
• Examples:
• Predicting the selling price and buying price of a house based on
factors like square footage, number of bedrooms, and location.
• Predicting a person's blood pressure based on the number of hours
they exercise per week

• Given a training data set comprising


• ‘N’ observations {xn}, where n=1,...,N,
• together with corresponding target values {tn},
• the goal is to predict the value of ‘t’ for a new value of ‘x’.
• In the simplest approach, this can be done by
• directly constructing an appropriate function y(x) whose
values for new inputs ‘x’ constitute the predictions for the
corresponding values of ‘t’.

• From a probabilistic perspective, we aim to


• model the predictive distribution p(t|x), because this
expresses our uncertainty about the value of ‘t’ for each
value of ‘x’.
• From this conditional distribution,
• we can make predictions of ‘t’, for any new value of ‘x’, in such a
way as to minimize the expected value of a suitably chosen loss
function.
• The simplest form of linear regression models are
also linear functions of the input variables.

• However, we can take linear combinations of a


fixed set of nonlinear functions of the input
variables, known as “basis functions”.

• Such models are


• linear functions of the parameters, and
• can be nonlinear with respect to the input variables.
Linear Basis Function Models
• The simplest linear model for regression is one that
involves a linear combination of the input variables:
y(x,w)=w0 +w1x1 +........+wDxD
where x =(x1,…, xD)T

• It is a linear function of the parameters w0,...,wD.


• Also, a linear function of the input variables xi,
• this imposes significant limitations on the model

• Therefore, extend the class of models by considering linear


combinations of fixed nonlinear functions of the input
variables, of the form

(if φj(x) = x j)
• Parameter ‘w0‘ allows for
• any fixed offset in the data and is sometimes called a bias
parameter
• it is convenient to define an additional dummy ‘basis
function’ φ0(x)=1
• so that

• By using nonlinear basis functions,


• we allow the function y(x,w) to be a non-linear function of the
input vector ‘x’
• Functions are called linear models because this function is linear in
parameters ‘w’.
• Example of above model is polynomial regression
• there is a single input variable x, and
• the basis functions φj(x)=xj
• limitation of polynomial basis functions
• they are global functions of the input variable,
• so that changes in one region of input space affect all other
regions.
• This can be resolved by dividing the input space up into
regions and fit a different polynomial in each region,
leading to spline functions.
• ‘Gaussian’ basis functions
• sigmoidal basis function
• Other Basis Functions
Source: Simple linear regression-Machine Learning-4-1-1-Supervised Learning-Regression-Slopes-
JNTUA-CSE
Deriving the least squares estimators of the slope and intercept
(simple linear regression)

• Source: https://youtu.be/ewnc1cXJmGA?si=DilxFVpp7ELg1d8t
Assumptions of linear regression
• Linear regression validity relies on certain assumptions about the data.

• Violations of these assumptions can affect the model’s performance.


Linearity
• Assumption: The relationship between the
independent and dependent variables is linear.
• i.e., a change in the independent variable results in a
proportional change in the dependent variable.

• If the relationship is not linear, the model may


underfit the data, leading to inaccurate predictions.
• In such cases, transformations of the data or the use of
non-linear regression models may be more appropriate.
• Simplest way to check for linearity
• A residual plot helps us identify poor or incorrect curve
fitting between the data and the regression model.

• Example:
• Consider a dataset where the relationship between
temperature and ice cream sales is being studied.
• Linear Relationship: This is where the increase in temperature
results in a consistent increase in ice cream sales.
• Non-Linear Relationship: In this case, the increase in temperature
leads to a more significant increase in ice cream sales at higher
temperatures, indicating a non-linear relationship.
Homoscedasticity of Residuals

• Homoscedasticity states that the residuals (differences between


observed and predicted values) should have a constant variance
across all independent variable levels (s).
• i.e., the spread of the errors should be relatively uniform, regardless of the
value of the predictor
• variance of residuals maintain constant - model is homoscedastic
• variance of residuals changes with independent variable - model
is heteroscedasticity.

• Heteroscedasticity can lead to several issues:


• Inefficient Estimates:
• The estimates of the coefficients may not be the best linear unbiased estimators
(BLUE), meaning that they could be less accurate than they should be.
• Impact on Hypothesis Testing: Standard errors can become biased, leading
to unreliable significance tests and confidence intervals.
• Left plot (Homoscedasticity):
• The residuals are scattered evenly around the horizontal line
at zero, indicating a constant variance.

• Right plot (Heteroscedasticity):


• The residuals are not evenly scattered.
• There is a clear pattern of increasing variance as the
predicted values increase, indicating heteroscedasticity.
Multivariate Normality – Normal Distribution of errors
• Multivariate normality means that the residuals (differences between observed
and predicted values) should follow a normal distribution when considering
multiple predictors together.

deviation from normality


Independence of Errors
(No autocorrelation of Errors)

• Independence of errors ensures that the residuals are NOT


correlated with one another.
• i.e., the error associated with one observation should not
influence the error of any other observation.

• When errors are correlated,


• indicate that some underlying pattern or trend in the data has been
overlooked by the model.
• can lead to underestimated standard errors, resulting in overconfident
predictions and misleading significance tests.
• Violation of this assumption is most common in time series data,
where the error at one point in time may influence errors at
subsequent time points.
• Such patterns suggest the presence of autocorrelation.
• The Residuals vs. Time plot: shows a random scatter of
points, suggesting no clear pattern or correlation over
time.
• The ACF (Autocorrelation Function of residuals) of
Residuals plot shows a few spikes at low lags, but they
are not significant enough to indicate strong
autocorrelation.
Lack of Multicollinearity
• Assumption: The independent VARIABLES are not highly correlated with
each other.

• Multicollinearity
• occurs when two or more independent variables in the model are highly correlated,
• leading to redundancy in the information they provide
• This can inflate the standard errors of the coefficients, making it difficult to determine
the effect of each independent variable.

• When multicollinearity is present,


• it becomes challenging to interpret the coefficients of the regression model
accurately
• lead to overfitting, where the model performs well on training data but poorly on
unseen data

• Example:
• In a model predicting health outcomes based on multiple health metrics, if both blood
pressure and heart rate are included as predictors, their high correlation may lead
to multicollinearity.
Absence of Endogeneity

• No endogeneity is an important assumption in the context of multiple


linear regression.

• The assumption of NO endogeneity states that


• the independent variables in the regression model should NOT be
correlated with the error term.
• If it is violated, it leads to biased and inconsistent estimates of the
regression coefficients.

• Bias and Consistency:


• When endogeneity is present, the estimates of the regression coefficients
are biased,
• they do not accurately reflect the true relationships between the variables.
• estimates become inconsistent, which means they do not converge to the
true parameter values as the sample size increases.
• Valid Inference:
• The assumption of no endogeneity is critical for conducting valid
hypothesis tests and creating reliable confidence intervals.
• If endogeneity exists, the statistical tests based on these estimates may
lead to incorrect conclusions.
Limitations of Linear Regression (Cont’d….)

• 1. Limited to Linear Relationships


- Linear regression can only model linear relationships between independent and dependent variables. It fails when the relationship is
nonlinear unless transformed appropriately.
• 2. Sensitive to Outliers
- Since linear regression minimizes squared errors, outliers can significantly impact the regression line, leading to misleading
predictions.
• 3. Poor Performance with High-Dimensional Data (Overfitting)
- When the number of independent variables is large relative to the number of observations, linear regression may overfit, making it
less generalizable.
• 4. Feature Interaction is Not Captured
- Standard linear regression does not account for interactions between independent variables unless explicitly included through
interaction terms.
• 5. Assumes Independence of Features
- Highly correlated (multicollinear) features can make it difficult to determine the effect of individual predictors, leading to unstable
coefficient estimates.
• 6. Cannot Handle Categorical Variables Directly
- Linear regression requires categorical variables to be converted into numerical representations (e.g., one-hot encoding), increasing
complexity and potential collinearity.
• 7. Lack of Robustness to Missing Data
- Missing values need to be handled separately, as linear regression does not inherently manage incomplete datasets.
• 8. Not Suitable for Complex Relationships
- Real-world problems often involve nonlinearity, interactions, and dependencies that linear regression cannot effectively capture
without modifications.
Gradient Descent Based Regression

Source: Linear Regression: Gradient Descent Approach, Learning rate, parameters, Simple and Multiple
To understand the cost function for linear regression

x f(x)
Industry applications of LR
• Market analysis
• establishing the relationships between several quantitative
variables, such as social media engagement, pricing and
number of sales.
• This information allows you to utilise specific marketing
strategies to maximise sales and increase revenue.
• For example, you can use a simple linear model to ascertain
how price affects sales and use it to evaluate the strength
between the two variables.

• Financial analysis
• Financial analysts use linear models to evaluate a company's
operational performance and forecast returns on investment.
• They also use it in the capital asset pricing model, which
studies the relationship between the expected investment
returns and the associated market risks. It shows companies if
an investment has a fair price and contributes to decisions on
whether or not to invest in the asset.
• Sports analysis
• This involves sports analysts using statistics to determine a team's or
player's performance in a game. They can use this information to
compare teams and players and provide essential information to their
followers.
• They can also use this data to predict game attendance based on the
status of the teams playing and the market size, so they can advise
team managers on game venues and ticket prices that can maximize
profits.

• Environmental health
• Specialists in this field use this regression model to evaluate the
relationship between natural elements, such as soil, water and air.
• An example is the relationship between the amount of water and
plant growth. This can help environmentalists predict the effects of air
or water pollution on environmental health.

• Medicine
• Medical researchers can use this regression model to determine the
relationship between independent characteristics, such as age and
body weight, and dependent ones, such as blood pressure.
• This can help reveal the risk factors associated with diseases. They can
use this information to identify high-risk patients and promote healthy
lifestyles.
• In Statistics, Linear regression refers to a model that can show relationship between two variables and how
one can impact the other. In essence, it involves showing how the variation in the “dependent variable” can be
captured by change in the “independent variables”.
• In Business, this dependent variable can also be called the predictor or the factor of interest for eg., sales of a
product, pricing, performance, risk etc. Independent variables are also called explanatory variables as they
can explain the factors that influence the dependent variable along with the degree of the impact which can be
calculated using “parameter estimates” or “coefficients”. These coefficients are tested for statistical
significance by building confidence intervals around them so that the model that we are building is statistically
robust and based on objective data. The elasticity based on the coefficient can tell us the extent to which a
certain factor explains the dependent. Further, a negative coefficient can be interpreted to have a negative or
an inverse relation with the dependent variable and positive coefficient can be said to have a positive
influence. The key factor in any statistical models is the right understanding of the domain and its business
application.
• Linear Regression is a very powerful statistical technique and can be used to generate insights on consumer
behaviour, understanding business and factors influencing profitability. Linear regressions can be used in
business to evaluate trends and make estimates or forecasts. For example, if a company’s sales have
increased steadily every month for the past few years, by conducting a linear analysis on the sales data with
monthly sales, the company could forecast sales in future months.
• Linear regression can also be used to analyze the marketing effectiveness, pricing and promotions on sales of
a product. For instance, if company XYZ, wants to know if the funds that they have invested in marketing a
particular brand has given them substantial return on investment, they can use linear regression. The beauty
of linear regression is that it enables us to capture the isolated impacts of each of the marketing campaigns
along with controlling the factors that could influence the sales. In real life scenarios there are multiple
advertising campaigns that run during the same time period. Supposing two campaigns are run on TV and
Radio in parallel, a linear regression can capture the isolated as well as the combined impact of running this
ads together.
• Linear Regression can be also used to assess risk in financial services or insurance domain. For example, a
car insurance company might conduct a linear regression to come up with a suggested premium table using
predicted claims to Insured Declared Value ratio. The risk can be assessed based on the attributes of the car,
driver information or demographics. The results of such an analysis might guide important business decisions.
• In the credit card industry, a financial company maybe interested in minimizing the risk portfolio and wants to
understand the top five factors that cause a customer to default. Based on the results the company could
implement specific EMI options so as to minimize default among risky customers.
• While Linear regression has limited applicability in business situations because it can work only when the
dependent variable is of continuous nature, it still is a very well known technique in the situations it can be
used. It assumes a linear relation between the independent and dependent variables. It must be noted that
sometimes transformations can also be applied to non linear relationships to make them applicable in a linear
regression model.
• The following are some of the areas where Simple Linear Regression is used
• Economics and Finance: Simple linear regression is employed in economics to
analyse relationships between economic variables, such as the impact of
interest rates on consumer spending or the relationship between inflation and
unemployment.
• Marketing and Sales: Businesses use simple linear regression for sales
forecasting. By analysing historical sales data and factors like advertising
expenditure or price changes, companies can make predictions about future
sales and adjust their strategies accordingly.
• Medical and Healthcare: Simple linear regression can be applied in healthcare
to study the relationship between variables like patient age and medical
expenses, drug dosage and treatment outcomes, or patient satisfaction and
hospital wait times.
• Sports Analytics: In sports analytics, simple linear regression can be used to
analyze player performance metrics (e.g., batting average in baseball or
shooting percentage in basketball) and their relationship with factors like
training intensity, player fatigue, or coaching strategies.
• Energy and Utilities: Energy companies can use simple linear regression to
predict energy consumption based on historical data and weather conditions.
This helps in resource planning and optimizing energy distribution.
• A lot of areas are still there where linear relationships between variables persist.

You might also like