Unit-3 Notes
Unit-3 Notes
1 Regression Concepts
Regression analysis is a statistical technique used to model the relationship between a dependent variable
and one or more independent variables. The most common form is linear regression, where the model
assumes a linear relationship between the variables.
Y = B0 + B1 X + ϵ
where:
• B0 is the intercept,
• B1 is the slope (rate of change of Y with respect to X),
1. Linearity: The relationship between the independent variables (predictors) and the dependent variable
is linear in terms of the parameters (β coefficients), though not necessarily in terms of the explanatory
variables themselves.
2. Independence of Errors: The residuals (errors) are independent, meaning there is no relationship
between the residuals and the predicted values (ŷ).
3. Normality of Errors: The error terms follow a normal distribution with mean 0 and variance σ 2
N (0, σ 2 )
.
4. Equal Variances (Homoscedasticity): The error terms have a constant variance σ 2 across all levels
of the independent variables.
5. No Auto-Correlation: Successive error terms are uncorrelated, with the covariance between them
being zero.
1
6. Errors are Uncorrelated with Independent Variables: The error terms should not be correlated
with any of the independent variables.
7. No Multicollinearity: The independent variables are not correlated with each other.
The term “Best” signifies that the BLUE estimator is selected to be the best among all possible linear
estimators in a specific sense. Specifically, it aims to minimize the variance of the parameter estimates. In
simpler terms, it tries to find the parameter estimates that have the smallest spread or uncertainty while
maintaining certain important properties.
The BLUE estimator is linear because it involves linear combinations of the observed data or predictors.
In linear regression, you assume that the relationship between the dependent variable (the one you’re trying
to predict) and the independent variables (predictors) is linear. The coefficients in front of these predictors
are what you’re estimating, and these coefficients are linear in nature.
The BLUE estimator is unbiased, which means that, on average, it provides parameter estimates that
are equal to the true population values. In other words, if you were to use the BLUE estimator on many
different samples from the same population, the average of the estimates would converge to the actual
population parameter values.
In linear regression, OLS is a commonly used BLUE estimator. It finds the coefficients that minimize
the sum of the squared differences between the observed and predicted values of the dependent variable.
3.1 Objective
Minimize the residual sum of squares (RSS):
n
X n
X
RSS = (yi − ŷi )2 = (yi − (B0 + B1 Xi ))2
i=1 i=1
where yi are the actual values and ŷi are the predicted values.
where yi are the observed values and ŷi are the predicted values given by:
ŷi = B0 + B1 xi
2
Thus,
n
X
RSS = (yi − (B0 + B1 xi ))2
i=1
To find the estimates for B0 and B1 , we need to compute the partial derivatives of RSS with respect to
B0 and B1 and set them to zero.
First, compute the partial derivative with respect to B0 :
n
∂RSS ∂ X
= (yi − (B0 + B1 xi ))2
∂B0 ∂B0 i=1
3
Solving these equations simultaneously yields the Ordinary Least Squares (OLS) estimates for B0 and
B1 :
Pn Pn Pn
n
i=1 xi yi − i=1 xi i=1 yi
B1 = n n
n i=1 x2i − ( i=1 xi )2
P P
n n
1X 1X
B0 = yi − B1 xi
n i=1 n i=1
n
X n
X n
X
= (xi − X̄)(yi − Ȳ ) + X̄ (yi − Ȳ ) + Ȳ (xi − X̄) + nX̄ Ȳ
i=1 i=1 i=1
n
X
= (xi − X̄)(yi − Ȳ ) + nX̄ Ȳ
i=1
"
n
# " n
! n
!#
X X X
n (xi − X̄)(yi − Ȳ ) + nX̄ Ȳ − xi yi
i=1 i=1 i=1
n
X
=n (xi − X̄)(yi − Ȳ ) + n2 X̄ Ȳ − n2 X̄ Ȳ
i=1
n
X
=n (xi − X̄)(yi − Ȳ )
i=1
n n
!2
X X
n x2i − xi
i=1 i=1
Pn
Rewrite i=1 x2i by expanding:
n
X n
X
x2i = (xi − X̄)2 + nX̄ 2
i=1 i=1
n
" n
#
X X
n x2i =n (xi − X̄) + nX̄2 2
i=1 i=1
n
X
=n (xi − X̄)2 + n2 X̄ 2
i=1
4
Pn 2
Subtract ( i=1 xi ) :
n
X 2
n x2i − nX̄
i=1
n
X
=n (xi − X̄)2
i=1
Pn
i=1 (xi − X̄)(yi − Ȳ )
B1 = Pn 2
i=1 (xi − X̄)
similarly
B0 = Ȳ − B1 X̄
These estimates minimize the Residual Sum of Squares and provide the best linear unbiased estimators
under the Gauss-Markov assumptions.
β̂ = (X T X)−1 X T y
where:
• β̂ is the vector of estimated coefficients,
6 Metrics
Metrics for regression models serve to quantify the performance and accuracy of predictions for continuous
numerical outcomes. Mean Absolute Error (MAE) calculates the average of absolute differences between
predicted and actual values, offering insight into the typical magnitude of errors. Mean Squared Error
(MSE) computes the average of squared differences, emphasizing larger errors more than MAE. Root Mean
Squared Error (RMSE), the square root of MSE, provides an interpretable measure in the same unit as the
target variable. R-squared (R²) gauges the proportion of variance in the dependent variable explained by
the independent variables, with values ranging from 0 to 1 indicating perfect to no linear relationship. These
metrics collectively aid in assessing regression model fit and predictive accuracy, guiding model selection and
refinement based on specific problem requirements.
5
Table 1: Formulas for Regression Metrics
Metric Formula
Pn
Mean Absolute Error (MAE) MAE = n1P i=1 |yi − ŷi |
n
Mean Squared Error (MSE) MSE = n1 i=1 √(yi − ŷi )
2
n=5
X
X = 1 + 2 + 4 + 3 + 5 = 15
X
Y = 1 + 3 + 3 + 2 + 5 = 14
X
XY = (1 × 1) + (2 × 3) + (4 × 3) + (3 × 2) + (5 × 5) = 50
X
X 2 = 12 + 22 + 42 + 32 + 52 = 55
6
Step 4: Compute RMSE (Root Mean Squared Error):
v
u n
u1 X
RMSE = t (Yi − Ŷi )2
n i=1
r
(1 − 1.2)2 + (3 − 2.0)2 + (3 − 3.6)2 + (2 − 2.8)2 + (5 − 4.0)2
=
5
r
0.04 + 1 + 0.36 + 0.64 + 1
=
5
r
3.04
=
5
√
≈ 0.608 ≈ 0.78
ŷ = b0 + b1 x1 + b2 x2
where ŷ is the predicted test score; b0 , b1 , and b2 are regression coefficients; x1 is an IQ score; and x2 is
the number of hours that the student studied.
On the right side of the equation, the only unknowns are the regression coefficients. To define the
regression coefficients, we use the following equation:
b = (X ′ X)−1 X ′ Y
To solve this equation, we need to complete the following steps:
7
1. Define X.
2. Define X ′ .
3. Compute X ′ X.
4. Find the inverse of X ′ X.
5. Define Y .
Let’s begin with matrix X. Matrix X has a column of 1’s plus two columns of values for each independent
variable. So, this is matrix X and its transpose X ′ :
1 110 40
1 120 30
X= 1 100 20
1 90 0
1 80 10
1 1 1 1 1
X ′ = 110 120 100 90 80
40 30 20 0 10
Given X ′ and X, it is a simple matter to compute X ′ X:
5 500 100
X ′ X = 500 51, 000 10, 800
100 10, 800 3, 000
Finding the inverse of X ′ X takes a little more effort. A way to find the inverse is described on this site
at https://stattrek.com/matrix-algebra/how-to-find-inverse. Ultimately, we find:
101 7 1
5 − 30 6
(X ′ X)−1 = − 30 7 1
360
1
− 450
1 1 1
6 − 450 360
Next, we define Y , the vector of dependent variable scores. For this problem, it is the vector of test
scores:
100
90
Y = 80
70
60
With all of the essential matrices defined, we are ready to compute the least squares regression coefficients:
b = (X ′ X)−1 X ′ Y
b0 20
b = b1 = 0.5
b2 0.5
To conclude, here is our least-squares regression equation:
ŷ = 20 + 0.5x1 + 0.5x2
where ŷ is the predicted test score; x1 is an IQ score; and x2 is the number of hours that the student
studied. The regression coefficients are b0 = 20, b1 = 0.5, and b2 = 0.5.
8
9 Variable Rationalization
A dataset often contains many attributes, but not all of them are relevant. Some attributes may be redundant
or irrelevant. The goal of variable rationalization is to improve data processing by selecting a subset of
attributes that are most important. This process helps in reducing the size of the dataset without losing
much useful information, which also makes the data analysis more efficient and cost-effective.
Working with a smaller, more focused dataset makes it easier to understand the patterns that are dis-
covered. Several methods are commonly used for selecting the most important attributes:
• Stepwise Forward Selection: This method starts with an empty set of attributes and adds the most
relevant ones (based on the lowest p-value) one at a time.
• Stepwise Backward Elimination: This method begins with all attributes and removes those that
have the highest p-value in each iteration, which are considered less significant.
• Combination of Forward Selection and Backward Elimination: This method merges both
forward selection and backward elimination to efficiently select the most important attributes. This
combined approach is widely used.
• Decision Tree Induction: This method uses a decision tree to choose attributes. It builds a flowchart
where nodes represent tests on attributes, and branches represent possible outcomes. Attributes that
are not part of the final decision tree are considered irrelevant and can be discarded.
10 Model Building
The primary goal of model building in linear regression is to develop the most accurate and precise estimates
of effects using your data and statistical software.
Strategies
• Adjust All: Start by including all potential confounders in the model to understand their impact on
the outcome variable.
• Predictor Selection: Choose predictors based on their ability to accurately predict the outcome.
Utilize methods such as stepwise selection or regularization techniques.
• Change in Estimate (CIE) Selection: Evaluate how excluding certain predictors changes the
estimated effects of other variables to make informed decisions about which variables to include.
Preparation
• Data Checking: Thoroughly check, describe, and summarize the data to ensure its quality and
suitability for modeling.
• Centering and Rescaling: Center quantitative variables to ensure that zero has a meaningful refer-
ence point and rescale them to make differences more interpretable.
• Variable Selection: Use statistical distributions and contextual knowledge to choose the appropriate
categories or flexible forms for the variables.
Avoid Pitfalls
• Over-Control: Avoid including too many variables in the model as this can lead to issues such as
data sparsity and multicollinearity, where predictors are highly correlated.
• Control Intermediates: Do not control for variables that lie on the causal pathway between the
predictor and the outcome, as this can introduce bias. Only include necessary control variables.
9
Review and Considerations
• Methodology Review: If your background knowledge is limited, review methodologies from related
studies to ensure robust model building.
• Forced vs. Data-Based Variables: Determine which variables should always be controlled (e.g.,
age, gender) and which should be selected based on the data analysis.
• Complexity: Avoid overcomplicating the model by including too many variables. A simpler model is
often more interpretable and avoids issues such as data sparsity and multicollinearity.
11 Logistic Regression
11.1 Model Theory
Logistic regression is a statistical method used for binary classification. The model predicts the probability
of a binary outcome based on one or more predictor variables.
The logistic regression model is formulated as:
1
P (Y = 1 | X) =
1 + e−(β0 +β1 X)
where:
• P (Y = 1 | X) is the probability of the outcome Y being 1 given the predictor X.
• β0 is the intercept term.
• β1 is the coefficient for the predictor X.
• e is the base of the natural logarithm.
For multiple predictors, the model is extended to:
1
P (Y = 1 | X) =
1 + e−(β0 +β1 X1 +β2 X2 +···+βp Xp )
where X = (X1 , X2 , . . . , Xp ) represents the vector of predictors.
• Likelihood Ratio Test: The likelihood ratio test compares the fit of the full model to a reduced
model:
LR = −2 (Log-Likelihoodreduced − Log-Likelihoodfull )
This test assesses whether the full model significantly improves fit over the reduced model.
• Hosmer-Lemeshow Test: The Hosmer-Lemeshow test evaluates the goodness-of-fit by grouping
observations into deciles based on predicted probabilities and comparing observed and expected fre-
quencies within these groups. A high p-value suggests a good fit.
• Akaike Information Criterion (AIC): The AIC is used for model comparison, penalizing the
likelihood of the model by the number of parameters:
AIC = −2Log-Likelihood + 2k
where k is the number of parameters in the model. Lower AIC values indicate a better fit.
10
• Pseudo-R2 : Pseudo-R2 measures the proportion of variance explained by the model. Common types
include:
– Cox and Snell R2 : n2
2 L0
RCS =1−
L1
where L0 is the likelihood of the null model, L1 is the likelihood of the fitted model, and n is the
sample size.
– Nagelkerke R2 :
2
2 RCS
RN = 2
1 − L0n
• Classification Accuracy: Classification accuracy is the proportion of correctly predicted cases out
of the total number of cases:
Number of Correct Predictions
Accuracy =
Total Number of Predictions
• Receiver Operating Characteristic (ROC) Curve: The ROC curve plots the true positive rate
(sensitivity) against the false positive rate (1-specificity) at various threshold levels. The area under
the ROC curve (AUC) measures the model’s ability to discriminate between classes:
Z 1
AUC = Sensitivity d(Specificity)
0
12 Model Construction
• Specify the Model: Determine the form of the logistic regression model based on the dependent
variable and independent variables.
• Select Predictors: Choose relevant predictors based on theoretical knowledge, exploratory data
analysis, and variable selection methods.
• Estimate Parameters: Fit the model to the data using maximum likelihood estimation. This involves
finding the parameter values that maximize the likelihood of observing the given data.
• Assess Model Fit: Use fit statistics and diagnostic tools to evaluate how well the model represents
the data. Check for the adequacy of the model fit and the appropriateness of the assumptions.
• Refine the Model: Based on the results, refine the model by adding or removing predictors, checking
for multicollinearity, and addressing any model assumptions violations.
• Validate the Model: Test the model on a separate validation set or use cross-validation techniques
to assess its generalizability and robustness.
11
• Marketing and Sales
– Customer Segmentation: Analytics helps identify distinct customer groups based on purchas-
ing behavior, demographics, and preferences. This allows for targeted marketing strategies.
– Campaign Effectiveness: Measure the success of marketing campaigns by analyzing conversion
rates, customer engagement, and return on investment (ROI).
– Sales Forecasting: Predict future sales trends using historical data and seasonal patterns to
optimize inventory and sales strategies.
• Finance
– Risk Management: Analyze financial data to assess and mitigate risks, including credit risk,
market risk, and operational risk.
– Fraud Detection: Use anomaly detection and predictive modeling to identify and prevent fraud-
ulent transactions.
– Portfolio Management: Optimize investment portfolios by analyzing market trends, asset
performance, and risk factors.
• Customer Service
– Customer Satisfaction: Analyze feedback, surveys, and support interactions to improve cus-
tomer service and enhance the overall customer experience.
– Churn Prediction: Identify customers at risk of leaving and develop strategies to retain them
through targeted interventions and personalized offers.
– Service Efficiency: Optimize customer support processes by analyzing response times, resolu-
tion rates, and resource allocation.
• Retail
– Pricing Strategy: Use analytics to determine optimal pricing strategies based on market de-
mand, competitor pricing, and cost structures.
– Product Assortment: Analyze sales data to determine the best mix of products to stock in
stores or online, maximizing sales and profitability.
12
– Customer Behavior: Track and analyze customer purchasing patterns to improve store layout,
product placement, and promotional strategies.
• Healthcare
– Patient Care: Analyze patient data to improve treatment outcomes, personalize care plans, and
predict patient needs.
– Operational Efficiency: Optimize hospital and clinic operations by analyzing patient flow,
resource utilization, and staff scheduling.
– Predictive Analytics: Use data to forecast disease outbreaks, patient admissions, and other
critical healthcare events.
13
Explain with examples when logistic regression is used over linear regression
Logistic regression is preferred over linear regression when the target variable is categorical rather than
continuous. Linear regression is used for predicting a real-valued outcome, such as house prices or temper-
atures, where the target variable is continuous and can take any value. In contrast, logistic regression is
ideal for classification tasks, where the target variable is categorical, often binary, like predicting whether an
event will occur (yes/no, 0/1). For instance, linear regression can predict a house’s price based on features
like size and location, whereas logistic regression can determine the likelihood of the house being sold (e.g.,
sale or no sale). The output of linear regression is not bounded and can range from −∞ to +∞, making
it unsuitable for probabilities. Logistic regression, however, produces probabilities constrained between 0
and 1 using a sigmoid function, which is crucial for classification tasks. For example, it can predict whether
an email is spam or not or if a student will pass an exam based on study hours. While linear regression
assumes a linear relationship between predictors and the outcome, logistic regression maps predictions to
probabilities, making it more suitable for problems involving categorical outcomes.
The logistic transformation is a mathematical function that maps any real-valued number to a value
between 0 and 1, making it suitable for modeling probabilities in classification problems. This transformation
is central to logistic regression and is expressed using the sigmoid function:
1
P (y = 1 | x) =
1 + e−z
where z is the linear combination of the input features and their weights:
z = w0 + w1 x1 + w2 x2 + · · · + wn xn
14
• Range: The output is always between 0 and 1.
• S-shape Curve: The sigmoid function has an ”S” shape, transitioning smoothly from 0 to 1.
• Interpretability: The logistic transformation directly relates to odds through the logit function:
P
logit(P ) = ln =z
1−P
This relationship shows how the linear combination z is linked to the odds of the event occurring.
The logistic transformation is widely used in machine learning for:
• Binary classification tasks, like spam detection or disease diagnosis.
• Probabilistic outputs, which allow for a confidence score in predictions.
• Logistic regression, where it models the probability of a categorical outcome.
Demonstrate maximum likelihood estimation.
MLE is a method used to estimate the parameters of a model so that it best fits the observed data. It
works by maximizing the likelihood function, which measures how likely the observed data is given the
model parameters.
Steps to Perform MLE 1. Write the Likelihood Function: The likelihood function represents the
probability of the observed data based on the model and its parameters. For n data points x1 , x2 , . . . , xn ,
the likelihood is:
n
Y
L(θ) = f (x1 ; θ) · f (x2 ; θ) · · · · · f (xn ; θ) = f (xi ; θ)
i=1
∂ℓ(θ)
=0
∂θ
This gives the parameter values that make the observed data most likely.
Example: MLE for a Normal Distribution Suppose the data is from a normal distribution with mean µ
and variance σ 2 . The probability of each data point is:
1 (x−µ)2
f (x; µ, σ 2 ) = √ e− 2σ 2
2πσ 2
Step 1: Write the Likelihood Function For n independent data points:
n
Y 1 (xi −µ)2
L(µ, σ 2 ) = √ e− 2σ 2
i=1 2πσ 2
15
Step 3: Maximize the Log-Likelihood 1. Solve for µ by setting the derivative of ℓ(µ, σ 2 ) with respect
to µ to zero:
n
∂ℓ 1 X
= 2 (xi − µ)
∂µ σ i=1
∂ℓ
Setting ∂µ = 0, we get:
n
1X
µ= xi
n i=1
∂ℓ
Setting ∂σ 2 = 0, we get:
n
1X
σ2 = (xi − µ)2
n i=1
This process works for other distributions by using their specific probability functions. MLE provides
the most likely parameter values for a given dataset.
The sigmoid function is a mathematical function that maps any input to a value between 0 and 1,
making it ideal for tasks that require probability predictions, such as binary classification. It is commonly
used in models like logistic regression and neural networks. The function is smooth and continuous, which
makes it differentiable and useful for optimization techniques like gradient descent. Its non-linear nature
allows models to capture complex patterns in data. However, it can suffer from the vanishing gradient
problem, where the gradient becomes very small for large inputs, slowing down learning in deep neural
networks. Despite this, the sigmoid function is still important because it provides interpretable probability
outputs and enables effective model training.
16