0% found this document useful (0 votes)
17 views16 pages

Unit-3 Notes

The document covers regression analysis, including concepts like simple and multiple linear regression, and the BLUE property assumptions essential for Ordinary Least Squares (OLS) estimators. It details the Least Squares Estimation method, key properties of OLS estimators, and various metrics for evaluating regression model performance. Additionally, an example of building a simple linear regression model is provided, demonstrating the calculation of coefficients and predicted values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Unit-3 Notes

The document covers regression analysis, including concepts like simple and multiple linear regression, and the BLUE property assumptions essential for Ordinary Least Squares (OLS) estimators. It details the Least Squares Estimation method, key properties of OLS estimators, and various metrics for evaluating regression model performance. Additionally, an example of building a simple linear regression model is provided, demonstrating the calculation of coefficients and predicted values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT -III

REGRESSION: Concepts, Blue property assumptions, Least Square Estimation,Variable Rationalization,


and Model Building etc. Logistic Regression: Model Theory,Model fit Statistics, Model Construction, Ana-
lytics applications to various Business Domains etc.

1 Regression Concepts
Regression analysis is a statistical technique used to model the relationship between a dependent variable
and one or more independent variables. The most common form is linear regression, where the model
assumes a linear relationship between the variables.

1.1 Simple Linear Regression


Involves one dependent variable (Y ) and one independent variable (X). The model takes the form:

Y = B0 + B1 X + ϵ

where:
• B0 is the intercept,
• B1 is the slope (rate of change of Y with respect to X),

• ϵ is the error term (residual).

1.2 Multiple Linear Regression


Extends simple linear regression to include multiple independent variables (X1 , X2 , . . . , Xn ). The equation
is:
Y = B0 + B1 X1 + B2 X2 + · · · + Bn Xn + ϵ

2 BLUE Property Assumptions


BLUE stands for Best Linear Unbiased Estimator. According to the Gauss-Markov theorem, Ordinary
Least Squares (OLS) estimators are BLUE if the following assumptions hold:

1. Linearity: The relationship between the independent variables (predictors) and the dependent variable
is linear in terms of the parameters (β coefficients), though not necessarily in terms of the explanatory
variables themselves.
2. Independence of Errors: The residuals (errors) are independent, meaning there is no relationship
between the residuals and the predicted values (ŷ).

3. Normality of Errors: The error terms follow a normal distribution with mean 0 and variance σ 2

N (0, σ 2 )

.
4. Equal Variances (Homoscedasticity): The error terms have a constant variance σ 2 across all levels
of the independent variables.
5. No Auto-Correlation: Successive error terms are uncorrelated, with the covariance between them
being zero.

1
6. Errors are Uncorrelated with Independent Variables: The error terms should not be correlated
with any of the independent variables.
7. No Multicollinearity: The independent variables are not correlated with each other.

The term “Best” signifies that the BLUE estimator is selected to be the best among all possible linear
estimators in a specific sense. Specifically, it aims to minimize the variance of the parameter estimates. In
simpler terms, it tries to find the parameter estimates that have the smallest spread or uncertainty while
maintaining certain important properties.
The BLUE estimator is linear because it involves linear combinations of the observed data or predictors.
In linear regression, you assume that the relationship between the dependent variable (the one you’re trying
to predict) and the independent variables (predictors) is linear. The coefficients in front of these predictors
are what you’re estimating, and these coefficients are linear in nature.
The BLUE estimator is unbiased, which means that, on average, it provides parameter estimates that
are equal to the true population values. In other words, if you were to use the BLUE estimator on many
different samples from the same population, the average of the estimates would converge to the actual
population parameter values.
In linear regression, OLS is a commonly used BLUE estimator. It finds the coefficients that minimize
the sum of the squared differences between the observed and predicted values of the dependent variable.

3 Least Squares Estimation (LSE)


Least Squares Estimation (LSE) is the method used to estimate the coefficients in a regression model. The
goal of LSE is to minimize the sum of the squared differences (residuals) between the observed and predicted
values.

3.1 Objective
Minimize the residual sum of squares (RSS):
n
X n
X
RSS = (yi − ŷi )2 = (yi − (B0 + B1 Xi ))2
i=1 i=1

where yi are the actual values and ŷi are the predicted values.

3.2 Steps in Least Squares


• Set up the regression equation: The model is Y = B0 + B1 X.
• Find the partial derivatives of RSS with respect to B0 and B1 .
• Solve the system of equations to find the OLS estimates for B0 and B1 .

4 Detailed Steps in Least Squares Estimation


The Residual Sum of Squares (RSS) is defined as:
n
X
RSS = (yi − ŷi )2
i=1

where yi are the observed values and ŷi are the predicted values given by:

ŷi = B0 + B1 xi

2
Thus,
n
X
RSS = (yi − (B0 + B1 xi ))2
i=1

To find the estimates for B0 and B1 , we need to compute the partial derivatives of RSS with respect to
B0 and B1 and set them to zero.
First, compute the partial derivative with respect to B0 :
n
∂RSS ∂ X
= (yi − (B0 + B1 xi ))2
∂B0 ∂B0 i=1

Expanding the squared term and differentiating:


n
∂RSS X
= −2 (yi − (B0 + B1 xi ))
∂B0 i=1

Setting this derivative to zero:


n
X
−2 (yi − (B0 + B1 xi )) = 0
i=1
n
X n
X
yi − nB0 − B1 xi = 0
i=1 i=1
n
X n
X
nB0 + B1 xi = yi
i=1 i=1

Next, compute the partial derivative with respect to B1 :


n
∂RSS ∂ X
= (yi − (B0 + B1 xi ))2
∂B1 ∂B1 i=1

Expanding the squared term and differentiating:


n
∂RSS X
= −2 xi (yi − (B0 + B1 xi ))
∂B1 i=1

Setting this derivative to zero:


n
X
−2 xi (yi − (B0 + B1 xi )) = 0
i=1
n
X n
X n
X
xi yi − B0 xi − B1 x2i = 0
i=1 i=1 i=1
n
X n
X n
X
B0 xi + B1 x2i = xi yi
i=1 i=1 i=1

We now have a system of two equations:


n
X n
X
nB0 + B1 xi = yi
i=1 i=1
n
X Xn Xn
B0 xi + B1 x2i = xi yi
i=1 i=1 i=1

3
Solving these equations simultaneously yields the Ordinary Least Squares (OLS) estimates for B0 and
B1 :
Pn Pn Pn
n
i=1 xi yi − i=1 xi i=1 yi
B1 = n n
n i=1 x2i − ( i=1 xi )2
P P
n n
1X 1X
B0 = yi − B1 xi
n i=1 n i=1

The means of X and Y are:


n
1X
X̄ = xi
n i=1
n
1X
Ȳ = yi
n i=1
n n
! n !
X X X
n xi yi − xi yi
i=1 i=1 i=1
Pn
Rewrite i=1 xi yi by adding and subtracting X̄ Ȳ :
n
X n
X  
xi yi = (xi − X̄)(yi − Ȳ ) + X̄(yi − Ȳ ) + Ȳ (xi − X̄) + X̄ Ȳ
i=1 i=1

n
X n
X n
X
= (xi − X̄)(yi − Ȳ ) + X̄ (yi − Ȳ ) + Ȳ (xi − X̄) + nX̄ Ȳ
i=1 i=1 i=1
n
X
= (xi − X̄)(yi − Ȳ ) + nX̄ Ȳ
i=1
"
n
# " n
! n
!#
X X X
n (xi − X̄)(yi − Ȳ ) + nX̄ Ȳ − xi yi
i=1 i=1 i=1
n
X
=n (xi − X̄)(yi − Ȳ ) + n2 X̄ Ȳ − n2 X̄ Ȳ
i=1
n
X
=n (xi − X̄)(yi − Ȳ )
i=1

n n
!2
X X
n x2i − xi
i=1 i=1
Pn
Rewrite i=1 x2i by expanding:
n
X n
X
x2i = (xi − X̄)2 + nX̄ 2
i=1 i=1

n
" n
#
X X
n x2i =n (xi − X̄) + nX̄2 2

i=1 i=1
n
X
=n (xi − X̄)2 + n2 X̄ 2
i=1

4
Pn 2
Subtract ( i=1 xi ) :
n
X 2
n x2i − nX̄
i=1
n
X
=n (xi − X̄)2
i=1
Pn
i=1 (xi − X̄)(yi − Ȳ )
B1 = Pn 2
i=1 (xi − X̄)
similarly

B0 = Ȳ − B1 X̄
These estimates minimize the Residual Sum of Squares and provide the best linear unbiased estimators
under the Gauss-Markov assumptions.

4.1 OLS in Matrix Form (for Multiple Regression)


The OLS estimates for multiple regression can be expressed in matrix form as:

β̂ = (X T X)−1 X T y

where:
• β̂ is the vector of estimated coefficients,

• X is the matrix of independent variables,


• y is the vector of dependent variable values.

5 Key Properties of OLS Estimators


• Unbiasedness: The OLS estimators will be unbiased as long as the assumptions hold true.
• Consistency: As the sample size increases, the OLS estimates converge to the true parameter values.
• Efficiency: The OLS estimators have the least variance among all unbiased linear estimators, making
them the best.

6 Metrics
Metrics for regression models serve to quantify the performance and accuracy of predictions for continuous
numerical outcomes. Mean Absolute Error (MAE) calculates the average of absolute differences between
predicted and actual values, offering insight into the typical magnitude of errors. Mean Squared Error
(MSE) computes the average of squared differences, emphasizing larger errors more than MAE. Root Mean
Squared Error (RMSE), the square root of MSE, provides an interpretable measure in the same unit as the
target variable. R-squared (R²) gauges the proportion of variance in the dependent variable explained by
the independent variables, with values ranging from 0 to 1 indicating perfect to no linear relationship. These
metrics collectively aid in assessing regression model fit and predictive accuracy, guiding model selection and
refinement based on specific problem requirements.

5
Table 1: Formulas for Regression Metrics
Metric Formula
Pn
Mean Absolute Error (MAE) MAE = n1P i=1 |yi − ŷi |
n
Mean Squared Error (MSE) MSE = n1 i=1 √(yi − ŷi )
2

Root Mean Squared Error (RMSE) RMSE = MSE


n
(yi −ŷi )2
P
R-squared (R2) R2 = 1 − Pi=1
n 2
i=1 (yi −ȳ)

7 Building SLR Example


Given data:
X = {1, 2, 4, 3, 5}
Y = {1, 3, 3, 2, 5}
Step 1: Compute necessary sums:

n=5
X
X = 1 + 2 + 4 + 3 + 5 = 15
X
Y = 1 + 3 + 3 + 2 + 5 = 14
X
XY = (1 × 1) + (2 × 3) + (4 × 3) + (3 × 2) + (5 × 5) = 50
X
X 2 = 12 + 22 + 42 + 32 + 52 = 55

Step 2: Compute slope b1 and y-intercept b0 :

(5 × 50) − (15 × 14)


b1 =
(5 × 55) − (15)2
14 − b1 × 15
b0 =
5
Now, we compute b1 :
250 − 210 40
b1 = = = 0.8
275 − 225 50
Next, we compute b0 :
14 − (0.8 × 15) 14 − 12
b0 = = = 0.4
5 5
So, the linear regression model is:
Ŷ = 0.8X + 0.4
Step 3: Compute predicted values Ŷ using the linear regression model:

Ŷ1 = (0.8 × 1) + 0.4 = 1.2


Ŷ2 = (0.8 × 2) + 0.4 = 2.0
Ŷ3 = (0.8 × 4) + 0.4 = 3.6
Ŷ4 = (0.8 × 3) + 0.4 = 2.8
Ŷ5 = (0.8 × 5) + 0.4 = 4.0

6
Step 4: Compute RMSE (Root Mean Squared Error):
v
u n
u1 X
RMSE = t (Yi − Ŷi )2
n i=1
r
(1 − 1.2)2 + (3 − 2.0)2 + (3 − 3.6)2 + (2 − 2.8)2 + (5 − 4.0)2
=
5
r
0.04 + 1 + 0.36 + 0.64 + 1
=
5
r
3.04
=
5

≈ 0.608 ≈ 0.78

Step 5: Compute R2 (R-squared):


Pn
(Yi − Ŷi )2
R2 = 1 − Pi=1
n 2
i=1 (Yi − Ȳ )
(1 − 1.2)2 + (3 − 2.0)2 + (3 − 3.6)2 + (2 − 2.8)2 + (5 − 4.0)2
=1−
(1 − 2)2 + (3 − 2)2 + (3 − 2)2 + (2 − 2)2 + (5 − 2)2
0.04 + 1 + 0.36 + 0.64 + 1
=1−
1+1+1+0+9
3.04
=1−
12
= 1 − 0.2533
≈ 0.7467

8 Building MLR Example


Consider the table below. It shows three performance measures for five students:

Student Test score IQ Study hours


1 100 110 40
2 90 120 30
3 80 100 20
4 70 90 0
5 60 80 10
Using least squares regression, we want to develop a regression equation to predict test scores based on
(1) IQ and (2) the number of hours that the student studied.
For this problem, we have some raw data, and we want to use this raw data to define a least-squares
regression equation:

ŷ = b0 + b1 x1 + b2 x2
where ŷ is the predicted test score; b0 , b1 , and b2 are regression coefficients; x1 is an IQ score; and x2 is
the number of hours that the student studied.
On the right side of the equation, the only unknowns are the regression coefficients. To define the
regression coefficients, we use the following equation:

b = (X ′ X)−1 X ′ Y
To solve this equation, we need to complete the following steps:

7
1. Define X.
2. Define X ′ .
3. Compute X ′ X.
4. Find the inverse of X ′ X.
5. Define Y .
Let’s begin with matrix X. Matrix X has a column of 1’s plus two columns of values for each independent
variable. So, this is matrix X and its transpose X ′ :
 
1 110 40
1 120 30
 
X= 1 100 20

1 90 0 
1 80 10
 
1 1 1 1 1
X ′ = 110 120 100 90 80
40 30 20 0 10
Given X ′ and X, it is a simple matter to compute X ′ X:
 
5 500 100
X ′ X = 500 51, 000 10, 800
100 10, 800 3, 000
Finding the inverse of X ′ X takes a little more effort. A way to find the inverse is described on this site
at https://stattrek.com/matrix-algebra/how-to-find-inverse. Ultimately, we find:
 101 7 1

5 − 30 6
(X ′ X)−1 = − 30 7 1
360
1 
− 450
1 1 1
6 − 450 360
Next, we define Y , the vector of dependent variable scores. For this problem, it is the vector of test
scores:
 
100
 90 
 
Y =  80 

 70 
60
With all of the essential matrices defined, we are ready to compute the least squares regression coefficients:

b = (X ′ X)−1 X ′ Y
   
b0 20
b = b1  = 0.5
b2 0.5
To conclude, here is our least-squares regression equation:

ŷ = 20 + 0.5x1 + 0.5x2
where ŷ is the predicted test score; x1 is an IQ score; and x2 is the number of hours that the student
studied. The regression coefficients are b0 = 20, b1 = 0.5, and b2 = 0.5.

8
9 Variable Rationalization
A dataset often contains many attributes, but not all of them are relevant. Some attributes may be redundant
or irrelevant. The goal of variable rationalization is to improve data processing by selecting a subset of
attributes that are most important. This process helps in reducing the size of the dataset without losing
much useful information, which also makes the data analysis more efficient and cost-effective.
Working with a smaller, more focused dataset makes it easier to understand the patterns that are dis-
covered. Several methods are commonly used for selecting the most important attributes:

• Stepwise Forward Selection: This method starts with an empty set of attributes and adds the most
relevant ones (based on the lowest p-value) one at a time.
• Stepwise Backward Elimination: This method begins with all attributes and removes those that
have the highest p-value in each iteration, which are considered less significant.
• Combination of Forward Selection and Backward Elimination: This method merges both
forward selection and backward elimination to efficiently select the most important attributes. This
combined approach is widely used.
• Decision Tree Induction: This method uses a decision tree to choose attributes. It builds a flowchart
where nodes represent tests on attributes, and branches represent possible outcomes. Attributes that
are not part of the final decision tree are considered irrelevant and can be discarded.

10 Model Building
The primary goal of model building in linear regression is to develop the most accurate and precise estimates
of effects using your data and statistical software.
Strategies

• Adjust All: Start by including all potential confounders in the model to understand their impact on
the outcome variable.
• Predictor Selection: Choose predictors based on their ability to accurately predict the outcome.
Utilize methods such as stepwise selection or regularization techniques.
• Change in Estimate (CIE) Selection: Evaluate how excluding certain predictors changes the
estimated effects of other variables to make informed decisions about which variables to include.

Preparation

• Data Checking: Thoroughly check, describe, and summarize the data to ensure its quality and
suitability for modeling.
• Centering and Rescaling: Center quantitative variables to ensure that zero has a meaningful refer-
ence point and rescale them to make differences more interpretable.
• Variable Selection: Use statistical distributions and contextual knowledge to choose the appropriate
categories or flexible forms for the variables.

Avoid Pitfalls

• Over-Control: Avoid including too many variables in the model as this can lead to issues such as
data sparsity and multicollinearity, where predictors are highly correlated.
• Control Intermediates: Do not control for variables that lie on the causal pathway between the
predictor and the outcome, as this can introduce bias. Only include necessary control variables.

9
Review and Considerations

• Methodology Review: If your background knowledge is limited, review methodologies from related
studies to ensure robust model building.
• Forced vs. Data-Based Variables: Determine which variables should always be controlled (e.g.,
age, gender) and which should be selected based on the data analysis.
• Complexity: Avoid overcomplicating the model by including too many variables. A simpler model is
often more interpretable and avoids issues such as data sparsity and multicollinearity.

11 Logistic Regression
11.1 Model Theory
Logistic regression is a statistical method used for binary classification. The model predicts the probability
of a binary outcome based on one or more predictor variables.
The logistic regression model is formulated as:
1
P (Y = 1 | X) =
1 + e−(β0 +β1 X)
where:
• P (Y = 1 | X) is the probability of the outcome Y being 1 given the predictor X.
• β0 is the intercept term.
• β1 is the coefficient for the predictor X.
• e is the base of the natural logarithm.
For multiple predictors, the model is extended to:
1
P (Y = 1 | X) =
1 + e−(β0 +β1 X1 +β2 X2 +···+βp Xp )
where X = (X1 , X2 , . . . , Xp ) represents the vector of predictors.

11.2 Model Fit Statistics


Several statistics are used to assess the fit of a logistic regression model:

• Likelihood Ratio Test: The likelihood ratio test compares the fit of the full model to a reduced
model:
LR = −2 (Log-Likelihoodreduced − Log-Likelihoodfull )
This test assesses whether the full model significantly improves fit over the reduced model.
• Hosmer-Lemeshow Test: The Hosmer-Lemeshow test evaluates the goodness-of-fit by grouping
observations into deciles based on predicted probabilities and comparing observed and expected fre-
quencies within these groups. A high p-value suggests a good fit.
• Akaike Information Criterion (AIC): The AIC is used for model comparison, penalizing the
likelihood of the model by the number of parameters:

AIC = −2Log-Likelihood + 2k

where k is the number of parameters in the model. Lower AIC values indicate a better fit.

10
• Pseudo-R2 : Pseudo-R2 measures the proportion of variance explained by the model. Common types
include:
– Cox and Snell R2 :   n2
2 L0
RCS =1−
L1
where L0 is the likelihood of the null model, L1 is the likelihood of the fitted model, and n is the
sample size.
– Nagelkerke R2 :
2
2 RCS
RN = 2
1 − L0n
• Classification Accuracy: Classification accuracy is the proportion of correctly predicted cases out
of the total number of cases:
Number of Correct Predictions
Accuracy =
Total Number of Predictions

• Receiver Operating Characteristic (ROC) Curve: The ROC curve plots the true positive rate
(sensitivity) against the false positive rate (1-specificity) at various threshold levels. The area under
the ROC curve (AUC) measures the model’s ability to discriminate between classes:
Z 1
AUC = Sensitivity d(Specificity)
0

12 Model Construction
• Specify the Model: Determine the form of the logistic regression model based on the dependent
variable and independent variables.
• Select Predictors: Choose relevant predictors based on theoretical knowledge, exploratory data
analysis, and variable selection methods.
• Estimate Parameters: Fit the model to the data using maximum likelihood estimation. This involves
finding the parameter values that maximize the likelihood of observing the given data.
• Assess Model Fit: Use fit statistics and diagnostic tools to evaluate how well the model represents
the data. Check for the adequacy of the model fit and the appropriateness of the assumptions.
• Refine the Model: Based on the results, refine the model by adding or removing predictors, checking
for multicollinearity, and addressing any model assumptions violations.
• Validate the Model: Test the model on a separate validation set or use cross-validation techniques
to assess its generalizability and robustness.

13 Analytics Applications to Various Business Domains


In today’s data-driven world, analytics plays a crucial role across various business domains by providing
valuable insights that drive decision-making and strategic planning. From marketing and finance to opera-
tions and human resources, analytics helps businesses to understand trends, optimize processes, and improve
overall efficiency. By leveraging data, companies can make informed decisions that enhance their competitive
edge and achieve better outcomes. In this section, we explore how analytics is applied in different sectors to
address specific challenges and opportunities, demonstrating its impact on business performance and growth.

11
• Marketing and Sales
– Customer Segmentation: Analytics helps identify distinct customer groups based on purchas-
ing behavior, demographics, and preferences. This allows for targeted marketing strategies.
– Campaign Effectiveness: Measure the success of marketing campaigns by analyzing conversion
rates, customer engagement, and return on investment (ROI).
– Sales Forecasting: Predict future sales trends using historical data and seasonal patterns to
optimize inventory and sales strategies.
• Finance
– Risk Management: Analyze financial data to assess and mitigate risks, including credit risk,
market risk, and operational risk.
– Fraud Detection: Use anomaly detection and predictive modeling to identify and prevent fraud-
ulent transactions.
– Portfolio Management: Optimize investment portfolios by analyzing market trends, asset
performance, and risk factors.

• Operations and Supply Chain


– Demand Forecasting: Predict future demand for products to improve inventory management
and reduce stockouts or overstock situations.
– Supply Chain Optimization: Analyze supply chain processes to identify inefficiencies and
optimize logistics, procurement, and production.
– Quality Control: Monitor and analyze production processes to ensure product quality and
reduce defects.
• Human Resources
– Talent Acquisition: Use data to improve recruitment strategies, including candidate sourcing,
selection processes, and hiring practices.
– Employee Retention: Analyze employee satisfaction, performance metrics, and turnover rates
to develop strategies for retaining top talent.
– Workforce Planning: Forecast workforce needs based on business growth, employee turnover,
and skill requirements.

• Customer Service
– Customer Satisfaction: Analyze feedback, surveys, and support interactions to improve cus-
tomer service and enhance the overall customer experience.
– Churn Prediction: Identify customers at risk of leaving and develop strategies to retain them
through targeted interventions and personalized offers.
– Service Efficiency: Optimize customer support processes by analyzing response times, resolu-
tion rates, and resource allocation.
• Retail

– Pricing Strategy: Use analytics to determine optimal pricing strategies based on market de-
mand, competitor pricing, and cost structures.
– Product Assortment: Analyze sales data to determine the best mix of products to stock in
stores or online, maximizing sales and profitability.

12
– Customer Behavior: Track and analyze customer purchasing patterns to improve store layout,
product placement, and promotional strategies.
• Healthcare
– Patient Care: Analyze patient data to improve treatment outcomes, personalize care plans, and
predict patient needs.
– Operational Efficiency: Optimize hospital and clinic operations by analyzing patient flow,
resource utilization, and staff scheduling.
– Predictive Analytics: Use data to forecast disease outbreaks, patient admissions, and other
critical healthcare events.

• Travel and Hospitality


– Customer Experience: Analyze guest feedback and booking data to enhance the guest experi-
ence, personalize offers, and improve service quality.
– Revenue Management: Optimize pricing and booking strategies for hotels and airlines to
maximize revenue and occupancy rates.
– Demand Prediction: Forecast travel and accommodation demand to manage capacity and
improve operational efficiency.
• Education
– Student Performance: Use data to track and analyze student performance, identify at-risk
students, and implement interventions to improve learning outcomes.
– Resource Allocation: Optimize the allocation of educational resources, including faculty, facil-
ities, and funding, based on data-driven insights.
– Enrollment Trends: Analyze enrollment data to forecast future student numbers and plan for
academic program offerings and staffing needs.

13
Explain with examples when logistic regression is used over linear regression

Logistic regression is preferred over linear regression when the target variable is categorical rather than
continuous. Linear regression is used for predicting a real-valued outcome, such as house prices or temper-
atures, where the target variable is continuous and can take any value. In contrast, logistic regression is
ideal for classification tasks, where the target variable is categorical, often binary, like predicting whether an
event will occur (yes/no, 0/1). For instance, linear regression can predict a house’s price based on features
like size and location, whereas logistic regression can determine the likelihood of the house being sold (e.g.,
sale or no sale). The output of linear regression is not bounded and can range from −∞ to +∞, making
it unsuitable for probabilities. Logistic regression, however, produces probabilities constrained between 0
and 1 using a sigmoid function, which is crucial for classification tasks. For example, it can predict whether
an email is spam or not or if a student will pass an exam based on study hours. While linear regression
assumes a linear relationship between predictors and the outcome, logistic regression maps predictions to
probabilities, making it more suitable for problems involving categorical outcomes.

Discuss the logistic transformation and its components.

The logistic transformation is a mathematical function that maps any real-valued number to a value
between 0 and 1, making it suitable for modeling probabilities in classification problems. This transformation
is central to logistic regression and is expressed using the sigmoid function:
1
P (y = 1 | x) =
1 + e−z
where z is the linear combination of the input features and their weights:

z = w0 + w1 x1 + w2 x2 + · · · + wn xn

1. Sigmoid Function: The logistic transformation is represented by the sigmoid function:


1
σ(z) =
1 + e−z
It squashes the real number z (which could be any value from −∞ to +∞) into a range between 0
and 1. This makes it ideal for modeling probabilities, where values near 0 indicate low likelihood and
values near 1 indicate high likelihood.
2. Exponential Term (e−z ): This term governs the steepness of the sigmoid curve. When z is large
and positive, e−z approaches 0, making the probability close to 1. Conversely, when z is large and
negative, e−z grows exponentially, making the probability close to 0.
3. Linear Combination of Inputs (z): The value of z is determined by the weighted sum of the input
features:
z = w0 + w1 x1 + w2 x2 + · · · + wn xn
Here:
• w0 is the bias or intercept term.
• w1 , w2 , . . . , wn are the feature weights or coefficients.
• x1 , x2 , . . . , xn are the input features.
4. Probability Output: The sigmoid function transforms z into a probability value. For logistic regres-
sion, this output represents the probability of the positive class (y = 1). For binary classification, a
threshold (commonly 0.5) is applied to assign the predicted class.

Key Properties of the Logistic Transformation

14
• Range: The output is always between 0 and 1.
• S-shape Curve: The sigmoid function has an ”S” shape, transitioning smoothly from 0 to 1.
• Interpretability: The logistic transformation directly relates to odds through the logit function:
 
P
logit(P ) = ln =z
1−P
This relationship shows how the linear combination z is linked to the odds of the event occurring.
The logistic transformation is widely used in machine learning for:
• Binary classification tasks, like spam detection or disease diagnosis.
• Probabilistic outputs, which allow for a confidence score in predictions.
• Logistic regression, where it models the probability of a categorical outcome.
Demonstrate maximum likelihood estimation.

MLE is a method used to estimate the parameters of a model so that it best fits the observed data. It
works by maximizing the likelihood function, which measures how likely the observed data is given the
model parameters.
Steps to Perform MLE 1. Write the Likelihood Function: The likelihood function represents the
probability of the observed data based on the model and its parameters. For n data points x1 , x2 , . . . , xn ,
the likelihood is:
n
Y
L(θ) = f (x1 ; θ) · f (x2 ; θ) · · · · · f (xn ; θ) = f (xi ; θ)
i=1

Here, θ represents the parameters to estimate.


2. Take the Log of the Likelihood Function: To simplify calculations, we take the natural logarithm
of the likelihood function:
n
X
ℓ(θ) = ln L(θ) = ln f (xi ; θ)
i=1

3. Find the Parameter Values: Maximize ℓ(θ) by solving:

∂ℓ(θ)
=0
∂θ
This gives the parameter values that make the observed data most likely.
Example: MLE for a Normal Distribution Suppose the data is from a normal distribution with mean µ
and variance σ 2 . The probability of each data point is:
1 (x−µ)2
f (x; µ, σ 2 ) = √ e− 2σ 2
2πσ 2
Step 1: Write the Likelihood Function For n independent data points:
n
Y 1 (xi −µ)2
L(µ, σ 2 ) = √ e− 2σ 2

i=1 2πσ 2

Step 2: Log-Likelihood Take the log of the likelihood function:


n
n n 1 X
ℓ(µ, σ 2 ) = − ln(2π) − ln(σ 2 ) − 2 (xi − µ)2
2 2 2σ i=1

15
Step 3: Maximize the Log-Likelihood 1. Solve for µ by setting the derivative of ℓ(µ, σ 2 ) with respect
to µ to zero:
n
∂ℓ 1 X
= 2 (xi − µ)
∂µ σ i=1
∂ℓ
Setting ∂µ = 0, we get:
n
1X
µ= xi
n i=1

(the sample mean).


2. Solve for σ 2 by setting the derivative of ℓ(µ, σ 2 ) with respect to σ 2 to zero:
n
∂ℓ n 1 X
= − + (xi − µ)2
∂σ 2 2σ 2 2σ 4 i=1

∂ℓ
Setting ∂σ 2 = 0, we get:
n
1X
σ2 = (xi − µ)2
n i=1

(the sample variance).


For a normal distribution, the MLE estimates are:

• Mean (µ): The sample mean:


n
1X
µ= xi
n i=1

• Variance (σ 2 ): The sample variance:


n
1X
σ2 = (xi − µ)2
n i=1

This process works for other distributions by using their specific probability functions. MLE provides
the most likely parameter values for a given dataset.

Discuss the importance of the sigmoid function.

The sigmoid function is a mathematical function that maps any input to a value between 0 and 1,
making it ideal for tasks that require probability predictions, such as binary classification. It is commonly
used in models like logistic regression and neural networks. The function is smooth and continuous, which
makes it differentiable and useful for optimization techniques like gradient descent. Its non-linear nature
allows models to capture complex patterns in data. However, it can suffer from the vanishing gradient
problem, where the gradient becomes very small for large inputs, slowing down learning in deep neural
networks. Despite this, the sigmoid function is still important because it provides interpretable probability
outputs and enables effective model training.

16

You might also like