QBM 101 Business Statistics
Dr. Lai Kee Huong
Department of Business Studies
Faculty of Business, Economics & Accounting
keehuong.lai@help.edu.my
SUBJECT OUTLINE:
Module
1: Introduction; organizing
and graphing data; numerical
descriptive measures
Module
2: Probability, discrete random
variables; continuous random variables
and the normal distribution
Module
3: Sampling distributions;
estimation; hypothesis testing
Module
4: Simple linear regression
CHAPTER 10:
SIMPLE LINEAR REGRESSION
10.1 Simple linear regression
10.2
Standard deviation of errors
and coefficient of determination
10.3 Inferences about B
10.4 Linear correlation
10.5 Regression analysis: A complete
example
10.6 Interpretation of Excel output
A regression model is a mathematical equation
that describes the relationship between two or
more variables. A simple regression model
includes only two variables: one independent and
one dependent. The dependent variable is the one
being explained, and the independent variable is
the one used to explain the variation in the
dependent variable.
A (simple) regression model that gives a straightline relationship between two variables is called a
linear regression model.
Regression: describing the nature of relationship
between variables positive, negative, linear, or
nonlinear.
Correlation: determining whether a relationship
between variables exists
Questions: Are the two variables related? If so,
what is the strength? What kind of relationship?
What prediction can be made?
Examples: Height and weight of human, number
of cigarettes smoked vs weights of infants;
time spent on studying and exam marks.
Dependent variable (DV) (y, the one being
explained) vs. independent variable (IV) (x, used
to explain the variation).
Simple (only 1 IV) vs. multiple (> 1 IV)
regression
Linear (straight-line relationship) vs. nonlinear
regression
SIMPLE LINEAR REGRESSION ANALYSIS
In the regression model y = A + Bx + , A
is called the y-intercept or constant term, B
is the slope, and is the random error term.
The dependent and independent variables
are y and x, respectively.
In the model = a + bx, a and b, which are
calculated using sample data, are called the
estimates of A and B, respectively.
SCATTER PLOT/DIAGRAM
ERROR SUM OF SQUARE (SSE)
The error sum of squares, denoted SSE, is
SSE e 2 (y y )2
The values of a and b that give the minimum
SSE are called the least square estimates of A
and B, and the regression line obtained with
these estimates is called the least squares line.
Least square/best-fit line:
y a bx
SS xx x
SS yy y
SS xy xy
b
SS xy
SS xx
a y bx
n
x y
n
FORMULAS
Source: http://mathworld.wolfram.com/LeastSquaresFitting.html
Source: http://mathworld.wolfram.com/LeastSquaresFitting.html
Source: http://mathworld.wolfram.com/LeastSquaresFitting.html
Source: http://mathworld.wolfram.com/LeastSquaresFitting.html
Least square/best-fit line:
x 386 55.1429, y y 108 15.4286
n
SS xy xy
SS xx x
2
x y 6403 386 108 447.5714
n
(386) 2
23058
1772.8571
7
n
SS xy 447.5714
b
0.2525
SS xx 1772.8571
a y bx 15.4286 (0.2525)(55.1429) 1.5050
y a bx 1.5050 0.2525x
Least square/best-fit line (estimation and its
reliability):
b
SS xy
SS xx
447.5714
0.2525
1772.8571
a y bx 15.4286 (0.2525)(55.1429) 1.5050
y a bx 1.5050 0.2525 x
Estimate the amount of food expenditures when the income is $6100.
y a bx 1.5050 0.2525(61) $16.9075 hundred $1690.75
Error, e y y 16 16.9075 $0.9075 hundred $90.75
Estimate the amount of food expenditures when the income is $6000.
y a bx 1.5050 0.2525(60) $16.655 hundred $1665.50
The estimation is reliable because 60 (33,83)
Estimate the amount of food expenditures when the income is $2000.
y a bx 1.5050 0.2525(20) $6.555 hundred $655.50
The estimation is not reliable because 20 (33,83) *Extrapolation
ERROR OF PREDICTION
Least square/best-fit line (interpretation of
regression coefficients):
y a bx 1.5050 0.2525 x
y intercept, a 1.5050
A family with RM 0 income will
spend RM1.5050 hundred
=RM150.50 on food.
Slope coefficient, b 0.2525
For every one unit (RM100) of increment
in income, the expenditure on food will
increase by RM0.2525 hundred = RM25.25.
Least square/best-fit line (assumptions of
regression models):
1. Error has a mean of zero.
2. The errors are independent.
3. The distribution of error is normal.
4. The distribution of population errors has
the same (constant) standard deviation.
~ N ( 0, )
2
Degrees of Freedom for a Simple Linear
Regression Model
The degrees of freedom for a simple linear
regression model are
df = n 2
Standard deviation of errors:
is estimated by se
SSE
2
se
, where SSE ( y y )
n2
df n 2
se
SS yy bSS xy
n2
Standard deviation of errors:
SS xy
SS xx
447.5714
0.2525
1772.8571
SS xy xy
SS yy y
se
x y 6403 386 108 447.5714
n
SS yy bSS xy
n2
(108) 2
1792
125.1743
7
125.1743 (0.2525)(447.5714)
1.5939
72
Coefficient of determination (COD)
r
2
bSS xy
SS yy
,0 r 1
2
b 0.2525, SS xy 447.5714, SS yy 125.7143
r
2
bSS xy
SS yy
0.2525(447.5714)
0.899 89.9%
125.7143
Interpretation: 89.9% of the total variation in food expenditures
of household can be explained by the variation in incomes, and
the remaining 10.1% is due to randomness and other variables.
Coefficient of correlation (COC)
SS xy
SS xx SS yy
, 1 r 1
SS xx 1772.8571, SS xy 447.5714, SS yy 125.7143
r
SS xy
SS xx SS yy
447.5714
0.9481
1772.8571125.7143
Interpretation: Positive or negative sign/correlated.
Very weak, average/moderate, strong, very strong
r 0.9481: very strong and positively correlated
Other example:
r 0.1111: very weak and negatively correlated
Test statistic: tcalc
bB
, df n 2
sb
H0 : B 0
H1 : B 0 (two-tailed test)
B 0 (positive),B 0 (negative) (one-tailed test)
is unknown, use the t distribution.
HT about the slope coefficient, B
Test at the 1% significance level whether the
slope of the regression line is positive.
H 0 : B 0, H1 : B 0 (one-tailed test)
0.01
df n 2 7 2 5
tcalc
b B 0.2525 0
6.662
sb
0.0379
tcritical t ,n 2 t0.01,5 3.365
tcritical 3.365 tcalc 6.662
Reject H 0 . There is sufficient evidence to conclude
that the slope is positive, or, income determines
food expenditure positively.
A random sample of eight drivers selected from a small city
insured with a company and having similar minimum
required auto insurance policies was selected. The following
table lists their driving experiences (in years) and monthly
auto insurance premiums (in dollars).
Regression Analysis: A Complete Example
(a) IV and DV. Do you expect a positive or negative relationship?
(b) Compute SS xx , SS yy , and SS xy .
(c) Find the least square regression line.
(d) Interpret the regression coefficients in (c).
(e) Calculate the COC and COD. Interpret their meanings.
(f) Predict the monthly premium for a driver with 10 years of experience.
Comment on the reliability of the estimation.
(g) Compute the standard deviation of errors.
(h) Test at a 5% significance level whether B is negative.
Regression Analysis: A Complete Example
(a) IV: Driving experience, DV: Monthly auto insurance premium
A negative linear relationship.
Regression Analysis: A Complete Example
x 90
y 474
(b) x
11.25, y
59.25
n
SS xy
x y
(90)(474)
xy
4739
593.5
SS xx x
SS xy
SS xx
SS yy y 2
(c) b
y
n
(90) 2
1396
383.5
8
(474) 2
29, 642
1557.5
8
593.5
1.5476
383.5
a y bx 59.25 (1.5476)(11.25) 76.6605
y a bx 76.6605 1.5476 x
Regression Analysis: A Complete Example
(d) y a bx 76.6605 1.5476 x
y intercept, a 76.6605
A driver with 0 years of driving experience will need to pay
a monthly premium of $76.66.
Slope coefficient, b 1.5476
For every one extra year of driving experience, the monthyly
premium will decrease by $1.55.
(e) COC, r
SS xy
SS xx SS yy
593.5
0.7679
(383.5)(1557.5)
A moderately strong and negatively correlation.
r
2
bSS xy
SS yy
(1.5476)(593.5)
0.5897
1557.5
Alternative: COD,r 2 0.7679 0.5897
2
58.97% of the variation in monthly premium can be explained by
driving experience, whereas the remaining 41.03% is due to
randomness and other unaccounted factors.
Regression Analysis: A Complete Example
(f) y (10) 76.6605 1.5476(10) $61.18
The estimstion is reliable because 10 (2,25).
(g) se
SS yy bSS xy
n2
1557.5 (1.5476)(593.5)
10.3199
82
Regression Analysis: A Complete Example
(h) H 0 : B 0, H1 : B 0
0.05, df n 2 8 2 6
tcalc
b B 1.5476 0
2.937
sb
0.5270
tcritical t ,df t0.05,6 1.943
tcalc 2.937 tcritical 1.943
Reject H 0 . There is sufficient evidence to conclude that the slope is negative.
The hypothesis test on B can be
performed using the p-value approach,
using the output obtained from
statistical software.
EXCEL OUTPUT
Source: http://www.excel-easy.com/examples/regression.html
EXCEL
EXCEL
EXCEL
SUMMARY
Identify IV (x) and DV (y)
Calculate SS of xx, yy, and xy
Determine the best fit line
Calculate and interpret regression coefficients
Calculate and interpret COC and COD
Estimate and comment on its reliability
Hypothesis test on B (critical value approach
using manual calculation, or p-value
approach from the Excel output)
Finding missing values from the given Excel
output