0% found this document useful (0 votes)
22 views59 pages

Lecture 3

research lectures for data analysis

Uploaded by

maleekajain1399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views59 pages

Lecture 3

research lectures for data analysis

Uploaded by

maleekajain1399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

BEAM078 Applied Empirical

Accounting and Finance

BEFM022 Quantitative
Research Methods
Module Leader: Dr Anthony Wood Email: a.p.wood@exeter.ac.uk
Workshop Tutor: Dr Wanling Rudkin Email: w.rudkin@exeter.ac.uk
Lecture 3
OLS Regression
Part 1
Correlation
Correlation

DATA LINK
STATA Code:
graph twoway scatter grade student_effort
Correlation
A scatterplot gives us a means of observing the relationship between two variables.

We may have:

• A positive (or negative) linear relationship (as below)


• A positive (or negative) non-linear / curved relationship
• Other relationships
• No relationship
Correlation
The correlation coefficient (r) is a number that measures the strength and direction of a bivariate linear
relationship.

It can be thought of as the standardised covariance between the two variables.


The correlation coefficient is defined as:

where covariance is equal to


Correlation
Correlation coefficient properties (more details)

The value of r is not dependent upon the units of measurement. For example, if X is the weight in kilograms
and Y is the height in cm of a person, then the correlation between X and Y would be the same if we measured X
in pounds (lbs) and Y in inches.

The value of r does not depend on which variable is labelled X and which variable is labelled Y. That is to say
The correlation of X and Y is equal to the correlation of Y and X.

r lies between -1 and 1 (-1 ≤ r ≤ 1). A positive value of r means a positive linear relationship, a negative value of r means a negative linear
relationship

When r = +1 there is a perfect positive linear relationship and when r = -1 there is a perfect negative linear relationship.

Values of r close to zero imply that there is no linear relationship. Note that “ no linear relationship” does not mean that there is no relationship
at all. r only measures linear relationships

The strength of correlation can generally be defined as follows

Strong: |r| ≥ 0.8


Moderate 0.5 ≤ |r| < 0.8
Weak |r| < 0.5
Correlation
Correlation
Correlation
Real life examples

rchsbowman (2010)
Correlation
Real life examples

rchsbowman (2010)
Correlation
Correlation does not mean or imply causation

We must be very careful and mindful in interpreting correlation coefficients.

Just because two variables are highly correlated does not mean that one
variable causes the other to change.

In statistical terms, we say that correlation does not imply causation. There are
many good examples of correlated variables which are nonsensical (not causal).

• E.g. Ice cream sales and the number of shark attacks on swimmers are positively
correlated

We call these spurious correlations


Correlation

Source: Gizmodo.com
Correlation

Source: theburningplatform.com
Correlation

Source: businessweek.com
Correlation

You can find more examples here: http://tylervigen.com/page?page=1


Correlation
Further terminology
Common response: Both X and Y respond to changes in a third (highly correlated) variable.
This could be:

• an unobserved (lurking) variable which is not included within the regression but is likely
to better explain a relationship.

• Ice cream sales and shark attacks both increase during summer.

• A confounding variable which is included in the regression, is non-causal, adversely


affects the relationship between X and Y.

• Shark attacks = B0 + B1Ice Cream Sales + B2Summer Month + ….. other causal factors

AGAIN - correlation does not imply causation

Causation: Changes in X directly cause changes in Y. For example, football weekends cause
heavier traffic, more food sales, etc.
Part 2
Bivariate OLS Regression
Bivariate OLS Regression

DATA LINK
STATA Code:
graph twoway scatter grade student_effort || lfit grade student_effort
Bivariate OLS Regression
The objective of a bivariate OLS linear regression is to fit a straight line through
the points on a scatter plot that best represents all the points.

This is called the regression line, or “line of best fit” and has the equation:

y = 0 + 1 x + e Example where y = 2 + 5x

where
y = Dependent variable
0 = Intercept (of y axis)
1 = Slope
x = Independent variable
e = Residual Error (sometimes notated as u)
Bivariate OLS Regression
Finding the “best fit”

We need to find values of β0 and β1 in order that the estimated line 𝑦ො = β0 + β1x fits the data as well as possible.

Note that we use 𝑦ො as our estimated value of y.

In effect we are saying, “what is our best estimate of y (the true value), given x, β0 and β1. “

For each data point within the regression the error term will be the true value of y – the predicted value of y (𝑦ො )

Residual error(e) = y - 𝑦ො = y – (β0 + β1X)

The line of best must be simultaneously closest to all the data points.

This will occur when the sum of the residual errors across all data points is minimised.
Bivariate OLS Regression
Visual depiction of the residual error (e)
Bivariate OLS Regression
Since residuals can be positive or negative, we square
them to remove the sign.

By summing the squared residuals, we get a total


measure of how far away from the data our line is.

Therefore the “best-fit” line is one where the sum of


squared residuals in minimised.

O Ordinary
L Least
S Squares
Bivariate OLS Regression
For bivariate OLS regressions we can calculate the slope b1 and intercept b0
of the regression equation as follows:

Remember that…

And that… r=

The slope and intercept can be calculated as..


Bivariate OLS Regression
Lets run a simple bivariate OLS regression in STATA

STATA has several inbuilt data sets which can be viewed using the command: sysuse dir

For this example we will be using the auto.dta file which contains data on different cars.

To load the data use: sysuse auto (you may need to clear any previous data using the “clear” command first).

We will now construct a regression model of the form: price = β0 + β1mpg


So y = price of the car
x = mpg (miles per gallon – fuel efficiency)

What impact does mpg have on the price of a car?


To run the regression use the command: regress price mpg
Bivariate OLS Regression
Bivariate OLS Regression
Understanding the output

Information
regarding the
whole
regression

y variable
Information
regarding
variable “mpg”

Intercept β0 (constant)
Slope β1 (coefficient)
Bivariate OLS Regression
Information regarding the variable “mpg”

where:

Coef. = Coefficient (slope). How much does y change, when I increase x by 1 unit.
Std. Err. = Standard Error
t = t-test statistic
P>|t| = p-value (the probability of observing any value equal or larger than the value of |t|)
[95% Conf. Interval] = The 95% confidence interval for the slope.
Bivariate OLS Regression
Remember that we when calculated a one-sample t-test
statistic

SE

What if..

we wish to determine whether the value (or influence) of a


regression coefficient is significantly different to zero,
at say, a 95% confidence level? . (significance level α =
0.05)
Bivariate OLS Regression
We can make inferences about β0 and β1 by using the t-distribution,
we just need to calculate the corresponding Standard Errors.

Note that the standard errors here are different to that of a one-
sample t-test.

Do not worry, STATA calculates this for us!


Bivariate OLS Regression
Standard Error, t and p-value:

Because the p=values are very small, we


can reject both null hypotheses and
conclude that the slope of mpg is not
equal to zero, and that the intercept is
not equal to zero.

𝛽𝑖 −0 −238.894
Note that t = 𝑆𝐸
= 53.076 = −4.5
Bivariate OLS Regression
Confidence intervals:

Computing the confidence intervals:

Recall that a 95% confidence interval is 𝑋ത ± CV * SE

From the t-table we can deduce that for a 95% confidence interval with n(74)-1 degrees
of freedom, that the critical value is 1.990

Therefore lower = -238.894 – (1.990 *53.076) = -344.701


And the upper = -238.894 + (1.990 *53.076) = -133.088

(95% of the time, the value of the coefficient will be within this range)
Bivariate OLS Regression
How well does the OLS regression explain the variation in Y?
Bivariate OLS Regression
How well does the OLS regression explain the variation in Y?
Bivariate OLS Regression
Goodness of fit
How do we think about how well our sample regression line fits our sample data?

We Can compute the fraction of the total sum of squares (SST) that is explained
by the model

This is called this the R-squared of the regression

R2 = SSR/SST = 1 – SSE/SST

Lies between 0 (no fit) and 1 (perfect fit)

SS of the regression SSR


SS of the errors SSE
SS total SST
Bivariate OLS Regression
How well does the OLS regression explain the variation in Y?

• R2 can never decrease when another independent variable is added to a


regression, and usually will increase

• Because R2 will usually increase with the number of independent variables,


it is usually not a good way to compare models
Bivariate OLS Regression
How well does the OLS regression explain the variation in Y?

• Adjusted R2 makes an adjustment and takes into account the number of predictor variables and
usually reported as the regression’s measure of fit.

𝑛−1
• Adjusted R2 = 1 − (1 − 𝑅2 ), where n = number of observations and m = number of predictor variables
𝑛−𝑚−1

73
• Adjusted R2 = 1 − (1 − 0.2196) = 0.2087
72
Bivariate OLS Regression
Is the regression is significant?

n-m-1
Bivariate OLS Regression
MSR

m MSE
n-m-1
n-1

Simply: MS = SS/df
Bivariate OLS Regression
The final part of the output uses an F-test to determine whether the formulated regression is significantly better than
having a regression with just an intercept.
The hypotheses for this test is as follows:

m, (n-
(n-m-1), α
m-1), α
Bivariate OLS Regression

F-statistic and significance


Bivariate OLS Regression
The F-distribution:

Unlike the Z and t-distributions that we have previously looked at, the f-distribution has 2 different degrees of freedom,
and is not symmetrical.

The critical value can be found via an F-table (Anova Table).

Note that for each level of significance, there is a different table. Here I have just provided the F-table for 5% significance.
m, (n-
Here means that we have m(1) numerator degreesm-1), α
of freedom and n-m-1(72) denominator degrees of freedom
Bivariate OLS Regression
Critical Value F1,72,0.05

F-Statistic = MSR/MSE = 139449474/6883554.48 = 20.285


Critical Value = 3.97 from table

F-statistic (20.258) > critical value of 3.97


Therefore reject the null hypothesis that there is no significant regression fit in favour of the alternate hypothesis that the
slope of the regression line is significant. m, (n-
m-1), α
Bivariate OLS Regression

m, (n-
m-1), α
Part 3
Multivariate OLS Regression
Multivariate OLS Regression

Properties of Multivariate OLS regression.


(more than one x variable)

b0 is still the intercept


b1 to bn are slope parameters (coefficients) for each variable x1,x2, …,xn
e is still the error term (or disturbance)
Still minimizing the sum of squared residuals, but in multiple dimensions
Multivariate OLS Regression

Visual depiction of 2 independent x variables (source python)


Multivariate OLS Regression
Multivariate OLS in STATA
Recall that in the last session we ran the bivariate OLS regression price = mpg
(using the inbuilt STATA auto.dta data set).

For this session we will expand this regression to include multiple independent x variables.

Given this data, what are the main factors that influence the price of a car?

The regression model we will be using is as follows:

price = mpg + headroom + trunk + weight + length + turn + displacement + gear_ratio

STATA Code:
sysuse auto
reg price mpg headroom trunk weight length turn displacement gear_ratio
Multivariate OLS Regression
Multivariate OLS Regression
A closer look at the coefficients

Within the regression, which variables are significant at explaining price?


How do we interpret the coefficients?
Multivariate OLS Regression
Variables significant at 10% (significantly different to a zero slope)
• headroom
• weight
• length
• turn
• gear_ratio

Variables significant at 5%
• weight
• turn
• gear_ratio

Variables significant at 1%
• weight
Multivariate OLS Regression
What do the coefficients mean within a multivariate regression?

Let us take weight as an example:

Weight has a coefficient of 4.85 and is significantly different to 0 at the 1% significance level.

Weight is measured in lbs (weight in American pounds)


Price is measured in $USD

Holding all other variables constant: A 1 unit increase in weight leads to a 4.85 unit increase in price.

Namely, if the weight of the car is increased by 1 lb, the price of the car will be $4.85 higher.

We can take the same approach with all other variables, but if they are not significant within the
regression, we can not provide any level of confident that this is the correct inference.
Multivariate OLS Regression
Multivariate OLS Regression
Making your regressions look presentable for publication

Commonly you do not want to simply paste an image copy of the STATA output into your paper/assignment/thesis

There is a lot of superfluous information that the reader will not care about.

The reader will primarily be interested in the coefficients, t-stat, p-value, #obs, and the adjusted r-squared.

The following slides shows a typical presentation of regression.

This is taken from Horton, Krishna Kumar, & Wood (2020). Detecting Academic Fraud using Benford Law.

The table presents 9 different regressions.

Coefficients and t-stats are provided (in parenthesis).


Note the starring system *,**,*** representing the significance levels at 10, 5, and 1% levels of significance respectively.
Multivariate OLS Regression
How to do this STATA. (Link to code)

“help estout” will provide you with a very long list of options. The options provided are what I would generally use.
You can then edit the resulting output in word to make any final changes/amendments.
Multivariate OLS Regression
TASKS
TASKS

Experiment with Multivariate OLS regressions in STATA

Understand what the coefficients mean and how to interpret them

Complete the Mini-Quiz on ELE

Complete Workshop_03 Questions


Building Block

Assignment Structure

You might also like