Aea 309 Notes-Econometrics
Aea 309 Notes-Econometrics
Prepared by
Khamaldin D. S. Mutabazi (PhD)
Philip, Goodluck 1
BSc. AEA
September, 2008
Philip, Goodluck 2
BSc. AEA
COURSE OUTLINE
1. Introduction to econometrics
Definition of econometrics
Statistical estimation and hypothesis testing
4. Practical applications
Econometric software
Data coding and entry (introduction)
Data processing and interpretation of estimation results
1.0 INTRODUCTION
Philip, Goodluck 3
BSc. AEA
1.1. Definition of econometrics
Econometrics is a social science in which the tools of economic theory, mathematics, and
statistical inference are applied to the analysis of economic phenomena.
Economic theoreticians
Put forward testable hypothesis, which are qualitative in nature (e.g. the law of demand,
the law of supply, quantity theory of money, inverse relationship between marginal
propensity to consume and income and positive relationship between marginal propensity
to save and income, etc)
Mathematical economics
Expresses economic theory in the form of mathematical equation (to quantify the kind of
testable relationships which are put in the form of testable hypothesis, econometricians
make use of mathematical equations developed by mathematical economists).
Economic statistics
Collects, processes, and presents economic data in the form of charts, diagrams, and
tables.
Exposes economic data collected by the statistical economist to empirical testing using
mathematical equations developed by the mathematical economist to give empirical
content to economic theory.
Depending on the activity of the organization you join after graduation you may be called
upon to forecast (ex. Sales, interest rates, exchange rate, prices, etc) or build models (i.e.
demand for and supply of agricultural products) which may be crucial to formulate new
policies or to evaluate the existing ones).
Philip, Goodluck 4
BSc. AEA
Schematic description of steps involved in an econometric analysis (simplified flow).
Economic theory
Estimation
Specification testing
and diagnostic checking
No Yes
The following are the steps followed in econometric study with ultimate aim of
estimating parameters and testing hypothesis:
Philip, Goodluck 5
BSc. AEA
Step 1: Creating a testable hypothesis
Microeconomic theory states that other things remaining the same (i.e. ceteris paribus),
an increase in the price of beef is expected to decrease the quantity demanded of beef i.e.
a negative relationship between the price and quantity demanded of beef. This is the law
of demand.
We want to test if the theory holds i.e. whether demand for beef and price of beef are
negatively related.
After the negative relationship is confirmed we want to calculate by how much demand
for beef decreases following a one TAS increase in the price of beef (i.e. we want to
calculate price elasticity of demand for beef).
There are four types of data: time series, cross-sectional and pooled.
Philip, Goodluck 6
BSc. AEA
Time series data: data collected over a period of time.
Cross-sectional data: data collected on one or more variables at one point of time.
Pooled data: data that includes both time series and cross-sectional data.
Panel data: a special type of data in which data from the same cross-sectional unit is
collected over time.
Consider 10 smallholder cattle farms in Morogoro that produced beef (cattle off-take)
from 1990 to 2002, the data typology can be comprehended as in Table 1.
F1 F2 . . . F 10
1991 P
. A
. N
. E
2000 L
2002
Assume we what to see how the demand for beef behaves in relation to price of beef and
to calculate the magnitude of change in the demand for beef following a unit change in
the price of beef.
Step one: plot the data for these variables in a scatter diagram (i.e. price in the x axis and
demand in the y axis). This gives a hint regarding the relationship between demand for
Philip, Goodluck 7
BSc. AEA
beef and price of beef and also whether a linear or non-linear mathematical model is
appropriate.
Step two: Assuming that demand for beef and price of beef are linearly related. Their
relationship can be quantified using the following mathematical model:
DBEEF = b1 + b2PRICE
b1 and b2 are parameters of the linear function. b 1 is the intercept of the model and b 2 is
the slope. The slope coefficient can be positive or negative. The sign of the slope
parameter determines the relationship between demand for beef and price of beef.
Step three: Superimpose a fitted linear line to determine whether the relationship
between the two variables (to see whether all points on the scatter diagram lie on the
straight line.
The mathematical model assumes an exact or deterministic relationship between the two
variables. In reality, the relationships are inexact or statistical in nature that is why all
points on the scatter diagram do not lie exactly on the straight line estimated by the trend
line. The observed relationship between the two variables is imprecise for various reasons
such as the nature of the data collection (i.e. the data are non-experimentally collected)
and the omission of other forces that affect demand for beef (i.e. taste of consumers,
income of consumers, population, etc). How can this problem be tackled?
To allow the influence of all variables not explicitly introduced in the mathematical
model, let us introduce a random variable u to get a statistical or econometric model.
DBEEF = b1 + b2PRICE + u
Philip, Goodluck 8
BSc. AEA
1.2.5. Steps 5 & 6: Estimation of Parameters (i.e. b1 and b2) from the Selected
Model and Testing for model adequacy
A methodology called ordinary least squares (OLS) will be developed to compute the
parameters of the chosen model.
Let the following result was obtained using OLS: DBEEF = 102.28 + 0.12PRICE
According to the result obtained, if price increases by one unit, demand for beef is
expected to increase on average by 0.12 units. The value 102.3 implies that the average
value of demand for beef will be about 102.3 if the price of beef is zero.
Is the model we estimated using beef price as the only explanatory variable adequate. Yes
or no? If yes, we are rejecting the hypothesis. Is it a sin to reject an established
hypothesis? How do we know if this is not due to problem of model misspecification?
What will happen to the relationship between the variables if we rewrite the model to
include other forces such as disposable income?
There are other factors that affect the demand for beef such as income. To see the
importance of the inclusion of this variable in the model, a multiple linear regression
model of the form given below was computed using OLS.
Which model do we choose? This one or the one with one explanatory variable? The
model chosen should be a representative of the reality. We will discuss in detail how one
can go about developing a model.
To test whether the estimated model makes economic sense or whether the results
conform to the underlying economic theory.
In our example (i.e. the model with multiple linear regression equation) the result is in
conformity with the hypothesis.
Philip, Goodluck 9
BSc. AEA
Hypothesis testing goes beyond looking at the signs of the estimated model. It makes use
of various tools to test complicated hypothesis.
Prediction and Forecasting: In addition to testing hypothesis the estimated model may be
used for prediction or forecasting purposes (see Table 2).
Example of forecast
Table 2
Two functions: Population regression function (PRF) and sample regression function
(SRF).
Y = a0 + a1 X …….PRF
Ŷ = â0 + â1X ……..SRF
SRF ………………… Ŷ = â0 + â1 X
Philip, Goodluck 10
BSc. AEA
2.2. The Linear Regression Model
2.2.1 Preamble
Empirically, we believe there is a causal link between class size and achievement
measured in terms of test score. As a result, you are asked to estimate the effect of a
change in class size on test score.
(1)
Equation (1) can be seen as the definition of a slope coefficient, thus a straight line
relating test score to class size, can be written as:
(2)
Where, β0 is the intercept (test score with a class size of 0!! In some applications, the
intercept is not meaningful).
(3)
Philip, Goodluck 11
BSc. AEA
Y P4
Y 0 1 X
Q4
u1 P 1 Q3
Q2
0 Q1 P3
P2
0 1 X 1
X1 X2 X3 X4 X
Figure 1
Each value of Y thus has a non-random component representing the non-systematic part
envisaging factors or influences that with small and possibly unpredictable or non-
measurable, b0 + b1X, and a random component, u. The first observation has been
decomposed into these two components.
(4)
The parameters to be estimated are and whose estimates are denoted as and ,
respectively, then the line is called the regression of Y on X, where
denotes a fitted value of Y, calculated from the equation of the line. In particular, the
estimated values of the dependent variable corresponding to the observed values of the
explanatory variable or regressor are as in equation 5:
Philip, Goodluck 12
BSc. AEA
i=1…, n (5)
The discrepancies between the observed and estimated values of the dependent variable
are the regression residues, denoted , as shown in equation 6 and respective Figure 2:
, i=1…, n (6)
These can be thought of as the observable counterpart to the unobservable disturbances,
ut
Y (actual value)
(Fitted value) P4
Y
Y Yˆ e e4
(Residual) Yˆi ˆ ˆX i
R3 R4
R2
P1 e3
e1i e2
R1 P3
P2
b0
Yˆ1i
X1 X2 X3 X4 X
Figure 2
The least squares method of estimation consists of choosing, for a given set of data, those
estimates , which minimize the sum of squared residuals as shown in equation 7:
(7)
To minimize S with respect to and we set the first partial derivatives with respect to
and equal to zero as shown in equations 8, 9, 10 and 11:
By applying the differentiation rule
Philip, Goodluck 13
BSc. AEA
Let,
n=2
(8)
(9)
(10)
(11)
We now solve equations 10 and 11 for and . Dividing equation 10 by n gives equation
12:
(12)
Where and are the respective sample means, and so we see that the regression line
passes through the point of the sample means ( , ). Rearranging slightly, we get
equation 13:
(13)
Substituting equation 13 into 11 gives equation 14 as follows:
(14)
Solving for in equation 14 we obtain the following results as shown in equation 15:
Philip, Goodluck 14
BSc. AEA
(15)
Some further implications of the least squares estimates can be obtained by re-examining
the first order conditions and normal equations. The first gives:
That is,
So the residuals sum to zero and have sample mean, , equal to zero. Since , it
follows that means of the observed and fitted values of Y are equal to:
That is,
This implies that the covariance, m, between the residuals and the explanatory variable is
zero:
Since the deviation of the fitted value from its mean is given as
Philip, Goodluck 15
BSc. AEA
, I = 1… n
It follows that the covariance between the residuals and the fitted values also is zero:
Thus OLS estimation splits the dependent variable Y into two components, namely an
estimate of the systematic part of Y, and a remainder or residue e, and these
two components are uncorrelated.
A2. X is non-stochastic, that is, contain no random part. This implies that:
, for all i, j
This means, X values taken by the regressor X are considered fixed in repeated samples.
In this sense, the regression analysis is conditional given values of the regressor(s) X
, for all i.
The disturbance term represents the non-systematic part entailing unobservable
influences affection outcome has on average a mean of 0 given the value of X.
Technically, the conditional mean value of u i is zero. Geometrically, this assumption can
be pictured as in following figure:
Philip, Goodluck 16
BSc. AEA
Conditional distribution of the disturbances ui
f(u)
Mean
+ui
-ui
PRF:
X1
X2
X3
Philip, Goodluck 17
BSc. AEA
The term serial correlation is often used as an alternative for autocorrelation. Lack of
serial correlation (zero correlation) is exhibited in Fig (c) as opposed to Figures (a) and
(b) with u’s positively and negatively correlated, respectively.
+ui +ui
(b)
(a)
+ui
-ui -ui
-ui +ui
(c)
-ui
Suppose in our PRF (Yt = + βXt + ut) that ut and ut-1 are positively correlated. Then Yt
depends not only on Xt but also on ut-1 for ut-1 to some extent determines u t. By invoking
assumption 5, we are saying that we will consider the systematic effect, if any, of X t on
Yt and not worry about the other influences that might act on Y as a result of the possible
intercorrelations among the u’s.
Philip, Goodluck 18
BSc. AEA
, since E(ui) = 0
Since E(Xi) is non-stochastic
since E(ui) = 0
by assumption
Basically, assumption 6 states that the disturbance u and explanatory variable X are
uncorrelated. The relevance of this assumption is that, we assume that X and u which
represent the influence of all the omitted variables have separate influence on Y. But if X
and u are correlated, it is not possible to assess their individual effects on Y.
A7. The number of observations n must be greater than the number of parameters to be
estimated.
Consider a hypothetical example in the following table. Imagine that we had only the first
pair of observations on Y and X (4 and 1). From this single observation there is no way to
estimate the two unknowns, and .
Yi Xi
4 1
5 4
7 5
12 6
Philip, Goodluck 19
BSc. AEA
probabilistic assumptions made about the Yi, the Xi and the ui entering the model. By not
addressing these questions properly, the validity of the outcomes of the estimation will be
highly questioned.
The theorem states that in a linear model in which errors have expectation zero and are
uncorrelated and have equal variances, a best linear unbiased estimator (BLUE) of the
coefficients is given by the least-squares estimator.
An estimator, say OLS estimator , is said to be BLUE of if the following hold:
of the values taken by in repeated sampling experiments. As the figure shows, the
Philip, Goodluck 20
BSc. AEA
mean of values, E( ), is equal to the true . In this situation, we say that, is an
̂ 2
E ( ˆ 2 ) 2
(a) Sampling distribution of
2*
E ( 2* ) 2
(b) Sampling distribution of
̂ 2
2*
Philip, Goodluck 21
ˆ 2 , 2*
BSc. AEA 2
(c) Sampling distribution of and
Philip, Goodluck 22
BSc. AEA
3.0 Practical computation of the parameters
Tasks:
Scatter plot of the relationship between yield and frequency of irrigation
Philip, Goodluck 23
BSc. AEA
Yield is related to irrigation frequency (r=+0.94)
But what is the slope (b) of the ‘best fit’ straight line for these data
The slope (b) estimates the expected change in yield (Y) for each unit increase in
frequency of irrigation (X). The slope (b) of the ‘best fit’ line is also called the
regression coefficient.
The intercept (a) of the ‘best fit’ line is also called the regression constant
Philip, Goodluck 24
BSc. AEA
Nota Bene:
When the slope b = 0.0
The intercept a = , the mean of Y, in other words, if X is not related to Y, your best guess
about the value of Y is
Philip, Goodluck 25
BSc. AEA
The intercept of the line ( )
Regression equation
Yield = -1.26 + 2.432 (irrigation frequency)
Interpretation
The best fit straight line intersects the Y-axis at -1.26
When the number of irrigation frequency increase by 1, the yield increases by 2.432
tons
Philip, Goodluck 26
BSc. AEA
In regression analysis, it is a common practice to partition the sums of squares into additive
parts.
SStotal = SSregression + SSerror
Yield
22 _
Philip, Goodluck 27
BSc. AEA
20 _
18 _
16 _
14 _
12 _
(Y - ) residual or error
10 _ regression
( - Y)
8 _ (Y - )
_ =7.5
tons
6 _
4 _
2
0
1 2 3 4 5 6 7 8 9
-2 _ Irrigation frequency
-4 _
Calculations of Sums of Squares
Yield = -1.26 + 2.432 (irrigation)
Farmer Irrigate Yield Total Regress. Error e2
Philip, Goodluck 28
BSc. AEA
(X) (Y)
A 2 2 30.25 15.18 2.573
B 3 3 20.25 2.14 9.217
C 0 2 30.25 76.74 10.628
D 4 8 0.25 0.94 0.219
E 5 10 6.25 11.56 0.810
F 1 2 30.25 40.04 0.686
G 6 15 56.25 34.01 2.782
H 3 5 6.25 2.14 1.073
I 7 18 110.25 68.29 5.000
J 5 10 6.25 11.56 0.810
Total 36 75 296.5 262.7 33.80
Alternatively
r2 = (296.5-33.8)/(296.5) = =0.886
r2 = (262.7)/(296.5) = + 0.886
= 0.94
3.2. Linear Regression and Analysis of Variance
In analysis of variance, the total sum of squares is partitioned into:
The between SS and
Philip, Goodluck 29
BSc. AEA
The within SS
Analysis of variance can be used to determine the significance of the regression equation.
QN: Is the regression SS explained by the regression equation significantly different than
0.0.?
Source of Variation SS df MS F
Regression 262.7 (k)1 262.7 62.18
Error 33.8 (N - k -1) 4.225
8
Total 296.5 9
(k = number of independent variables)
Null hypothesis
In the population, the regression sum of squares is zero, i.e. irrigation frequency is not
related to yield, i.e. r = 0.00
Decision
The null hypothesis is rejected
Philip, Goodluck 30
BSc. AEA
= variance estimate
(33.8)/(10-1-1) = 4.85
= 0.31
, Or
t = (2.432/0.31) = 7.85
Since 7.845 > 1.860, p < 0.05 and the null hypothesis is rejected
Nota Bene
Philip, Goodluck 31
BSc. AEA
Both F and t test the same null hypothesis: i.e. there is no relationship between irrigation
frequency and green maize yield in the farming population
In this example, the critical value of t for = (0.05 / 2) = 0.025 and df = 8 is equal to
t = 2.306
Therefore
2.432 (2.306) (0.31) = 1.7 to 3.2 tons
In generalizing the relationship between irrigation frequency and green maize yield, we are
95% confident that the population parameter falls between 1.7 years and 3.2 tons
Population
β=?
1.7 3.2
Sample
b = 2.432
Philip, Goodluck 32
BSc. AEA
Farmer Age Yield
(X) (Y)
A 60 2
B 50 3
C 65 2
D 40 8
E 36 10
F 55 2
G 38 15
H 48 5
I 25 18
J 39 10
Total 456 75
Q: What is the relationship between the age of the farmer and the yield of green maize?
Q What is the correlation between age and green maize yield (r)?
Q How much of the variance in green maize yield is accounted for by age (r2)?
Q How much of the variance in sentence is not accounted for by age (1 - r2)?
Q When the age of a farmer increases by one year, by how much does their yield
increase or decrease?
Philip, Goodluck 33
BSc. AEA
Problem 1 – How does the scatterplot of green maize yield as a function of farmer’s age look
like?
Regression
Variables Entered/Removedb
Mode Variables
l Variables Entered Removed Method
1 age of the farmera . Enter
a All requested variables entered.
b Dependent Variable: yield (ton/ha)
Philip, Goodluck 34
BSc. AEA
The correlation r = 0.918
Model Summary
Std. Error
Mode Adjusted of the
l R R Square R Square Estimateof determination r2 = 0.843
The coefficient
df = (1 & F= p <
8) 42.848 0.001
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regressio
249.851 1 249.851 42.848 .000a
n
Residual 46.649 8 5.831
Total 296.500 9
a Predictors: (Constant), age of the farmer
b Dependent Variable: yield (ton/ha)
Philip, Goodluck 35
BSc. AEA
Linear regression equation
Coefficientsa
Unstandardized Standardized
Model Coefficients Coefficients t Sig.
Std.
B Error Beta
1 (Constant) 27.143 3.097 8.766 .000
age of the farmer -
-0.431 .066 -.918 .000
6.546
a Dependent Variable: yield (ton/ha)
What is the probability that the regression coefficient β in the population is 0.0?
Residuals Statisticsa
Minimu Maximu Std.
m m Mean Deviation N
Predicted Value -.8571 16.3740 7.5000 5.26890 10
Residual -2.6046 4.2261 .0000 2.27666 10
Philip, Goodluck 36
BSc. AEA
Std. Predicted
-1.586 1.684 .000 1.000 10
Value
Std. Residual -1.079 1.750 .000 .943 10
a Dependent Variable: yield (ton/ha)
1 5.0 0
1 0.0 0
5 .00
0 .00
r = 0.918, r2 = 0.843
(1-r2) = 0.157
Philip, Goodluck 37
BSc. AEA
F = 42.848, p < 0.001
Q: If significant, how much the variance in Y can be accounted for by X, i.e. the coefficient
of determination? r2 = 0.843, or 84%
Q: How much of the variance in Y cannot be accounted for by X, i.e. the coefficient of non-
determination? 1-r2 = 0.157, or 16%
Residual (e) = (Y - )
If the data fit the assumptions of the regression model, then the residuals will be randomly
distributed
Philip, Goodluck 38
BSc. AEA
Histogram of the residuals with a normal curve overlay
Interpretation
Not perfectly normally distributed
Roughly symmetric
Not too bad
Philip, Goodluck 39
BSc. AEA
Interpretation
Confirms results in the histogram
Scatter plot of the residuals against the predicted Values of Y ( )
Philip, Goodluck 40
BSc. AEA
2 .00
residuals
-2 .0 0
predicted values of Y
To standardize a residue (e) or a prediction ( ) is to convert it into a Z-score. The z score for
an item, indicates how far and in what direction, that item deviates from its distribution's
mean, expressed in units of its distribution's standard deviation. The mathematics of the z
score transformation are such that if every item in a distribution is converted to its z score,
the transformed scores will necessarily have a mean of zero and a standard deviation of one.
Philip, Goodluck 41
BSc. AEA
In SPSS, standardized residuals and predictions can be saved in the regression analysis. They
are called zre_1 and zpr_1
Possible outliers
Cases that produce errors beyong +/- 1.30 should be examined as possible outluers
Philip, Goodluck 42
BSc. AEA
1 .70
1 .40
0 .80
0 .50
0 .20
-0 .1 0
-0 .4 0
-0 .7 0
-1 .0 0
-1 .3 0
-1 .6 0
-1 .0 0 0 .00 1 .00
Philip, Goodluck 43
BSc. AEA
I 7 25 18
J 5 39 10
Total 36 456 75
Multiple linear regression involves more than one predictor variable (Xk)
Regression with one predictor variable geometrically involves fitting a “best fit” straight line
to a scatterplot of the data
With two predictors, it involves fitting a “best fit” plane to the data
Philip, Goodluck 44
BSc. AEA
X2 X1
Green maize yield as a function irrigation frequency and age of the farmer
Y = a +b1(irrigation) + b2(age)
Inter-correlation matrix
Irrigation Age (X2) Yield
(X1) (Y)
Irrigation 1.00 -0.960 +0.941
(X1)
Philip, Goodluck 45
BSc. AEA
Age (X2) 1.00 -0.918
Yield 1.00
R2 = 0.8882 0.888
R = 0.943
Interpretation
The multiple correlation of yield with irrigation and age is 0.943
89% of the variance in yield is accounted for by the combination of irrigation and age
11% of the variability in yield is accounted for by the factors other than irrigation and
age
For irrigation:
For age:
Philip, Goodluck 46
BSc. AEA
= 44.40 = 108.00
= 1346.40 = -580.00
= 296.50 = -234.60
For irrigation:
= 1.969892 1.969
For age:
= -0.08754 -0.088
Calculation of intercept
Y = a + b1 (irrigation) + b2 (Age)
Philip, Goodluck 47
BSc. AEA
= ΣY/N = 75/10 =7.5
ΣX1/N = 36/10 = 3.6
ΣX2/N = 456/10 = 45.6
Philip, Goodluck 48
BSc. AEA
Philip, Goodluck 49
BSc. AEA
SPSS Multiple Regression Results
Regression
Variables Entered/Removedb
Mode Variables
l Variables Entered Removed Method
1 age of the farmer, frequency of
. Enter
irrigation per weeka
a All requested variables entered.
b Dependent Variable: yield (ton/ha)
R2=0.889
Model Summaryb
Adjusted Std. Error of
Model R R Square R Square the Estimate
1 .943a .889 .857 2.17054
a Predictors: (Constant), age of the farmer, frequency of irrigation per week
b Dependent Variable: yield (ton/ha)
What is the probability that the multiple correlation in the population = 0.0? p <
0.001
Philip, Goodluck 50
BSc. AEA
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 263.521 2 131.761 27.967 .000a
Residual 32.979 7 4.711
Total 296.500 9
a Predictors: (Constant), age of the farmer, frequency of irrigation per week
b Dependent Variable: yield (ton/ha)
Philip, Goodluck 51
BSc. AEA
Yield = 4.424 + 1.969 (irrigation) – 0.088
(age)
Coefficientsa
Unstandardized Standardized
Model Coefficients Coefficients T Sig.
Std.
B Error Beta
1 (Constant) 4.400 13.639 .323 .756
Are theoftwo
frequency predictor
irrigation variables
per week 1.970significant?
1.156 No .762 1.703 .132
age of the farmer -0.088 .210 -.187 -.417 .689
a Dependent Variable: yield (ton/ha)
Are the residuals randomly distributed? No, they are heteroskedastic. The data may violet
one or more regression assumptions
Philip, Goodluck 52
BSc. AEA
1 .00 0 00
Standardized Residual
0 .00 0 00
-1 .0 0 00 0
-1 .0 0 00 0 0 .00 0 00 1 .00 0 00
Philip, Goodluck 53
BSc. AEA