0% found this document useful (0 votes)
26 views53 pages

Aea 309 Notes-Econometrics

Uploaded by

elizabethnziku37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views53 pages

Aea 309 Notes-Econometrics

Uploaded by

elizabethnziku37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 53

SOKOINE UNIVERSITY OF AGRICULTURE

Department of Agricultural Economics and Agribusiness

Sokoine University of Agriculture, Morogoro, Tanzania

Lecture Notes (LM 01)

AEA 309 Econometrics

Prepared by
Khamaldin D. S. Mutabazi (PhD)

Philip, Goodluck 1
BSc. AEA
September, 2008

Philip, Goodluck 2
BSc. AEA
COURSE OUTLINE

1. Introduction to econometrics
 Definition of econometrics
 Statistical estimation and hypothesis testing

2. Econome tric models


 Regression analysis (simple and multiple regression models)
 Linear and non-linear models
 Consequences of violating OLS assumptions and how to address the problems

3. Data management methods


 Time series
 Panel
 Cross-sectional

4. Practical applications
 Econometric software
 Data coding and entry (introduction)
 Data processing and interpretation of estimation results

1.0 INTRODUCTION

Philip, Goodluck 3
BSc. AEA
1.1. Definition of econometrics

Econometrics is a social science in which the tools of economic theory, mathematics, and
statistical inference are applied to the analysis of economic phenomena.

Economic theoreticians

Put forward testable hypothesis, which are qualitative in nature (e.g. the law of demand,
the law of supply, quantity theory of money, inverse relationship between marginal
propensity to consume and income and positive relationship between marginal propensity
to save and income, etc)

Mathematical economics

Expresses economic theory in the form of mathematical equation (to quantify the kind of
testable relationships which are put in the form of testable hypothesis, econometricians
make use of mathematical equations developed by mathematical economists).

Economic statistics

Collects, processes, and presents economic data in the form of charts, diagrams, and
tables.

What the Econometrician Does:

Exposes economic data collected by the statistical economist to empirical testing using
mathematical equations developed by the mathematical economist to give empirical
content to economic theory.

Why study econometrics?

Depending on the activity of the organization you join after graduation you may be called
upon to forecast (ex. Sales, interest rates, exchange rate, prices, etc) or build models (i.e.
demand for and supply of agricultural products) which may be crucial to formulate new
policies or to evaluate the existing ones).

1.2. Statistical estimation and hypothesis testing procedures

Philip, Goodluck 4
BSc. AEA
Schematic description of steps involved in an econometric analysis (simplified flow).

Economic theory

Econometric model Data

Estimation

Specification testing
and diagnostic checking

No Yes

Tests of any hypotheses

Using the model for prediction


and policy

The following are the steps followed in econometric study with ultimate aim of
estimating parameters and testing hypothesis:

Philip, Goodluck 5
BSc. AEA
Step 1: Creating a testable hypothesis

Step 2: Collection of Data

Step 3: Specification of the mathematical model

 Assume an exact or deterministic relationships

 Applicable to experimental data

Step 4: Specification of the econometric model

 Economic relationships are inexact in nature

 Differs from deterministic mathematical model

Step 5: Estimation of parameters

Step 6: Checking for model adequacy (conformity to theory or reality)

Step 7: Testing the hypothesis derived from the model

Step 8: Using the model for forecasting or predicting

1.2.1. Step 1: Creating Hypothesis

Microeconomic theory states that other things remaining the same (i.e. ceteris paribus),
an increase in the price of beef is expected to decrease the quantity demanded of beef i.e.
a negative relationship between the price and quantity demanded of beef. This is the law
of demand.

We want to test if the theory holds i.e. whether demand for beef and price of beef are
negatively related.

After the negative relationship is confirmed we want to calculate by how much demand
for beef decreases following a one TAS increase in the price of beef (i.e. we want to
calculate price elasticity of demand for beef).

1.2.2. Step 2: Data Collection

There are four types of data: time series, cross-sectional and pooled.

Philip, Goodluck 6
BSc. AEA
Time series data: data collected over a period of time.

Cross-sectional data: data collected on one or more variables at one point of time.

Pooled data: data that includes both time series and cross-sectional data.

Panel data: a special type of data in which data from the same cross-sectional unit is
collected over time.

Consider 10 smallholder cattle farms in Morogoro that produced beef (cattle off-take)
from 1990 to 2002, the data typology can be comprehended as in Table 1.

Table 1: Summary of typology of data

F1 F2 . . . F 10

1990 Cross section data

1991 P

. A

. N

. E

2000 L

2002

1.2.3. Step 3: Specifying the Mathematical Model

Assume we what to see how the demand for beef behaves in relation to price of beef and
to calculate the magnitude of change in the demand for beef following a unit change in
the price of beef.

Step one: plot the data for these variables in a scatter diagram (i.e. price in the x axis and
demand in the y axis). This gives a hint regarding the relationship between demand for

Philip, Goodluck 7
BSc. AEA
beef and price of beef and also whether a linear or non-linear mathematical model is
appropriate.

Step two: Assuming that demand for beef and price of beef are linearly related. Their
relationship can be quantified using the following mathematical model:

DBEEF = b1 + b2PRICE

b1 and b2 are parameters of the linear function. b 1 is the intercept of the model and b 2 is
the slope. The slope coefficient can be positive or negative. The sign of the slope
parameter determines the relationship between demand for beef and price of beef.

Step three: Superimpose a fitted linear line to determine whether the relationship
between the two variables (to see whether all points on the scatter diagram lie on the
straight line.

1.2.4. Step 4: Specifying the econometric model

The mathematical model assumes an exact or deterministic relationship between the two
variables. In reality, the relationships are inexact or statistical in nature that is why all
points on the scatter diagram do not lie exactly on the straight line estimated by the trend
line. The observed relationship between the two variables is imprecise for various reasons
such as the nature of the data collection (i.e. the data are non-experimentally collected)
and the omission of other forces that affect demand for beef (i.e. taste of consumers,
income of consumers, population, etc). How can this problem be tackled?

To allow the influence of all variables not explicitly introduced in the mathematical
model, let us introduce a random variable u to get a statistical or econometric model.

DBEEF = b1 + b2PRICE + u

DBEEF is called dependent variable

PRICE is independent variable and

U is an error term. It is equal to DBEEF- b1 - b2PRICE

Philip, Goodluck 8
BSc. AEA
1.2.5. Steps 5 & 6: Estimation of Parameters (i.e. b1 and b2) from the Selected
Model and Testing for model adequacy

A methodology called ordinary least squares (OLS) will be developed to compute the
parameters of the chosen model.

Let the following result was obtained using OLS: DBEEF = 102.28 + 0.12PRICE

According to the result obtained, if price increases by one unit, demand for beef is
expected to increase on average by 0.12 units. The value 102.3 implies that the average
value of demand for beef will be about 102.3 if the price of beef is zero.

Is the model we estimated using beef price as the only explanatory variable adequate. Yes
or no? If yes, we are rejecting the hypothesis. Is it a sin to reject an established
hypothesis? How do we know if this is not due to problem of model misspecification?

What will happen to the relationship between the variables if we rewrite the model to
include other forces such as disposable income?

There are other factors that affect the demand for beef such as income. To see the
importance of the inclusion of this variable in the model, a multiple linear regression
model of the form given below was computed using OLS.

DBEEF = 37.5 – 0.88PRICE + 11.9INCOME

Which model do we choose? This one or the one with one explanatory variable? The
model chosen should be a representative of the reality. We will discuss in detail how one
can go about developing a model.

1.2.6. Steps 7 & 8: Testing of hypotheses, prediction & forecasting

To test whether the estimated model makes economic sense or whether the results
conform to the underlying economic theory.

In our example (i.e. the model with multiple linear regression equation) the result is in
conformity with the hypothesis.

Philip, Goodluck 9
BSc. AEA
Hypothesis testing goes beyond looking at the signs of the estimated model. It makes use
of various tools to test complicated hypothesis.

Prediction and Forecasting: In addition to testing hypothesis the estimated model may be
used for prediction or forecasting purposes (see Table 2).

Example of forecast

Year Demand Income Price Forecast

2002 103.3 10.97 61.1 114.022

2003 10.97 64 111.4613

2004 10.97 66 109.6953

Table 2

2.0 Econometric models: Regressions

2.1. Population versus sample regression functions

Two functions: Population regression function (PRF) and sample regression function
(SRF).

 Y = a0 + a1 X …….PRF

 Ŷ = â0 + â1X ……..SRF

SRF as approximation of the PRF

PRF …………………. Y = a0 + a1X

SRF ………………… Ŷ = â0 + â1 X

Error (ui) ………………. Ui = Yi – Ŷi

Residual sum of Sq. ……Sum (ui)2

Sum (ui)2 = sum (Yi – â0 + â1 X)2

Sum (ui)2 = f (â0 , â1 )

Philip, Goodluck 10
BSc. AEA
2.2. The Linear Regression Model

2.2.1 Preamble

Empirically, we believe there is a causal link between class size and achievement
measured in terms of test score. As a result, you are asked to estimate the effect of a
change in class size on test score.

Basically, you are thinking of the following relationship in equation 1:

(1)

Equation (1) can be seen as the definition of a slope coefficient, thus a straight line
relating test score to class size, can be written as:

(2)

Where, β0 is the intercept (test score with a class size of 0!! In some applications, the
intercept is not meaningful).

So in general, we believe there is a linear relationship between an independent variable


(X) and a dependent variable (Y). This relationship holds on average, thus for each
observation i there exists an error term (ui). Thus, we have the generic equation 3:

(3)

is called the population regression line. β0 and β1 are the coefficients


(parameters) to be estimated using the available data as depicted in Figure 1.

Philip, Goodluck 11
BSc. AEA
Y P4

Y  0  1 X

Q4
u1 P 1 Q3
Q2
0 Q1 P3
P2
 0  1 X 1

X1 X2 X3 X4 X

Figure 1

Each value of Y thus has a non-random component representing the non-systematic part
envisaging factors or influences that with small and possibly unpredictable or non-
measurable, b0 + b1X, and a random component, u. The first observation has been
decomposed into these two components.

2.2.1. Estimating the coefficient of the linear regression model

Consider equation for OLS estimation as in equation 4:

(4)

The parameters to be estimated are and whose estimates are denoted as and ,
respectively, then the line is called the regression of Y on X, where
denotes a fitted value of Y, calculated from the equation of the line. In particular, the
estimated values of the dependent variable corresponding to the observed values of the
explanatory variable or regressor are as in equation 5:

Philip, Goodluck 12
BSc. AEA
i=1…, n (5)
The discrepancies between the observed and estimated values of the dependent variable
are the regression residues, denoted , as shown in equation 6 and respective Figure 2:
, i=1…, n (6)
These can be thought of as the observable counterpart to the unobservable disturbances,
ut

Y (actual value)

(Fitted value) P4
Y
Y  Yˆ e e4
(Residual) Yˆi ˆ  ˆX i
R3 R4
R2
P1 e3
e1i e2
R1 P3
P2
b0

Yˆ1i

X1 X2 X3 X4 X

Figure 2

The least squares method of estimation consists of choosing, for a given set of data, those
estimates , which minimize the sum of squared residuals as shown in equation 7:

(7)

To minimize S with respect to and we set the first partial derivatives with respect to
and equal to zero as shown in equations 8, 9, 10 and 11:
By applying the differentiation rule

Philip, Goodluck 13
BSc. AEA
Let,
n=2

(8)

(9)

Rearranging gives what is known as the least squares normal equations:

(10)

(11)

We now solve equations 10 and 11 for and . Dividing equation 10 by n gives equation
12:
(12)
Where and are the respective sample means, and so we see that the regression line
passes through the point of the sample means ( , ). Rearranging slightly, we get
equation 13:
(13)
Substituting equation 13 into 11 gives equation 14 as follows:

(14)

Solving for in equation 14 we obtain the following results as shown in equation 15:

Philip, Goodluck 14
BSc. AEA
(15)

Having found we can then calculate from equation 13

Some further implications of the least squares estimates can be obtained by re-examining
the first order conditions and normal equations. The first gives:

That is,

So the residuals sum to zero and have sample mean, , equal to zero. Since , it
follows that means of the observed and fitted values of Y are equal to:

The second of the two first-order conditions gives

That is,

This implies that the covariance, m, between the residuals and the explanatory variable is
zero:

Since the deviation of the fitted value from its mean is given as

Philip, Goodluck 15
BSc. AEA
, I = 1… n

It follows that the covariance between the residuals and the fitted values also is zero:

Thus OLS estimation splits the dependent variable Y into two components, namely an
estimate of the systematic part of Y, and a remainder or residue e, and these
two components are uncorrelated.

2.2.2. Properties of least square estimates


So far no assumptions have been made about u t, and indeed none are needed to obtain the
OLS estimates. However, the properties of the estimates vary according to the
assumptions made. Such assumptions provide desirable properties of the estimates.

The classical assumptions of linear regression model are as follows:


A1. The model is linear in the parameters

A2. X is non-stochastic, that is, contain no random part. This implies that:
, for all i, j
This means, X values taken by the regressor X are considered fixed in repeated samples.
In this sense, the regression analysis is conditional given values of the regressor(s) X

A3. The expected value of the errors is always zero.

, for all i.
The disturbance term represents the non-systematic part entailing unobservable
influences affection outcome has on average a mean of 0 given the value of X.
Technically, the conditional mean value of u i is zero. Geometrically, this assumption can
be pictured as in following figure:

Philip, Goodluck 16
BSc. AEA
Conditional distribution of the disturbances ui

f(u)

Mean

+ui
-ui

PRF:
X1
X2
X3

A4. The residuals have constant variance.

, constant, for all i


Thus each pair of observations is assumed to be equally reliable. If var (u1) < var (u2)
then the likelihood is that (X1, Y1) lies nearer to the line Y = α + βX than does (X2, Y2),
in which case we would want to put more emphasis on (X1, Y1) in the estimation
procedure. If the constant variance assumption the disturbances are said to be
homoskedastic, if not they are said to be heteroskedastic

A5. Errors are not autocorrelated

, for all i, j with i ≠ j


Where i and j are two different observations and where cov means covariance. This
assumption is critical in time series data. In a general context this assumption corresponds
to a random sample.

Philip, Goodluck 17
BSc. AEA
The term serial correlation is often used as an alternative for autocorrelation. Lack of
serial correlation (zero correlation) is exhibited in Fig (c) as opposed to Figures (a) and
(b) with u’s positively and negatively correlated, respectively.

+ui +ui
(b)
(a)

-ui +ui -ui +ui

+ui
-ui -ui

-ui +ui

(c)
-ui

Suppose in our PRF (Yt =  + βXt + ut) that ut and ut-1 are positively correlated. Then Yt
depends not only on Xt but also on ut-1 for ut-1 to some extent determines u t. By invoking
assumption 5, we are saying that we will consider the systematic effect, if any, of X t on
Yt and not worry about the other influences that might act on Y as a result of the possible
intercorrelations among the u’s.

A6. Zero covariance between ui and Xi, or

Philip, Goodluck 18
BSc. AEA
, since E(ui) = 0
Since E(Xi) is non-stochastic
since E(ui) = 0
by assumption

Basically, assumption 6 states that the disturbance u and explanatory variable X are
uncorrelated. The relevance of this assumption is that, we assume that X and u which
represent the influence of all the omitted variables have separate influence on Y. But if X
and u are correlated, it is not possible to assess their individual effects on Y.

A7. The number of observations n must be greater than the number of parameters to be
estimated.
Consider a hypothetical example in the following table. Imagine that we had only the first
pair of observations on Y and X (4 and 1). From this single observation there is no way to
estimate the two unknowns, and .
Yi Xi
4 1
5 4
7 5
12 6

A8. The regressors exhibit variation.


That is, values Xi, i = 1… n are not all the same. Our model is designed to explain the
changes in Y that result from change in X but if no changes in X are observed then this
effect clearly cannot be evaluated. If all the X’s are identical, then Xi = and there
denominator of that equation will be zero, making it impossible to estimate and
therefore .
A9. The regression model is correctly specified
Alternatively, there is no specification bias or error in the model used in empirical
analysis. Important questions underlying model specifications include: 1) what variables
should be included in the model, 2) what is the functional form of the model, 3) what are

Philip, Goodluck 19
BSc. AEA
probabilistic assumptions made about the Yi, the Xi and the ui entering the model. By not
addressing these questions properly, the validity of the outcomes of the estimation will be
highly questioned.

A10. There is no perfect multicollinearity.


That is, there are no perfect linear relationships among the explanatory variables.

2.2.3. Properties of Least Squares Estimators: The Gauss-Markov Theorem


Given the assumptions of the classical linear regression model, least squares estimates
possess some ideal or optimum properties. These properties are contained in a well-
known Gauss-Markov Theorem.

The theorem states that in a linear model in which errors have expectation zero and are
uncorrelated and have equal variances, a best linear unbiased estimator (BLUE) of the
coefficients is given by the least-squares estimator.
An estimator, say OLS estimator , is said to be BLUE of if the following hold:

1. It is linear, that is a linear function of a random variable, such as the dependent


variable Y in the regression model
2. It is unbiased, that is, its average or expected value, is equal to the true
value,
3. It has minimum variance in the class of all such linear unbiased estimator with
least variance is known as an efficient estimator

Consider the following figures:


Fig (a) shows the sampling distribution of the OLS estimator , that is, the distribution

of the values taken by in repeated sampling experiments. As the figure shows, the

Philip, Goodluck 20
BSc. AEA
mean of values, E( ), is equal to the true . In this situation, we say that, is an

unbiased estimator of . In figure (b) a sampling distribution of is shown, an

alternative estimator of obtained using non-OLS method. Assume that , like , is


unbiased, that is, its average or expected value is equal to . Assume further that both
and are linear estimators, that is, they are linear functions of Y. An important
question is what estimator, and , would one chose? To answer this question ,
superimpose the two figures to obtain figure (c). It is obvious that although both and
are unbiased the distribution of the latter is more widespread around the mean value
than that of the former. In other words, the variance of the latter is larger than the
variance of the former. Now given two estimators that are both linear and unbiased, one
would choose the estimator with smaller variance because it is more likely to be close to
than the alternative estimator. In short, one would choose the BLUE estimator.

̂ 2
E ( ˆ 2 )   2
(a) Sampling distribution of

 2*
E (  2* )   2
(b) Sampling distribution of

̂ 2

 2*

Philip, Goodluck 21
ˆ 2 ,  2*
BSc. AEA 2
(c) Sampling distribution of and
Philip, Goodluck 22
BSc. AEA
3.0 Practical computation of the parameters

3.1. Practical example


Case example: Consider 10 farmers, i=A…, J producing irrigated green maize during off-
season. We can say yield (ton/ha) is a function of frequency of irrigation. We can ask
ourselves some questions

Farmer Irrigate Yield


A 2 2
B 3 3
C 0 2
D 4 8
E 5 10
F 1 2
G 6 15
H 3 5
I 7 18
J 5 10
Total 36 75

The mathematical equation:


Y = a + bX

Q1: Is the yield of maize related to the frequency of irrigation?


Q2: What is the direction and magnitude of the relationship?
Q3: When the number of frequency of irrigation increase by 1, by how much does the yield
increase?

Tasks:
Scatter plot of the relationship between yield and frequency of irrigation

Philip, Goodluck 23
BSc. AEA
Yield is related to irrigation frequency (r=+0.94)
But what is the slope (b) of the ‘best fit’ straight line for these data
The slope (b) estimates the expected change in yield (Y) for each unit increase in
frequency of irrigation (X). The slope (b) of the ‘best fit’ line is also called the
regression coefficient.

Equations for estimating the ‘best fit’ straight line

The intercept (a) of the ‘best fit’ line is also called the regression constant

Philip, Goodluck 24
BSc. AEA
Nota Bene:
When the slope b = 0.0
The intercept a = , the mean of Y, in other words, if X is not related to Y, your best guess
about the value of Y is

Calculating the regression equation


Farmer Irrigate Yield X2 Y2 XY
(X) (Y)
A 2 2 4 4 4
B 3 3 9 9 9
C 0 2 0 4 0
D 4 8 16 64 32
E 5 10 25 100 50
F 1 2 1 4 2
G 6 15 36 225 90
H 3 5 9 25 15
I 7 18 49 324 126
J 5 10 25 100 50
Total 36 75 174 859 378

N=10 ∑X ∑Y ∑X2 ∑Y2 ∑XY

The slope of the line (b)

= [378 – (36) (75) / 10] / [174 – (36) 2 / 10]


= 2.432 ton/acre

Philip, Goodluck 25
BSc. AEA
The intercept of the line ( )

= (75 / 10) – 2.432 (36 / 10) = -1.26

Regression equation
Yield = -1.26 + 2.432 (irrigation frequency)

Interpretation
The best fit straight line intersects the Y-axis at -1.26
When the number of irrigation frequency increase by 1, the yield increases by 2.432
tons

Making predictions with the regression equation


Yield = -1.26 + 2.432 (irrigation)

Farmer Irrigate Yield Prediction Error (e) e2


(X) (Y) ( )
A 2 2 +3.604 -1.604 2.573
B 3 3 +6.036 -3.036 9.217
C 0 2 -1.260 +3.260 10.628
D 4 8 +8.468 -0.468 0.219
E 5 10 +10.900 -0.900 0.810
F 1 2 +1.172 +0.826 0.686
G 6 15 +13.332 +1.668 2.782
H 3 5 +6.036 -1.036 1.073
I 7 18 +15.764 +2.236 5.000
J 5 10 +10.900 -0.900 0.810
Total 36 75 0.0 33.80
Nota Bene
( This will always be true)
 e2 = SSerror = 33.8 this is called the error or residual sum of squares
Partitioning of the Sums of Squares

Philip, Goodluck 26
BSc. AEA
In regression analysis, it is a common practice to partition the sums of squares into additive
parts.
SStotal = SSregression + SSerror

Total Sum Squares

Regression Sum of Squares

Error of Residual Sum of Squares

Yield
22 _

Philip, Goodluck 27
BSc. AEA
20 _

18 _ 

16 _

14 _

12 _
(Y - ) residual or error
10 _ regression 
( - Y)
8 _  (Y - )
_ =7.5
tons
6 _

4 _

2   

0
1 2 3 4 5 6 7 8 9
-2 _ Irrigation frequency

-4 _
Calculations of Sums of Squares
Yield = -1.26 + 2.432 (irrigation)
Farmer Irrigate Yield Total Regress. Error e2

Philip, Goodluck 28
BSc. AEA
(X) (Y)
A 2 2 30.25 15.18 2.573
B 3 3 20.25 2.14 9.217
C 0 2 30.25 76.74 10.628
D 4 8 0.25 0.94 0.219
E 5 10 6.25 11.56 0.810
F 1 2 30.25 40.04 0.686
G 6 15 56.25 34.01 2.782
H 3 5 6.25 2.14 1.073
I 7 18 110.25 68.29 5.000
J 5 10 6.25 11.56 0.810
Total 36 75 296.5 262.7 33.80

SS total = SS regression + SS error = 296.5 = 262.7 + 33.8

The relationship between linear regression and Correlation


Coefficient of determination (r2)

Alternatively

r2 = (296.5-33.8)/(296.5) = =0.886
r2 = (262.7)/(296.5) = + 0.886

The correlation coefficient (r)

= 0.94
3.2. Linear Regression and Analysis of Variance
In analysis of variance, the total sum of squares is partitioned into:
The between SS and

Philip, Goodluck 29
BSc. AEA
The within SS

In regression, the total sum of squares is partitioned into:


The regression SS and
The residual or error SS

Analysis of variance can be used to determine the significance of the regression equation.
QN: Is the regression SS explained by the regression equation significantly different than
0.0.?

Analysis of Variance of the Results of the Regression Analysis

Source of Variation SS df MS F
Regression 262.7 (k)1 262.7 62.18
Error 33.8 (N - k -1) 4.225
8
Total 296.5 9
(k = number of independent variables)

Null hypothesis
In the population, the regression sum of squares is zero, i.e. irrigation frequency is not
related to yield, i.e. r = 0.00

F = 62.18 for 1 and 8 df is significant at p < 0.0001

Decision
The null hypothesis is rejected

Testing the Significance of the Regression Coefficient (b)

Philip, Goodluck 30
BSc. AEA
= variance estimate

k= # of independent variables in the equation

(33.8)/(10-1-1) = 4.85

= 0.31

, Or

t = (2.432/0.31) = 7.85

The critical value of t for (10-1-1) = 8 df at = 0.05 is 1.860

Since 7.845 > 1.860, p < 0.05 and the null hypothesis is rejected

Nota Bene

With one predictor variable (x), t2 = F

From ANOVA, F=62.18 62


From the t test, t = 7.845
(7.845)2 61.54 62

Philip, Goodluck 31
BSc. AEA
Both F and t test the same null hypothesis: i.e. there is no relationship between irrigation
frequency and green maize yield in the farming population

3.3. 95% confidence interval of the regression coefficient


Q: In generalizing the regression coefficient (b = 2.432) to the population, we do not expect
to be right. The statistical question is how far off will the generalization be?

The confidence interval of b …

t = the critical value of t for two tailed  ( / 2) at df = (N –1 – 1)

In this example, the critical value of t for  = (0.05 / 2) = 0.025 and df = 8 is equal to
t = 2.306

Therefore
2.432  (2.306) (0.31) = 1.7 to 3.2 tons
In generalizing the relationship between irrigation frequency and green maize yield, we are
95% confident that the population parameter  falls between 1.7 years and 3.2 tons

Population
β=?

1.7 3.2

Sample
b = 2.432

3.4. Estimation by SPSS Program


Another Example (handle it using SPSS)

Philip, Goodluck 32
BSc. AEA
Farmer Age Yield
(X) (Y)
A 60 2
B 50 3
C 65 2
D 40 8
E 36 10
F 55 2
G 38 15
H 48 5
I 25 18
J 39 10
Total 456 75

Q: What is the relationship between the age of the farmer and the yield of green maize?

Q What is the correlation between age and green maize yield (r)?

Q How much of the variance in green maize yield is accounted for by age (r2)?

Q How much of the variance in sentence is not accounted for by age (1 - r2)?

Q When the age of a farmer increases by one year, by how much does their yield
increase or decrease?

Q What is the probability that there is no relationship in the population of farmers


growing irrigated green maize between age and yield?

Philip, Goodluck 33
BSc. AEA
Problem 1 – How does the scatterplot of green maize yield as a function of farmer’s age look
like?

Problem 2 – The regression of yield on age

Regression

Variables Entered/Removedb

Mode Variables
l Variables Entered Removed Method
1 age of the farmera . Enter
a All requested variables entered.
b Dependent Variable: yield (ton/ha)

Philip, Goodluck 34
BSc. AEA
The correlation r = 0.918

Model Summary
Std. Error
Mode Adjusted of the
l R R Square R Square Estimateof determination r2 = 0.843
The coefficient

1 .918(a) .843 .823 2.41476


a Predictors: (Constant), age of the farmer

The analysis of variance


What is the probability that the correlation in the population is 0.0? the probability <
0.001

df = (1 & F= p <
8) 42.848 0.001

ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regressio
249.851 1 249.851 42.848 .000a
n
Residual 46.649 8 5.831
Total 296.500 9
a Predictors: (Constant), age of the farmer
b Dependent Variable: yield (ton/ha)

Philip, Goodluck 35
BSc. AEA
Linear regression equation

Yield = 27.143 – 0.431 (Age)

Coefficientsa
Unstandardized Standardized
Model Coefficients Coefficients t Sig.
Std.
B Error Beta
1 (Constant) 27.143 3.097 8.766 .000
age of the farmer -
-0.431 .066 -.918 .000
6.546
a Dependent Variable: yield (ton/ha)

What is the probability that the regression coefficient β in the population is 0.0?

t = -6.446, df = 8, p < 0.001

Summary statistics on the errors of prediction, the residuals

Residuals Statisticsa
Minimu Maximu Std.
m m Mean Deviation N
Predicted Value -.8571 16.3740 7.5000 5.26890 10
Residual -2.6046 4.2261 .0000 2.27666 10

Philip, Goodluck 36
BSc. AEA
Std. Predicted
-1.586 1.684 .000 1.000 10
Value
Std. Residual -1.079 1.750 .000 .943 10
a Dependent Variable: yield (ton/ha)

Graph of the Regression of yield as a function of Age


1 5.0 0

yi el d (to n /h a ) = 2 7.1 4 + -0.4 3 * a ge


R-S q ua re = 0 .8 4
yield (ton/ha )

 
1 0.0 0


5 .00


  

0 .00

3 0.0 0 4 0.0 0 5 0.0 0 6 0.0 0

age of the farmer

r = 0.918, r2 = 0.843

(1-r2) = 0.157

Philip, Goodluck 37
BSc. AEA
F = 42.848, p < 0.001

Residual analysis: Determining the Goodness of Fit

How well does the regression model fit the data?

Q: Is the correlation r significantly different than 0.0? Yes p < 0.001

Q: If significant, how much the variance in Y can be accounted for by X, i.e. the coefficient
of determination? r2 = 0.843, or 84%

Q: How much of the variance in Y cannot be accounted for by X, i.e. the coefficient of non-
determination? 1-r2 = 0.157, or 16%

Q: Are the prediction errors distributed randomly?


A residual (an error) is the difference between a prediction ( ) and the actual value of
the dependent variable Y

Residual (e) = (Y - )
If the data fit the assumptions of the regression model, then the residuals will be randomly
distributed

How to test whether the residuals are random

Histogram of the residuals (e)


Normal probability plots of the residuals (e)
Plot the residuals (e) against the predictions ( )

Philip, Goodluck 38
BSc. AEA
Histogram of the residuals with a normal curve overlay

Interpretation
Not perfectly normally distributed
Roughly symmetric
Not too bad

Philip, Goodluck 39
BSc. AEA
Interpretation
Confirms results in the histogram
Scatter plot of the residuals against the predicted Values of Y ( )

Philip, Goodluck 40
BSc. AEA


2 .00


residuals

res iduals = 0.00 + 0.00 * predicty


R -Square = 0.00
0 .00


-2 .0 0

0 .00 5 .00 1 0.0 0 1 5.0 0

predicted values of Y

The residuals should be distributed about equally on either side of 0.0


This condition is known as homogeneity of the residuals

Plotting Standardized Residuals and Standardized Predictions


Standardizing the residuals and the predictions and graphing them in a scatterplot is helpful
in identifying outliers. Outliers are cases which may have an undue influence on the
estimation of the regression constant (a) and the regression coefficient (b)

To standardize a residue (e) or a prediction ( ) is to convert it into a Z-score. The z score for
an item, indicates how far and in what direction, that item deviates from its distribution's
mean, expressed in units of its distribution's standard deviation. The mathematics of the z
score transformation are such that if every item in a distribution is converted to its z score,
the transformed scores will necessarily have a mean of zero and a standard deviation of one.

Philip, Goodluck 41
BSc. AEA
In SPSS, standardized residuals and predictions can be saved in the regression analysis. They
are called zre_1 and zpr_1

Possible outliers

Cases that produce errors beyong +/- 1.30 should be examined as possible outluers

Philip, Goodluck 42
BSc. AEA
1 .70

1 .40

Standardized Res idual = -0.00 + 0.00 * zpr_1 


1 .10
R-Square = 0.00
Standardized Residu al


0 .80

0 .50 

0 .20

-0 .1 0

-0 .4 0 

-0 .7 0 

-1 .0 0

-1 .3 0

-1 .6 0
-1 .0 0 0 .00 1 .00

Standardized Predicted Value

Multiple linear regression

Farmer Irrigate Age Yield


(X1) (X2) (Y)
A 2 60 2
B 3 50 3
C 0 65 2
D 4 40 8
E 5 36 10
F 1 55 2
G 6 38 15
H 3 48 5

Philip, Goodluck 43
BSc. AEA
I 7 25 18
J 5 39 10
Total 36 456 75

Y = a + b1X1 + b2X2 + … + bkXk

Multiple linear regression involves more than one predictor variable (Xk)

Regression with one predictor variable geometrically involves fitting a “best fit” straight line
to a scatterplot of the data

With two predictors, it involves fitting a “best fit” plane to the data

Philip, Goodluck 44
BSc. AEA
X2 X1

Green maize yield as a function irrigation frequency and age of the farmer

Y = a +b1(irrigation) + b2(age)

Irrigation = X1; Age = X2; Yield = Y

Inter-correlation matrix
Irrigation Age (X2) Yield
(X1) (Y)
Irrigation 1.00 -0.960 +0.941
(X1)

Philip, Goodluck 45
BSc. AEA
Age (X2) 1.00 -0.918
Yield 1.00

R2 = [0.9412 + -0.9182 – 2 (0.941) (-0.918) (-0.960)]/[1-(-0.960)]

R2 = 0.8882 0.888

R = 0.943

Interpretation
The multiple correlation of yield with irrigation and age is 0.943
89% of the variance in yield is accounted for by the combination of irrigation and age
11% of the variability in yield is accounted for by the factors other than irrigation and
age

Calculation of the regression Coefficients b1 and b2

Deviation score equations:

For irrigation:

For age:

Philip, Goodluck 46
BSc. AEA
= 44.40 = 108.00

= 1346.40 = -580.00

= 296.50 = -234.60

For irrigation:

b1 = [(108)(1346.4) – (-580) (-234.6)]/[(44.4) (1346.4) – (-234.6)2]

= 1.969892 1.969

For age:

b2 = [(-580)(44.4) – (108) (-234.6)]/[(44.4) (1346.4) – (-234.6)2]

= -0.08754 -0.088

Calculation of intercept

Y = a + b1 (irrigation) + b2 (Age)

Philip, Goodluck 47
BSc. AEA
= ΣY/N = 75/10 =7.5
ΣX1/N = 36/10 = 3.6
ΣX2/N = 456/10 = 45.6

a = 7.5 – 1.969 * 3.6 – (-0.088*45.6)


= 4.4244

Yield as a function of irrigation and age of the farmer

Yield = 4.424 + 1.969 (irrigation) – 0.088 (age)

Philip, Goodluck 48
BSc. AEA

 

 

Interpretation of the regression equation


R=0.943, R2=0.888

The multiple correlation of yield with irrigation and age is 0.943


The percentage of the variance in the yield accounted for by irrigation and age is
88.8%, not accounted for 11.2%
When the number of irrigation frequency increase by one, the yield increase by 1.969
ton, regardless of age of the farmer
When the age of the farmer increase by one, the yield decrease by 0.088 ton,
regardless of irrigation frequency

Philip, Goodluck 49
BSc. AEA
SPSS Multiple Regression Results

Regression

Variables Entered/Removedb
Mode Variables
l Variables Entered Removed Method
1 age of the farmer, frequency of
. Enter
irrigation per weeka
a All requested variables entered.
b Dependent Variable: yield (ton/ha)

Multiple correlation R=0.943

R2=0.889

Model Summaryb
Adjusted Std. Error of
Model R R Square R Square the Estimate
1 .943a .889 .857 2.17054
a Predictors: (Constant), age of the farmer, frequency of irrigation per week
b Dependent Variable: yield (ton/ha)

What is the probability that the multiple correlation in the population = 0.0? p <
0.001

Philip, Goodluck 50
BSc. AEA
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 263.521 2 131.761 27.967 .000a
Residual 32.979 7 4.711
Total 296.500 9
a Predictors: (Constant), age of the farmer, frequency of irrigation per week
b Dependent Variable: yield (ton/ha)

Philip, Goodluck 51
BSc. AEA
Yield = 4.424 + 1.969 (irrigation) – 0.088
(age)

Coefficientsa
Unstandardized Standardized
Model Coefficients Coefficients T Sig.
Std.
B Error Beta
1 (Constant) 4.400 13.639 .323 .756
Are theoftwo
frequency predictor
irrigation variables
per week 1.970significant?
1.156 No .762 1.703 .132
age of the farmer -0.088 .210 -.187 -.417 .689
a Dependent Variable: yield (ton/ha)

Are the residuals randomly distributed? No, they are heteroskedastic. The data may violet
one or more regression assumptions

Philip, Goodluck 52
BSc. AEA

1 .00 0 00 

Standardized Residual

S ta n d ard i ze d Re si du a l = 0 .0 0 + 0 .00 * zp r_2


R-S q ua re = 0 .00

0 .00 0 00

 
  

-1 .0 0 00 0

-1 .0 0 00 0 0 .00 0 00 1 .00 0 00

Standardized Predicted Value

Philip, Goodluck 53
BSc. AEA

You might also like