Introduction to Econometrics
Econometrics consists of the application of mathematical statistics to economic data to lend empirical
support to the models constructed by mathematical economics and to obtain numerical results
Econometrics may be defined as the social science in which the tools of economic theory, mathematics,
and statistical inference are applied to the analysis of economic phenomena.
Econometrics is concerned with the empirical determination of economic laws.
Econometrics is based upon the development of statistical methods for estimating economic relationships,
testing economic theories, and evaluating and implementing government and business policy.
The most common application of econometrics is the forecasting of such important macroeconomic
variables as interest rates, inflation rates, and gross domestic product. Whereas forecasts of economic
indicators are highly visible and often widely published, econometric methods can be used in economic
areas that have nothing to do with macroeconomic forecasting.
Econometrics has evolved as a separate discipline from mathematical statistics because the former
focuses on the problems inherent in collecting and analyzing nonexperimental economic data.
Nonexperimental data are not accumulated through controlled experiments on individuals, firms, or
segments of the economy. (Nonexperimental data are sometimes called observational data, or
retrospective data, to emphasize the fact that the researcher is a passive collector of the data.)
Experimental data are often collected in laboratory environments in the natural sciences, but they are
much more difficult to obtain in the social sciences. Although some social experiments can be devised, it
is often impossible, prohibitively expensive, or morally repugnant to conduct the kinds of controlled
experiments that would be needed to address economic issues.
Economic mode
Wage = f(educ, exper, training),
Where,
wage = hourly wage,
educ = years of formal education,
exper = years of workforce experience, and
training = weeks spent in job training.
Econometric model
wage = β0 + β1 educ + β2 exper + β3 training + U
Where, the term U contains factors such as “innate ability,” quality of education, family background, and
the myriad other factors that can influence a person’s wage. Also known as error term which represents
the difference between the actual value of a dependent variable and the value predicted by the regression
model. It captures the unexplained variance in the dependent variable and represents the impact of factors
not included in the model
For the most part, econometric analysis begins by specifying an econometric model
Once an econometric model has been specified, various hypotheses of interest can be stated
An empirical analysis, by definition, requires data. After data on the relevant variables have been
collected, econometric methods are used to estimate the parameters in the econometric model and to
formally test hypotheses of interest. In some cases, the econometric model is used to make predictions in
either the testing of a theory or the study of a policy’s impact.
The Significance of the Stochastic Disturbance Term (Gujarati Page 41-42)
The disturbance term ui is a surrogate for all those variables that are omitted from the model but that
collectively affect Y. The obvious question is: Why not introduce these variables into the model
explicitly? Stated otherwise, why not develop a multiple regression model with as many variables as
possible? The reasons are many.
1. Vagueness of theory:
The theory, if any, determining the behavior of Y may be, and often is, incomplete. We might know for
certain that weekly income X influences weekly consumption expenditure Y, but we might be ignorant or
unsure about the other variables affecting Y. Therefore, ui may be used as a substitute for all the excluded
or omitted variables from the model.
2. Unavailability of data:
Even if we know what some of the excluded variables are and therefore consider a multiple regression
rather than a simple regression, we may not have quantitative information about these variables. It is a
common experience in empirical analysis that the data we would ideally like to have often are not
available. For example, in principle we could introduce family wealth as an explanatory variable in
addition to the income variable to explain family consumption expenditure. But unfortunately,
information on family wealth generally is not available. Therefore, we may be forced to omit the wealth
variable from our model despite its great theoretical relevance in explaining consumption expenditure.
3. Core variables versus peripheral variables:
Assume in our consumption-income example that besides income X1, the number of children per family
X2, sex X3, religion X4, education X5, and geographical region X6 also affect consumption expenditure.
But it is quite possible that the joint influence of all or some of these variables may be so small and at best
nonsystematic or random that as a practical matter and for cost considerations it does not pay to introduce
them into the model explicitly. One hopes that their combined effect can be treated as a random variable
ui.
4. Intrinsic randomness in human behavior:
Even if we succeed in introducing all the relevant variables into the model, there is bound to be some
“intrinsic” randomness in individual Y’s that cannot be explained no matter how hard we try. The
disturbances, the u’s, may very well reflect this intrinsic randomness.
5. Poor proxy variables:
Although the classical regression model assumes that the variables Y and X are measured accurately, in
practice the data known theory of the consumption function. He regards permanent consumption (Y p) as a
function of permanent income (X p). But since data on these variables are not directly observable, in
practice we use proxy variables, such as current consumption (Y) and current income (X), which can be
observable. Since the observed Y and X may not equal Y p and Xp, there is the problem of errors of
measurement. The disturbance term u may in this case then also represent the errors of measurement. If
there are such errors of measurement, they can have serious implications for estimating the regression
coefficients, the β’s.
6. Principle of parsimony:
Following Occam’s razor, we would like to keep our regression model as simple as possible. If we can
explain the behavior of Y “substantially” with two or three explanatory variables and if our theory is not
strong enough to suggest what other variables might be included, why introduce more variables? Let ui
represent all other variables. Of course, we should not exclude relevant and important variables just to
keep the regression model simple.
The Coefficient of Determination r2: A Measure of “Goodness of Fit” (Gujarati page 73)
We consider the goodness of fit of the fitted regression line to a set of data; that is, we shall find out how
“well” the sample regression line fits the data. It is clear that if all the observations were to lie on the
regression line, we would obtain a “perfect” fit, but this is rarely the case. The coefficient of
determination r2 (two-variable case) or R2 (multiple regression) is a summary measure that tells how well
the sample regression line fits the data.
The quantity r2 thus defined is known as the (sample) coefficient of determination and is the most
commonly used measure of the goodness of fit of a regression line. Verbally, r 2 measures the proportion
or percentage of the total variation in Y explained by the regression model.
The Structure of Economic Data
Cross-Sectional Data
A cross-sectional data set consists of a sample of individuals, households, firms, cities, states, countries,
or a variety of other units, taken at a given point in time. Cross-section data are data on one or more
variables collected at the same point in time.
Time Series Data
A time series data set consists of observations on a variable or several variables over time. Examples of
time series data include stock prices, money supply, consumer price index, gross domestic product,
annual homicide rates, and automobile sales figures.
Pooled cross Sections
Some data sets have both cross-sectional and time series features. In pooled, or combined, data are
elements of both time series and cross-section data. A researcher might collect data on student test scores
in different schools (cross-section) in the year before the policy was implemented, and then again in the
year after. By combining these two cross-sectional datasets, the researcher can analyze the effect of the
policy on test scores over time.
Suppose we have data on 250 houses for 1993 and on 270 houses for 1995.
Panel or longitudinal Data
A panel data (or longitudinal data) set consists of a time series for each cross-sectional member in the
data set. This is a special type of pooled data in which the same cross-sectional unit (say, a family or a
firm) is surveyed over time. As an example, suppose we have wage, education, and employment history
for a set of individuals followed over a ten-year period. Or we might collect information, such as
investment and financial data, about the same set of firms over a five-year time period.
Key Assumptions of Linear Regression
1. Linearity
Assumption: The relationship between the independent and dependent variables is linear.
The first and foremost assumption of linear regression is that the relationship between the
predictor(s) and the response variable is linear. This means that a change in the independent
variable results in a proportional change in the dependent variable. This can be visually
assessed using scatter plots or residual plots.
If the relationship is not linear, the model may underfit the data, leading to inaccurate
predictions. In such cases, transformations of the data or the use of non-linear regression models
may be more appropriate.
Example:
Consider a dataset where the relationship between temperature and ice cream sales is being
studied. If sales increase non-linearly with temperature (e.g., significantly more sales at high
temperatures), a linear model may not capture this effect well. We'll also show a scenario where
the relationship is not linear.
Linear Relationship: This is where the increase in temperature results in a consistent
increase in ice cream sales.
Non-Linear Relationship: In this case, the increase in temperature leads to a more
significant increase in ice cream sales at higher temperatures, indicating a non-linear
relationship.
2. Multivariate Normality - Normal Distribution
Multivariate normality is a key assumption for linear regression models when making
statistical inferences. Specifically, it means that the residuals (the differences between observed
and predicted values) should follow a normal distribution when considering multiple
predictors together. This assumption ensures that hypothesis tests, confidence intervals, and p-
values are valid. This assumption can be assessed by examining histograms or Q-Q plots of the
residuals, or through statistical tests such as the Kolmogorov-Smirnov test.
This assumption is crucial because it allows us to make valid inferences about the model's
parameters and the relationship between the dependent and independent variables.
3. Lack of Multicollinearity
Assumption: The independent variables are not highly correlated with each other.
Multicollinearity occurs when two or more independent variables in the model are highly
correlated, leading to redundancy in the information they provide. This can inflate the standard
errors of the coefficients, making it difficult to determine the effect of each independent variable.
It is essential that the independent variables are not too highly correlated with each other, a
condition known as multicollinearity. This can be checked using:
Correlation matrices, where correlation coefficients should ideally be below 0.80.
Variance Inflation Factor (VIF), with VIF values above 10 indicating problematic
multicollinearity. Solutions may include centering the data (subtracting the mean score from
each observation) or removing the variables causing multicollinearity. Example: In a model
predicting health outcomes based on multiple health metrics, if both blood pressure and heart
rate are included as predictors, their high correlation may lead to multicollinearity.
4. Homoscedasticity of Residuals in Linear Regression
Homoscedasticity is one of the key assumptions of linear regression, which asserts that the
residuals (the differences between observed and predicted values) should have a constant
variance across all levels of the independent variable(s). In simpler terms, it means that the
spread of the errors should be relatively uniform, regardless of the value of the predictor.
When the residuals maintain constant variance, the model is said to be homoscedastic.
Conversely, when the variance of the residuals changes with the level of the independent
variable, we refer to this phenomenon as heteroscedasticity.
Heteroscedasticity can lead to several issues:
Inefficient Estimates: The estimates of the coefficients may not be the best linear unbiased
estimators (BLUE), meaning that they could be less accurate than they should be.
Impact on Hypothesis Testing: Standard errors can become biased, leading to unreliable
significance tests and confidence intervals.
5. Absence of Endogeneity
No endogeneity is an important assumption in the context of multiple linear regression. The
assumption of no endogeneity states that the independent variables in the regression model
should not be correlated with the error term. If this assumption is violated, it leads to biased and
inconsistent estimates of the regression coefficients.
Bias and Consistency: When endogeneity is present, the estimates of the regression coefficients
are biased, meaning they do not accurately reflect the true relationships between the variables.
Additionally, the estimates become inconsistent, which means they do not converge to the true
parameter values as the sample size increases.
Valid Inference: The assumption of no endogeneity is critical for conducting valid hypothesis
tests and creating reliable confidence intervals. If endogeneity exists, the statistical tests based on
these estimates may lead to incorrect conclusions.
Detecting Violations of Assumptions
It is crucial to assess whether the assumptions of linear regression are met before fitting a model.
Here are some common techniques to detect violations:
1. Residual Plots: Plotting the residuals against the fitted values or independent variables can help
visualize linearity, homoscedasticity, and independence of errors. Ideally, the residuals should
show no pattern, indicating a linear relationship and constant variance.
2. Q-Q Plots: A Quantile-Quantile plot can be used to assess the normality of residuals. If the
residuals follow a straight line in a Q-Q plot, they are normally distributed.
3. Variance Inflation Factor (VIF): To check for multicollinearity, calculate the VIF for each
independent variable. A VIF value greater than 5 or 10 indicates significant multicollinearity.
4. Durbin-Watson Test: This statistical test helps detect the presence of autocorrelation in the
residuals. A value close to 2 indicates no autocorrelation, while values significantly less than or
greater than 2 indicate the presence of positive or negative autocorrelation, respectively.
5. Statistical Tests: Perform statistical tests like the Breusch-Pagan test for homoscedasticity and
the Shapiro-Wilk test for normality.
Addressing Violations of Assumptions
If any of the assumptions are violated, there are various strategies to mitigate the issue:
Transformations: Apply transformations (e.g., logarithmic, square root) to the dependent
variable to address non-linearity and heteroscedasticity.
Adding Variables: If autocorrelation or omitted variable bias is suspected, consider adding
relevant predictors to the model.
Generalized Least Squares (GLS): This approach can be used when the residuals are
heteroscedastic or correlated.