The notion of ceteris paribus—which means “other (relevant) factors being
equal”— plays an important role in causal analysis
CHAPTER 2: Simple regression model
2.1. Definition of simple regression model
“Explain y in term of x”
Where:
1. x & y
2. Beta1: slope paremeter: the relationship between x and y, holding the other
factors in u is fixed
3. Beta0: Intercept parameter: hệ số chắn
The variable u, called the error term or disturbance in the relationship,
represents factors other than x that affect y. A simple regression analysis
effectively treats all factors affecting y other than x as being unobserved. You
can usefully think of u as standing for “unobserved.”
Population regression function (PRF)
4. Phân tích hồi quy dựa trên toàn bộ tổng thể
5. E(yx), is a linear function of x. The linearity means that a one-unit increase
in x changes the expected value of y by the amount beta1
Nếu: hàm hồi qui tổng thể có dạng E(Y/X) = 1 + 2X
Thì 1 = E(Y/X = 0): hệ số chặn (INPT : intercept term)
2 = E(Y/X) / X: hệ số góc (slope coefficient)
PRF cho biết quan hệ giữa biến phụ thuộc và biến giải thích về mặt trung
bình trong tổng thể.
SECSION 1 (6/5/2021)
1, Write sample regression model
Y^ = 3,108 + 0,305X1 + (-0,003)X2 + (-0,127)X3 + 0,107.X4 + (-0,092)X5
SESSION 2 (10/5/2021) : Introduction
1. DEFINITION
Econometrics is based on the development of statistical methods for
estimating economics relationships, testing economic theories, evaluating
government and business policies
• What is econometrics for?
– Quantifying relationships among economic variables
– Empirically testing economic theories: law of demand, money supply and
inflation
– Evaluating the impact of a change in one variable on another variable:
measuring the Return to Education, effect of the Minimum Wage on
Unemployment
– Forecasting (demand for goods, stock/gold prices,...)
2. ANALYSIS STEPS
• Steps in empirical economic analysis
1. Question of interest
2. Economic model
3. Econometric model
4. Data collection
5. Estimation of econometric model
6. Dianosing the model problem (example:Multicollinearity;heteroskedasticity;
normality)
7. Hypotheses postulated
8. Result analysis and policy implications
Step 1. Question of interest based on economic theories
• The relationship among variables in theory
• Example: Keynes’s theory states that the consumption of households has a
positive relationship with their income
Step 2. Set up mathematical model
Keynes’ theory in Step 1 can be modeled as following:
C=β +β I;β >0 (1)
0 1 1
In which
• C: Consumption of the households
• I: Income of the households
•β β : Parameters/ coefficient
0; 1
– β0: Constant / intercept parameter
– β1: Slope parameter
Step 3. Set up econometric model
• The mathematic model in Step 2 reflects the exact relationship between
variable consumption and income of households.
• But the relationship among economic variables in general is not perfectly
exact.
• For example, beside variable “income”, there are other variables that can
affect the consumption of households: numbers of family member, ages of the
family head...
• To measure inexact relationships between variables econometric model:
C = β + β I + u (2)
i 0 1i i
• u : error term ( or disturbance (nhiễu), represents factors that are not
i
income, but can affect consumption of the households
• The choice of variables to include in the models bases on economic theories
and available data.
*The reasons for the existence of error term:
The researchers cannot know all the factors that affect dependent variable Y.
If they know all the factors, it is impossible to get data for all factors.
It becomes very complicated if we include all the variables in the models
Step 4. Data collection
• Primary vs secondarydata
• The Structure of Economic Data
– Cross-sectional Data
– Time-series Data
– Pooled Data
• Pooled cross-sectional data
• Panel Data
A, Cross-sectional data: Dữ liệu chéo
• A cross-sectional data set consists of a sample of individuals, households,
firms, cities... taken at a given point of time.
• These data are obtained by random sampling from the underlying population.
• Sometimes, different variables can correspond to different time periods in
cross-sectional data sets.
B, Time-series data: Dữ liệu chuỗi thời gian
A time-series data set consists of observations on a variable or several variables
over time (stock price, consumer price index, GDP...).
– The chronological ordering of observations conveys important information.
– Economic observations can rarely, if ever, be assumed to be independent
across time.
A time series data on consumption and income of a person
C, Pooled cross sectional data: Dữ liệu chéo gộp
• Some data sets have both cross-sectional data and time series features.
Purpose: increase the number of observations
The estimation is more exact
D, Panel data
• A panel data set consists of a time series for each cross-sectional member in
the data set.
• The same cross-sectional units are followed over a given time period.
A panel data set on provinces’ characteristics
Pane
ID YEAR FDI ODA POPU IZ MOUTAIN
l
2170,
1 An Giang 2004 145 40,61 0 0
1
1 An Giang 2005 139 41,51 2194 0 0
2210,
1 An Giang 2006 140 30,60 0 0
4
Ba Ria
2 2004 64776 1220,01 897,6 7 0
Vung Tau
Ba Ria
2 2005 71441 157,99 913,1 7 0
Vung Tau
Ba Ria
2 2006 106618 11,55 926,3 7 0
Vung Tau
.... .... .... .... .... .... .... ....
1154,
63 Vinh Phuc 2004 7340 5,24 2 0
8
63 Vinh Phuc 2005 9340 7,36 1169 2 0
1180,
63 Vinh Phuc 2006 12776 27,73 3 0
4
64 Yen Bai 2004 96 3,04 723,5 0 1
64 Yen Bai 2005 103 6,13 731,8 0 1
64 Yen Bai 2006 113 9,80 740,7 0 1
• The quality of data depends on:
– Errors in data collection
– Sampling methods
• Data sources: (Must be mentioned the source of data when doing analysis)
– Experimental data
– Available data
Step 5. Estimate parameters of the model
Data -> Stata, Eviews, SPSS -> estimate parameters of the model (2)
β^ = -184,08 and β^ = 0,7064
0 1
Ĉ= -184,08 + 0,7064I (3)
i
The “hat” above the variable C show that this is an estimator of this
variable. ("Mũ" phía trên biến C cho thấy đây là một công cụ ước tính của biến
này)
Slope parameter = 0,70: if income increases by 1 billion USD, consumption
will increase 706 million USD.
Step 6: Test mistakes of the model
• To test if the assumptions of the models are violated
- Multicollinearity : Đa đối chiếu
- Heteroskedasticity: Dị hợp tử
- Normality of u
Step 7: Test hypotheses
• To test the appropriation of the model and estimated parameters
• Tests: Fisher, Durbin- Watson, Lagrange, Hausman....
Step 8: Analyze the estimated results and Forecasting/ policy implication
To see if the estimated results are consistent with/ supportive of the theories.
If the model is appropriate and the estimated results are consistent with the
theories Provide policy implication
Step 9: Forecasting
Lecture 2: The Linear Regression Model 1
1. Introduction to regression model
• The term « regression » means «regression to mediocrity» ( Hồi quy là quy về
giá trị trung bình)
Regression line is the line that connect the medium point
• Defined by Galton (1886) when he studied the relationship between the height
of sons and the height of fathers
Distribution of the height of sons respects to the height of the fathers
3
The study shows that:
• Given the height of fathers, the height of sons will distribute around a medium
value
• On average, when the height of fathers increase, the height of sons also
increase
Possitive effect is the most important thing we have to know
• If we conect all the medium points, we will have a linear line
• This line is called regression line, showing the relationship between the height
of sons and the height of father on average
2. Population Regression Function (PRF) and Sample Regression Function
(SRF)
2.1. Definition of PRF
PRF is a regression function that is constructed based on the survey of the
population
For example: Galton studied the relationship between the height of fathers and
the height of sons in one city. He collected the data of all fathers having adult
sons. So he can build PRF. 5
So, E(Y|X ) is a function of independent variable X :
i i
E(Y/X )= f(X ) = β + β X [1]
i i 0 1 i
Conditional expected value of Y = medium value (?) : Đường hồi quy là
đường đi qua giá trị trung bình nên biến phụ thuộc chính là kỳ vọng toán vì kỳ
vọng toán bằng giá trị trung bình
• The equation [1] is called Population regression function (PRF).
– PRF shows how the expected value of Y changes at different values of X
– If PRF has 1 independent variable -> simple regression function
–If PRF has 2 or more independent variables -> multiple regression function
6 • Suppose that PRF E(Y|X ) is a linear function, then:
i
E(Y|X )= β + β X [2]
i 0 1 i
- β , β : regression coefficients/ parameters • β0: constant coefficient
0 1
• β1: slope coefficient
• The equation [2] is a simple regression function 7
2.2. Error/ disturbance term
• Because E(Y|X ) is expected value of Y given X , single values of Y is not
i i i
necessary the same with E(Y|X ), but they are around E(Y/X ).
i i
• Note u is the difference between Y and E(Y/X ) (Khoảng cách từ giá trị
i i i
thực của quan sát thứ i đến kì vọng toán = giá trị trung bình), we have:
u = Y - E(Y|X ) [3]
i i i
Or :
Y = E(Y|X )+ u [4]
i i i
-> u is a random variable/ component or disturbance
i
8
2.3. Sample regression function (SRF)
• In reality, we can not carry out surveys of population -> we can not build PRF
• Then we only canestimate the expected value of Y,or in other words, estimate
PRF based on sample(s) taken from population
• Obviously the estimated SRF can not be absolutely exact
The regression function that is constructed based on a sample is called
Sample Regression Function (SRF).
9
Graph 2.03. Scatter graph and regression line of the 2 samples SRF1 và
SRF2
10
• From the population, we can get many samples. With each sample, we can
have a SRF
• To have the “best” SRF, meaning that that SRF is the closest estimate of PRF,
we have to base on some criteria (tiêu chuẩn) even when we do not have PRF
to compare.
Study somes technicque
2.3.Sample Regression Function (SRF) – Simple ( 1 X )
Y is an etimate of E(Y/Xi) and is a fitted value/ predicted value of Y
0, 1 are estimates of β , β
0 1
𝑢𝑖 is an estimate of u and is called as residuals (Dư lượng)
i
2.3.Sample Regression Function (SRF) – Multiple ( 2 or more X )
Where: k is number of independent variable X
i is the number of observation
Y: giá trị thực tế của quan sát thứ i
U^: phần dư
3. The Ordinary Least Square (OLS)
• The method OLS is invented by the German mathematician - Carl Friedrich
Gauss.
• It is used to estimate parameters given some assumptions.
• The estimates have some properties (linearity, unbiasedness, and efficiency).
• This method is used the most popularly now.
14
3.1. The Ordinary Least Square (OLS)
• Assume that:
PRF has the form: Y = β + β X + u [3.01]
i 0 1 i i
• Because we can not have PRF, we have to estimate it through SRF
SRF: 𝑌𝑖 = β0 + 1 𝑋𝑖 + 𝑢𝑖 = 𝑌𝑖 + 𝑢𝑖 [3.02]
Where Y is the predicted/ fitted value of Y
i
From [3.02], we have [3.03]: 𝑢𝑖 = Y – Y
i i
𝑢𝑖 is the difference between the actual value and predicted value of Y
i
If 𝑢𝑖 is smaller and smaller, the difference between is Y and Y smaller.
i i
Then, the estimated value Y is closer to Y
i i
• Suppose that we have n observations of Y and X, we try to find SRF so that Y
is closest to Y.
• It means that we have to choose SRF so that the sum of residuals:
has the minimum value
(tìm hệ số hồi quy sao cho tổng 𝑢𝑖 nhỏ nhất)
• However, this is not the best choice because of some following reasons:
Vấn đề trái ngược dấu , bị triệt tiêu giá trị trong khi khoảng cách k phải như
vậy
6. Formula of calculating :
SESSION 3 (13/5/2021)
Example
A random sample as followings:
X: Personal income/ day in thousand vnd
Y: Personal consumption/ day in thousand vnd
a. Calculate the main properties of X and Y (Expected value, variance, median,
mod)
b, Estimate the parameters of the SRF
c, Write SRF
Example
X: Personal income/ day in thousand vnd
Y: Personal consumption/ day in thousand vnd
X 5 4 2 8 8
Y 1 2 3 4 5
1. Calculate the main properties of X and Y (Expected value, variance, median,
mod)
2. Estimate the parameters of the SRF
3. Write SRF
4. Calculate SST, SSE, SSR, R-square
5. Explain the meaning of R-square
28
• Beta0=69/68; beta1=25/68
• SST = 10, SSE = 125/34, SSR = 215/34, R-square = 0.3676
• R-square=0.3676
It means that: Income can explain 36.76% of the sample variation in
consumption of people. So 63.24% of the the sample variation in consumption
of people is explained by other independent variables that are not included in the
model.
Example
X: Personal income/ day in thousand vnd
Y: Personal consumption/ day in thousand vnd
X 6 5 2 4 4
Y 5 2 2 3 1
1. Calculate the main properties of X and Y (Expected value, variance, median,
mod)
2. Estimate the parameters of the SRF
3. Write SRF
4. Calculate SST, SSE, SSR, R-square
5. Explain the meaning of R-square
E(X)=21/5 , E(Y)=13/5
Var(X)=1.76 , Var(Y)=1.84
Beta0 = 1/44 ; beta1 = 27/44
SST=9.2,SSE=3.3,SSR=5.9,R-square=35.86%
It means that: Income can explain 35.86% of the sample variation in
consumption of people. So 64.14% of the the sample variation in consumption
of people is explained by other independent variables that are not included in the
model.
3.2. The statistical properties of OLS estimators
The OLS estimators are expressed solely in terms of the observable (i.e.,
sample) quantities.
They are point estimators; that is, given the sample, each estimator will
provide only a single (point) value of the relevant population parameter.
Once the OLS estimates are obtained from the sample data, the sample
regression line can be easily obtained.
The regression line thus obtained has the following properties:
1. It passes through the sample means of Y and X ( Y and X )
2. The mean value of the estimated Y is equal to the mean value of the actual Y
i
Y =Y
i
3. The mean value of the residuals is zero.
34
3.3. The sum of squares
3.4. Determination Coefficient (R-squared)
2
R is the fraction (percentage) of the sample variation in Y that is explained
3.6. Assumptions of the OLS
Assumption 1- Linear in parameters: In the PRF, the dependent variable, y, is
related to the independent variable, x, and the error term, u, as
And “linear’ in parameters (tham số), not in variables
Assumption 2 – Random sampling: We have a random sample of size n
Assumption 3 – Sample variation in the explanatory variable: The sample
outcomes on x, namely {xi , i = 1,..., n}, are not all the same value.
• Assumption 4 – No perfect collinearity (tính đối chiếu) for multiplied: In
the sample, there are no exact linear relationships among the independent
variables.
Example:
Note: If the model violate (vi phạm) assumption 1-4 => We cannot run the
model
• Assumption 5: The error term has an expected value of zero given any value
of the explanatory variable. In other words, E(u|X)=0.
This assumption simply says that the factors not explicitly included in the
model, therefore subsumed in 𝑢𝑖, do not systematically affect the mean value of
Y; the positive 𝑢𝑖 values cancel out the negative 𝑢𝑖 values so that their average
or mean effect on Y is zero.
Geometrically, this assumption can be pictured as in Figure 3.3, which
shows a few values of the variable X and Y populations associated with each of
the them. As shown, each Y corresponding to a given X is distributed around its
mean value.
Figure 3.3. Conditional distribution of the disturbances u
i
43
Assumption 6 - Homoskedasticity: ( Phương sai đồng nhất) : The error term
ui has the same variance given any value of the independent variable. In other
words,
var (ui/Xi)= E[ui- E(ui/Xi)]2= E(ui2/Xi)= σ2
Var(u) reflects the distribution of Y surrounding its E(Y|X).
This assumption means that Y corresponding to various X values have the
same variance. The variance surrouding the regression line is the same across
the X values, it neither increases nor decreases as X varies.
Figure 3.4. The simple regression model under homoskedasticity
Consider figure 3.5, where the conditional variance of population Y varies
with X.
Let Y represents for weekly consumption expenditure and X represents for
weekly income.
Figure 3.4 and 3.5 show that as income increases, the average consumption
expenditure also increases.
In figure 3.4, the variance of consumption expenditure remains the same at all
levels of income.
Infigure3.5,itincreaseswiththeincreasesinincome.
Richer families on average consume more than poor families. There is also
more variability in the consumption expenditure of the former.
Figure 3.5. The simple regression model under heteroskedasticity
3.7. Properties of the OLS estimators - Gauss-Markov
Theorem
The OLS estimators are the best, linear, unbiased estimators (BLUE).
The Gauss- Markov Theorem: Under the OLS assumptions, the
estimators are BLUE ( best, linear, unbiased estimators).
–Linear: OLS Estimators are linear functions of a random variable
–Unbiased: tính không chệch/tính tuyến tính -> Kỳ vọng toán của betaj mũ =
betaj ở tổng thể
– The best: Smallest variance (reflect the exact/efficient of the estimation)
among layers of unbiased estimators
Theorem 1: Unbiasedness of OLS: Tính k chệnh của ước lượng trong OLS
Given assumptions, we have:
E(0 ) 0 ,and E(1 ) 1
for any values of 0 and 1. In other words, o is unbiased
for o, and 1 is unbiased for 1
Mỗi 1 biến -> 1 beta2 mũ.
Kỳ vọng toán của beta2 mũ = beta2 tổng thể
3.9: The components of the OLS variances
Ví dụ: Có mô hình với 2 biến độc lập như sau:
Wage = βo + β1educ + β2iq + u
R2=0.3
Để xem 2 biến độc lập educ và iq có tương quan mạnh với nhau hay không, có 2 cách:
Chạy mô hình hồi quy phụ được kết quả
Educ = a0+ a1iq + u Nếu hệ số xác định của mô hình này, ký hiệu là Rj2, lớn hơn 0.8
2 biến độc lập educ và iq có tương quan mạnh với nhau mô hình có đa cộng
tuyến.
Chú ý: R2 khác với Rj2. R2 thể hiện mối quan hệ tương quan giữa biến phụ thuộc
và các biến độc lập. Rj2 thể hiện mối quan hệ tương quan giữa các biến độc lập
với nhau
Note: R2
Công thức tính hệ số R bình phương.
Công thức tính hệ số R bình phương xuất phát từ ý tưởng: toàn bộ sự biến thiên của
biến phụ thuộc được chia làm hai phần: phần biến thiên do hồi quy và phần biến
thiên không do hồi quy( còn gọi là phần dư).
RSS
R2 = 1−
TSS
Regression Sum of Squares(RSS): tổng các độ lệch bình phương giải thích từ môi
hình hồi quy
Residual Sum of Squares(ESS): tổng các độ lệch bình phương phần dư
Total Sum of Squares(TSS): tổng các độ lệch bình phương toàn bộ
7. Giá trị R bình phương dao động từ 0 đến 1. R bình phương càng gần 1 thì mô hình
đã xây dựng càng phù hợp với bộ dữ liệu dùng chạy hồi quy. R bình phương càng gần
0 thì mô hình đã xây dựng càng kém phù hợp với bộ dữ liệu dùng chạy hồi quy.
Trường hợp đặt biệt, phương trình hồi quy đơn biến ( chỉ có 1 biến độc lập) thì R2
chính là bình phương của hệ số tương quan r giữa hai biến đó.
Ý nghĩa R bình phương
Ý nghĩa cụ thể:giả sử R bình phương là 0.60, thì mô hình hồi quy tuyến tính này phù
hợp với tập dữ liệu ở mức 60%. Nói cách khác, 60% biến thiên của biến phụ thuộc
được giải thích bởi các biến độc lập.( còn 40% còn lại ở đâu, dĩ nhiên là do sai số đo
lường, do cách thu thập dữ liệu, do có thể có biến độc lập khác giải thích cho biến phụ
thuộc mà chưa được được vào mô hình nghiên cứu…vv). Thông thường, ngưỡng của
R2 phải trên 50%, vì như thế mô hình mới phù hợp. Tuy nhiên tùy vào dạng nghiên
cứu, như các mô hình về tài chính, không phải tất cả các hệ số R2 đều bắc buộc phải
thỏa mãn lớn hơn 50%.( do rất khó dể dự đoán giá vàng, giá cổ phiếu mà chỉ đơn
thuần dựa vào các biến độc lập ví dụ GDP, ROA,ROE….)
Hạn chế của hệ số R bình phương
Càng đưa thêm nhiều biến vào mô hình, mặc dù chưa xác định biến đưa vào có ý nghĩa
hay không thì giá trị R2 sẽ tăng. Lý do là khi càng đưa thêm biến giải thích vào mô
hình thì sẽ càng khiến phần dư giảm xuống (vì bản chất những gì không giải thích
được đều nằm ở phần dư), do vậy tăng thêm biến sẽ khiến tổng bình phương phần
dư(Residual Sum of Squares) giảm, trong khi Total Sum of Squares không đổi, dẫn tới
R2 luôn luôn tăng.
Giá trị R2 tăng khả năng giải thích của mô hình, nhưng bản chất thì lại không làm rõ
được tầm quan trọng của biến đưa vào, do đó nếu dựa vào giá trị R2 để đánh giá tính
hiệu quả của mô hình sẽ dẫn đến tình huống không chính xác vì sẽ đưa quá nhiều biến
không cần thiết, làm phức tạp mô hình.
Để ngăn chặn tình trạng như đã nêu trên, một phép đo khác về mức độ thích hợp được
sử dụng thường xuyên hơn. Phép đo này gọi là R2 hiệu chỉnh hoặc R2 hiệu chỉnh theo
bậc tự do.
Theorem 2: Sampling variances of the OLS estimators
Under assumptions 1 through 6,
2
Var( ˆ j ) n
( X
2
i 1
ij
X j
) 2 (1 R j )
3.10. Units of measurement
• Example: data set “CEO Salary and Return on Equity” Salary: salary per year in thousands
dollar of CEO
Roe: average return on equity in percentage salary 963.19118.501roe
=> When roe increases by 1%, salary per year of CEO is expected to increase by 18.501
thousand usd
Case 1
When salary is measured in usd
salarydol = 1000*salary
• The unit of roe is unchanged
salarydol 96319118501roe
=> If the dependent variable is multiplies or divided by the constant c, then the OLS
intercept and slope estimates are also multiplies or divided by c.
Case 2
• When the unit of salary unchanged
• The unit of roe changed: roedec = roe/100
salary 963.1911850.1roedec
• Coefficient of roedec is 100 times greater than the coefficient of roe in [1]
=> If the independent variable is divided or multiplied by some non zero constant c, then
the OLS slope coefficient is multiplied or divided by c, respectively. The intercept is
unchanged.
LECTURE 3: HYPOTHESIS TEST
20/5/2021. LECTURE 4
3 dạng file:
Dữ liệu (Data): wage.dta
Log: Lưu trữ các thông tin chạy phần mềm (.smcl, .log)
Do file: Chứa các câu lệnh
Log file: Store all the result and commands
Log using “…”
Example: Log using “D:\Practice_econometrics”
*Command
1. Des : provides the meaning and the measurement of the variables
Obs: oservation
Vars: Variables
2 KINDS OF VARIABLE
- Quantiative (định lượng) and quanlitative (Định tính)
a) Quantitative variable: is a rando, variable that has value in number and
the value has meaning in terms of algebra
(Biến định lượng: Là các biến số có giá trị bằng số và các giá trị này có ý nghĩa
về mặt đại số)
Ex: educ: trình độ học vấn
Obs Educ
1 16
2 12
3 15
4 9
b, Quanlitative variable: is a random variable that has value in number
BUT the value has NO meaning in term of algebra
(Biến định tính là các biến số có gía trị bằng số nhưng không có ý nghĩa về mặt
đại số)
Sometime Quanlitative variables were coded in number
We have to transfer the quanlitative variable into a dummy variable
DUMMY VARIABLE (ZERO-ONE; BINARY): (Biến giả) : is a variable that
has value of 0 or 1
NOTE
If a quanlitative variable has n categories (loại) , then we can create n dummy
variables
BUT we only include (n-1) dummy variables in the model. The variable
excluded in the model is considered as base group or benchmark variable (biến
=0) to compare (Biến bị loại trừ trong mô hình được coi là biến cơ sở hoặc
biến chuẩn để so sánh)
Example: Gender has 2 categories: Male and Female -> We create 2 dummy
variables: Male and Female
Variable Male = 1 if the obs is a Male, = 0 otherwise
Variable Female = 1 if the obs is a Female, =0 otherwise
Obs Edu Gender Male Female
1 11 M 1 0
2 16 F 0 1
3 4 F 0 1
4 6 M 1 0
5 18 F 0 1
Example:
SOE: State-Owned Enterprise
FDI: Foreign Direct Investment
Enterprise Ownership Private FDI SOE
1 Private 1 0 0
2 Fdi 0 1 0
3 Soe 0 0 1
4 Private 1 0 0
5 Fdi 0 1 0
6 Fdi 0 1 0
7 Private 1 0 0
8 Soe 0 0 1
9 Fdi 0 1 0
10 Soe 0 0 1
The variable Private = 1 if the ownership of the firm is private, =0 otherwise
The variable FDI =1 if if the ownership of the firm is FDI, = 0 otherwise
The variable SOE=1 if if the ownership of the firm is SOE, = 0 otherwise
ANALYSIS
Step 1: Question of interest
Topic: Analysing factors affecting income of individuals in the USA
Income is dependent variable
Chose the X and Y
Y: wage
X: educ exper nonwhite female married south
Statistic Description (Quanlitative analysis) Phân tích mô tả thống kê
The purpose of statistic description is to provide the understanding of the data
structure
sum
sum provides the statistic indicators of the variables (mean, standard deviation,
min, max)
Example: sum wage educ exper nonwhite female married south
*Wage
Has 526 observations
Mean = 5,896
Mean: Giá trị trung bình: Trong thống kê, nó là thước đo xu hướng tập
trung của dữ liệu. Nó cũng được coi là một giá trị mong đợi.
Standard deviation = 3,693
SD: Độ lệch chuẩn: độ lệch so với giá trị trung bình của biến. Giá trị này
càng nhỏ cho thấy, các con số không chênh lệch nhau nhiều so với giá trị
trung bình. Ngược lại nếu giá trị này cao, thể hiển rằng đối tượng khảo sát
có nhận định rất khác biệt nhau đối với biến đó, nên mức điểm cho chênh
lệch nhau khá nhiều.
(Usually do not use dummy variables to analysis because the number is no
meaning)
sum can go with if or by
+-*/
>=, <=
| means “or”
& means “and”
!= means “unequal”
Sau if, we use ==
- Homework: Calculate the average mean wage of these groups:
Female vs. Male
Married vs. Unmarred
Nonwhite vs. White
Graduated from university vs. Not yet graduated from university
Answer:
a, Female vs. Male
Female: sum wage if female == 1
Male: sum wage if female == 0
How that give us information?
Mean -> Average wage of female is much smaller than male
b, Married vs. Unmarred
Married: sum wage if married == 1
Unmarried: sum wage if married == 0
c, Nonwhite vs. White
Nonwhite: sum wage if nonwhite == 1
White: sum wage if nonwhite == 0
The gap on average wage is quite small
d, Graduated from university vs. Not yet graduated from university
Graduated: sum wage if educ >= 16
Not yet: sum wage if educ < 16
Average wage of people who graduated is twice as much as people has not
graduated yet
The educ has the strongest effect on wage (chênh lệch giữa 2 giá trị trung
bình), next is the gender, the married vs unmarried people. Nonwhite and
white people have light effect on wage
20/5/2021 SESSION 5
Other way of dummy variable
Bysort female: sum wage
Bysort nonwhite: sum wage
Bysort married: sum wage
Average wage of female/male and married
Sum wage if female == 1 & married == 1
Sum wage if female == 0 & married == 1
3.Tab provides distribution of value of variables so that we can understand the
structure of the dataset (thấy rõ hơn data)
We have: variable educ:
Mean = 12,562
SD 2,762
→ Many values is concentrated around the mean value because the Sd is quite
small
In sample, we have 526 obs totally
There are 2 people did not go to school , educ = 0 , accounting for 0,38
Cum. Percent: (phần trăm tích lũy) = 0,38
Most of people in the sample have educ = 12 (198) => avg educ around 12
( cum. Percent at educ = 12 is 59,70, represent for the percent of people who
have year of educ equal or less than 12)
tab wage
Wage is continuos variable, has a lot of value
→ should not use command “tab” for continous variable
4 .gen: to generate/create a new variable in the case we do not have this
variable in the data table
gen newvar =
gen educsq = educ^2
gen lneduc = ln(educ)
drop "variable" = delete "variable"
After creating the variable, we should add the meaning of the new variable by
command:
label variable variablename "…" (dán nhãn)
NOTE: Phải sử dụng dấu ngoặc thẳng ", không được sử dụng ngoặc cong
Eg. label variable educsq "the squared value of educ"
Ex:
Create one dummy variable showing the education level of 2 group: Graduated
from university vs. Not yet graduated from university
Gen … if …
Replace
Answer:
gen graduated = 1 if educ >= 16
replace graduated = 0 if educ < 16
We have new variable : graduated
Create one dummy variable showing the experience of 2 group: less than 20
years and more than or equal 20 years
Experience = 1 if the experience >= 20, = 0 otherwise
gen experience = 1 if exper >= 20
replace experience = 0 if exper < 20
label variable experience "=1 if experience >=20"
5.List in/if
Sort …
List … in STT
Exercise: list 10 people that have the lowest wage and highest wage
Sort wage
List wage in 1/10 ( STT của1-10)
List wage in 517/526 ( STT …)
Calculate the average wage of 10 people that have th lowest wage and highest
wage
Sort wage
Sum wage in 1/10
Sum wage in 517/526
6. Drop/ Keep in/ if
Drop/ Keep variable
Drop variable
Drop in 1/20
7. Rename
rename var newname
rename educ hocvan
-Đồ thị phân phối của science ở 2 nhóm có dạng gần phân phối chuẩn. Bây giờ, giả
sử chúng ta muốn biết giá trị trung bình ở 2 nhóm này có bằng nhau ở mức ý nghĩa
thống kê 5% hay không, sử dụng lệnh ttest như sau:
ttest var
Sau đó, mình sẽ dùng lệnh rvfplot để có thể vẽ được đồ thị giữa sai số và giá trị
ước lượng của biến phụ thuộc trong mô hình. Mình thêm một cái option trong câu
lệnh là yline(0) để đồ thị hiện ra đường thẳng tại mức sai số = 0. Giá trị 0 là giá trị
trung bình của sai số.
Step 2: Set up mathematics model ( skip)
Step 3: Set up ecomometrics model
Step 4: Collect data
Data source
Number or observation
Years of survey
Step 5: Estimate the model
Econometrics model: linear regression model (data set is cross-sectional and
dependent variable is continous)
Method to estimate coefficients: OLS
Check the correlation of Y and X: trước khi chạy hồi quy phải chạy bảng ma
trận tương quan (provide in research)
corr Y X -> correlation matrix (bảng ma trận tương quan)
corr wage educ exper nonwhite female married south
(correlation and statistical significant effect are 2 different effects): tương
quan nhưng k có nghĩa là sẽ ảnh hưởng đến Y
Correlation matrix presents:
Correlation between Y and X -> r(Y,X)
If r(Y,X) # 0 -> X has the correlation with Y -> we can include X in the model
Ex: r(wage, educ) = 0,4059 # 0 -> educ has correlation with wage
Correlation between Xj and Xk (to check multicollinearity problem)
Note: Choosing Y and X
Topic 1: Analyze the impact of foreign direct investment on GDP growth of
VietNam
Y: GDP growth of VietNam
X1: FDI (main independent variable)
X2, X3,… Xk: Control variable: Biến kiểm soát
Topic 2: Analysis the relationship between FDI and GDP growth of VN
(1): GDP growth = f(FDI)
(2): FDI = f(GDP growth)
Topic 3: Analyze the relationship between the rice output and the rainfall of
VietNam
Y: Rice output
X: rain fall
If we have 2 variables A and B and we want to see if A or B can be Y or X ->
we have to check the nature of the correlation of A and B
+ Cal the correlation of A and B -> r(A,B)
+ If r(A,B) # 0 -> we can include A, B in the model
+ If A correlates with B -> check if the correlation is causation (nhân quả) or not
If causation -> Y: Result/ Consequence (rice output)
X: Cause (rainfall)
Regression (Chạy mô hình)
reg Y X
Eg. reg wage educ exper nonwhite female married south
- What is the difference between Rsq and adjusted Rsq?
R Square is a basic matrix which tells you about that how much variance is been
explained by the model.
+Rsq: What happens in a multivariate linear regression is that if you keep on adding new
variables, the R square value will always increase irrespective of the variable significance
+ Adjusted Rsq increase only when the independent variable in the model has statiscally
significant effect on dependent variable
In this case, why is Rsq greater than adjusted Rsq?
Nonwhite variable has no statistically significant effect on wage ( p-value is big)
Mistake type I
Null hypothesis
In analysis, why do we always mention/analyse Rsq instead of adjusted
Rsq?
Because in the model, some variable have no significant effect but it do not
mean that the variable is useless in term of providing us information.
Cannot exclude the variable which has no effect on the dependent variable
24/5/2021. Session 6
The sum of square
- df: Bậc tự do
- df = 6 = k
- k is the number of independent variable in the model (k=6)
- df = 519 = n-k-1 = 526 – 7
- n: number of observations
- MS = SS/df : mean of sum of square: gía trị trung bình của tổng bình phương
- Root MSE: căn bậc 2 của giá trị trung bình của tổng bình phương sai số
- Cons_: constant coefficient Beta0 head
- Coef. = coefficient: hệ số hồi quy
- Std. Err: Standard error: sai số chuẩn của hồi quy
SRF: wage = -1,41 + 0,569educ +0,55exper +0,072nonwhite – 2,092female +
0,715married -0,646south +u^
R2 = SSE/SST = 2309,32588/7160,41429 = 0,322
It means that the independent variables in the model (educ, exper, nonwhite,
female, married, south) can explain 32,25% of the sampke varition of wage
So, 67,75% of the sample variation of wage is explain by other variable that are
not included in the model. By theory, they are included in u (error term or
residual)
P-VALUE EXPLAIN
If we calc t = 1, 76, alpha = 5%, n>300 -> critical value = 1,96 ( giá trị tới hạn)
Khi kiểm định 1 biến, k thể bác bỏ at alpha = 55
p-value
Trong thống kê có 2 loại mắc sai lầm: Mistake type I and Mistake type II
27/5/2021 SESSION 7:
Step 6: Dianosing the problems of the model
(Chuẩn đoán vấn đề của mô hình)
(1): E(u|X) = 0 (Assumption 5)
(2): Multi-collinearity
(3): Homoskedaticity vs. Heteroskedasticity (Assumption 6)
(4): Normal distribution of u (Assumption 7)
(5): Auto-Correlation (only for timeseries data)
(1): E(u|X) = 0 (Assumption 5)
This assumption is satisfied when:
- E(u) = 0
- Cov(Xj,u) =0
Giả thiết này nhằm để đảm bảo u là một nhiễu trắng (whitr noise) để viêc ước
lượng các hệ số hồi quy không bị ảnh hưởng bởi các yếu tố thuộc nhiễu u
Nếu (1) or (2) không thỏa mãn -> Violate assumption
Cause:
- Not include important variables in the model
- Mispecification of funtional form (read at 6.2): Phân loại sai
Consequense
- Biased estimation
Theorem 4.1, 4.2 (lecture 3) are not satisfied (read again at textbook: 3.3:
the Expected Value of the oLS Estimators)
T statistics has no t-distribution
Inexact hypothesis test
How to find out and collect the problem
- Not include important variable: if we have a doubt that variable Z has
statistically significant effect on Y, we should include Z in the model and apply
the t-test to see if it has statistically significant effect on Y
- (1) Check for mispecification of function form: Ramsey test
If the test shows that the model has mispecification of function foem, we need to
change the function form
Câu lệnh trong stata: ovtest
reg wage educ exper nonwhite female married south
ovtest
Ho: model has no omitted variables ( the model has no mispecification of function
form)
P-value < alpha = 0,05 -> reject Ho -> The model has mispecification of
function form -> We have to change the function form
Generate new variable
Gen educsq = educ^2
reg lwage educ educsq exper nonwhite female married south
-> p-value > alpha -> Accepted Ho -> the model has no omitted variables tuye
(2) Multicollinearity (Đa cộng tuyến)
+ Problem: Multicollinearity happens when independent variables has strong
(but not exact) correlation
This problem does not violate Assumption 4 (perfect collinearity)
Assumption 4: In the sample, there are no exact linear relationshíp among the
independent
-> the model still valid but this problem cause some consequences.
Example:
Wage = Bo + B1educ + B2iq + u
Educ = ao + a1iq + u
If 0,8 < Rj2 <1 -> multicollinearity
- Consequence
NO
TE when the model has multicollinearity, the estimator’re still BLUE
+ How to find out :
Correlation matrix
Command: corr lwage educ educsq exper nonwhite female married south
- vif (Variance inflation factor): Nhân tử phóng đại phương sai
vif = 1/(1-Rj2)
Nếu vif and mean vif >10 -> multicollinearity
+ Solution
- Exclude the variables that have high correlation out of the model (In this case, if
we exclude one of them, we could have (1) mistake)
- Increase the sample size
If we have a large sample, we do not need to worry about the
multicollinearity
So in this case, the sample has 526 observation -> largr enough -> We do not
have to care about the multicollinearity problem
( How is consider as large enough? When the sample have about n =
384.16 -> large enough)
31/5/2021 SESSION 8
(3): Homoskedaticity vs. Heteroskedasticity (Assumption 6)
+Problem
Assumption 6: var(ui|Xi) = 2 với i -> Homoskedaticity (Phương sai sai số
không đổi)
Heteroskedasticity (Phương sai sai số thay đổi)
Khi giả thiết này bị vi phạm. Điều đó có nghĩa là: varu X i2
(Với i = 1, 2,..., n).
→ Heteroskedasticity( Hiện tượng phương sai sai số (PSSS) thay đổi).
Violate assumption 6
+Consequence
- (Hàm đa biến)
Homoskedasticity var(ui) = 2 ⏞ 2
Var( B j ) = ❑
Heterskedasticity varui i2 ⏞ 2
Var( B j ) = ❑
+ If the model has heteroskedasticity -> the estimators are still linear and unbiased,
but not the best
+ The variance of the coefficient will be larger (Biased) -> standard error (SE) is
bias -> hypothesis test is inexact
Khi có hiện tượng PSSS thay đổi, nếu vẫn dùng OLS để ước lượng mô hình,
các ước lượng OLS thu được vẫn là ước lượng tuyến tính, không chệch
nhưng có phương sai bị chệch.
- Phương sai của ước lượng không còn chính xác.
- Các khoảng tin cậy, các kết luận kiểm định các giả thuyết thống kê về
hệ số hồi quy không còn giá trị.
- Kết quả dự báo không còn đáng tin cậy.
- Hàm đơn biến
Xét ví dụ sau đây:
Hồi quy tiêu dùng Y theo mức thu nhập X của các hộ gia đình ta có mô hình hồi
quy sau:
- Trường hợp 1: PSSS không đổi Homoskedaticity
var ui 2
→ Độ biến thiên trong chi tiêu của các gia đình có thu nhập khác nhau là như
nhau.
- Trường hợp 2: PSSS thay đổi Heteroskedasticity
varui i2
Các gia định có mức thu nhập cao hơn có độ biến thiên trong chi tiêu cao
hơn. Đây là điều thường thấy trong thực tế.
Vi phạm giả định của mô hình 6 hồi quy tuyến tính cổ điển. -> Strongest
assumption because most of reality data violate this assumption
+ How we can find out?
A, Phương pháp đồ thị
STATA
-Graph stata
Reg (dependent variable) (independent variables)
Rvfplot, yline(0) -> Graph of u
As the distribution of the residuals does not converge into any certain direction ->
predict that the model has heteroskedasticity
b, White test
VD: Câu 3: Nếu mô hình hồi quy gốc có 4 biến độc lập, khi dùng kiểm định White
sử dụng các phần dư từ mô hình hồi quy ước lượng, mô hình hồi quy phụ có bao
nhiêu biến độc lập?
We have X1, X2, X3, X4,
(X1)2, (X2)2, (X3)2, (X4)2
X1X2, X1X3, X1X4, X2X3, X2X4, X3X4
14 variables
STATA:
Command: imtest, white
If (Pro>Chi2) > alpha = 5% -> not reject H0 -> The model has no
heteroskedascity
Example:
c, Breusch-Pagan test
ui
Command: hettest
If (Pro>Chi2) < alpha = 5% -> reject H0 at alpha = 5%
The model has heteroskedasticity
Note: If the methods give the different result -> Should follow the test that
give us the heteroskedasticity conclusion.
Solution:
-Robust Standard error (Phương pháp ước lượng sai số chuẩn mạnh)
Reg X Y, robust
reg lwage educ educsq exper nonwhite female married south, robust
Weakness of this method: (People prefer use it because it is simple)
C, How we can correct the problem?
Simple method
This method just adjust, not correct the model -> People often use this
model because it is simple
(4): Normal distribution of u (Assumption 7)
+ Problem: u does not have nornaml distribution -> Violate assumption 7
+ Cause: The sample is not large enough
+ Consequence: T statistics may have no t-distribution -> The hypothesis test is
inexact
+How to find out?
Ho: u has normal distribution
- Graph
predict u, residuals
histogram u, normal
- Jacque – Bera test
Ho: u has normality distribution
predict u, residuals
sktest u
+How to correct: increase the sample size
(5) Tự tương quan (Auto correlation)
3/6/2021 SESSION 9
Step 7: Test Hypothesis and explain the effect of independent variables on
dependent variable
- 3 method to hypothesis
Critical vaule (T TEST)
p-value
confidence interval
NOTE: làm kiểm định giả thuyết thống kê dựa vào kết quả mô hình cuối cùngk,
sau khi đã sửa chữa các vấn đề
KIểm định từng hệ số hồi quy và kiểm định sự phù hợp của mô hình
Step 8: Estimated result analysis and policy implication
For educ variable:
As: p-value of educ >0,05 but p-value of educsq <0,05 ; the coefficient of educsq =
0.0044 >0
+ Education has unlinear effect on wage and has an increaseing marginal effect on
wage
Example:
- policy implication : (of government on reduce gap, …)
knownledge for researching cross data set
CHƯƠNG 10: MULTI REGRESSION WITH A BINARY DEPENDENT
VARIABLE
1. A single dummy independent variable
wage0 female1educu 001
0 E(wage| female1,educ) E(wage| female0, educ)
Female = 1 corresponds to females, female = 0 corresponds to male
0 E(wage| female,educ) E(wage|male, educ)
The level of education is the same in both expectations, the difference, 0, is due to
gender only.
- At any level of education, men have higher wage than women.
- The difference in wage comes from the difference in gender only, not from the
education (educ have linear effect)z because the slopes are the same for 2 cases
- Higher education, higher wage for both men and women.
Note: If one qualitative variable has n characteristics
include only n-1 dummy variables in the regression.
The dummy variable is not included in the model
-> base group or benchmark group.
E.g.: Gender has 2 characteristics: male and female use only 1 dummy variable
male or female
-If female is the base group, we have the model:
wage maleeducu 001
- Using 2 dummy variables would introduce perfect collinearity because female +
male = 1, which means that male is perfect linear function of female.
2. Using multiple dummy variables in the model
- We can include more than 1 dummy variable in the model
- Weakness of the model: we only know the difference in wage between:
+ Female vs.male group
+ Married vs. Unmarried group
Do not know the difference in wage between 4 group: married man,
single man, married woman, single woman.
- Solution
METHOD 1:
We can overcome this disadvantage by generating 4 groups: married man,
married woman, single man, single woman
-If the base group is single men, the model will be:
Note: We have to include the variables female and married from the model
METHOD 2:
We can generate an interaction variable of 2 dummy
variables:
The estimated results of 2 models are the same.
STATA:
Singmale is base group
reg wage marrmale marrfemale singfemale educ (3)
gen femalemarried = female*married
reg wage female married femalemarried educ (4)
(3)
-> wage^ = -1.024421 + 2.641066marrmale - .5567235marrfemale - .
368964singfemale + .493559educ
(4)
-> wage^ = -1.024421 + 2.641066marrmale - .5567235marrfemale - .
368964singfemale + .493559educ
Basegroup singmale = 1 if female ==0 & married ==0
(3) Singmale Wage = 1.024421+ 493559educ
Marrmale Wage -1.024421 + 2.641066marrmale + .493559educ
(4) Singmale Wage = 1.024421+ 493559educ
Marrmale Wage -1.024421 + 2.641066marrmale + .493559educ
3.Interaction between dummy and quantitative variables
Case 2
-The intercept for women is below that for men, but the slope on education is
larger for women.
- This means that women earn less than men at low levels of education, but the gap
narrow as education increases.
- At some point, a woman earns more than a man.
Stata: gen femaleeduc = female*edu
reg wage female educ femaleeduc
7/6/2021 SESSION 10
CHƯƠNG 10: MULTI REGRESSION WITH A BINARY DEPENDENT
VARIABLE
Regression with a binary dependent variable
1. Linear Probability Model (LPM)
2. Logit Model
3. Probit Model
4. Comparing LPM, Logit, Probit
I. . Linear Probability Model (LPM)
P = E(Yi/Xi) = P(Y=1/Xi) = 0 + 1X1
We have a model: Yi = 0 + 1iX1i + u^i
Let pi be probability for event A happening
given Xi, pi = P(A/Xi),
* The differences when explaining the coefficient:
- Linear regression model: when X changes by 1 unit, the average value of Y (E
(Y|X)) changes by 1 unit.
- LPM: when X changes by 1 unit, the probability for the event A happening
changes by 1 unit.
For example:
inlf = 0,5 + 0,038educ – 0,02female +u^i
+ Holding other factors fixed, another year of education will increase the
probability of attending labor force by 3,8%
+ Holding other factors fixed, female has lower probability of attending labor
force, by 2%, compared with male
*The weakness of LPM
1. Probability can be smaller than 0 or greater than 1.
2. Probability is linear with independent variable. But in fact, it cannot be.
Exercise 1
A survey of 40 students after graduation for 6 months, with variables GS -
Graduate Score, EN – English grade. The scale of grade equals 100.
Y = 1, if student can have a job, Y =0, if student has not got a job yet. (inlf)
Let “probability of getting job” is the probability for a student to get a job after 6
month from graduation, =5%
1. Write LPM.
2. Interpret coefficients
3. Estimate the probability of getting job when GS and EN equal: (70,80);
( 60,60)?
2. Interpretation: (Giải thích)
- P-value <0.05 -> both GS and EN have statistically significance on Y.
- Holding other factors fixed, when GS increases by 1 unit, the probability of
getting a job increases by 2%.
- Holding other factors fixed, when EN increases by 1 unit, the probability of
getting a job increases by 2.9%.
3. When
- GS = 70, EN = 80
plpm (Y=1|X) = -3.01567 + 0.020158*70 + 0.02922*80 = 0.733
^
- GS = 60, EN = 60
plpm (Y=1|X) = -3.01567 + 0.020158*60 + 0.02922*60 = -0.053
^
The negative value (-0.053) is inappropriate. So we can not have a
conclusion in this case
II. Logit and Probit model
- As LPM has 2 weaknesses:
+ P(y=1|x) can be smaller than 0 or greater than 1
+ the effect of X on Y is constant
- To overcome these weaknesses, we rewrite the model:
P(y1|x)G(0 +1X1+...+k Xk)G(z)
+ If we want G(z) to have value in the interval (0,1) => Logit and Probit model
are among the choices.
1. Logit Model
ez
G(z) e /[1+ e )] =
z z
1+ e z
+ In the logit model, G(z) is the logistic function which is between 0 and 1 for all
real numbers z.
+ This is the cdf for a standard logistic random variable.
+ P(y=1|x) = G(z) = pi and P(y=0|x) = 1- G(z) = 1- pi
At each value of Xi, the probability for the event A happening is pi.
When X changes by 1 unit, the probability changes: pi( 1-pi)j.
(Can not interpretation as normal)
2. Ước lượng mô hình Logit
With Logit Probit model, we do not use OLS
b. Phương pháp ước lượng hợp lý tối đa MLE (maximum likelihood
estimation)
- Do hàm E(y|x) là không tuyến tính nên phương pháp OLS không còn hợp lý.
- Phương pháp ước lượng hợp lý tối đa (maximum likelihood estimation) hợp lý
hơn do dựa trên sự phân phối có điều kiện của y.
- Hàm mật độ có điều kiện của y:
f(y=1|x; ) = P(y=1|x) = G(x )
f(y=0|x; ) = P(y=0|x) = 1-G(x )
Exercise
A survey of 40 students after graduation for 6 months, with variables GS -
Graduate Score, EN – English grade. The scale of grade equals 100.
Y = 1, if student can have a job, Y =0, if student has not got a job yet.
Let “probability of getting job” is the probability for a student to get a job after 6
month from graduation, =5%
b, When GS increases by 1 unit, EN fixed, the probability to have a job
increase by 1,78%
ANSWER:
Because the coeficiemts is positive
STATA
Command:
Logit y x
Probit y x
Example : logit coursechoice read math
Note:
Chỉ dùng scalar cho Probit, k cho Logit
Mfx dùng cho cả Logit and probit
Command MFX
Mô tả khả năng dự báo của mô hình
3.
14/6/2021 SESSION 12
Panel DATA
1. Definition
- Panel data: the same groups of observation (N) (household, enterprise,
individuals, countries...) are observed over time (T)
- The panel data can have:
+ Variable has different values for each obs, but does not change overtime
(location, gender..)
+ Variable has different values for each obs, but change overtime (exchange rate,
FDI, consumption, income..)
2. Advantages of Panel Data
For example: ( Cross-section dataset) Analyze the relationship between quantity of
fertilize and productivity
Problem:
– There are missing unobserved variables (u) and u correlate with X
– These variables are different for each unit => Biased OLS estimation
Advantages of Panel Data
• Overcome the problem of missing unobserved variables
• The data (units) are observed overtime, so the sample is larger and we can track
all the changes of units overtime.
The unobserved variables can be:
– Change/ unchange by unit (i)
– Change/ unchange by time (t)
– Change by both unit (i) and time (t)
3. Econometric Model for Panel Data
ai unobserved -> include in error term
• Let: vit = ai+uit
• Depends on the characteristics of ai, we have 3 models:
– Pooled OLS (Mô hình hồi qui gộp - POLS)
– Fixed effect (Mô hình tác động cố định - FE)
– Random effect (Mô hình tác động ngẫu nhiên-RE)
Xtset id year : command to declare that you want to treat this data as panel
data