Econometrics and Machine Learning in
Economics
- Topics 2 and 3 -
Probability Models and Data Generating
Processes
Higher School of Economics
A. Duplinskiy
What we did until now!
1 Reviewed some concepts from Time-Series Econometrics
2 Stationarity, Conditional vs Unconditional moments
2 / 77 Week 2 Introduction
This week
Ergodicity
Assignment
Multivariate Regression: More than 1 feature
Clustering: K-Means and Hierarchical, Affinity
Propagation
Dimensionality Reduction: PCA, Lasso-Ridge:
Logistic Regression
Consider checking out
https://online.stanford.edu/courses/sohs-ystatslearning-
statistical-learning
3 / 77 Week 2 Introduction
Next Week
1 Advanced time-series models
The models you will implement in your work!
Private companies, central banks, governments, public
institutions, research institutions, universities
Simple linear models everyone can estimate! (click buttons)
Econometricians (are expected to) do more!
Econometricians are suppose to be specialists in
cutting-edge, state-of-the-art models (not available in
software packages).
2 Machine Learning Vs Econometrics:
Cases when one has to be careful with ML
Experimental vs Observational data
Predictive vs Casual Models
4 / 77 Week 2 Introduction
Homework: Lecture 1
Solutions to Selected Exercises
Exercises 1.2 and 1.3
Exercise 1.2: Make a Venn diagram specifying the relation
between the following types of stochastic processes:
(a) iid processes (b) weakly stationary processes
(c) strictly stationary processes
Exercise 1.3: Give examples of time series that characterize
each set (including each intersection and union) in the Venn
diagram elaborated on the previous question.
6 / 77 Week 2 Introduction
Exercises 1.2 and 1.3
(1) WS, SS, IID (4) WS, SS, IID
(2) WS, SS, IID (5) WS, SS, IID
(3) WS, SS, IID
7 / 77 Week 2 Introduction
Exercises 1.2 and 1.3
(1) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t ≤ t∗ and Xt ∼
t(µ, σ 2 ) for every t > t∗
(2) {Xt } iid with Xt ∼ N(µ, σ 2 ) for every t
(3) (Xt , Xt−1 ) ∼ N(µ, Σ) for every t, where
" #
1 0.5
µ = [0, 0] and Σ =
0.5 1
(4) {Xt } iid with Xt ∼ Cauchy for every t
(5) (Xt , Xt−1 ) ∼ t(µ, Σ, λ) for every t, where
" #
1 0.5
µ = [0, 0] and Σ = and λ < 2.
0.5 1
8 / 77 Week 2 Introduction
Exercise 1.6
Exercise 1.6: Show that ρX (t + h, t) = ρX (h) if the time series
is weakly stationary.
Answer: The ACF is given by
Cov(Xt+h , Xt )
ρX (t + h, t) = p
Var(Xt+h )Var(Xt )
If {Xt } is weakly/strictly stationary, then, for every (t, h),
2
Cov(Xt+h , Xt ) = γX (h) and Var(Xt+h ) = Var(Xt ) = σX
Therefore, we can re-write
γX (h) γX (h) γX (h)
ρX (t + h, t) = q = 2 = = ρX (h)
2 2
σX σX σ X γX (0)
9 / 77 Week 2 Introduction
Exercise 1.11
Exercise 1.11: Derive the autocorrelation function of the
random walk starting at t = 1.
Xt = ε1 + ε2 + . . . + εt ∀ t ∈ N, were {εt } ∼ W N (0, σε2 ).
Recall: Var(Xt ) = tσε2 and Cov(Xt , Xt−h ) = (t − h)σε2 .
γX (t, t − h)
Hence: ρX (t, t − h) = p
Var(Xt )Var(Xt−h )
√
(t − h)σε2 t−h
= p = √
(tσε2 )((t − h)σε2 ) t
The correlations between elements of {Xt } change over time!
10 / 77 Week 2 Introduction
Exercise 1.17
(a) Xt = a + bZt + cZt−2 , {Zt } ∼ N ID(0, σ 2 )
E(Xt ) = E(a + bZt + cZt−2 )
= E(a) + E(bZt ) + E(cZt−2 )
= a + bE(Zt ) + cE(Zt−2 )
= a + b · 0 + c · 0 = a.
Var(Xt ) = Var(a + bZt + cZt−2 )
= Var(a) + Var(bZt ) + Var(cZt−2 )
= 0 + b2 Var(Zt ) + c2 Var(Zt−2 )
= (b2 + c2 )σ 2 .
11 / 77 Week 2 Introduction
Exercise 1.17
(a) (continued)
Cov(Xt , Xt−h ) = Cov(a + bZt + cZt−2 , a + bZt−h + cZt−h−2 )
= b2 Cov(Zt , Zt−h ) + bcCov(Zt , Zt−h−2 )
+ cbCov(Zt−2 , Zt−h ) + c2 Cov(Zt−2 , Zt−h−2 )
Now since Cov(Zt , Zt−h ) = 0 ∀ h 6= 0, we have,
γ(0) = (b2 + c2 )σ 2
γ(1) = 0
γ(2) = bcσ 2
γ(h) = 0 ∀ h > 2.
Since expectation, variance and covariance are all finite and
constant over time we conclude that {Xt } is weakly stationary.
12 / 77 Week 2 Introduction
Exercise 1.18
Exercise 1.18: Suppose {Xt } and {Yt } are uncorrelated weakly
stationary sequences. Show that {Xt + Yt } is weakly stationary.
Let {Wt } := {Xt + Yt }. Then,
E(Wt ) = E(Xt + Yt ) = E(Xt ) + E(Yt ) = µX + µY .
2
Var(Wt ) = Var(Xt + Yt ) = Var(Xt ) + Var(Yt ) = σX + σY2
Cov(Wt , Wt−h ) = Cov(Xt + Yt , Xt−h + Yt−h )
= Cov(Xt , Xt−h ) + Cov(Xt , Yt−h ) + Cov(Yt , +Xt−h )
+ Cov(Yt , Yt−h )
= γX (h) + 0 + 0 + γY (h) = γX (h) + γY (h).
Since expectation, variance and covariance are all finite and
constant over time we conclude that {Wt } is weakly stationary.
13 / 77 Week 2 Introduction
Useful info
Note: groups should have three or four students.
No more, no less. (contact me if you do not have partners)
Deadline for forming groups: Sunday, November 1
Assignment deadlines:
Part 1 and 2: Monday, December 14, at 18:30.
Important: Failure to deliver on time implies zero points on that
part!
Good grade: answers should be: correct, clear and complete.
Deliver a report with appropriate justifications and insightful
comments and remarks. Think about your report: Too much
information is unreadable. Too little information is ill advised.
14 / 77 Probability Models Introduction
Assignment and Software
Software: Assignment should be completed with python or any
other software, Matlab, R
Topics: Clustering, Text Processing
Doing extra: Go ahead, but make sure you know what you are
doing
15 / 77 Probability Models Introduction
Autosuggestions
About this assignment:
In this assignment you have data about id’s and when and
where they were issues. Do not worry, there is no id
number and I modified the data!
Assignment objectives:
1 Work with text data and learn how to cluster text data.
2 Get the feeling how to calculate the value of your work.
3 Visualize data and use clustering for data cleaning.
16 / 77 Probability Models Introduction
Id data
17 / 77 Probability Models Introduction
Load the data and process it
18 / 77 Probability Models Introduction
Visualise data
19 / 77 Probability Models Introduction
Visualise cleaned data
20 / 77 Probability Models Introduction
Autosuggestion accuracy
Figure : Proportion of entries that have a smaller error then a number
on x-axis according to a levenshtein distance. Axis: x- levengshtein
distance, y - proportion of entries that have error smaller than x
21 / 77 Probability Models Introduction
Multivariate regression
Extension of the univariate regression:
Strictly speaking univariate models we worked with have
two variables often – regressor and constant.
Mathematically very similar – the same techniques to
analyse.
Typically the same packages, but more things to look at.
statsmodels is more statistic driven, sklearn is more
machine learning.
similar functionality slightly different focus and notation.
22 / 77 Probability Models Introduction
Multivariate regression examples
Number of Shops on Population and Number of Horeca
Establishments
23 / 77 Probability Models Introduction
Load Packages
24 / 77 Probability Models Introduction
Fit Plot
25 / 77 Probability Models Introduction
Multivariate Regression
26 / 77 Probability Models Introduction
Plot fit
27 / 77 Probability Models Introduction
Multivariate Plot
Why is this mess here?
28 / 77 Probability Models Introduction
Multivariate Scatter Plot
29 / 77 Probability Models Introduction
Multivariate Scatter Plot
30 / 77 Probability Models Introduction
Residuals Plot
31 / 77 Probability Models Introduction
Clustering
What is it? Why do we do it?
Clustering refers to a very broad set of techniques for
finding subgroups, or clusters, in a data set.
We want to partition the data into distinct groups based on
the information we have. So that similar shoes are in
distinct groups.
What doest it mean for shoes to be similar or different?
In non-trivial cases, we need to combine categorical and
numerical features and use domain-specific knowledge
32 / 77 Probability Models Introduction
Clustering
What is it? Why do we do it?
Alternative to going to more sophisticated models. Instead
of going to semi-nonparametric or non-parametric methods
such as ANN we can split data into groups and have a
simpler model for them
Netflix and Spotify recommendation is essentially a
sophisticated version of two-sided clustering. Find people
with similar tastes. Find products with similar properties.
33 / 77 Probability Models Introduction
Clustering Recom systems
34 / 77 Probability Models Introduction
Clustering
What is it? Why do we do it?
Alternative to going to more sophisticated models. Instead
of going to semi-nonparametric or non-parametric methods
such as ANN we can split data into groups and have a
simpler model for them
Netflix and Spotify recommendation is essentially a
sophisticated version of two-sided clustering. Find people
with similar tastes. Find products with similar properties.
Solve the problem of cold start. If time series data is not
available how to predict sales of a new article?
Imagine a map and us trying to guess the value of a
property based on k-nearest neighbors
35 / 77 Probability Models Introduction
Clustering Example
Sales and Product data of Shoes:
Suppose we have a large number of shoe characteristics
(e.g. number of shoes sold last year, average price, product
category(e.g. running, football, outdoor), and so forth) for
a large number of shoes.
Our purpose is to group similar shoes together to make a
separate demand/promotion model for each group..
This task can be linked to user segmentation. Similar
people like similar shoes (related to recommender systems)
36 / 77 Probability Models Introduction
Clustering Result Plot
37 / 77 Probability Models Introduction
K-means Clustering Details
Goal:
Assign a unique index to each observation.
To minimize the within-cluster variation given the number
of clusters.
1
Pp
− x i0 j ) 2
P
W CV (Ck ) = |Ck | xi ,xi0 ∈Ck j=1 (xij
38 / 77 Probability Models Introduction
K-means Clustering Details
Procedure:
1 Randomly assign a number, from 1 to K, to each of the
observations. These serve as initial cluster assignments for
the observations.
2 Iterate until the cluster assignments stop changing:
1 For each of the K clusters, compute the cluster centroid.
The kth cluster centroid is the vector of the p feature means
for the observations in the kth cluster.
2 Assign each observation to the cluster whose centroid is
closest (where closest is defined using Euclidean distance).
39 / 77 Probability Models Introduction
Example by Robert Tibshirani and Trevor Hastie
Figure : Steps of K-means algorithm. Source: Statistical Learning at
https://online.stanford.edu/
40 / 77 Probability Models Introduction
Breaking down steps
The progress of the K-means algorithm with K=3.
Top left: Scatter plot of the data.
Top center: Step 1 – Randomly assign each observation to
a cluster.
Top right: Step 2(a) – Compute the cluster centroids.
Bottom left: Step 2(b), Assign each observation to the
nearest centroid.
Bottom center: Step 2(a) – Compute new cluster
centroids.
Top right: Results after 10 iterations.
41 / 77 Probability Models Introduction
Scale Data
42 / 77 Probability Models Introduction
Residuals Plot
43 / 77 Probability Models Introduction
Residuals Plot
44 / 77 Probability Models Introduction
K-means code
45 / 77 Probability Models Introduction
Scatter Plot
46 / 77 Probability Models Introduction
Clustering Result Plot
47 / 77 Probability Models Introduction
K-means Clustering. How to Choose K?
No simple answer:
Number given by domain knowledge.
Elbow method.
Silhouette Score.
48 / 77 Probability Models Introduction
Elbow Method
49 / 77 Probability Models Introduction
Elbow Method
50 / 77 Probability Models Introduction
Silhuette Score k = 2
51 / 77 Probability Models Introduction
Silhuette Score k = 3
52 / 77 Probability Models Introduction
Silhuette Score k = 4
53 / 77 Probability Models Introduction
Code For Silhouette Score. Part 1
54 / 77 Probability Models Introduction
Code For Silhouette Score. Part 2
55 / 77 Probability Models Introduction
Bonus
56 / 77 Probability Models Introduction
Get Cluster Labels
57 / 77 Probability Models Introduction
Scatter Plot
Compare with K-means?
58 / 77 Probability Models Introduction
Two types of Clustering
Top-down and Bottom-Up:
In K-means clustering, we seek to partition the
observations into a pre-specified number of clusters.
In hierarchical clustering, we do not know in advance how
many clusters we want; in fact, we end up with a tree-like
visual representation of the observations, called a
dendrogram, that allows us to view at once the clusterings
obtained for each possible number of clusters, from 1 to n.
59 / 77 Probability Models Introduction
Affinity Prop
Based on distance matrix
60 / 77 Probability Models Introduction
Affinity Prop
What are the clusters here?
61 / 77 Probability Models Introduction
Affinity Prop
What are the clusters here?
62 / 77 Probability Models Introduction
Affinity Prop
Compare with K-means?
https://scikit-learn.org/stable/modules/clustering.html
63 / 77 Probability Models Introduction
Logit
Implementation:
Dependent variable is binary ∈ {0, 1} – probability of
succes.
Use stats models and sklearn
64 / 77 Probability Models Introduction
Logit with sm ols
65 / 77 Probability Models Introduction
Logit with sm ols
Get Odds Ratio and Confidence intervals
66 / 77 Probability Models Introduction
Logit with sklearn
67 / 77 Probability Models Introduction
Logit Performance measures
68 / 77 Probability Models Introduction
Area under Curve
69 / 77 Probability Models Introduction
Area under Curve
70 / 77 Probability Models Introduction
Lasso
Pp 2
ML + a loss function λ i=1 βi
71 / 77 Probability Models Introduction
Lasso
72 / 77 Probability Models Introduction
Lasso
73 / 77 Probability Models Introduction
Chapter 3: Probability models
Some more reading material:
1 Davidson (1994), “Stochastic Limit Theory”
Chapter 1.6, 2.3, 3.1 and 7.1
2 Billingsley (1995), “Probability and Measure”
Chapter 2 and 5
3 White (1996), “Estimation Inference and
Specification Analysis”
Chapter 2.1, 2.2 and 20
4 Fan and Yao (2005), “Nonlinear Time-Series”
Chapter 1.3
74 / 77 Probability Models Introduction
Probability spaces and random variables
Note: we need to learn some basic concepts of set theory and
measure theory in order to find an appropriate definition of
probability model and data generating process.
A very brief history of 20th century mathematics:
1 Late 19th and 20th centuries: foundations!
2 Gottlob Frege attempted to give proper definitions of numbers,
functions and variables. What is the number 2 after all?
3 Days before publication (10 years of work) Bertrand Russell pointing
out ‘small inconsistency’ which turned out to destroy Frege’s work.
4 Bertrand Russell gives foundations to mathematics in ‘Principia
Mathematica’ in 3 volumes.
5 Kurt Godel’s Incompleteness Theorem shows that any logical
axiomatic system will always be incomplete or contradictory!
75 / 77 Probability Models Introduction
Probability space
A probability space is a triplet (E, F, P ) where E is the ‘event
space’, F is a σ-field defined on the event space E and P is a
probability measure defined on the σ-field F.
Event space E is the collection of all possible outcomes for the
random variable.
Probability measure P defines probability associated to each
event and each collection of events in E.
σ-field F contains all the relevant collections of events.
Note: P : F → [0, 1] maps elements of F to the interval [0, 1].
76 / 77 Probability Models Introduction
Probability space
A probability space is a triplet (E, F, P ) where E is the ‘event
space’, F is a σ-field defined on the event space E and P is a
probability measure defined on the σ-field F.
Examples of event spaces E:
Coin tosses: E = {heads, tails}
Dice tosses: E = {1, 2, 3, 4, 5, 6}
Gaussian random variable E = R
Question: Why is P defined on collections of sets in F?
Answer: Describes what are joint and disjoint sets!
77 / 77 Probability Models Introduction
Probability space: Coin toss example
Example: A σ-field F of the event space E = {heads, tails} is
n o
F := {∅} , {heads} , {tails} , {heads, tails} .
Note:
F contains the empty set ∅
F contains each element of E
F contains the event space E = {heads, tails}
Hence: Probability measure P must define a probability of
Nothing happening P (∅)
Drawing heads P (heads)
Drawing tails P (tails)
Drawing either heads or tails P ({heads, tails})
78 / 77 Probability Models Introduction
σ-fields (σ-algebras)
Note: there are certain rules that must be followed for
constructing a σ-algebra F.
Banach-Tarsky Paradox (1924): It is possible to take a ball,
cut it into pieces, and re-arrange those pieces in such a manner
as to obtain two balls of the exact same size, that have no parts
missing! The σ-algebra solves this problem!
A σ-field F of a set E is a collection of subsets of E satisfying:
(i) E ∈ F.
(ii) If F ∈ F, then F c ∈ F.
S∞
(iii) If {Fn }n∈N is a sequence of sets in F then n=1 Fn ∈ F.
79 / 77 Probability Models Introduction
Measurable spaces and probability measures
A measurable space is just a pair (E, F) composed of an event
space E and respective σ-algebra F.
A probability measure P defined on a measurable space (E, F)
is a function P : F → [0, 1] satisfying:
(i) P (F ) ≥ 0 ∀ F ∈ F.
(ii) P (E) = 1.
(iii) If {Fn }n∈N is a collection of sets in F, then
S∞ P∞
P n=1 Fn = n=1 P (Fn ).
80 / 77 Probability Models Introduction
Random variable
Given two measurable spaces (A, FA ) and (B, FB ), a function
f : A → B is said to be measurable if f is such that every
element b ∈ FB satisfies f −1 (b) ∈ FA .
Note: inverse map f −1 exists always, it may just not be a
function! Do you remember the properties of a function?
Given a probability space (E, F, P ) and a measurable space
(R, FR ), a random-variable xt is a measurable map xt : E → R
that maps elements of E to the real numbers R.
81 / 77 Probability Models Introduction
Random variable
Note: The definition of random variable is very intuitive!
Note: measurability of xt : E → R implies that we can assign
probabilities for each interval R ⊆ R of the real line
PR (R) = P (x−1
t (R)) = P e ∈ E : xt (e) ∈ R
we obtain a new probability space (R, FR , PR ).
Note: we can now define the cumulative distribution function F
that you know so well as
F (a) = PR (x ≤ a) ∀ a ∈ R.
Important: xt is a random variable. xt (e) ∈ R is the
realization of the random variable produced by event e ∈ E.
82 / 77 Probability Models Introduction
Random vectors and random elements
Note: The concept of random variable is easy to generalize!
Given a probability space (E, F, P ) and a measurable space
(Rn , FRn ) with n ∈ N, an n-variate random-vector xt is a
measurable map xt : E → Rn that maps elements of E to Rn .
Given a probability space (E, F, P ) and the measurable space
(A, FA ), a random-element at taking values in A is a measurable
map at : E → A that maps elements of E to A.
83 / 77 Probability Models Introduction
Is this a random variable? Borel σ-algebra
Important: definition of random variable depends on the
σ-algebra that one is using!
Question: Consider the case where xt is a normal random
variable xt ∼ N (0, σ 2 ). Is x2t also a random variable? How
about exp(xt )?
Answer: Yes, if we use the Borel σ-algebra! (Emile Borel)
Given set A the Borel σ-algebra BA is the smallest σ-algebra
containing all open sets of A.
84 / 77 Probability Models Introduction
Is this a random variable? Continuous functions
Important: all continuous functions are measurable under the
Borel σ-algebra! Any continuous transformation f (xt ) of a
random variable xt is also a random variable!
Note: It is obvious that all continuous functions are
measurable! Just look at the definition of continuous function.
Let (A, TA ) and (B, TB ) be topological spaces. A function
f : A → B is said to be continuous if its inverse f −1 maps open
sets to open sets; i.e. if for every b ∈ TB we have f −1 (b) ∈ TA .
85 / 77 Probability Models Introduction
What is a probability model?
Question: What exactly is a model?
Example: Given T tosses of a coin, it is reasonable to suppose
that x1 , ..., xT are realizations of T Bernoulli random variables,
xt ∼ Bern(θ) with unknown probability parameter θ ∈ [0, 1].
Important: Each θ defines a probability distribution for the
random vector (x1 , ..., xT ) taking values in RT . Our model is a
collection of probability distributions on RT .
This definition of model is the one you have been
always using. Even if you did not realize it!
86 / 77 Probability Models Introduction
What is a probability model?
Question: What exactly is a model?
Example 2: Gaussian linear AR(1) model,
xt = α + βxt−1 + εt , εt ∼ N (0, σε2 ) , ∀t∈Z
Important: Each θ = (α, β, σε2 ) defines a distribution for the
time-series {xt }t∈Z . Our model is a collection of probability
distributions on R∞ .
This definition of model is the one you have been
always using. Even if you did not realize it!
87 / 77 Probability Models Introduction
Probability model
Given a measurable space (E, F) and a parameter space Θ, a
probability model is a collection PΘ := {Pθ , θ ∈ Θ} of
probability measures defined on F.
Given the measurable space (R∞ , FR∞ ) and a parameter space
Θ, a probability model is a collection PΘ := {Pθ , θ ∈ Θ} of
probability measures defined on FR∞ .
88 / 77 Probability Models Introduction
Some more useful definitions...
A probability model PΘ := {Pθ , θ ∈ Θ} is said to be:
’parametric’ if the parameter space Θ is finite dimensional;
‘nonparametric’ if Θ is infinite dimensional;
‘semi-parametric’ if e Θ = Θ1 × Θ2 where Θ1 is finite
dimensional and Θ2 is infinite dimensional;
‘semi-nonparametric’ if ΘT is indexed by the sample size T with
‘sieves’ {ΘT }T ∈N with increasing dimension.
Given a measurable space (E, F) and two parametric models
PΘ := {Pθ , θ ∈ Θ} and P∗Θ∗ := {Pθ∗∗ , θ ∗ ∈ Θ∗ }, we say that model PΘ
nests model P∗Θ∗ if and only if P∗Θ∗ ⊆ PΘ .
89 / 77 Probability Models Introduction