Lecture 1 - Statistical Challenges for Prediction
Statistical Learning (CFAS420)
Alex Gibberd
Lancaster University
18th Feb 2020
Outline
Learning Outcomes:
I Gain awareness of challenges in real-world statistical modelling
I Understand the concepts of prediction error, bias and variance
I Have some intuition for the degrees-of-freedom in a model
I Recognise the principles of cross-validation and information criterion
2
Traditional Assumptions
I Data is drawn independently and identically (i.i.d) from a population
I Independence: Probability of one observation not dependent on
others
P[Xi = xi | Xj = xj ] = P[Xi = xi ]
P[Xi = xi , Xj = xj ] = P[Xi = xi ]P[Xj = xj ]
I Identical sampling: Probability distribution is not a function of
observation index/position
P[Xi = xi ] = P[Xj = xj ]
Statistics in the Real World 4
Linear Regression
I Linear Regression: Predict an output Y ∈ R from a set of
covariates (or features) X1 , . . . , Xp
Y = flin (X; β ) + ε (1)
p
flin (X; β ) := α + β1 X1 + β2 X2 + · · · + βp Xp = ∑ βj Xj .
j=1
I Noise source ε i.i.d
∼ N (0, σ 2 )
I The outcome Y is a random variable
I When Xi are random variables, we call this random design
I When x1 , . . . , xp are fixed values, we call this fixed design
Statistics in the Real World 5
Linear Regression (Least Squares)
I In reality, we only have observations {yi , x1i . . . xpi }ni=1 .
I We posit that a linear model (1) exists and try to estimate its
parameters {βj }pj=1
I Least-squares estimator:
1 n
(yi − flin ({x1i }ni=1 , β ))2 .
β̂ := arg min ∑ (2)
β n i=1
where arg minx [f (x)] is the value of x which minimises function f (x).
Statistics in the Real World 6
Aside: Vectors and Norms
I For a vector x ∈ Rn , the function kxk2 is known as the `2
(pronounced ell-two) norm of x, it is defined as
s
n
kxk2 := ∑ xi2 .
i=1
I Norms are very useful as they allow us to easily describe the size of a
vector. For instance, the least-squares problem (2) can be written as
this is vector
1 z }| {
β̂ := arg min k y − X > β k22 .
β n
I More generally we define the `q norm in the form
!1/q
n
q
kxkq := ∑ |xi | .
i=1
Statistics in the Real World 7
Some Consequences of False Assumptions
I In the real-world, the assumption that our linear relationship (i.e. 1)
holds across the whole population is almost certainly false
I Consider that our population (we observe {y, x} from) has a linear
relationship with the same slope β1 , but different intercepts αA and
αB where the intercept term depends on x
Statistics in the Real World 8
Simpson’s Paradox
I The true β is negative, however, when the data is taken as a whole,
we obtain β̂ > 0
I When the data is partitioned into the correct regions, we find β̂ < 0
I This reversal of regression coefficient is known as Simpson’s Paradox.
I This is a consequence of non-homogeneity in the distribution of
(Yi , Xi )
Statistics in the Real World 9
Generalisation
I Ideally we want our models to generalise to any new examples of
data
I However, as we observe more data, we typically sample across more
sub-populations
I Our modelling assumptions may only be locally valid
I For instance, in the Simpson’s example, the linear model
assumptions were valid, but only in some localised region.
I The best models will be flexible enough to adapt to
non-homogeneity, but simple enough to be able to predict new
observations
Statistics in the Real World 10
Prediction Error
I If we observe a new sample of co-variates x1(new) , . . . , xp(new) we hope
that the prediction
Ŷ := flin (x(new) ; β̂ ) + ε
is close to the corresponding new outcome y(new)
I Usually, the new outcome may be revealed later,
I Usually something we are interested in knowing in order to take
action, i.e. a stock-price, or whether someone has cancer or not.
I Consider that y(new) is drawn from the random variable Y
I The expected prediction error can be written as1
epred := E[(Y − f̂lin (X))2 ]
where we take expectation over (Y, X) and operate in the random
design case
1 I shortened the notation to f̂ (X) := f (X , . . . .X ; β̂ , . . . , β̂ )
lin lin 1 p 1 p
The Bias-Variance Trade-off 12
Prediction Error (Conditional on X = x)
I Let’s try and understand epred in the case where we observe X = x.
E[(Y − f̂lin (X))2 |X = x] = σ 2 + E[(flin (x) − f̂lin (x))2 ] .
| {z }
Risk(f̂ (x))
I Can we ever expect to predict Y with 100% accuracy??
I Generally, the answer is no, since σ 2 ≥ 0
I Lets decompose the second term
E[(Y − f̂lin (X))2 |X = x] = σ 2 + (flin (x) − E[f̂lin (x)])2
| {z }
Bias2 (f̂ (x))
. . . + E (f̂lin (x) − E[f̂lin (x)])2
| {z }
Var(f̂ (x))
The Bias-Variance Trade-off 13
Prediction Error (Marginalising over X)
I To remove the dependence on x in (Y − f̂ (x))2 we can consider
integrating across all possible values of x
I This is known as marginalising
Z Z
epred :=E[(Y − f̂lin (X))2 ] = σ 2 pX (x)dx + Bias2 (f̂ (x))pX (x)dx
Rp Rp
Z
...+ Var(f̂ (x))pX (x)dx ,
Rp
where pX (x) is the probability density function for X ∈ Rp
I Note: Since Rp pX (x)dx = 1, what is Rp σ 2 pX (x)dx??
R R
I Don’t worry if you don’t follow all the maths here!!
I Where we known the ground-truth distributions (i.e. the synthetic
setting), we can easily approximate the above quantities (You will do
so in the lab)
The Bias-Variance Trade-off 14
Bias-Variance as a Function of (p, n)
I Consider we estimate a linear model which includes up to p̂ ≤ p
covariates, i.e. X1 , . . . , Xp̂
I How do we expect the three error quantities to change as a function
of p̂
I The specific behaviour depends on the true β1 , . . . , βp
I However, generally we get something like
Bias−Variance Trade−off
0.4
colour
Bias
Error
Prediction Error
Std (Variance)
0.2
0.0
2.5 5.0 7.5 10.0
p_hat
Degrees of Freedom 16
Bias-Variance as a Function of (p, n)
I The shape of the function depends on σ 2 , n and p alongside the true
β = (3, 2, 1, 0.1, . . . , 0.1) ∈ Rp
I What happens if we fix σ 2 and p, but let n → ∞
Bias−Variance Trade−off
0.6
0.4
0.4
0.4
colour
Bias
Error
Prediction Error
0.2 Std (Variance)
0.2
0.2
0.0
0.0
2.5 5.0 7.5 10.0
p_hat
Degrees of Freedom 17
Degrees of Freedom
I Bias usually decreases as a function of increasing p, however, the
variance increases.
I In a linear model, the number K of non-zero parameters β1 , . . . , βK
represents the degrees-of-freedom in the model.
I To balance bias and variance, it is often useful when comparing two
models, to introduce a penalty on the loss function.
I We can try to maximise some function which balances good model
fit, and penalises complexity
best model := arg max Lmodel (x; y) − |Kmodel | × penalty(n)
over set of models | {z }
Likelihood
Degrees of Freedom 18
Information Criteria for Linear Regression
I Generally we can use log-likelihood in place of Lmodel (x; y)
I We minimise a complexity penalised log-likelihood.
I Two common approaches:
– Akikake Information Criteria (AIC):
best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel |]
over set of models
– Bayesian Information Criteria (BIC):
best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel | log(n)]
over set of models
Degrees of Freedom 19
Cross-Validation
I For many statistical models we may not know the theoretical
degrees-of-freedom function
I In real-life we do not have the luxury of re-sampling from the
ground-truth generating distribution, i.e. we do not know the real
f (β ), β or error structure ε
I Cross-validation is a useful method to approximate (or estimate) the
prediction error of a model from observations
I Our aim is to use observations (X, y) to form an estimate of
êpred (model)
I We can then pick the best model to minimise êpred
Cross-Validation 21
Cross-Validation (Splitting Data)
I Split the data-set indexed i = 1, . . . , n into k = 1, . . . , K different
groups (or folds)
I Denote these sets of data as Ck .
I These groups should be the same size, nk , and cover the full data,
i.e. C1 ∪ Ck · · · ∪ CK = {1, . . . , n}.
Cross-Validation 22
Cross-Validation
I 1. Let us denote (y(/k) , x(/k) ) as the observations in all groups
excluding k
I 2. Estimate the model parameters on the this data
fmodel (β )
(y(/k) , x(/k) ) =⇒ β̂
I 3. Let the data y(k) , x(k) be observations in group k
I 4. Test the prediction ability of f (β̂ ; x(k) ) on the test data
(k) 1 (k)
êpred := ky − f (β̂ ; x(k) )k22
nk
I 5. Repeat for folds k = 1, . . . , K and average the prediction errors
1 K (k)
êpred := ∑ êpred .
K k=1
Cross-Validation 23
Important Considerations
I How good estimate êpred is will depend on how accurate the
assumptions of cross-validation are
I In particular, we have to pay attention to how we split the data
I If data is not independently, OR, identically drawn then the
distribution of samples within Ck may not be representative of those
in Ck0
I Can mitigate these issues to some extent by making Ck large.
I Can have large consequences, c.f. back-testing for financial
portfolios. What performs well under some economic conditions
performs badly in others.
Cross-Validation 24
Summary
I Introduced the Simpsons paradox to highlight issues when
assumptions break down (non-homogeneity)
I Discussed the concepts of risk, bias, and variance.
I Demonstrated how increasing degrees-of-freedom can be linked with
overfitting (bias-variance trade-off)
I How to compare linear models via information criterion
I The basic concepts of cross-validation
Cross-Validation 25
In The Lab
1. How to test for linear relationships. Demonstrate that data is
typically non-homogeneous at large scale.
2. Demonstrate the bias-variance trade-off when we know the
generative distribution (we know the true distribution which
produces X, Y)
3. Learn how to easily split data-sets in R using caret
4. Implement and use cross-validation to generate estimates of the
prediction error
Cross-Validation 26
References I
Appendix 27