0% found this document useful (0 votes)
26 views23 pages

Slides 1 Handout

statistical learning

Uploaded by

Pasxalis Itsios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views23 pages

Slides 1 Handout

statistical learning

Uploaded by

Pasxalis Itsios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Lecture 1 - Statistical Challenges for Prediction

Statistical Learning (CFAS420)

Alex Gibberd

Lancaster University

18th Feb 2020


Outline

Learning Outcomes:
I Gain awareness of challenges in real-world statistical modelling
I Understand the concepts of prediction error, bias and variance
I Have some intuition for the degrees-of-freedom in a model
I Recognise the principles of cross-validation and information criterion

2
Traditional Assumptions

I Data is drawn independently and identically (i.i.d) from a population


I Independence: Probability of one observation not dependent on
others
P[Xi = xi | Xj = xj ] = P[Xi = xi ]
P[Xi = xi , Xj = xj ] = P[Xi = xi ]P[Xj = xj ]
I Identical sampling: Probability distribution is not a function of
observation index/position

P[Xi = xi ] = P[Xj = xj ]

Statistics in the Real World 4


Linear Regression

I Linear Regression: Predict an output Y ∈ R from a set of


covariates (or features) X1 , . . . , Xp

Y = flin (X; β ) + ε (1)


p
flin (X; β ) := α + β1 X1 + β2 X2 + · · · + βp Xp = ∑ βj Xj .
j=1

I Noise source ε i.i.d


∼ N (0, σ 2 )
I The outcome Y is a random variable
I When Xi are random variables, we call this random design
I When x1 , . . . , xp are fixed values, we call this fixed design

Statistics in the Real World 5


Linear Regression (Least Squares)

I In reality, we only have observations {yi , x1i . . . xpi }ni=1 .


I We posit that a linear model (1) exists and try to estimate its
parameters {βj }pj=1
I Least-squares estimator:

1 n
(yi − flin ({x1i }ni=1 , β ))2 .

β̂ := arg min ∑ (2)
β n i=1

where arg minx [f (x)] is the value of x which minimises function f (x).

Statistics in the Real World 6


Aside: Vectors and Norms
I For a vector x ∈ Rn , the function kxk2 is known as the `2
(pronounced ell-two) norm of x, it is defined as
s
n
kxk2 := ∑ xi2 .
i=1

I Norms are very useful as they allow us to easily describe the size of a
vector. For instance, the least-squares problem (2) can be written as
this is vector
 1 z }| { 
β̂ := arg min k y − X > β k22 .
β n
I More generally we define the `q norm in the form
!1/q
n
q
kxkq := ∑ |xi | .
i=1

Statistics in the Real World 7


Some Consequences of False Assumptions

I In the real-world, the assumption that our linear relationship (i.e. 1)


holds across the whole population is almost certainly false
I Consider that our population (we observe {y, x} from) has a linear
relationship with the same slope β1 , but different intercepts αA and
αB where the intercept term depends on x

Statistics in the Real World 8


Simpson’s Paradox

I The true β is negative, however, when the data is taken as a whole,


we obtain β̂ > 0
I When the data is partitioned into the correct regions, we find β̂ < 0
I This reversal of regression coefficient is known as Simpson’s Paradox.
I This is a consequence of non-homogeneity in the distribution of
(Yi , Xi )
Statistics in the Real World 9
Generalisation

I Ideally we want our models to generalise to any new examples of


data
I However, as we observe more data, we typically sample across more
sub-populations
I Our modelling assumptions may only be locally valid
I For instance, in the Simpson’s example, the linear model
assumptions were valid, but only in some localised region.
I The best models will be flexible enough to adapt to
non-homogeneity, but simple enough to be able to predict new
observations

Statistics in the Real World 10


Prediction Error

I If we observe a new sample of co-variates x1(new) , . . . , xp(new) we hope


that the prediction
Ŷ := flin (x(new) ; β̂ ) + ε
is close to the corresponding new outcome y(new)
I Usually, the new outcome may be revealed later,
I Usually something we are interested in knowing in order to take
action, i.e. a stock-price, or whether someone has cancer or not.
I Consider that y(new) is drawn from the random variable Y
I The expected prediction error can be written as1

epred := E[(Y − f̂lin (X))2 ]

where we take expectation over (Y, X) and operate in the random


design case
1 I shortened the notation to f̂ (X) := f (X , . . . .X ; β̂ , . . . , β̂ )
lin lin 1 p 1 p
The Bias-Variance Trade-off 12
Prediction Error (Conditional on X = x)

I Let’s try and understand epred in the case where we observe X = x.

E[(Y − f̂lin (X))2 |X = x] = σ 2 + E[(flin (x) − f̂lin (x))2 ] .


| {z }
Risk(f̂ (x))

I Can we ever expect to predict Y with 100% accuracy??


I Generally, the answer is no, since σ 2 ≥ 0
I Lets decompose the second term

E[(Y − f̂lin (X))2 |X = x] = σ 2 + (flin (x) − E[f̂lin (x)])2


| {z }
Bias2 (f̂ (x))

. . . + E (f̂lin (x) − E[f̂lin (x)])2


 
| {z }
Var(f̂ (x))

The Bias-Variance Trade-off 13


Prediction Error (Marginalising over X)

I To remove the dependence on x in (Y − f̂ (x))2 we can consider


integrating across all possible values of x
I This is known as marginalising
Z Z
epred :=E[(Y − f̂lin (X))2 ] = σ 2 pX (x)dx + Bias2 (f̂ (x))pX (x)dx
Rp Rp
Z
...+ Var(f̂ (x))pX (x)dx ,
Rp

where pX (x) is the probability density function for X ∈ Rp


I Note: Since Rp pX (x)dx = 1, what is Rp σ 2 pX (x)dx??
R R

I Don’t worry if you don’t follow all the maths here!!


I Where we known the ground-truth distributions (i.e. the synthetic
setting), we can easily approximate the above quantities (You will do
so in the lab)

The Bias-Variance Trade-off 14


Bias-Variance as a Function of (p, n)

I Consider we estimate a linear model which includes up to p̂ ≤ p


covariates, i.e. X1 , . . . , Xp̂
I How do we expect the three error quantities to change as a function
of p̂
I The specific behaviour depends on the true β1 , . . . , βp
I However, generally we get something like
Bias−Variance Trade−off

0.4

colour
Bias
Error

Prediction Error
Std (Variance)
0.2

0.0

2.5 5.0 7.5 10.0


p_hat

Degrees of Freedom 16
Bias-Variance as a Function of (p, n)

I The shape of the function depends on σ 2 , n and p alongside the true

β = (3, 2, 1, 0.1, . . . , 0.1) ∈ Rp

I What happens if we fix σ 2 and p, but let n → ∞

Bias−Variance Trade−off
0.6

0.4
0.4
0.4
colour
Bias
Error

Prediction Error

0.2 Std (Variance)


0.2
0.2

0.0
0.0
2.5 5.0 7.5 10.0
p_hat

Degrees of Freedom 17
Degrees of Freedom

I Bias usually decreases as a function of increasing p, however, the


variance increases.
I In a linear model, the number K of non-zero parameters β1 , . . . , βK
represents the degrees-of-freedom in the model.
I To balance bias and variance, it is often useful when comparing two
models, to introduce a penalty on the loss function.
I We can try to maximise some function which balances good model
fit, and penalises complexity
 

best model := arg max Lmodel (x; y) − |Kmodel | × penalty(n)


over set of models | {z }
Likelihood

Degrees of Freedom 18
Information Criteria for Linear Regression

I Generally we can use log-likelihood in place of Lmodel (x; y)


I We minimise a complexity penalised log-likelihood.
I Two common approaches:
– Akikake Information Criteria (AIC):

best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel |]


over set of models

– Bayesian Information Criteria (BIC):

best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel | log(n)]
over set of models

Degrees of Freedom 19
Cross-Validation

I For many statistical models we may not know the theoretical


degrees-of-freedom function
I In real-life we do not have the luxury of re-sampling from the
ground-truth generating distribution, i.e. we do not know the real
f (β ), β or error structure ε
I Cross-validation is a useful method to approximate (or estimate) the
prediction error of a model from observations
I Our aim is to use observations (X, y) to form an estimate of
êpred (model)
I We can then pick the best model to minimise êpred

Cross-Validation 21
Cross-Validation (Splitting Data)

I Split the data-set indexed i = 1, . . . , n into k = 1, . . . , K different


groups (or folds)
I Denote these sets of data as Ck .
I These groups should be the same size, nk , and cover the full data,
i.e. C1 ∪ Ck · · · ∪ CK = {1, . . . , n}.

Cross-Validation 22
Cross-Validation

I 1. Let us denote (y(/k) , x(/k) ) as the observations in all groups


excluding k
I 2. Estimate the model parameters on the this data

fmodel (β )
(y(/k) , x(/k) ) =⇒ β̂

I 3. Let the data y(k) , x(k) be observations in group k


I 4. Test the prediction ability of f (β̂ ; x(k) ) on the test data

(k) 1 (k)
êpred := ky − f (β̂ ; x(k) )k22
nk
I 5. Repeat for folds k = 1, . . . , K and average the prediction errors

1 K (k)
êpred := ∑ êpred .
K k=1

Cross-Validation 23
Important Considerations

I How good estimate êpred is will depend on how accurate the


assumptions of cross-validation are
I In particular, we have to pay attention to how we split the data
I If data is not independently, OR, identically drawn then the
distribution of samples within Ck may not be representative of those
in Ck0
I Can mitigate these issues to some extent by making Ck large.
I Can have large consequences, c.f. back-testing for financial
portfolios. What performs well under some economic conditions
performs badly in others.

Cross-Validation 24
Summary

I Introduced the Simpsons paradox to highlight issues when


assumptions break down (non-homogeneity)
I Discussed the concepts of risk, bias, and variance.
I Demonstrated how increasing degrees-of-freedom can be linked with
overfitting (bias-variance trade-off)
I How to compare linear models via information criterion
I The basic concepts of cross-validation

Cross-Validation 25
In The Lab

1. How to test for linear relationships. Demonstrate that data is


typically non-homogeneous at large scale.
2. Demonstrate the bias-variance trade-off when we know the
generative distribution (we know the true distribution which
produces X, Y)
3. Learn how to easily split data-sets in R using caret
4. Implement and use cross-validation to generate estimates of the
prediction error

Cross-Validation 26
References I

Appendix 27

You might also like