0% found this document useful (0 votes)

26 views23 pages

Slides 1 Handout

statistical learning

Uploaded by

Pasxalis Itsios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views23 pages

Slides 1 Handout

statistical learning

Uploaded by

Pasxalis Itsios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Lecture 1 - Statistical Challenges for Prediction

Statistical Learning (CFAS420)

Alex Gibberd

Lancaster University

18th Feb 2020

Outline

Learning Outcomes:
I Gain awareness of challenges in real-world statistical modelling
I Understand the concepts of prediction error, bias and variance
I Have some intuition for the degrees-of-freedom in a model
I Recognise the principles of cross-validation and information criterion

2
Traditional Assumptions

I Data is drawn independently and identically (i.i.d) from a population

I Independence: Probability of one observation not dependent on
others
P[Xi = xi | Xj = xj ] = P[Xi = xi ]
P[Xi = xi , Xj = xj ] = P[Xi = xi ]P[Xj = xj ]
I Identical sampling: Probability distribution is not a function of
observation index/position

P[Xi = xi ] = P[Xj = xj ]

Statistics in the Real World 4

Linear Regression

I Linear Regression: Predict an output Y ∈ R from a set of

covariates (or features) X1 , . . . , Xp

Y = flin (X; β ) + ε (1)

p
flin (X; β ) := α + β1 X1 + β2 X2 + · · · + βp Xp = ∑ βj Xj .
j=1

I Noise source ε i.i.d

∼ N (0, σ 2 )
I The outcome Y is a random variable
I When Xi are random variables, we call this random design
I When x1 , . . . , xp are fixed values, we call this fixed design

Statistics in the Real World 5

Linear Regression (Least Squares)

I In reality, we only have observations {yi , x1i . . . xpi }ni=1 .

I We posit that a linear model (1) exists and try to estimate its
parameters {βj }pj=1
I Least-squares estimator:

1 n
(yi − flin ({x1i }ni=1 , β ))2 .

β̂ := arg min ∑ (2)
β n i=1

where arg minx [f (x)] is the value of x which minimises function f (x).

Statistics in the Real World 6

Aside: Vectors and Norms
I For a vector x ∈ Rn , the function kxk2 is known as the `2
(pronounced ell-two) norm of x, it is defined as
s
n
kxk2 := ∑ xi2 .
i=1

I Norms are very useful as they allow us to easily describe the size of a
vector. For instance, the least-squares problem (2) can be written as
this is vector
1 z }| {
β̂ := arg min k y − X > β k22 .
β n
I More generally we define the `q norm in the form
!1/q
n
q
kxkq := ∑ |xi | .
i=1

Statistics in the Real World 7

Some Consequences of False Assumptions

I In the real-world, the assumption that our linear relationship (i.e. 1)

holds across the whole population is almost certainly false
I Consider that our population (we observe {y, x} from) has a linear
relationship with the same slope β1 , but different intercepts αA and
αB where the intercept term depends on x

Statistics in the Real World 8

Simpson’s Paradox

I The true β is negative, however, when the data is taken as a whole,

we obtain β̂ > 0
I When the data is partitioned into the correct regions, we find β̂ < 0
I This reversal of regression coefficient is known as Simpson’s Paradox.
I This is a consequence of non-homogeneity in the distribution of
(Yi , Xi )
Statistics in the Real World 9
Generalisation

I Ideally we want our models to generalise to any new examples of

data
I However, as we observe more data, we typically sample across more
sub-populations
I Our modelling assumptions may only be locally valid
I For instance, in the Simpson’s example, the linear model
assumptions were valid, but only in some localised region.
I The best models will be flexible enough to adapt to
non-homogeneity, but simple enough to be able to predict new
observations

Statistics in the Real World 10

Prediction Error

I If we observe a new sample of co-variates x1(new) , . . . , xp(new) we hope

that the prediction
Ŷ := flin (x(new) ; β̂ ) + ε
is close to the corresponding new outcome y(new)
I Usually, the new outcome may be revealed later,
I Usually something we are interested in knowing in order to take
action, i.e. a stock-price, or whether someone has cancer or not.
I Consider that y(new) is drawn from the random variable Y
I The expected prediction error can be written as1

epred := E[(Y − f̂lin (X))2 ]

where we take expectation over (Y, X) and operate in the random

design case
1 I shortened the notation to f̂ (X) := f (X , . . . .X ; β̂ , . . . , β̂ )
lin lin 1 p 1 p
The Bias-Variance Trade-off 12
Prediction Error (Conditional on X = x)

I Let’s try and understand epred in the case where we observe X = x.

E[(Y − f̂lin (X))2 |X = x] = σ 2 + E[(flin (x) − f̂lin (x))2 ] .

| {z }
Risk(f̂ (x))

I Can we ever expect to predict Y with 100% accuracy??

I Generally, the answer is no, since σ 2 ≥ 0
I Lets decompose the second term

E[(Y − f̂lin (X))2 |X = x] = σ 2 + (flin (x) − E[f̂lin (x)])2

| {z }
Bias2 (f̂ (x))

. . . + E (f̂lin (x) − E[f̂lin (x)])2

| {z }
Var(f̂ (x))

The Bias-Variance Trade-off 13

Prediction Error (Marginalising over X)

I To remove the dependence on x in (Y − f̂ (x))2 we can consider

integrating across all possible values of x
I This is known as marginalising
Z Z
epred :=E[(Y − f̂lin (X))2 ] = σ 2 pX (x)dx + Bias2 (f̂ (x))pX (x)dx
Rp Rp
Z
...+ Var(f̂ (x))pX (x)dx ,
Rp

where pX (x) is the probability density function for X ∈ Rp

I Note: Since Rp pX (x)dx = 1, what is Rp σ 2 pX (x)dx??
R R

I Don’t worry if you don’t follow all the maths here!!

I Where we known the ground-truth distributions (i.e. the synthetic
setting), we can easily approximate the above quantities (You will do
so in the lab)

The Bias-Variance Trade-off 14

Bias-Variance as a Function of (p, n)

I Consider we estimate a linear model which includes up to p̂ ≤ p

covariates, i.e. X1 , . . . , Xp̂
I How do we expect the three error quantities to change as a function
of p̂
I The specific behaviour depends on the true β1 , . . . , βp
I However, generally we get something like
Bias−Variance Trade−off

0.4

colour
Bias
Error

Prediction Error
Std (Variance)
0.2

0.0

2.5 5.0 7.5 10.0

p_hat

Degrees of Freedom 16
Bias-Variance as a Function of (p, n)

I The shape of the function depends on σ 2 , n and p alongside the true

β = (3, 2, 1, 0.1, . . . , 0.1) ∈ Rp

I What happens if we fix σ 2 and p, but let n → ∞

Bias−Variance Trade−off
0.6

0.4
0.4
0.4
colour
Bias
Error

Prediction Error

0.2 Std (Variance)

0.2
0.2

0.0
0.0
2.5 5.0 7.5 10.0
p_hat

Degrees of Freedom 17
Degrees of Freedom

I Bias usually decreases as a function of increasing p, however, the

variance increases.
I In a linear model, the number K of non-zero parameters β1 , . . . , βK
represents the degrees-of-freedom in the model.
I To balance bias and variance, it is often useful when comparing two
models, to introduce a penalty on the loss function.
I We can try to maximise some function which balances good model
fit, and penalises complexity
 

best model := arg max Lmodel (x; y) − |Kmodel | × penalty(n)

over set of models | {z }
Likelihood

Degrees of Freedom 18
Information Criteria for Linear Regression

I Generally we can use log-likelihood in place of Lmodel (x; y)

I We minimise a complexity penalised log-likelihood.
I Two common approaches:
– Akikake Information Criteria (AIC):

best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel |]

over set of models

– Bayesian Information Criteria (BIC):

best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel | log(n)]
over set of models

Degrees of Freedom 19
Cross-Validation

I For many statistical models we may not know the theoretical

degrees-of-freedom function
I In real-life we do not have the luxury of re-sampling from the
ground-truth generating distribution, i.e. we do not know the real
f (β ), β or error structure ε
I Cross-validation is a useful method to approximate (or estimate) the
prediction error of a model from observations
I Our aim is to use observations (X, y) to form an estimate of
êpred (model)
I We can then pick the best model to minimise êpred

Cross-Validation 21
Cross-Validation (Splitting Data)

I Split the data-set indexed i = 1, . . . , n into k = 1, . . . , K different

groups (or folds)
I Denote these sets of data as Ck .
I These groups should be the same size, nk , and cover the full data,
i.e. C1 ∪ Ck · · · ∪ CK = {1, . . . , n}.

Cross-Validation 22
Cross-Validation

I 1. Let us denote (y(/k) , x(/k) ) as the observations in all groups

excluding k
I 2. Estimate the model parameters on the this data

fmodel (β )
(y(/k) , x(/k) ) =⇒ β̂

I 3. Let the data y(k) , x(k) be observations in group k

I 4. Test the prediction ability of f (β̂ ; x(k) ) on the test data

(k) 1 (k)
êpred := ky − f (β̂ ; x(k) )k22
nk
I 5. Repeat for folds k = 1, . . . , K and average the prediction errors

1 K (k)
êpred := ∑ êpred .
K k=1

Cross-Validation 23
Important Considerations

I How good estimate êpred is will depend on how accurate the

assumptions of cross-validation are
I In particular, we have to pay attention to how we split the data
I If data is not independently, OR, identically drawn then the
distribution of samples within Ck may not be representative of those
in Ck0
I Can mitigate these issues to some extent by making Ck large.
I Can have large consequences, c.f. back-testing for financial
portfolios. What performs well under some economic conditions
performs badly in others.

Cross-Validation 24
Summary

I Introduced the Simpsons paradox to highlight issues when

assumptions break down (non-homogeneity)
I Discussed the concepts of risk, bias, and variance.
I Demonstrated how increasing degrees-of-freedom can be linked with
overfitting (bias-variance trade-off)
I How to compare linear models via information criterion
I The basic concepts of cross-validation

Cross-Validation 25
In The Lab

1. How to test for linear relationships. Demonstrate that data is

typically non-homogeneous at large scale.
2. Demonstrate the bias-variance trade-off when we know the
generative distribution (we know the true distribution which
produces X, Y)
3. Learn how to easily split data-sets in R using caret
4. Implement and use cross-validation to generate estimates of the
prediction error

Cross-Validation 26
References I

Appendix 27

226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Probability & Regression Basics
100% (2)
Probability & Regression Basics
5 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
MLDL Lecture 1
No ratings yet
MLDL Lecture 1
28 pages
Sta 3
No ratings yet
Sta 3
9 pages
Econometrics Module 2
No ratings yet
Econometrics Module 2
38 pages
Econometric S Cheat Sheet
No ratings yet
Econometric S Cheat Sheet
3 pages
Econometrics I: Professor William Greene Stern School of Business Department of Economics
No ratings yet
Econometrics I: Professor William Greene Stern School of Business Department of Economics
47 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Econometrics Cheat Sheet
No ratings yet
Econometrics Cheat Sheet
4 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Merge
No ratings yet
Merge
240 pages
Econometrics I 2
No ratings yet
Econometrics I 2
46 pages
Notes 2
No ratings yet
Notes 2
16 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
SimpleLinearRegression PDF
No ratings yet
SimpleLinearRegression PDF
86 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Gary Chamberlain Econometric S
No ratings yet
Gary Chamberlain Econometric S
152 pages
HKUST ISOM 2500 Lecture Materials On Uncertainty
No ratings yet
HKUST ISOM 2500 Lecture Materials On Uncertainty
26 pages
Ordinary Least Squares
No ratings yet
Ordinary Least Squares
54 pages
Linear Regression Models 2018
No ratings yet
Linear Regression Models 2018
68 pages
Econometrics
No ratings yet
Econometrics
13 pages
18 CV & Model Selection
No ratings yet
18 CV & Model Selection
11 pages
ECON3049 Lecture Notes 1
No ratings yet
ECON3049 Lecture Notes 1
32 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Econometrics: CLM & OLS Basics
No ratings yet
Econometrics: CLM & OLS Basics
11 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Model Selection and Multiple Hypothesis Testing
No ratings yet
Model Selection and Multiple Hypothesis Testing
6 pages
Additional Cheatsheet en
No ratings yet
Additional Cheatsheet en
3 pages
Ssss PDF
No ratings yet
Ssss PDF
50 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Linear Regression
No ratings yet
Linear Regression
56 pages
Regression Analysis Techniques
No ratings yet
Regression Analysis Techniques
16 pages
Regression
No ratings yet
Regression
45 pages
04 16 Simple Regression
No ratings yet
04 16 Simple Regression
47 pages
SSRN Id3588594
No ratings yet
SSRN Id3588594
27 pages
Chapter 10 PDF
No ratings yet
Chapter 10 PDF
3 pages
Classical LinearReg 000
No ratings yet
Classical LinearReg 000
41 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Ecn 306
No ratings yet
Ecn 306
43 pages
Bias Variance Annotated
No ratings yet
Bias Variance Annotated
73 pages
CH 06
No ratings yet
CH 06
22 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
Data Science Q&A - Latest Ed (2020) - 3 - 1
No ratings yet
Data Science Q&A - Latest Ed (2020) - 3 - 1
2 pages
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
No ratings yet
Briefly Explain The Trade-Offs Associated Between The Model Variance Versus Bias-Squared To Inform Model Selection
7 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
Linear Regression
No ratings yet
Linear Regression
108 pages
ETC2410 Introductory Econometrics Unit Overview
No ratings yet
ETC2410 Introductory Econometrics Unit Overview
42 pages
IS4242 W3 Regression Analyses
No ratings yet
IS4242 W3 Regression Analyses
67 pages
CH 2
No ratings yet
CH 2
31 pages
Asl NGP+ 8 - 100 - en - 2930714210
100% (1)
Asl NGP+ 8 - 100 - en - 2930714210
80 pages
Case Study
No ratings yet
Case Study
4 pages
Summer Internship Application Form
No ratings yet
Summer Internship Application Form
6 pages
Contoh Assignment Oum
100% (1)
Contoh Assignment Oum
12 pages
MSC BFS Flyer - ForMed
No ratings yet
MSC BFS Flyer - ForMed
2 pages
Turbomechinery 7
No ratings yet
Turbomechinery 7
11 pages
Industrial Coupling Suppliers and Exporters in Mumbai
No ratings yet
Industrial Coupling Suppliers and Exporters in Mumbai
7 pages
AEG Protect-8 INV EN
No ratings yet
AEG Protect-8 INV EN
4 pages
Legal Concepts: Ownership vs Possession
No ratings yet
Legal Concepts: Ownership vs Possession
4 pages
Test Quiz
No ratings yet
Test Quiz
6 pages
Cabin Crew Sample Interview Questions
100% (3)
Cabin Crew Sample Interview Questions
4 pages
Advanced Instrumentation and Computer I O Design Defined Accuracy Decision Control and Process Applications 2nd Edition Patrick H. Garrett Download
No ratings yet
Advanced Instrumentation and Computer I O Design Defined Accuracy Decision Control and Process Applications 2nd Edition Patrick H. Garrett Download
52 pages
Turbo Pump Rotor Blade Analysis
No ratings yet
Turbo Pump Rotor Blade Analysis
43 pages
Assemblage Mun 2.0 - MVN Sr. Sec. School, Sector-88
No ratings yet
Assemblage Mun 2.0 - MVN Sr. Sec. School, Sector-88
27 pages
Bending Lab Report Final
No ratings yet
Bending Lab Report Final
21 pages
Set Theory (Solutions)
No ratings yet
Set Theory (Solutions)
4 pages
TAP 118-1: Potential Dividers
No ratings yet
TAP 118-1: Potential Dividers
4 pages
Plywood Manufacturing Process
75% (4)
Plywood Manufacturing Process
40 pages
Simple Equations
No ratings yet
Simple Equations
8 pages
Layouting Basics: How To Be A Good Layout Artist
100% (1)
Layouting Basics: How To Be A Good Layout Artist
51 pages
OEM Samples
No ratings yet
OEM Samples
5 pages
HIACE Ignition Wiring Guide
No ratings yet
HIACE Ignition Wiring Guide
2 pages
ICGCEE 2023: Green Construction Conference
No ratings yet
ICGCEE 2023: Green Construction Conference
19 pages
Crab Cave A Coastal Adventure For PCs Level 1-2
No ratings yet
Crab Cave A Coastal Adventure For PCs Level 1-2
11 pages
Pearson Edexcel: International GCSE and GCE A Level Price List November 2021 Bangladesh
No ratings yet
Pearson Edexcel: International GCSE and GCE A Level Price List November 2021 Bangladesh
12 pages
Medical Devices Equipment
No ratings yet
Medical Devices Equipment
30 pages
Mordazas Tinius Olsen
0% (1)
Mordazas Tinius Olsen
4 pages
Linear Control Systems Lecture # 8 Observability & Discrete-Time Systems
No ratings yet
Linear Control Systems Lecture # 8 Observability & Discrete-Time Systems
25 pages
Library Basics Programming
No ratings yet
Library Basics Programming
11 pages
Lab Report 2 Group 5
No ratings yet
Lab Report 2 Group 5
14 pages

Slides 1 Handout

Uploaded by

Slides 1 Handout

Uploaded by

Lecture 1 - Statistical Challenges for Prediction

Statistical Learning (CFAS420)

18th Feb 2020

I Data is drawn independently and identically (i.i.d) from a population

Statistics in the Real World 4

I Linear Regression: Predict an output Y ∈ R from a set of

Y = flin (X; β ) + ε (1)

I Noise source ε i.i.d

Statistics in the Real World 5

I In reality, we only have observations {yi , x1i . . . xpi }ni=1 .

Statistics in the Real World 6

Statistics in the Real World 7

I In the real-world, the assumption that our linear relationship (i.e. 1)

Statistics in the Real World 8

I The true β is negative, however, when the data is taken as a whole,

I Ideally we want our models to generalise to any new examples of

Statistics in the Real World 10

I If we observe a new sample of co-variates x1(new) , . . . , xp(new) we hope

epred := E[(Y − f̂lin (X))2 ]

where we take expectation over (Y, X) and operate in the random

I Let’s try and understand epred in the case where we observe X = x.

E[(Y − f̂lin (X))2 |X = x] = σ 2 + E[(flin (x) − f̂lin (x))2 ] .

I Can we ever expect to predict Y with 100% accuracy??

E[(Y − f̂lin (X))2 |X = x] = σ 2 + (flin (x) − E[f̂lin (x)])2

. . . + E (f̂lin (x) − E[f̂lin (x)])2

The Bias-Variance Trade-off 13

I To remove the dependence on x in (Y − f̂ (x))2 we can consider

where pX (x) is the probability density function for X ∈ Rp

I Don’t worry if you don’t follow all the maths here!!

The Bias-Variance Trade-off 14

I Consider we estimate a linear model which includes up to p̂ ≤ p

2.5 5.0 7.5 10.0

I The shape of the function depends on σ 2 , n and p alongside the true

β = (3, 2, 1, 0.1, . . . , 0.1) ∈ Rp

I What happens if we fix σ 2 and p, but let n → ∞

0.2 Std (Variance)

I Bias usually decreases as a function of increasing p, however, the

best model := arg max Lmodel (x; y) − |Kmodel | × penalty(n)

I Generally we can use log-likelihood in place of Lmodel (x; y)

best model := arg min [−2 log{Lmodel (x; y)} + 2|Kmodel |]

– Bayesian Information Criteria (BIC):

I For many statistical models we may not know the theoretical

I Split the data-set indexed i = 1, . . . , n into k = 1, . . . , K different

I 1. Let us denote (y(/k) , x(/k) ) as the observations in all groups

I 3. Let the data y(k) , x(k) be observations in group k

I How good estimate êpred is will depend on how accurate the

I Introduced the Simpsons paradox to highlight issues when

1. How to test for linear relationships. Demonstrate that data is

You might also like