0% found this document useful (0 votes)
28 views54 pages

2DI90 ch11

Chapter 11 of the Probability & Statistics course discusses the inadequacies of traditional statistical models when analyzing relationships between two related quantities, exemplified by the correlation between hydrocarbon levels and oxygen purity in a chemical distillation process. It introduces simple linear regression and least squares methods for modeling these relationships, emphasizing the importance of model validity and the interpretation of results. The chapter also covers confidence intervals, prediction intervals, and the significance of regression models while cautioning against assuming causation from correlation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views54 pages

2DI90 ch11

Chapter 11 of the Probability & Statistics course discusses the inadequacies of traditional statistical models when analyzing relationships between two related quantities, exemplified by the correlation between hydrocarbon levels and oxygen purity in a chemical distillation process. It introduces simple linear regression and least squares methods for modeling these relationships, emphasizing the importance of model validity and the interpretation of results. The chapter also covers confidence intervals, prediction intervals, and the significance of regression models while cautioning against assuming causation from correlation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

2DI90

Probability &
Statistics

2DI90 – Chapter 11 of MR
Motivation
So far in the course we have dealt with statistical models that
assume data is a random sample from some distribution…

Although this is quite powerful, it is inadequate in many


situations, for instance, when one observes two quantities that
are related:

Example: Data was collected about several models of laptop


computers available in a big online store: In particular, for each
model of computer we took note of the processor speed and
time it took to perform a certain benchmark task (e.g. encoding
a 1 minute .divx video).

As expected computers with faster processors can typically do


the task in less time, but can we say what the exact relation
between processor speed and time to complete the task? 2
Example 11.1 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
Hydrocarbon level (%) Purity (%)
0.99 90.01
1.02 89.05

98
1.15 91.43
1.29 93.74
1.46 96.73
96
1.36 94.45
Purity oxygen

0.87 87.59
94

1.23 91.77
1.55 99.42
1.4 93.65
92

1.19 93.54
1.15 92.52
90

0.98 90.56
1.01 89.54
1.11 89.85
88

1.2 90.39
1.26 93.25
1.32 93.41 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.43 94.98 3
0.95 87.33 Hydrocarbonlevel
Example 11.1 (MR)

98
Clearly it seems the

96
Purity oxygen
hydrocarbon level is telling us

94
something about the purity of

92
oxygen produced.

90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbonlevel

4
Example 11.1 (MR)

98
96
One possible way to construct

Purity oxygen

94
a model for the observations

92
is to assume the oxygen purity

90
levels are measured with a

88
(small) error 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbonlevel

5
Simple Linear Regression
Definition: Simple Regression Model

6
Simple Linear Regression

98
96
Purity oxygen

94
92
90
88

0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbonlevel

We want ALL the distances between the fitting line and the
points to be small:

7
Least Squares

98
96
Purity oxygen
Minimize instead the sum of the

94
SQUARED distances !!!

92
90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbonlevel

8
Least Squares

9
Least Squares
Definition: Least Squares Estimates

10
Least Squares

11
Least Squares
Definition: Least Squares Estimates

Definition: Fitted Regression Line and Residuals

12
Example 11.1 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:

98
Hydrocarbon level (%) Purity (%)
0.99 90.01

96
1.02 89.05
Purity oxygen
1.15 91.43
1.29 93.74 94
1.46 96.73
1.36 94.45
92

0.87 87.59
1.23 91.77
90

1.55 99.42
1.4 93.65
1.19 93.54
88

1.15 92.52
0.98 90.56
1.01 89.54 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.11 89.85
1.2 90.39 Hydrocarbonlevel
1.26 93.25
1.32 93.41
1.43 94.98 13
0.95 87.33
Example 11.1 (MR)

98
You must always be careful
with the interpretation of 96
Purity oxygen

the results you have… Your


94

inferences are only as valid


92

as the model you are using


90

is reasonable !!!
88

0.9 1.0 1.1 1.2 1.3 1.4 1.5


14
Hydrocarbonlevel
Estimating the Variance

98
96
Purity oxygen

94
92
90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5

Hydrocarbonlevel

Definition: Estimate of the Variance

15
Least Squares

16
Properties of Least Squares

Proposition: Least Squares Coefficients

17
Partial proof on the board (maybe)…
Relation to Maximum Likelihood Estimation

Proposition: Maximum Likelihood in Regression

Home Exercise: Prove the above proposition

18
Distribution of the Regression Coeff.’s

19
Distribution of the Regression Coeff.’s
Theorem:

20
Testing in Linear Regression

21
Example 11.1 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
Hydrocarbon level (%) Purity (%)
0.99 90.01
1.02 89.05

98
1.15 91.43
1.29 93.74
1.46 96.73
96
1.36 94.45
Purity oxygen

0.87 87.59
94

1.23 91.77
1.55 99.42
1.4 93.65
92

1.19 93.54
1.15 92.52
90

0.98 90.56
1.01 89.54
1.11 89.85
88

1.2 90.39
1.26 93.25
1.32 93.41 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.43 94.98 22
0.95 87.33 Hydrocarbonlevel
Example

23
Example 11.2 (MR)

24
Summary of Testing Procedures
Tests for the Slope:

25
Summary of Testing Procedures
Tests for the Intercept:

26
Summary of Testing Procedures
Tests for the Variance:

27
Analysis of Variance

28
98 ANOVA - ANalysis Of VAriance

98

96
96

96

94
94

94
y

92
92

92

90
90

90

88
88

88

0.9 1.1 1.3 1.5 0.9 1.1 1.3 1.5 0.9 1.1 1.3 1.5

x x x
29
ANOVA - Analysis of Variance

30
Example 11.3 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
Hydrocarbon level (%) Purity (%)
0.99 90.01
1.02 89.05

98
1.15 91.43
1.29 93.74
1.46 96.73
96

1.36 94.45
Purity oxygen

0.87 87.59
94

1.23 91.77
1.55 99.42
1.4 93.65
92

1.19 93.54
1.15 92.52
90

0.98 90.56
1.01 89.54
1.11 89.85
88

1.2 90.39
1.26 93.25
1.32 93.41 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.43 94.98 31
0.95 87.33 Hydrocarbonlevel
Example 11.3 (MR)

32
Example 11.3 (MR)

33
Example 11.3 (MR)

34
Example 11.3 (MR)

Descriptive
Statistics of the
Residuals

Test stats.
and p-values

35
Relation Between ANOVA and t-tests

36
Confidence Intervals
Recall the result we shown before:

Theorem:

37
CI for the Regr. Coeff.’s and the Variance
Confidence Intervals:

One-sided confidence intervals are obtained as we seen before…


38
Example 11.4 (MR)

39
CIs on the Mean Response

Definition:

Proposition:

40
CIs on the Mean Response

Proposition:

We have all the pieces needed to construct a nice


confidence interval for the mean response…

41
CIs on the Mean Response
Confidence Interval for the Mean Response:

It is quite interesting to note that the CI is the


narrowest when , and becomes wider elsewhere.

As before, using these is a matter of plugging in the


proper quantities…

42
Example 11.5 (MR)
We can plot these CI for each point, and get a nice
(pointwise) confidence band around the regression line !!!

105
100
Purity oxygen

95
90
85

Regression Line
80

Mean Response 95% CI

0.5 1.0 1.5 2.0


43
Hydrocarbon level
Regression Prediction Intervals

Proposition:

44
Prediction Interval for Regression
Prediction Interval for Regression:

Note these intervals are always a significantly wider than


the CIs of the previous slides…

45
Example 11.6 (MR)
105
100
Purity oxygen

95
90
85

Regression Line
Mean Response 95% CI
80

95% Prediction Interval

0.5 1.0 1.5 2.0

Hydrocarbon level 46
Important Remarks
• All the predictions, estimates, and confidence statements are
only valid if the model assumptions are reasonable…
• It is quite dangerous to extrapolate the response for
predictor values outside the range you observed.
• If the regression model is significant it means the response
and predictor variable are correlated, but it doesn’t say
anything about a causal relation !!!

47
Correlation does NOT IMPLY Causation

48
Adequacy of the Regression Model

Our regression model only makes sense if the above expression


is approximately true. Furthermore, to do hypotheses test we
need to also assume that the error variables follow a normal
distribution.

You should always take such assumptions with a grain of salt,


and see if there is enough evidence to reject them !!!

49
Residual Analysis

Residuals Normal Q-Q Plot

2
2

Sample Quantiles

1
1
model$residuals

0
0

-1
-1

Shapiro-Wilk p-value=0.9293

0.9 1.0 1.1 1.2 1.3 1.4 1.5 -2 -1 0 1 2

x Theoretical Quantiles 50
Residual Analisys

51
Coefficient of Determination

Definition:

52
Coefficient of Determination

53
Final Remarks
The topic of regression models is very broad, and we barely
scrapped the surface !!! There are other courses offered at the
TU/e exclusively on this topic…

• We can consider models with transformed variables


• We can consider logistic regression, which allows us to deal
with categorical data
• There is also non-parametric regression models, as well as non-
linear regression procedures

• Most practical methods in machine learning are regression-like


procedures (e.g. Support Vector Machines).

Always question the validity of the models you are using.


All models are wrong…
…but some are useful. 54

You might also like