2DI90
Probability &
Statistics
2DI90 – Chapter 11 of MR
Motivation
So far in the course we have dealt with statistical models that
assume data is a random sample from some distribution…
Although this is quite powerful, it is inadequate in many
situations, for instance, when one observes two quantities that
are related:
Example: Data was collected about several models of laptop
computers available in a big online store: In particular, for each
model of computer we took note of the processor speed and
time it took to perform a certain benchmark task (e.g. encoding
a 1 minute .divx video).
As expected computers with faster processors can typically do
the task in less time, but can we say what the exact relation
between processor speed and time to complete the task? 2
Example 11.1 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
Hydrocarbon level (%) Purity (%)
0.99 90.01
1.02 89.05
98
1.15 91.43
1.29 93.74
1.46 96.73
96
1.36 94.45
Purity oxygen
0.87 87.59
94
1.23 91.77
1.55 99.42
1.4 93.65
92
1.19 93.54
1.15 92.52
90
0.98 90.56
1.01 89.54
1.11 89.85
88
1.2 90.39
1.26 93.25
1.32 93.41 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.43 94.98 3
0.95 87.33 Hydrocarbonlevel
Example 11.1 (MR)
98
Clearly it seems the
96
Purity oxygen
hydrocarbon level is telling us
94
something about the purity of
92
oxygen produced.
90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5
Hydrocarbonlevel
4
Example 11.1 (MR)
98
96
One possible way to construct
Purity oxygen
94
a model for the observations
92
is to assume the oxygen purity
90
levels are measured with a
88
(small) error 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Hydrocarbonlevel
5
Simple Linear Regression
Definition: Simple Regression Model
6
Simple Linear Regression
98
96
Purity oxygen
94
92
90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5
Hydrocarbonlevel
We want ALL the distances between the fitting line and the
points to be small:
7
Least Squares
98
96
Purity oxygen
Minimize instead the sum of the
94
SQUARED distances !!!
92
90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5
Hydrocarbonlevel
8
Least Squares
9
Least Squares
Definition: Least Squares Estimates
10
Least Squares
11
Least Squares
Definition: Least Squares Estimates
Definition: Fitted Regression Line and Residuals
12
Example 11.1 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
98
Hydrocarbon level (%) Purity (%)
0.99 90.01
96
1.02 89.05
Purity oxygen
1.15 91.43
1.29 93.74 94
1.46 96.73
1.36 94.45
92
0.87 87.59
1.23 91.77
90
1.55 99.42
1.4 93.65
1.19 93.54
88
1.15 92.52
0.98 90.56
1.01 89.54 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.11 89.85
1.2 90.39 Hydrocarbonlevel
1.26 93.25
1.32 93.41
1.43 94.98 13
0.95 87.33
Example 11.1 (MR)
98
You must always be careful
with the interpretation of 96
Purity oxygen
the results you have… Your
94
inferences are only as valid
92
as the model you are using
90
is reasonable !!!
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5
14
Hydrocarbonlevel
Estimating the Variance
98
96
Purity oxygen
94
92
90
88
0.9 1.0 1.1 1.2 1.3 1.4 1.5
Hydrocarbonlevel
Definition: Estimate of the Variance
15
Least Squares
16
Properties of Least Squares
Proposition: Least Squares Coefficients
17
Partial proof on the board (maybe)…
Relation to Maximum Likelihood Estimation
Proposition: Maximum Likelihood in Regression
Home Exercise: Prove the above proposition
18
Distribution of the Regression Coeff.’s
19
Distribution of the Regression Coeff.’s
Theorem:
20
Testing in Linear Regression
21
Example 11.1 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
Hydrocarbon level (%) Purity (%)
0.99 90.01
1.02 89.05
98
1.15 91.43
1.29 93.74
1.46 96.73
96
1.36 94.45
Purity oxygen
0.87 87.59
94
1.23 91.77
1.55 99.42
1.4 93.65
92
1.19 93.54
1.15 92.52
90
0.98 90.56
1.01 89.54
1.11 89.85
88
1.2 90.39
1.26 93.25
1.32 93.41 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.43 94.98 22
0.95 87.33 Hydrocarbonlevel
Example
23
Example 11.2 (MR)
24
Summary of Testing Procedures
Tests for the Slope:
25
Summary of Testing Procedures
Tests for the Intercept:
26
Summary of Testing Procedures
Tests for the Variance:
27
Analysis of Variance
28
98 ANOVA - ANalysis Of VAriance
98
96
96
96
94
94
94
y
92
92
92
90
90
90
88
88
88
0.9 1.1 1.3 1.5 0.9 1.1 1.3 1.5 0.9 1.1 1.3 1.5
x x x
29
ANOVA - Analysis of Variance
30
Example 11.3 (MR)
Data was collected in a chemical distillation plan producing
oxygen for medical applications. One of the steps reduces the
impurities by condensation. The percentage of hydrocarbons
collected in the condenser might give a good indication of the
oxygen purity:
Hydrocarbon level (%) Purity (%)
0.99 90.01
1.02 89.05
98
1.15 91.43
1.29 93.74
1.46 96.73
96
1.36 94.45
Purity oxygen
0.87 87.59
94
1.23 91.77
1.55 99.42
1.4 93.65
92
1.19 93.54
1.15 92.52
90
0.98 90.56
1.01 89.54
1.11 89.85
88
1.2 90.39
1.26 93.25
1.32 93.41 0.9 1.0 1.1 1.2 1.3 1.4 1.5
1.43 94.98 31
0.95 87.33 Hydrocarbonlevel
Example 11.3 (MR)
32
Example 11.3 (MR)
33
Example 11.3 (MR)
34
Example 11.3 (MR)
Descriptive
Statistics of the
Residuals
Test stats.
and p-values
35
Relation Between ANOVA and t-tests
36
Confidence Intervals
Recall the result we shown before:
Theorem:
37
CI for the Regr. Coeff.’s and the Variance
Confidence Intervals:
One-sided confidence intervals are obtained as we seen before…
38
Example 11.4 (MR)
39
CIs on the Mean Response
Definition:
Proposition:
40
CIs on the Mean Response
Proposition:
We have all the pieces needed to construct a nice
confidence interval for the mean response…
41
CIs on the Mean Response
Confidence Interval for the Mean Response:
It is quite interesting to note that the CI is the
narrowest when , and becomes wider elsewhere.
As before, using these is a matter of plugging in the
proper quantities…
42
Example 11.5 (MR)
We can plot these CI for each point, and get a nice
(pointwise) confidence band around the regression line !!!
105
100
Purity oxygen
95
90
85
Regression Line
80
Mean Response 95% CI
0.5 1.0 1.5 2.0
43
Hydrocarbon level
Regression Prediction Intervals
Proposition:
44
Prediction Interval for Regression
Prediction Interval for Regression:
Note these intervals are always a significantly wider than
the CIs of the previous slides…
45
Example 11.6 (MR)
105
100
Purity oxygen
95
90
85
Regression Line
Mean Response 95% CI
80
95% Prediction Interval
0.5 1.0 1.5 2.0
Hydrocarbon level 46
Important Remarks
• All the predictions, estimates, and confidence statements are
only valid if the model assumptions are reasonable…
• It is quite dangerous to extrapolate the response for
predictor values outside the range you observed.
• If the regression model is significant it means the response
and predictor variable are correlated, but it doesn’t say
anything about a causal relation !!!
47
Correlation does NOT IMPLY Causation
48
Adequacy of the Regression Model
Our regression model only makes sense if the above expression
is approximately true. Furthermore, to do hypotheses test we
need to also assume that the error variables follow a normal
distribution.
You should always take such assumptions with a grain of salt,
and see if there is enough evidence to reject them !!!
49
Residual Analysis
Residuals Normal Q-Q Plot
2
2
Sample Quantiles
1
1
model$residuals
0
0
-1
-1
Shapiro-Wilk p-value=0.9293
0.9 1.0 1.1 1.2 1.3 1.4 1.5 -2 -1 0 1 2
x Theoretical Quantiles 50
Residual Analisys
51
Coefficient of Determination
Definition:
52
Coefficient of Determination
53
Final Remarks
The topic of regression models is very broad, and we barely
scrapped the surface !!! There are other courses offered at the
TU/e exclusively on this topic…
• We can consider models with transformed variables
• We can consider logistic regression, which allows us to deal
with categorical data
• There is also non-parametric regression models, as well as non-
linear regression procedures
• Most practical methods in machine learning are regression-like
procedures (e.g. Support Vector Machines).
Always question the validity of the models you are using.
All models are wrong…
…but some are useful. 54