MATH 240 – INTRODUCTION
TO PROBABILITY AND
STATISTICS FOR ENGINEERS
Chapter 1
Introduction to
Statistics and
Data Analysis
Copyright © 2017 Pearson Education ,Ltd. All rights reserved.
Probability and Statistics
• Probability: a field devoted to the study of
random variation in systems
– Inferential statistics (where we use the information
in a sample to draw a conclusion about the
population)
– Other applications in engineering
• Statistics deals with collection, presentation,
analysis and use of data to make decisions and
solve problems.
Statistical Problem-Solving
The engineering or scientific method of formulating and solving
problems follows certain steps:
Where we use
Engineering
Statistics Collect data
Data, Information, and
Knowledge
• Data:
– Quantifiable measurements of some physical
phenomenon
– “Patient A weighs 80 kg”
– “The date is April 1, 2019”
• Information:
– Synthesized data that produces meaning
– “The average weight of patients over a 10 week period
dropped by 5 kgs when taking drug X”
• Knowledge:
– Our internal model of the way the world works
– Present in the human mind
– Used to make predications about cause/effect
relationships
– “I think that drug X causes weight loss”
Data, Information, Decision
Data Measure
Statistics
Information Compare
Knowledg Decide
e
Figure 1.2 Fundamental relationship
between probability and inferential
statistics
• Estimating properties of the population without
examining the entire population
Population and Sample
Random sample
If you were to take two different random
samples from the same population and calculate
the sample means, you would expect them to be
different.
Variability
• Inconsistency
– Repeated experiments (runs) will yield
slightly different results
• Sources of Variation
– Natural variation
– Assignable causes
Making Comparisons
Dot diagram: useful
for displaying a small
number of data points
(≈20).
Allows us to see the
location, and scatter
or variability of the
dataset.
• Does wall thickness have an effect?
• How confident can you be? Do we know that another
specimen will not give another result?
• Is this sample adequate?
• Statistical methodology can answer these questions.
Techniques of Statistical Inference
• Point estimation (lecture 7)
• Interval estimation (lecture 8)
• Hypothesis testing (lecture 9)
Section 1.3
Measures of
Location: The
Sample Mean
and Median
Copyright © 2017 Pearson Education ,Ltd. All rights reserved.
Three things to check while
analyzing data
• Central Tendency
– Address of population
• Spread (Scatter)
– How wide the population
is
– How high the variability is
• Shape
– How the distribution looks
Definition 1.1
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 15
Measures of Central
Tendency
1. Mean (Average)
sample
Individual
observation
n
sample S i=1x i
mean x=
n
sample size
Class Exercise 1:
Xi
11.8
11.9
• What is the sample
11.8 average?
12.4 = 12.0
12.8
12.4
12.1
12.6
12.0
11.3
11.8
11.7
11.5
11.9
Geometric Meaning of Mean
Repeated runs
Mean is the target value.
Geometric Meaning of Mean
n
å (x i - x) = 0
• Centroid of the data i =1
• Fulcrum that balances the weights
Measures of Central
Tendency
1. Mean (Average)
population
individual
observation
N
S i=1x i
µ=
population
mean
N
population
Watch out for the difference
between sample and a size
population!
Mean
• Advantages of Using the mean
– It is the center of gravity of the data
– It uses all data
– No sorting is needed
• Disadvantages of Using the mean
– The mean may not be the actual value of any
data points.
– Extreme data values may distort the picture
Effect of Extreme Points
Extreme point:
outlier
mean
• Assignable cause
• Potential reasons: Wrong measurement,
different population
Trimmed means
• In order to alleviate the problems of extreme
points on the mean, the trimmed means method
may be used.
– Calculated by trimming away a certain percent of
both the largest and smallest set of values.
– 10% trimmed mean would indicate the top and
bottom 10% of data points are omitted from the
calculation of the mean.
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 23
Trimmed means
• Consider the following 5 data points from an
experiment.
– The mean calculation would yield 2.90
– The 20% trimmed mean would yield 2.60, hence
minimize the effect of extreme values.
1.5
2.2
2.5
3.1
5.2
Measures of Central Tendency
2. Median
middle value when the data is ordered
in ascending or descending order
If n is odd If n is even
~
x = x n +1 xn + xn
+1
2 ~
x= 2 2
2
Definition 1.2
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 26
Class Exercise 2:
Xi
11.8
11.9 • What is the median?
11.8
12.4
12.8 = 11.9
12.4
12.1
12.6
12.0
11.3
11.8
11.7
11.5
11.9
Median
• Advantages of using median
– provides an idea where most data are
located
– little calculation required
• Disadvantages of using median
– data must be sorted and arranged
– does not use all the data
– extreme values may be important
Measures of Central
Tendency
3. Mode
the most frequently occurring number
in a data set
Class Exercise 3:
Xi
11.8
11.9 • What is the sample mode?
11.8
12.4
12.8
12.4 Mode = 11.8
12.1
12.6
12.0
11.3
11.8
11.7
11.5
11.9
Mode
• Advantages
– no calculations necessary
– not influenced by extreme values
– an actual value
• Disadvantage
– The data may not have a mode!
Example
• Suppose a data set consists of the following
observations:
0.32 0.53 0.28 0.37 0.47 0.43 0.36 0.42 0.38 0.43
• Find the mean, median, and mode.
• Mean: 0.399
• Median: 0.40
• Mode: 0.43
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 32
Section 1.4
Measures of
Variability
Copyright © 2017 Pearson Education ,Ltd. All rights reserved.
Three things to check while
analyzing data
• Central Tendency
– Address of population
• Spread (Scatter)
– How wide the population
is
– How high the variability is
• Shape
– How the distribution looks
Measures of Spread
1. Range
R = Max - Min
• Advantages
• easy to calculate
• Disadvantages
• does not use all the data
• if n>7, use standard
deviation
Class Exercise 4:
Xi
11.8
11.9 • What is the Range?
11.8
12.4
12.8
12.4 = 12.8 – 11.3 = 1.5
12.1
12.6
12.0
11.3
11.8
11.7
11.5
11.9
Measures of Spread
2. Standard Deviation
population sample
S iN=1 ( x i - µ) 2 S (x i - x)
n 2
s= s= i =1
N n -1
If we knew the mean of the population, we would not
need a sample. In practice, μ is almost never known and
so a sample needs to be used.
However, observations tend to be closer to x than μ. To
compensate for this, we use n-1 as the divisor rather
Measures of Spread
3. Variance
population sample
S N
( x - µ ) 2
S n
( x - x ) 2
s 2 = i =1 i s 2 = i =1 i
N n -1
• Variance is square of standard deviation
• Standard deviation used more frequently
– Same unit as the measures of central tendency
Definition 1.3
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 39
How to calculate s
x = 104.0/8 = 13.0
How to calculate s
1.6
s =
2
= 0.229
7
s = 0.229 = 0.479
Standard Deviation
0.479 0.479
What is the sample standard deviation for the
given dataset?
Xi Xi - X (Xi – X)2
11.8 -0.2 0.04 X = 12.0
11.9 -0.1 0.01
11.8 -0.2 0.04
12.4 0.4 0.16 Sample variance = s2
12.8 0.8 0.64 = 2.30/(14-1) = 0.177
12.4 0.4 0.16
12.1 0.1 0.01
12.6 0.6 0.36
12.0 0.0 0.0 Sample standard
11.3 -0.7 0.49 deviation = s =
11.8 -0.2 0.04 √0.177 = 0.42
11.7 -0.3 0.09
11.5 -0.5 0.25
11.9 -0.1 0.01
Sum=2.30
What is the population variance?
xi x i - x (x i - x )
2
11.8
11.9
11.8
12.4
12.8
12.4
12.1
12.6
12.0
11.3
11.8
11.7
11.5
11.9
What is the population variance?
xi x i - x (x i - x )2
11.8 -0.2 0.04
11.9 -0.1 0.01
11.8 -0.2 0.04 Population variance = s2
12.4 0.4 0.16 = 2.30/(14) = 0.164
12.8 0.8 0.64
12.4 0.4 0.16
12.1 0.1 0.01
12.6 0.6 0.36 Population standard
12.0 0.0 0.00 deviation = s =
11.3 -0.7 0.49 √0.164 = 0.405
11.8 -0.2 0.04
11.7 -0.3 0.09
11.5 -0.5 0.25
11.9 -0.1 0.01
Class Exercise
• Preventing fatigue crack propagation in aircraft
structures is an important element of aircraft
safety. An engineering study to investigate
fatigue crack in n = 9 cyclically loaded wing
boxes reported the following crack lengths (in
mm): 2.13, 2.96, 3.02, 1.82, 1.15, 1.37, 2.04,
2.47, and 2.60. Calculate the sample average and
sample standard deviation. Construct a dot
diagram of the data.
Section 1.6
Statistical
Modeling,
Scientific
Inspection, and
Graphical
Diagnostics
Copyright © 2017 Pearson Education ,Ltd. All rights reserved.
• Graphical presentation of data may yield
important information in engineering.
– Dot diagram
– Scatter plot
– Stem and leaf diagram
– Histogram
– Box plots
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 48
Table 1.1 Data Set for
Example1.2
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 49
Observing Processes Over Time
Example: Tensile strength
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 51
Figure 1.5 Scatter plot of tensile
strength and cotton percentages
One method of presentation is
the scatter plot.
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 52
Example: Car Battery Life
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 53
Table 1.5 Stem-and-Leaf Plot of
Battery Life
Another method is the stem and leaf plot
30
Frequency (number)
25
20
15
10
5
0
1,5 2,5 3,5 4,5
Battery Life (years)
Table 1.6 Double-Stem-and-Leaf
Plot of Battery Life
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 55
Table 1.7 Relative Frequency
Distribution of Battery Life
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 56
Figure 1.6 Relative frequency
histogram
A histogram takes the
information from the
stem and leaf table to
graphically represent
data
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 57
Figure 1.7 Estimating frequency
distribution
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 58
Stem and Leaf example –
number of bins
Stem and Leaf example –
number of bins
Stem and Leaf example –
number of bins
Figure 1.8 Skewness of data
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 62
Characteristics of a Stable
Distribution
• Most of the data are
near the average
• centerline divides
curve into two
symmetrical halves
• few points near MAX
and MIN
• bell-shaped
• no points beyond
curve
Unstable Distributions
Skewed Spikes
Unstable Distributions
not a bell-shaped curve
• unstable process
• unpredictable
• assignable causes
• Variation
•Natural
•Assignable causes® Determine
Bi-modal
Here is another example on graphical
representation of data
Graphical Presentation of Data
BOX PLOTS
Copyright © 2017 Pearson Education Ltd. All rights reserved. 1 - 73
Percentiles
• kth percentile is point under which
approximately k% of ordered samples lie
beneath (100-k% lie above)
• Computation:
– Compute value 0.01*k*(n+1) to determine samples
above and below
– Use value to interpolate between the two samples
Percentile Example
• Compute 25th percentile of samples of RMS voltage in
office electrical outlet:
– 110v, 111v, 110.5v, 109v, 110v
• 0.01*k*(n+1) = 0.01*25*(5+1) = 1.5
• Interpolate between 1st and 2nd samples:
– 109 and 110
• 25th percentile = 109+(110-109)*0.5=109.5
Quartiles
• When an ordered set of data is divided into four
equal parts, the division points/values are called
quartiles
• 1st Quartile (q1): 25th percentile (25% of dataset
below value)
– Also called Lower Quartile
• 2nd Quartile (q2): 50th percentile (equal to the
median)
• 3rd Quartile (q3): 75th percentile
– Also called Upper Quartile
Quartiles
• Quartiles may not be unique points. You might
need to calculate the 20.25th point in the dataset
– Use interpolation
• q1 = (n+1)/4 q 3 =3(n+1)/4
• Interquartile Range (IQR):
– Difference between Upper and Lower Quartiles: q 3 -
q1
Box Plots to represent data
• Box plots are a graphical display that describes several important
features such as the center, spread, departure from symmetry,
observations that lie unusually far from the bulk of the data (outliers).
• The 3 quartiles are displayed on a rectangular box. q1 is the left or lower
edge of the box, q 3 is the right or upper edge.
• A whisker extends outward by 1.5 IQR from each end of the box.
• Data beyond whiskers are plotted as individual points.
• Data beyond whiskers but within 3 IQR from the edges are outliers.
Points beyond 3 IQR from the edges are extreme outliers.
Box Plots to represent data
Q1 Q3
median IQR=Q3-Q1
Between 1.5 - 3 IQR away ® outlier
Beyond 3 IQR away ® extreme outlier
Box Plot of Compressive Data
Table 1.8 Nicotine Data for
Example 1.5
Figure 1.9 Box-and-whisker plot
for Example 1.5
Exercise
• A battery-operated pacemaker device helps the
human heart to beat in regular rhythm. The
activation rate is important in stimulating the
heart, when necessary. Fourteen activation rates
(in sec.) were collected on a newly designed
device: 0.670, 0.697, 0.699, 0.707, 0.732, 0.733,
0.737, 0.747, 0.751, 0.774, 0.777, 0.804, 0.819,
0.827
• (a) Compute the sample mean and standard
deviation
• (b) Find the sample upper and lower quartiles
• (c) Find the sample median
• (d) Construct a box plot of the data