Introduction to Statistics for Engineers
Introduction to Statistics for Engineers
Engineers
Chapter One
Introduction to Statistics
1
Outline
Introduction to statistics
Importance of statistics
Limitation of statistics
Classification of statistics
2
Definitions:
It is a science which helps us to collect, analyze and
present data systematically.
5
5
Application areas of statistics
Some of the diverse fields in which Statistical methodology
has extensive applications are:
Engineering:
Improving product design, testing product performance,
determining reliability and maintainability, working out safer
systems of flight control for airports, etc.
Business:
Estimating the volume of retail sales, designing optimum
inventory control system, producing auditing and accounting
procedures, improving working conditions in industrial plants,
assessing the market for new products.
6
Quality Control: Cont’d
Determining techniques for evaluation of quality through
adequate sampling, in process control, consumer survey and
experimental design in product development etc.
* Realizing its importance, large organizations are maintaining
their own Statistical Quality Control Department *.
Economics:
Measuring indicators such as volume of trade, size of labor
force, and standard of living, analyzing consumer behavior,
computation of national income accounts, formulation of
economic laws, etc.
* Particularly, Regression analysis extensively uses in the field
of Economics*.
7
Health and Medicine: Cont’d
Developing and testing new drugs, delivering improved medical care,
preventing diagnosing, and treating disease, etc. Specifically, inferential
Statistics has a tremendous application in the fields of health and
medicine.
Biology:
Exploring the interactions of species with their environment, creating
theoretical models of the nervous system, studying genetically evolution,
etc.
Psychology:
Measuring learning ability, intelligence, and personality characteristics,
creating psychological scales and abnormal behavior, etc.
Sociology:
Testing theories about social systems, designing and conducting sample
surveys to study social attitudes, exploring cross-cultural differences,
studying the growth of human population, etc.
8
8
Classification of statistics
There are two main branches of statistics:
1. Descriptive statistics
2. Inferential statistics
1. Descriptive statistics:
It is the first phase of Statistics;
16
Steps/stages in Statistical Investigation
1. Collection of Data:
Data collection is the process of gathering information or
data about the variable of interest. Data are inputs for
Statistical investigation. Data may be obtained either
from primary source or secondary source.
2. Organization of Data
Organization of data includes three major steps.
1 . Editing: checking and omitting inconsistencies,
irrelevancies.
2. Classification : task of grouping the collected and edited
data .
3. Tabulation: put the classified data in the form of table.
17
17
Cont’d
3. Presentation of Data
The purpose of presentation in the statistical analysis is to display
what is contained in the data in the form of Charts, Pictures,
Diagrams and Graphs for an easy and better understanding of the
data.
4. Analyzing of Data
In a statistical investigation, the process of analyzing data
includes finding the various statistical constants from the
collected mass of data such as measures of central tendencies
(averages) , measures of dispersions and soon.
It merely involves mathematical operations: different measures of
central tendencies (averages), measures of variations, regression
analysis etc. In its extreme case, analysis requires the knowledge
of advanced mathematics. 18
18
Cont’d
5. Interpretation of Data
involve interpreting the statistical constants computed in
analyzing data for the formation of valid conclusions and
inferences.
It is the most difficult and skill requiring stage.
It is at this stage that Statistics seems to be very much
viable to be misused.
Correct interpretation of results will lead to a valid
conclusion of the study and hence can aid in taking correct
decisions.
Improper (incorrect) interpretation may lead to wrong
conclusions and makes the whole objective of the study
useless. 19
19
20
THE ENGINEERING METHOD AND
STATISTICAL THINKING
An engineer is someone who solves problems of
interest to society by the efficient application
of scientific principles.
23
Cont’d
25
Data collection and representation
The term “Data Collection” refers to all the issues
related to data sources, scope of investigation and sampling
techniques.
33
Sampling Techniques
Inferential statistics is a systematic method of inferring
satisfactory conclusions about the population on the basis
of examining a few representative units termed as sample.
The process of selecting samples is called sampling
The number of units in the sample is called Sample size.
The size of sample for a study is determined on the basis
of the following factors
The size of the population
The availability of resources
The degree of accuracy
The homogeneity or heterogeneity of the population
The nature of the study
The method of sampling technique adopted
The nature of respondents
34
Reason for sampling
We study a sample of population instead of
considering the entire population due to one or the
other of the following reasons.
Shortage of money, time and labor
Limited data available
Minimize destruction
Obtaining more complete detailed and accurate
information about the characteristics of the
population with less time, energy and expenditure
35
Cont’d
A good sample possesses two characteristics, which are:
i. Random/Probability sampling
36
Cont’d
Random sampling method is a method of selection
of a sample such that each item within the
population has equal chance of being selected.
Random sampling method is further divided into the
following
i. Simple
ii. Stratified,
iii. Systematic and
iv. Cluster sampling
i. Simple Random Sampling Method:
This method involves very simple method of drawing
a sample from a given population used when the
population under consideration is homogenous.
37
Cont’d
ii. Stratified Random Sampling Method:
This method of sampling is used when the population under
study is heterogeneous.
Judgment
Convenient and
Quota sampling.
Judgment Sampling: -
In Judgment sampling, personal judgment plays a significant role in
the selection of the sample
40
Data Presentation
Raw data:- It is collected numerical data which has
not been arranged in order of magnitude.
Tabular method
Graphical method
Diagrammatic method
41
41
Cont’d
1. Tabular presentation of data:
It is a tabular arrangement of
numerical data in order of magnitude
showing the distinct values with the
corresponding frequencies.
43
Example:
Suppose the following are test score of 16 students in a
class, write un grouped frequency distribution.
“14, 17, 10, 19, 14, 10, 14, 8, 10, 17, 19, 8, 10, 14, 17, 14”
18-25 5
26-32 15
33-39 10
46
46
Components of grouped frequency distribution
47
Cont’d
5. Class width/ Class intervals
is the difference between two consecutive lower
class limits or the two consecutive upper class
limits. (OR)
can be obtained by taking the difference of two
adjoining class marks or two adjoining lower class
boundaries.
Class width = Range/Number of class desired.
Where: Number of classes=1+3.322(log N) where N is
the Number of observation.
6. Unit of measure
is the smallest possible positive difference
between any two measurements in the given data
set that shows the degree of precision.
48
Cont’d
Class boundaries:
50
Cont’d
The unit of measure is 1
51
52
Rules to construct Grouped
Frequency Distribution (GFD):
i. Find the unit of measure of the given data;
ii. Find the range;
iii. Determine the number of classes required;
iv. Find class width (size);
v. Determine a lowest class limit and then find the successive
lower and upper class limits forming non over lapping
intervals such that each observation falls into exactly one
of the class intervals;
vi. Find the number of observations falling into each class
intervals that is taken as the frequency of the class (class
interval) which is best done using a tally.
53
Exercise:-
Construct a GFD of the following aptitude test scores
of 40 applicants for accountancy positions in a company
with
a. 6 classes b. 8 classes
96 89 58 61 46 59 75 54
41 56 77 49 58 60 63 82
66 64 69 67 62 55 67 70
78 65 52 76 69 86 44 76
57 68 64 52 53 74 68 39
54
Types of Grouped Frequency
Distribution
55
Cont’d
1. Relative frequency distribution (RFD):
56
For example
Test score F RFD PFD
37.5-47.5 4 4/40=0.1 10%
47.5-57.5 8 8/40=0.2 20%
57.5-67.5 13 13/40=0.325 32.5%
67.5-77.5 10 10/40=0.25 25%
77.5-87.5 3 3/40=0.075 7.5%
87.5-97.5 2 2/40=0.05 5%
57
Cont’d
2. Cumulative Frequency Distribution (CFD):
LRCF MRCF
Test score CF Test score CF
59
Chapter 1- Introduction to statistics 59
3. Relative Cumulative Frequency Distribution (RCFD)
It is used to determine the ratio or the percentage of observations
that lie below or above a certain value/class boundary, to the total
frequency of all the classes. These are of two types: The LRCFD and
MRCFD.
61
Cont’d
MRCFD
Test score MC F MR C F MP C F
More than 37.5 40 40/40=1 100%
More than 47.5 36 36/40=0.9 90%
More than 57.5 28 28/40=0.7 70%
More than 67.5 15 15/40=0.375 37.5%
More than 77.5 5 5/40=0.125 12.5%
More than 87.5 2 2/40=0.05 5%
More than 97.5 0 0/40=0 0%
62
Graphic Methods of Data presentation
1. Histogram
2. Frequency Polygon (Line graph)
63
63
1. Histogram:
A graphical presentation of grouped
frequency distribution consisting of a series
of adjacent rectangles whose bases are the
class intervals specified in terms of class
boundaries (equal to the class width of the
corresponding classes) shown on the x-axis
and whose heights are proportional to the
corresponding class frequencies shown on
the y-axis.
64
Cont’d
Histogram: E.g.
Freq.
20
15
10
3-D Column 1
5
0
20 - 3030 - 4040 - 5050 - 60 60 - 70 70 -80
65
Chapter 1- Introduction to statistics 65
Cont’d
Uses for a Histogram
A Histogram can be used:
20
15
10 Line 1
0
20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 -80 80
69
69
Steps to draw Frequency polygon
i. Mark the class mid points on the x-axis and
the frequency on the y-axis.
ii. Mark dots which correspond to the
frequency of the marked class mid points.
iii. Join each successive dot by a series of
line segments to form line graph, including
classes with zero frequencies at both ends
of the distribution to form a polygon.
70
Find the midpoints of each class
Create a frequency polygon using the data
Create a frequency polygon using the data
3. O-GIVE curve (Cumulative Frequency Curve /
percentage Cumulative Frequency Curve)
{The number of values less than the upper class boundary for the
current class. This is a running total of the frequencies.}
O-gives are of two types: The Less than O-give and The
More than O-give.
Ogive: E.g.
45
40
35
30
25
Line 1
20
15
10
5
0
20 30 40 50 60 70 80
75
Chapter 1- Introduction to statistics 75
Steps to draw O-gives
i. Mark class boundaries on the x-axis and mark non
overlapping intervals of equal length on the y-axis to
represent the cumulative frequencies.
ii. For each class boundaries marked on the x-axis, plot a
point with height equal to the corresponding cumulative
frequencies.
iii. Connect the marked points by a series of line segments
where the less than O-give is done by plotting the less
than cumulative frequency against the upper class
boundaries
76
Chapter 1- Introduction to statistics 76
Draw the x and y axis
• Bar charts
• Pie chart
• Pictograph and
• Pareto diagram
81
Outline
82
1.3 Measures of Central Tendency
The mean
The mode
The median
Percentile/Quantiles and
Midrange
83
The Mean:
It is the most commonly used measures of
central value.
Types of Mean:
1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean
4. Quadratic Mean
5. Trimmed Mean
6. Weighted Mean
7. Combination mean
84
1. Arithmetic Mean (simply Mean)
n N
∑x i ∑x i
x= i =1
µ= i =1
n N
85
Cont’d
Advantages
• It is the most commonly used measure of location or
central tendency for continuous variables.
• The arithmetic mean uses all observations in the
data set.
• All observations are given equal weight.
Disadvantage
• The mean is affected by extreme values that may
not be representative of the sample.
86
2. Geometric mean
• It is the nth root of the product of the
data elements.
Geometric mean = ,
Harmonic mean =
89
• Example: Suppose a boy rode a bicycle three miles. Due
to the topography, for the first mile he rode 2 mph; for
the second mile 3 mph; for the final mile the average
speed was 4 mph. What was the average speed for the
three miles?
Solution:
To show simple analysis using Harmonic mean :
3 miles = 3 . = 36 = 2.77mph
1/2+1/3+1/4 13/12 13
90
4. Quadratic mean
92
5. Trimmed mean
It is usually refers to the arithmetic mean
without the top 10% and bottom 10% of
the ordered scores.
93
6. Weighted mean
It is the average of differently weighted
scores;
94
The Mode
95
Cont’d
Example
‘23, 22, 12, 14, 22, 18, 20, 22, 18, 18’
96
Cont’d
Advantages
Requires no calculations.
Disadvantage
The mode for continuous measurements is
dependent on the grouping of the intervals.
97
The Median
98
Cont’d
Example
‘23, 22, 12, 14, 22, 18, 20, 22, 18, 18’
99
Cont’d
Advantages
• The median always exists and is unique.
Disadvantages
• The values must be sorted in order of
magnitude.
100
Percentiles / Quartiles
Percentiles are values that divide a distribution
into two groups where the Pth percentile is larger
than P% of the values.
Some specific percentiles have special names:
First Quartile : Q1 = the 25 percentile
Median : Q2 = the 50 percentile
A percentile provides information about how the
data are spread over the interval from the
smallest value to the largest value.
101
Cont’d
Interpretation
• Q1 = a. This means that 25% of the data
values are smaller than a
102
Midrange
The midrange is the average of largest and
smallest observation.
103
104
1.4 Measure of Dispersion (of Variability)
The range
The variance
108
Cont’d
Advantage
• The range is easily understood and gives a
quick estimate of dispersion.
• The range is easy to calculate
Disadvantage
• The range is inefficient because it only uses
the extreme value and ignores all other
available data. The larger the sample size,
the more inefficient the range becomes.
109
The Variance
The variance is the average of the squared
differences between each data value and the mean.
If the data set is a sample, the variance is denoted
by s2.
∑ ( xi − x )
2
s =
2
n −1
If the data set is a population, the variance is
denoted by σ 2
∑ ( xi − µ ) 2
σ =
2
N
Cont’d
Advantages
• The variance is an efficient estimator
Disadvantage
• The calculation of the variance can be
tedious without the aid of a calculator or
computer.
111
The Standard Deviation
• The standard deviation is one of the most important measures
of dispersion. It is much more accurate than the range or inter
quartile range.
s= s 2
112
Cont’d
What does it measure?
• It measures the dispersion (or spread) of
figures around the mean.
5, 9, 3, 2, 7, 9, 8, 2, 2, 3 (˚Centigrade)
Solution
x ẍ (x - ẍ) (x - ẍ)2
5 5 0 0
9 5 4 16
3 5 -2 4
2 5 -3 9
7 5 2 4
9 5 4 16
8 5 3 9
2 5 -3 9
2 5 -3 9
3 5 -2 4
∑x = 50 ∑(x - ẍ)2 = 80
ẍ = ∑x/n = 50/10 = 5 ∑(x - ẍ)2/N = 8
√∑(x - ẍ)2/N =2.8°C
x = temperature --- ẍ = mean temperature --- √ = square root
∑ = total of --- 2 = squared --- n = number of values
The coefficient of variation
• The coefficient of variation indicates how large
the standard deviation is in relation to the mean.
10 – 25 – 45 – 47 – 49 – 51 – 52 – 52 – 54 – 56 – 57 – 58 – 60 – 62 – 66 – 68 – 70 - 90
Q1 Q3
IR = Q3 – Q1 , IR = 62 – 49. IR = 13
Measure of skewness and kurtosis
Skewness (mean deviation)
• Skewness is a measure of the tendency of the deviations
to be larger in one direction than in the other.
• Skewness is the degree of asymmetry or departure
from symmetry of a distribution.
• If the frequency curve of a distribution has a longer tail
to the right of the central value than to the left, the
distribution is said to be skewed to the right or to
have positive skewness.
• If the reverse is true, it is said that the distribution is
skewed to the left or has negative skewness.
121
Cont’d
The following formula can
( )
∑ X− X
3
be used to determine 3
= N
∑ (X− X )
s 2
skew:
N
s4 = N
N
124
s2, s3, & s4
Collectively, the variance (s2), skew (s3), and
kurtosis (s4) describe the shape of the
distribution
125
Skewness and Kurtosis
End of Chapter 1