Business Statistics
CHAPTER 3
DESCRIBING DATA
Dang Quan Tri
Summary Measures
Describing Data Numerically
Center and Location Other Measures Variation
of Location
Mean Range
Percentiles
Median Interquartile Range
Quartiles
Mode
Variance
Weighted Mean
Standard Deviation
Coefficient of
Variation
Content
Mean
Median
Mode
Weighted Mean
Range
Mean (Arithmetic Average)
Cha
p 3- (continued)
4
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Mean
Mean is the average of numbers.
How to calculate?
Adding up all the numbers.
Dividing by how many numbers there are
Example
What is the mean of these numbers?
6, 11, 7
S1: Adding the numbers:
6 + 11 + 7 = 24
S2: Dividing by how many numbers:
24 / 3 = 8
-> The mean is 8
Why does this work?
Because 6 , 11, 7 added together is the same as 3
lots of 8.
Like you are “ fattening out” the numbers.
Negative number
How to handle negative numbers?
Adding a negative number is the same as
subtracting the number (without the negative):
3 + (-2)= 3 - 2 = 1
Example
What is the mean of theses numbers :
6 , 7 , 9 ,5 ,2 ,3
3 , - 7, 5 , 13, -2
-1 , -3 , -6 , -9, -10
Mean (Arithmetic Average)
The Mean is the arithmetic average of data values
Sample mean
n = Sample Size
n
x i
x1 x 2 x n
x i 1
n n
Population mean
N = Population Size
N
x x1 x 2 x N
i
i 1
N N
Ex : Suppose that in thirty shots at a target,
a mark makes the following scores:
5 2 2 3 4 4 3 2 0 3 0 3 2 1 5 1 3 1 5 5 2 4 0 0 4
54455
Calculate Mean
Weighted Mean
Used when values are grouped by frequency or
relative importance
Weighted Mean - Case 1
Example: Sample of
26 Repair Projects
Days to
Frequency
Complete Weighted Mean Days
5 4 to Complete:
6 12
7 8
8 2
Business Statistics: A Decision-
Chap 3-13 Making Approach, 6e © 2005
Prentice-Hall, Inc.
When the weights don’t add to 1:
Multiply each weight w by its matching value x, sum
that all up, and divide by the sum of weights:
Weighted mean =
XW
w x
i i
(4 5) (12 6) (8 7) (2 8)
w i 4 12 8 2
164
6.31 days
26
Weighted mean –Case 2
Decisions : Weighted means can help with decisions
where some things are more important than others.
Ex: Sam want to buy a new cammera, and decides on the
following rating system:
Image Quality 50%
Battery life 30%
Zoom Range 20%
The sony camera gets 8 ( out of 10) for Image Quality, 6 for Battery
Life and 7 for Zoom Range
The Canon camera gets 9 for Image Quality, 4 for Battery Life and
6 for Zoom Range.
Which cammera is best
Sony: 0.5 x 8 + 0.3 x 6 + 0.2 x 7 = 7.2
Canon : 0.5 x 9 + 0.3 x 4 + 0.2 x 6 = 6.9
-> Sam decides to buy the Sony.
- When the weights add to 1:
Multiply each weight by the matching value and sum it
all up.
Ex: Sam want to buy a new Phone, and decides on the
following rating system:
System 40%
Battery life 30%
Zoom Range 30%
The Iphone gets 8 ( out of 10) for Image Quality, 6 for Battery
Life and 7 for Zoom Range
The Samsung gets 9 for Image Quality, 4 for Battery Life and 6
for Zoom Range.
Which cammera is best
Median
Cha
p 3-
20
Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
In an ordered array, the median is the “middle”
number
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
How to calculate Median
S1: collect data
S2: Put the data in order
S3: Calculate the median index (i)
S4: Find median (Value)
Median index
i=½n
Where : i = index of point in the data set corresponding
to the median value
n = sample size
If is not an integer, round its value up to the next
highest integer. This next highest integer then is the
position of the median in the data array
If is an integer, the median is the average of the values
in postion i and position i+1
Example
Example
Find the median of 3, 13 , 7, 5 ,21, 23, 39, 23, 40,
14, 12, 56, 23, 15 and 29?
Find the median of 5, 6,7, 10, 4 ,2 ,3 ,8
In statistics, We care about where the data gathered
by calculating mean but sometime it will make
mistake
Skewed and Symmetric Data
Skewed and Symmetric Data
Data in a population or sample can be either
symmetric or skewed ( shape of data), depending
on how the data are distributed around the center.
Skewed and Symmetric Data
Symmetric data: Data sets whose values are evenly
spread around the center. Median = mean
Skewed data data sets that are not symmetric. For
skewed data, the mean will be larger or smaller
than the median
Mode
The mode is simply the number which appears
most often
Mode is most
How to find?
Putting the number in order
Counting how many of each number ( or the highest
frequency of value is mode)
Find the mode
1, 3, 3, 3, 4, 4, 6, 6, 6, 9
3 appears three times, as does 6
-> there are two mode: at 3 and 6
We having more than one mode
Having two modes is called “bimodal”
Having more than two modes is called “multimodal”
Find the mode of 3, 7 ,5 ,13 ,20, 23, 39, 23, 40 ,23,
14 ,12 ,56, 23 and 29?
Range
Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = xmaximum – xminimum
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Disadvantage of range
The range can sometims be misleading when there
are extremely high or low values.
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Review Example
Cha
p 3-
40
Five houses on a hill by the beach
$2,000 K
House Prices:
$2,000,000
500,000 $500 K
300,000 $300 K
100,000
100,000
$100 K
$100 K
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Summary Statistics
House Prices:
Mean: ($3,000,000/5)
$2,000,000
= $600,000
500,000
300,000
100,000
100,000
Median: middle value of ranked data
Sum 3,000,000 = $300,000
Mode: most frequent value
= $100,000
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Quartiles
Quartiles are the values that divide a list of
numbers into quarters
How to find?
* First put the list of numbers in order
* Then cut the list into four equal parts
-> The quartiles are at the”cuts”
Example
5, 8, 4, 4, 6, 3, 8
• Put them in order: 3, 4, 4, 5, 6, 8, 8
• Cut the list into quarters:
• 3, 4, 4 ,5, 6, 8 ,8
2nd 4th 6th
• And the result is:
• Quartile 1 (Q1) = 4
• Quartile 2 (Q2), which also the median, =
5
How to calculate
i= p*n
Q1 ->p =25% -> lower quartiles
Q2 ->p =50% -> median
Q3 -> p= 75% -> upper quartiles
If i is not integer -> round up to next
highest integer
If i is an integer, the pth percentile is the
average of the values in position i and
position i+1
COMPARING HURDLES SCORES
2007 2008 Here are the top eleven 50 m goat racing times in
12.1 12.3 seconds for 2007 and 2008.
14.0 13.7 Work out the mean and range.
15.3 15.5
2007 2008
15.4 15.5
Mean 15.4 16.1
15.4 15.6
Range 4.9 10.6
15.6 15.9
15.7 16.0 Which year was better and why?
15.7 16.1 Why might this comparison be unfair?
16.1 16.1
16.7 17.1 The interquartile range is a better measure of spread
17.0 22.9 when the data contains an outlier.
Noted
Sometimes a “cut” is between two numbers.
-> The Quartile is the average of the two numbers
FINDING THE INTERQUARTILE RANGE
When there are outliers in the data, it is more appropriate to
calculate the interquartile range.
The interquartile range (IQR) is the
range of the middle half of the data.
The upper quartile is The lower quartile is the
the data value that is data value that is one quarter
three quarters of the way of the way along the list
along the ordered list. (when written in order of size).
interquartile range =
upper quartile – lower quartile
Box and Whisker Diagrams.
Anatomy of a Box and Whisker Diagram.
Lower Lower Upper Upper
limit Quartile Median Quartile limit
Whisker Whisker
Box
4 5 6 7 8 9 10 11 12
Boys
130 140 150 160 170 180 cm 190
Girls
Box plots are useful for comparing two or more sets of data like
that shown below for heights of boys and girls in a class.
S1: Sort the data
S2: Calculate quartiles
S3: draw the box correspond to Q1 and Q3
S4: Draw a vertical line through the box at the median
S5: compute the upper and lower limit
*** Lower limit : Q1 – 1.5 (Q3- Q1)
*** Upper limit: Q3 + 1.5 (Q3 – Q1)
S6: Draw the whiskers
S7: Plot the outliers
* n = 45
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below
Q1 Q2 Q3
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
4 5 6 7 8 9 10 11 12
Drawing a Box Plot.
Example 2: Draw a Box plot for the data below
Q1 Q2 Q3
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,
Lower Upper
Quartile Median Quartile
= 4 = 8 = 10
3 4 5 6 7 8 9 10 11 12 13 14 15
Drawing a Box Plot.
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
QL Q2 Qu
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
Lower Upper
Quartile Median Quartile
= 158 = 171 = 180
130 140 150 160 170 180 cm 190
Drawing a Box Plot.
Question: Gemma recorded the heights in cm of girls in the same
class and constructed a box plot from the data. The box plots for both
boys and girls are shown below. Use the box plots to choose some
correct statements comparing heights of boys and girls in the class.
Justify your answers. Boys
130 140 150 160 170 180 cm 190
Girls
1. The girls are taller on average. 2. The boys are taller on average.
3. The girls show less variability in height. 5. The smallest person is a girl.
4. The boys show less variability in height. 6. The tallest person is a boy.
PERCENTILES
A percentile is a measure that tells us what percent
of the total frequency scored at or below that
measure.
EX : You are the fourth tallest person in a group of 20 -
80% of people are shorter than you
-> That means you are at the 80th percentile
IF your height is 1.85 then 1.85 is the 80% percentile
height in that group.
I =(p/100)*n
Where p : desired percentage
If i is not integer -> round up to next highest
integer
If i is an integer, the pth percentile is the
average of the values in position i and position
i+1
Estimating Percentiles from a Line graph
A total of 10,000 visited Aeon mall over 12 hours
Time ( hours) People
0 0
2 350
4 1100
6 2400
8 6500
10 8850
12 10,000
Estimate the 30th percentiles (when 30% of the vistors had
arrived).
Estimate what percentile of visitors had arrived after 11 hours
Key Points:
Percentile rank is a number between 0 and 100
indicating the percent of cases falling at or below that
score.
Percentile ranks are usually written to the nearest
whole percent: 74.5% = 75% = 75th percentile
Scores are arranged in rank order from lowest to
highest
There is no 0 percentile rank - the lowest score is at
the first percentile
There is no 100th percentile - the highest score is at
the 99th percentile
You have 25 test scores, and in order from lowest
to highest they look like this:
43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77,
78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99.
Find the 90th percentile and the 20th percentile for
these (ordered) scores.
Practice Problems
Measures of variation
Plan A Plan B
15 23
25 26
35 25
20 24
30 27
Calculate mean and median of data above
Variance
Variance is the average of the squared differences
of the data value from the mean.
MEASURES OF VARIABILITY
POPULATION VARIANCE
• The population variance is the mean squared deviation
from the population mean:
N
i
( x ) 2
2 i 1
N
• Where 2 stands for the population variance
• is the population mean
• N is the total number of values in the population
• xi is the value of the i-th observation.
• represents a summation
These are the numbers of newspapers sold at the local shop over the
last 20 days:
22, 20, 18, 23, 20, 25, 22, 20, 18, 20, 19 ,19 ,20, 22, 21, 20, 21,
23, 25, 29.
Standard Deviation
Standard deviation is a measure of how spread out
numbers are
It is a symbol is
The formula is the square root of the variance
Population Variance
In practice population variance cannot be
computed directly because the entire population is
not ordinarily observed.
An analogous measure of variability may be
determined with sample data.
This referred to as sample variance
MEASURES OF VARIABILITY
SAMPLE VARIANCE
• The sample variance is defined as follows:
N
i
( x x ) 2
s2 i 1
n 1
• Where s2 stands for the sample variance
• x is the sample mean
• n is the total number of values in the sample
• xi is the value of the i-th observation.
• represents a summation
MEASURES OF VARIABILITY
POPULATION/SAMPLE STANDARD DEVIATION
• The standard deviation is the positive square root of the
variance:
Population standard deviation: 2
Sample standard deviation: s s2
• Compute the standard deviations of advertising and sales.
MEASURES OF VARIABILITY
POPULATION/SAMPLE STANDARD DEVIATION
• Compute the sample standard deviation of advertising
data: 2.5, 1.3, 1.4, 1.0 and 2.0
• Compute the population standard deviation of sales data:
264, 116, 165, 101 and 209
MEASURES OF VARIABILITY
POPULATION/SAMPLE CV
• The coefficient of variation is the standard deviation
divided by the means
Population coefficient of variation: CV
s
Sample coefficient of variation: cv
x
Coefficient of Variation
Cha
p 3-
83
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Is used to compare two or more sets of data
measured in different units
Population Sample
σ s
CV 100% CV 100%
μ x
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Comparing Coefficient
Cha of Variation
p 3-
84
Stock A:
Average price last year = $50
Standard deviation = $5
s $5
CVA 100%
100% 10%
x $50 Both stocks
have the same
Stock B: standard
Average price last year = $100 deviation, but
stock B is less
Standard deviation = $5
variable relative
to its price
s $5
CVB 100% 100% 5%
x $100
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Data can be “distributed” (spread out) in different
way.
Skewed and Symmetric Data
Data in a population or sample can be either
symmetric or skewed ( shape of data), depending
on how the data are distributed around the center.
Use for
In probability theory,
the normal (or Gaussian)distribution is a very
common continuous probability distribution.
Normal distributions are important in statistics and are
often used in the natural and social sciences to represent
real-valued random variables whose distributions are
not known
Examples
Many things closely follow a Normal Distribution:
Heights of people
Size of things produced by machines
Errors in measurements
Blood pressure
Marks on a test
=> we say the data is “ normally distributed”
Characteristics
The normal distribution has:
Mean= median = mode
Symmetry about the center
50% of values less than the mean and 50% greater than
the mean.
What is the standard deviation ????
MEASURE OF HOW SPREAD
OUT NUMBERS ARE.
When you calculate the standard deviation of
your data, you will find that:
Example
95% of students at school are between 1.1m and
1.7 tall. Assuming this data is normally
distributed can you calculate the mean and
standard deviation?
The mean is halfway between 1.1m and 1.7m:
Mean = 1.4m
95% is 2 standard deviations either side of the mean ( a
total of standard deviations) so:
1 standard deviation = (1.7 -1.1)/4 = 0.15
Cha
The Empirical Rule
p 3-
101
If the data distribution is bell-shaped, then the
interval:
μ 1σ contains about 68% of the values in
the population or the sample
68%
μ
μ 1σ
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Cha
The Empirical Rule
p 3-
102
μ 2σ contains about 95% of the values in
the population or the sample
μ 3σ contains about 99.7% of the values
in the population or the sample
95% 99.7%
μ 2σ μ 3σ
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Standardized Data Values
Cha
p 3-
103
A standardized data value refers to the
number of standard deviations a value is
from the mean
Standardized data values are sometimes
referred to as z-scores
-> The number of standard deviations from the
mean is also called “ standard score,’ sigma”, or
“Z-score”
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
One of your friend is 1.85 tall. You can see on the
bell curve that 1.85 is 3 standard deviations from
the mean of 1.4, so:
-> your friend’s height has “z score” of 3.0
It is also possible to calculate how many standard
deviations 1.85 is from the mean.
How far is 1.85 from the mean?
It is 1.85 -1.4 = 0.45 m from the mean
How many standard deviations is that? The
standard deviation is 0.15, so:
0.45/0.15 = 3 standard deviation
To covert a value to a standard score ( “z-score”):
First subtract the mean,
Then divide by the standard deviation
-> doing that is called “Standarding”
Example
A survey of daily time had these results (in minutes):
26, 33, 65, 28, 34, 55, 25, 44, 50, 36, 26, 37, 43, 62,
35, 38, 45, 32, 28, 34
The mean is 38.8 minutes and the standard deviation is
11.4 minutes
Convert the value to z-scores (‘ standard scores”)
To covert 26
First subtract the mean: 26 – 38.8 =-12.8
Then divide by the standard deviation: -12.8/11.4 = -
1.12
So 26 is -1.12 standard deviations from the mean
Cha
Standardized Population Values
p 3-
110
x μ
z
σ
where:
x = original data value
μ = population mean
σ = population standard deviation
z = standard score
(number of standard deviations x is from μ)
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Cha
Standardized Sample Values
p 3-
111
xx
z
s
where:
x = original data value
x = sample mean
s = sample standard deviation
z = standard score
(number of standard deviations x is from μ)
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-
Hall, Inc.
Why standardize…?
It can help you decisions about your data.
Example: Professor Willoughby is marking a test.
Here are the students results ( out of 60 points):
20, 15, 26, 32, 18, 35, 14, 26, 22, 17
Most students didn’t even get 30 out of 60 and most
will fail
The test must have been really hard, so the Prof
decides to standardize all the scores and only fail
people 1 standard deviation below the mean.
The mean is 22.5 and the standard deviation is 6.75 and these are
standard scores:
-0.45 ; -1.21, 1.36, - 0.76, 0.76, 1.82, -1.36, 0.45, -0.15, -0.91
Therefore, 2 students will fail ( the one who scored 15 and 14 on the test)
In more detail
Here is the standard normal distribution with
percentages for every half of a standard deviation
and cumulative percentages.
Your score is a recent test was 0.5 standard
deviations above the average, how many people
scored lower than you did? ( P (z <0.5)
Question 1
95% of students at school weight between 62 kg and
90 kg.
Assuming this data is normally distributed, what are
the mean and standard deviation.
Question 2
A machine produces electrical components
99.7% of the components have lengths between
1.176 cm and 1.224 cm.
Assuming this data is normally distributed, what
are the mean and standard deviation?
Question 3
68% of the marks in a test are between 51 and 64.
Assuming this data is normally distributed, what
are the mean and standard deviation?
Question 4
A company makes parts for a machine. The lengths
of the parts must be within certain limits or they
will be rejected.
A large number or parts were measured and the
mean and standard deviation were calculated as 3.1
m and 0.005 m respectively.
Assuming this data is normally distributed and
99.7% of the parts were accepted, what are the
limits?
Question 5
Students pass a test if they score 50% or more.
The marks of large number of students were
sampled and the mean and standard deviation were
calculated as 42% and 8% respectively.
Assuming this data is normally distributed, what
percentage of students pass the test?
Standard normal distribution table
It shows you that percent of population:
Between 0 and z
Less than z
Greater than z
Example
Find the percent of population between 0.45
Start at the row for 0.4 and read along until 0.45:
there is the value 0.1736
And 0.1736 is 17.36%
So 17.36% of the population are between 0.045
standard deviations from the mean.
Because the curve is symmetrical, the same table
can be used for value going either direction, so
negative 0.45 (-0.45) also has an area of 0.1736.
Example
Find the percent of population z between -1 and + 2
Use the standard Normal Distribution table to find
P(0<Z
P (Z
P ( -1.65 <Z
P (0.85 < Z
P (Z>1.75)
P (Z -0.69)
P (-1.27 < Z
P (Z >-2.64)
P (Z 0.96)
Find Z when you know percentage.
P( Z to + ) = 50.8%
P ( - to Z) =30.85%
P ( -2 to Z ) = 11.29%
P ( Z to 3) = 0.3%
Question 9
The mean July daily rainfall in Waterville is 10mm
and the standard deviation is 1.5mm
Assume that this data is normally distributed
How many days in July would you expect the daily
rainfall to be less than 8.5 mm?
Trong clip “Em gái mưa” của Hương Tràm. Người
ta đo được lượng mưa trung bình trong mỗi cảnh
mưa là 20mm, độ lệch chuẩn là 3mm.
Giả sử rằng lượng mưa theo phân phối chuẩn.
Hỏi có tổng cộng bao nhiêu cảnh mưa có lượng
mưa dưới 16 mm. Biết rằng có tổng cộng 40 cảnh
mưa trong clip của Hương Tràm.
Độ tuổi trung bình trong một tổng thể của người
dân tại địa phương là 43, độ lệch chuẩn ( standard
deviation là 14. Địa phương có 5,000 người. Hỏi có
bao nhiêu người trong độ tuổi từ 22 đến 57. Giả sử
rằng đây là phân phối chuẩn ( normal distribution)
Biết trung bình trong một tổng thể là 20, độ lệch
chuẩn (standard deviation) là 3. Tổng thể có 2000.
Lưu ý giả sử rằng đây là phân phối chuẩn (normal
distribution). Hỏi:
a. Có bao nhiêu giá trị từ 14 đến 17. (1 điểm)
b. Người ta biết rằng trong cuộc điều tra này giá trị
từ 17 đến x chiếm 68.28%. Tìm x ( 1 điểm)
C. Người ta biết rằng trong cuộc điều tra này giá
trị lớn hơn hoặc bằng x có 131 người ( số chưa
làm tròn) . Tìm x
The mean July daily rainfall in Waterville is 10mm
and the standard deviation is 1.5mm
Assume that this data is normally distributed
How many days in July would you expect the daily
rainfall from 8.5 mm to 11.5?
Thank you