Chapter 3, Part A
x́=
∑ x i = 4 7,280 = 3,940
Descriptive Statistics: Numerical Measures n 12
Measures of Location
Measures of Variability Median
The median of a data set is the value in the
Numerical Measures middle when the data items are arranged in
If the measures are computed for data from a ascending order.
sample, they are called sample statistics. Whenever a data set has extreme values,
If the measures are computed for data from a median is the preferred measure of central
population, they are called population parameters. location.
A sample statistic is referred to as the point The median is the measure of location most
estimator of the corresponding population often reported for annual income and property
parameter. value data.
Measures of Location A few extremely large incomes or property
- Mean values can inflate the mean.
- Median For an odd number of observations:7 observations
- Mode
- Weighted Mean
- Geometric Mean
- Percentiles
In ascending order
- Quartiles
Median is the middle value; Median = 19
For an even number of observations:
Mean
8 observations
Perhaps the most important measure of location is the
Median is the average of the middle two values.
mean.
Median = (19 + 26)/2 = 22.5
The mean provides a measure of
central location.
Example: Monthly Starting Salary
The mean of a data set is the average of
Averaging the 6th and 7th data values:
all the data values.
Median = (3,890+ 3,920)/2 = 3,905
The sample mean x́ is the point
estimator of the population mean µ.
Sample Mean x́
x́=
∑ xi
n
where:
Sxi = sum of the values of the n observations
n = number of observations in the sample
Population Mean m Trimmed Mean
∑ xi Another measure sometimes used when
μ= extreme values are present, is the trimmed mean.
N
where: It is obtained by deleting a percentage of the
Sxi = sum of the values of the N observations smallest and largest values from a data set and then
N = number of observations in the population computing the mean of the remaining values.
For example, the 5% trimmed mean is obtained
Sample Mean x́ by removing the smallest 5% and the largest 5% of
Example: Monthly Starting Salary the data values and then computing the mean of
A placement office wants to know the average starting the remaining values.
salary of business graduates. Monthly starting salaries
for a sample of 12 business school graduates is
provided here.
Example: Monthly starting Salary
Mode
The mode of a data set is the value that occurs
with greatest frequency. Geometric Mean
The greatest frequency can occur at two or The geometric mean is calculated by finding the
more different values. nth root of the product of n values.
If the data have exactly two modes, the data It is often used in analyzing growth rates in
are bimodal. financial data (where using the arithmetic mean will
If the data have more than two modes, the data provide misleading results).
are multimodal. It should be applied anytime you want to
Example: Monthly Starting Salary The only monthly determine the mean rate of change over several
starting salary that occurs more than once is $3,880. successive periods (be it years, quarters,
weeks, . . .).
Mode = 3,880 Other common applications include: changes in
populations of species, crop yields, pollution levels,
and birth and death rates.
x́ g =√n ( x 1 ) ( x 2 ) …( x n)
= [(x1)(x2)…(xn)]1/n
Percentiles
A percentile provides information about how
the data are spread over the interval from the
smallest value to the largest value.
Weighted Mean
Admission test scores for colleges and
In some instances the mean is computed by
universities are frequently reported in terms of
giving each observation a weight that reflects its
percentiles.
relative importance.
The pth percentile of a data set is a value such
The choice of weights depends on the
that at least p percent of the items take on this
application.
value or less and at least (100 - p) percent of the
The weights might be the number of credit
items take on this value or more.
hours earned for each grade, as in GPA.
In other weighted mean computations,
Arrange the data in ascending order.
quantities such as pounds, dollars, or volume are
Compute Lp, the location of the pth percentile.
frequently used.
x́=
∑ w i xi Lp = (p/100)(n + 1)
∑ wi th
80 Percentile
where: xi = value of observation i Example: Monthly Starting Salary
wi = weight for observation I Lp = (p/100)(n + 1) = (80/100)(12 + 1) = 10.4
Numerator: sum of the weighted data values (the 10th value plus .4 times the difference between the
Denominator: sum of the weights 11th and 10th values)
80th Percentile = 4050 + 0.4 (4130 – 4050) = 4082
If data is from a population, m replaces x́ .
Example: Purchase of Raw Material
Consider the following sample of five purchases of a
raw material over a period of three months:
“At least 20% of the items
At least 80% of the items take on a value of 4082 or
take on a value of 4082 or more.” 2/12 = .167 or 16.7%
less.” 10/12 = .833 or 83%
x́ =
∑ wi x i 18,500
= = 2.96 = $2.96
∑ wi 6,250
FYI, equally-weighted (simple) mean = $3.07
Quartiles
Quartiles are specific percentiles. Interquartile Range
The interquartile range of a data set is the
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = Median difference between the third quartile and the first
quartile.
Third Quartile = 75th Percentile
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
Third Quartile (75th Percentile)
Example: Monthly Starting Salary values.
Example: Monthly Starting Salary
Lp = (p/100)(n + 1) = (75/100)(12 + 1) = 9.75
(the 9th value plus .75 times the difference between 3rd Quartile (Q3) = 4,000
1st Quartile (Q1) = 3,865
the 10th and 9th values)
Third quartile = 3950 + .75(4050 – 3950) = 4025 IQR = Q3 - Q1 = 4,000 – 3,865 = 135
Measures of Variability Variance
It is often desirable to consider measures of The variance is a measure of variability that
variability (dispersion), as well as measures of utilizes all the data.
location. It is based on the difference between the value
For example, in choosing supplier A or supplier of each observation (xi) and the mean ( x́ for a
B we might consider not only the average delivery sample, m for a population).
time for each, but also the variability in delivery The variance is useful in comparing the
time for each. variability of two or more variables.
Range The variance is the average of the squared
Interquartile Range differences between each data value and the mean.
Variance The variance is computed as follows:
Standard Deviation
Coefficient of Variation for a sample: s2=
∑ ( x i−x́ ) 2
n−1
Range for a population:
σ 2=
∑ ( xi −μ ) 2
The range of a data set is the difference N
between the largest and smallest data values.
Range = Largest value – Smallest value Standard Deviation
It is the simplest measure of variability. The standard deviation of a data set is the
It is very sensitive to the smallest and largest positive square root of the variance.
data values. It is measured in the same units as the data,
making it more easily interpreted than the variance.
Example: Monthly Starting Salary for a sample: 2
s = √s
Range = largest value - smallest value
Range = 4,325 – 3,710 = 615 for a population: s = √ s 2
1. Moderately Skewed Left
- Skewness is negative
- Mean will usually be less than the median
Coefficient of Variation
The coefficient of variation indicates how large
the standard deviation is in relation to the mean.
The coefficient of variation is computed as follows:
s 2. Moderately Skewed Right
for a sample [ x́ ]
x 100 %
- Skewness is positive
σ - Mean will usually be more than the median.
for a population [ μ ]
x 100 %
Sample Variance, Standard Deviation,
And Coefficient of Variation
Example: Monthly starting salary
Variance
s2 =
∑ ( x i−x́ ) 2 = 27,440.91
n−1
Standard Deviation
3. Highly Skewed Right
s = √ s 2=√ 27,440.91=¿ 165.65
- Skewness is positive (often above 1.0).
Coefficient of Variation - Mean will usually be more than the median.
s 165.65
[ x́ ] [
x 100 % =
3,940 ]
x 100 %=4.2 %
Chapter 3, Part B
Descriptive Statistics: Numerical Measures
Measures of Distribution Shape,
Relative Location, and Detecting Outliers
Distribution Shape Z- Scores
z-Scores The z-score is often called the standardized
Chebyshev’s Theorem value.
Empirical Rule It denotes the number of standard deviations a
Detecting Outliers data value xi is from the mean.
xi −x́
Distribution Shape: Skewness z i=
s
An important measure of the shape of a
Excel’s STANDARDIZE function can be used to
distribution is called skewness.
compute the z-score.
The formula for the skewness of sample data is
An observation’s z-score is a measure of the
3
n x i−x́
Skewness =
(n−1)( n−2)
∑ [ ]s
relative location of the observation in a data set.
A data value less than the sample mean will
Skewness can be easily computed using have a z-score less than zero.
statistical software A data value greater than the sample mean will
have a z-score greater than zero.
1.Symmetric (not skewed) A data value equal to the sample mean will
- Skewness have a
i s zero z-score
- Mean and of zero.
median are
equal
Example: Class Size data Approximately 68% of the data
values will be within +/- 1 standard deviation of its
xi −x́
zi = mean.
s Approximately 95% of the data
values will be within +/- 2 standard deviations of its
mean.
Almost all of the data values
will be within +/- 3 standard deviations of its mean.
Bell shaped distribution:
Note: 𝑥 ̅ = 44 and s = 8 for the given data.
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set
will be within z standard deviations of the mean,
where z is any value greater than 1.
Chebyshev’s theorem requires z > 1, but z need
not be an integer. Detecting Outliers:
At least 75% of the data values must be within z An outlier is an unusually small or unusually
= 2 standard deviations of the mean. large value in a data set.
At least 89% of the data values must be within z A data value with a z-score less than -3 or
= 3 standard deviations of the mean. greater than +3 might be considered an outlier.
At least 94% of the data values must be within z It might be:
= 4 standard deviations of the mean. an incorrectly recorded data value
Example: Marks of students - Suppose the a data value that was incorrectly
marks of 100 students in a course had a mean of included in the data set
70 and a standard deviation of 5. We want to know a correctly recorded unusual data value
the number of students having test scores between that belongs in the data set
60 and 80. Example: Class Size data
60 and 80 are 2 standard deviations below and above
the mean respectively.
- 60 = 70 – 2(5) s
- 80 = 70 + 2(5)
- Z = 75%
Number of students having test scores between 58 and 72:
(58-72)/5 = -2.4
-1.5 shows fifth class size is farthest from the
(82-70)/5 = 2.4
mean .
z = 2.4
No outliers are present as z value is within +/- 3
(1 – 1/ z2) = (1 – 1/(2.4)2 ) = 0.826 = 82.6%
guideline for outliers.
Empirical Rule
When the data are believed to approximate a bell-shaped Five-Number Summaries and Box Plots
distribution: Summary statistics and easy-to-draw graphs
The empirical rule can be used to can be used to quickly summarize large quantities of
determine the percentage of data values that must be data.
within a specified number of standard deviations of Two tools that accomplish this are five-number
the mean. summaries and box plots.
The empirical rule is based on the
normal distribution, which is covered in Chapter 6. Five-Number Summary
Smallest Value
For data having a bell-shaped distribution:
First Quartile
Median
Third Quartile
Largest Value Lower Limit: Q1 - 1.5(IQR) = 3,857.5 - 1.5(167.5) =
3,606.25
Example: Monthly starting Salary The upper limit is located 1.5(IQR)
Lowest Value = 3,710 above Q3.
Median = 575 Upper Limit: Q3 + 1.5(IQR) = 4,025 + 1.5(167.5) =
Third Quartile = 4,025 4,276.25
First Quartile = 3,857.5 There is one outlier i.e 4,325 in the
Largest Value = 4,325 given instance.
Example: Monthly Starting Salary
Whiskers (dashed lines) are drawn from the
ends of the box to the smallest and largest data
values inside the limits.
Smallest value inside Largest value inside
limits = 3,606.25 limits = 4,276.25
Measures of Association Between Two Variables
Thus far we have examined numerical methods
used to summarize the data for one variable at a time.
Often a manager or decision maker is interested
Box Plot in the relationship between two variables.
A box plot is a graphical summary of data that is Two descriptive measures of the relationship
based on a five-number summary. between two variables are covariance and correlation
A key to the development of a box plot is the coefficient.
computation of the median and the quartiles Q1 and
Q3.
Box plots provide another way to identify
outliers. Covariance
Example Monthly starting salary The covariance is a measure of the linear
A box is drawn with its ends located at the first association between two variables.
and third quartiles. Positive values indicate a positive relationship.
A vertical line is drawn in the box at the location Negative values indicate a negative
of the median (second quartile). relationship.
For sample: s xy= ∑ ( x i ¿−x́)( y i − ý) ¿
n−1
( x i ¿−μ x )( y i−μ y )
σ xy = ∑ ¿
For population: N
Correlation Coefficient
Correlation is a measure of linear association
and not necessarily causation.
Just because two variables are highly
correlated, it does not mean that one variable is the
Limits are located (not drawn) using the cause of the other. σ xy
interquartile range (IQR). s xy ρ xy=
r xy= σxσ y
Data outside these limits are considered For sample: s x s y For population:
outliers
The coefficient can take on
The locations of each outlier is shown with the
values between -1 and +1.
symbol
Values near -1 indicate a strong
Example: Monthly starting salary
negative linear relationship.
The lower limit is located 1.5(IQR)
below Q1. Values near +1 indicate a strong
positive linear relationship.
The closer the correlation is to
zero, the weaker the relationship.
Example: Stereo and Sound Equipment Store
The store’s manager wants to determine the
relationship between the number of weekend television
commercials shown and the sales at the store during
the following week.
Example: Stereo and Sound Equipment Store
( x i ¿−x́)( y i − ý)
Sample Covariance s xy = ∑ ¿ = 99
n−1
/ 9 = 11
Sample Correlation Coefficient
s xy
r xy = = 11/(1.49 × 7.93) = 0.93
sx s y
Data Dashboards:
Adding Numerical Measures to Improve Effectiveness
Data dashboards are not limited to graphical
displays.
The addition of numerical measures, such as
the mean and standard deviation of KPIs, to a data
dashboard is often critical.
Dashboards are often interactive.
Drilling down refers to functionality in
interactive dashboards that allows the user to
access information and analyses at increasingly
detailed level.