0% found this document useful (0 votes)
23 views11 pages

(IN) Measures

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

(IN) Measures

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

12/7/2024

 Population (Tổng thể) - all items of interest for a particular decision or investigation
- all married drivers over 25 years old
- all subscribers to Netflix
 Sample (Mẫu) - a subset of the population

- a list of individuals who rented a comedy from


Netflix in the past year
 The purpose of sampling (chọn mẫu) is to obtain sufficient information to draw a valid
inference about a population.
Tiết kiệm thời gian, chi phí, nhân lực so với việc khảo sát/nghiên cứu trên toàn bộ đối
tượng
Chọn mẫu đúng cách để đạt được mức chính xác cần có của kết quả
Tốc độ thu thập dữ liệu nhanh hơn, đảm bảo tính kịp thời của số liệu thống kê.
Tính sẵn có của các đơn vị tổng thể
Thu thập được nhiều chỉ tiêu thống kê, đặc biệt các chỉ tiêu có nội dung phức tạp, không
có điều kiện điều tra ở diện rộng.
Chọn mẫu trong nghiên cứu giúp giảm sai số khi chọn mẫu sai (do sai số cân, đo,
Lecturer: Nguyen Van Dung Ph.D. đếm, khai báo, ghi chép,..)
Slides are based on slides accompanied the book “Business Analytics:
Khuyết điểm của việc chọn mẫu: tồn tại “sai số”
Methods, Models, and Decisions”, with improvement from the lecturer

 We typically label the elements of a data set using  Population mean:


subscripted variables, x1, x2 , … , and so on, where xi
represents the ith observation.  Sample mean:
 It is common practice in statistics to use Greek letters,
such as  (mu),  (sigma), and  (pi), to represent
 Excel function: =AVERAGE(data range)
population measures and italic letters such as by x
(called x-bar), s, and p to represent sample statistics.  Property of the mean:
 N represents the number of items in a population and n
represents the number of observations in a sample.
  represents summation: xi = x1 + x2 + … xn
 Outliers can affect the value of the mean. Outliers:
observations that are radically different from the
rest—which pull the value of the mean
toward these values.

 The median specifies the middle value when the data are
Purchase Orders database
arranged from least to greatest.
Using formula:
◦ Half the data are below the median, and half the data are above
it.
◦ For an odd number of observations, the median is the middle of
the sorted numbers.
=SUM(B2:B95)/COUNT(B2:B95)
◦ For an even number of observations, the median is the mean of
the two middle numbers.
Mean = $2,471,760/94
= $26,295.32  We could use the Sort option in Excel to rank-order the data
and then determine the median. The Excel function
Using Excel AVERAGE Function =MEDIAN(data range) could also be used.
=AVERAGE(B2:B95)  The median is meaningful for ratio, interval, and ordinal data.
 Not affected by outliers.

1
12/7/2024

 Sort the data from smallest to largest. Since we  The mode is the observation that occurs most
have 90 observations, the median is the average frequently.
of the 47th and 48th observation.  The mode is most useful for data sets that contain
a relatively small number of unique values.
Median =
($15,562.50 + $15,750.00)/2
 You can easily identify the mode from a frequency
= $15,656.25 distribution by identifying the value or group
having the largest frequency or from a histogram
=MEDIAN(B2:B94)
by identifying the highest bar.
 Excel function: =MODE.SNGL(data range).
 For multiple modes: =MODE.MULT(data range)

 Purchase Orders
database: A/P Terms  The midrange is the average of the greatest and
least values in the data set.
 Mode = 30 months
 Caution must be exercised when using the
midrange because extreme values easily distort
the result. This is because the midrange uses
only two pieces of data, whereas the mean uses
 Cost per order
all the data; thus, it is usually a much rougher
 Mode is the group estimate than the mean and is often used for only
between $0 and small sample sizes.
$13,000

 Purchase Orders data The Excel file Computer Repair Times includes 250
 Use the Excel MIN and MAX functions or sort the repair times for customers.
data and find them easily.  What repair time would be
 Cost per order midrange: reasonable to quote to a
= ($68.78 + $127,500)/2 new customer?
= $63,784.89  Median repair time is 2
weeks; mean and mode are
about 15 days.
 Examine the histogram.

2
12/7/2024

 Dispersion refers to the degree of variation in


the data; that is, the numerical spread (or
compactness) of the data.
 Key measures:
◦ Range
◦ Interquartile range (độ trải giữa)
◦ Variance (phương sai)
◦ Standard deviation (độ lệch chuẩn)
90% are completed within 3 weeks

 The range is the simplest and is the difference  Purchase Orders data
between the maximum value and the minimum  For the cost per order data:
value in the data set. ◦ Maximum = $127,500
 In Excel, compute as =MAX(data range) - ◦ Minimum = $68.78
MIN(data range).  Range = $127,500 - $68.78 = $127,431.22
 The range is affected by outliers, and is often used
only for very small data sets.

 The interquartile range (IQR), or the midspread  Purchase Orders data


is the difference between the first and third  For the Cost per order data:
quartiles, Q3 – Q1.  Third Quartile = Q3 = $27,593.75
 This includes only the middle 50% of the data  First Quartile = Q1 = $6,757.81
and, therefore, is not influenced by extreme  Interquartile Range = $27,593.75 – $6,757.81
values. =$20,835.94

3
12/7/2024

 Purchase Orders Cost per order data


 The variance is the “average” of the squared
deviations from the mean.
 For a population:

◦ In Excel: =VAR.P(data range)


 For a sample:

◦ In Excel: =VAR.S(data range)


 Note the difference in denominators!

 The standard deviation is the square root of the  Purchase Orders Cost per order data
variance.
◦ Note that the dimension of the variance is the square of the  Using the results of Example 4.8, take the square
dimension of the observations, whereas the dimension of the
standard deviation is the same as the data. This makes the root of the variance:
standard deviation more practical to use in applications.
 For a population:

◦ In Excel: =STDEV.P(data range)  Alternatively, use the STDEV.S function for the
 For a sample: data range.

◦ In Excel: =STDEV.S(data range)

Excel file: Closing Stock  For any data set, the proportion of values that lie
Prices
within k (k > 1) standard deviations of the mean
Intel (INTC):
Mean = $18.81 is at least 1 – 1/k2
Standard deviation = $0.50  Examples:
General Electric (GE): ◦ For k = 2: at least ¾ or 75% of the data lie within two
Mean = $16.19 standard deviations of the mean
Standard deviation = $0.35 ◦ For k = 3: at least 8/9 or 89% of the data lie within three
standard deviations of the mean
INTC is a higher risk
investment than GE.

4
12/7/2024

 For many data sets encountered in practice:


 The process capability index (Cp) is a measure of
 Approximately 68% of the observations fall within one
standard deviation of the mean
how well a manufacturing process can achieve
 Approximately 95% fall within two standard deviations of specifications.
the mean  Using a sample of output, measure the dimension
 Approximately 99.7% fall within three standard deviations of interest, and compute the total variation using
of the mean the third empirical rule.
 These rules are commonly used to characterize  Compare results to specifications using:
the natural variation in manufacturing processes
and other business phenomena.

 A Cp value less than 1.0 is not good; it means that


the variation in the process is wider than the
specification limits, signifying that some of the
parts will not meet the specifications. In practice,
many manufacturers want to have Cp values of at
least 1.5.

Empirical rules

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 4-28

 A standardized value, commonly called a z-score,  The numerator represents the distance that xi is from the
provides a relative measure of the distance an sample mean; a negative value indicates that xi lies to
observation is from the mean, which is independent the left of the mean, and a positive value indicates that it
of the units of measurement. lies to the right of the mean. By dividing by the standard
deviation, s, we scale the distance from the mean to
 The z-score for the ith observation in a data set is
express it in units of standard deviations. Thus,
calculated as follows:
◦ a z-score of 1.0 means that the observation is one standard
deviation to the right of the mean;
◦ a z-score of 2 1.5 means that the observation is 1.5 standard
deviations to the left of the mean.
◦ Excel function: =STANDARDIZE(x, mean, standard_dev).

5
12/7/2024

 Purchase Orders Cost per order data  The coefficient of variation (CV) provides a relative
measure of dispersion in data relative to the mean:
=(B2 - $B$97)/$B$98, or
=STANDARDIZE(B2,$B$97,$B$98).

 Sometimes expressed as a percentage.


 Provides a relative measure of risk to return.
 Return to risk = 1/CV, is often easier to interpret,
especially in financial risk analysis.
 The Sharpe ratio is a related measure in finance.

 Skewness (độ lệch) describes the lack of


 Closing Stock Prices worksheet
symmetry of data.
 Intel (INTC) is slightly riskier than the other stocks. ◦ Distributions that tail off to the right are called positively
 The Index fund has the least risk (lowest CV). skewed; those that tail off to the left are said to be
negatively skewed.

Positively skewed Symmetrical

 Coefficient of Skewness (CS):


 Purchase Orders database
 Cost per order data: CS = 1.66 (high positive
skewness)
 Excel function: =SKEW(data range)  A/P terms data: CS = 0.60 (moderate positive
 CS is negative for left-skewed data.
skewness)
 CS is positive for right-skewed data.
 |CS| > 1 suggests high degree of skewness.
 0.5 ≤ |CS| ≤ 1 suggests moderate skewness.
 |CS| < 0.5 suggests relative symmetry.

6
12/7/2024

 Kurtosis (độ nhọn) refers to the peakedness (i.e., high,  Comparing measures of location can sometimes reveal
narrow) or flatness (i.e., short, flat-topped) of a information about the shape of the distribution of
histogram. observations.
 The coefficient of kurtosis (CK) measures the degree of ◦ For example, if the distribution were perfectly symmetrical and
kurtosis of a population unimodal, the mean, median, and mode would all be the same.
◦ If it were negatively skewed, we would generally find that
mean < median < mode
◦ Positive skewness would suggest that mode < median < mean

 CK < 3 indicates the data is somewhat flat with a wide degree of


dispersion.
 CK > 3 indicates the data is somewhat peaked with less
dispersion.
 Excel function: =KURT(data range).

This tool provides a summary of numerical statistical measures


for sample data.  Purchase Orders database
Data >
Data Analysis > Note: Results of
Descriptive Statistics the Analysis
 Enter Input Range Toolpak do not
 Labels (optional) change when
 Check Summary Statistics box changes are
made to the data.

 The data must be in a single row or column. If the data are in


multiple columns, the tool treats each row or column as a
separate data set

 Population mean:  Computer Repair Times

 Sample mean:

 Population variance:

 Sample variance:

7
12/7/2024

 If the data are grouped into k cells in a frequency


distribution, we can use modified versions of the
formulas to estimate the mean and variance by Representative
replacing xi with a representative value (such as group value
the midpoint) for all the observations in each cell.

 The proportion, denoted by p, is the fraction of  Proportion of orders placed by Spacetime Technologies
data that have a certain characteristic. =COUNTIF(A4:A97, “Spacetime Technologies”)/94
= 12/94 = 0.128
 Proportions are key descriptive statistics for
categorical data, such as defects or errors in
quality control applications or consumer
preferences in market research.

Value Field Settings include several statistical  Credit Risk Data


measures:  First, create a PivotTable.
 Average  In the PivotTable Field List, move Job to the Row Labels
 Max and Min field and Checking and Savings to the Values field. Then
 Product change the field settings from “Sum of Checking” and
“Sum of Savings” to the averages.
 Standard deviation
 Variance

8
12/7/2024

 Covariance (hiệp phương sai) is a measure of the linear


association between two variables, X and Y. Like the variance,
 Two variables have a strong statistical relationship different formulas are used for populations and samples.
with one another if they appear to move together.  Population covariance:
 When two variables appear to be related, you
might suspect a cause-and-effect relationship. ◦ Excel function: =COVARIANCE.P(array1,array2)
 Sometimes, however, statistical relationships exist  Sample covariance:
even though a change in one variable is not
caused by a change in the other.
◦ Excel function: =COVARIANCE.S(array1,array2)
 The covariance between X and Y is the average of the product of
the deviations of each pair of observations from their respective
means.

 Colleges and  The larger the absolute value of the covariance,


Universities data
the higher is the degree of linear association
between the two variables.
 The sign of the covariance tells us whether there
is a direct relationship or an inverse relationship.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 4-52

 Correlation is a measure of the linear relationship between two


variables, X and Y, which does not depend on the units of
measurement.
 Correlation is measured by the correlation coefficient, also known as
the Pearson product moment correlation coefficient.
 Correlation coefficient for a population:

 Correlation coefficient for a sample:

 The correlation coefficient is scaled between -1 and 1.


 Excel function: =CORREL(array1,array2)

9
12/7/2024

 Colleges and Universities data  When using the CORREL function, it does not
matter if the data represent samples or
populations. In other words,

CORREL(array1,array2) =
COVARIANCE.P(array1,array2) / STDEV.P(array1)*STDEV.P(array2)

and

CORREL(array1,array2) =
COVARIANCE.S(array1,array2) / STDEV.S(array1)*STDEV.S(array2)

 Colleges and Universities data


Data >
Data Analysis >
Correlation

◦ Moderate negative correlation between acceptance rate and


graduation rate, indicating that schools with lower acceptance
rates have higher graduation rates.
 Excel computes the correlation coefficient ◦ Acceptance rate is also negatively correlated with the median
between all pairs of variables in the Input Range. SAT and Top 10% HS, suggesting that schools with lower
acceptance rates have higher student profiles.
Input Range data must be in contiguous columns. ◦ The correlations with Expenditures/Student suggest that schools
with higher student profiles spend more money per student.

 Home Market Value data


 There is no standard definition of what constitutes
an outlier.
 Some typical rules of thumb:
 z-scores greater than +3 or less than -3
 Extreme outliers are more than 3*IQR to the left of Q1 or
right of Q3
 Mild outliers are between 1.5*IQR and 3*IQR to the left of
 None of the z-scores exceed 3. However, while
Q1 or right of Q3 individual variables might not exhibit outliers,
combinations of them might.
◦ The last observation has a high market value ($120,700) but
a relatively small house size (1,581 square feet) and may be
an outlier.

10
12/7/2024

 Excel file Surgery Infections


 Statistical Thinking is a philosophy of learning and ◦ Is month 12 simply random variation or some explainable
action for improvement, based on principles that: phenomenon?
 all work occurs in a system of interconnected processes
 variation exists in all processes
 better performance results from understanding and reducing
variation
 Work gets done in any organization through processes
— systematic ways of doing things that achieve desired
results.
 Understanding business processes provides the context
for determining the effects of variation and the proper
type of action to be taken.

 Three-standard deviation empirical rule:


 Different samples from any population will vary.
◦ They will have different means, standard deviations, and
other statistical measures
◦ They will have differences in the shapes of histograms.
 Samples are extremely sensitive to the sample
size – the number of observations included in the
samples.

 This suggests that month 12 is statistically


different from the rest of the data.

 Samples from Computer Repair Times data


 Population statistics: μ = 14.91 days, σ2 = 35.5 days2
 Two samples of size 50:  Excel practice of Mean, Median, Mode, Midrange

 Two samples of size 25:

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 4-66

11

You might also like