BA1 - PPTs Merged
BA1 - PPTs Merged
Discussions topics:
• What is Business Analytics?
• Evolution of Business Analytics
• Framework of Business Analytics
• Scope of Business Analytics
2
Dimensions of Statistics
Statistics
a) You apply for a credit card for the first time. How does the bank
assess your creditworthiness?
b) How does Amazon or Flipkart know which books and other products
to recommend to you when you log in to their website?
c) How do airlines determine what price to quote to you when you are
buying a plane ticket?
d) How do Uber and OLA determine their prices?
e) How do insurance companies determine the risk on a person and
decide premium?
a) Even though you are applying for a credit card for the first time, millions of people have
also applied. Many of them have paid back the amount spent on time, many of them
have deferred the payment and some of them have been defaulters. The bank wants
to know you belong to which category by comparing your profile with similar card
holders.
b) Similarly, Amazon or Flipkart has access to millions of previous purchases made by
customers on its website. They examine your previous purchases, the products you
have viewed, and any product recommendations you have provided. From their huge
database of customers with similar profile, they create recommendations for you.
c) The price quoted to you for a flight between New Delhi and Hyderabad today could
be very different from the price quoted tomorrow. These changes happen because
airlines use a variable pricing strategy. It works by examining vast amounts of data on
past purchases and using these data to forecast future purchases. These forecasts are
then fed into sophisticated optimization algorithms that determine the optimal price
to charge for a particular flight and when to change that price.
What is Business Analytics?
9
Support for Decision Making
• Uncertain economics
• Rapidly changing environments
• Global competition
• Demanding customers
• Taking advantage of information acquired by companies is a Critical
Success Factor.
Business Analytics Defined
• Business analytics is the scientific process of transforming data into
insight for making better decisions.
• Business analytics is used for data-driven or fact-based decision
making, which is often seen as more precise than other alternatives
for decision making.
• The tools of business analytics can aid decision making by creating
insights from data, by improving our ability to more accurately
forecast for planning, by helping us quantify risk, and by yielding
better alternatives through analysis and optimization.
Business Analytics Vs. Business Intelligence
• Business Intelligence (BI) is a set of methodologies, processes, architectures, and
technologies that leverage the output of information management processes for
analysis, reporting, performance management, and information delivery.
• Business Analytics is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software.
• Business Intelligence deals with the present, while business analytics is more
focused on the future.
• A focus of business intelligence is to take data and use it for better decision making.
Through the use of aggregation, visualization, and careful analysis, companies can
use BI to achieve better efficiency in how the organization is operating now.
• Business analytics, on the other hand, places emphasis on the future. Data
analytics engages in data mining, essentially analyzing a set of information to pick
out patterns and predict future trends that can inform organizations as to what
they should do.
Business Analytics Vs. Data Science
• Data Science is the study that puts the use of statistics, trends, algorithms, and
technology to understand and segregate data into different aspects that make
sense.
• The main contribution of data science in business and management is to provide
actionable insights over a wide range of data that are either segregated or needs
to be mined, trying to bring facts around business operations, customer trends,
and behavior in byte sized format.
• On the other hand, Business Analytics is a statistical study of
segregated/structured data. Business Analytics allows solutions to overcome
hurdles and improve business performance.
• Because these two terms are often used interchangeably, the chances are that a
business analytics problem could be wrongly approached with Data Science’s
solution. Using two different sets of tools to solve Business Analyst could be
adverse and bring undesirable results.
Importance of Business Analytics
14
Evolution of Business Analytics
BA in the 1800s: The need to stay ahead
The first use of data to stay ahead of his competitors dates back to 1865. Sir Henry Furnese, a
banker, was always one step ahead by actively gathering information and acting on it before
any of his competitors. This makes it clear that professionals such as Sir Furnese relied more
on data and empirical evidence, rather than gut instinct.
Healthcare Analytics
• The use of analytics in health care is on the increase because of pressure to simultaneously
control cost and provide more effective treatment.
• Descriptive, predictive, and prescriptive analytics are used to improve patient, staff, and
facility scheduling; patient flow; purchasing; and inventory control.
Supply Chain Analytics
• One of the earliest applications of analytics was in logistics and supply chain management.
• The core service of companies such as UPS and FedEx is the efficient delivery of goods and
analytics has long been used to achieve efficiency.
• The optimal sorting of goods, vehicle and staff scheduling, and vehicle routing are all key to
profitability for logistics companies such as UPS, FedEx, and others like them.
Web Analytics
• Web analytics is the analysis of online activity, which includes, but is not limited to, visits to
websites and social media sites such as Facebook and LinkedIn.
• Web analytics obviously has huge implications for promoting and selling products and
services via the Internet.
• Leading companies apply descriptive and advanced analytics to data collected in online
experiments to determine the best way to configure websites, position ads, and utilize
social networks for the promotion of products and services.
Example-1: Retail Markdown Decisions
28
Analytics in Practice: Harrah’s Entertainment
29
Topic-2
Exploring Data
Understanding Data
Data and data set:
• Data are facts and figures collected, analysed and summarized for presentation and
interpretation.
• Facts are the truths which could be numeric or non-numeric in nature and figures are
information which are numeric.
• In a more technical sense, data are a set of values of qualitative/categorical or
quantitative nature pertaining to one or more individuals or objects.
• Qualitative data are descriptive information (about colour of an object, taste of food,
religion, education, ethnicity etc.) while quantitative data are numerical information
(about marks obtained, age, height, no. of employees, interest rate etc.).
• Quantitative data can be discrete (information relating to no. of households in a
society, no. of IPL teams, no. of warehouses etc.) or continuous (information relating
to height, weight, speed, sales figures, growth rate etc.).
• All the data collected for a particular study are referred to as data set for the study.
Elements, variables and observations:
• Elements are the entities on which data are collected, e.g., individuals,
objects, nations, companies etc.
• Variable is a characteristic of interest for the elements, e.g., height of
individual, dimension of object, GDP of nation, sales figure of company.
• The set of measurements or values obtained for each element and
concerned variable are called observations.
Vehicle Size Cylinders Mileage per gallon Fuel
Federal Reserve Board Data on the money supply, installment credit, exchange rates, and discount rates
Office of Management and Budget Data on revenue, expenditures, and debt of the federal government
Data on business activity, value of shipments by industry, level of profits by industry, and
Department of Commerce
growing and declining industries
Consumer spending, hourly earnings, unemployment rate, safety records, and
Bureau of Labor Statistics
international statistics
Panel data on sales, advt. and R&D expenditure (million $) from 2012-2016
Company A Company B Company C
Sales Advt R&D Sales Advt R&D Sales Advt R&D
2012 12 5 2 15 4 3 20 5 5
2013 15 6 3 18 6 4 22 7 5
2014 14 7 4 12 8 4 24 10 6
2015 18 8 5 10 10 6 26 12 8
2016 20 10 7 15 12 8 30 15 10
Scales of Measurement
• Two scales for measuring ‘qualitative data’
➢ Nominal Scale: A qualitative scale for which there is no meaningful ordering, or ranking of the
categories. It only measures the presence or absence of an attribute and does not contribute to any
higher order analysis. For example, name, address, gender, income category, education etc.
➢ Ordinal Scale: It measures a qualitative phenomenon that exists with a varying degree. An ordinal scale
is used when the phenomenon can be arranged in ascending or descending order and hence also
known as ranked order scale. For example, customer satisfaction, leadership, credit rating etc.
• Two scales for measuring ‘quantitative data’
➢ Interval Scale: Interval scale measures qualitative data in a quantitative manner. It is based on equal
intervals between the scale points where ‘zero’ has no meaning. For example, Likert scale where the
measurement is done on a scale of 1 to 5 and ‘zero’ has no meaning. For example, rate the service of a
restaurant on a scale of 1 to 5 with 1 represents very bad, 2 represents bad, 3 represents neither good
nor bad, 4 represents good and 5 represents very good.
➢ Ratio Scale: Any quantitative data where ‘zero’ has a meaning and we can also perform mathematical
operations. For example, sales data, advertisement expenditure, profit/loss, distance etc. In some of the
statistical analysis, ratio scales are also converted into interval scales for the ease of analyses.
Let’s Answer…
1. If the grading of diabetes is classified as mild, moderate and severe the scale of
measurement used is ordinal scale.
2. The faculty of BA-1 records the answers that each student got correct in the last
test. Interval
4. Asking for preference levels of various apparel brands like Van Heusen, Arrow,
Levi, Perter England and UCB. Ordinal scale
5. Asking whether the customer will like tea or coffee. Nominal scale
Common Features
• Division of the subjects/elements into groups (control, experimental).
• Use of a "treatment" (usually the independent variable) which is
introduced into the research context or manipulated by the researcher.
• In contrast to qualitative research, virtually all experiments are designed
to test hypotheses.
• Its highly analytical.
Qualitative & Quantitative Research
➢ Qualitative research: explore perceptions, attitudes and motivations
• To understand how they are formed.
• To provides depth of information
• To determine what attributes will subsequently be measured in quantitative
studies
➢ Quantitative research: descriptive
• provides raw data on the numbers of people exhibiting certain behaviors,
attitudes, etc.
• allows sample large numbers of the population.
• Its highly data-intensive and mathematical
Qualitative & Quantitative Research
Examples of each type of research
• Exploratory: Understanding the immunity built by Covid vaccine, new type of
health supplement products liked by consumers
• Descriptive: Sales analysis, consumer perception and behavior analysis, market
characteristics analysis
• Causal: Impact of medicines in curing certain disease, impact of advt. on sales,
impact of FDI on economic growth
• Experimental: Drug experiments on two or more group of patients, product
experiments with different set of consumers
• Qualitative: Branding perception among customers, adding/deleting features in
new products, assessment of services
• Quantitative: Customer feedback surveys, employee satisfaction surveys,
financial assessment of companies, stock market analysis
Attitude Measurement
• The negative adjective or phrase sometimes appears at the left side of the scale and sometimes at
the right.
• This controls the tendency of some respondents, particularly those with very positive or very
negative attitudes, to mark the right- or left-hand sides without reading the labels.
• Scored on either a -3 to +3 or a 1 to 7 scale.
A Semantic Differential Scale for Measuring Self- Concepts,
Person Concepts, and Product Concepts
1) Rugged :---:---:---:---:---:---:---: Delicate
2) Excitable :---:---:---:---:---:---:---: Calm
3) Uncomfortable :---:---:---:---:---:---:---: Comfortable
4) Dominating :---:---:---:---:---:---:---: Submissive
5) Thrifty :---:---:---:---:---:---:---: Indulgent
6) Pleasant :---:---:---:---:---:---:---: Unpleasant
7) Contemporary :---:---:---:---:---:---:---: Obsolete
8) Organized :---:---:---:---:---:---:---: Unorganized
9) Rational :---:---:---:---:---:---:---: Emotional
10) Youthful :---:---:---:---:---:---:---: Mature
11) Formal :---:---:---:---:---:---:---: Informal
12) Orthodox :---:---:---:---:---:---:---: Liberal
13) Complex :---:---:---:---:---:---:---: Simple
14) Colorless :---:---:---:---:---:---:---: Colorful
15) Modest :---:---:---:---:---:---:---: Vain
Stapel scale
• This is a unipolar ten-point rating scale.
• Ranges from +5 to -5 and has no neutral zero point.
• Measures intensity of an attitude
• This scale is usually presented vertically.
Stapel Scale Example
Croma products and service:
+5 (describes very well) +5
+4 +4
+3 +3
+2 +2X
+1 +1
HIGH QUALITY POOR SERVICE
-1 -1
-2 -2
-3 -3
-4X -4
-5 (describes poorly) -5
The data obtained by using a Stapel scale can be analyzed in the same way as semantic
differential data.
Graphic Rating Scale
• A measure of attitude that allows respondents to rate an object by
choosing any point along a graphic continuum.
A
Exercise:
Name the scale ?
B
C
A B
C
Topic-3
Descriptive Analytics
For Pie Chart (1% =
Ice cream Nos. sold Percentage
3.6 degrees)
Butterscotch 10 20 72 deg.
16
14
Buuterscotch
Vanilla 20%
12 30%
10
6
Chocolate
Strawberry 34%
4 16%
0
Buuterscotch Chocolate Strawberry Vanilla
Buuterscotch Chocolate Strawberry Vanilla
Share of
Share of Mfg
Country Exports in GDP
in GDP (%)
(%)
China 34 15
S. Korea 28 4
Thailand 36 2
Japan 21 6
Germany 19 11
90%
35
80%
30
70%
25
60%
20 50%
15 40%
30%
10
20%
5
10%
0
0%
China S. Korea Thailand Japan Germany India
China S. Korea Thailand Japan Germany India
Share of Mfg in GDP (%) Share of Exports in GDP (%)
Share of Mfg in GDP (%) Share of Exports in GDP (%)
Sales
Year
(million $)
2012 12
2013 14
2014 17
2015 15
2016 18
Scatter Plot with Scatter Plot with
Sharp Lines Smooth Lines
Sales (million $) Sales (million $)
20 20
18 18
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 2016 2016.5 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 2016 2016.5
Comparison of countries on ease of starting business
120 14
M 12
100
i
x 10
80
e
8
d 60
C 40
h 4
a 20
2
r
t 0 0
New South
Canada USA France UK China India Brazil Russia
Zealand Africa
Time (days) 1 5 5 7 12 33 19 27 108 15
No. of Procedures 1 1 6 5 6 13 5 12 13 7
No. of
Points
Students
0-10 8
10-20 11
20-30 22
30-40 29
40-50 36
50-60 55
60-70 39
70-80 21
80-90 11
90-100 3
Ogive – Cumulative Frequency Curves
Summarizing Data
(Grouped, continuous)
No. of
Salary Average Salary
Employees Mid-value (X) f*X
($1000s) calculation
(f)
10-20 2 15 2*15=30
20-30 4 25 4*25=100 Average salary
30-40 6 35 6*35=210 of employees
= ΣfX/Σf
40-50 8 45 8*45=360 = 2750/50
50-60 10 55 10*55=550 = 55
1. The mean weights of five computer stations is 167.2 lbs. The weights of four of
them are 158.4 lbs, 162.8 lbs, 165.0 lbs and 178.2 lbs respectively. What is the
weight of the fifth computer?
2. The following table gives the weights of wooden items being sold by a timber
merchant. Calculate mean weight of the items sold.
Weight (lbs) 1-3 4-6 7-9 10-12 13-15
No. of items 8 25 45 18 4
3. An ice-cream parlor sells six varieties of ice-creams which have generated the
following revenue. Find the mean price of an ice-cream sold.
Ice-cream Butter scotch Chocolate Lychee Choco chips Tooty fruity Vanilla
Price (Rs.) 40 90 65 55 75 45
Sales (Rs.) 5,00,000 4,50,000 3,38,000 3,01,180 4,93,800 3,14,415
Weighted Arithmetic Mean
The weighted mean is a type of mean that is calculated by multiplying the weight (or
probability) associated with a particular event or outcome with its associated quantitative
outcome and then summing all the products together.
Weighted Mean = ΣWiXi/ΣWi; i = 1,2,3,……,n.
Weight
Examination Score (X) W*X Weighted Mean
(W)
2. A batsman scored 1, 113, 148, 22, 24, 27, 15, 16, 16 & 28 runs in the last 10 innings. Using an
appropriate measure, find his average score.
Key: Since there are 2 extreme scores 113 & 148, hence mean would be affected by these values.
Here, median would be an appropriate measure.
Arrangement: 1, 15, 16, 16, 22, 24, 27, 28, 113, 148.
No. of observations, n = 10 (even)
Median = Mean of (n/2)th and (n/2+1)th observations
= Mean of 5th and 6th observations
= (22+24)/2 = 23
The average score of the batsman is 23 runs.
Grouped, discrete Grouped, continuous
Mode
• Mode is the value which occurs most
frequently in a distribution.
• A distribution can have one or more
than one modes.
• Mode is widely used while compiling
the results of surveys. The options
with maximum frequencies are
considered and decisions are taken
accordingly.
• The demerits of arithmetic mean and
median can be overcome with the
help of mode.
• Mode can be calculated for grouped,
ungrouped, discrete and continuous
data.
Discrete series Continuous series
Practice Problems
Grouping of frequencies Mode - Grouping method
Col.3 Col.5 Col.6
Col.2 Col.4
Col.1 (actual (sum of two (sum of three (sum of three
Marks (sum of two (sum of three
frequency, f)
freq.)
leaving the
freq.)
leaving the first leaving the first two Counting for highest frequency
first freq.) freq.) freq.)
11 5 --- ---- --- Marks Col.1 Col.2 Col.3 Col.4 Col.5 Col.6 Total
5+5 = 10 11 0
12 5 5+5+10 = 20 ---
12 0
5+10 = 15
13 10 5+10+15 = 30 13 1 1 2
10+15 = 25 14 1 1 1 1 1 5
14 15 10+15+15 = 40
15+15 = 30
15 1 1 1 1 1 1 6
15 15 15+15+10 = 40 16 1 1 1 3
15+10 = 25
17 1 1
16 10 15+10+7 = 32
18 0
10+7 = 17
17 7 10+7+5 = 22 19 0
7+5 = 12
20 0
18 5 7+5+4 = 16
5+4 = 9
19 4 5+4+4 = 13 ---
4+4 = 8
20 4 --- --- ---
Partition Values: Percentile, Decile and Quartile
Pk = (k.n/100)th observation where k = 1, 2, 3……., 99 and n is the no. of observations.
If Pk is integer, then percentile value is calculated by taking average of Pk and (Pk+1)th
obs.
If Pk is non-integer, then percentile value is calculated by rounding it to the next integer.
7 |7-9|= 2
8 |8-9|= 1
13 |13-9|= 4
Mean
Σ|X-m(X)|= 14
m(X) = 9
Variance, Standard Deviation and Coeff. of Variation
Population Sample
Variance
σ2 = Σ 𝑥 − 𝑥ҧ 2 /n s2= Σ 𝑥 − 𝑥ҧ 2 /(n-1)
(ungrouped)
Std.
Deviation σ = Sqrt {Σ 𝑥 − 𝑥ҧ 2 /n} s = Sqrt {Σ 𝑥 − 𝑥ҧ 2 /(n-1)}
(ungrouped)
Variance
σ2 = Σf 𝑥 − 𝑥ҧ 2 /Σf s2= Σf 𝑥 − 𝑥ҧ 2 /(Σf-1)
(grouped)
Std.
Deviation σ = Sqrt {Σf 𝑥 − 𝑥ҧ 2 /Σf} s = Sqrt {Σf 𝑥 − 𝑥ҧ 2 /(Σf-1)}
(grouped)
Coeff. of
C.V. = (σ/𝑥)*100
ҧ C.V. = (s/𝑥)*100
ҧ
Variation
Example: Ungrouped & grouped, discrete series
2
x (x-𝑥)ҧ 𝑥 − 𝑥ҧ Variance Std. Deviation Coeff. of Variation
8 8-7=1 1
4 4-7=-3 9
9 9-7=2 4 Sample variance (s2) Sample Std. Deviation (s) CV
11 11-7=4 16 = Σ 𝑥 − 𝑥ҧ 2 /(n-1) = Sqrt (Sample variance) = (𝑠/𝑥)*100
ҧ
= 46/4 = Sqrt (11.5) = (3.391/7)*100
3 3-7=-4 16 = 11.5 = 3.391 = 48.44%
Σx=35 2
Σ 𝑥 − 𝑥ҧ =
𝑥ҧ = 7
46
2 2
x f fx (x-𝑥)ҧ 𝑥 − 𝑥ҧ 𝑓 𝑥 − 𝑥ҧ Variance Std. Deviation Coeff. of Var.
8 3 24 8-8=0 0 3*0=0
4 4 16 4-8=-4 16 4*16=64
10 6 60 10-8=2 4 6*4=24 Sample Variance (s2) Std Deviation (𝑠) CV
12 4 48 12-8=4 16 4*16=64 = Σf 𝑥 − 𝑥ҧ 2 /(Σf -1) = Sqrt (Sample variance) = (𝑠/𝑥)*100
ҧ
= 200/19 = Sqrt (10.526) = (3.244/8)*100
4 3 12 4-8=-4 16 3*16=48 = 10.526 = 3.244 = 40.55%
2
Σfx = 160; Σf=20; Σ𝑓 𝑥 − 𝑥ҧ
𝑥ҧ = 160/20 = 8 = 200
Example: Grouped, continuous series
A showroom of cars displays its sales figures for the last 30 days. Calculate the mean no. of
cars sold per day and std. deviation.
No. of
Days 2 2
cars x fx (x-𝑥)ҧ 𝑥 − 𝑥ҧ 𝑓 𝑥 − 𝑥ҧ Variance Std. Deviation Coeff. of Var.
(f)
sold
0-2 14 1 14 1-3=-2 4 14*4=56
2-4 7 3 21 3-3=0 0 7*0=0
4-6 5 5 25 5-3=2 4 5*4=20 Sample Standard
Sample Variance (s2) CV
Deviation (s)
6-8 3 7 21 7-3=4 16 3*16=48 = Σf 𝑥 − 𝑥ҧ 2 /(Σf-1) = (𝑠/𝑥)*100
ҧ
= Sqrt (Variance)
= 160/29 = (2.349/3)*100
8-10 1 9 9 9-3=6 36 1*36=36 = Sqrt (5.517)
= 5.517 = 78.30%
= 2.349
2
Σfx = 90; Σf = 30 Σ𝑓 𝑥 − 𝑥ҧ
𝑥ҧ = 90/30 = 3 = 160
Example: Consistency of data using C.V.
Coefficient of Variation is used to study consistency whenever there is a
comparison between two or more datasets.
Two companies Dawson Suppliers and Clark Distributors deliver construction materials. The following data
shows days of delivery for both the companies on 8 occasions. Which company is more consistent in
deliveries?
Dawson Clark 𝟐 𝟐
ഥ
𝒙−𝒙 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 ഥ
𝒚−𝒚 Coefficient of Variation
(x) (y)
11 8 1 1 -2 4
10 10 0 0 0 0
9 17 -1 1 7 49 Variance of x = 26/7 = 3.71
SD of x = Sqrt (3.71) = 1.92
10 7 0 0 -3 9 CV of x = (1.92/10)*100 = 19.2%
8 10 -2 4 0 0
8 11 -2 4 1 1 Variance of y = 72/7 = 10.28
SD of y = Sqrt (10.28)= 3.21
10 10 0 0 0 0 CV of y = (3.21/10)*100 = 32.1%
14 7 4 16 -3 9
Dawson Suppliers is more consistent
2 2 in delivering the materials.
Mean Mean Σ 𝑥 − 𝑥ҧ Σ 𝑦 − 𝑦ത
𝑥ҧ = 10 𝑦ത = 10 = 26 = 72
Practice Problems
1. Find quartile deviation from the following data:
109, 189, 167, 209, 309, 265, 189, 187, 165, 239, 308, 378, 367, 109, 198, 209, 218, 387
2. The share prices of two companies X and Y are given below for twelve days. Which
company’s share prices are more consistent?
Days 1 2 3 4 5 6 7 8 9 10 11 12
X 201 200 199 203 206 208 206 201 197 199 198 196
Y 291 293 293 287 292 298 298 299 302 302 302 304
When computing the z-score for each sample on the data set a threshold must be specified. A general
‘thumb-rule’ for detecting outliers is lZl > 3.0, however it varies. Sometimes lZl values more than 2.0 or
2.5 are also considered as outliers.
Meal price($) Z-score = (x-mean)/s.d.
18 -0.3545
19 -0.3151
20 -0.2757
17 -0.3939
21 -0.2363
22 -0.1969
99 2.8358 (Outlier)
21 -0.2363
15 -0.4726
18 -0.3545
Mean=27
S.D.=25.3859
‘IQR’ method:
Interquartile range (IQR) is a measure of variability and also referred to as ‘midspread’. It is calculated as:
IQR = Q3 – Q1
With the help of IQR, we determine lower limit and upper limit for the dataset. Any value less than lower limit or
more than upper limit is considered as an outlier.
Lower limit = Q1 – 1.5*IQR
Upper limit = Q3 + 1.5*IQR
Meal price ($) Calculations
15 IQR = Q3 – Q1
17 Q1 = 18 and Q3 = 21
IQR = 3
18 Lower limit = 18 – 1.5*3 = 13.5
18 Upper limit = 21 + 1.5*3 = 25.5
19 Any value less than $13.5 and more than
20 $25.5 will be considered as outlier.
Hence the meal price of $99 is an outlier.
21
21
22
99
Association between two variables
• In a bivariate distribution, we are often interested in knowing the association between
the two variables, say X & Y. One such technique of establishing the relationship
between two variables is correlation analysis.
• For example, we might be interested in finding whether there is any association
between height and weight of kids, sales and advertisement expenditure, work
experience and salary, stress level and BP etc.
• Correlation is defined as the relationship between two variables in such a way that any
change in one variable results a corresponding change in the other. Correlation analysis
provides a tool to measure the strength and direction of such kind of associations, if
any.
• Correlation can be positive or negative. If the two variables exhibit changes in the
same direction then there exists a positive correlation and if the changes are in
opposite direction then there is a negative correlation between the variables.
• However, it does not establish any causal or dependence-independence relationship
between the two variables.
Covariance
• Covariance measures how the two variables move with respect to each other and is an extension of the
concept of variance, which tells about how a single variable varies. It can take any value between -∞
to +∞.
• A positive number signifies positive covariance and denotes that there is a direct relationship.
Effectively this means that an increase in one variable would also lead to a corresponding increase in the
other variable provided other conditions remain constant.
• On the other hand, a negative number signifies negative covariance, which denotes an inverse
relationship between the two variables. Though covariance is perfect for defining the type of
relationship, it is bad for interpreting its magnitude.
• Covariance is defined by the formula:
9 4 59 1 8 8
10 2 46 -1 -5 5
2 5 57 2 6 4 36 12 = Sqrt (20/9)
= Sqrt (2.22) = 1.49
3 1 41 -2 -10 4 100 20
Sy= Sqrt [Σ(𝑦 − 𝑦)
ത 2/(n-1)]
4 3 54 0 3 0 9 0
5 4 54 1 3 1 9 3 = Sqrt (566/9)
= Sqrt (62.89) = 7.93
6 1 38 -2 -13 4 169 26
rxy = Cov(X,Y)/SxSy
7 5 63 2 12 4 144 24
= 11/(1.49)(7.93)
8 3 48 0 -3 0 9 0
= 0.931
9 4 59 1 8 1 64 8
There exists a very strong
10 2 46 -1 -5 1 25 5 positive correlation between
television commercials and
Σx=30 Σy=510 Σ(x-𝑥)ҧ 2 Σ(𝑦 − 𝑦)
ത 2 Σ(x-𝑥)(𝑦
ҧ − 𝑦)
ത sales of the store.
𝒙=3
ഥ ഥ = 51
𝒚 = 20 = 566 = 99
Karl Pearson’s Coefficient of Correlation (r)
4 60
5 10 50 25 100 r=
78.9936
5
6 11 66 36 121
r = 0.759
6 There is strong direct (or strong positive)
9 13 117 81 169
correlation between age and weight.
∑X2 ∑Y2
Total ∑X= ∑Y= ∑XY=
= =
41 66 461
291 742
Topic-6
Random Variables
and
Probability Distributions
Topics to be discussed:
• Random Variables
• Discrete Random Variables
• Continuous random variables
• Expected Value and Variance of Discrete Random variables
• Binomial Probability Distribution
• Poisson Probability Distribution
• Normal Distribution
Random Variables
• A random variable is a numerical description of the outcome of an experiment.
• A random variable can be classified as being either discrete or continuous depending
on the numerical values it assumes.
• A discrete random variable may assume either a finite number of values or an
infinite sequence of values.
Experiment Random Variable (x) Possible values of x
Ticket sale in a hall of size 100 No. of viewers coming in the hall 0, 1, 2, 3………., 100
Inspect a lot of 25 chairs No. of defective chairs 0, 1, 2, 3………., 25
Probability
.40
.30
.20
.10
0 1 2 3 4
Values of Random Variable ‘x’ (TV sales)
Expected Value and Variance
x f(x) x.f(x)
0 .40 0
1 .25 .25
2 .20 .40
3 .05 .15
4 .10 .40
Σx.f(x) =1.20 = E(x)
(i) What is the probability that delivery will be made within 3 to 6 days?
(ii) What is the probability that the delivery will be late?
(iii) What is the probability that the delivery will be early?
2. The following data shows the no. of hours a car being parked at a parking slot along with the
probabilities. The parking supervisor wants to know the expected no. of hours and standard deviation of
the no. of hours cars are parked in the slot.
No. of hours 1 2 3 4 5 6 7 8
Probability 0.24 0.18 0.13 0.10 0.07 0.04 0.04 0.20
Binomial Probability Distribution
✓Properties of a Binomial Experiment
• The experiment consists of a sequence of n identical trials.
• Two outcomes, success and failure, are possible on each trial.
• The probability of a success, denoted by p, does not change from trial
to trial. The probability of failure q or (1-p) also does not change from
trial to trial.
• The trials are independent.
✓ Points to be remembered
If an experiment fulfils all the above four conditions then it is called a
Binomial Experiment.
If the first point is not fulfilled and rest three are fulfilled, it is termed
as a Bernoulli Experiment (given by Jacob Bernoulli).
Binomial Probability Distribution
n!
f ( x) = p x (1 − p) (n − x )
x !( n − x )!
where
f(x) = the probability of x successes in n trials
n = the number of trials
p = the probability of success on any one trial
Example: Evans Electronics
• Binomial Probability Distribution
Evans Electronics is concerned about a low retention rate for
employees. On the basis of past experience, management
has seen a turnover of 10% of the hourly employees
annually. Thus, for any hourly employees chosen at random,
management estimates a probability of 0.1 that the person
will not be with the company next year.
Choosing 3 hourly employees at random, what is the
probability that 1 of them will leave the company this year?
Let: p = .10, n = 3, x = 1
Example: Evans Electronics
3!
f (1) = ( 0.1)1 ( 0. 9 ) 2
1!( 3 − 1)!
= (3)(0.1)(0.81)
= 0.243
Example: Evans Electronics
p
n x .10 .15 .20 .25 .30 .35 .40 .45 .50
3 0 .7290 .6141 .5120 .4219 .3430 .2746 .2160 .1664 .1250
1 .2430 .3251 .3840 .4219 .4410 .4436 .4320 .4084 .3750
2 .0270 .0574 .0960 .1406 .1890 .2389 .2880 .3341 .3750
3 .0010 .0034 .0080 .0156 .0270 .0429 .0640 .0911 .1250
Sample: Binomial Probability Table
Example: Evans Electronics
Binomial Probability using a Tree Diagram
First Second Third Value
Worker Worker Worker of x Probab.
L (.1) 3 0.1*0.1*0.1 = 0.0010
Leaves (.1)
S (.9) 2 .0090
Leaves (.1)
L (.1)
2 .0090
Stays (.9)
1 .0810
S (.9)
L (.1) 2 .0090
Leaves (.1)
1 .0810
S (.9)
Stays (.9) L (.1)
1 .0810
Stays (.9)
0 .7290
S (.9)
Expected value and Variance
• Expected Value
E(x) = m = np
• Variance
Var(x) = s 2 = np(1 - p)
• Standard Deviation
SD( x ) = s = np(1 − p)
Ex.4: Twenty three percent of the vehicles are not covered by any insurance. On a special
checking day, 30 vehicles are checked randomly. What is the probability that more than 27
vehicles are insured? (here, p=0.77, n=30)
What is the expected no. of vehicles not covered by any insurance? What is the variance and
std. deviation? (here, n=30, p=0.23, q=0.77)
Poisson Probability Distribution
• Properties of a Poisson Experiment
• It is used to estimate the probability of no. of occurrences
over a specified period of time.
• The probability of an occurrence is the same for any two
intervals of equal length.
• The occurrence or nonoccurrence in any interval is
independent of the occurrence or nonoccurrence in any
other interval.
Poisson Probability Distribution
• Poisson Probability Function
x −m
m e
f ( x) =
x!
where
f(x) = probability of x occurrences in an interval; x = 0, 1, 2, 3, …………∞
μ = mean number of occurrences in an interval
e = 2.71828
34 ( 2. 71828) −3
f ( 4) = =.1680
4!
Example: Mercy Hospital
• Using the Tables of Poisson Probabilities
m
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
0 .1225 .1108 .1003 .0907 .0821 .0743 .0672 .0608 .0550 .0498
1 .2572 .2438 .2306 .2177 .2052 .1931 .1815 .1703 .1596 .1494
2 .2700 .2681 .2652 .2613 .2565 .2510 .2450 .2384 .2314 .2240
3 .1890 .1966 .2033 .2090 .2138 .2176 .2205 .2225 .2237 .2240
4 .0992 .1082 .1169 .1254 .1336 .1414 .1488 .1557 .1622 .1680
5 .0417 .0476 .0538 .0602 ..0668 .0735 .0804 .0872 .0940 .1008
6 .0146 .0174 .0206 .0241 .0278 .0319 .0362 .0407 .0455 .0504
7 .0044 .0055 .0068 .0083 .0099 .0118 .0139 .0163 .0188 .0216
8 .0011 .0015 .0019 .0025 .0031 .0038 .0047 .0057 .0068 .0081
9 .0003 .0004 .0005 .0007 .0009 .0011 .0014 .0018 .0022 .0027
10 .0001 .0001 .0001 .0002 .0002 .0003 .0004 .0005 .0006 .0008
11 .0000 .0000 .0000 .0000 .0000 .0001 .0001 .0001 .0002 .0002
12 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0001
Poisson Probability Distribution:
An Approximation of Binomial Distribution
Ex.1: In a drive-thru window of a burger shop, average 10 customers arrive in a 30-minute interval.
What is the probability that exactly 5 customers arrive in 30 minutes?
What is the probability that 5 customers arrive in 15-minutes interval?
Soln: For 30-minutes duration, μ = 10 and x = 5.
5
−10 10
f(5) = 𝑒 . = (0.000045).(100000/120) = (0.000045).(833.33) = 0.0378
5!
For 15-minutes duration, μ = 5 and x = 5.
5
−5 5
f(5) = 𝑒 . = (0.0067).(3125/120) = 0.1754
5!
Ex.2: An average of 15 aircraft accidents occur each year. Compute
(i) The mean no. of accidents per month.
(ii) Prob. of no accidents during a month.
(iii) Prob. of exactly one accident per month.
(iv) Prob. of more than one accidents per month.
−μ μ𝑥
Soln: P(X=r) = 𝑒 .
𝑥!
For 12-months duration μ = 15, for one-month duration μ = 1.25.
(i) Mean no. of accidents per month = 15/12 = 1.25
−1.25 (1.25)0
(ii) For x=0, P(X=0) = 𝑒 . = (0.2865).(1/1) = 0.2865
0!
−1.25 1.25 1
(iii) For x=1, P(X=1) = 𝑒 . = (0.2865).(1.25/1) = 0.3581
1!
(iv) For x>1, P(X>1) = 1- P(X≤1) = 1-[P(X=0)+P(X=1)] = 1-(0.2865+0.3581) = 0.3554
Ex.3: Phone calls arrive at the rate of 48 per hour in a call centre. Compute
(i) The mean no. of calls in 5 minutes duration. (4)
(ii) The prob. of receiving three calls in 5 minutes. (0.1952)
(iii) Prob. of receiving exactly 10 calls in 15 minutes. (0.1048)
(iv) Prob. of receiving at least one call in 10 minutes. (0.9997)
Ex.4: It is estimated that 0.5 percent of the callers to the customer service department will receive
a busy signal. What is the probability that out of 1200 callers, at least 3 will receive a busy signal?
Soln: n = 1200, p = 0.5/100 = 0.005
Since the problem has large n and small p, hence we assume μ=np
Mean np = (1200).(0.005) = 6 = μ (as n is large and p is very small)
−μ μ
𝑥
P(X=r) = 𝑒 .
𝑥!
For x≥3, P(X ≥3) = 1-P(X<3) = 1-[P(X=0)+P(X=1)+P(X=2)]
−6 (6)0
P(X=0) = 𝑒 . = 0.0025, P(X=1) = 0.0149, P(X=2) = 0.0446
0!
P(X≥3) = 1-(0.0025+0.0149+0.0446) = 0.938
Ex.5: Patients arrive at a hospital at the rate of 6 per hour. Find the probability that in a 90-
minute duration (i) exactly 7 patients arrive in the hospital.
(ii) between 7 and 10 patients arrive in the hospital. P(7≤X≤10) = P(X=7)+P(X=8)+P(X=9)+P(X=10)
(iii) If a patient arrives at 11:30am then what is the probability that other patients arrive before
11:45am?
μ= 1.5, P(X≥1)= 1-P(X<1) = 1-P(X=0) = 1-e-1.5 = 1- 0.2232 = 0.7768
Continuous Probability Distributions
▪ Bell Shaped
▪ Symmetrical f(X)
X=μ X
The Standardized Normal Distribution
• Any normal distribution (with any mean and standard deviation
combination) can be transformed into the standardized normal
distribution (Z distribution).
• Need to transform X units into Z units.
• The standardized normal distribution (Z) has a mean of 0 and a
standard deviation of 1.
• Translate from X to the standard normal variate ‘Z’ by
subtracting the mean of X and dividing by its standard deviation:
X −μ
Z=
σ
The Z distribution always has mean = 0 and standard deviation = 1
The Standardized Normal Probability Density Function
1 −(1/2)Z 2
f(Z) = e
2π
Z
Z=0
Values on right of Z=0 have positive Z-values and values on left of Z=0 have negative Z-values
Example: Transforming X into Z
• If X is distributed normally with mean of 100 and standard
deviation of 50, the Z value for X = 200 is
X − μ 200 − 100
Z= = = 2.0
σ 50
• This says that X = 200 is two standard deviations (2
increments of 50 units) above the mean of 100.
• P(X>200) = P(Z>2) because (X=200) = (Z=2)
• P(0<X<200) = P(-2<Z<2)
Comparing X and Z values
Z=0 Z=2.0 Z (μ = 0, σ = 1)
Note that the shape of the distribution is the same, only the scale has changed.
We can express the problem in original units (X) or in standardized units (Z)
Probability and the Normal Curve
• The total area under the normal curve is equal to 1.
• The probability that X is greater than ‘a’ equals the area under the normal curve
bounded by ‘a’ and plus infinity (as indicated by the non-shaded area in the figure
below).
• The probability that X is less than ‘a’ equals the area under the normal curve
bounded by ‘a’ and minus infinity (as indicated by the shaded area in the figure
below).
X
a b
Probability as Area Under the Curve
The total area under the curve is 1.0, and the curve is symmetric, so half
is on the right of mean and half is on the left.
f(X)
P( − X μ) = 0.5 P(μ X ) = 0.5
0.5 0.5
μ X
P( − X ) = 1.0
The Standardized Normal Table
0.50 0.4772
Example:
P(Z<2) = P(-∞<Z<2)
= 0.9772
P(Z>2) = 1.00-0.4772
Z=0 Z=2 Z
= 0.0228
0.0228
Finding Normal Probabilities
X − μ 8.6 − 8.0
Z= = = 0.12
σ 5.0
μ=8 μ=0
σ=5 σ=1
Standard Normal Probability Table (Portion) P(X < 8.6) = P(Z < 0.12)
= P(-∞ <Z< 0.12)
= 0.5478
Z 0.00 0.01 0.02
0.50 0.0478
0.0 0.5000 0.5040 0.5080
X
8.0
8.6
Finding Normal Upper Tail Probabilities
0.5478
0.5 0.5 (blue area)
1.000 - 0.5478 = 0.4522
Z Z
Z=0 Z=0
Z=0.12
Finding a normal probability between two values
Calculate Z-values:
X −μ 8 −8
Z= = =0
σ 5
8 8.6 X
Z=0
X − μ 8.6 − 8 Z
Z= = = 0.12 Z=0.12
σ 5 P(8 < X < 8.6)
= P(0 < Z < 0.12)
= P(-∞ < Z < 0.12) – P(-∞ < Z < 0)
= 0.5478 – 0.5 = 0.0478
Probabilities in the Lower Tail
X
8.0
7.4
Probabilities in the Lower Tail
2. The mean weight of 500 college students is 70 kg and the standard deviation is 3 kg. Assuming that the
weight is normally distributed, determine how many students weigh:
a. between 60 kg and 75 kg
b. more than 90 kg
c. less than 64 kg
d. exactly 64 kg
e. 64 kg or less
3. For borrowers with good credit scores, the mean debt amount is $15,000. Assuming the debt amounts
to be normally distributed with standard deviation $3000, calculate the probability that
a. debt for a borrower is more than $18,000
b. debt for a borrower is less than $10,000
c. Debt for a borrower is between $12,000 and $18,000
Topic 7
Sampling and Sampling Distributions
Chapter topics
• Concept of sampling
• Probability and nonprobability sampling methods
• Concept of sampling distributions
• Sampling distribution of the mean
• For normal populations
• Using the Central Limit Theorem
• Sampling distribution of a proportion
• Probabilities using sampling distributions
Why Sample?
Unbiased
Sample Unbiased,
representative sample
Male students
drawn at random from
Female students
Population the entire population
Biased
Sample
Biased, unrepresentative
Female sample drawn consisting
Male students students of more female students
Population
than males
Sampling Process begins with a Sampling Frame
Sampling
Systematic Cluster
Types of Sampling: Non-probability Sampling
In non-probability sampling, items included are chosen without
considering their probability of occurrence.
• In convenience sampling, items are selected based only on the fact that they are
easy, inexpensive, or convenient to sample.
• In judgment sampling, one gets the opinions of pre-selected individuals or
experts in the subject matter.
• In quota sampling, individuals or items are selected on the basis of specific traits
or qualities. Some fixed number of units are selected including all the traits.
• In snowball sampling, research units are selected with the help of other research
units. It is used where potential participants are difficult to identify. For example,
customers in life insurance, network marketing, survey on ‘social evils’ etc.
Types of Sampling: Probability Sampling
Probability Sampling
Joann P. 849
Paul F. 850
Probability Sampling: Stratified Random Sampling
• Divide population into two or more subgroups (called strata) according to some common
characteristic
• A simple random sample is selected from each subgroup, with sample sizes proportional
to strata sizes
• Samples from subgroups are combined into one
• This is a common technique when sampling population of voters, stratifying across racial
or socio-economic lines.
Population
Divided
into 4
strata
Chap 7-11
Probability Sampling: Systematic Sampling
N = 40 First Group
n=4
k = 10
Probability Sampling: Cluster Sampling
Population
divided into
16 clusters. Randomly selected
clusters for sample
Probability Sample: Comparing Sampling Methods
• ‘x’ is the number of elements in the sample that possess the characteristic of
interest and ‘n’ is the sample size.
• 0≤ p≤1
• p is approximately distributed as a normal distribution when n is large
(assuming sampling with replacement from a finite population or without replacement from an
infinite population)
Sampling Distribution of p
and
𝒑(𝟏 − 𝒑)
𝝁 =𝒑 𝝈 =
𝒏
−𝑝 −𝑝
𝑍= =
𝜎 𝑝(1 − 𝑝)
𝑛
Example
• If the true proportion of voters who support Proposition
A is 0.4, what is the probability that a sample of size
200 yields a sample proportion between 0.40 and 0.45?
• i.e. if p = 0.4 and n = 200, what is P(0.40 ≤ ≤ 0.45) ?
Example
(continued)
if p = 0.4 and n = 200, what is
P(0.40 ≤ ≤ 0.45) ?
Standardized
Sampling Distribution Normal Distribution
0.4251
Standardize
1.The Grocery Manufacturers of America reported that 76% of consumers read the ingredients
listed on a product’s label. Assume the population proportion is p = .76 and a sample of 400
consumers is selected from the population.
(a) Show the sampling distribution of the sample proportion where is the proportion of the
sampled consumers who read the ingredients listed on a product’s label.
(b) What is the probability that the sample proportion will be within ±.03 of the population
proportion?
(c) Answer part (b) for a sample of 750 consumers.
2. The Food Marketing Institute shows that 17% of households spend more than $100 per week
on groceries. Assume the population proportion is p = .17 and a sample of 800 households will
be selected from the population.
(a) Show the sampling distribution of p, the sample proportion of households spending more
than $100 per week on groceries.
(b) What is the probability that the sample proportion will be within ±.02 of the population
proportion?
(c) Answer part (b) for a sample of 1600 households.
Point Estimation
• Point estimation is the process of using the sample data available to estimate the unknown
value of a parameter. The point estimate obtained from the data will be a single number like
sample mean, sample standard deviation, sample proportion etc.
• Suppose we have an unknown population parameter, such as a population mean μ or a
population proportion p, which we'd like to estimate. For example, suppose we are interested
in estimating:
p = the (unknown) proportion of American college students, 18-24, who have a
smart phone
μ = the (unknown) mean number of days it takes patients to respond to a drug
In either case, we can't possibly survey the entire population. That is, neither we can survey all
American college students between the ages of 18 and 24 nor can we survey all patients with
a specific disease. So, of course, we do what comes naturally and take a random sample from
the population, and use the resulting data to estimate the value of the population parameter.
Of course, we want the estimate to be "good" in some way.
The following table shows a sample of 30 managers of a company out of the total
2500 managers.
• The mean annual salary ( =$51,814) is a point estimate of the population mean
salary (μ=$51,800).
• Similarly sample std. dev. (s=$3348) is a point estimate of the population std. dev.
(σ=$4000).
• The proportion of managers who have completed training ( =0.63) is a point
estimate of the population proportion (p=0.60).
Properties of a Point Estimator
Unbiasedness: If the expected value of the sample statistic is equal to the population
parameter being estimated, the sample statistic is said to be an unbiased estimator of
the population parameter.
In discussing the sampling distributions of the sample mean and the sample proportion,
we stated that E( ) = μ and E( ) = p. Thus, both and are unbiased estimators of their
corresponding population parameters μ and p. In the case of the sample standard
deviation s and the sample variance s2, it can be shown that E(s2) = σ2.
Efficiency: The most efficient point estimator is the one which is having the smallest
variance of all the unbiased estimators. The variance represents the level of dispersion
from the estimate, and the smallest variance should vary the least from one sample to
the other.
Consistency: A third property associated with good point estimators is consistency. A
point estimator is consistent if the values of the point estimator tend to become closer to
the population parameter as the sample size becomes larger. In other words, a large
sample size tends to provide a better point estimate than a small sample size.
CONSTRUCTION OF
QUESTIONNAIRES
Types of Questionnaire
Structured Questionnaire Unstructured Questionnaire
• definite, concrete, and • set of questions which are
pre-determined questions not structured in advance
• prepared in advance, not • questions may be adjusted
constructed on the spot as per the need
• additional questions may • these questionnaires are
be asked only when some flexible in nature
clarification is required
Construction of Questionnaire
A. General Considerations
• Well-defined goals are the best way to assure a good
questionnaire design. Questionnaires are developed
directly to address the goals of study.
• Keep it short and simple to maximize responses.
• Try to eliminate unimportant questions…involve
experts and decision-makers while doing this.
• Provide a well written cover page…it gives the first
impression and provides you the best chance to
convince the respondent to complete the survey.
• Give your questionnaire a title that is short and
meaningful to the respondents.
• Place the most important items in the first half
of the questionnaire. Respondents often send
back partially completed questionnaires.
• Leave adequate space for respondents to make
comments and provide valuable information.
• Use professional printing methods and
materials for the questionnaires.
B. Language
• Wording of a question is extremely important.
Researchers strive for objectivity in surveys and,
therefore, must be careful not to lead the
respondent into giving a desired answer.
• Questionnaires require special measures to cast
questions that are clear and straight forward in
four important aspects; simple language,
common concepts, manageable tasks and
widespread information.
• The nature and structure of population to be
studied should be kept in mind. Technical terms
and jargons should be avoided to the maximum
possible extent.
• Common concepts should be used in the
questionnaire. Mathematical abstractions tend
to be difficult for the general public.
C. Type of Questions
Researchers use two basic types of questions:
• Closed-ended (dichotomous ,multiple choice & scales)
• Open-ended
Examples of each kind of questions are:
Closed-ended: Dichotomous Questions
1. Do you have a car: (a) Yes (b) No
2. What kind of petrol do you use: (a) Normal (b) Premium
3. Your working hours are: (a) Fixed (b) Flexible
Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
Point Estimators
Mean μ X
Std. Deviation σ S
Confidence Intervals
Sample
General Formula
• The general formula for all confidence
intervals is:
Point Estimate ± (Critical Value)(Standard Error)
• Point Estimate is the sample statistic estimating the population
parameter of interest
Population Population
Mean Proportion
σ Known σ Unknown
Confidence Interval for μ
(σ Known)
• Assumptions
• Population standard deviation σ is known
• Population is normally distributed
• If population is not normal, use large sample
σ
X Z α/2
n
where is the point estimate
X
Zα/2 is the normal distribution critical value for a probability of /2 in each tail
is the standard error
σ/ n
Finding the Critical Value, Zα/2
Zα/2 = 1.96
• Consider a 95% confidence interval:
1 − α = 0.95 so α = 0.05
α α
= 0.025 = 0.025
2 2
Confidence
Confidence
Coefficient, Zα/2 value
Level
1−
80% 0.80 1.28
90% 0.90 1.645
95% 0.95 1.96
98% 0.98 2.33
99% 0.99 2.58
99.8% 0.998 3.08
99.9% 0.999 3.27
Example:1
• A sample of 11 circuits from a large normal population has a
mean resistance of 2.20 ohms. We know from past testing that
the population standard deviation is 0.35 ohms. Determine a
95% confidence interval for the true mean resistance of the
population.
σ
• Solution: Confidence Interval: X Zα/2
n
= 2.20 1.96 (0.35/ 11 )
= 2.20 0.2068
1.9932 μ 2.4068
Interpretation
Confidence
Intervals
Population Population
Mean Proportion
σ Known σ Unknown
Do You Ever Truly Know σ?
• Probably not!
• If you truly know µ there would be no need to collect a sample to estimate it.
Confidence Interval for μ
(σ Unknown)
d.f. = n - 1
Degrees of Freedom (df)
Idea: Number of observations that are free to vary
after sample mean has been calculated
Example: Suppose the mean of 3 numbers is 8
Let X1 = 7
If the mean of these three
Let X2 = 8
What is X3?
values is 8.0,
then X3 must be 9
(i.e., X3 is not free to vary)
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2
(2 values can be any numbers, but the third is not free to vary
for a given mean)
Student’s t Distribution
Note: t Z as n increases
Standard
Normal
Distribution
t (df = 15)
t-distributions are bell-shaped
and symmetric, but have
‘fatter’ tails than the normal t (df = 10)
0 t
Selected t distribution values
With comparison to the Z value
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)
Note: t Z as n increases
tα/2 values
tα values
Example-1: t-distribution confidence interval
46.6976 ≤ μ ≤ 53.3024
Example-2: t-distribution confidence interval
A restaurant owner thinks that the time spent by customers in the restaurant is directly
proportional to sales. He took a random sample of 20 customers and found the mean
time spent by them to be 55 minutes with a std. deviation of 7.8 minutes. Find the
interval estimate of mean time spent by all the customers visiting that restaurant using (a)
95%, and (b) 99% confidence level.
𝑆
𝑋 ± 𝑡𝛼/2
Confidence interval estimate: 𝑛
(since sample size is small and population std. deviation is unknown)
= 55, S = 7.8, n = 20 and tα/2 = 2.093 (at 95% conf. level and 19 d.f.)
Hence, interval estimate of the mean time spent is:
- tα/2(S/√n) ≤ μ ≤ + tα/2(S/√n)
55 – 2.093(7.8/√20) ≤ μ ≤ 55+ 2.093(7.8/√20)
(55 –3.65) ≤ μ ≤ (55+ 3.65)
51.35 ≤ μ ≤ 58.65, the mean time spent by all the customers is between 51.35 mins. and
58.65 mins. at 95% confidence level.
Ex.3: The average annual premium for automobile insurance in the United States is $1503 (Insure.com
website, March 6, 2014). The following annual premiums ($) are representative of the website’s findings
for the state of Michigan:
1905, 3112, 2312, 2725, 2545, 2981, 2677, 2525, 2627, 2600, 2370, 2857, 2962, 2545, 2675, 2184, 2529,
2115, 2332, 2442
Assuming the population to be approximately normal, provide
(i) A point estimate of the mean annual automobile insurance premium in Michigan. [Hint: calculate
for Michigan state; Ans: $2551]
(ii) Develop a 95% confidence interval for the mean annual automobile insurance premium in
Michigan. [Hint: calculate S for Michigan state using and find confidence interval using
𝑡𝛼/2 with (n-1) d.f.; Ans: (2409.99, 2692.01)]
(iii) Does the 95% confidence interval for the annual automobile insurance premium in Michigan
include the national average for the United States? What is your interpretation of the relationship
between auto insurance premiums in Michigan and the national average?
Summary of Interval Estimation for Population Mean
Interval Estimation for Population Proportion
The general form of an interval estimate of a population proportion is:
Point Estimate ± (Critical Value)(Standard Error)
± (critical value x standard error)
± (margin of error)
We know that the sampling distribution of can be approximated by a normal
distribution whenever np ≥ 5 and n(1−p) ≥ 5. The mean of the sampling distribution of
𝒑(𝟏−𝒑)
is the population proportion p, and the standard error of is σ =
𝒏
But since p is unknown and to be estimated, so p is replace by in the formula of σ .
(𝟏− )
Hence, margin of error = Zα/2 . 𝐧
Determining
Sample Size
Determining
Sample Size
For the
Mean Sampling error
(margin of error)
σ σ
X Zα / 2 e = Zα / 2
n n
Determining Sample Size
Determining
Sample Size
For the
Mean
σ 2
Zα / 2 σ 2
e = Zα / 2 Now solve
n=
for n to get 2
n e
Determining Sample Size
• To determine the required sample size for the mean, you
must know:
𝑍 2 σ2 (1.645)2 (45)2
𝑛= 2
= 2
= 219.19
𝑒 5
Determining
Sample Size
For the
Proportion
Solution:
For 95% confidence, use Zα/2 = 1.96
e = 0.03
p* = 0.2, so use this to estimate p
2 ∗ ∗
Z𝛼/2 𝑝 (1 − 𝑝 ) (1.96)2 (0.2)(1 − 0.2)
n= = = 682.95
e2 (0.03) 2
So use n = 683
• population mean
Example: The mean monthly cell phone bill
in this city is μ = $42
• population proportion
H0 : μ = 3 H0 : X = 3
The Null Hypothesis, H0
• Null hypothesis assumes that the statement is true
• Similar to the notion of innocent until proven guilty
Population
Sample
The Hypothesis Testing Process
• Suppose the sample mean age was =20
• This is significantly lower than the claimed mean
population age of 50.
• If the null hypothesis were true, the probability of getting
such a different sample mean would be very small, so you
reject the null hypothesis.
The Hypothesis Testing Process
Sampling
Distribution of X
X
μ = 50
=20 If H0 is true ... then you reject
If it is unlikely that you the null hypothesis
would get a sample that μ = 50.
mean of this value ... ... When in fact this were
the population mean…
The Test Statistic and Critical Values
Region of Region of
Rejection Rejection
Critical Values
Actual Situation
Critical values
Rejection Region
0 0
These are one-tail tests because rejection region lies only in one tail
Hypothesis Tests for the Mean
Hypothesis
Tests for
Known Unknown
(Z test) (t test)
Z-test of Hypothesis for the Mean
(σ Known)
• Convert sample statistic ( ) to a ZSTAT test statistic
Hypothesis
Tests for
σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
X−μ
ZSTAT =
σ
n
Critical Value Approach to Testing
-Zα/2 0 +Zα/2 Z
Lower Upper
critical critical
value value
Steps in Hypothesis Testing
1. State the null hypothesis, H0 and the alternative
hypothesis, Ha
2. Choose the level of significance, , and the sample
size, n
3. Determine the appropriate test statistic and
sampling distribution
4. Determine the critical values that divide the
rejection and non-rejection regions
Steps in Hypothesis Testing
5. Collect data and compute the value of the test
statistic
6. Make the statistical decision and state the
managerial conclusion. If the test statistic falls into
the non-rejection region, do not reject the null
hypothesis H0. If the test statistic falls into the
rejection region, reject the null hypothesis.
Express the managerial conclusion in the context
of the problem.
Z values for One-tail and Two-tail Tests
Level of Significance One-tail value Two-tail value
= 0.05/2 = 0.05/2
X−μ
ZSTAT =
σ
n
• If you truly know µ there would be no need to gather a sample to estimate it.
Hypothesis Testing:
σ Unknown
• If the population standard deviation is unknown, we instead use the
sample standard deviation s.
σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
Acceptance/rejection criteria:
− μ− μ
XX
ltSTATl<ltCRTl then accept H0 and STAT == s
ttSTAT
ltSTATl>ltCRTl then reject H0 nS
n
t-statistic and Rejection Criteria
Ex-1: Two-tailed t-test ( unknown)
t=1.46
As per the table, t=1.46 lies between 1.318 and 1.711 (two-tailed values) which shows
that the level of significance (or p-value) will also lie between 0.10 and 0.20.
Since 1.46 is closer to 1.318 as compared to 1.711, hence the p-value will be also be
closer to 0.20 as compared to 0.10.
Let us take p-value = 0.16, which is more than = 0.05, hence there is no reason to
reject the null hypothesis H0.
We conclude that the average cost of hotel rooms in New York can be considered
$168 per night.
Ex-2: Two-tailed t-test ( unknown)
Ex-3: One-tailed t-test ( unknown)
A shareholders’ group, in lodging a protest, claimed that the mean tenure for a CEO was at
least nine years. A survey of 25 companies reported in the Wall Street Journal found a
sample mean tenure of 5.81 years for CEOs with a standard deviation of 6.38 years. Test
the hypothesis to challenge the validity of the claim made by the shareholders’ group.
What is the p-value for your hypothesis test? At = 0.01, what is your conclusion?
The mean annual premium for automobile insurance in the United States is $1503
(insure.com website, March 6, 2014). A researcher from Pennsylvania believes that
automobile insurance is cheaper there and wishes to develop statistical support for
his opinion. A sample of 25 automobile insurance policies from the state of
Pennsylvania showed a mean annual premium of $1440 with a standard deviation of
$165. Develop a hypothesis to test whether the mean annual premium in
Pennsylvania is lower than the national mean annual premium? Use = 0.05.
Topic-9
Hypothesis Testing-1
Topics of discussion:
*Concept of null and alternative hypothesis
*Developing null and alternative hypothesis
*Errors in hypothesis testing
*Testing of hypothesis for population mean (σ known and unknown)
*Testing of hypothesis for population proportion
What is a Hypothesis?
• A hypothesis is a claim
(assumption) about a
population parameter:
• population mean
Example: The mean monthly cell phone bill
in this city is μ = $42
• population proportion
H0 : μ = 3 H0 : X = 3
The Null Hypothesis, H0
• Null hypothesis assumes that the statement is true
• Similar to the notion of innocent until proven guilty
Population
Sample
The Hypothesis Testing Process
• Suppose the sample mean age was =20
• This is significantly lower than the claimed mean
population age of 50.
• If the null hypothesis were true, the probability of getting
such a different sample mean would be very small, so you
reject the null hypothesis.
The Hypothesis Testing Process
Sampling
Distribution of X
X
μ = 50
=20 If H0 is true ... then you reject
If it is unlikely that you the null hypothesis
would get a sample that μ = 50.
mean of this value ... ... When in fact this were
the population mean…
The Test Statistic and Critical Values
Region of Region of
Rejection Rejection
Critical Values
Actual Situation
Critical values
Rejection Region
0 0
These are one-tail tests because rejection region lies only in one tail
Hypothesis Tests for the Mean
Hypothesis
Tests for
Known Unknown
(Z test) (t test)
Z-test of Hypothesis for the Mean
(σ Known)
• Convert sample statistic ( ) to a ZSTAT test statistic
Hypothesis
Tests for
σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
X−μ
ZSTAT =
σ
n
Critical Value Approach to Testing
-Zα/2 0 +Zα/2 Z
Lower Upper
critical critical
value value
Critical Value Approach: Steps in Hypothesis Testing
= 0.05/2 = 0.05/2
X−μ
ZSTAT =
σ
n
X−μ 3325−3173
ZSTAT = σ = 1000 = 2.04
n 180
Assume = 0.05
P(Z ≥2.04) = 1 – 0.9793 = 0.0207 (p-value)
Since p-value < , we reject H0 and conclude that the mean credit card balance for
undergraduate students has continued to increase.
Z-test for Population Proportion
This section shows how to conduct a hypothesis test about a population proportion p.
using p0 to denote the hypothesized value for the population proportion, the three
forms for a hypothesis test about a population proportion are as follows.
H0: p ≥ p0 H0: p ≤ p0 H0: p = p0
Ha: p < p0 Ha: p > p0 Ha: p ≠ p0
The first form is called a lower tail test, the second form is called an upper tail test, and
the third form is called a two-tailed test. Hypothesis tests about a population
proportion are based on the difference between the sample proportion and the
hypothesized population proportion p0.
The methods used to conduct the hypothesis tests for population proportion are similar
to those used for hypothesis tests about a population mean. The only difference is that
we use the sample proportion and its standard error to compute the test statistic. The
p-value approach or the critical value approach is then used to determine whether the
null hypothesis should be rejected.
Z-statistic and Rejection Criteria
Ex-1: A study by showed that 64% of supermarket shoppers believe supermarket brands
to be as good as national/international brands. To investigate whether this result applies
to its own product, the manufacturer of a national ketchup brand asked a sample of 100
shoppers whether they believed that supermarket ketchup was as good as the national
brand ketchup. Out of 100 shoppers, 52 stated that supermarket brand was as good as
national brands. Test the hypotheses to determine whether the percentage of
supermarket shoppers who believe that the supermarket ketchup was as good as the
national brand ketchup differed from 64%. Take = 0.05.
Two-Population Tests
Population Population
Means, Means, Population Population
Independent Related Proportions Variances
Samples Samples
Examples:
Group 1 vs. Same group Proportion 1 vs. Variance 1 vs.
Group 2 before vs. after Proportion 2 Variance 2
treatment
Difference Between Two Means:
Independent Samples
• Different data sources
• Unrelated population
Population means,
• Independent samples
independent
• Sample selected from one population has no effect
samples
on the sample selected from the other population.
• For example, we may want to test whether the mean
Difference of means
starting salary for a population of men and the mean
using Z-statistic
starting salary for a population of women differ
significantly.
Difference of means
• Conduct a hypothesis test to determine whether any
using t-statistic difference is present between the proportion of
defective parts in a population of parts produced by
supplier A and the proportion of defective parts in a
population of parts produced by supplier B.
Hypothesis Tests for Two Population Means
a a a/2 a/2
-Za Za -Za Za
Reject H0 if ZSTAT ≤ -Za Reject H0 if ZSTAT Za Reject H0 if
ZSTAT ≤ -Za/2 OR ZSTAT Za/2
(1) Hypothesis tests for µ1-µ2 with σ1 and σ2
known and unequal (Z-test)
lZSTATl < lZCRTl then accept H0 and lZSTATl lZCRTl then reject H0
Example-1
Greystone Department Stores, operates two stores in Buffalo, New York: One is in the inner city and the other
is in a suburban shopping center. The Regional Manager noticed that products that sell well in one store do
not always sell well in the other. The manager believes this situation may be attributable to differences in
customer demographics at the two locations. Customers may differ in age, education, income, and so on. The
manager wants to investigate the difference between the mean ages of the customers who shop at the two
stores. From the inner city store, a random sample of 36 customers was taken whose mean age was 40 yrs
and from the suburban store, a random sample of 49 customers was taken whose mean age was 35 yrs. The
population std. deviation of age in both the areas are 9 yrs and 10 yrs respectively. The manager wants to
know whether the mean age of customers in both areas differ significantly? Assume α=0.05.
Solution: Example-1
Test at 5% level of significance whether the mean distance travelled in both the cities differ significantly?
Solution: The hypothesis is
H0: μ1−μ2=0 and Ha: μ1−μ2≠0
ZSTAT = 2.34 (calculated using formula); ZCRT = 1.96 (known at 5% l.s.)
Rule for acceptance/rejection:
lZSTATl < lZCRTl then accept H0 and lZSTATl > lZCRTl then reject H0
Here, lZSTATl > lZCRTl, hence we reject the null hypothesis and conclude that the mean distance travelled in
both the cities differ significantly.
Example-3
To compare customer satisfaction levels of two competing cable television companies, 174
customers of Company 1 and 355 customers of Company 2 were randomly selected and were
asked to rate their cable companies on a five-point scale, with 1 being least satisfied and 5
most satisfied. The survey results are summarized in the table. Test at the 1% level of
significance whether the data provide sufficient evidence to conclude that Company 1 has a
higher mean satisfaction rating than does Company 2.
Company 1 Company 2 a
Sample size 174 355
Sample mean 3.51 3.24 0
ZCRT=2.33
Standard deviation (σ) 0.51 0.52
ZSTAT= 5.684
Test at the 5% level of significance whether the data provide sufficient evidence to
conclude that more passengers ride the 8:30 train.
Confidence Interval Estimate for (µ1-µ2) with
σ1 and σ2 known
Example: Greystone Department Stores, operates two stores in Buffalo, New York: One is in the inner city and the
other is in a suburban shopping center. The manager wants to investigate the difference between the mean ages of the
customers who shop at the two stores. From the inner city store, a random sample of 36 customers was taken whose
mean age was 40 yrs and from the suburban store, a random sample of 49 customers was taken whose mean age was
35 yrs. The population std. deviation of age in both the areas are 9 yrs and 10 yrs respectively. Find a 95% confidence
interval estimate for the difference of means?
Solution: 1=40, 2=35, σ1=9, σ2=10, n1=36, n2=49 and Zα/2 = 1.96
The confidence interval estimate is calculated as:
Thus, the margin of error is 4.06 years and the
95% confidence interval estimate of the difference between the
two population means is (5−4.06 to 5+4.06) i.e. 0.94 years to 9.06 years.
(2) Testing of Hypothesis for difference between
Two Population Proportions
Let p1 denote the proportion for Population 1 and p2 denote the proportion for Population
2, we consider the difference between the two population proportions (p1−p2). To make an
inference about this difference, we will select two independent random samples consisting
of n1 units from Population 1 and n2 units from Population 2.
Let us now consider hypothesis tests about the difference between the proportions of two
populations. We focus on tests involving no difference between the two population
proportions. In this case, the three forms for a hypothesis test are:
Example: A tax preparation firm is interested in comparing the quality of work at two of its regional offices. By
randomly selecting samples of tax returns prepared at each office and verifying the sample returns accuracy, the firm
will be able to estimate the proportion of erroneous returns prepared at each office. In Office 1, errors were found in
35 files out of 250 returns whereas 27 errors were found in Office 2 out of 300 returns. Find a 90% confidence interval
estimate for the difference of proportions?
tSTAT= 2.019 and df = 40(calculated value of df is 40.5 but we shall round it down to 40 to get higher value of
t). Also tCRT=2.021 (two-tail value at 0.05 level and 40 df).
Since |tSTAT| < |tCRT|, we do not reject the null hypothesis H0 and conclude that there is no difference
between the mean yields.
Example-2
Specific Motors of Detroit has developed a new automobile known as the M car. 24 M cars and 28 J
cars (from Japan) were road tested to compare miles-per-gallon (mpg) performance. The sample
statistics are shown below. Can we conclude, using a 0.05 level of significance, that the miles-per-
gallon performance of M cars is greater than the miles-per-gallon performance of J cars?
M cars J cars
n1=24 cars n2=28 cars
1=29.8 2=27.3
s1=2.56 s2=1.81
Solution: H0: μ1 – μ2 ≤ 0 and Ha: μ1 – μ2 > 0
2
𝑥1 − 𝑥2 − 𝐷0 𝑠12 𝑠22
+
t STAT = 𝑛1 𝑛2
𝑑𝑓 =
s12 s22 1 𝑠12
2
1 𝑠2
2
+ 𝑛1 − 1 𝑛1 + 𝑛 − 1 𝑛2
n1 n2 2 2
tSTAT= 4.003 and df = 40(calculated value of df is 40.6 but we shall round it down to 40 to get higher
value of t). Also tCRT=1.684 (one-tail value at 0.05 level and 40 df).
Since |tSTAT| > |tCRT|, we reject the null hypothesis H0 and conclude that the miles-per-gallon (mpg)
performance of M cars is greater than the miles-per-gallon performance of J cars.
Example-3
A study of 40 staff nurses in Tampa and 50 staff nurses in Dallas gives the following results:
Tampa Dallas
n1= 40 nurses n2= 50 nurses
1= $56,100 2= $59,400
s1= $6000 s2= $7000
Does this data show enough evidence that the mean salary of Tampa staff nurses is lower than that
of Dallas staff nurses? Test at 5% level of significance.
tSTAT= -2.41 and df = 87 (calculated value of df is 87.6 but we shall round it down to 87 to get higher
value of t). Also tCRT=1.663 (one-tail value at 0.05 level and 87 df).
Since |tSTAT| > |tCRT|, we reject the null hypothesis H0 and conclude that the mean salary of Tampa
staff nurses is significantly lower than that of Dallas staff nurses.
Confidence Interval Estimate for (µ1-µ2) with
σ1 and σ2 unknown
Example: Clearwater National Bank is conducting a study designed to identify differences between checking account
practices by customers at two of its branch banks. In a simple random sample of 28 checking accounts from the Cherry
Grove Branch showed a mean balance of $1025 with a standard deviation of $150. Similarly, a simple random sample
of 22 checking accounts selected from the Beechmont Branch showed mean balance of $910 and a standard deviation
of $125. Find a 95% interval estimate for the difference of means.
Test at 5% l.s. whether the body fat for men and women differ significantly?
Solution: H0: μ1−μ2=0 and H1: μ1−μ2≠0
n − 1 s 2+ n −1 s 2
X1 − X2 − 𝐷0 1 1 2 2
t STAT = Sp2 =
1 1 (n1 − 1) + (n2 − 1)
Sp2 +
n1 n2
SP2 = 38.88; tSTAT= 2.80; d.f. = n1+n2-2 = (10+13-2) = 23; tCRT = ±2.08 (at 5% l.s. and 21 d.f.)
Here ItSTATI > ItCRTI, hence we reject H0 and conclude that the body fat for men and women differ significantly.
Confidence interval for (µ1-µ2) with σ1 and σ2
unknown and assumed equal
Population means,
independent
samples
The confidence interval for
μ1–μ2 is:
σ1 and σ2 unknown,
assumed equal 1 1
X1 − X2 ± 𝑡𝛼/2 Sp2 +
n1 n2
Solution: H0: μd = 0 and Ha: μd ≠ 0 District office UPX INTEX Difference (d)
A 32 25 7
where and B 30 24 6
C 19 15 4
2 (𝑛 − 1)𝑠 2 2
𝜒(1−α ≤ ≤ 𝜒α/2
2
) 𝜎2
α/2
α/2
95% of the
possible 2 values
2
0 𝝌𝟐(𝟏−𝜶)
𝟐
𝝌𝟐(𝜶)
𝟐
Interval Estimate of Population Standard Deviation (σ)
(𝑛 − 1)𝑠 2 (𝑛 − 1)𝑠 2
2 ≤𝜎≤ 2
𝜒𝛼/2 𝜒(1−𝛼/2)
Interval Estimation of σ2
Example-1: Buyer’s Digest
Buyer’s Digest rates thermostats manufactured for home
temperature control. In a recent test, 10 thermostats manufactured
by ThermoRite were selected and placed in a test room that was
maintained at a temperature of 68oF. The temperature readings of
the ten thermostats are shown below:
Thermostat 1 2 3 4 5 6 7 8 9 10
Temperature 67.4 67.8 68.2 69.3 69.5 67.0 68.1 68.6 67.9 67.2
Example-3: Consumer Reports uses a 100-point customer satisfaction score to rate the nation’s
major chain stores. Assume that from past experience with the satisfaction rating score, a
population standard deviation of σ=12 is expected. In 2012, Costco, with its 432 warehouses in 40
states, was the only chain store to earn an outstanding rating for overall quality (Consumer
Reports, March 2012). A sample of 15 Costco customer satisfaction scores are: 95, 90, 83, 75, 95,
98, 80, 83, 82, 93, 86, 80, 94, 64 and 62.
(a) What is the sample mean customer satisfaction score for Costco? (84)
(b) What is the sample variance? (118.71)
(c) What is the sample standard deviation? (10.90)
(d) Construct a 95% confidence interval estimate for population variance and standard deviation.
(63.63, 295.25) and (7.98, 17.18)
Rejection Criteria for 2 test about a Population Variance
Hypothesis Testing for σ2
Example: Buyer’s Digest
Buyer’s Digest rates thermostats manufactured for home temperature
control. In a recent test, 10 thermostats manufactured by ThermoRite
were selected and placed in a test room that was maintained at a
temperature of 68oF. The temperature readings of the ten thermostats
are shown below:
Thermostat 1 2 3 4 5 6 7 8 9 10
Temperature 67.4 67.8 68.2 69.3 69.5 67.0 68.1 68.6 67.9 67.2
(𝑛−1)𝑠 2
2 =
𝜎2
= 12.6
α = 0.10
2
0 14.684
Do not Reject H0 12.6
Reject H0
Inference: p-value Approach
Each home sold by Finger Lakes Homes can be classified according to price and to
style. Finger Lakes’ manager would like to determine if the price of the home and the
style of the home are independent variables. The number of homes sold for each
model and price for the past two years is shown below. For convenience, the price of
the home is listed as either $99,000 or less or more than $99,000.
Style of Home
Price Colonial Log Split-level A-frame
< $99,000 18 6 19 12
> $99,000 12 14 16 3
Test the hypothesis whether price and style of the home are independent of each
other. Consider 0.05 level of significance.
Solution: Finger Lakes Homes
1. Hypotheses formulation
H0: Price of the home is independent of the style of the home that is purchased
Ha: Price of the home is not independent of the style of the home that is purchased
2. The sample of size 100 has been selected and observed frequencies are recorded.
3. Calculation of Expected Frequencies:
Style of Home
Price Colonial Log Split-level A-frame Row total
< $99,000 (55*30/100) (55*20/100) (55*35/100) (55*15/100) 55
= 16.5 = 11 = 19.25 = 8.25
> $99,000 (45*30/100) (45*20/100) (45*35/100) (45*15/100) 45
= 13.5 =9 = 15.75 = 6.75
Column total 30 20 35 15 Sample
size=100
4. Calculation of Test statistic 𝝌𝟐
fij eij (fij-eij) (fij-eij)2 (fij-eij)2/eij
18 16.5 1.5 2.25 2.25/16.5=0.136
6 11 -5 25 25/11=2.272
19 19.25 -0.25 0.0625 0.0625/19.25=0.003
12 8.25 3.75 14.0625 14.0625/8.25=1.705
12 13.5 -1.5 2.25 2.25/13.5=0.167
14 9 5 25 25/9=2.778
16 15.75 0.25 0.0625 0.0625/15.75=0.004
3 6.75 -3.75 14.0625 14.0625/6.75=2.083
TOTAL (𝝌𝟐 value) 𝝌𝟐 = 9.148
Rejection Rule
p-value Approach: Reject H0 if p-value <
Hypotheses
H0: 1 = 2 = 3
Ha: Not all the means are equal
where:
1 = mean number of washes using Type 1 wax
2 = mean number of washes using Type 2 wax
3 = mean number of washes using Type 3 wax
1. Mean Square Between Treatments (MSTR) calculation:
Because the sample sizes are all equal:
𝒙ሜlj = (𝒙lj 𝟏 + 𝒙lj 𝟐 + 𝒙lj 𝟑 )/𝟑 = (29 + 30.4 + 30)/3 = 29.8
Total 38.4 14
Rejection Rule
Total 798 14
p –Value Approach
Test Statistic
𝑥lj 𝑖 − 𝑥𝑗lj
𝑡=
1 1
MSE( + )
𝑛𝑖 𝑛𝑗
Rejection Rule
p-value Approach:
Reject H0 if p-value <
➢ Sign Test
➢ Wilcoxon Signed Rank Test
➢ Kruskal-Wallis Test
Nonparametric Methods
❑ Most of the statistical methods referred to as parametric require the use of
interval- or ratio-scaled data.
❑ Whenever the data are quantitative, we will transform the data into
categorical data in order to conduct the nonparametric test.
Sign Test
• The sign test is a versatile method for hypothesis testing
that uses the binomial distribution with p=0.50 as the
sampling distribution.
• There are two applications of the sign test:
✓ A hypothesis test about a population median
x A matched-sample test about the difference between two
populations
Hypothesis Test about a Population Median
▪ The assigning of the plus and minus signs makes the situation
into a binomial distribution application.
▪ The sample size is the number of trials.
▪ There are two outcomes possible per trial, a plus sign or a
minus sign.
▪ The trials are independent.
▪ We let p denote the probability of a plus sign.
▪ If the population median is in fact a particular value, p should
equal 0.5.
Hypothesis Test about a Population Median:
Small-Sample Case
The small-sample case for this sign test should be
used whenever n < 20.
The hypotheses are:
𝐻0 : 𝑝 = .50
The population median equals the
value assumed.
𝐻a : 𝑝 ≠ .50
The population median is different
than the value assumed.
The number of plus signs is our test statistic.
Assuming H0 is true, the sampling distribution for
the test statistic is a binomial distribution with p = 0.5
H0 is rejected if the p-value < level of significance.
Hypothesis Test about a Population Median:
Smaller Sample Size
Example: Potato Chip Sales
Lawler’s Grocery Store made the decision to carry Cape May Potato Chips based
on the manufacturer’s estimate that the median sales should be $450 per week
on a per-store basis.
Lawler’s has been carrying the potato chips for three months. Data showing one-
week sales at 10 randomly selected Lawler’s stores are shown below:
H0: p = .50
Ha: p ≠ .50
Example: Potato Chip Sales
Number of Number of
Plus Signs Probability Plus Signs Probability
0 .0010 6 .2051
1 .0098 7 .1172
2 .0439 8 .0439
3 .1172 9 .0098
4 .2051 10 .0010
5 .2461
Example: Potato Chip Sales
Since observed number of plus signs is 7, we begin
by computing the probability of obtaining 7 or more
plus signs.
The probability of 7, 8, 9, or 10 plus signs is:
.1172 + .0439 + .0098 + .0010 = .1719.
We are using a two-tailed hypothesis test, so:
p-value = 2(.1719) = .3438.
Conclusion:
Do not reject H0. The p-value for this two-tail test is
0.0548. There is insufficient evidence in the sample
to conclude that the median age is not 34 for female
members of Trim Fitness Center.
Wilcoxon Signed-Rank Test
Sampling Distribution of T +
for the Wilcoxon Signed-Rank Test
𝑛(𝑛+1)
Mean: 𝜇 𝑇 + =
4
𝑛(𝑛 + 1)(2𝑛 + 1)
Standard Deviation: 𝜎 𝑇 + =
24
Seattle 32 25
Los Angeles 30 24
Boston 19 15
Cleveland 16 15
New York 15 13
Houston 18 15
Atlanta 14 15
St. Louis 10 8
Milwaukee 7 9
Denver 16 11
Wilcoxon Signed-Rank Test
Hypotheses:
49 − 27.5
𝑃(𝑇 + ≥ 49.5) = 𝑃 𝑧 ≥ = 𝑃(𝑧 ≥ 2.19)
9.81
p-value:
p-value = 2(1.0000 - 0.9857) = 0.0286
Wilcoxon Signed-Rank Test
Rejection Rule:
Using 0.05 level of significance,
Reject H0 if p-value < .05
Conclusion:
Reject H0. The p-value for this two-tail test is 0.0286.
There is sufficient evidence in the sample to
conclude that a difference exists in the median
delivery times provided by the two services.
Kruskal-Wallis Test
Kruskal and Wallis is used for comparing the cases of
three or more populations.
where:
k = number of populations
ni = number of observations in sample i
nT = Sni = total number of observations in all samples
Ri = sum of the ranks for sample i
Kruskal-Wallis Test
When the populations are identical, the sampling
distribution of the test statistic H can be approximated
by a chi-square distribution with k–1 degrees of freedom.
This approximation is acceptable if each of the sample
sizes ni is > 5.
Rejection Rule:
Using test statistic: Reject H0 if χ2 > 4.605 (2 d.f.)
Using p-value: Reject H0 if p-value < 0.10
Kruskal-Wallis Test Statistic:
k = 3 populations, n1 = 6, n2 = 7, n3 = 7, nT = 20
𝑘
12 𝑅𝑖2
𝐻= − 3(𝑛 𝑇 + 1)
𝑛 𝑇 (𝑛 𝑇 + 1) 𝑛𝑖
𝑖=1
Conclusion:
Do no reject H0. There is insufficient evidence to
conclude that the populations are not identical.
(H = 0.3532 < 4.60517)