0% found this document useful (0 votes)
91 views454 pages

BA1 - PPTs Merged

Uploaded by

shrayan189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views454 pages

BA1 - PPTs Merged

Uploaded by

shrayan189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 454

Topic-1

Introduction to Business Analytics


Business Analytics

Discussions topics:
• What is Business Analytics?
• Evolution of Business Analytics
• Framework of Business Analytics
• Scope of Business Analytics

2
Dimensions of Statistics

Statistics

Statistical Data Statistical Methods


- Numerical facts - Formulae and models

- Comparable - Descriptive and inferential


Statistical Data:
*Statistics is an aggregate of facts, a single numerical term can not be termed as statistics.
*Numerical information which are relevant to fulfill the objectives of study is considered
as statistical data.
*All statistical data are expressed in numbers and even facts which are non-numeric too
are converted into numbers to make sense as statistical data.
*Since accuracy of statistical data is essential to achieve the study objectives, hence data
should be collected in a systematic manner.
Statistical Methods:
*These are mathematical formulae, models and techniques that are used in statistical
analysis of raw research data.
*The application of statistical methods extracts information from research data and
provides different ways to assess the robustness of research outputs.
*Statistical methods can be generally divided into two categories – descriptive statistics
and statistical inference.
Statistics Vocabulary
Data: Data are facts and figures collected, analyzed and summarized for presentation and
interpretation, e.g., GDP of a country, number of employees in a company, sales in past 12
months etc.
Population: A population can be defined by any number of characteristics within a group
that statisticians use to draw conclusions about the subjects in a study. A population can be
vague or specific, e.g., number of graduates in Hyderabad, number of IT companies in
Hyderabad.
Sample: A sample refers to a smaller, manageable version of a larger group. It is a subset
containing the characteristics of a population. Samples are used in statistical testing when
population sizes are too large for the test to include all possible members or observations. A
sample should represent the population as a whole and not reflect any bias toward a
specific attribute.
Frequency: The total number of occurrences of any observation or event is termed as
frequency. The cumulative frequency is the total of the absolute frequencies of all events at
or below a certain point in an ordered list of events. The relative frequency of an event is
the absolute frequency normalized by the total number of events.
Descriptive Statistics: Descriptive statistics are summary of a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive statistics are
broken down into data representation, measures of central tendency and measures of
variability.
Inferential Statistics: Inferential statistics are techniques that use the samples to make
generalizations about the population(s) from which the samples were drawn. The methods of
inferential statistics are estimation of parameters and testing of hypothesis.
Predictive Statistics: Predictive statistics deals with extracting information from data and using
it to predict trends and behavior patterns.
Prescriptive Statistics: Prescriptive statistics gathers data from both descriptive and predictive
sources for its models and applies them to the process of decision-making. This includes
combining existing conditions and considering the consequences of each decision to
determine how the future would be impacted.
Business Analytics: An Idea

a) You apply for a credit card for the first time. How does the bank
assess your creditworthiness?
b) How does Amazon or Flipkart know which books and other products
to recommend to you when you log in to their website?
c) How do airlines determine what price to quote to you when you are
buying a plane ticket?
d) How do Uber and OLA determine their prices?
e) How do insurance companies determine the risk on a person and
decide premium?
a) Even though you are applying for a credit card for the first time, millions of people have
also applied. Many of them have paid back the amount spent on time, many of them
have deferred the payment and some of them have been defaulters. The bank wants
to know you belong to which category by comparing your profile with similar card
holders.
b) Similarly, Amazon or Flipkart has access to millions of previous purchases made by
customers on its website. They examine your previous purchases, the products you
have viewed, and any product recommendations you have provided. From their huge
database of customers with similar profile, they create recommendations for you.
c) The price quoted to you for a flight between New Delhi and Hyderabad today could
be very different from the price quoted tomorrow. These changes happen because
airlines use a variable pricing strategy. It works by examining vast amounts of data on
past purchases and using these data to forecast future purchases. These forecasts are
then fed into sophisticated optimization algorithms that determine the optimal price
to charge for a particular flight and when to change that price.
What is Business Analytics?

Business Analytics is the use of:


• data,
• information technology,
• statistical analysis,
• quantitative methods, and
• mathematical or computer-based models

to gain improved insight about business operations


and make better, fact-based decisions.

9
Support for Decision Making

• Uncertain economics
• Rapidly changing environments
• Global competition
• Demanding customers
• Taking advantage of information acquired by companies is a Critical
Success Factor.
Business Analytics Defined
• Business analytics is the scientific process of transforming data into
insight for making better decisions.
• Business analytics is used for data-driven or fact-based decision
making, which is often seen as more precise than other alternatives
for decision making.
• The tools of business analytics can aid decision making by creating
insights from data, by improving our ability to more accurately
forecast for planning, by helping us quantify risk, and by yielding
better alternatives through analysis and optimization.
Business Analytics Vs. Business Intelligence
• Business Intelligence (BI) is a set of methodologies, processes, architectures, and
technologies that leverage the output of information management processes for
analysis, reporting, performance management, and information delivery.
• Business Analytics is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software.
• Business Intelligence deals with the present, while business analytics is more
focused on the future.
• A focus of business intelligence is to take data and use it for better decision making.
Through the use of aggregation, visualization, and careful analysis, companies can
use BI to achieve better efficiency in how the organization is operating now.
• Business analytics, on the other hand, places emphasis on the future. Data
analytics engages in data mining, essentially analyzing a set of information to pick
out patterns and predict future trends that can inform organizations as to what
they should do.
Business Analytics Vs. Data Science
• Data Science is the study that puts the use of statistics, trends, algorithms, and
technology to understand and segregate data into different aspects that make
sense.
• The main contribution of data science in business and management is to provide
actionable insights over a wide range of data that are either segregated or needs
to be mined, trying to bring facts around business operations, customer trends,
and behavior in byte sized format.
• On the other hand, Business Analytics is a statistical study of
segregated/structured data. Business Analytics allows solutions to overcome
hurdles and improve business performance.
• Because these two terms are often used interchangeably, the chances are that a
business analytics problem could be wrongly approached with Data Science’s
solution. Using two different sets of tools to solve Business Analyst could be
adverse and bring undesirable results.
Importance of Business Analytics

• There is a strong relationship of BA with


• revenue of businesses
• profitability of business
• shareholder return
• BA enhances understanding of data
• BA is vital for businesses to remain competitive
• BA enables creation of informative reports

14
Evolution of Business Analytics
BA in the 1800s: The need to stay ahead
The first use of data to stay ahead of his competitors dates back to 1865. Sir Henry Furnese, a
banker, was always one step ahead by actively gathering information and acting on it before
any of his competitors. This makes it clear that professionals such as Sir Furnese relied more
on data and empirical evidence, rather than gut instinct.

BA in the late 1800s: The Advent of Scientific Management


During this time, Frederick Taylor introduced the first-ever system of business analytics in the
United States of America, and he called it scientific management. The purpose of this system
was to analyze the production techniques and laborers activities to identify greater
efficiencies.

BA in the early 1900s: The Transformation of Manufacturing Industry


Frederick Taylors scientific management system inspired Henry Ford, who hired Taylor as his
consultant. Ford was willing to measure the time each component of his Ford Model T took to
complete on his assembly line. This analysis transformed his work and the manufacturing
industry across the globe.
BA in the 1950s: The first hard drive disk by IBM
Computers were not easily accessible in the early 1900s but had a massive demand during World War II. As they were
still rudimentary, punch cards or tapes were used to store information. However, in 1956, the tech giant, IBM invented
the first hard disk drive. This allowed users to save a vast amount of data with better flexibility.

BA in the late 1900s: The Emergence of Business Intelligence


Owing to the lower prices for storage space and better databases, the next generation of business intelligence solutions
was all set to step in. By now, there was a considerable amount of data available but not a centralized place to store it.
To address this problem, Ralph Kimball and Bill Inmon proposed similar strategies to build data warehouses (DW).

BA in the new Millennium: Availability of different analytical solutions


By this time, medium and large-sized businesses had already realized the value of business intelligence solutions.
Companies such as IBM, Microsoft, SAP, and Oracle were at the forefront of offering such solutions to change the way
businesses function.

BA in 2005: Accessibility of Data for the Common People


Considering the extensive usage of data, companies started directing their efforts on improving the speed at which the
information was available. New business analytics tools were introduced to ensure technical as well as non-technical
people were able to mine the data and gain insights.
Around this time, the increasing interconnectivity of the business world led to the need for real-time information. This
was when Google Analytics was introduced. Google wanted to provide a free and accessible way for users to analyze
their website data.
BA from 2005 till date: The Bread and Butter for Companies globally
With the internet available to almost everyone and the increasing data, companies
needed better solutions to store and analyze all the information. Building computers
with more storage capacity and better speed was not possible for many, so companies
resorted to using several machines at the same time. This was the beginning of cloud
computing.
Since the last decade, big data, cloud computing, and business analytics have become
integral for almost all companies. The new advancements have made these
technologies even better. Now, data analytics and science are known to be the future.
From advertising and marketing to recruiting and planning operational activities, these
terms are tossed around in every field.
Categorization of Business Analytics
1. Descriptive Analytics:
Descriptive analytics encompasses the set of techniques that describes what has
happened in the past. Examples are data queries, reports, descriptive statistics, data
visualization including data dashboards, some data-mining techniques, and basic
what-if spreadsheet models.
2. Predictive Analytics:
Predictive analytics consists of techniques that use models constructed from
past data to predict the future or ascertain the impact of one variable on
another.
➢ For example, past data on product sales may be used to construct a
mathematical model to predict future sales, which can factor in the product’s
growth trajectory and seasonality based on past patterns.
➢ A packaged food manufacturer may use point-of-sale scanner data from retail
outlets to help in estimating the lift in unit sales due to coupons or sales
events. Survey data and past purchase behavior may be used to help predict
the market share of a new product.
➢ Linear regression, time series analysis, some data-mining techniques, and
simulation, often referred to as risk analysis, all fall under the banner of
predictive analytics.
3. Prescriptive Analytics:
Prescriptive analytics differ from descriptive or predictive analytics in that prescriptive
analytics indicate a best course of action to take; that is, the output of a prescriptive model
is a best decision. The airline industry’s use of revenue management is an example of a
prescriptive analytics. Airlines use past purchasing data as inputs into a model that
recommends the best pricing strategy across all flights for maximizing revenue.
Prescriptive analytics uses the tools such as Statistical process control, optimization
modelling, decision analysis etc.
Big Data Analytics
Big data analytics examines large amounts of data to uncover hidden patterns,
correlations and other insights. With today’s technology, it’s possible to analyze your
data and get answers from it almost immediately – an effort that’s slower and less
efficient with more traditional business intelligence solutions.
Cloud Analytics
Cloud analytics is the process of storing and analyzing data in the cloud and using it
to extract actionable business insights. Similar to on-premises data analytics, cloud
analytics algorithms are applied to large data collections to identify patterns, predict
future outcomes and produce other information useful to business decision makers.
Business Analytics in Practice
Financial Analytics
• Predictive models are used to forecast future financial performance, to assess the risk of
investment portfolios and projects, and to construct financial instruments such as derivatives.
• Prescriptive models are used to construct optimal portfolios of investments, to allocate
assets, and to create optimal capital budgeting plans.

Human Resource Analytics


• A relatively new area of application for analytics is the management of an organization’s
human resources.
• The HR function is charged with ensuring that the organization has the mix of skill sets
necessary to meet its needs.
• It is hiring the quality talent and providing an environment that retains it
• The team uses descriptive and predictive analytics to support employee hiring and to track
and influence retention.
Marketing Analytics
• Marketing is one of the fastest growing areas for the application of analytics.
• A better understanding of consumer behavior through the use data generated from social
media has led to an increased interest in marketing analytics.
• As a result, descriptive, predictive, and prescriptive analytics are all heavily used in
marketing.
• A better understanding of consumer behavior through analytics leads to the better use of
advertising budgets, more effective pricing strategies, improved forecasting of demand,
improved product line management, and increased customer satisfaction and loyalty.

Healthcare Analytics
• The use of analytics in health care is on the increase because of pressure to simultaneously
control cost and provide more effective treatment.
• Descriptive, predictive, and prescriptive analytics are used to improve patient, staff, and
facility scheduling; patient flow; purchasing; and inventory control.
Supply Chain Analytics
• One of the earliest applications of analytics was in logistics and supply chain management.
• The core service of companies such as UPS and FedEx is the efficient delivery of goods and
analytics has long been used to achieve efficiency.
• The optimal sorting of goods, vehicle and staff scheduling, and vehicle routing are all key to
profitability for logistics companies such as UPS, FedEx, and others like them.

Web Analytics
• Web analytics is the analysis of online activity, which includes, but is not limited to, visits to
websites and social media sites such as Facebook and LinkedIn.
• Web analytics obviously has huge implications for promoting and selling products and
services via the Internet.
• Leading companies apply descriptive and advanced analytics to data collected in online
experiments to determine the best way to configure websites, position ads, and utilize
social networks for the promotion of products and services.
Example-1: Retail Markdown Decisions

• Most department stores clear seasonal inventory by


reducing prices.
• The question is: When to reduce the price and by how
much?
• Descriptive analytics: examine historical data for
similar products (prices, units sold, advertising,…)
• Predictive analytics: predict sales based on price
• Prescriptive analytics: find the best sets of pricing and
advertising to maximize sales revenue

28
Analytics in Practice: Harrah’s Entertainment

• Harrah’s owns numerous hotels andcasinos


• Uses analytics to:
➢ forecast demand for rooms
➢ segment customers by gaming activities
• Uses prescriptive models to:
➢ set room rates
➢ allocate rooms
➢ offer perks and rewards to customers

29
Topic-2
Exploring Data
Understanding Data
Data and data set:
• Data are facts and figures collected, analysed and summarized for presentation and
interpretation.
• Facts are the truths which could be numeric or non-numeric in nature and figures are
information which are numeric.
• In a more technical sense, data are a set of values of qualitative/categorical or
quantitative nature pertaining to one or more individuals or objects.
• Qualitative data are descriptive information (about colour of an object, taste of food,
religion, education, ethnicity etc.) while quantitative data are numerical information
(about marks obtained, age, height, no. of employees, interest rate etc.).
• Quantitative data can be discrete (information relating to no. of households in a
society, no. of IPL teams, no. of warehouses etc.) or continuous (information relating
to height, weight, speed, sales figures, growth rate etc.).
• All the data collected for a particular study are referred to as data set for the study.
Elements, variables and observations:
• Elements are the entities on which data are collected, e.g., individuals,
objects, nations, companies etc.
• Variable is a characteristic of interest for the elements, e.g., height of
individual, dimension of object, GDP of nation, sales figure of company.
• The set of measurements or values obtained for each element and
concerned variable are called observations.
Vehicle Size Cylinders Mileage per gallon Fuel

Audi A8 Large 12 13.62 Petrol


BMW 328Xi Compact 6 17.81 Diesel
Ford Focus Compact 4 19.35 Petrol
Toyota Camry Mid size 4 15.45 Diesel
Volkswagen Jetta Compact 5 18.76 Petrol
Hyundai Elantra Mid size 4 15.90 Petrol
Practice Exercise
Types of data
Primary and secondary data:
• Primary data are collected and used for the first time as first-hand
information.
• These information can be collected by surveys or interviews or
observations.
• Customer satisfaction surveys, interview of scientists, health observation of
patients etc. are some examples of primary data.
• Secondary data are second hand information which already exist in
published or unpublished forms.
• These information can be obtained from journals, magazines, reports,
websites etc.
• Financial reports, population census, CMIE reports, ProwessIQ database,
IMF reports etc.
SERVICE SURVEY Excellent Good Average Fair Poor
Overall Experience ❑ ❑ ❑ ❑ ❑
Greeting by Hostess ❑ ❑ ❑ ❑ ❑
Manager (Table Visit) ❑ ❑ ❑ ❑ ❑
Overall Service ❑ ❑ ❑ ❑ ❑
Professionalism ❑ ❑ ❑ ❑ ❑
Menu Knowledge ❑ ❑ ❑ ❑ ❑
Friendliness ❑ ❑ ❑ ❑ ❑
Wine Selection ❑ ❑ ❑ ❑ ❑
Menu Selection ❑ ❑ ❑ ❑ ❑
Food Quality ❑ ❑ ❑ ❑ ❑
Food Presentation ❑ ❑ ❑ ❑ ❑
Value for $ Spent ❑ ❑ ❑ ❑ ❑
EXAMPLES OF DATA AVAILABLE FROM SELECTED GOVERNMENT AGENCIES

Government Agency Some of the Data Typically Available

Census Bureau Population data, number of households, and household income

Federal Reserve Board Data on the money supply, installment credit, exchange rates, and discount rates

Office of Management and Budget Data on revenue, expenditures, and debt of the federal government
Data on business activity, value of shipments by industry, level of profits by industry, and
Department of Commerce
growing and declining industries
Consumer spending, hourly earnings, unemployment rate, safety records, and
Bureau of Labor Statistics
international statistics

EXAMPLES OF DATA AVAILABLE FROM INTERNAL COMPANY RECORDS


Source Some of the Data Typically Available
Name, address, social security number, salary, number of vacation days, numbers of sick
Employee records
days and bonus
Production records Part or product number, quantity produced, direct labor cost, and materials cost
Part or product number, number of units on hand, reorder level, economic order
Inventory records
quantity, and discount schedule
Product number, sales volume, sales volume by region, and sales volume by customer
Sales records
type
Credit records Customer name, address, phone number, credit limit, and accounts receivable balance
Customer profile Age, gender, income level, household size, address, and preferences
Time series, cross sectional and panel data:
• Time series data are time variant and collected over a period of time at regular
intervals.
• This type of data helps us understand the changes in the values of any variable.
• Cross sectional data are collected at similar time points for different variables.
• Panel data are the combination of time series and cross sectional data.
Time series data Cross sectional data (expenditure for 2018 in million $)
Year Sales (million $) Company Sales Advertisement R&D
2012 12 A 18 4 8
2013 14 B 16 2 7
2014 17 C 22 7 12
2015 15 D 19 6 10
2016 18 E 24 8 12

Panel data on sales, advt. and R&D expenditure (million $) from 2012-2016
Company A Company B Company C
Sales Advt R&D Sales Advt R&D Sales Advt R&D
2012 12 5 2 15 4 3 20 5 5
2013 15 6 3 18 6 4 22 7 5
2014 14 7 4 12 8 4 24 10 6
2015 18 8 5 10 10 6 26 12 8
2016 20 10 7 15 12 8 30 15 10
Scales of Measurement
• Two scales for measuring ‘qualitative data’
➢ Nominal Scale: A qualitative scale for which there is no meaningful ordering, or ranking of the
categories. It only measures the presence or absence of an attribute and does not contribute to any
higher order analysis. For example, name, address, gender, income category, education etc.
➢ Ordinal Scale: It measures a qualitative phenomenon that exists with a varying degree. An ordinal scale
is used when the phenomenon can be arranged in ascending or descending order and hence also
known as ranked order scale. For example, customer satisfaction, leadership, credit rating etc.
• Two scales for measuring ‘quantitative data’
➢ Interval Scale: Interval scale measures qualitative data in a quantitative manner. It is based on equal
intervals between the scale points where ‘zero’ has no meaning. For example, Likert scale where the
measurement is done on a scale of 1 to 5 and ‘zero’ has no meaning. For example, rate the service of a
restaurant on a scale of 1 to 5 with 1 represents very bad, 2 represents bad, 3 represents neither good
nor bad, 4 represents good and 5 represents very good.
➢ Ratio Scale: Any quantitative data where ‘zero’ has a meaning and we can also perform mathematical
operations. For example, sales data, advertisement expenditure, profit/loss, distance etc. In some of the
statistical analysis, ratio scales are also converted into interval scales for the ease of analyses.
Let’s Answer…
1. If the grading of diabetes is classified as mild, moderate and severe the scale of
measurement used is ordinal scale.

2. The faculty of BA-1 records the answers that each student got correct in the last
test. Interval

3. Age of the customer: 20-25, 25-30, 30-35 etc. Interval scale

4. Asking for preference levels of various apparel brands like Van Heusen, Arrow,
Levi, Perter England and UCB. Ordinal scale

5. Asking whether the customer will like tea or coffee. Nominal scale

6. Satisfaction level of customers: very satisfied, satisfied, neutral, dissatisfied, very


dissatisfied. Interval scale
Research
• Research is an ORGANIZED and SYSTEMATIC way of FINDING
ANSWERS to QUESTIONS.
• QUESTIONS: should be relevant, useful, and important.
• FINDING ANSWERS: Research is successful when we find answers.
Sometimes the answer is no, but it is still an answer.
• ORGANIZED: focused and limited to a specific scope.
• SYSTEMATIC: a definite set of procedures and steps
Types of Research
• Exploratory Research (explore)
• Generate basic knowledge, clarify relevant issues, uncover variables associated with a
problem, uncover information needs, and/or define alternatives for addressing research
objectives.
• A very flexible, open-ended process.
• Descriptive Research (who, what, where, how)
• Provide further insight into the research problem by describing the variables of interest.
• Applications: Profiling, defining, segmentation, estimating, predicting, and examining
associative relationships.
• Causal Research (If-then)
• Designed to provide information on potential cause-and-effect relationships.
• Most practical in marketing to talk about associations or impact of one variable on another.
Experimental Research

Common Features
• Division of the subjects/elements into groups (control, experimental).
• Use of a "treatment" (usually the independent variable) which is
introduced into the research context or manipulated by the researcher.
• In contrast to qualitative research, virtually all experiments are designed
to test hypotheses.
• Its highly analytical.
Qualitative & Quantitative Research
➢ Qualitative research: explore perceptions, attitudes and motivations
• To understand how they are formed.
• To provides depth of information
• To determine what attributes will subsequently be measured in quantitative
studies
➢ Quantitative research: descriptive
• provides raw data on the numbers of people exhibiting certain behaviors,
attitudes, etc.
• allows sample large numbers of the population.
• Its highly data-intensive and mathematical
Qualitative & Quantitative Research
Examples of each type of research
• Exploratory: Understanding the immunity built by Covid vaccine, new type of
health supplement products liked by consumers
• Descriptive: Sales analysis, consumer perception and behavior analysis, market
characteristics analysis
• Causal: Impact of medicines in curing certain disease, impact of advt. on sales,
impact of FDI on economic growth
• Experimental: Drug experiments on two or more group of patients, product
experiments with different set of consumers
• Qualitative: Branding perception among customers, adding/deleting features in
new products, assessment of services
• Quantitative: Customer feedback surveys, employee satisfaction surveys,
financial assessment of companies, stock market analysis
Attitude Measurement

• Attitude: The way of feeling or acting towards a person, thing or situation


• composed of cognitive, emotional and behavioural components
• Inferred from actions, behaviour and statements
• Hypothetical Construct: Variables that are not directly observable but are
measurable through indirect indicators, such as verbal expression or overt
behaviour.
Techniques for Measuring Attitudes

• Rating: to estimate the magnitude of a characteristic or quality


• Ex: a brand loyalty, store service quality, or product life
• Ranking: to rank order a small number of stores, brands, or objects
• On the basis of overall preference or some characteristics
• Sorting: Presents a respondent with several objects or product
concepts
• Requires the respondent to arrange the objects into piles or classify the
product concepts.
• Choice: identifies preferences
• Requiring respondents to choose between two or more alternatives
Attitude Rating Scales
• Category scale: A rating scale that consists of several response
categories, often providing respondents with alternatives to indicate
positions on a continuum.
Likert Scale
• Likert scale - Respondents are asked to indicate the amount of
agreement or disagreement (from strongly agree to strongly disagree)
on a five- or seven-point scale. The same format is used for multiple
questions.
Likert Scale Example
Strongly Disagree Neither Agree Strongly
disagree agree nor agree
disagree
1. D-Mart sells high quality merchandise. 1 2 3 4 5

2. D-Mart has poor in-store service. 1 2 3 4 5

3. I like to shop at D-Mart. 1 2 3 4 5

• The analysis can be conducted on an item-by-item basis (profile analysis), or a total


(summated) score can be calculated.
Semantic Differential Scale
A seven-point rating scale with end points associated with bipolar labels that have semantic meaning.
Jio fiber is:
Powerful --:--:--:--:-X-:--:--: Weak
Unreliable --:--:--:--:--:-X-:--: Reliable
Modern --:--:--:--:--:--:-X-: Old-fashioned

• The negative adjective or phrase sometimes appears at the left side of the scale and sometimes at
the right.
• This controls the tendency of some respondents, particularly those with very positive or very
negative attitudes, to mark the right- or left-hand sides without reading the labels.
• Scored on either a -3 to +3 or a 1 to 7 scale.
A Semantic Differential Scale for Measuring Self- Concepts,
Person Concepts, and Product Concepts
1) Rugged :---:---:---:---:---:---:---: Delicate
2) Excitable :---:---:---:---:---:---:---: Calm
3) Uncomfortable :---:---:---:---:---:---:---: Comfortable
4) Dominating :---:---:---:---:---:---:---: Submissive
5) Thrifty :---:---:---:---:---:---:---: Indulgent
6) Pleasant :---:---:---:---:---:---:---: Unpleasant
7) Contemporary :---:---:---:---:---:---:---: Obsolete
8) Organized :---:---:---:---:---:---:---: Unorganized
9) Rational :---:---:---:---:---:---:---: Emotional
10) Youthful :---:---:---:---:---:---:---: Mature
11) Formal :---:---:---:---:---:---:---: Informal
12) Orthodox :---:---:---:---:---:---:---: Liberal
13) Complex :---:---:---:---:---:---:---: Simple
14) Colorless :---:---:---:---:---:---:---: Colorful
15) Modest :---:---:---:---:---:---:---: Vain
Stapel scale
• This is a unipolar ten-point rating scale.
• Ranges from +5 to -5 and has no neutral zero point.
• Measures intensity of an attitude
• This scale is usually presented vertically.
Stapel Scale Example
Croma products and service:
+5 (describes very well) +5
+4 +4
+3 +3
+2 +2X
+1 +1
HIGH QUALITY POOR SERVICE
-1 -1
-2 -2
-3 -3
-4X -4
-5 (describes poorly) -5

The data obtained by using a Stapel scale can be analyzed in the same way as semantic
differential data.
Graphic Rating Scale
• A measure of attitude that allows respondents to rate an object by
choosing any point along a graphic continuum.
A
Exercise:
Name the scale ?

B
C
A B

C
Topic-3
Descriptive Analytics
For Pie Chart (1% =
Ice cream Nos. sold Percentage
3.6 degrees)
Butterscotch 10 20 72 deg.

Chocolate 17 34 122.4 deg

Strawberry 8 16 57.6 deg.


Simple Bar Chart
Vanilla 15 30 108 deg. Pie Chart
Nos. sold Nos. sold
18

16

14
Buuterscotch
Vanilla 20%
12 30%

10

6
Chocolate
Strawberry 34%
4 16%

0
Buuterscotch Chocolate Strawberry Vanilla
Buuterscotch Chocolate Strawberry Vanilla
Share of
Share of Mfg
Country Exports in GDP
in GDP (%)
(%)
China 34 15
S. Korea 28 4
Thailand 36 2
Japan 21 6
Germany 19 11

Multiple Bar Chart India 14 2 Percentage/Stacked Bar Chart

Comparison of Global Mfg countries Comparison of Global Mfg countries


40 100%

90%
35

80%
30
70%

25
60%

20 50%

15 40%

30%
10
20%
5
10%

0
0%
China S. Korea Thailand Japan Germany India
China S. Korea Thailand Japan Germany India
Share of Mfg in GDP (%) Share of Exports in GDP (%)
Share of Mfg in GDP (%) Share of Exports in GDP (%)
Sales
Year
(million $)
2012 12
2013 14
2014 17
2015 15
2016 18
Scatter Plot with Scatter Plot with
Sharp Lines Smooth Lines
Sales (million $) Sales (million $)
20 20

18 18

16 16

14 14

12 12

10 10

8 8

6 6

4 4

2 2

0 0
2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 2016 2016.5 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 2016 2016.5
Comparison of countries on ease of starting business
120 14

M 12
100
i
x 10
80
e
8
d 60

C 40

h 4

a 20
2
r
t 0 0
New South
Canada USA France UK China India Brazil Russia
Zealand Africa
Time (days) 1 5 5 7 12 33 19 27 108 15
No. of Procedures 1 1 6 5 6 13 5 12 13 7

Time (days) No. of Procedures


Histogram

No. of
Points
Students

0-10 8
10-20 11
20-30 22
30-40 29
40-50 36
50-60 55
60-70 39
70-80 21
80-90 11
90-100 3
Ogive – Cumulative Frequency Curves
Summarizing Data

Beverage Frequency Cumulative Relative Percent


Frequency Frequency Frequency
Coca Cola 25 25 25/220 = 0.11 11
Diet Coke 35 60 35/220 = 0.16 16
Thums Up 30 90 30/220 = 0.14 14
Pepsi 20 110 0.09 9
Sprite 37 147 0.17 17
Maaza 23 170 0.11 11
Nescafe 25 195 0.11 11
Nestea 25 220 0.11 11
TOTAL 220 1.00 100
Cross Tabulation
Restaurant Quality Meal Price
Rating ($)
1 Good 18
2 Very good 22 Cross-tabulation for quality rating and meal price
3 Good 28 Quality Meal price ($) Total
4 Excellent 38 Rating
$10-19 $20-29 $30-39 $40-49
5 Very good 33
6 Good 28 Good 42 40 2 0 84
7 Very good 19 Very good 34 64 46 6 150
8 Very good 11
Excellent 2 14 28 22 66
9 Very good 23
10 Good 13 Total 78 118 76 28 300
. . .
. . .
. . .
Central Tendency
• Central tendency is a descriptive summary of a dataset through a single value that
reflects the center of the data distribution.
• A central tendency is located around the central part which represents an average
characteristic of the distribution.
• The most common measures of central tendency are mean (arithmetic, weighted,
geometric, harmonic), median and mode.
• Mean: The most common measure of central tendency, it can be used with both
discrete and continuous data, although its use is most often with continuous data.
• Median: The middle value in a dataset that is arranged in ascending order (from the
smallest value to the largest value). If a dataset contains an even number of values,
the median of the dataset is the mean of the two middle values.
• Mode: Defines the most frequently occurring value in a dataset. In some cases, a
dataset may contain multiple modes while some datasets may not have any mode.
Arithmetic Mean
Examples of Arithmetic Mean
The following table shows monthly sales of 10 The number of LED lamps used in households is
stores of a retail chain. Calculate the average given below. Calculate the average number of LED
monthly sales. lamps used.
(Ungrouped) (Grouped, discrete)
Sales Average Sales No. of LED No. of
Store Average LED
($ 1000s) calculation lamps households f*X
calculation
A 22 (X) (f)
B 25 Average monthly sales Average no.
1 2 2*1=2
C 27 = of LED
(22+25+27+29+30+31 lamps used
D 29 2 4 4*2=8
+32+33+35+36)/10 = ΣfX/Σf
E 30 = ΣX/n = 60/20
F 31 = 300/10 3 6 6*3=18 =3
= 30 The average
G 32
The average monthly 4 8 8*4=32 usage of
H 33 sales is $30,000 per LED is 3
I 35 store. lamps per
Total Σf = 20 ΣfX = 60
J 36 household.
The following table provides data on the distribution of salaries ($1000s)
of 50 employees of an organization. Calculate the average salary.

(Grouped, continuous)
No. of
Salary Average Salary
Employees Mid-value (X) f*X
($1000s) calculation
(f)
10-20 2 15 2*15=30
20-30 4 25 4*25=100 Average salary
30-40 6 35 6*35=210 of employees
= ΣfX/Σf
40-50 8 45 8*45=360 = 2750/50
50-60 10 55 10*55=550 = 55

60-70 8 65 8*65=520 The average


70-80 6 75 6*75=450 salary of the
employees is
80-90 4 85 4*85=340 $55,000.
90-100 2 95 2*95=190
Total Σf = 50 ΣfX = 2750
Practice Problems:

1. The mean weights of five computer stations is 167.2 lbs. The weights of four of
them are 158.4 lbs, 162.8 lbs, 165.0 lbs and 178.2 lbs respectively. What is the
weight of the fifth computer?
2. The following table gives the weights of wooden items being sold by a timber
merchant. Calculate mean weight of the items sold.
Weight (lbs) 1-3 4-6 7-9 10-12 13-15
No. of items 8 25 45 18 4

3. An ice-cream parlor sells six varieties of ice-creams which have generated the
following revenue. Find the mean price of an ice-cream sold.
Ice-cream Butter scotch Chocolate Lychee Choco chips Tooty fruity Vanilla
Price (Rs.) 40 90 65 55 75 45
Sales (Rs.) 5,00,000 4,50,000 3,38,000 3,01,180 4,93,800 3,14,415
Weighted Arithmetic Mean
The weighted mean is a type of mean that is calculated by multiplying the weight (or
probability) associated with a particular event or outcome with its associated quantitative
outcome and then summing all the products together.
Weighted Mean = ΣWiXi/ΣWi; i = 1,2,3,……,n.
Weight
Examination Score (X) W*X Weighted Mean
(W)

Test-1 0.15 65% 0.15*65=9.75

Test-2 0.15 85% 0.15*85=12.75 Weighted


Arithmetic Mean
CP-1 0.05 90% 0.05*90=4.50 = ΣWX/ΣW
= 72.50/1.00
CP-2 0.05 90% 0.05*90=4.50
= 72.50
Mid-term 0.20 75% 0.20*75=15.00 The weighted
average score is
End-term 0.40 65% 0.40*65=26.00 72.50%.

Total ΣW=1.00 ΣWX= 72.50


Decisions based on Weighted Mean:
Example: Sam wants to buy a new camera, and decides on the following rating system:
Image Quality 50%
Battery Life 30%
Zoom Range 20%
The brand ‘X’ camera gets 8 for Image Quality, 6 for Battery Life and 7 for Zoom Range, all out
of 10.
The brand ‘Y’ camera gets 9 for Image Quality, 4 for Battery Life and 6 for Zoom Range, all out
of 10.

Which camera will Sam buy?


Weighted score of X : (0.5 × 8) + (0.3 × 6) + (0.2 × 7) = 4 + 1.8 + 1.4 = 7.2
Weighted score of Y: (0.5 × 9) + (0.3 × 4) + (0.2 × 6) = 4.5 + 1.2 + 1.2 = 6.9

Sam decides to buy the brand ‘X’.


Median
Examples of Median (Ungrouped data):
1. The class size of five sections of first year students are 32, 56, 42, 46, 48 respectively.
Find the median no. of students.
Key: Arrange the nos. in ascending order: 32, 42, 46, 48, 56
No. of observations n = 5 (odd)
Median value = [(n+1)/2]th observation
= [(5+1)/2]th observation
= 3rd observation = 46
The median no. of students is 46.

2. A batsman scored 1, 113, 148, 22, 24, 27, 15, 16, 16 & 28 runs in the last 10 innings. Using an
appropriate measure, find his average score.
Key: Since there are 2 extreme scores 113 & 148, hence mean would be affected by these values.
Here, median would be an appropriate measure.
Arrangement: 1, 15, 16, 16, 22, 24, 27, 28, 113, 148.
No. of observations, n = 10 (even)
Median = Mean of (n/2)th and (n/2+1)th observations
= Mean of 5th and 6th observations
= (22+24)/2 = 23
The average score of the batsman is 23 runs.
Grouped, discrete Grouped, continuous
Mode
• Mode is the value which occurs most
frequently in a distribution.
• A distribution can have one or more
than one modes.
• Mode is widely used while compiling
the results of surveys. The options
with maximum frequencies are
considered and decisions are taken
accordingly.
• The demerits of arithmetic mean and
median can be overcome with the
help of mode.
• Mode can be calculated for grouped,
ungrouped, discrete and continuous
data.
Discrete series Continuous series
Practice Problems
Grouping of frequencies Mode - Grouping method
Col.3 Col.5 Col.6
Col.2 Col.4
Col.1 (actual (sum of two (sum of three (sum of three
Marks (sum of two (sum of three
frequency, f)
freq.)
leaving the
freq.)
leaving the first leaving the first two Counting for highest frequency
first freq.) freq.) freq.)
11 5 --- ---- --- Marks Col.1 Col.2 Col.3 Col.4 Col.5 Col.6 Total

5+5 = 10 11 0
12 5 5+5+10 = 20 ---
12 0
5+10 = 15
13 10 5+10+15 = 30 13 1 1 2
10+15 = 25 14 1 1 1 1 1 5
14 15 10+15+15 = 40
15+15 = 30
15 1 1 1 1 1 1 6
15 15 15+15+10 = 40 16 1 1 1 3
15+10 = 25
17 1 1
16 10 15+10+7 = 32
18 0
10+7 = 17
17 7 10+7+5 = 22 19 0
7+5 = 12
20 0
18 5 7+5+4 = 16
5+4 = 9
19 4 5+4+4 = 13 ---
4+4 = 8
20 4 --- --- ---
Partition Values: Percentile, Decile and Quartile
Pk = (k.n/100)th observation where k = 1, 2, 3……., 99 and n is the no. of observations.
If Pk is integer, then percentile value is calculated by taking average of Pk and (Pk+1)th
obs.
If Pk is non-integer, then percentile value is calculated by rounding it to the next integer.

Dk = (k.n/10)th observation where k = 1, 2, 3……., 9 and n is the no. of observations.


If Dk is integer, then decile value is calculated by taking average of Dk and (Dk+1) obs.
If Dk is non-integer, then decile value is calculated by rounding it to the next integer.

Qk = (k.n/4)th observation where k = 1, 2, 3 and n is the no. of observations.


If Qk is integer, then quartile value is calculated by taking average of Qk and (Qk+1) obs.
If Qk is non-integer, then quartile value is calculated by rounding it to the next integer.
3710 3755 3850 3880 3880 3890 3920 3940 3950 4050 4130 4325
Find 85th percentile, 6th decile and 1st & 3rd quartiles from the above distribution.
1) P85 = (85*12/100)th observation = 10.2th observation or 11th observation
Hence P85 = 4130
2) D6 = (6*12/10)th observation = 7.2th observation or 8th observation
Hence D6 = 3940
3) Q1 = (1*12/4)th observation = 3rd observation
Hence Q1 will be calculated by taking the average of 3rd and 4th observations.
Hence Q1 = (3850 + 3880)/2 = 3865
Q3 = (3*12/4)th observation = 9th observation
Hence Q3 will be calculated by taking the average of 9th and 10th observations.
Q3 = (3950 + 4050)/2 = 4000
Measures of Dispersion
• Let us consider the series of numbers:
5, 5, 5, 5, 5 (Mean = 5) 1, 3, 5, 7, 9 (Mean = 5) 1, 3, 4, 6, 11 (Mean = 5)
• Are the dataset same?
• Do they have the same characteristics?
• Does any difference exist among various observations of the datasets?
• Dispersion means spread or scatteredness of the various observations.
• Dispersion measures the extent to which the observations vary from central value.
• Dispersion only measures the degree of variation, not the direction.
• Measures of dispersion are also known as ‘average of second order’ as they depend
on some or the other average value.
• The common measures of dispersion are range, quartile deviation, mean deviation
and standard deviation.
Range
• Range is defined as the difference between largest observation (L) and smallest
observation (S) in a distribution. Range = L-S
• If the average of two distributions are almost same, then the distribution with
smaller range is said to have less dispersion.
• Lesser value of range indicates more consistency in the distribution.
• Coefficient of range = (L-S)/(L+S)
• Range is widely used for statistical quality control. If the dimensions of products
are beyond a defined range, they are discarded.
• It facilitates the study of variations in the prices of shares, agricultural products
and other commodities.
• It also helps in weather forecasts by indicating minimum and maximum
temperature.
Quartile Deviation
• Quartile deviation or semi inter quartile range is half of the difference between
upper quartile (Q3) and lower quartile (Q1).
• Quartile deviation (QD) = (Q3 - Q1)/2 The Quartile Deviation doesn’t take
into consideration the extreme points of the distribution. Thus, the dispersion or
the spread of only the central 50% data is considered.
• It is the best measure of dispersion for open-ended systems (which have open-
ended extreme ranges).
• Interquartile rage IQR = (Q3 - Q1)
Ex: 3710, 3755, 3850, 3880, 3880, 3890, 3920, 3940, 3950, 4050, 4130, 4325
Q1 = 3865
Q3 = 4000
QD = (4000 – 3865)/2 = 67.5
Mean Deviation
• Mean deviation is arithmetic mean of the absolute deviations of all items from a
measure of central tendency.
• Mean deviation (MD) =
where ‘m(X)’ is any central tendency value and ‘n’ is no. of observations.
Example: Calculate Mean X |X-m(X)| Mean Deviation

deviation about mean: 5 |5-9|= 4

7 |7-9|= 2

8 |8-9|= 1

9 |9-9|= 0 MD about mean


= Σ|X-m(X)|/n
10 |10-9|= 1 = 14/7
=2
11 |11-9|= 2

13 |13-9|= 4

Mean
Σ|X-m(X)|= 14
m(X) = 9
Variance, Standard Deviation and Coeff. of Variation

Population Sample

Variance
σ2 = Σ 𝑥 − 𝑥ҧ 2 /n s2= Σ 𝑥 − 𝑥ҧ 2 /(n-1)
(ungrouped)

Std.
Deviation σ = Sqrt {Σ 𝑥 − 𝑥ҧ 2 /n} s = Sqrt {Σ 𝑥 − 𝑥ҧ 2 /(n-1)}
(ungrouped)

Variance
σ2 = Σf 𝑥 − 𝑥ҧ 2 /Σf s2= Σf 𝑥 − 𝑥ҧ 2 /(Σf-1)
(grouped)

Std.
Deviation σ = Sqrt {Σf 𝑥 − 𝑥ҧ 2 /Σf} s = Sqrt {Σf 𝑥 − 𝑥ҧ 2 /(Σf-1)}
(grouped)

Coeff. of
C.V. = (σ/𝑥)*100
ҧ C.V. = (s/𝑥)*100
ҧ
Variation
Example: Ungrouped & grouped, discrete series
2
x (x-𝑥)ҧ 𝑥 − 𝑥ҧ Variance Std. Deviation Coeff. of Variation
8 8-7=1 1
4 4-7=-3 9
9 9-7=2 4 Sample variance (s2) Sample Std. Deviation (s) CV
11 11-7=4 16 = Σ 𝑥 − 𝑥ҧ 2 /(n-1) = Sqrt (Sample variance) = (𝑠/𝑥)*100
ҧ
= 46/4 = Sqrt (11.5) = (3.391/7)*100
3 3-7=-4 16 = 11.5 = 3.391 = 48.44%
Σx=35 2
Σ 𝑥 − 𝑥ҧ =
𝑥ҧ = 7
46

2 2
x f fx (x-𝑥)ҧ 𝑥 − 𝑥ҧ 𝑓 𝑥 − 𝑥ҧ Variance Std. Deviation Coeff. of Var.
8 3 24 8-8=0 0 3*0=0
4 4 16 4-8=-4 16 4*16=64
10 6 60 10-8=2 4 6*4=24 Sample Variance (s2) Std Deviation (𝑠) CV
12 4 48 12-8=4 16 4*16=64 = Σf 𝑥 − 𝑥ҧ 2 /(Σf -1) = Sqrt (Sample variance) = (𝑠/𝑥)*100
ҧ
= 200/19 = Sqrt (10.526) = (3.244/8)*100
4 3 12 4-8=-4 16 3*16=48 = 10.526 = 3.244 = 40.55%
2
Σfx = 160; Σf=20; Σ𝑓 𝑥 − 𝑥ҧ
𝑥ҧ = 160/20 = 8 = 200
Example: Grouped, continuous series
A showroom of cars displays its sales figures for the last 30 days. Calculate the mean no. of
cars sold per day and std. deviation.

No. of
Days 2 2
cars x fx (x-𝑥)ҧ 𝑥 − 𝑥ҧ 𝑓 𝑥 − 𝑥ҧ Variance Std. Deviation Coeff. of Var.
(f)
sold
0-2 14 1 14 1-3=-2 4 14*4=56
2-4 7 3 21 3-3=0 0 7*0=0
4-6 5 5 25 5-3=2 4 5*4=20 Sample Standard
Sample Variance (s2) CV
Deviation (s)
6-8 3 7 21 7-3=4 16 3*16=48 = Σf 𝑥 − 𝑥ҧ 2 /(Σf-1) = (𝑠/𝑥)*100
ҧ
= Sqrt (Variance)
= 160/29 = (2.349/3)*100
8-10 1 9 9 9-3=6 36 1*36=36 = Sqrt (5.517)
= 5.517 = 78.30%
= 2.349
2
Σfx = 90; Σf = 30 Σ𝑓 𝑥 − 𝑥ҧ
𝑥ҧ = 90/30 = 3 = 160
Example: Consistency of data using C.V.
Coefficient of Variation is used to study consistency whenever there is a
comparison between two or more datasets.
Two companies Dawson Suppliers and Clark Distributors deliver construction materials. The following data
shows days of delivery for both the companies on 8 occasions. Which company is more consistent in
deliveries?
Dawson Clark 𝟐 𝟐

𝒙−𝒙 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 ഥ
𝒚−𝒚 Coefficient of Variation
(x) (y)
11 8 1 1 -2 4
10 10 0 0 0 0
9 17 -1 1 7 49 Variance of x = 26/7 = 3.71
SD of x = Sqrt (3.71) = 1.92
10 7 0 0 -3 9 CV of x = (1.92/10)*100 = 19.2%
8 10 -2 4 0 0
8 11 -2 4 1 1 Variance of y = 72/7 = 10.28
SD of y = Sqrt (10.28)= 3.21
10 10 0 0 0 0 CV of y = (3.21/10)*100 = 32.1%
14 7 4 16 -3 9
Dawson Suppliers is more consistent
2 2 in delivering the materials.
Mean Mean Σ 𝑥 − 𝑥ҧ Σ 𝑦 − 𝑦ത
𝑥ҧ = 10 𝑦ത = 10 = 26 = 72
Practice Problems
1. Find quartile deviation from the following data:
109, 189, 167, 209, 309, 265, 189, 187, 165, 239, 308, 378, 367, 109, 198, 209, 218, 387
2. The share prices of two companies X and Y are given below for twelve days. Which
company’s share prices are more consistent?
Days 1 2 3 4 5 6 7 8 9 10 11 12
X 201 200 199 203 206 208 206 201 197 199 198 196
Y 291 293 293 287 292 298 298 299 302 302 302 304

3. Calculate standard deviation from the following distribution:


Class 10-20 20-30 30-40 40-50 50-60 60-70
Frequency 5 10 15 15 10 5
Measures of Shape
Skewness
It is the degree of distortion from the symmetrical bell curve or the normal distribution. It
measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical distribution
will have a skewness of 0. There are two types of Skewness: Positive and Negative
Positive Skewness means when the tail on the right side of the distribution is longer
or flatter. The mean and median will be greater than the mode.
Negative Skewness is when the tail of the left side of the distribution is longer or
flatter than the tail on the right side. The mean and median will be less than the
mode.

So, when is the skewness too much?


The rule of thumb seems to be:
•If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
•If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and
1(positively skewed), the data are moderately skewed.
•If the skewness is less than -1(negatively skewed) or greater than 1(positively
skewed), the data are highly skewed.
Kurtosis
Kurtosis is all about the tails of the distribution — not only the peakedness or flatness. It is used to
describe the extreme values in one versus the other tail. It is actually the measure of outliers present in
the distribution.
High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis,
then, we need to investigate why do we have so many outliers.
Low kurtosis in a data set is an indicator that data has light tails or lack of outliers. If we get low
kurtosis(too good to be true), then also we need to investigate and trim the dataset of unwanted results.
Mesokurtic distribution (Kurtosis = 3): This distribution has kurtosis statistic similar to
that of the normal distribution. It means that the extreme values of the distribution
are similar to that of a normal distribution characteristic. This definition is used so that
the standard normal distribution has a kurtosis of three.
Leptokurtic distribution (Kurtosis > 3): The peak is higher and sharper than
Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of
the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness”
of a leptokurtic distribution.
Platykurtic distribution (Kurtosis < 3): The peak is lower and broader than Mesokurtic,
which means that data are light-tailed or lack of outliers. The reason for this is because
the extreme values are less than that of the normal distribution.
Outlier Detection
Outliers are extreme values that deviate from other observations on data, they may
indicate a variability in a measurement, experimental errors or a uniqueness. In other
words, an outlier is an observation that diverges from an overall pattern on a sample.
Most common causes of outliers on a data set:
• Data entry errors (human errors)
• Measurement errors (instrument errors)
• Experimental errors (data extraction or experiment planning/executing errors)
• Intentional (dummy outliers made to test detection methods)
• Data processing errors (data manipulation or data set unintended mutations)
• Sampling errors (extracting or mixing data from wrong or various sources)
• Natural (not an error, novelties in data)
‘Z-Score’ method:
The z-score or standard score of an observation is a metric that indicates how many standard deviations
a data point is from the sample mean. After making the appropriate transformations to the selected
feature space of the dataset, the z-score of any data point can be calculated with the following
expression:

When computing the z-score for each sample on the data set a threshold must be specified. A general
‘thumb-rule’ for detecting outliers is lZl > 3.0, however it varies. Sometimes lZl values more than 2.0 or
2.5 are also considered as outliers.
Meal price($) Z-score = (x-mean)/s.d.

18 -0.3545

19 -0.3151

20 -0.2757

17 -0.3939

21 -0.2363

22 -0.1969

99 2.8358 (Outlier)

21 -0.2363

15 -0.4726

18 -0.3545
Mean=27
S.D.=25.3859
‘IQR’ method:
Interquartile range (IQR) is a measure of variability and also referred to as ‘midspread’. It is calculated as:
IQR = Q3 – Q1
With the help of IQR, we determine lower limit and upper limit for the dataset. Any value less than lower limit or
more than upper limit is considered as an outlier.
Lower limit = Q1 – 1.5*IQR
Upper limit = Q3 + 1.5*IQR
Meal price ($) Calculations
15 IQR = Q3 – Q1
17 Q1 = 18 and Q3 = 21
IQR = 3
18 Lower limit = 18 – 1.5*3 = 13.5
18 Upper limit = 21 + 1.5*3 = 25.5
19 Any value less than $13.5 and more than
20 $25.5 will be considered as outlier.
Hence the meal price of $99 is an outlier.
21
21
22
99
Association between two variables
• In a bivariate distribution, we are often interested in knowing the association between
the two variables, say X & Y. One such technique of establishing the relationship
between two variables is correlation analysis.
• For example, we might be interested in finding whether there is any association
between height and weight of kids, sales and advertisement expenditure, work
experience and salary, stress level and BP etc.
• Correlation is defined as the relationship between two variables in such a way that any
change in one variable results a corresponding change in the other. Correlation analysis
provides a tool to measure the strength and direction of such kind of associations, if
any.
• Correlation can be positive or negative. If the two variables exhibit changes in the
same direction then there exists a positive correlation and if the changes are in
opposite direction then there is a negative correlation between the variables.
• However, it does not establish any causal or dependence-independence relationship
between the two variables.
Covariance
• Covariance measures how the two variables move with respect to each other and is an extension of the
concept of variance, which tells about how a single variable varies. It can take any value between -∞
to +∞.
• A positive number signifies positive covariance and denotes that there is a direct relationship.
Effectively this means that an increase in one variable would also lead to a corresponding increase in the
other variable provided other conditions remain constant.
• On the other hand, a negative number signifies negative covariance, which denotes an inverse
relationship between the two variables. Though covariance is perfect for defining the type of
relationship, it is bad for interpreting its magnitude.
• Covariance is defined by the formula:

xi = data value of x; yi = data value of y; x̄ = mean of x; ȳ = mean of y; n = number of data values.


The data for ‘Stereo and Sound Equipment Store’ is given pertaining to the no. of television commercials shown and sales during
10 weeks. The manager store wants to determine the association between television commercials and sales revenue. Calculate
sample covariance and interpret the result.
No. of
Week Commerci Sales ($100s) (x-ഥ
𝒙) (𝒚 − 𝒚
ഥ) (x-ഥ
𝒙)(𝒚 − 𝒚
ഥ)
als (x) (y)
1 2 50 -1 -1 1
Cov(X,Y)= Σ(x-𝑥)(𝑦
ҧ − 𝑦)/(n-1)

= 99/(10-1)
2 5 57 2 6 12 = 11

3 1 41 -2 -10 20 From the data, it is evident that there


exists a positive association between
4 3 54 0 3 0 television commercials and sales of
the store, i.e., sales increases with the
5 4 54 1 3 3 increase in television commercials.
But this analysis only gives direction
6 1 38 -2 -13 26 of the relationship, not the strength.
Strength of degree of relationship can
7 5 63 2 12 24 be understood with correlation
analysis.
8 3 48 0 -3 0

9 4 59 1 8 8

10 2 46 -1 -5 5

Σx=30 Σy=510 Σ(x-𝑥)(𝑦


ҧ − 𝑦)

𝒙=3
ഥ ഥ = 51
𝒚 = 99
Correlation
• Correlation is a step ahead of covariance as it quantifies the relative strength of the
relationship between two variables. In simple terms, it is a measure of how the variables change
with respect to each other.
• Unlike covariance, the correlation has an upper and lower cap on a range. It can only take
values between +1 and -1.
• A positive correlation indicates that the variables have a direct relationship and they move in
the same direction. On the other hand, the negative correlation indicates that there is a inverse
relationship between the variables and they move in the opposite direction. A ‘zero’ correlation
means that the two variables have no association.
Positive correlation Negative correlation
Value of ‘r’ Degree of relationship Value of ‘r’ Degree of relationship
r = +1 Perfect positive correlation r = -1 Perfect negative correlation
0.9 ≤ r < 1 Very strong -1 < r ≤ -0.9 Very strong
0.75 ≤ r ≤ 0.9 Strong -0.9 ≤ r ≤ -0.75 Strong
0.5 ≤ r ≤ 0.75 Moderate -0.75 ≤ r ≤ -0.5 Moderate
0.25 ≤ r ≤ 0.5 Weak -0.5 ≤ r ≤ -0.25 Weak
0 < r ≤ 0.25 Very weak -0.25 ≤ r < 0 Very weak
The sample data for ‘Stereo and Sound Equipment Store’ is given pertaining to the no. of television commercials shown and sales
during 10 weeks. The manager store wants to determine the association between television commercials and sales revenue. Calculate
Karl Pearson’s coefficient of correlation and interpret the result.

No. of Sales Cov(X,Y)= Σ(x-𝑥)(𝑦ҧ − 𝑦)/(n-1)



Week Commercials ($100s) (x-ഥ
𝒙) (𝒚 − 𝒚
ഥ) (x-ഥ
𝒙)2 (𝒚 − ഥ)2
𝒚 (x-ഥ
𝒙)(𝒚 − 𝒚
ഥ) = 99/(10-1)
(x) (y) = 11
1 2 50 -1 -1 1 1 1
Sx= Sqrt [Σ(x-𝑥)ҧ 2/(n-1)]

2 5 57 2 6 4 36 12 = Sqrt (20/9)
= Sqrt (2.22) = 1.49
3 1 41 -2 -10 4 100 20
Sy= Sqrt [Σ(𝑦 − 𝑦)
ത 2/(n-1)]
4 3 54 0 3 0 9 0
5 4 54 1 3 1 9 3 = Sqrt (566/9)
= Sqrt (62.89) = 7.93
6 1 38 -2 -13 4 169 26
rxy = Cov(X,Y)/SxSy
7 5 63 2 12 4 144 24
= 11/(1.49)(7.93)
8 3 48 0 -3 0 9 0
= 0.931
9 4 59 1 8 1 64 8
There exists a very strong
10 2 46 -1 -5 1 25 5 positive correlation between
television commercials and
Σx=30 Σy=510 Σ(x-𝑥)ҧ 2 Σ(𝑦 − 𝑦)
ത 2 Σ(x-𝑥)(𝑦
ҧ − 𝑦)
ത sales of the store.
𝒙=3
ഥ ഥ = 51
𝒚 = 20 = 566 = 99
Karl Pearson’s Coefficient of Correlation (r)

Serial no. Age (years) Weight (Kg)


Example: 1 7 12
A sample of 6 children was selected, data about their 2 6 8
age in years and weight in kilograms was recorded as
shown in the following table. Calculate the Karl 3 8 12
Pearson’s coefficient of correlation between age and 4 5 10
weight.
5 6 11
6 9 13
Seri 6x461−41x66
Age Weight r=
al XY X2 Y2 6x291 − 41x41 { 6x742 −(66x66)}
(X) (Y)
no.
1 2766−2706
7 12 84 49 144 r=
1746 − 1681 { 4452 −(4356}
2
6 8 48 36 64
60
3 r=
8 12 96 64 144 65 {96}

4 60
5 10 50 25 100 r=
78.9936
5
6 11 66 36 121
r = 0.759
6 There is strong direct (or strong positive)
9 13 117 81 169
correlation between age and weight.
∑X2 ∑Y2
Total ∑X= ∑Y= ∑XY=
= =
41 66 461
291 742
Topic-6

Random Variables
and
Probability Distributions
Topics to be discussed:
• Random Variables
• Discrete Random Variables
• Continuous random variables
• Expected Value and Variance of Discrete Random variables
• Binomial Probability Distribution
• Poisson Probability Distribution
• Normal Distribution
Random Variables
• A random variable is a numerical description of the outcome of an experiment.
• A random variable can be classified as being either discrete or continuous depending
on the numerical values it assumes.
• A discrete random variable may assume either a finite number of values or an
infinite sequence of values.
Experiment Random Variable (x) Possible values of x
Ticket sale in a hall of size 100 No. of viewers coming in the hall 0, 1, 2, 3………., 100
Inspect a lot of 25 chairs No. of defective chairs 0, 1, 2, 3………., 25

• A continuous random variable may assume any numerical value in an interval or


collection of intervals.
Experiment Random Variable (x) Possible values of x
Filling soft drink bottle of 200 ml. Quantity filled in a bottle 0 ≤ x ≤ 200
Operating a bank Time interval between arrivals x≥0
X= getting head in Prob. Of X
a toss of two
coins
0 1/4 = 0.25 HH, HT, TH, TT
1 2/4 = 0.5
2 1/4 = 0.25
• Discrete random variable with a finite number of values:
Let x = number of TV sets sold at the store in one day
where x can take on 5 values (0, 1, 2, 3, 4)

• Discrete random variable with an infinite sequence of values:


Let x = number of customers arriving in one day in shopping mall
where x can take on the values 0, 1, 2, . . .
We can count the customers arriving, but there is no finite upper
limit on the number that might arrive.
Quiz Time???

Experiment Random Variable (x) Type


Take a 20-questions No. of questions answered
examination correctly
Observe cars arriving at No. of cars arriving at the gas
a gas station in an hour station
Audit 50 tax returns No. of returns containing
errors
Observe an employee’s No. of unproductive hours in
work an eight-hour working day
Weigh a shipment of Observed weight in
goods kilograms
Discrete Probability Distributions
• The probability distribution for a random variable describes how
probabilities are distributed over the values of the random
variable.
• The probability distribution is defined by a probability function,
denoted by f(x), which provides the probability for each value of
the random variable.
• The required conditions for a discrete probability function are:
f(x) ≥ 0
Σf(x) = 1
• We can describe a discrete probability distribution with a table,
graph, or equation.
Example: ABL Appliances

Using past data on TV sales (below left), a tabular representation of


the probability distribution for TV sales (below right) was developed.

Units Sold No. of Days x f(x)


0 80 0 80/200= 0.40
1 50 1 0.25
2 40 2 0.20
3 10 3 0.05
4 20 4 0.10
200 1.00
Example: ABL Appliances
A graphical representation of the probability distribution for TV
sales in one day
.50

Probability
.40

.30

.20

.10

0 1 2 3 4
Values of Random Variable ‘x’ (TV sales)
Expected Value and Variance

• The expected value, or mean, of a random variable is a


measure of its central location.
• Expected value of a discrete random variable:
E(x) = m = Σx.f(x)
• The variance summarizes the variability in the values of a
random variable.
• Variance of a discrete random variable:
Var(x) = s2 = Σ(x - m)2f(x)
• The standard deviation, s, is defined as the positive square
root of the variance.
Example: ABL Appliances
• Expected Value of a Discrete Random Variable

x f(x) x.f(x)
0 .40 0
1 .25 .25
2 .20 .40
3 .05 .15
4 .10 .40
Σx.f(x) =1.20 = E(x)

The expected number of TV sets sold in a day is 1.2


Example: ABL Appliances
• Variance and Standard Deviation of a Discrete Random Variable
x x-m (x - m)2 f(x) (x - m)2f(x)

0 -1.2 1.44 .40 .576


1 -0.2 0.04 .25 .010
2 0.8 0.64 .20 .128
3 1.8 3.24 .05 .162
4 2.8 7.84 .10 .784
Σ(x - m)2f(x) = 1.66 = s 2
The variance of daily sales is 1.66 TV sets squared.
The standard deviation of sales is 1.29 TV sets.
Example: Investment Returns

Consider the return per $1000 for two types of investments.


Investment
Economic Condition
Prob. Passive Fund X Aggressive Fund Y

0.2 Recession - $25 - $200

0.5 Stable Economy + $50 + $60

0.3 Expanding Economy + $100 + $350


Investment Returns:
Expected Value

E(X) = μX = (-25)(.2) +(50)(.5) + (100)(.3) = $50

E(Y) = μY = (-200)(.2) +(60)(.5) + (350)(.3) = $95

Interpretation: Fund X is averaging a $50.00 return


and fund Y is averaging a $95.00 return per $1000
invested.
Investment Returns:
Standard Deviation

σ X = (-25 − 50) 2 (.2) + (50 − 50) 2 (.5) + (100 − 50) 2 (.3)


= 43.30

σ Y = (-200 − 95) 2 (.2) + (60 − 95) 2 (.5) + (350 − 95) 2 (.3)


= 193.71

Interpretation: Even though fund Y has a higher


average return, it is subject to much more variability
and the probability of loss is higher.
Exercise: Discrete Random Variables
Problem: In a survey, the no. of mobile phones per household was collected as below:
No. of mobiles 0 1 2 3 4 5 Total
No. of households 5 15 30 30 15 5 100
(i) Develop a probability distribution for the no. of mobile phones per household.
(ii) Calculate the expected no. of mobile phones per household and its standard deviation.
(iii) Find the values of P(X≤2), P(X>2) and P(2≤X≤4), where X is the no. of mobiles.
x f(x) μ = E(x) x-μ (x-μ)2 (x-μ)2.f(x) Variance P(X≤2) P(X>2) P(2 ≤X≤4)
= Σ(x-μ)2.f(x)
0 0.05 Expected value E(X) -2.5 6.25 0.3125 Sum of Sum of Sum of
= ΣX.P(X) = 1.45 prob. for prob. for prob. for
1 0.15 -1.5 2.25 0.3375
= (0x0.05 + 1x0.15 X = 0,1,2 X=3,4,5 X=2,3,4
2 0.30 + 2x0.3 + 3x0.3 + -0.5 0.25 0.075 Std. dev. OR

4x0.15 + 5x0.05) = Sqrt[Σ(x-μ)2.f(x)] 1-P(X≤2)


3 0.30 0.5 0.25 0.075
μ = 2.5 = Sqrt(1.45) P(X≤2)= P(X>2)= P(2 ≤X≤4)=
4 0.15 1.5 2.25 0.3375 = 1.204 0.05+0.1 0.30+0.1 0.30+0.30
5+0.30 = 5+0.05 = +0.15 =
5 0.05 2.5 6.25 0.3125
0.50 0.50 0.75
1.00 1.45
Practice Exercises
1. An online pharmacy claims to deliver medicines within 3 to 6 days of order. The manager collected
following data pertaining to the no. of days taken to deliver the orders:
No. of days 0 1 2 3 4 5 6 7 8
Probability 0 0 0.01 0.04 0.28 0.42 0.21 0.02 0.02

(i) What is the probability that delivery will be made within 3 to 6 days?
(ii) What is the probability that the delivery will be late?
(iii) What is the probability that the delivery will be early?
2. The following data shows the no. of hours a car being parked at a parking slot along with the
probabilities. The parking supervisor wants to know the expected no. of hours and standard deviation of
the no. of hours cars are parked in the slot.

No. of hours 1 2 3 4 5 6 7 8
Probability 0.24 0.18 0.13 0.10 0.07 0.04 0.04 0.20
Binomial Probability Distribution
✓Properties of a Binomial Experiment
• The experiment consists of a sequence of n identical trials.
• Two outcomes, success and failure, are possible on each trial.
• The probability of a success, denoted by p, does not change from trial
to trial. The probability of failure q or (1-p) also does not change from
trial to trial.
• The trials are independent.

✓ Points to be remembered
If an experiment fulfils all the above four conditions then it is called a
Binomial Experiment.
If the first point is not fulfilled and rest three are fulfilled, it is termed
as a Bernoulli Experiment (given by Jacob Bernoulli).
Binomial Probability Distribution

• Binomial Probability Function

n!
f ( x) = p x (1 − p) (n − x )
x !( n − x )!
where
f(x) = the probability of x successes in n trials
n = the number of trials
p = the probability of success on any one trial
Example: Evans Electronics
• Binomial Probability Distribution
Evans Electronics is concerned about a low retention rate for
employees. On the basis of past experience, management
has seen a turnover of 10% of the hourly employees
annually. Thus, for any hourly employees chosen at random,
management estimates a probability of 0.1 that the person
will not be with the company next year.
Choosing 3 hourly employees at random, what is the
probability that 1 of them will leave the company this year?

Let: p = .10, n = 3, x = 1
Example: Evans Electronics

• Using the Binomial Probability Function


n!
f ( x) = p x (1 − p ) ( n − x )
x !( n − x )!

3!
f (1) = ( 0.1)1 ( 0. 9 ) 2
1!( 3 − 1)!

= (3)(0.1)(0.81)
= 0.243
Example: Evans Electronics

• Using the Tables of Binomial Probabilities

p
n x .10 .15 .20 .25 .30 .35 .40 .45 .50
3 0 .7290 .6141 .5120 .4219 .3430 .2746 .2160 .1664 .1250
1 .2430 .3251 .3840 .4219 .4410 .4436 .4320 .4084 .3750
2 .0270 .0574 .0960 .1406 .1890 .2389 .2880 .3341 .3750
3 .0010 .0034 .0080 .0156 .0270 .0429 .0640 .0911 .1250
Sample: Binomial Probability Table
Example: Evans Electronics
Binomial Probability using a Tree Diagram
First Second Third Value
Worker Worker Worker of x Probab.
L (.1) 3 0.1*0.1*0.1 = 0.0010
Leaves (.1)
S (.9) 2 .0090
Leaves (.1)
L (.1)
2 .0090
Stays (.9)
1 .0810
S (.9)
L (.1) 2 .0090
Leaves (.1)
1 .0810
S (.9)
Stays (.9) L (.1)
1 .0810
Stays (.9)
0 .7290
S (.9)
Expected value and Variance

• Expected Value
E(x) = m = np
• Variance
Var(x) = s 2 = np(1 - p)
• Standard Deviation
SD( x ) = s = np(1 − p)

• Example: Evans Electronics


E(x) = m = 3(.1) = 0.3
Var(x) = s 2 = 3(.1)(.9) = .27
SD( x) = s = 3(.1)(.9) = .52 employees
Ex.2: Nitrogen gas is filled in automobile tyres to improve the ride quality. A filling station experiences
that 30% of the customers get nitrogen gas filled in their vehicle’s tyres. If 10 customers arrive at a
gas station then what is the probability that,
(i) All of them would fill nitrogen gas?
(ii) None of them would fill nitrogen gas?
(iii) Half of them would fill nitrogen gas?
(iv) Less than 8 customers would fill nitrogen gas?
(v) Find the mean no. of customers who would fill nitrogen gas.
Soln: n = 10, p = 0.3, 1-p = 1-0.3 = 0.7
f(x) = nCx px (1-p)n-x
(i) f(10) = 10C10 (0.3)10 (1-0.3)10-10 = 1.(0.000006).(1) = Almost zero
(ii) f(0) = 10C0 (0.3)0 (1-0.3)10-0 = 1.(1).(0.0282) = 0.0282
10!
(iii) f(5) = 10C5 (0.3)5 (1-0.3)10-5 = . (0.3)5(1-0.3)10-5 = 252.(0.0024).(0.1681) = 0.0835
5! 10−5 !
(iv) P(x<8) = 1-P(x≥8) = 1-[f(8) + f(9) + f(10)] = 1-(0.0015+0.0001+0.000006) = 0.9984
(v) Mean no. of customers who fill nitrogen gas = np = 10.(0.3) = 3
Ex.3: Fifty percent of the people believe that the country is in recession. For a sample of 20
people, make the following calculations:
(i) Probability that 12 people believe the country is in recession. (0.1201)
(ii) Probability that at least 18 people believe the country is in recession. (0.0002)
(iii) Probability that more than three people believe the country is in recession.
P(x>3)=1-P(x≤3) = 1-[f(0)+f(1)+f(2)+f(3)] = (0.9987)
(iv) Expected no. of people who would say that the country is in recession. (10)
(v) Compute variance and std. deviation of the no. of people who believe the country is in
recession. (5, 2.236)

Ex.4: Twenty three percent of the vehicles are not covered by any insurance. On a special
checking day, 30 vehicles are checked randomly. What is the probability that more than 27
vehicles are insured? (here, p=0.77, n=30)
What is the expected no. of vehicles not covered by any insurance? What is the variance and
std. deviation? (here, n=30, p=0.23, q=0.77)
Poisson Probability Distribution
• Properties of a Poisson Experiment
• It is used to estimate the probability of no. of occurrences
over a specified period of time.
• The probability of an occurrence is the same for any two
intervals of equal length.
• The occurrence or nonoccurrence in any interval is
independent of the occurrence or nonoccurrence in any
other interval.
Poisson Probability Distribution
• Poisson Probability Function

x −m
m e
f ( x) =
x!
where
f(x) = probability of x occurrences in an interval; x = 0, 1, 2, 3, …………∞
μ = mean number of occurrences in an interval
e = 2.71828

✓Both mean and variance of a Poisson distribution are ‘μ’.


✓ The value of ‘μ’ should be adjusted according to time-interval.
✓ Poisson distribution approximates the binomial distribution for large n and small p.
Example: Mercy Hospital

Patients arrive at the emergency room of Mercy Hospital at the


average rate of 6 per hour on weekend evenings. What is the
probability of 4 arrivals in 30
minutes on a weekend evening?

• Using the Poisson Probability Function


m = 6/hour = 3/half-hour, x = 4

34 ( 2. 71828) −3
f ( 4) = =.1680
4!
Example: Mercy Hospital
• Using the Tables of Poisson Probabilities
m
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
0 .1225 .1108 .1003 .0907 .0821 .0743 .0672 .0608 .0550 .0498
1 .2572 .2438 .2306 .2177 .2052 .1931 .1815 .1703 .1596 .1494
2 .2700 .2681 .2652 .2613 .2565 .2510 .2450 .2384 .2314 .2240
3 .1890 .1966 .2033 .2090 .2138 .2176 .2205 .2225 .2237 .2240
4 .0992 .1082 .1169 .1254 .1336 .1414 .1488 .1557 .1622 .1680
5 .0417 .0476 .0538 .0602 ..0668 .0735 .0804 .0872 .0940 .1008
6 .0146 .0174 .0206 .0241 .0278 .0319 .0362 .0407 .0455 .0504
7 .0044 .0055 .0068 .0083 .0099 .0118 .0139 .0163 .0188 .0216
8 .0011 .0015 .0019 .0025 .0031 .0038 .0047 .0057 .0068 .0081
9 .0003 .0004 .0005 .0007 .0009 .0011 .0014 .0018 .0022 .0027
10 .0001 .0001 .0001 .0002 .0002 .0003 .0004 .0005 .0006 .0008
11 .0000 .0000 .0000 .0000 .0000 .0001 .0001 .0001 .0002 .0002
12 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0001
Poisson Probability Distribution:
An Approximation of Binomial Distribution

The Poisson probability distribution can be used as an


approximation of the binomial probability distribution
when p, the probability of success, is small and n, the
number of trials, is large.
• Approximation is good when p < .05 and n > 20
• Set μ = np and use the Poisson tables.
Practice Exercises: Poisson Distribution

Ex.1: In a drive-thru window of a burger shop, average 10 customers arrive in a 30-minute interval.
What is the probability that exactly 5 customers arrive in 30 minutes?
What is the probability that 5 customers arrive in 15-minutes interval?
Soln: For 30-minutes duration, μ = 10 and x = 5.
5
−10 10
f(5) = 𝑒 . = (0.000045).(100000/120) = (0.000045).(833.33) = 0.0378
5!
For 15-minutes duration, μ = 5 and x = 5.
5
−5 5
f(5) = 𝑒 . = (0.0067).(3125/120) = 0.1754
5!
Ex.2: An average of 15 aircraft accidents occur each year. Compute
(i) The mean no. of accidents per month.
(ii) Prob. of no accidents during a month.
(iii) Prob. of exactly one accident per month.
(iv) Prob. of more than one accidents per month.
−μ μ𝑥
Soln: P(X=r) = 𝑒 .
𝑥!
For 12-months duration μ = 15, for one-month duration μ = 1.25.
(i) Mean no. of accidents per month = 15/12 = 1.25
−1.25 (1.25)0
(ii) For x=0, P(X=0) = 𝑒 . = (0.2865).(1/1) = 0.2865
0!
−1.25 1.25 1
(iii) For x=1, P(X=1) = 𝑒 . = (0.2865).(1.25/1) = 0.3581
1!
(iv) For x>1, P(X>1) = 1- P(X≤1) = 1-[P(X=0)+P(X=1)] = 1-(0.2865+0.3581) = 0.3554

Ex.3: Phone calls arrive at the rate of 48 per hour in a call centre. Compute
(i) The mean no. of calls in 5 minutes duration. (4)
(ii) The prob. of receiving three calls in 5 minutes. (0.1952)
(iii) Prob. of receiving exactly 10 calls in 15 minutes. (0.1048)
(iv) Prob. of receiving at least one call in 10 minutes. (0.9997)
Ex.4: It is estimated that 0.5 percent of the callers to the customer service department will receive
a busy signal. What is the probability that out of 1200 callers, at least 3 will receive a busy signal?
Soln: n = 1200, p = 0.5/100 = 0.005
Since the problem has large n and small p, hence we assume μ=np
Mean np = (1200).(0.005) = 6 = μ (as n is large and p is very small)
−μ μ
𝑥
P(X=r) = 𝑒 .
𝑥!
For x≥3, P(X ≥3) = 1-P(X<3) = 1-[P(X=0)+P(X=1)+P(X=2)]
−6 (6)0
P(X=0) = 𝑒 . = 0.0025, P(X=1) = 0.0149, P(X=2) = 0.0446
0!
P(X≥3) = 1-(0.0025+0.0149+0.0446) = 0.938

Ex.5: Patients arrive at a hospital at the rate of 6 per hour. Find the probability that in a 90-
minute duration (i) exactly 7 patients arrive in the hospital.
(ii) between 7 and 10 patients arrive in the hospital. P(7≤X≤10) = P(X=7)+P(X=8)+P(X=9)+P(X=10)
(iii) If a patient arrives at 11:30am then what is the probability that other patients arrive before
11:45am?
μ= 1.5, P(X≥1)= 1-P(X<1) = 1-P(X=0) = 1-e-1.5 = 1- 0.2232 = 0.7768
Continuous Probability Distributions

• A continuous random variable is a variable that can assume any


value on a given range (can assume an uncountable number of
values)
• thickness of an item
• time required to complete a task
• temperature of a solution
• height, in inches

• These can potentially take on any value depending only on the


ability to precisely and accurately measure
The Normal Distribution

▪ Bell Shaped
▪ Symmetrical f(X)

▪ Mean, Median and Mode are equal


Location is determined by the mean, μ
σ
Spread is determined by the standard X
deviation, σ X=μ

The random variable has an infinite Mean = Median = Mode


theoretical range:
-  to + 
The Normal Distribution Function

The formula for the normal probability density function is


2
1  (X −μ) 
1 −  
2  
f(X) = e
2π

where e = the mathematical constant approximated by 2.71828


π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Many Normal Distributions

By varying the parameters μ and σ, we obtain different normal distributions


The Normal Distribution Shape

f(X) Changing μ shifts the distribution to


left or right.

Changing σ increases or decreases


the spread.
σ

X=μ X
The Standardized Normal Distribution
• Any normal distribution (with any mean and standard deviation
combination) can be transformed into the standardized normal
distribution (Z distribution).
• Need to transform X units into Z units.
• The standardized normal distribution (Z) has a mean of 0 and a
standard deviation of 1.
• Translate from X to the standard normal variate ‘Z’ by
subtracting the mean of X and dividing by its standard deviation:
X −μ
Z=
σ
The Z distribution always has mean = 0 and standard deviation = 1
The Standardized Normal Probability Density Function

• The formula for the standardized normal probability


density function is

1 −(1/2)Z 2
f(Z) = e

Where e = the mathematical constant approximated by 2.71828


π = the mathematical constant approximated by 3.14159
Z = any value of the standardized normal distribution
The Standardized Normal Distribution

• Also known as the ‘Z-distribution’


• Mean is 0
• Standard Deviation is 1
f(Z)

Z
Z=0

Values on right of Z=0 have positive Z-values and values on left of Z=0 have negative Z-values
Example: Transforming X into Z
• If X is distributed normally with mean of 100 and standard
deviation of 50, the Z value for X = 200 is

X − μ 200 − 100
Z= = = 2.0
σ 50
• This says that X = 200 is two standard deviations (2
increments of 50 units) above the mean of 100.
• P(X>200) = P(Z>2) because (X=200) = (Z=2)
• P(0<X<200) = P(-2<Z<2)
Comparing X and Z values

X=100 X= 200 X (μ = 100, σ = 50)

Z=0 Z=2.0 Z (μ = 0, σ = 1)

Note that the shape of the distribution is the same, only the scale has changed.
We can express the problem in original units (X) or in standardized units (Z)
Probability and the Normal Curve
• The total area under the normal curve is equal to 1.
• The probability that X is greater than ‘a’ equals the area under the normal curve
bounded by ‘a’ and plus infinity (as indicated by the non-shaded area in the figure
below).
• The probability that X is less than ‘a’ equals the area under the normal curve
bounded by ‘a’ and minus infinity (as indicated by the shaded area in the figure
below).

Additionally, every normal curve (regardless of its mean or standard deviation)


conforms to the following "rule".
• About 68.3% of the area under the curve falls within 1 standard deviation of the
mean.
• About 95.4% of the area under the curve falls within 2 standard deviations of the
mean.
• About 99.7% of the area under the curve falls within 3 standard deviations of the
mean.
Finding Normal Probabilities

Probability is measured by the area under


the curve
f(X)
P (a ≤ X ≤ b )
= P (a < X < b )
(Note that the probability
of any individual value is
zero)

X
a b
Probability as Area Under the Curve

The total area under the curve is 1.0, and the curve is symmetric, so half
is on the right of mean and half is on the left.

f(X)
P( −  X  μ) = 0.5 P(μ  X  ) = 0.5

0.5 0.5

μ X

P( −  X  ) = 1.0
The Standardized Normal Table

Standard normal table in the textbook gives the


probability equal to or less than a desired value of Z
(from -∞ to Za).
0.9772

0.50 0.4772
Example:
P(Z<2) = P(-∞<Z<2)
= 0.9772
P(Z>2) = 1.00-0.4772
Z=0 Z=2 Z
= 0.0228
0.0228
Finding Normal Probabilities

To find P(a<X<b) when X is distributed normally:

• Translate X-values to Z-values for ‘a’ and ‘b’ separately


• Draw a normal curve for the problem in terms of Z
• Identify the target region under the normal curve
• Use Standard normal table to get the required probability
Finding Normal Probabilities
• Let X represent the time it takes to download an image file from
the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0.
Find P(X < 8.6)

X − μ 8.6 − 8.0
Z= = = 0.12
σ 5.0

μ=8 μ=0
σ=5 σ=1

8.0 8.6 X Z=0.12 Z

P(X < 8.6) P(Z < 0.12)


Solution: Finding P(Z < 0.12)

Standard Normal Probability Table (Portion) P(X < 8.6) = P(Z < 0.12)
= P(-∞ <Z< 0.12)
= 0.5478
Z 0.00 0.01 0.02
0.50 0.0478
0.0 0.5000 0.5040 0.5080

0.1 0.5398 0.5438 0.5478

0.2 0.5793 0.5832 0.5871


Z = -∞ Z = +∞
0.3 0.6179 0.6217 0.6255 Z=0
Z=0.12
0.5478
Finding Normal Probabilities

• Suppose X is normal with mean 8.0 and


standard deviation 5.0.
• Now Find P(X > 8.6)

X
8.0
8.6
Finding Normal Upper Tail Probabilities

Now Find P(X > 8.6)…


P(X > 8.6) = P(Z > 0.12)
= 1.00 - P(-∞ ≤ Z ≤ 0.12)
= 1.00 - 0.5478
= 0.4522

0.5478
0.5 0.5 (blue area)
1.000 - 0.5478 = 0.4522

Z Z
Z=0 Z=0
Z=0.12
Finding a normal probability between two values

Suppose X is normal with mean 8.0 and standard


deviation 5.0. Find P(8 < X < 8.6)

Calculate Z-values:

X −μ 8 −8
Z= = =0
σ 5
8 8.6 X
Z=0
X − μ 8.6 − 8 Z
Z= = = 0.12 Z=0.12
σ 5 P(8 < X < 8.6)
= P(0 < Z < 0.12)
= P(-∞ < Z < 0.12) – P(-∞ < Z < 0)
= 0.5478 – 0.5 = 0.0478
Probabilities in the Lower Tail

• Suppose X is normal with mean 8.0 and


standard deviation 5.0.
• Now Find P(7.4 < X < 8)

X
8.0
7.4
Probabilities in the Lower Tail

Now Find P(7.4 < X < 8)…


P(7.4<X< 8) = P(-0.12<Z<0)
= P(-∞<Z<0) – P((-∞<Z<0.12)
0.0478
= 0.5 – 0.4522
= 0.0478
Z 0.00 0.01 0.02 0.4522

-0.2 0.4207 0.4168 0.4129

-0.1 0.4602 0.4562 0.4522


7.4 8.0 X
-0.0 0.5000 0.4960 0.4920 Z
Z=-0.12 0
Practice Exercises
1. In a city, it is estimated that the maximum temperature is normally distributed with a mean of 23°C and
a standard deviation of 5°C. Calculate the number of days in this month in which it is expected to reach a
maximum of between 21°C and 27°C.

2. The mean weight of 500 college students is 70 kg and the standard deviation is 3 kg. Assuming that the
weight is normally distributed, determine how many students weigh:
a. between 60 kg and 75 kg
b. more than 90 kg
c. less than 64 kg
d. exactly 64 kg
e. 64 kg or less

3. For borrowers with good credit scores, the mean debt amount is $15,000. Assuming the debt amounts
to be normally distributed with standard deviation $3000, calculate the probability that
a. debt for a borrower is more than $18,000
b. debt for a borrower is less than $10,000
c. Debt for a borrower is between $12,000 and $18,000
Topic 7
Sampling and Sampling Distributions
Chapter topics
• Concept of sampling
• Probability and nonprobability sampling methods
• Concept of sampling distributions
• Sampling distribution of the mean
• For normal populations
• Using the Central Limit Theorem
• Sampling distribution of a proportion
• Probabilities using sampling distributions
Why Sample?

• Selecting a sample is less time-consuming than selecting


every item in the population (census).

• Selecting a sample is less costly than selecting every item in


the population.

• An analysis of a sample is less cumbersome and more


practical than an analysis of the entire population.
Selection of Class Representatives

Unbiased
Sample Unbiased,
representative sample
Male students
drawn at random from
Female students
Population the entire population

Biased
Sample
Biased, unrepresentative
Female sample drawn consisting
Male students students of more female students
Population
than males
Sampling Process begins with a Sampling Frame

• The sampling frame is a listing of items that make up the population


• Frames are data sources such as population lists, directories, or maps
• Inaccurate or biased results can occur if a frame excludes certain
portions of the population
• Using different frames to generate data can lead to dissimilar
conclusions
Types of Sampling

Sampling

Non-Probability Probability Sampling


Sampling

Simple Random Stratified


Convenience Judgement Quota Snowball

Systematic Cluster
Types of Sampling: Non-probability Sampling
In non-probability sampling, items included are chosen without
considering their probability of occurrence.
• In convenience sampling, items are selected based only on the fact that they are
easy, inexpensive, or convenient to sample.
• In judgment sampling, one gets the opinions of pre-selected individuals or
experts in the subject matter.
• In quota sampling, individuals or items are selected on the basis of specific traits
or qualities. Some fixed number of units are selected including all the traits.
• In snowball sampling, research units are selected with the help of other research
units. It is used where potential participants are difficult to identify. For example,
customers in life insurance, network marketing, survey on ‘social evils’ etc.
Types of Sampling: Probability Sampling

In probability sampling, items in the sample are chosen on the


basis of known probabilities.

Probability Sampling

Simple Random Stratified Systematic Cluster


Probability Sampling: Simple Random Sampling

• Every individual or item from the frame has an equal chance of


being selected

• Selection may be with replacement (selected individual is


returned to frame for possible reselection) or without
replacement (selected individual isn’t returned to the frame).

• Samples are obtained using either lottery method or random


number tables or computer random number generators.
Selecting a Simple Random Sample using ‘Random Number Table’

Sampling Frame For


Population With 850 Portion Of A Random Number Table
49280 88924 35779 00283 81163 07275
Items 11100 02340 12860 74697 96644 89439
09893 23997 20048 49420 88872 08401
Item Name Item #
Bev R. 001
Ulan X. 002 The First 12 Items in a simple random sample: first
Roger F. 003 3 digits should be between 001 to 850
. . Item # 49280 - select Item # 11100 - select
. . Item # 88924 - ignore Item # 02340 - select
Item # 35779 - select Item # 12860 - select
. .
Item # 00283 - select Item # 74697- select
. . Item # 81163 - select Item # 96644 - ignore
Peter S. 848 Item # 07275 - select Item # 89439 - ignore

Joann P. 849
Paul F. 850
Probability Sampling: Stratified Random Sampling

• Divide population into two or more subgroups (called strata) according to some common
characteristic
• A simple random sample is selected from each subgroup, with sample sizes proportional
to strata sizes
• Samples from subgroups are combined into one
• This is a common technique when sampling population of voters, stratifying across racial
or socio-economic lines.

Population
Divided
into 4
strata

Chap 7-11
Probability Sampling: Systematic Sampling

• Decide on sample size: n


• Divide frame of N individuals into groups of k
individuals: k=N/n (called sampling interval)
• Randomly select one individual from the 1st group
• Select every kth individual thereafter

N = 40 First Group
n=4
k = 10
Probability Sampling: Cluster Sampling

• Population is divided into several “clusters,” each representative of the


population
• A simple random sample of clusters is selected
• All items in the selected clusters can be used, or items can be chosen from a
cluster using another probability sampling technique
• A common application of cluster sampling involves election exit polls, where
certain election districts are selected and sampled.

Population
divided into
16 clusters. Randomly selected
clusters for sample
Probability Sample: Comparing Sampling Methods

• Simple random sample and Systematic sample


✓Simple to use
✓May not be a good representation of the population’s
underlying characteristics
• Stratified sample
✓Ensures representation of individuals across the entire
population
• Cluster sample
✓More cost effective
✓Less efficient (need larger sample to acquire the same level
of precision)
Sampling Distribution

• A sampling distribution is a distribution of all of the possible values of


a sample statistic (mean, std dev., proportion etc.) for a given size of
sample selected from a population.
• For example, suppose you sample 50 students from your college
regarding their mean GPA. If you obtain different samples of size 50,
you will compute a different mean for each sample. We are
interested in the distribution of all potential mean GPAs ( ) we might
calculate for all samples of 50 students.
Sampling Distribution
• If we consider the process of selecting a simple random sample as an
experiment, the sample mean is the numerical description of the
outcome of the experiment. Thus, the sample mean is a random
variable.
• As a result, just like other random variables, has a mean or expected
value, a standard deviation, and a probability distribution.
• Because the various possible values of are the result of different
simple random samples, the probability distribution of is called the
sampling distribution of .
• Knowledge of this sampling distribution and its properties will enable
us to make probability statements about how close the sample mean
is to the population mean μ.
How Large is Large Enough?
• For most distributions, n > 30 will give a sampling
distribution that is nearly normal
• For fairly symmetric distributions, n > 15
• For normal population distributions, the sampling
distribution of the mean is always normally distributed
Example
• Suppose a population has mean μ = 8 and standard
deviation σ = 3. Suppose a random sample of size n = 36 is
selected.
• What is the probability that the sample mean is between
7.8 and 8.2?
Practice Exercises
1. Mean expenditure of all the visitors in a restaurant is Rs.2000 with a std. deviation of
Rs.250. A random sample of 40 customers was taken, find the probability that
(a) mean expenditure of customers is more than Rs.1928, (b) mean expenditure of
customers is between Rs.1950 and Rs.2030.
(a) Z = Xσ −μ = 1928
250
−2000
= -1.82
ൗ n ൗ40

𝐏(𝐗 > 𝟏𝟗𝟐𝟖)


= 𝐏 Z > −1.8𝟐
= 𝑷(−∞<Z<∞) − 𝑷(−∞<Z< − 𝟏. 𝟖𝟐))
= 1-0.0344
= 0.9656

(b) P(1950<𝐗<2030) Z= -1.82 Z=0


= P(-1.26<Z<0.76)
2. The numerical population of grade point averages at a college has mean 2.61 and
standard deviation 0.5. If a random sample of size 100 is taken from the population,
what is the probability that the sample mean will be between 2.51 and 2.71?
3. A prototype automotive tire has a mean design life of 38,500 miles with a standard
deviation of 2,500 miles. Five such tires are manufactured and tested. Find the
probability that the sample mean will be less than 36,000 miles. Assume that the
distribution of lifetimes of such tires is normal.
4. An automobile battery manufacturer claims that its midgrade battery has a mean life
of 50 months with a standard deviation of 6 months. Suppose the distribution of battery
lives of this particular brand is approximately normal.
(a) On the assumption that the manufacturer’s claims are true, find the probability that
a randomly selected battery of this type will last less than 48 months. (Normal
distribution problem)
(b) On the same assumption, find the probability that the mean life of a random sample
of 36 such batteries will be less than 48 months. (Sampling distribution problem)
Sampling Distribution for Population Proportion

• Let p = the proportion of the population having some characteristic


• Sample proportion ( ) provides an estimate of population proportion (p):
𝑋 number of items in the sample having the characteristic of interest
= =
𝑛 sample size

• ‘x’ is the number of elements in the sample that possess the characteristic of
interest and ‘n’ is the sample size.
• 0≤ p≤1
• p is approximately distributed as a normal distribution when n is large
(assuming sampling with replacement from a finite population or without replacement from an
infinite population)
Sampling Distribution of p

• Approximated by a normal distribution if:


Sampling Distribution
𝑛𝑝 ≥ 5 P( )
and .3
𝑛(1 − 𝑝) ≥ 5 .2
.1
0
where 0 .2 .4 .6 8 1

and
𝒑(𝟏 − 𝒑)
𝝁 =𝒑 𝝈 =
𝒏

(where p = population proportion)


Z-Value for Proportions
Standardize to a Z-value with the formula:

−𝑝 −𝑝
𝑍= =
𝜎 𝑝(1 − 𝑝)
𝑛
Example
• If the true proportion of voters who support Proposition
A is 0.4, what is the probability that a sample of size
200 yields a sample proportion between 0.40 and 0.45?
• i.e. if p = 0.4 and n = 200, what is P(0.40 ≤ ≤ 0.45) ?
Example
(continued)
if p = 0.4 and n = 200, what is
P(0.40 ≤ ≤ 0.45) ?

Use standardized normal table: P(0 ≤ Z ≤ 1.44) = 0.4251

Standardized
Sampling Distribution Normal Distribution

0.4251

Standardize

0.40 0.45 0 1.44


p Z
Practice Exercise

1.The Grocery Manufacturers of America reported that 76% of consumers read the ingredients
listed on a product’s label. Assume the population proportion is p = .76 and a sample of 400
consumers is selected from the population.
(a) Show the sampling distribution of the sample proportion where is the proportion of the
sampled consumers who read the ingredients listed on a product’s label.
(b) What is the probability that the sample proportion will be within ±.03 of the population
proportion?
(c) Answer part (b) for a sample of 750 consumers.

2. The Food Marketing Institute shows that 17% of households spend more than $100 per week
on groceries. Assume the population proportion is p = .17 and a sample of 800 households will
be selected from the population.
(a) Show the sampling distribution of p, the sample proportion of households spending more
than $100 per week on groceries.
(b) What is the probability that the sample proportion will be within ±.02 of the population
proportion?
(c) Answer part (b) for a sample of 1600 households.
Point Estimation
• Point estimation is the process of using the sample data available to estimate the unknown
value of a parameter. The point estimate obtained from the data will be a single number like
sample mean, sample standard deviation, sample proportion etc.
• Suppose we have an unknown population parameter, such as a population mean μ or a
population proportion p, which we'd like to estimate. For example, suppose we are interested
in estimating:
p = the (unknown) proportion of American college students, 18-24, who have a
smart phone
μ = the (unknown) mean number of days it takes patients to respond to a drug
In either case, we can't possibly survey the entire population. That is, neither we can survey all
American college students between the ages of 18 and 24 nor can we survey all patients with
a specific disease. So, of course, we do what comes naturally and take a random sample from
the population, and use the resulting data to estimate the value of the population parameter.
Of course, we want the estimate to be "good" in some way.
The following table shows a sample of 30 managers of a company out of the total
2500 managers.
• The mean annual salary ( =$51,814) is a point estimate of the population mean
salary (μ=$51,800).
• Similarly sample std. dev. (s=$3348) is a point estimate of the population std. dev.
(σ=$4000).
• The proportion of managers who have completed training ( =0.63) is a point
estimate of the population proportion (p=0.60).
Properties of a Point Estimator
Unbiasedness: If the expected value of the sample statistic is equal to the population
parameter being estimated, the sample statistic is said to be an unbiased estimator of
the population parameter.
In discussing the sampling distributions of the sample mean and the sample proportion,
we stated that E( ) = μ and E( ) = p. Thus, both and are unbiased estimators of their
corresponding population parameters μ and p. In the case of the sample standard
deviation s and the sample variance s2, it can be shown that E(s2) = σ2.
Efficiency: The most efficient point estimator is the one which is having the smallest
variance of all the unbiased estimators. The variance represents the level of dispersion
from the estimate, and the smallest variance should vary the least from one sample to
the other.
Consistency: A third property associated with good point estimators is consistency. A
point estimator is consistent if the values of the point estimator tend to become closer to
the population parameter as the sample size becomes larger. In other words, a large
sample size tends to provide a better point estimate than a small sample size.
CONSTRUCTION OF
QUESTIONNAIRES
Types of Questionnaire
Structured Questionnaire Unstructured Questionnaire
• definite, concrete, and • set of questions which are
pre-determined questions not structured in advance
• prepared in advance, not • questions may be adjusted
constructed on the spot as per the need
• additional questions may • these questionnaires are
be asked only when some flexible in nature
clarification is required
Construction of Questionnaire
A. General Considerations
• Well-defined goals are the best way to assure a good
questionnaire design. Questionnaires are developed
directly to address the goals of study.
• Keep it short and simple to maximize responses.
• Try to eliminate unimportant questions…involve
experts and decision-makers while doing this.
• Provide a well written cover page…it gives the first
impression and provides you the best chance to
convince the respondent to complete the survey.
• Give your questionnaire a title that is short and
meaningful to the respondents.
• Place the most important items in the first half
of the questionnaire. Respondents often send
back partially completed questionnaires.
• Leave adequate space for respondents to make
comments and provide valuable information.
• Use professional printing methods and
materials for the questionnaires.
B. Language
• Wording of a question is extremely important.
Researchers strive for objectivity in surveys and,
therefore, must be careful not to lead the
respondent into giving a desired answer.
• Questionnaires require special measures to cast
questions that are clear and straight forward in
four important aspects; simple language,
common concepts, manageable tasks and
widespread information.
• The nature and structure of population to be
studied should be kept in mind. Technical terms
and jargons should be avoided to the maximum
possible extent.
• Common concepts should be used in the
questionnaire. Mathematical abstractions tend
to be difficult for the general public.
C. Type of Questions
Researchers use two basic types of questions:
• Closed-ended (dichotomous ,multiple choice & scales)
• Open-ended
Examples of each kind of questions are:
Closed-ended: Dichotomous Questions
1. Do you have a car: (a) Yes (b) No
2. What kind of petrol do you use: (a) Normal (b) Premium
3. Your working hours are: (a) Fixed (b) Flexible

Multiple Choice Question


1. Which kind of car you own
(a) Hatch back (b) Sedan (c) SUV (d) Luxury
(e) Others……….
2. How much do you spend monthly on grocery items
(a) Upto Rs. 2500 (b) Rs. 2500-5000 (c) More than Rs. 5000
3. Which one do you like the most?
(a) Pizza Hut (b) Domino’s (c) Pizza Corner (d) Nirula’s
(e) Any other (specify…………)
4. Which brand of apparel you like?
(a) Allen Solley (b) Louis Phillipe (c) Levi’s (d) Adidas
(e) Spykar (f) Reebok (g) Any other (specify…………)
• Scales like Ordinal scale and Likert scale are often used while
designing a questionnaire, e.g.,
(A) How would you rate the performance of your car?
1. Excellent 2. Very Good 3. Good 4. Fair 5. Poor
(B) How much do you agree with the following statements?
Sr. Statement Strongly Agree Neither Disagree Strongly
No. Agree agree nor Disagree
disagree
1 My car is very fuel
efficient
2 My car gives me
max. comfort
3 My car is maint-
enance free
4 My car is most
stylish
Open-ended: Numeric Open Ended
1. How much money did you spend on fuel last week?
…………….
2. How many children do you have? …………
3. What is your age? …………
Text Open Ended
1. How can the performance of your car be
improved? …………….
2. What training programme did you attend last week?
…………
3. You like Nescafe because …………………
D. Question Contents
• Clearly specify the issue
Incorrect: Which newspaper you read?
Correct: Which newspaper(s) do you read generally?
• Use simple terminology
Incorrect: Do you think thermal wear provides immunity?
Correct: Do you think that thermal wear protects from cold?
• Avoid Ambiguity in questioning
Incorrect: How often you visit McDonald’s in a month?
(a) Never (b) Occasionally (c) Often (d) Regularly
Correct: How often you visit McDonald’s in a month?
(a) Never (b) Less than 2 times (c) 2-5 times(d) More than 5
times
• Avoid Leading Questions
Incorrect: Do you think working mothers should buy ready -to-eat food
knowing that might contain preservatives?
(a) Yes (b) No (c) Can’t say
Correct: Do you think working mothers should buy ready -to-eat food?
(a) Yes (b) No (c) Can’t say
• Avoid Sensitive/Loaded Questions
Incorrect: Do you support the dowry system? (a) Yes (b) No
Correct: Do you think that most Indian men still support dowry system?
(a) Yes (b) No
• Avoid implicit choices in questions
Incorrect: Would you prefer to work fixed hours, five days a week? (a)
Yes (b) No
Correct: Would you prefer to work fixed hours, five days a week or a
flexi time of 40 hrs per week?
• Avoid double-barrel questions
Incorrect: Do you think Nokia and Samsung have a wide variety of
touch phones? (a) Yes (b) No
Correct: A wide variety of touch phones is available for-
(a) Nokia (b) Samsung (c) Others
Incorrect: Did the training make you feel motivated and effective in
your job? (a) Yes (b) No
Correct: Did the training make you feel motivated in your job?
Did the training make you more effective in your job?
Questionnaire Structure
• General Instructions and Greetings
Sir/Madam,
I, Shahrukh Khan, a student of PGDM-I semester at School
of Management Sciences, Varanasi, am conducting a
market survey of consumers on ‘ The Most Preferred Pizza
Brand in Varanasi City.’ In this regard, I seek your valuable
cooperation and suggestions which could be helpful for our
study. I promise to keep the information provided by you
strictly confidential.
• Opening Questions
- Personal questions like name, age, gender, contact
information, occupation, income etc.
- Non-threatening questions like ‘Have you ever
tried Pizza’ or ‘How much do you like Pizza’?
• Study Questions
- Questions related to objectives
1. How often you order pizzas from outside?
(a) Once in 2-3 months (b) Once a month (c) Once in a fortnight (d) Once a
week (e) 2-3 times a week (f) Everyday
2. How is it purchased?
(a) Personal visit/take away (b) Home delivery
3. What are the preferred days of ordering pizzas?
(a) Week days (b) Weekends (c) Special Occasions
4. What is the general time of ordering?
(a) Lunch time (b) Dinner time (c) Evening (d) Anytime
5. How much generally you spend on pizzas per month?
(a) Upto Rs.500 (b) Rs.500-1000 (c) Rs.1000-1500 (d) More than Rs.1500
6. From where do you order pizzas generally?
(a) Pizza Hut (b) Domino’s (c) Pizza Corner (d) Nirula’s (e) Local Bakery Shops (f)
Any other (specify…………….)
• End Matter
✓ Questionnaire generally ends with seeking suggestions or remarks
on the subject matter
✓ Researcher must not forget to acknowledge the inputs of the
respondent and thank him for sparing his time.
Assignment # 1
The management of magazine ‘Outlook’ finds that despite
the changes in the publication’s frequency, it is still facing
stiff competition from the rival ‘India Today’. Thus they
wanted to conduct a comparative survey for the two
magazines and assess whether Outlook has a different
market.
• Identify some of the research objectives for this survey
• Prepare a questionnaire of at least 10-15 study
questions based on the research objectives
Assignment # 2
‘Rainbow Seven’ is a regional brand of packaged water whose
market share has remained fairly stable for the past few years.
The management wants to increase the brand’s market share
through the use of a more effective advertising theme. For the
last two years, Rainbow Seven’s advertising has featured a well-
known actress who presents a message ‘Safe and secure,
always’.
The company knows it needs to make the brand more
progressive and needs to reposition it. Thus they want to carry
out a short study to know the perception about ‘Rainbow
Seven’ as compared to the other local/regional brands. They
feel that such information will help them structure the
positioning exercise better.
You are required to design a structured questionnaire
including different containing 10-15 study questions of
different types.
Topic-8
Confidence Interval Estimation
Chapter discussions

• Concept of confidence intervals


• Developing confidence interval estimates for population
mean (σ known and unknown)
• Developing confidence interval estimates for population
proportion
• Determining sample size for mean and proportion
Point and Interval Estimates

• A point estimate is a single number,


• A confidence interval provides additional
information about the variability of the estimate

Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
Point Estimators

We can estimate a with a Sample


Population Parameter … Statistic
(a Point Estimator)

Mean μ X

Std. Deviation σ S
Confidence Intervals

• Much uncertainty is associated with a point estimate of a


population parameter.

• An interval estimate provides more information about a


population characteristic than a point estimate.

• Such interval estimates are called confidence intervals.


Confidence Interval Estimate
• An interval gives a range of values:
• Gives information about closeness to unknown population
parameters
• Stated in terms of level of confidence
• e.g. 99% or 95% or 90% confident
• can never be 100% confident
Estimation Process

Random Sample I am 95%


confident that
μ is between
Population Mean 40 & 60.
(mean, μ, is X = 50
unknown)

Sample
General Formula
• The general formula for all confidence
intervals is:
Point Estimate ± (Critical Value)(Standard Error)
• Point Estimate is the sample statistic estimating the population
parameter of interest

• Critical Value is a table value based on the sampling distribution of the


point estimate and the desired confidence level

• Standard Error is the standard deviation of the point estimate


Confidence Level (1-)
• Confidence Level
• Confidence interval will contain the unknown population parameter
• A percentage (less than 100%)
• Suppose confidence level = 95%
• Also written (1 - ) = 0.95, (so  = 0.05, called level of significance)
• A relative frequency interpretation:
• 95% of all the confidence intervals that can be constructed will contain
the unknown true parameter
• A specific interval either will contain or will not contain the true parameter
• No probability involved in a specific interval
Confidence Intervals
Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown
Confidence Interval for μ
(σ Known)
• Assumptions
• Population standard deviation σ is known
• Population is normally distributed
• If population is not normal, use large sample

• Confidence interval estimate:

Point Estimate ± (Critical Value)(Standard Error)

σ
X  Z α/2
n
where is the point estimate
X
Zα/2 is the normal distribution critical value for a probability of /2 in each tail
is the standard error
σ/ n
Finding the Critical Value, Zα/2
Zα/2 = 1.96
• Consider a 95% confidence interval:
1 − α = 0.95 so α = 0.05

α α
= 0.025 = 0.025
2 2

Z units: Zα/2 = -1.96 0 Zα/2 = 1.96


Lower Upper
X units: Confidence Point Estimate Confidence
Limit Limit
Common Levels of Confidence
• Commonly used confidence levels are 90%, 95%,
and 99%

Confidence
Confidence
Coefficient, Zα/2 value
Level
1− 
80% 0.80 1.28
90% 0.90 1.645
95% 0.95 1.96
98% 0.98 2.33
99% 0.99 2.58
99.8% 0.998 3.08
99.9% 0.999 3.27
Example:1
• A sample of 11 circuits from a large normal population has a
mean resistance of 2.20 ohms. We know from past testing that
the population standard deviation is 0.35 ohms. Determine a
95% confidence interval for the true mean resistance of the
population.
σ
• Solution: Confidence Interval: X  Zα/2
n
= 2.20  1.96 (0.35/ 11 )
= 2.20  0.2068
1.9932  μ  2.4068
Interpretation

• We are 95% confident that the true mean


resistance is between 1.9932 and 2.4068
ohms.
• Although the true mean may or may not be in
this interval, but 95% of intervals formed in
this manner will contain the true mean.
Example:2
• The gross salary of all the employees of ABC Technologies has a std. deviation of Rs.5200. To study the
food habits of employees, a random sample of 40 employees was taken and the mean salary was found to
be Rs.41,500. Calculate the interval estimate of gross salaries of all the employees of ABC technologies
using (a) 95% confidence level, and (b) 99% confidence level.

• Solution: Confidence interval estimate is given by


= 41500, σ = 5200, n = 40 σ
X  Z α/2
(a) Zα/2 = 1.96 (for conf. level of 95%) n
Hence, interval estimate of the gross salaries is:
- Zα/2(σ/√n) ≤ μ ≤ + Zα/2(σ/√n)
41500 – 1.96(5200/√40) ≤ μ ≤ 41500 + 1.96(5200/√40)
41500 – (1611.50) ≤ μ ≤ 41500 + (1611.50)
39888.50 ≤ μ ≤ 43111.50
The gross salary of all the employees of ABC Technologies is between Rs. 39,888.50 and Rs. 43,111.50.
(b) Zα/2 = 2.58 (for conf. level of 99%)
41500 – 2.58(5200/√40) ≤ μ ≤ 41500 + 2.58(5200/√40)
41500 – (2121.26) ≤ μ ≤ 41500 + (2121.26)
39378.74 ≤ μ ≤ 43621.26. The confidence interval estimate is (39378.74, 43621.26) at 99% level.
Confidence Intervals

Confidence
Intervals

Population Population
Mean Proportion

σ Known σ Unknown
Do You Ever Truly Know σ?
• Probably not!

• In most of the real world business situations, σ is not known.

• If there is a situation where σ is known then µ is also known (since to calculate σ


you need to know µ.)

• If you truly know µ there would be no need to collect a sample to estimate it.
Confidence Interval for μ
(σ Unknown)

• If the population standard deviation σ is unknown, we


can substitute the sample standard deviation, S
• This introduces extra uncertainty, since S varies from
sample to sample
• So we use the t-distribution instead of the normal
distribution
Confidence Interval for μ
(σ Unknown)
• Assumptions
• Population standard deviation is unknown
• Population is normally distributed
• If population is not normal, use large sample
• Use Student’s t Distribution
• Confidence Interval Estimate:
𝑆
𝑋 ± 𝑡𝛼/2
𝑛
(where tα/2 is the critical value of the t distribution with n -1 degrees of freedom and
an area of α/2 in each tail)
Student’s t-distribution

• The t is a family of distributions


• The tα/2 value depends on degrees of freedom
(d.f.)
• Number of observations that are free to vary after sample
mean has been calculated

d.f. = n - 1
Degrees of Freedom (df)
Idea: Number of observations that are free to vary
after sample mean has been calculated
Example: Suppose the mean of 3 numbers is 8

Let X1 = 7
If the mean of these three
Let X2 = 8
What is X3?
values is 8.0,
then X3 must be 9
(i.e., X3 is not free to vary)
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2
(2 values can be any numbers, but the third is not free to vary
for a given mean)
Student’s t Distribution
Note: t Z as n increases

Standard
Normal
Distribution

t (df = 15)
t-distributions are bell-shaped
and symmetric, but have
‘fatter’ tails than the normal t (df = 10)

0 t
Selected t distribution values
With comparison to the Z value

Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) (∞ d.f.)

0.80 1.372 1.325 1.310 1.28


0.90 1.812 1.725 1.697 1.645
0.95 2.228 2.086 2.042 1.96
0.99 3.169 2.845 2.750 2.58

Note: t Z as n increases
tα/2 values

tα values
Example-1: t-distribution confidence interval

A random sample of n = 25 has = 50 and S = 8.


Form a 95% confidence interval for μ.

𝑡𝛼/2 = t 0.025 = 2.064


• d.f. = n – 1 = 24, so

The confidence interval is


S 8
𝑋 ± 𝑡𝛼/2 = 50 ± (2.064)
n 25

46.6976 ≤ μ ≤ 53.3024
Example-2: t-distribution confidence interval
A restaurant owner thinks that the time spent by customers in the restaurant is directly
proportional to sales. He took a random sample of 20 customers and found the mean
time spent by them to be 55 minutes with a std. deviation of 7.8 minutes. Find the
interval estimate of mean time spent by all the customers visiting that restaurant using (a)
95%, and (b) 99% confidence level.
𝑆
𝑋 ± 𝑡𝛼/2
Confidence interval estimate: 𝑛
(since sample size is small and population std. deviation is unknown)
= 55, S = 7.8, n = 20 and tα/2 = 2.093 (at 95% conf. level and 19 d.f.)
Hence, interval estimate of the mean time spent is:
- tα/2(S/√n) ≤ μ ≤ + tα/2(S/√n)
55 – 2.093(7.8/√20) ≤ μ ≤ 55+ 2.093(7.8/√20)
(55 –3.65) ≤ μ ≤ (55+ 3.65)
51.35 ≤ μ ≤ 58.65, the mean time spent by all the customers is between 51.35 mins. and
58.65 mins. at 95% confidence level.
Ex.3: The average annual premium for automobile insurance in the United States is $1503 (Insure.com
website, March 6, 2014). The following annual premiums ($) are representative of the website’s findings
for the state of Michigan:
1905, 3112, 2312, 2725, 2545, 2981, 2677, 2525, 2627, 2600, 2370, 2857, 2962, 2545, 2675, 2184, 2529,
2115, 2332, 2442
Assuming the population to be approximately normal, provide
(i) A point estimate of the mean annual automobile insurance premium in Michigan. [Hint: calculate
for Michigan state; Ans: $2551]
(ii) Develop a 95% confidence interval for the mean annual automobile insurance premium in
Michigan. [Hint: calculate S for Michigan state using and find confidence interval using
𝑡𝛼/2 with (n-1) d.f.; Ans: (2409.99, 2692.01)]
(iii) Does the 95% confidence interval for the annual automobile insurance premium in Michigan
include the national average for the United States? What is your interpretation of the relationship
between auto insurance premiums in Michigan and the national average?
Summary of Interval Estimation for Population Mean
Interval Estimation for Population Proportion
The general form of an interval estimate of a population proportion is:
Point Estimate ± (Critical Value)(Standard Error)
± (critical value x standard error)
± (margin of error)
We know that the sampling distribution of can be approximated by a normal
distribution whenever np ≥ 5 and n(1−p) ≥ 5. The mean of the sampling distribution of
𝒑(𝟏−𝒑)
is the population proportion p, and the standard error of is σ =
𝒏
But since p is unknown and to be estimated, so p is replace by in the formula of σ .
(𝟏− )
Hence, margin of error = Zα/2 . 𝐧

The confidence interval estimate for population proportion is:


(𝟏− )
± Zα/2 .
𝐧
Ex-1: The Consumer Reports National Research Center conducted a telephone survey of 2000 adults to
learn about the major economic concerns for the future (Consumer reports, January 2009). The survey
results showed that 1760 of the respondents think the future health of social security is a major economic
concern.
(i) What is the point estimate of the population proportion of adults who think the future health of Social
Security is a major economic concern. (Point estimate of population proportion,
=1760/2000 = 0.88
0.88(1−0.88)
(ii) At 90% confidence, what is the margin of error? (margin of error = Zα/2 . (𝟏−
𝐧
)
= {1.645.
2000
}
= 0.012
(i) Develop a 90% confidence interval for the population proportion of adults who think the future health
of Social Security is a major economic concern. (0.88-0.012, 0.88+0.012)
(ii) Develop a 95% confidence interval for this population proportion. (0.88-0.014, 0.88+0.014)
Determining Sample Size

Determining
Sample Size

For the For the


Mean Proportion
Sampling Error

• The required sample size can be found to reach a


desired margin of error (e) with a specified level of
confidence (1 - )

• The margin of error is also called sampling error


• the amount of imprecision in the estimate of the
population parameter
• the amount added and subtracted to the point estimate
to form the confidence interval
Determining Sample Size

Determining
Sample Size

For the
Mean Sampling error
(margin of error)
σ σ
X  Zα / 2 e = Zα / 2
n n
Determining Sample Size

Determining
Sample Size

For the
Mean

σ 2
Zα / 2 σ 2
e = Zα / 2 Now solve
n=
for n to get 2
n e
Determining Sample Size
• To determine the required sample size for the mean, you
must know:

• The desired level of confidence (1 - ), which determines the critical


value, Zα/2
• The acceptable sampling error, e
• The standard deviation, σ
Required Sample Size Example

Ex.1: If  = 45, what sample size is needed to


estimate the mean within ± 5 margin of error
with 90% confidence?

𝑍 2 σ2 (1.645)2 (45)2
𝑛= 2
= 2
= 219.19
𝑒 5

So the required sample size is n = 220


(Always round up)
Ex.2: The expenditure of customers in a restaurant is studied by a researcher and he
comes to know that the std. dev. of expenditure is Rs.30. If the researcher is willing to
have a margin of error of 3.2, what should be the sample size at 99% confidence level?
Ans: 586 (approx.) 𝑍 2 σ2
𝑛= = (2.58)2(30)2/(3.2)2
𝑒2
= 585.03
Ex-3: The operations manager of a large production plant would like to estimate the
average amount of time workers take to assemble a new electronic component. After
observing a number of workers assembling similar devices, she finds that the standard
deviation is 6 minutes. How large a sample of workers should be taken if she wishes to
estimate the mean assembly time with a permissible error of 20 seconds. Assume the
level of confidence of 99%.
Ans: 2157 (approx.)
Determining Sample Size

Determining
Sample Size

For the
Proportion

(𝟏 − ) Now solve (Zα/2 )2 (𝟏 − )


e = Zα/2 . n=
𝐧 for n to get e2
Determining Sample Size
However, we cannot use this formula to compute the sample size that will provide the desired margin of
error because will not be known until we select the sample. What we need, then, is a planning value
for that can be used to make the computation. Using p* to denote the planning value for , the
following formula can be used to compute the sample size that will provide a margin of error of size ‘e’:
∗ ∗
2
Z 𝑝 (1−𝑝 )
𝛼 /2
n=
e2
In practice, the planning value p* can be chosen by one of the following procedures.
1. Use the sample proportion from a previous sample of the same or similar units.
2. Use a pilot study to select a preliminary sample. The sample proportion from this sample can be
used as the planning value, p*.
3. Use judgment or a “best guess” for the value of p*.
4. If none of the preceding alternatives apply, use a planning value of p* = 0.50.
Required Sample Size Example

How large a sample would be necessary to


estimate the true proportion defective in a large
population within ±3%, with 95% confidence?
Assume a pilot sample yields p* = 0.2.
What if there is no pilot sample available?
Required Sample Size Example

Solution:
For 95% confidence, use Zα/2 = 1.96
e = 0.03
p* = 0.2, so use this to estimate p
2 ∗ ∗
Z𝛼/2 𝑝 (1 − 𝑝 ) (1.96)2 (0.2)(1 − 0.2)
n= = = 682.95
e2 (0.03) 2

So use n = 683

If there is no pilot sample available, then take p* = 0.5


n = 1068 (approx.)
Topic-9
Hypothesis Testing-1
Topics of discussion:
*Concept of null and alternative hypothesis
*Developing null and alternative hypothesis
*Errors in hypothesis testing
*Testing of hypothesis for population mean (σ known and unknown)
*Testing of hypothesis for population proportion
What is a Hypothesis?
• A hypothesis is a claim
(assumption) about a
population parameter:

• population mean
Example: The mean monthly cell phone bill
in this city is μ = $42
• population proportion

Example: The proportion of adults in this


city with cell phones is p = 0.68
The Null Hypothesis, H0
• States the claim or assertion to be tested
Example: The average number of TV sets in U.S.
households is equal to three ( H0 : μ = 3 )
• Is always about a population parameter,
not about a sample statistic

H0 : μ = 3 H0 : X = 3
The Null Hypothesis, H0
• Null hypothesis assumes that the statement is true
• Similar to the notion of innocent until proven guilty

• Refers to the status quo or historical value


• Always contains “=” , “≤” or “” sign
• May or may not be rejected
The Alternative Hypothesis, Ha
• Is the opposite of the null hypothesis
• e.g., The average number of TV sets in U.S. households is not
equal to 3 ( Ha: μ ≠ 3 )
• Challenges the status quo
• Never contains the “=” , “≤” or “” sign
• May or may not be proven
• Is generally the hypothesis that the researcher is trying
to prove
Hypothesis Formulation
1. A particular automobile that currently attains a fuel efficiency of 24
miles per gallon in city driving. A product research group has
developed a new fuel injection system designed to increase the miles-
per-gallon rating. Formulate the null and alternative hypotheses to test
whether the new fuel injection provides better results.
Solution: Since null hypothesis assumes the statement is true, hence we
will assume that the efficiency of existing fuel attains 24 miles per gallon,
i.e., it is less than or equal to 24 mpg. Therefore the alternative
hypothesis, which is a counter statement, will assume that the new fuel
will give a better performance and attain more than 24 miles per gallon.
The null and alternative hypotheses may be written as
H0: μ ≤ 24
Ha: μ > 24
Hypothesis Formulation
2. Because of high production-changeover time and costs, a director of manufacturing
must convince management that a proposed manufacturing method reduces costs
before the new method can be implemented. The current production method
operates with a mean cost of $220 per hour. A research study will measure the cost of
the new method over a sample production period. Develop the null and alternative
hypotheses most appropriate for this study.
Solution: H0: μ ≥ 220 & Ha: μ < 220

3. A production line operation is designed to fill cartons with laundry detergent to a


mean weight of 32 ounces. A sample of cartons is periodically selected and weighed to
determine whether underfilling or overfilling is occurring. If the sample data lead to a
conclusion of underfilling or overfilling, the production line will be shut down and
adjusted to obtain proper filling. Formulate the null and alternative hypotheses that
will help in deciding whether to shut down and adjust the production line.
Solution: H0: μ = 32 & Ha: μ  32
The Hypothesis Testing Process
• Claim: The population mean age is 50. Test whether the sample mean
differs significantly?
H0: μ = 50, Ha: μ ≠ 50
• Sample the population and find sample mean.

Population

Sample
The Hypothesis Testing Process
• Suppose the sample mean age was =20
• This is significantly lower than the claimed mean
population age of 50.
• If the null hypothesis were true, the probability of getting
such a different sample mean would be very small, so you
reject the null hypothesis.
The Hypothesis Testing Process

Sampling
Distribution of X

X
μ = 50
=20 If H0 is true ... then you reject
If it is unlikely that you the null hypothesis
would get a sample that μ = 50.
mean of this value ... ... When in fact this were
the population mean…
The Test Statistic and Critical Values

• If the sample mean is close to the assumed population


mean, the null hypothesis is not rejected.

• If the sample mean is far from the assumed population


mean, the null hypothesis is rejected.

• How far is “far enough” to reject H0?

• The critical value of a test statistic creates a “line in the


sand” for decision making --- it answers the question of
how far is far enough.
The Test Statistic and Critical Values
Sampling Distribution of the test statistic
Region of
Non-Rejection

Region of Region of
Rejection Rejection

Critical Values

“Too Far Away” From Mean of Sampling Distribution


Errors in Hypothesis Testing
• Type I Error
✓Reject a true null hypothesis
✓Considered a serious type of error
✓The probability of a Type-I Error is  (kwon as level of significance)
✓Set by researcher in advance
✓Referred to as producer’s risk in business terms
• Type II Error
✓Failure to reject false null hypothesis
✓The probability of a Type-II Error is β
✓Referred to as consumer’s risk in business terms
Errors in Hypothesis Testing

Possible Hypothesis Test Outcomes

Actual Situation

Decision H0 True H0 False

Do Not No Error Type II Error


Reject H0 Probability (1 – α) Probability β
Reject H0 Type I Error No Error
Probability α Probability (1 – β)
Errors in Hypothesis Testing
• The confidence coefficient (1-α) is the probability of not rejecting H0
when it is true.

• The confidence level of a hypothesis test is (1-α)*100%.

• The power of a statistical test (1-β) is the probability of rejecting H0


when it is false.
Level of Significance and Rejection Region in
Two-tailed Test

H0: μ = 3 Level of significance = 


Ha: μ ≠ 3
 /2  /2

Critical values

Rejection Region

This is a two-tail test because there is a rejection region in both tails


Level of Significance and Rejection Region in
One-tailed Test

H0: μ ≤ 3 Level of significance = 


H0: μ ≥ 3
Ha: μ > 3 Ha: μ < 3
 

0 0

Critical value Critical value

Rejection Region in Right Rejection Region in Left


or Upper tail or Lower tail

These are one-tail tests because rejection region lies only in one tail
Hypothesis Tests for the Mean

Hypothesis
Tests for 

 Known  Unknown
(Z test) (t test)
Z-test of Hypothesis for the Mean
(σ Known)
• Convert sample statistic ( ) to a ZSTAT test statistic
Hypothesis
Tests for 

σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
X−μ
ZSTAT =
σ
n
Critical Value Approach to Testing

• For a two-tail test for the mean, σ known:


• Convert sample statistic ( X ) to test statistic (ZSTAT)
• Determine the critical Z values for a specified
level of significance  from a table or computer
• Decision Rule: If the test statistic falls in the rejection
region, reject H0 ; otherwise do not reject H0
Two-Tail Tests
H0: μ = 3
◼ There are two
H1: μ  3
cutoff values
(critical values),
defining the
regions of /2 /2
rejection
3 X
Reject H0 Do not reject H0 Reject H0

-Zα/2 0 +Zα/2 Z

Lower Upper
critical critical
value value
Steps in Hypothesis Testing
1. State the null hypothesis, H0 and the alternative
hypothesis, Ha
2. Choose the level of significance, , and the sample
size, n
3. Determine the appropriate test statistic and
sampling distribution
4. Determine the critical values that divide the
rejection and non-rejection regions
Steps in Hypothesis Testing
5. Collect data and compute the value of the test
statistic
6. Make the statistical decision and state the
managerial conclusion. If the test statistic falls into
the non-rejection region, do not reject the null
hypothesis H0. If the test statistic falls into the
rejection region, reject the null hypothesis.
Express the managerial conclusion in the context
of the problem.
Z values for One-tail and Two-tail Tests
Level of Significance One-tail value Two-tail value

0.10 1.28 1.645

0.05 1.645 1.96

0.025 1.96 2.24

0.01 2.33 2.58

0.005 2.58 2.81


Hypothesis Testing Example-1
Test the claim that the mean number of TV sets in US
households is equal to 3. Assume σ = 0.8

Step-1: State the appropriate null and alternative


hypotheses
◼ H0: μ = 3 Ha: μ ≠ 3 (This is a two-tail test)
Step-2: Specify the desired level of significance and the sample size
◼ Suppose that  = 0.05 and n = 100 are chosen for this test
Hypothesis Testing Example
Step-3: Determine the appropriate technique
◼ σ is assumed known so this is a Z test.

Step-4: Determine the critical values


◼ For  = 0.05 the critical Z values are ±1.96

Step-5: Collect the data and compute the test statistic


◼ Suppose the sample results are

n = 100, = 2.84 (σ = 0.8 is assumed known)


So the test statistic is:
X − μ 2.84 − 3 − .16
ZSTAT = = = = −2.0
σ 0.8 .08
n 100
Hypothesis Testing Example
Step-6: Check whether the test statistic lies in the rejection region?

/2 = 0.025 /2 = 0.025

Reject H0 if Reject H0 Do not reject H0 Reject H0


ZSTAT < -1.96 or -Zα/2 = -1.96 0 +Zα/2 = +1.96
ZSTAT > 1.96;
otherwise do
not reject H0 Here, ZSTAT = -2.0 < -1.96, so the
test statistic is in the rejection
region
Hypothesis Testing Example
Step-6 (continued): Reach a decision and interpret the result

 = 0.05/2  = 0.05/2

Reject H0 Do not reject H0 Reject H0

-Zα/2 = -1.96 0 +Zα/2= +1.96


-2.0
Since ZSTAT = -2.0 < -1.96, reject the null hypothesis
and conclude there is sufficient evidence that the mean
number of TVs in US homes is not equal to 3.
Hypothesis Testing Example-2
Average returns in equity mutual funds of company ‘X’ is supposed to be 12% p.a. with a std.
deviation of 2%. A sample of 40 investors is randomly selected and their mean return is found to
be 11.3%. Test at 5% level of significance whether there is any significant difference between the
sample mean and population mean? What if the level of significance is 5%?

X−μ
ZSTAT =
σ
n

H0: μ = 12 and Ha: μ ≠ 12 (This is a two-tail test)


Here  = 0.05, = 11.3, μ = 12, σ = 2, n = 40
ZSTAT = (11.3 – 12)/(2/√40) = -2.21; ZCRT = 1.96 (at 5% l.s.)
lZSTATl < lZCRTl then accept H0 and lZSTATl > lZCRTl then reject H0
lZSTATl = l-2.21l = 2.21, which is more than 1.96(ZCRT), hence we reject H0
Check whether the test statistic lies in the rejection region?

/2 = 0.025 /2 = 0.025

Reject H0 if Reject H0 Do not reject H0 Reject H0


ZSTAT < -1.96 or -Zα/2 = -1.96 Z=0 +Zα/2 = +1.96
ZSTAT > 1.96;
otherwise do
not reject H0 Here, ZSTAT = -2.21 < -1.96, so the test statistic is in the rejection
region, hence we reject the null hypothesis and conclude that
there is a significant difference between sample mean and
population mean.
Example-3 (Practice)
A national vocabulary test is known to have a mean score of 68 and a standard
deviation of 13. A class of 49 students takes the test and has a mean score of 65. Test
whether the performance of this sample of students is significantly different than the
national standard? (Assume level of significance = 5%, i.e. ZCRT = 1.96)
H0: μ=68 and Ha: μ≠68 X −μ
Z STAT =
σ
μ=68, σ=13, =65, n=49
n
ZSTAT= -1.62; ZCRT = 1.96
lZSTATl= l-1.62l = 1.62
lZCRTl= 1.96
lZSTATl<lZCRTl, hence we accept the null hypothesis and conclude that there is no
significant difference between the sample score and national score.
Do You Ever Truly Know σ?
• Probably not!

• In virtually all real world business situations, σ is not known.

• If there is a situation where σ is known then µ is also known (since to calculate σ


you need to know µ.)

• If you truly know µ there would be no need to gather a sample to estimate it.
Hypothesis Testing:
σ Unknown
• If the population standard deviation is unknown, we instead use the
sample standard deviation s.

• Because of this change, we use the t distribution instead of the Z


distribution to test the null hypothesis about mean.

• When using the t distribution we must assume the population we are


sampling from follows a normal distribution.

• All other steps, concepts, and conclusions are the same.


t-test of Hypothesis for the Mean
(σ unknown)
◼ Convert sample statistic ( X ) to a tSTAT test statistic
Hypothesis
Tests for 

σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
Acceptance/rejection criteria:
− μ− μ
XX
ltSTATl<ltCRTl then accept H0 and STAT == s
ttSTAT
ltSTATl>ltCRTl then reject H0 nS
n
t-statistic and Rejection Criteria
Ex-1: Two-tailed t-test ( unknown)

The average cost of a hotel room in


New York is said to be $168 per
night. To determine if this is true, a
random sample of 25 hotels is taken
and the mean tariff of this sample
was found to be $172.50 with a
standard deviation of $15.40. Test
the appropriate hypotheses at
=0.05.
H0: μ = 168
(Assume the population distribution Ha: μ  168
is normal)
Solution: Two-tailed t-test

H0: μ = 168 /2=.025 /2=.025


Ha: μ  168

•  = 0.05 Reject H0 Do not reject H0 Reject H0


t 24,0.025
-t 24,0.025 0
• n = 25, df = 25-1=24 -2.064 2.064
1.46
•  is unknown, so
X−μ 172.50 − 168
use a t statistic t STAT = s =
15.40
= 1.46
n
• Critical Value: 25

t24,0.025 = ± 2.064 Do not reject H0: insufficient evidence that true


mean cost is different than $168
Use of p-value in t-test

Level of significance (two-tailed test)


d.f. 0.50 0.40 0.30 0.20 0.10 0.05
24 0.685 0.857 1.059 1.318 1.711 2.064

t=1.46

As per the table, t=1.46 lies between 1.318 and 1.711 (two-tailed values) which shows
that the level of significance (or p-value) will also lie between 0.10 and 0.20.
Since 1.46 is closer to 1.318 as compared to 1.711, hence the p-value will be also be
closer to 0.20 as compared to 0.10.
Let us take p-value = 0.16, which is more than  = 0.05, hence there is no reason to
reject the null hypothesis H0.
We conclude that the average cost of hotel rooms in New York can be considered
$168 per night.
Ex-2: Two-tailed t-test ( unknown)
Ex-3: One-tailed t-test ( unknown)

A shareholders’ group, in lodging a protest, claimed that the mean tenure for a CEO was at
least nine years. A survey of 25 companies reported in the Wall Street Journal found a
sample mean tenure of 5.81 years for CEOs with a standard deviation of 6.38 years. Test
the hypothesis to challenge the validity of the claim made by the shareholders’ group.
What is the p-value for your hypothesis test? At  = 0.01, what is your conclusion?

Solution: H0: μ ≥ 9 and Ha: μ < 9 


5.81−9
tSTAT = X−μ
s = 6.38 = −2.5
n
25
tCRT (at 24 d.f. and 0.01 level) = 2.492 0

Result: tSTAT < -tCRT Critical value tCRT = -2.492


Hence we reject the null hypothesis and conclude that
mean tenure of CEOs is less than 9 years. tSTAT = -2.5
Ex-4: One-tailed t-test (Practice)

The mean annual premium for automobile insurance in the United States is $1503
(insure.com website, March 6, 2014). A researcher from Pennsylvania believes that
automobile insurance is cheaper there and wishes to develop statistical support for
his opinion. A sample of 25 automobile insurance policies from the state of
Pennsylvania showed a mean annual premium of $1440 with a standard deviation of
$165. Develop a hypothesis to test whether the mean annual premium in
Pennsylvania is lower than the national mean annual premium? Use  = 0.05.
Topic-9
Hypothesis Testing-1
Topics of discussion:
*Concept of null and alternative hypothesis
*Developing null and alternative hypothesis
*Errors in hypothesis testing
*Testing of hypothesis for population mean (σ known and unknown)
*Testing of hypothesis for population proportion
What is a Hypothesis?
• A hypothesis is a claim
(assumption) about a
population parameter:

• population mean
Example: The mean monthly cell phone bill
in this city is μ = $42
• population proportion

Example: The proportion of adults in this


city with cell phones is p = 0.68
The Null Hypothesis, H0
• States the claim or assertion to be tested
Example: The average number of TV sets in U.S.
households is equal to three ( H0 : μ = 3 )
• Is always about a population parameter,
not about a sample statistic

H0 : μ = 3 H0 : X = 3
The Null Hypothesis, H0
• Null hypothesis assumes that the statement is true
• Similar to the notion of innocent until proven guilty

• Refers to the status quo or historical value


• Always contains “=” , “≤” or “” sign
• May or may not be rejected
The Alternative Hypothesis, Ha
• Is the opposite of the null hypothesis
• e.g., The average number of TV sets in U.S. households is not
equal to 3 ( Ha: μ ≠ 3 )
• Challenges the status quo
• Never contains the “=” , “≤” or “” sign
• May or may not be proven
• Is generally the hypothesis that the researcher is trying
to prove
Hypothesis Formulation
1. A particular automobile that currently attains a fuel efficiency of 24
miles per gallon in city driving. A product research group has
developed a new fuel injection system designed to increase the miles-
per-gallon rating. Formulate the null and alternative hypotheses to test
whether the new fuel injection provides better results.
Solution: Since null hypothesis assumes the statement is true, hence we
will assume that the efficiency of existing fuel attains 24 miles per gallon,
i.e., it is less than or equal to 24 mpg. Therefore the alternative
hypothesis, which is a counter statement, will assume that the new fuel
will give a better performance and attain more than 24 miles per gallon.
The null and alternative hypotheses may be written as
H0: μ ≤ 24
Ha: μ > 24
Hypothesis Formulation
2. Because of high production-changeover time and costs, a director of manufacturing
must convince management that a proposed manufacturing method reduces costs
before the new method can be implemented. The current production method
operates with a mean cost of $220 per hour. A research study will measure the cost of
the new method over a sample production period. Develop the null and alternative
hypotheses most appropriate for this study.
Solution: H0: μ ≥ 220 & Ha: μ < 220

3. A production line operation is designed to fill cartons with laundry detergent to a


mean weight of 32 ounces. A sample of cartons is periodically selected and weighed to
determine whether underfilling or overfilling is occurring. If the sample data lead to a
conclusion of underfilling or overfilling, the production line will be shut down and
adjusted to obtain proper filling. Formulate the null and alternative hypotheses that
will help in deciding whether to shut down and adjust the production line.
Solution: H0: μ = 32 & Ha: μ  32
The Hypothesis Testing Process
• Claim: The population mean age is 50. Test whether the sample mean
differs significantly?
H0: μ = 50, Ha: μ ≠ 50
• Sample the population and find sample mean.

Population

Sample
The Hypothesis Testing Process
• Suppose the sample mean age was =20
• This is significantly lower than the claimed mean
population age of 50.
• If the null hypothesis were true, the probability of getting
such a different sample mean would be very small, so you
reject the null hypothesis.
The Hypothesis Testing Process

Sampling
Distribution of X

X
μ = 50
=20 If H0 is true ... then you reject
If it is unlikely that you the null hypothesis
would get a sample that μ = 50.
mean of this value ... ... When in fact this were
the population mean…
The Test Statistic and Critical Values

• If the sample mean is close to the assumed population


mean, the null hypothesis is not rejected.

• If the sample mean is far from the assumed population


mean, the null hypothesis is rejected.

• How far is “far enough” to reject H0?

• The critical value of a test statistic creates a “line in the


sand” for decision making --- it answers the question of
how far is far enough.
The Test Statistic and Critical Values
Sampling Distribution of the test statistic
Region of
Non-Rejection

Region of Region of
Rejection Rejection

Critical Values

“Too Far Away” From Mean of Sampling Distribution


Errors in Hypothesis Testing
• Type I Error
✓Reject a true null hypothesis
✓Considered a serious type of error
✓The probability of a Type-I Error is  (kwon as level of significance)
✓Set by researcher in advance
✓Referred to as producer’s risk in business terms
• Type II Error
✓Failure to reject false null hypothesis
✓The probability of a Type-II Error is β
✓Referred to as consumer’s risk in business terms
Errors in Hypothesis Testing

Possible Hypothesis Test Outcomes

Actual Situation

Decision H0 True H0 False

Do Not No Error Type II Error


Reject H0 Probability (1 – α) Probability β
Reject H0 Type I Error No Error
Probability α Probability (1 – β)
Errors in Hypothesis Testing
• The confidence coefficient (1-α) is the probability of not rejecting H0
when it is true.

• The confidence level of a hypothesis test is (1-α)*100%.

• The power of a statistical test (1-β) is the probability of rejecting H0


when it is false.
Level of Significance and Rejection Region in
Two-tailed Test

H0: μ = 3 Level of significance = 


Ha: μ ≠ 3
 /2  /2

Critical values

Rejection Region

This is a two-tail test because there is a rejection region in both tails


Level of Significance and Rejection Region in
One-tailed Test

H0: μ ≤ 3 Level of significance = 


H0: μ ≥ 3
Ha: μ > 3 Ha: μ < 3
 

0 0

Critical value Critical value

Rejection Region in Right Rejection Region in Left


or Upper tail or Lower tail

These are one-tail tests because rejection region lies only in one tail
Hypothesis Tests for the Mean

Hypothesis
Tests for 

 Known  Unknown
(Z test) (t test)
Z-test of Hypothesis for the Mean
(σ Known)
• Convert sample statistic ( ) to a ZSTAT test statistic
Hypothesis
Tests for 

σKnown
Known σUnknown
Unknown
(Z test) (t test)
The test statistic is:
X−μ
ZSTAT =
σ
n
Critical Value Approach to Testing

• For a two-tail test for the mean, σ known:


• Convert sample statistic ( X ) to test statistic (ZSTAT)
• Determine the critical Z values for a specified
level of significance  from a table or computer
• Decision Rule: If the test statistic falls in the rejection
region, reject H0 ; otherwise do not reject H0
Two-Tail Tests
H0: μ = 3
◼ There are two
H1: μ  3
cutoff values
(critical values),
defining the
regions of /2 /2
rejection
3 X
Reject H0 Do not reject H0 Reject H0

-Zα/2 0 +Zα/2 Z

Lower Upper
critical critical
value value
Critical Value Approach: Steps in Hypothesis Testing

1. State the null hypothesis, H0 and the alternative


hypothesis, Ha
2. Choose the level of significance, , and the sample
size, n
3. Determine the appropriate test statistic and
sampling distribution
4. Determine the critical values that divide the
rejection and non-rejection regions
Critical Value Approach: Steps in Hypothesis Testing

5. Collect data and compute the value of the test


statistic
6. Make the statistical decision and state the
managerial conclusion. If the test statistic falls into
the non-rejection region, do not reject the null
hypothesis H0. If the test statistic falls into the
rejection region, reject the null hypothesis.
Express the managerial conclusion in the context
of the problem.
Critical Value Approach

Z values for One-tail and Two-tail Tests


Level of Significance One-tail value (α) Two-tail value (α/2)

0.10 1.28 1.645

0.05 1.645 1.96

0.025 1.96 2.24

0.01 2.33 2.58

0.005 2.58 2.81


Hypothesis Testing Example-1
Test the claim that the mean number of TV sets in US
households is equal to 3. Assume σ = 0.8

Step-1: State the appropriate null and alternative


hypotheses
◼ H0: μ = 3 Ha: μ ≠ 3 (This is a two-tail test)
Step-2: Specify the desired level of significance and the sample size
◼ Suppose that  = 0.05 and n = 100 are chosen for this test
Hypothesis Testing Example
Step-3: Determine the appropriate technique
◼ σ is assumed known so this is a Z test.

Step-4: Determine the critical values


◼ For  = 0.05 the critical Z values are ±1.96

Step-5: Collect the data and compute the test statistic


◼ Suppose the sample results are

n = 100, = 2.84 (σ = 0.8 is assumed known)


So the test statistic is:
X − μ 2.84 − 3 − .16
ZSTAT = = = = −2.0
σ 0.8 .08
n 100
Hypothesis Testing Example
Step-6: Check whether the test statistic lies in the rejection region?

/2 = 0.025 /2 = 0.025

Reject H0 if Reject H0 Do not reject H0 Reject H0


ZSTAT < -1.96 or -Zα/2 = -1.96 0 +Zα/2 = +1.96
ZSTAT > 1.96;
otherwise do
not reject H0 Here, ZSTAT = -2.0 < -1.96, so the
test statistic is in the rejection
region
Hypothesis Testing Example
Step-6 (continued): Reach a decision and interpret the result

 = 0.05/2  = 0.05/2

Reject H0 Do not reject H0 Reject H0

-Zα/2 = -1.96 0 +Zα/2= +1.96


-2.0
Since ZSTAT = -2.0 < -1.96, reject the null hypothesis
and conclude there is sufficient evidence that the mean
number of TVs in US homes is not equal to 3.
Hypothesis Testing Example-2
Average returns in equity mutual funds of company ‘X’ is supposed to be 12% p.a. with a std.
deviation of 2%. A sample of 40 investors is randomly selected and their mean return is found to
be 11.3%. Test at 5% level of significance whether there is any significant difference between the
sample mean and population mean? What if the level of significance is 5%?

X−μ
ZSTAT =
σ
n

H0: μ = 12 and Ha: μ ≠ 12 (This is a two-tail test)


Here  = 0.05, = 11.3, μ = 12, σ = 2, n = 40
ZSTAT = (11.3 – 12)/(2/√40) = -2.21; ZCRT = 1.96 (at 5% l.s.)
lZSTATl < lZCRTl then accept H0 and lZSTATl > lZCRTl then reject H0
lZSTATl = l-2.21l = 2.21, which is more than 1.96(ZCRT), hence we reject H0
Check whether the test statistic lies in the rejection region?

/2 = 0.025 /2 = 0.025

Reject H0 if Reject H0 Do not reject H0 Reject H0


ZSTAT < -1.96 or -Zα/2 = -1.96 Z=0 +Zα/2 = +1.96
ZSTAT > 1.96;
otherwise do
not reject H0 Here, ZSTAT = -2.21 < -1.96, so the test statistic is in the rejection
region, hence we reject the null hypothesis and conclude that
there is a significant difference between sample mean and
population mean.
Example-3 (Practice)
A national vocabulary test is known to have a mean score of 68 and a standard
deviation of 13. A class of 49 students takes the test and has a mean score of 65. Test
whether the performance of this sample of students is significantly different than the
national standard? (Assume level of significance = 5%, i.e. ZCRT = 1.96)
H0: μ=68 and Ha: μ≠68 X−μ
ZSTAT = σ
μ=68, σ=13, =65, n=49 n
ZSTAT= -1.62; ZCRT = 1.96
lZSTATl= l-1.62l = 1.62
lZCRTl= 1.96
lZSTATl<lZCRTl, hence we accept the null hypothesis and conclude that there is no
significant difference between the sample score and national score.
Example-4
In a study entitled how undergraduate students use credit cards, it was reported that
undergraduate students have a mean credit card balance of $3173. This figure was an all-
time high and had increased 44% over the previous five years. A current study is being
conducted on a sample of 180 students with a mean balance of $3325 to determine if it
can be concluded that the mean credit card balance for undergraduate students has
continued to increase compared to the original report. based on previous studies, the
population standard deviation (σ) is reported to be $1000. Test the hypothesis at 0.05
level and interpret your results.

Solution: H0: μ ≤ 3173 and Ha: μ > 3173


X−μ 3325−3173
ZSTAT = σ = 1000 = 2.039
n 180
ZCRT = 1.645 (right-tailed test)
Since ZSTAT > ZCRT, we reject H0 and conclude that the mean credit card balance for
undergraduate students has continued to increase.
Example-5
Average annual expenditure for prescription drugs was $838 per person in
the country. A sample of 60 individuals in the Midwest showed a per person
annual expenditure for prescription drugs of $745. Using a population
standard deviation of $300 test whether the sample data support the
conclusion that the population annual expenditure for prescription drugs per
person is lower in the Midwest than the country.

Solution: H0: μ ≥ 838 and Ha: μ < 838


X−μ 745−838
ZSTAT = σ = 300 = -2.40
n 60
ZCRT = 1.645 (left-tailed test)
Since ZSTAT < -ZCRT or -2.40 < -1.645, we reject H0 and conclude that the
annual expenditure for prescription drugs is lower in the Midwest than the
country.
p-value Approach of Hypothesis Testing
• p-value is a probability used to determine whether the null hypothesis should be
rejected.
• Computation of p-value:
1) Compute the value of the test statistic (Z or t).
2) If the value of the test statistic is in the upper tail, compute the probability that Z is greater than or
equal to the value of the test statistic (the upper tail area). If the value of the test statistic is in the
lower tail, compute the probability that Z is less than or equal to the value of the test statistic (the
lower tail area).
3) For one-tailed test, the probability obtained in step-2 is the p-value.
4) For a two-tailed test, double the probability obtained from step-2 to get the p-value.
Example (p-value approach)
Average returns in equity mutual funds of company ‘X’ is supposed to be 12% p.a. with a std.
deviation of 2%. A sample of 40 investors is randomly selected and their mean return is found to
be 11.3%. Test at 5% level of significance whether there is any significant difference between the
sample mean and population mean? What if the level of significance is 5%?
Solution:
H0: μ = 12 and Ha: μ ≠ 12 (This is a two-tail test)
Here  = 0.05, = 11.3, μ = 12, σ = 2, n = 40
Z = (11.3 – 12)/(2/√40) = -2.21
P(Z ≤-2.21) = 0.0136
p-value = 2*0.0136 = 0.0272
Since p-value < , hence we reject H0 and conclude that there is a significant difference between
the sample mean and population mean.
Example (p-value approach)
In a study entitled how undergraduate students use credit cards, it was reported that
undergraduate students have a mean credit card balance of $3173. This figure was an all-
time high and had increased 44% over the previous five years. A current study is being
conducted on a sample of 180 students with a mean balance of $3325 to determine if the
mean credit card balance for undergraduate students has continued to increase compared
to the original report. Based on previous studies, the population standard deviation (σ) is
reported to be $1000. Test the hypothesis at 0.05 level and interpret your results.
Solution: H0: μ ≤ 3173 and Ha: μ > 3173

X−μ 3325−3173
ZSTAT = σ = 1000 = 2.04
n 180
Assume  = 0.05
P(Z ≥2.04) = 1 – 0.9793 = 0.0207 (p-value)
Since p-value < , we reject H0 and conclude that the mean credit card balance for
undergraduate students has continued to increase.
Z-test for Population Proportion
This section shows how to conduct a hypothesis test about a population proportion p.
using p0 to denote the hypothesized value for the population proportion, the three
forms for a hypothesis test about a population proportion are as follows.
H0: p ≥ p0 H0: p ≤ p0 H0: p = p0
Ha: p < p0 Ha: p > p0 Ha: p ≠ p0
The first form is called a lower tail test, the second form is called an upper tail test, and
the third form is called a two-tailed test. Hypothesis tests about a population
proportion are based on the difference between the sample proportion and the
hypothesized population proportion p0.
The methods used to conduct the hypothesis tests for population proportion are similar
to those used for hypothesis tests about a population mean. The only difference is that
we use the sample proportion and its standard error to compute the test statistic. The
p-value approach or the critical value approach is then used to determine whether the
null hypothesis should be rejected.
Z-statistic and Rejection Criteria
Ex-1: A study by showed that 64% of supermarket shoppers believe supermarket brands
to be as good as national/international brands. To investigate whether this result applies
to its own product, the manufacturer of a national ketchup brand asked a sample of 100
shoppers whether they believed that supermarket ketchup was as good as the national
brand ketchup. Out of 100 shoppers, 52 stated that supermarket brand was as good as
national brands. Test the hypotheses to determine whether the percentage of
supermarket shoppers who believe that the supermarket ketchup was as good as the
national brand ketchup differed from 64%. Take  = 0.05.

Solution: H0: p = 0.64 and Ha: p ≠ 0.64 −𝑝0 0.52−0.64


= 52/100 = 0.52 𝑍= = = -2.50
𝑝0(1−𝑝0) 0.64(1−0.64)
p0 = 0.64 𝑛 100
n = 100
P(Z ≤-2.5) = 0.0062
p-value = 2*0.0062 = 0.0124
p-value < , hence we reject H0 and conclude that the proportion shoppers who believe
that the supermarket brand is as good as national brand, differs significantly from 64%.
Ex-2: Last year, 46% of business owners gave a holiday gift to their employees. A survey
of 60 business owners conducted this year indicates that 35% plan to provide a holiday
gift to their employees. Using a 0.05 level of significance, would you conclude that the
proportion of business owners providing gifts decreased?

Ex-3: Members of the millennial generation are continuing to be dependent on their


parents during early adulthood (the Enquirer, March 16, 2014). A family research
organization has claimed that, in past generations, no more than 30% of individuals aged
18 to 32 continued to be dependent on their parents. A sample of 400 individuals aged
18 to 32 showed that 136 of them continue to be dependent on their parents. Develop
hypotheses for a test to determine whether the proportion of millennials continuing to
be dependent on their parents is higher than for past generations. What is your point
estimate of the proportion of millennials that are continuing to be dependent on their
parents? What is the p-value provided by the sample data? What is your hypothesis
testing conclusion? Use  =0.05 as the level of significance.
Topic-10
Hypothesis Testing-2
Hypothesis Testing with Two Populations

Two-Population Tests

Population Population
Means, Means, Population Population
Independent Related Proportions Variances
Samples Samples
Examples:
Group 1 vs. Same group Proportion 1 vs. Variance 1 vs.
Group 2 before vs. after Proportion 2 Variance 2
treatment
Difference Between Two Means:
Independent Samples
• Different data sources
• Unrelated population
Population means,
• Independent samples
independent
• Sample selected from one population has no effect
samples
on the sample selected from the other population.
• For example, we may want to test whether the mean
Difference of means
starting salary for a population of men and the mean
using Z-statistic
starting salary for a population of women differ
significantly.
Difference of means
• Conduct a hypothesis test to determine whether any
using t-statistic difference is present between the proportion of
defective parts in a population of parts produced by
supplier A and the proportion of defective parts in a
population of parts produced by supplier B.
Hypothesis Tests for Two Population Means

Two Population Means, Independent Samples

Lower-tail test: Upper-tail test: Two-tail test:

H0: μ1  μ2 H0: μ1 ≤ μ2 H0: μ1 = μ2


Ha: μ1 < μ2 Ha: μ1 > μ2 Ha: μ1 ≠ μ2
OR OR OR
H0: μ1 – μ2  0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
Ha: μ1 – μ2 < 0 Ha: μ1 – μ2 > 0 Ha: μ1 – μ2 ≠ 0
Hypothesis tests for (μ1–μ2)

Two Population Means, Independent Samples


Lower-tail test: Upper-tail test: Two-tail test:
H0: μ1 – μ2  0 H0: μ1 – μ2 ≤ 0 H0: μ1 – μ2 = 0
Ha: μ1 – μ2 < 0 Ha: μ1 – μ2 > 0 Ha: μ1 – μ2 ≠ 0

a a a/2 a/2

-Za Za -Za Za
Reject H0 if ZSTAT ≤ -Za Reject H0 if ZSTAT  Za Reject H0 if
ZSTAT ≤ -Za/2 OR ZSTAT  Za/2
(1) Hypothesis tests for µ1-µ2 with σ1 and σ2
known and unequal (Z-test)

Population means, Assumptions:


independent
▪ Samples are randomly and
samples
independently drawn

▪ Populations are normally


σ1 and σ2 known, distributed
assumed unequal
▪ Population variances are
known and unequal
Two-population Z-test
Hypothesis for two-tailed test:
H0: μ1−μ2=0
Ha: μ1−μ2≠0
Test Statistic:
1= Mean of first sample
2 = Mean of second sample
σ1 = S.D. of first population
σ2 = S.D. of second population
n1 = Size of first sample
Reject H0 if
n2 = Size of second sample
ZSTAT ≤ -Za/2 OR ZSTAT  Za/2 D0 = Difference between μ1 and μ2

lZSTATl < lZCRTl then accept H0 and lZSTATl  lZCRTl then reject H0
Example-1
Greystone Department Stores, operates two stores in Buffalo, New York: One is in the inner city and the other
is in a suburban shopping center. The Regional Manager noticed that products that sell well in one store do
not always sell well in the other. The manager believes this situation may be attributable to differences in
customer demographics at the two locations. Customers may differ in age, education, income, and so on. The
manager wants to investigate the difference between the mean ages of the customers who shop at the two
stores. From the inner city store, a random sample of 36 customers was taken whose mean age was 40 yrs
and from the suburban store, a random sample of 49 customers was taken whose mean age was 35 yrs. The
population std. deviation of age in both the areas are 9 yrs and 10 yrs respectively. The manager wants to
know whether the mean age of customers in both areas differ significantly? Assume α=0.05.
Solution: Example-1

Hypothesis for two-tailed test: H0: μ1−μ2=0 and Ha: μ1−μ2≠0

Test Statistic: Reject H0 if


ZSTAT ≤ -Za/2 OR ZSTAT  Za/2

1=40, 2=35,σ1=9, σ2=10, n1=36, n2=49 a /2 a /2


D0= 0 (μ1−μ2=0)
ZSTAT = 2.41 (calculated using formula);
ZCRT = 1.96 (known at 5% l.s.)
Since ZSTAT  Za/2, we reject the null 0
hypothesis H0 and conclude that there is Zα/2 = -1.96 Zα/2 = 1.96
a significant difference between the
mean ages of the two populations. ZSTAT = 2.41
(Rejection region)
Example-2
A survey was conducted in two large cities to study the distance travelled by people per day:

Test at 5% level of significance whether the mean distance travelled in both the cities differ significantly?
Solution: The hypothesis is
H0: μ1−μ2=0 and Ha: μ1−μ2≠0
ZSTAT = 2.34 (calculated using formula); ZCRT = 1.96 (known at 5% l.s.)
Rule for acceptance/rejection:
lZSTATl < lZCRTl then accept H0 and lZSTATl > lZCRTl then reject H0
Here, lZSTATl > lZCRTl, hence we reject the null hypothesis and conclude that the mean distance travelled in
both the cities differ significantly.
Example-3
To compare customer satisfaction levels of two competing cable television companies, 174
customers of Company 1 and 355 customers of Company 2 were randomly selected and were
asked to rate their cable companies on a five-point scale, with 1 being least satisfied and 5
most satisfied. The survey results are summarized in the table. Test at the 1% level of
significance whether the data provide sufficient evidence to conclude that Company 1 has a
higher mean satisfaction rating than does Company 2.
Company 1 Company 2 a
Sample size 174 355
Sample mean 3.51 3.24 0
ZCRT=2.33
Standard deviation (σ) 0.51 0.52
ZSTAT= 5.684

Solution: H0: μ1-μ2 ≤ 0 and Ha: μ1-μ2 > 0


ZSTAT = 5.684 (calculated using formula); ZCRT = 2.33 (one tail value at α=0.01)
Since ZSTAT  ZCRT, we reject the null hypothesis H0 and conclude that Company 1 has a higher
mean satisfaction rating than Company 2.
Example-4
The Municipal Transit Authority wants to know if, on weekdays, more passengers ride
the northbound blue line train towards the city center that departs at 8:15 a.m. or
the one that departs at 8:30 a.m. The following sample statistics are assembled by
the Transit Authority.
8:15 a.m. train 8:30 a.m. train
Sample size 30 45
Sample mean 323 356
Standard deviation (σ) 41 45

Test at the 5% level of significance whether the data provide sufficient evidence to
conclude that more passengers ride the 8:30 train.
Confidence Interval Estimate for (µ1-µ2) with
σ1 and σ2 known

Example: Greystone Department Stores, operates two stores in Buffalo, New York: One is in the inner city and the
other is in a suburban shopping center. The manager wants to investigate the difference between the mean ages of the
customers who shop at the two stores. From the inner city store, a random sample of 36 customers was taken whose
mean age was 40 yrs and from the suburban store, a random sample of 49 customers was taken whose mean age was
35 yrs. The population std. deviation of age in both the areas are 9 yrs and 10 yrs respectively. Find a 95% confidence
interval estimate for the difference of means?

Solution: 1=40, 2=35, σ1=9, σ2=10, n1=36, n2=49 and Zα/2 = 1.96
The confidence interval estimate is calculated as:
Thus, the margin of error is 4.06 years and the
95% confidence interval estimate of the difference between the
two population means is (5−4.06 to 5+4.06) i.e. 0.94 years to 9.06 years.
(2) Testing of Hypothesis for difference between
Two Population Proportions
Let p1 denote the proportion for Population 1 and p2 denote the proportion for Population
2, we consider the difference between the two population proportions (p1−p2). To make an
inference about this difference, we will select two independent random samples consisting
of n1 units from Population 1 and n2 units from Population 2.
Let us now consider hypothesis tests about the difference between the proportions of two
populations. We focus on tests involving no difference between the two population
proportions. In this case, the three forms for a hypothesis test are:

Upper-tail test: Two-tail test:


Lower-tail test:
H 0 : p1 ≤ p2 H 0 : p1 = p 2
H 0 : p1  p2 H a : p 1 > p2 H a : p1 ≠ p2
H a : p1 < p2
OR OR OR
H 0 : p1 – p2  0
H 0 : p 1 – p2 ≤ 0 H 0 : p1 – p2 = 0
H a : p1 – p2 < 0 H a : p 1 – p2 > 0 H a : p1 – p2 ≠ 0
Z-statistic for difference between two population
proportions

1 Sample proportion for a simple random


sample from Population 1
1 Sample proportion for a simple random
sample from Population 2
n1 Size of the random sample selected
from Population 1
n2 Size of the random sample selected
from Population 2
Example-1
A tax preparation firm is interested in comparing the quality of work at two of its regional offices. By
randomly selecting samples of tax returns prepared at each office and verifying the sample returns
accuracy, the firm will be able to estimate the proportion of erroneous returns prepared at each office. In
Office 1, errors were found in 35 files out of 250 returns whereas 27 errors were found in Office 2 out of 300
returns. The firm wants to test the hypothesis to determine whether the error proportions differ between
the two offices. Test at 10% level of significance.

Solution: H0: p1 – p2 = 0 and Ha: p1 – p2 ≠ 0

n1=250, n2=300, 1=35/250=0.14, 2=27/300=0.09


=0.1127 and ZSTAT = 1.85 (both calculated from the formulae); ZCRT= 1.645 (two tail value at α=0.1)
ZSTAT>ZCRT, we reject the null hypothesis H0 and conclude that the proportion of errors differ significantly in
the two offices.
Example-2
The Adecco Workplace Insights Survey sampled men and women workers and asked if they expected to
get a raise or promotion this year (USA today, February 16, 2012). Suppose the survey sampled 200 men
and 200 women. If 104 of the men replied Yes and 74 of the women replied Yes, are the results
statistically significant so that you can conclude a greater proportion of men expect to get a raise or a
promotion this year?

Solution: H0: p1 – p2 ≤ 0 and Ha: p1 - p2 > 0

n1=200, n2=200, 1=104/200=0.52, 2=74/200=0.37


=0.445 and ZSTAT = 3.018 (both calculated from the formulae); ZCRT= 1.28 (one tail value at α=0.1)
ZSTAT>ZCRT, we reject the null hypothesis H0 and conclude that conclude a greater proportion of men
expect to get a raise or a promotion this year.
Confidence Interval Estimate for (p1-p2)

Example: A tax preparation firm is interested in comparing the quality of work at two of its regional offices. By
randomly selecting samples of tax returns prepared at each office and verifying the sample returns accuracy, the firm
will be able to estimate the proportion of erroneous returns prepared at each office. In Office 1, errors were found in
35 files out of 250 returns whereas 27 errors were found in Office 2 out of 300 returns. Find a 90% confidence interval
estimate for the difference of proportions?

Solution: n1=250, n2=300, 1=35/250=0.14, 2=27/300=0.09 and Zα/2 = 1.645


The confidence interval estimate is calculated as:
Thus, the margin of error is 0.045 and the 90% confidence interval estimate
of the difference between the two population proportions is
(0.05−0.045 to 0.05+0.045) i.e. 0.005 to 0.095.
(3) Hypothesis tests for (µ1-µ2) with σ1 and σ2
unknown and assumed unequal (t-test)

Population means, Assumptions:


independent
▪ Samples are randomly and
samples
independently drawn

▪ Populations are normally


σ1 and σ2 unknown, distributed
assumed unequal
▪ Population variances are
unknown and unequal
Two-population t-test
2
𝑥1 − 𝑥2 − 𝐷0 𝑠12 𝑠22
t STAT = 𝑛1 + 𝑛2
Test Statistic: s12 s22
+
𝑑𝑓 = 2 2
n1 n2 1 𝑠12 1 𝑠22
+
𝑛1 − 1 𝑛1 𝑛2 − 1 𝑛2

𝑥1 = Mean of first sample


𝑥2 = Mean of second sample
𝑠1 = S.D. of first sample
𝑠2= S.D. of second sample
𝑛1 = Size of first sample
𝑛2 = Size of second sample
𝐷0= Difference between μ1 and μ2
Example-1
You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between stocks
listed on the NYSE & NASDAQ? You collect the following data:
NYSE NASDAQ
Number 21 25
Sample mean 3.27 2.53
Sample std dev 1.30 1.16
Assuming both populations are approximately normal with unequal variances, is there any difference in
mean yield (a = 0.05)?
2
𝑥1 − 𝑥2 − 𝐷0 𝑠12 𝑠22
H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) t STAT = 𝑛1 + 𝑛2
Solution: s12 s22
𝑑𝑓 =
𝑠12
2
𝑠22
2
H1: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2) +
n1 n2
1
𝑛1 − 1 𝑛1
+
1
𝑛2 − 1 𝑛2

tSTAT= 2.019 and df = 40(calculated value of df is 40.5 but we shall round it down to 40 to get higher value of
t). Also tCRT=2.021 (two-tail value at 0.05 level and 40 df).
Since |tSTAT| < |tCRT|, we do not reject the null hypothesis H0 and conclude that there is no difference
between the mean yields.
Example-2
Specific Motors of Detroit has developed a new automobile known as the M car. 24 M cars and 28 J
cars (from Japan) were road tested to compare miles-per-gallon (mpg) performance. The sample
statistics are shown below. Can we conclude, using a 0.05 level of significance, that the miles-per-
gallon performance of M cars is greater than the miles-per-gallon performance of J cars?
M cars J cars
n1=24 cars n2=28 cars
1=29.8 2=27.3

s1=2.56 s2=1.81
Solution: H0: μ1 – μ2 ≤ 0 and Ha: μ1 – μ2 > 0
2
𝑥1 − 𝑥2 − 𝐷0 𝑠12 𝑠22
+
t STAT = 𝑛1 𝑛2
𝑑𝑓 =
s12 s22 1 𝑠12
2
1 𝑠2
2
+ 𝑛1 − 1 𝑛1 + 𝑛 − 1 𝑛2
n1 n2 2 2

tSTAT= 4.003 and df = 40(calculated value of df is 40.6 but we shall round it down to 40 to get higher
value of t). Also tCRT=1.684 (one-tail value at 0.05 level and 40 df).
Since |tSTAT| > |tCRT|, we reject the null hypothesis H0 and conclude that the miles-per-gallon (mpg)
performance of M cars is greater than the miles-per-gallon performance of J cars.
Example-3
A study of 40 staff nurses in Tampa and 50 staff nurses in Dallas gives the following results:
Tampa Dallas
n1= 40 nurses n2= 50 nurses
1= $56,100 2= $59,400
s1= $6000 s2= $7000
Does this data show enough evidence that the mean salary of Tampa staff nurses is lower than that
of Dallas staff nurses? Test at 5% level of significance.

Solution: H0: μ1 – μ2 ≥ 0 and Ha: μ1 – μ2 < 0


2
𝑥1 − 𝑥2 − 𝐷0 𝑠12 𝑠22
t STAT = 𝑛1 + 𝑛2
𝑑𝑓 =
s12 s22 1 𝑠12
2
1 𝑠2
2
+ + 𝑛 − 1 𝑛2
n1 n2 𝑛1 − 1 𝑛1 2 2

tSTAT= -2.41 and df = 87 (calculated value of df is 87.6 but we shall round it down to 87 to get higher
value of t). Also tCRT=1.663 (one-tail value at 0.05 level and 87 df).
Since |tSTAT| > |tCRT|, we reject the null hypothesis H0 and conclude that the mean salary of Tampa
staff nurses is significantly lower than that of Dallas staff nurses.
Confidence Interval Estimate for (µ1-µ2) with
σ1 and σ2 unknown

Example: Clearwater National Bank is conducting a study designed to identify differences between checking account
practices by customers at two of its branch banks. In a simple random sample of 28 checking accounts from the Cherry
Grove Branch showed a mean balance of $1025 with a standard deviation of $150. Similarly, a simple random sample
of 22 checking accounts selected from the Beechmont Branch showed mean balance of $910 and a standard deviation
of $125. Find a 95% interval estimate for the difference of means.

Solution: 1=1025, 2=910, s1=150, s2=125, n1=28, n2=22


and tα/2 = 2.012 (at 47 df)
The confidence interval estimate is calculated as:
Thus, the margin of error is $78 and the
95% confidence interval estimate of the difference
between the two population means is
(115−78 to 115+78) i.e. $37 to $193.
(4) Hypothesis tests for (µ1-µ2) with σ1 and σ2
unknown and assumed equal (t-test)

• The pooled variance is:


Population means,
2 2
independent n1 − 1 S1 + n2 − 1 S 2
Sp2 =
samples (n1 − 1) + (n2 − 1)

• The test statistic is:


σ1 and σ2 unknown,
assumed equal X1 − X2 − μ1 − μ2
t STAT =
1 1
Sp2 +
n1 n2

• where tSTAT has d.f. = (n1 + n2 – 2)


Example
The sample data is from a group of men and women who did workouts at a gym three times a week for a
year. Then, their trainer measured the body fat. The table below shows the data:
Group Sample Size (n) Average ( ) Standard deviation (S)

Women 10 22.29 5.32

Men 13 14.95 6.84

Test at 5% l.s. whether the body fat for men and women differ significantly?
Solution: H0: μ1−μ2=0 and H1: μ1−μ2≠0

n − 1 s 2+ n −1 s 2
X1 − X2 − 𝐷0 1 1 2 2
t STAT = Sp2 =
1 1 (n1 − 1) + (n2 − 1)
Sp2 +
n1 n2

SP2 = 38.88; tSTAT= 2.80; d.f. = n1+n2-2 = (10+13-2) = 23; tCRT = ±2.08 (at 5% l.s. and 21 d.f.)
Here ItSTATI > ItCRTI, hence we reject H0 and conclude that the body fat for men and women differ significantly.
Confidence interval for (µ1-µ2) with σ1 and σ2
unknown and assumed equal

Population means,
independent
samples
The confidence interval for
μ1–μ2 is:
σ1 and σ2 unknown,
assumed equal 1 1
X1 − X2 ± 𝑡𝛼/2 Sp2 +
n1 n2

where tα/2 has d.f. = n1 + n2 – 2


(5) Hypothesis Testing for difference between two population
means: Matched samples (Paired t-test)
A paired t-test is used on dependent samples. Dependent samples are essentially
connected — they are tested on the same individuals or entities. For example:
• MRI costs at two different hospitals,
• Two tests on the same employees before and after training,
• Two blood pressure measurements on the same patients using different
equipment.
• Test statistic: where and

d is the difference between the observations.


• Degrees of freedom = (n-1)
Example-1: A Chicago-based firm has documents that must be quickly distributed to district offices
throughout the U.S. The firm must decide between two delivery services, UPX (United Parcel Express) and
INTEX (International Express), to transport its documents. In testing the delivery times of the two services,
the firm sent two reports to a random sample of its district offices with one report carried by UPX and the
other report carried by INTEX. Do the data given below indicate a difference in mean delivery times for the
two services? Use a 0.05 level of significance.
Delivery time (hours)

Solution: H0: μd = 0 and Ha: μd ≠ 0 District office UPX INTEX Difference (d)
A 32 25 7

where and B 30 24 6
C 19 15 4

d-bar = 2.7 and sd= 2.9 D 16 15 1


tSTAT = (2.7 – 0)/(2.9/√10) = 2.94 E 15 13 2
tCRT = 2.26 (at 0.05 level and 9 df) F 18 15 3
|tSTAT|> | tCRT|, hence we reject the null hypothesis and G 14 15 -1
conclude that there is a significant difference in the mean H 10 8 2
delivery times for the two services. I 7 9 -2
J 16 11 5
Example-2

tCRT = 2.571 (at 0.05 level and 5 df)


Topic-11
Chi-square Tests
Topics to be discussed
1. Inference about a Population Variance
2. Chi-square Distribution
3. Interval Estimation for Population Variance
4. Chi-square test for Population Variance
5. Chi-square test of Independence
Inferences About a Population Variance

➢ A variance can provide important decision-making


information.
➢ Consider the production process of filling containers
with a liquid detergent product.
➢ The mean filling weight is important, but also is the
variance of the filling weights.
➢ By selecting a sample of containers, we can compute a
sample variance for the amount of detergent placed in
a container.
➢ If the sample variance is excessive, overfilling and
underfilling may be occurring even though the mean is
correct.
Chi-Square Distribution
❑ The chi-square distribution is the sum of squared
standardized normal random variables such as
(z1)2+(z2)2+(z3)2 and so on.
❑ The chi-square distribution is based on sampling from
a normal population.
❑ The sampling distribution of (n-1)s2/σ2 has a chi-
square distribution whenever a simple random sample
of size n is selected from a normal population.
❑ We can use the chi-square distribution to develop
interval estimates and conduct hypothesis tests about
a population variance.
Interval Estimation of  2

2 (𝑛 − 1)𝑠 2 2
𝜒(1−α ≤ ≤ 𝜒α/2
2
) 𝜎2

α/2
α/2
95% of the
possible 2 values
2
0 𝝌𝟐(𝟏−𝜶)
𝟐
𝝌𝟐(𝜶)
𝟐
Interval Estimate of Population Standard Deviation (σ)

Taking the square root of the upper and lower limits


of the variance interval provides the confidence
interval for the population standard deviation.

(𝑛 − 1)𝑠 2 (𝑛 − 1)𝑠 2
2 ≤𝜎≤ 2
𝜒𝛼/2 𝜒(1−𝛼/2)
Interval Estimation of σ2
Example-1: Buyer’s Digest
Buyer’s Digest rates thermostats manufactured for home
temperature control. In a recent test, 10 thermostats manufactured
by ThermoRite were selected and placed in a test room that was
maintained at a temperature of 68oF. The temperature readings of
the ten thermostats are shown below:

Thermostat 1 2 3 4 5 6 7 8 9 10
Temperature 67.4 67.8 68.2 69.3 69.5 67.0 68.1 68.6 67.9 67.2

Develop a 95% confidence interval estimate of the population


variance.
𝟐
Finding 𝝌𝟎.𝟎𝟐𝟓

df = (n–1) = 10-1 = 9 and α/2 = 0.025


Selected Values from the Chi-Square Distribution Table

Degrees Area in Upper Tail


of Freedom .99 .975 .95 .90 .10 .05 .025 .01
5 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086
6 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812
7 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475
8 1.647 2.180 2.733 3.490 13.362 15.507 17.535 20.090
9 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666

10 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209


2
Our 𝜒.025 value
𝟐
Finding 𝝌𝟎.𝟗𝟕𝟓
df = (n-1) = 10-1 = 9 and (1-α/2) = 0.975
Selected Values from the Chi-Square Distribution Table

Degrees Area in Upper Tail


of Freedom .99 .975 .95 .90 .10 .05 .025 .01
5 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086
6 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812
7 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475
8 1.647 2.180 2.733 3.490 13.362 15.507 17.535 20.090
9 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666

10 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209


𝟐
Our 𝝌.𝟗𝟕𝟓 value
Example-2: The variance in drug weights is critical in the pharmaceutical industry. For a specific
drug, with weights measured in grams, a sample of 18 units provided a sample variance of s2=0.36.
(a) Construct a 90% confidence interval estimate of the population variance for the weight of this
drug.
(b) Construct a 90% confidence interval estimate of the population standard deviation.

Example-3: Consumer Reports uses a 100-point customer satisfaction score to rate the nation’s
major chain stores. Assume that from past experience with the satisfaction rating score, a
population standard deviation of σ=12 is expected. In 2012, Costco, with its 432 warehouses in 40
states, was the only chain store to earn an outstanding rating for overall quality (Consumer
Reports, March 2012). A sample of 15 Costco customer satisfaction scores are: 95, 90, 83, 75, 95,
98, 80, 83, 82, 93, 86, 80, 94, 64 and 62.
(a) What is the sample mean customer satisfaction score for Costco? (84)
(b) What is the sample variance? (118.71)
(c) What is the sample standard deviation? (10.90)
(d) Construct a 95% confidence interval estimate for population variance and standard deviation.
(63.63, 295.25) and (7.98, 17.18)
Rejection Criteria for 2 test about a Population Variance
Hypothesis Testing for σ2
Example: Buyer’s Digest
Buyer’s Digest rates thermostats manufactured for home temperature
control. In a recent test, 10 thermostats manufactured by ThermoRite
were selected and placed in a test room that was maintained at a
temperature of 68oF. The temperature readings of the ten thermostats
are shown below:
Thermostat 1 2 3 4 5 6 7 8 9 10

Temperature 67.4 67.8 68.2 69.3 69.5 67.0 68.1 68.6 67.9 67.2

Buyer’s Digest gives an “acceptable” rating to a thermostat with a


temperature variance of 0.5 or less. Test (with α=0.10) whether the
ThermoRite thermostat’s temperature variance is “acceptable”.
Rejection/Non-rejection Region

(𝑛−1)𝑠 2
2 =
𝜎2
= 12.6

α = 0.10

2
0 14.684
Do not Reject H0 12.6
Reject H0
Inference: p-value Approach

1. The rejection region for the ThermoRite thermostat example is in the


upper tail; thus, the appropriate p-value is less than 0.90 (2 = 4.168)
and greater than 0.10 (2 = 14.684).
2. Because the p–value > α (0.10), we cannot reject the null hypothesis.
3. The sample variance of s2=0.70 is insufficient evidence to conclude that
the temperature variance is unacceptable (>0.50).

The exact p-value is .18156.


Example-2: Consumer Reports uses a 100-point customer satisfaction score to rate the nation’s major
chain stores. Assume that from past experience with the satisfaction rating score, a population standard
deviation of σ=12 is expected. In 2012, Costco, with its 432 warehouses in 40 states, was the only chain
store to earn an outstanding rating for overall quality (Consumer Reports, March 2012). A sample of 15
Costco customer satisfaction scores are: 95, 90, 83, 75, 95, 98, 80, 83, 82, 93, 86, 80, 94, 64 and 62.
Test the hypothesis to determine whether the population standard deviation of σ=12 should be rejected
for Costco. With a 0.05 level of significance, what is your conclusion?

Solution:H0: σ = 12 and Ha: σ≠12

(n−1)s2 15−1 118.714


2 = = =11.542
σ2 12 2

For a two-tailed test, do not reject H0 if 𝝌𝟐(𝟏−𝜶/𝟐) < 𝝌𝟐 < 𝝌𝟐𝜶/𝟐


χ2(1−α/2) = χ20.975 = 5.629
χ2α/2 = χ20.025 = 26.119
We have 𝟓. 𝟔𝟐𝟗 < 𝛘𝟐 𝟏𝟏. 𝟓𝟒𝟐 < 𝟐𝟔. 𝟏𝟏𝟗, hence we do not reject the null hypothesis and conclude that
the population standard deviation of satisfaction scores is 12.
Chi-square Test for Independence of Two Categorical Variables
▪ An important application of a Chi-square test involves using sample
data to test for the independence of two categorical variables.
▪ For this test we take one sample from a population and record the
observations for two categorical variables.
▪ We summarize the data by counting the number of responses for
each combination of a category for variable 1 and a category for
variable 2.
▪ The null hypothesis for this test is that the two categorical variables
are independent. Thus, the test is referred to as a test of
independence.
Test of Independence
Example-1: Finger Lakes Homes

Each home sold by Finger Lakes Homes can be classified according to price and to
style. Finger Lakes’ manager would like to determine if the price of the home and the
style of the home are independent variables. The number of homes sold for each
model and price for the past two years is shown below. For convenience, the price of
the home is listed as either $99,000 or less or more than $99,000.
Style of Home
Price Colonial Log Split-level A-frame
< $99,000 18 6 19 12
> $99,000 12 14 16 3

Test the hypothesis whether price and style of the home are independent of each
other. Consider 0.05 level of significance.
Solution: Finger Lakes Homes

1. Hypotheses formulation
H0: Price of the home is independent of the style of the home that is purchased
Ha: Price of the home is not independent of the style of the home that is purchased
2. The sample of size 100 has been selected and observed frequencies are recorded.
3. Calculation of Expected Frequencies:

Style of Home
Price Colonial Log Split-level A-frame Row total
< $99,000 (55*30/100) (55*20/100) (55*35/100) (55*15/100) 55
= 16.5 = 11 = 19.25 = 8.25
> $99,000 (45*30/100) (45*20/100) (45*35/100) (45*15/100) 45
= 13.5 =9 = 15.75 = 6.75
Column total 30 20 35 15 Sample
size=100
4. Calculation of Test statistic 𝝌𝟐
fij eij (fij-eij) (fij-eij)2 (fij-eij)2/eij
18 16.5 1.5 2.25 2.25/16.5=0.136
6 11 -5 25 25/11=2.272
19 19.25 -0.25 0.0625 0.0625/19.25=0.003
12 8.25 3.75 14.0625 14.0625/8.25=1.705
12 13.5 -1.5 2.25 2.25/13.5=0.167
14 9 5 25 25/9=2.778
16 15.75 0.25 0.0625 0.0625/15.75=0.004
3 6.75 -3.75 14.0625 14.0625/6.75=2.083
TOTAL (𝝌𝟐 value) 𝝌𝟐 = 9.148

5. Determine the rejection rule: Reject H0 if 𝝌𝟐 ≥ 𝝌𝟐𝜶


df = (2-1)*(4-1) = 3 and 𝜶 = 0.05; 𝝌𝟐𝜶 = 7.815
Since 𝜒 2 (9.148) > 𝜒𝛼2 (7.815), we reject, at 0.05 level of significance, the assumption
that the price of homes is independent of the style of home that is purchased.
Conclusion Using the p-value Approach

Area in Upper Tail .10 .05 .025 .01 .005

2 Value (df = 3) 6.251 7.815 9.348 11.345 12.838

Since 2 = 9.145 is between 7.815 and 9.348, the area in


the upper tail of the distribution is between 0.05 and
0.025.
The p-value < a. We reject the null hypothesis and
conclude that the price of homes is not independent of
the style of home that is purchased.
Actual p-value is 0.0274
Example-2: Health insurance benefits vary by the size of the company (Atlanta Business chronicle,
December 31, 2010). The sample data below show the number of companies providing health
insurance for small, medium, and large companies. For purposes of this study, small companies are
companies that have fewer than 100 employees, medium-sized companies have 100 to 999 employees,
and large companies have 1000 or more employees. The questionnaire collected from 225 employees
asked whether or not the employee had health insurance and then asked the employee to indicate the
size of the company. Conduct a test of
independence to determine whether health insurance
coverage is independent of the size of the company.
Using a 0.05 level of significance, what is your conclusion?
H0: Health insurance coverage is independent of the size of the company
Ha: Health insurance coverage is not independent of the size of the company
42 63 84 189
8 12 16 36
Total 50 75 100 225

(36-42)2 (65-63)2 (88-84)2 (14-8)2 (10-12)2 (12-16)2


χ²= + + + + + =6.944
42 63 84 8 12 16

χ² (= 6.994) > 𝝌𝟐𝜶 (= 5.991), we reject the null hypothesis


Topic-12
Analysis of Variance (ANOVA)
Introduction to Analysis of Variance
Statistical studies can be classified as being either experimental or
observational.
In an experimental study, one or more factors are controlled so
that data can be obtained about how the factors influence the
variables of interest.
In an observational study, no attempt is made to control the
factors.
Cause-and-effect relationships are easier to establish in
experimental studies than in observational studies.
Analysis of variance (ANOVA) can be used to analyze the data
obtained from experimental or observational studies.
Analysis of Variance: A Conceptual Overview
• Analysis of Variance (ANOVA) can be used to test for the
equality of three or more population means.
• Data obtained from observational or experimental studies
can be used for the analysis.
• We want to use the sample results to test the following
hypotheses:
H0: 1 = 2 = 3 = . . . = k
Ha: Not all population means are equal

• If H0 is rejected, we cannot conclude that all population


means are different.

• Rejecting H0 means at least two population means have


different values.
Analysis of Variance: A Conceptual Overview
Assumptions for Analysis of Variance

For each population, the response (dependent) variable is


normally distributed.

The variance of the response variable, denoted by σ2, is the


same for all of the populations.

The observations must be independent.


Steps in Analysis of Variance
We assume that a simple random sample of size nj has been
selected from each of the k populations or treatments. Let us
consider,
xij = value of observation i for treatment j
nj = number of observations for treatment j
j=sample mean for treatment j

𝑥ሜlj = overall sample mean


sj2 = sample variance for treatment j
sj= sample standard deviation for treatment j
Between-Treatments Estimate of Population Variance
Within-Treatments Estimate of Population Variance
Comparing the Variance Estimates: The F Test
ANOVA Table
Example of a typical ANOVA problem

Observation Sample-1 Sample-2 Sample-3 … Sample-j … Sample-k


1 x11 x12 x13 x1j x1k
2 x21 x22 x23 x2j x2k
3 x31 x32 x33 x3j x3k
. .
i xi1 xi2 xi3 xij xik
.
n xn1 xn2 xn3 xnj xnk
Sample 1 2 3 j k
Mean ( j)
Sample s12 s22 s32 sj2 sk2
Variance (sj2)
Test for the Equality of k Population Means

Rejection Rule
p-value Approach: Reject H0 if p-value < 

Critical Value Approach: Reject H0 if F > F

where the value of F is based on an


F distribution with k - 1 numerator d.f.
and nT - k denominator d.f.
Testing for the Equality of k Population
Means: ANOVA
Example: AutoShine, Inc.

AutoShine, Inc. is considering marketing a long-lasting car wax. Three


different waxes (Type 1, Type 2 and Type 3) have been developed.
In order to test the durability of these waxes, 5 new cars were waxed
with Type 1, 5 with Type 2, and 5 with Type 3. Each car was then
repeatedly run through an automatic carwash until the wax coating
showed signs of deterioration.
The number of times each car went through the carwash before its wax
deteriorated is shown on the next slide. AutoShine, Inc. must decide
which wax to market. Are the three waxes equally effective?
Example: AutoShine, Inc.

Wax Wax Wax


Observation Type 1 Type 2 Type 3
1 27 33 29
2 30 28 28
3 29 31 30
4 28 30 32
5 31 30 31

Sample Mean 29.0 30.4 30.0


Sample Variance 2.5 3.3 2.5
AOVA: Solution

Hypotheses
H0: 1 = 2 = 3
Ha: Not all the means are equal
where:
1 = mean number of washes using Type 1 wax
2 = mean number of washes using Type 2 wax
3 = mean number of washes using Type 3 wax
1. Mean Square Between Treatments (MSTR) calculation:
Because the sample sizes are all equal:
𝒙ሜlj = (𝒙lj 𝟏 + 𝒙lj 𝟐 + 𝒙lj 𝟑 )/𝟑 = (29 + 30.4 + 30)/3 = 29.8

SSTR = σ𝑘𝑗=1 𝑛𝑗 (𝑥𝑗lj − 𝑥)


ሜlj 2 = 5(29–29.8)2 + 5(30.4–29.8)2 + 5(30–29.8)2 = 5.2
MSTR = SSTR/(k-1) = 5.2/(3 - 1) = 2.6

2. Mean Square Error (MSE) calculation:


SSE = σ𝑘𝑗=1(𝑛𝑗 −1) 𝑠𝑗2 = 4(2.5) + 4(3.3) + 4(2.5) = 33.2
MSE = SSE/(nT - k) = 33.2/(15 - 3) = 2.77
ANOVA Table

Source of Sum of Degrees of Mean


Variation Squares Freedom Squares F p-Value

Treatments SSTR = 5.2 k-1 = 2 5.2/2 = 2.60 2.60/2.77 0.42


= 0.939
Error SSE = 33.2 nT - k = 12 33.2/12 = 2.77

Total 38.4 14
Rejection Rule

p-Value Approach: Reject H0 if p-value < .05


Critical Value Approach: Reject H0 if F > 3.89

The p-value is greater than .10 (Excel provides a


p-value of .42.), where F = 2.81.

where F.05 = 3.89 is based on an F distribution


with 2 numerator degrees of freedom and 12
denominator degrees of freedom

There is insufficient evidence to conclude that


the mean number of washes for the three wax
types are not all the same.
Example: Reed Manufacturing
Janet Reed would like to know if there is any significant difference in
the mean number of hours worked per week for the department
managers at her three manufacturing plants (in Buffalo, Pittsburgh,
and Detroit). A simple random sample of five managers from each of
the three plants was taken and the number of hours worked by each
manager in the previous week is shown in the next slide.
Conduct an appropriate test using  = 0.05.

Factor . . . Manufacturing plant


Treatments . . . Buffalo, Pittsburgh, Detroit
Experimental units . . . Managers
Response variable . . . Number of hours worked
Plant 1 Plant 2 Plant 3
Observation Buffalo Pittsburgh Detroit
1 48 73 51
2 54 63 63
3 57 66 61
4 54 64 54
5 62 74 56
Sample Mean 55 68 57
Sample Variance 26.0 26.5 24.5
p -Value and Critical Value Approaches

1. Develop the hypotheses.


H0:  1 =  2 =  3
Ha: Not all the means are equal
where:
 1 = mean number of hours worked per
week by the managers at Plant 1
 2 = mean number of hours worked per
week by the managers at Plant 2
 3 = mean number of hours worked per
week by the managers at Plant 3
2. Specify the level of significance.  = .05

3. Compute the value of the test statistic.


Mean Square Due to Treatments
(Sample sizes are all equal.)
𝑥ሜlj = (55 + 68 + 57)/3 = 60
SSTR = 5(55 - 60)2 + 5(68 - 60)2 + 5(57 - 60)2 = 490
MSTR = 490/(3 - 1) = 245
3. Compute the value of the test statistic. (con’t.)

Mean Square Due to Error


SSE = 4(26.0) + 4(26.5) + 4(24.5) = 308
MSE = 308/(15 - 3) = 25.667
F = MSTR/MSE = 245/25.667 = 9.55
ANOVA Table

Source of Sum of Degrees of Mean


Variation Squares Freedom Square F p-value
Treatment 490 2 245 9.55 .0033
Error 308 12 25.667

Total 798 14
p –Value Approach

4. Compute the p –value.

With 2 numerator d.f. and 12 denominator d.f.,


the p-value is .01 for F = 6.93. Therefore, the
p-value is less than .01 for F = 9.55.

5. Determine whether to reject H0.


The p-value < .05, so we reject H0.
We have sufficient evidence to conclude that the
mean number of hours worked per week by
department managers is not the same at all 3 plant.
Critical Value Approach
4. Determine the critical value and rejection rule.
Based on an F distribution with 2 numerator
d.f. and 12 denominator d.f., F.05 = 3.89.
Reject H0 if F > 3.89

5. Determine whether to reject H0.


Because F = 9.55 > 3.89, we reject H0.
We have sufficient evidence to conclude that the
mean number of hours worked per week by
department managers is not the same at all 3 plants.
Multiple Comparison Procedures

• Suppose that analysis of variance has provided


statistical evidence to reject the null hypothesis of
equal population means.
• Fisher’s least significant difference (LSD) procedure
can be used to determine where the differences
occur.
Fisher’s LSD Procedure
Hypotheses
𝐻0 : 𝜇𝑖 = 𝜇𝑗
𝐻𝑎 : 𝜇𝑖 ≠ 𝜇𝑗

Test Statistic

𝑥lj 𝑖 − 𝑥𝑗lj
𝑡=
1 1
MSE( + )
𝑛𝑖 𝑛𝑗
Rejection Rule
p-value Approach:
Reject H0 if p-value < 

Critical Value Approach:

Reject H0 if t < -ta/2 or t > ta/2

where the value of ta/2 is based on a


t distribution with nT - k degrees of freedom.
LSD for Plants 1 and 2
Hypotheses (A): 𝐻0 : 𝜇1 = 𝜇2
𝐻𝑎 : 𝜇1 ≠ 𝜇2
Rejection Rule:
Reject H0 if 𝒙lj 𝟏 − 𝒙lj 𝟐 > 6.98
Test Statistic:
𝒙lj 𝟏 − 𝒙lj 𝟐 = |55 - 68| = 13
Conclusion:
The mean number of hours worked at Plant 1 is
not equal to the mean number worked at Plant 2.
LSD for Plants 1 and 3
Hypotheses (B): 𝐻0 : 𝜇1 = 𝜇3
𝐻𝑎 : 𝜇1 ≠ 𝜇3
Rejection Rule:
Reject H0 if 𝒙lj 𝟏 − 𝒙lj 𝟑 > 6.98
Test Statistic:
𝒙lj 𝟏 − 𝒙lj 𝟑 = |55 - 57| = 2
Conclusion:
There is no significant difference between the mean
number of hours worked at Plant 1 and the mean number
of hours worked at Plant 3.
LSD for Plants 2 and 3
Hypotheses (C): 𝐻0 : 𝜇2 − 𝜇3
𝐻𝑎 : 𝜇2 ≠ 𝜇3
Rejection Rule:
Reject H0 if 𝒙lj 𝟐 − 𝒙lj 𝟑 > 6.98
Test Statistic:
𝒙lj 𝟐 − 𝒙lj 𝟑 = |68 - 57| = 11
Conclusion:
The mean number of hours worked at Plant 2 is
not equal to the mean number worked at Plant 3.
Topic-13
Nonparametric Methods
Topics to be discussed

➢ Sign Test
➢ Wilcoxon Signed Rank Test
➢ Kruskal-Wallis Test
Nonparametric Methods
❑ Most of the statistical methods referred to as parametric require the use of
interval- or ratio-scaled data.

❑ Nonparametric methods are often the only way to analyze categorical


(nominal or ordinal) data and draw statistical conclusions.

❑ Nonparametric methods require no assumptions about the population


probability distributions.

❑ Nonparametric methods are often called distribution-free methods.

❑ Whenever the data are quantitative, we will transform the data into
categorical data in order to conduct the nonparametric test.
Sign Test
• The sign test is a versatile method for hypothesis testing
that uses the binomial distribution with p=0.50 as the
sampling distribution.
• There are two applications of the sign test:
✓ A hypothesis test about a population median
x A matched-sample test about the difference between two
populations
Hypothesis Test about a Population Median

We can apply the sign test by:


➢ Using a plus sign whenever the data in the sample are above
the hypothesized value of the median

➢ Using a minus sign whenever the data in the sample are


below the hypothesized value of the median

➢ Discarding any data exactly equal to the hypothesized


median
Hypothesis Test about a Population Median

▪ The assigning of the plus and minus signs makes the situation
into a binomial distribution application.
▪ The sample size is the number of trials.
▪ There are two outcomes possible per trial, a plus sign or a
minus sign.
▪ The trials are independent.
▪ We let p denote the probability of a plus sign.
▪ If the population median is in fact a particular value, p should
equal 0.5.
Hypothesis Test about a Population Median:
Small-Sample Case
The small-sample case for this sign test should be
used whenever n < 20.
The hypotheses are:

𝐻0 : 𝑝 = .50
The population median equals the
value assumed.
𝐻a : 𝑝 ≠ .50
The population median is different
than the value assumed.
The number of plus signs is our test statistic.
Assuming H0 is true, the sampling distribution for
the test statistic is a binomial distribution with p = 0.5
H0 is rejected if the p-value < level of significance.
Hypothesis Test about a Population Median:
Smaller Sample Size
Example: Potato Chip Sales
Lawler’s Grocery Store made the decision to carry Cape May Potato Chips based
on the manufacturer’s estimate that the median sales should be $450 per week
on a per-store basis.
Lawler’s has been carrying the potato chips for three months. Data showing one-
week sales at 10 randomly selected Lawler’s stores are shown below:

Store Weekly Store Weekly


Number Sales($) Sign Number Sales($) Sign
56 485 + 63 474 +
19 562 + 39 662 +
36 415 - 84 380 -
128 860 + 102 515 +
12 426 - 44 721 +
Example: Potato Chip Sales
Lawler’s management requested the following
hypothesis test about the population median weekly
sales of Cape May Potato Chips (using a = 0.10).
H0: Median Sales = 450
Ha: Median Sales ≠ 450

In terms of the binomial probability p:

H0: p = .50
Ha: p ≠ .50
Example: Potato Chip Sales

Binomial Probabilities with n = 10 and p = .50

Number of Number of
Plus Signs Probability Plus Signs Probability
0 .0010 6 .2051
1 .0098 7 .1172
2 .0439 8 .0439
3 .1172 9 .0098
4 .2051 10 .0010
5 .2461
Example: Potato Chip Sales
Since observed number of plus signs is 7, we begin
by computing the probability of obtaining 7 or more
plus signs.
The probability of 7, 8, 9, or 10 plus signs is:
.1172 + .0439 + .0098 + .0010 = .1719.
We are using a two-tailed hypothesis test, so:
p-value = 2(.1719) = .3438.

With p-value > a, (.3438 > .10), we cannot reject H0.


Conclusion:

Because the p-value > a, we cannot reject H0.


There is insufficient evidence in the sample to reject the
assumption that the median weekly sales is $450.
Hypothesis Test about a Population Median:
Larger Sample Size
With larger sample sizes, we rely on the normal
distribution approximation of the binomial distribution
to compute the p-value, which makes the computations
quicker and easier.
Normal Approximation of
the Number of Plus Signs when H0: p = .50
Mean:  = .50n
Standard Deviation: 𝜎 = .25𝑛
Distribution Form: Approximately normal for n > 20
Hypothesis Test about a Population Median:
Larger Sample Size
Example: Trim Fitness Center
A hypothesis test is being conducted about the
median age of female members of the Trim Fitness
Center.

H0: Median Age = 34 years


Ha: Median Age ≠ 34 years

In a sample of 40 female members, 25 are older than


34, 14 are younger than 34, and 1 is 34. Is there
sufficient evidence to reject H0? Use a = .05.
Hypothesis Test about a Population Median:
Larger Sample Size
Example: Trim Fitness Center
• Letting x denote the number of plus signs, we will
use the normal distribution to approximate the
binomial probability P(x > 25).
• Remember that the binomial distribution is discrete
and the normal distribution is continuous.
• To account for this, the binomial probability of 25 is
computed by the normal probability to 25.5.
Hypothesis Test about a Population Median:
Larger Sample Size
Mean and Standard Deviation:
 = .5(n) = .5(39) = 19.5
𝝈 = . 𝟐𝟓𝒏 = . 𝟐𝟓(𝟑𝟗) = 𝟑. 𝟏𝟐𝟐𝟓
Test Statistic:
z = (x – )/s = (25.5 – 19.5)/3.1225 = 1.92
p-value:
p-Value = 2(1.0000 - .9726) = .0548
Hypothesis Test about a Population Median:
Larger Sample Size
Rejection Rule:
Using .05 level of significance:
Reject H0 if p-value < .05

Conclusion:
Do not reject H0. The p-value for this two-tail test is
0.0548. There is insufficient evidence in the sample
to conclude that the median age is not 34 for female
members of Trim Fitness Center.
Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is a procedure for


analyzing data from a matched samples experiment.
The test uses quantitative data but does not require
the assumption that the differences between the
paired observations are normally distributed.
It only requires the assumption that the differences
have a symmetric distribution.
This occurs whenever the shapes of the two
populations are the same and the focus is on
determining if there is a difference between the two
populations’ medians.
Wilcoxon Signed-Rank Test
-
Let T denote the sum of the negative signed ranks.
+
Let T denote the sum of the positive signed ranks.
If the medians of the two populations are equal, we
would expect the sum of the negative signed ranks
and the sum of the positive signed ranks to be
approximately the same.
+
We use T as the test statistic.
Wilcoxon Signed-Rank Test

Sampling Distribution of T +
for the Wilcoxon Signed-Rank Test
𝑛(𝑛+1)
Mean: 𝜇 𝑇 + =
4

𝑛(𝑛 + 1)(2𝑛 + 1)
Standard Deviation: 𝜎 𝑇 + =
24

Distribution Form: Approximately normal for n > 10


Example: Express Deliveries
A firm has decided to select one of two express delivery services to provide next-day
deliveries to its district offices.
To test the delivery times (in hours) of the two services, the firm sends two reports to a
sample of 10 district offices, with one report carried by one service and the other report
carried by the second service. Do the data indicate a difference in the two services?
District Office OverNight NiteFlite

Seattle 32 25
Los Angeles 30 24
Boston 19 15
Cleveland 16 15
New York 15 13
Houston 18 15
Atlanta 14 15
St. Louis 10 8
Milwaukee 7 9
Denver 16 11
Wilcoxon Signed-Rank Test

Preliminary Steps of the Test:

✓ Compute the differences between the paired observations.


✓ Discard any differences of zero.
✓ Rank the absolute value of the differences from lowest to
highest. Tied differences are assigned the average ranking of
their positions.
✓ Give the ranks the sign of the original difference in the data.
✓ Sum the positive-signed ranks.
✓ Next we will determine whether the sum is significantly
different from zero.
Wilcoxon Signed-Rank Test

Hypotheses:

H0: The difference in the median delivery times of the


two services equals 0.
Ha: The difference in the median delivery times of the
two services does not equal 0.
Wilcoxon Signed-Rank Test

District Office Diff. |Diff.| Rank Signed-Rank


Seattle 7 7 10 +10
Los Angeles 6 6 9 +9
Boston 4 4 7 +7
Cleveland 1 1 1.5 +1.5
New York 2 2 4 +4
Houston 3 3 6 +6
Atlanta -1 1 1.5 (-1.5)
St. Louis 2 2 4 +4
Milwaukee -2 2 4 (-4)
Denver 5 5 8 +8
+
T = 49.5
Test Statistic:
𝑛(𝑛 + 1) 10(10 + 1)
𝜇𝑇+ = = = 27.5
4 4

𝑛(𝑛 + 1)(2𝑛 + 1) 10(11)(21)


𝜎 𝑇+ = = = 9.81
24 24

49 − 27.5
𝑃(𝑇 + ≥ 49.5) = 𝑃 𝑧 ≥ = 𝑃(𝑧 ≥ 2.19)
9.81

p-value:
p-value = 2(1.0000 - 0.9857) = 0.0286
Wilcoxon Signed-Rank Test

Rejection Rule:
Using 0.05 level of significance,
Reject H0 if p-value < .05
Conclusion:
Reject H0. The p-value for this two-tail test is 0.0286.
There is sufficient evidence in the sample to
conclude that a difference exists in the median
delivery times provided by the two services.
Kruskal-Wallis Test
Kruskal and Wallis is used for comparing the cases of
three or more populations.

H0: All populations are identical


Ha: Not all populations are identical

The Kruskal-Wallis test can be used with ordinal data


as well as with interval or ratio data.
Also, the Kruskal-Wallis test does not require the
assumption of normally distributed populations.
Test Statistic
𝑘
12 𝑅𝑖2
𝐻= ෍ − 3(𝑛 𝑇 + 1)
𝑛 𝑇 (𝑛 𝑇 + 1) 𝑛𝑖
𝑖=1

where:
k = number of populations
ni = number of observations in sample i
nT = Sni = total number of observations in all samples
Ri = sum of the ranks for sample i
Kruskal-Wallis Test
When the populations are identical, the sampling
distribution of the test statistic H can be approximated
by a chi-square distribution with k–1 degrees of freedom.
This approximation is acceptable if each of the sample
sizes ni is > 5.

This test is always expressed as an upper-tailed test.

The rejection rule is: Reject H0 if p-value < a


Kruskal-Wallis Test
Example: Lakewood High School
John Norr, Director of Athletics at Lakewood High School, is curious about
whether a student’s total number of absences in four years of high school is
the same for students participating in no varsity sport, one varsity sport, and
two varsity sports.
Number of absences data were available for 20 recent graduates and are
listed below. Test whether the three populations are identical in terms of
number of absences. Use α = 0.10.
No Sport 1 Sport 2 Sports
13 18 12
16 12 22
6 19 9
27 7 11
20 15 15
14 20 21
17 10
Kruskal-Wallis Test

Example: Lakewood High School


No Sport Rank 1 Sport Rank 2 Sports Rank
13 8 18 14 12 6.5
16 12 12 6.5 22 19
6 1 19 15 9 3
27 20 7 2 11 5
20 16.5 15 10.5 15 10.5
14 9 20 16.5 21 18
17 13 10 4
Total 66.5 Total 77.5 Total 66
Kruskal-Wallis Test

Rejection Rule:
Using test statistic: Reject H0 if χ2 > 4.605 (2 d.f.)
Using p-value: Reject H0 if p-value < 0.10
Kruskal-Wallis Test Statistic:
k = 3 populations, n1 = 6, n2 = 7, n3 = 7, nT = 20
𝑘
12 𝑅𝑖2
𝐻= ෍ − 3(𝑛 𝑇 + 1)
𝑛 𝑇 (𝑛 𝑇 + 1) 𝑛𝑖
𝑖=1

12 (66.5)2 (77.5)2 (66.0)2


= + − 3(20 + 1)
20(20 + 1) 6 7 7
= 𝟎. 𝟑𝟓𝟑𝟐
Kruskal-Wallis Test

Conclusion:
Do no reject H0. There is insufficient evidence to
conclude that the populations are not identical.
(H = 0.3532 < 4.60517)

You might also like