0% found this document useful (0 votes)
68 views82 pages

Introduction To Statistics

Uploaded by

miki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views82 pages

Introduction To Statistics

Uploaded by

miki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Introduction to statistics

Chapter-One
Basic concepts, Method of data collection and presentation
1. Introduction
The word statistics comes from the Italian word “statistita” which means “state man”, in spite of the fact that
people had been recording and using data for various purpose in the distance past, the term was first used in the
early 18th century in many part of Europe, to signify the application of recorded data for political purpose of the
state. Now day statistical information plays a pivotal role in a wide range of fields, many of which influence and
affect our day to day activities. Statistics is used in almost all fields of human endeavor. In the recent past
statistics has become part of the natural science, social science, research, business, management, planning,
economics, industry, behavioral sciences, agriculture, engineering and many other experimental sciences.
2. Definition of Statistics
Before we get in to further discussion we should introduce the meaning of the word statistics:-
Statistics in its plural sense
It is equivalent to referring to a collection of numerical facts, figures or statistical data. This meaning of the
word is widely used when reference is made to facts and figures on employment or unemployment, rate of
traffic accident, death, birth, student’s enrollment at university etc.
Statistics in its singular sense
In this case statistics has it modern meaning and refer to subject area that is concerned with the extraction of
relevant information from available data with the aim to make sound decisions. In this case, it means a branch
of mathematics or applied research, which is concerned with the development and application of methods and
techniques for collecting, organizing, presenting, analyzing, and interpreting quantitative data in such a way that
the reliability of conclusions based on the data may be evaluated objectively in terms of probability statements.
Classification of statistics
Statisticians commonly classify statistical techniques in to two broad categories as:-
• Descriptive
• Inferential
Descriptive statistics: - this part of statistics deals only with describing some characteristics of the data
without going beyond the data. It encompasses any kind of data processing techniques, which is assigned to
summarize or describe important features of the data. This comprises the first three stages of statistical
investigation namely: collection, organization and presentation of data.
E.g 20% of the students in my class are married

1
Inferential statistics: - this is concerned with drawing statistically valid conclusions about the
characteristics of the population based on information from a sample. It is part of statistics which is
concerned with generalizing from sample to population using probability, performing hypothesis testing,
determining relationship between variables and making prediction.
E.g. at least 5% of all killings during last year in city Y were due to terrorists
Stages in statistical investigation
Data collection: - this is the 1st in statistical investigation, gathering information is our basic purpose for
the investigation. If data are needed and if not readily available, then they have to be collected. Data may be
collected by the investigator directly using methods like interview, questionnaire, and observation or may be
available from published or unpublished sources.
Data organization: - it is a stage where we edit our data. The collected data might involve irrelevant
figures, incorrect facts, omissions and mistakes. Errors that may have been included during data collection
will have to be edited. After editing, we may classify (arrange) data according to their common
characteristics, which is called organizing.
Data presentation: - this stage is presenting the organized data in the form of tables and diagrams.
Graphs may also be used to give the data a valid meaning and make the presentation attractive.
Data analysis: - this is the stage where we critically study to draw conclusion about them. Analysis
usually involves highly complex and sophisticated mathematical techniques.
Data interpretation: - this stage means drawing conclusion from the data which form the basis for
decision making. This is the stage where we draw valid conclusion from the results obtained through data
analysis.
Definitions of some statistical terms
➢ Population: - is the complete collection of individuals, objects or measurements that have a
characteristic in common.
Population may be finite or infinite, if population values consists of a fixed number of thus values, the
population said to be finite. On the other hand if a population consists of an endless succession of
values, then the population is an infinite one. It may also be a population quantitative or qualitative.
➢ Sample: - when a population is infinite it is impossible to obtain all possible observations for they are
infinitely many. If the nature of the study is also destructive, we can’t obtain information from each and
every member of the population, for the fact that it is destructive. Therefore, we will be forced to deal
with a representative part of the population in analyzing the data, such a representative part of the
population is called sample.
➢ Census: - is the process of collecting data covering all the units in the population.

2
➢ Parameters: - are numerical characteristics of the population defined for each variable of interest. Or a
statistical measure obtained from a population.
➢ Statistic: - is a measure, which is obtained from the sample data to make statements about an unknown
parameter, or is a measure obtained from a sample.
➢ Frame: - is a list of element covering the survey population, serves as a base for sample selection.
➢ Data: - is a set of related observations from which conclusions may be drawn.
➢ Variable: - a characteristics or attribute associated with each unit in the population that can assume
different values.
➢ Elementary unit: - is an element or group of elements on which information is required.
➢ Sampling unit: - for the purpose of sample selection, population is divided into a finite number of
distinct, non – over lapping and identifiable units.
Application and limitation of statistics
Application of statistics
Statistics is used in almost all fields of human activities and used by governmental bodies, private business
firms and research agencies as an indispensable tool. Particularly it is used in the following area.
➢ Design and analysis of experiments to testing of new aides and computing hypothesis.
➢ For short term and long term rational planning and decision making and control.
➢ To asses past trend and current status and to forecast future economic activities for a firm, an
industry or economy as a whole.
➢ Determination of man power requirements personnel selection, making research, financial analysis,
distribution of analysis and development.
➢ In public administration and in the social science like in the studies of poverty, population, voting
pattern, accidents etc.
➢ In communicating information, drawing conclusions and inference from data and guiding planning
and decision.
Limitation of statistics
Even though statistics is growing in popularity and in being successfully employed by the seekers of truth
in numerous fields of learning, still it has limitations, some of them are: -
➢ Deals directly only with quantitative characteristics.
➢ Doesn’t deal with individual measurements rather it studies aggregate of facts.
➢ Results are true only in general and on the average.
➢ Ignorant or wrongly motivated persons can miss use statistics.

3
In summary, statistics is a highly developed science with deep rooted mathematical base. It is
applicable to a large number of economic, social and business phenomena. It is a backbone of
industrial research, basic science research and planning.
Scale of measurement
➢ Nominal data: - as the name implies it consists of “naming” observations, or classifying them in to
various mutually exclusive and collectively exhaustive categories.
It indicates only that there is a qualitative difference among categories.
E.g. sex of an individual
The regional number of Ethiopia
➢ Ordinal data: - they are nominal data which have order and consensus, or measurements with
ordinal scales are ordered in the sense that higher numbers represent higher values.
They can have meaningful inequality but we can’t get meaningful difference.
E.g. military ranks comparing 3 star general and 4 star general
➢ Interval data:- they are ordinal data in which the difference between units have meaning
There is no true zero, it is arbitrary
The ratios of different values are meaningless
E.g. the temperature of town X is 300c in Monday
➢ Ratio data: -they are interval data, which also have true zero, which shows the absence of
something. And which make possible to state relations in terms of proportion or ratios.
E.g. income of a person

CHAPTER TWO

2. METHOD OF DATA COLLECTION AND PRESENTATION


4
1.2.1. Method of Data Collection
There are three major methods of data collection.
I. Observation or measurement
II. Interviews and questionnaires
III. The use of documentary sources
I. Observation or measurement ( direct personal observation)
In this case data can be obtained through direct observation or measurement. This requires training and
monitoring of the measure to ensure the use of standard procedure.
- Provides accurate information but it is expensive and inconvenient.
Example: physical examination, clinical measurements, laboratory tests etc.
II. Interviews and questionnaires
1. Direct personal Interview
In direct personal interview method, the investigator presents himself/herself personally before the
informant and questions him/her personally.
This method is best suited to situations where the problems are not completely understood and where
questions cannot be formulated before hand and one question leads to another. The disadvantage of is that it
is time consuming and is not suited for large group of informants.
2. Enumeration
In this method, well trained enumerators ask the selected group of respondent a set of questions relevant to
the study. A form that contains the set of questions to be asked and recorded the response of information
afterward by an enumerator is known as schedule.
E.g. census in Ethiopia
Respondents (Interviewees) are individuals who are to answer the question on the questionnaire.
Interviewers: the person to record the responses given to the questions by the respondent.
3. Face To Face Interviews (questionnaires in charge of enumerators)
The interviewer knows exactly who is responding to the questionnaire. This may not be the case with either
telephone or mail administered questionnaire since anyone in the household can answer.
- The interviewer can help the respondent if he/she has difficulty in understanding the questions. The
difficulty could be due to language, concentration or limited intellectual capacity.
- There is more flexibility in presenting the items; they can range from closed to open.
- There is the ability to use the method of skip patterns.
- Skip patterns means skipping a questions or a group of questions which are not applicable.
Disadvantages

5
❖ Untrained interviewer may distort the meaning of the questions.
❖ Attribute of the interviewer may affect the responses given due to:
a) Bias of the interviewer and b) his/her social or ethnic characteristics.
❖ It costs much in terms of time and money. Training of interviewers, salary for interviewers.
In many instances interviewers go house to house in order to locate the respondents.
Employing a bilingual interviewer can also increase cost.

4. Telephone Interviews
Advantages
It is less expensive in time and money compared with face to face interviews. The interviewer is able to
help the respondent if he/she doesn’t understand the question (as seen with face to face interview)
- Broad representative samples can be obtained for those who have telephone lines.
- May assure the uniformity of interviewer.
Disadvantage
❖ Under representation of those groups which do not have telephones.
❖ Problem with unlisted telephone number in the directory.
❖ Respondent may be substituted by another.
❖ Depending on the time of day the phone calls are made, different types of persons are reached
which will create bias in the sample.
❖ Problem with questions with multiple options for answers and complicated questions.
❖ Repeated calls may be necessary.
5. Self administered questionnaire returned by mail (mailed questionnaire)
Here the questionnaire is mailed to respondents to be filled. Sometimes it is known as self enumeration.
Advantages
- These are the cheapest. There is no need for trained interviewer. There is no interviewer bias.
- Mailed questionnaire can be coordinated from one central location.
Disadvantage
- Low response rate
- Uncompleted questionnaires due to omission or invalid responses.
- No assurance that the questionnaire was answered by the right person
- Needs intense follow up to get a high response rate.
III. The use of documentary sources

6
Extracting information from existing sources (e.g. Hospital records) is much less expensive than the other
two methods. It can be an important source of data.
Limitation: It is difficult to get information needed, when records are compiled in unstandardized manner.
1.2.2. Source and Types of Data
Data may be obtained from two sources, primary and secondary.
1. Primary sources: sources that can supply first hand information for immediate use.
Primary data: data originally collected for the purpose at hand.
Example: observe signs, measure characteristics, record symptoms and interview respondent, etc.
2. Secondary sources: the source in which data are obtained from records of individual that have been
collected by persons other than the investigator for other purpose.
Example: Hospital records, vital statistics and registers, etc.
Secondary data: the data obtained from secondary sources.
1.2.3. Method of Data Presentation
The data collected in survey or other empirical inquiry are called raw data. These unorganized data are not in a
way to be assimilated. It is therefore, necessary to reduce and present the data with their relevant features.
Tables
Tables include the systematic arrangement of statistical data in column and rows. Important features are:-
a. Tables should be simple and self explanatory.
b. Each row or column should be labeled concisely and clearly giving units of measurement for all
quantitative data.
c. The title should describe the content of the table and the scale should be understood without reference to
the text. A good title will answer the question: what? When? And where?
d. Percentage should add up to 1000%.
e. Any necessary explanatory footnotes should be included at the bottom of the table.
1.2.3.1.Frequency distribution and tables
Frequency: - is the number of counts assigned to individuals having a particular characteristic.
Frequency distribution: the set of frequencies of all possibilities is called frequency distribution of the
variable.
Based on the type of data, we can have two type of frequency distribution, tables.
a) Qualitative frequency tables (categorical frequency distribution)
Table 1 Data on smoking status by gender of a sample of health workers, Jimma Hospital 1986 E.C.
a. Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Gender M F M M F F F M M M F F F F M F M F M M

7
Smoking status Y N N Y N N Y N N N Y N N Y Y Y N N Y N
Information on each of the characteristics (gender, and smoking status) is displayed for each health worker.
a. Characteristics Tally Frequency

Gender
Male //// //// 10
Female //// //// 10
Smoking status
No //// //// // 12
Yes //// ///
Summarize what is presented in (a)

b. Characteristics Frequency (%)


Gender
Male 10 50
Female 10 50
Smoking status
No 12 60
Yes 8 40

Provides both frequency and percentage


The presentation in (b) is one way table. The classification by gender is one way classification. The
classification by smoking status is another one way table. When a single variable is used for classification, the
table formed is considered as one way table.
Table c
No Yes Total
Female 7 3 10
35% 15% 50%
70% 30%
58.33% 37.5%
Male 5 5 10
25% 25% 50%
50% 50%
41.67% 62.5%
12 8 20
8
60 40 100
What is presented in (c) is two way table with four cells sometimes known as Contingency Table. When two
variables are used for classification then the table is called two way or contingency table.
Note: the sum of all cell percentage must be equal to 100%.
35+15+25+25=100%
70+30=100% the sum of row 1 pct
58.33+41.67=100% the sum of coll. Pct
When more than one variable are used for classification, the table formed is called High order table.
Example: the following is an example for high order table. Three classification variables are used.
Smoking status
Health center Gender Yes No
1 Male 10 32
Female 23 98
2 Male 33 65
Female 12 21
3 Male 11 32
Female 21 21
a) Quantitative Frequency Table
1. Ungrouped Frequency Distribution
When measurements are taken on the entities of a population, the resulting values usually comes to the
researcher as a mass of data.
The first step in organizing these data is the preparation of an ordered array.
An ordered Array: is a listing of the values of a collection (either population or sample) in order of
magnitude, from the smallest value to the largest value.
Table 2: Age in year of 20 women who attended health education at Jimma Health center in 1986
30 25 23 41 39 27 41 24 32 29 29 35 31 36 33 36 42 35 37 41

II. Table 3: frequency distribution showing the array age of 20 women


23 24 25 27 29 29 30 31 32 33 35 35 36 36 37 39 41 41 41 42
Slightly better than Table2
Ungrouped frequency distribution of the age of 20 women
Age(xj) 23 24 25 27 29 30 31 32 33 35 36 37 39 41 42 total
Tally / / / / // / / / / // // / / /// /
Frequency(f) 1 1 1 1 2 1 1 1 1 2 2 1 1 3 1 20
9
Each individual value is presented separately, that is why it is named ungrouped frequency distribution. Before
the day of computers the objective of grouping data was to facilitate the calculation of various descriptive
measures and summarization. The main purpose in grouping is now summarization.
Class Intervals: are non overlapping intervals such that each value in the set of observations can be placed in
one, and only one, of the intervals.
Procedures for grouping into class intervals
I. Decide on the number of classes you want. Too few intervals are undesirable because of the loss of
information. On the other hand, if too many intervals are used, the objective of summarization is not
being met.
A commonly followed rule of thumb states 6≤N.C ≤15.
The formula
K=1+3.322(Log10n) is a formula by “Sturges”.
K=number of class intervals.
n=number of values in the data.
But this should not be regarded as final answer. E.g. if we have a sample of 275 observations, that we want
to group,
K= 1+3.322(log10275) = 1+3.322(2.4393) = 9
In practice other considerations might cause us to use 8 or fewer or perhaps 10 or more class intervals.
II. Determine width(W) of the class interval
W= (b-a)/k
b=the largest observation in the data set.
a=the smallest value in the data set.
k=the number of class interval.
III. Approximate W to the nearest integer.
It is preferable for “w” to be odd since it has advantage of having a mid point which is a number.
i.e. 10-19⇒ width=10 ⇒ (10+19)/2= 14.5
5-9⇒ width=5 ⇒ (5+9)/2= 7
Example: consider the age data given previously.
n=20
k=1+3.322(log20) =1+3.322(1.3010) = 5.19 ≈ 6
k=5 ⇒ w= (42-23)/5 ≈ 4
The grouped frequency table using Sturges formula
Class 23-26 27-30 31-34 35-38 39-42

10
Frequency (fi) 3 4 3 5 5
Note: this is only example
The data are grouped in to a set of non-overlapping intervals.
Class limits (CL): these are extreme values for each class. They are called lower and upper class limits and
are used for discrete values.
For our example:
Lower class limits (LCL): are 23, 27, 31, 35, and 39
Upper class limits (UCL): are 26, 30, 34, 38, and 42
Note: I. usually class intervals are ordered from smallest to largest.
II. The lower limit of the first class interval should be equal to or smaller than the smallest
measurement in the data.
III. The upper limit of the last class interval should be equal to or greater than the largest measurement.
Class Boundaries (CB)
With continuous data, values such as 26.5 will not fit any of the class given above. It is therefore necessary
to set exact limit or true limits which are known as class boundaries.
Exact limits refer to values of continuous measurement.
a) Lower class boundary (LCB): given a class limit, the LCB is obtained by subtracting half the unit of
measurement from the LCL of the class.
The unit of measurement is the gap between the UCL of the class and the LCL of the next higher class.
Thus LCBi=LCLi - (LCLi+1-UCLi)/2
b) Upper class Boundary (UCB): UCB is the average of the upper class limit and the next lower class
limit.
i.e. UCB is obtained by adding half the unit of measurement to the UCL of the class.
Thus
UCBi= UCLi+ (LCLi+1-UCLi)/2
= (UCLi+LCLi+1)/2
Note: UCBi=LCBi+1
Proof:
Consider:
LCLi+1=LCLi+1-(LCLi+2-UCLi+1)/2
But UCLi+1-UCLi=LCLi+2-LCLi+1=w
⇒UCLi+1=LCLi+2 +UCLi-LCLi+1
Substituting this in to the formula, we have

11
LCBi+1= (LCLi+1+UCLi)/2
Examples:
Convert the following class limit into class boundaries
a) 5-9 b) 44.5-49.4 c) 78.25-80.24
10-14 49.5-54.4 80.25-82.24
15-19 54.5-59.4 82.25-84.24
a) LCB1=LCL1- (LCL2-UCL1 )/2=5- (10-9)/2=4.5
The UCB1= (UCL1+LCL1+1)/2= (9+10)/2=9.5
a) 4.5-9.5 b) 44.45-49.45 c) 78.245-80.245
9.5-14.5 49.45-54.45 80.245-82.245
14.5-19.5 54.45-59.45 82.245-84.245
Class Marks (mi): are the mid points of the classes.
Mi= (LCLi + UCLi)/2 or (LCBi + UCBi)/2
e.g. (5+9)/2=7 or (4.5+9.5)/2=7
Note: mi+1=mi + w
m2=7+5=12 also (10+14)/2=12
Advantage of grouping
• Provides information about the range of the data.
• Gives an impression about the values that are frequent and infrequent.
• It provides data that can be easily used for graphical representation.
Disadvantage of grouping
• Information may be lost, since individual values displayed.
• Something that can be determined from the original data cannot be determined from grouped data.
Modified frequency distribution
I) The cumulative frequency distribution: is used when one is interested to know how often the
measurements fall below or above a certain level.
Less than cumulative frequency (LCF): the LCF of a value of a variable is the number of
individual with value less than or equal to that value.
More than cumulative frequency (MCF): the MCF of a value of a variable is the number of cases
with value greater or equal to that value.
Example:
Class limit Frequency Less than LCF More than MCF
23-26 3 22.5(<23) 0 22.5(>22) 20

12
27-30 4 26.5(<27) 3 26.5(>26) 17
31-34 3 30.5(<31) 7 30.5(>30) 13
35-38 5 34.5(<35) 10 34.5(>34) 10
39-42 5 38.5(<39) 15 38.5(>38) 5
42.5(<43) 20 42.5(>42) 0
Relative frequency distribution: the proportion of individuals expressed as percentage of the total.
frequency of each class
Rel.freq. = x100
total freqency of item

⇒R.fi=fi/n x 100

Example
Class Freq. Relative freq. Cumulative
limit (%) R.freq.
23-26 3 3/20x100=15 15
27-30 4 4/20x100=20 35
31-34 3 3/20x100=15 50
35-38 5 5/20x100=25 75
39-42 5 5/20x100=25 100
Total 50 100
Note:- about 75% of the women are in the age group 23-28 years.
1.2.3.2. DIAGRAMATIC PRESENTATION OF DATA
The essential advantages of these presentations lie in the fact that they facilitate comparisons.
Bar chart
There are three types of bar charts
I) Simple
II) Component
III) Multiple
I) Simple Bar chart: the bars may be vertical or horizontal, with their height or width representing the size
of the data. It helps to make simple comparison b/n data.
- The bars do not overlap.
- The space b/n the bars must be equal and narrow.

13
- It shows changes in the totals of different categories.
Example Construct a simple bar diagram for the following table showing annual cases of HIV reported in
Ethiopia as of July 31, 1993.
Year of report 1986 1987 1988 1989 1990 1991 1992 1993
Cases 2 17 87 190 448 885 3256 2814

II. Component Bar chart


For each category in the bar are subdivided in to components to allow comparison between parts. It is used to
present more than one variable. These are two types
a) Actual
b) Percentage
Actual component bar chart: height or length of individual components is represented by its actual figure.
Different parts of a bar are shaded or colored differently to provide contrast. It shows not only changes in the
total of different categories but also the change in component figures within each category.
Example
Construct actual component bar chart for the number of children who were vaccinated with DPT, POLIO and
BCG antigens in Jimma Hospital in 1979 E.c.

Antigen Male Female


DPT 250 300
Polio 300 320
BCG 200 210

14
Percentage component Bar chart
Similar to actual component bar, except the components are expressed as percentages of the total.
- All bars are equal in height.
- Mostly used to compare relative variation b/n data.

Example: draw a percentage component bar chart for the vaccination data, previously described
Soln
Male Female
DPT= 250/550 X 100=45.5% 300/550 X 100=54.5%
POLIO=300/620 X100=48.4% 320/620 X100=51.6%
BCG=200/410 X100=48.8% 210/410 X100=51.2%
III. Multiple Bar chart
These are used when two or more inter-related data are to be compared. Height of bars shows actual values of
each component. It is used to present more than one variable.
Example: draw a multiple bar chart for the vaccination data.

15
PIE CHART
Pie charts are used to show the partitioning of a total into its components parts using circles. The circles should
be divided into sectors proportional in size to the frequencies of the categories they represent.
Steps in drawing a pie chart
1. Convert freq. distribution into percentage frequency distribution.
2. Draw a circle of any of radius and note that the circle is represented by an angle of 3600.
3. Convert percentage into degree measures. Since the whole circle (3600) represents 100% of the
observation, 3.60 will represent 1%.
Example
Draw the pie chart for the following table. First construct a table providing the central angles.

Wards Frequency Percentage Central angle


Medical A 55 27.5 99
Medical B 30 15 54
Surgical A 40 20 72
Surgical B 25 12.5 45
pediatrics 50 25 90
Total 200 100 360

16
Histogram
A histogram presents grouped frequency distribution of a continuous type. The real limits of the class make up
the horizontal axis, while the vertical axis has as its scale the frequency of occurrence. Comparison can be made
using the height or areas of the bar.
Method of construction histogram
I) Obtain a frequency distribution with class boundaries and class midpoints.
II) Construct bars on the horizontal axis with center at the class midpoint and width equal to the class
width.
III) The height of each bar should correspond to the respective class frequency.
Example: consider the following grouped age data
Class Mid
S.N Class limit Frequency
boundaries point
1. 15-19 14.5-19.5 17 2
2. 20-24 29.5-24.5 22 8
3. 25-29 24.5-29.5 27 6
4. 30-34 39.5-34.5 32 12
5. 35-39 34.5-39.5 37 7
6. 40-44 39.5-44.5 42 6
7. 45-49 44.5-49.5 47 4
8. 50-54 49.5-54.5 52 3
9. 55-59 54.5-59.5 57 1
10. 60-64 59.5-64.5 62 1
Note: each cell contains a certain proportion of the total area, depending on the frequency.
For example, the fourth cell contains 12/50 of the area. ⇒The relative frequency of occurrence of values
between 29.5-34.5.
Histogram

17
Frequency polygon:- is a multi-sided figure where the frequency is plotted against the class midpoint. The
steps are:
I) Construct a histogram
II) Mark the midpoint on the top of each bar
III) Join these marks with straight lines
IV) Extend these lines on both ends so that it reaches the horizontal axis at the class mid points. This
allows the total area to be enclosed.
Frequency distributions of age
Class limit 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64
Mid point 17 22 27 32 37 42 47 52 57 62
Frequency 2 8 6 12 7 6 4 3 1 1

Note: the total area under the frequency polygon is equal to the area under the histogram.
Ogives or cumulative frequency curve
Points are plotted in association with the exact values on the horizontal axis and the cumulative frequency
values on the vertical axis. Then connect the points with straight lines.
-the curves obtained are called the “less than” and “more than” curves.
Note: Cumulative frequencies are plotted at the class boundaries. Frequency polygons are plotted at class
marks. The sum of the frequencies of two or more classes is cumulative frequency.
Consider the age data
Class Frequency Less than LCF More than MCF

18
limit
<23 0 22.5(<23) 0 22.5(>22) 20
23-26 3 26.5(<27) 3 26.5(>26) 17
27-30 4 30.5(<31) 7 30.5(>30) 13
31-34 3 34.5(<35) 10 34.5(>34) 10
35-38 5 38.5(<39) 15 38.5(>38) 5
39-42 5 42.5(<43) 20 42.5(>42) 0

19
CHAPTER 3
SUMMARIZING DATA
MEASURE OF CENTRAL TENDENCY
INTRODUCTION
In this unit we shall discuss measure of central tendency. These are also known as measure of central location or
central values. The most important objective of statistical analysis is to determine a single value for the entire
mass of data, which describes the overall level of the group of observations and can be called a representative of
the whole set of data. It tells us where the center of the distribution of data is located on the scale that we are
using. There are several such measures, but here we shall discuss the most commonly used measures of central
tendency. This includes: mean, median and mode.
Section one is concerned with four different types of means, namely arithmetic means, weighted mean,
geometric mean and harmonic means.
Section two deals with mode and median .we hope that you are familiar with arithmetic mean, median and
mode for ungrouped data.
Do you know how to compute mean, median and mode for a grouped data? This is one of the main concerns of
this unit. There are also other measures of central tendency (sometimes called measure of non-central location,
such as quartiles, deciles and percentiles. We shall discuss about these in section three.
Section one: means
Among the types of means we discuss four of them, which are suitable for a particular type of data. These are
Arithmetic mean, weighted mean, geometric mean and harmonic mean.
We shall discuss each of these one by one in this section. In chapter I we have seen that the meaning of the
word “population” in statistics is quite different from that of our every day language. What about the word
average? The word average occurs frequently in our everyday usage. we usually say “average family size”,

20
average family income, or the average marks of students etc. in this sense ,obviously, it refer to the arithmetic
mean .however ,in statistics “average ”in general means any measure of central location.
Finally we put the following remark, which is going to be used throughout this course. If mean is mentioned, it
implies arithmetic mean, while the other means are identified by their name.
Properties of summation
The symbol Σ is capital sigma, the Greek letter for Summation.
1. In our notation ∑𝑛𝑖=1 𝑋𝑖 stands for the sum of X’s (i.e. X1+X2+…+Xn).
2. ∑𝑛𝑖=1(𝑌𝑖 +𝑋𝑖 ) =∑𝑛𝑖=1 𝑋𝑖 + ∑𝑛𝑖=1 𝑌𝑖 6. ∑𝑛𝑖=1 𝑌𝑖 𝑋𝑖 = 𝑋1 𝑌1 + 𝑋2 𝑌2 + ⋯ + 𝑋𝑛 𝑌𝑛
3 . ∑𝑛𝑖=1 𝑌𝑖 𝑋𝑖 = 𝑋1 𝑌1 + 𝑋2 𝑌2 + ⋯ + 𝑋𝑛 𝑌𝑛 7. ∑𝑛𝑖=1 𝐶𝑋𝑖 = 𝐶 ∑𝑛𝑖=1 𝑋𝑖 4 . ∑𝑛𝑖=1 𝐶 =
𝑛𝐶 8. ∑𝑛𝑖=1(𝑋𝑖 ± 𝐶) = ∑𝑛𝑖=1 𝑋𝑖 ± 𝑛𝐶
5. 1 + 2 + 3 + 4 + ⋯ … … … + 𝑛= n(n+1) 9. 12+22+32+42+…………+n2 = n(n+1)(2n+1)
2 6
Desirable properties of good measure of central tendency
1. It should be easy to calculate and understand.
2. It should be rigidly defined. It should have one and only one interpretation so that the personal prejudice or bias
of the investigator does not affect the value or its usefulness.
3. It should be representative of the data. If it is calculated from a sample, then the sample should be random
enough to be accurately representing the population.
4. It should have sampling stability. It should not be much affected by sampling fluctuations. This means that if we
pick 10 different groups of college students at random and we compute the average of each group, then we
should expect to get approximately the same value from these groups.
5. It should not be affected much by extreme values, if a few very small or very large value are presented in the
data, they will unduly influence the value of the average by shifting it to one side or the other and hence the
average would not be really typical of the entire series. Hence, the average chosen should be such that it is not
unduly influenced by extreme values.
1.1 Arithmetic mean and its properties
The arithmetic mean of a sample is the sum of all the observation divided by the number of observations in the
sample.
the sum of all values in the sample
i.e. Sample mean or arithmetic mean= number of values in the sample

The arithmetic mean for ungrouped data:


Suppose x1,x2,x3,……..,xn are n observed values in a sample of size n taken from a population of size N, n<N.
then the arithmetic mean of the sample, denoted by Xi, is given by
21
∑ 𝑛
X1+X2+X3+⋯+Xn 𝑥𝑖
𝑋̅ = = 𝑖=1 i = 1,2,3,4, … , n 1.1
n 𝑛

Where 𝑋̅ stands for the sample mean or arithmetic mean and is read as “X bar”
Σ is the Greek capital letter sigma and indicates the operation of addition .so ∑𝑛𝑖=1 Xi stands for the sum of all
the Xi’s
If we take an entire population, the population mean denoted by µ is given by
X1+X2+X3+⋯+XN ∑𝑁
𝑖=1 𝑋𝑖
µ= =
N 𝑁

Where: N- stands for the total number of observations in the population.


Exercise The net weights of the contents of five perfume bottles selected at random from the production line are
(in grams); 85.4,85.3,84.9,85.4 and 85. What is the arithmetic mean weight of the sample observation?
Exercise The following table gives the daily income of ten operators in a machine tool factory. Find the
arithmetic mean.
Number of A B C D E F G H I J
operators
income 12 15 18 20 25 30 22 35 37 26

Activities 1.1 Find the arithmetic mean for the data given below.
Values(xi) 3 5 4 2 7 6
Frequency(fi) 2 1 3 2 1 1
Please try to answer before reading further!
Suppose the data are given in the form of discrete frequency distribution with frequencies f 1, f2, f3…fn associated
with the values of the variable x1, x2, x3…xn resp.
As there are f1 items with values x1,f2 items with values x2 etc the sum of all values equals
f1x1+f2x2+f3x3+f4x4+………….+fnxn and the total number of items is obviously f1+f2+f3+….+fn =∑𝑛𝑖=1 fi =n
Thus by formula 1.1 the arithmetic mean is given by
f1x1+f2x2+f3x3+f4x4+...+fnxn
𝑋̅ = 1.2
f1+f2+f3+...+fn

The arithmetic mean of a given data with frequency fi corresponding to value xi is


fiXi
𝑋̅ = ∑𝑛𝑖=1 n where n=∑𝑛𝑖=1 fi

Exercise Calculate the mean of the marks of 46 students given below;


Marks(xi) 9 10 11 12 13 14 15 16 17 18
Frequency(fi) 1 2 3 6 10 22 7 3 2 1

22
Combined mean
“If we have arithmetic means X̅1, X̅2… X̅n of n groups having the same unit of measurement of a variable,
based on n1, n2… nn observations respectively, we can compute the combined mean of the variant values of the
groups taken together from the individual means by the formula
n1x̅1+ n2x̅2+⋯+ nnx̅n ∑𝑛
𝑖=1 nix̅i
𝑋̅com= = ∑𝑛
1.3
n1+ n2+⋯+n n 𝑖=1 ni

The advantage of formula 1.3 is that we don’t have to do the entire calculation for the means of the combined
set of observations, if the mean of each observation is known.
Exercise The mean weight of 150 students in certain class is 60kgs.the mean weight of boys in the class is 70kg
and that of girl’s is 55kg .find the number of boys and girls in the class?

Arithmetic mean of grouped data


In the ungrouped case, the exact value of each item is known. If however, the data is grouped such that we are
given frequency distribution of finite sized class intervals we do not know the values of every item. The
calculation of arithmetic mean in such a case is then necessarily a matter of estimation. For this purpose the
values within a particular class are considered to be concentrated at its class mark (mid point).
Moreover, the observations in each class are represented by the class mark of the class interval. Recall that class
mark is the average of the lower and upper limit of the class.
For a frequency distribution (grouped) with k classes in which the jth class has the class mark of mj with
corresponding frequency fi, j=1, 2, 3…k.
The mean is given by
∑ 𝑘
fimi f1m1+f2m2+⋯+ fkmk
𝑋̅ = ∑𝑖=1
𝑘 = 1.4
𝑖=1 fi f1+f2+ …+fk

Do you recall how to calculate a class mark? If not please review chapter 2.

Exercise The net income of a sample of large importers of antiques was organized into the following table.
Net income 2-4 5-7 8-10 11-13 14-16
Number of importers 1 4 10 3 2

What is the arithmetic mean of net income?


23
Properties of arithmetic mean
From the foregoing discussion we are in position to understand the properties inherent in an arithmetic mean.
1. The algebraic sum of the deviations of each value (xi) from the mean (x̅) is equal to zero. That is
𝑛
∑𝑖=1(xi − x̅ )=0 => ∑𝑛𝑖=1 xi- ∑𝑛𝑖=1 x̅ =0=>n x̅-n x̅=0

As an example, the mean of 3, 8 & 4 is 5 Then: ∑3𝑖=1(xi − x̅) = (3-5)+(8-5)+(4-5)=-2+3-1=0


For this reason, we can consider the mean as a balance point for a set of data.
𝑛
2. The sum of squares of deviations from arithmetic mean is the least (minimum), that is ∑𝑖=1(xi − c) 2 is
minimum when c= x̅
For example, consider the sample values: 5, 9, 4, and 10
The mean is 7. The square deviations from the mean are:
Values(xi) 5 9 4 10 sum
xi- x̅ 5-7=-2 9-7=2 4-7=-3 10- 0
7=3
(xi- x̅)2 (-2)2=4 22=4 (-3)2=9 32=9 26

4 4
So ∑𝑖=1(xi − x̅ )2 = ∑𝑖=1(xi − 7)2 =26
If the deviations from some other value are squared their sum would be larger than 26.for instance take the
value 5 and then consider the following table.
Values(xi) 5 9 4 10 sum
xi- c 5-5=0 9-5=4 4-5=-1 10-5=5 8
(xi- c)2 (0)2=0 42=16 (-1)2=1 52=25 42

𝑛 4
So ∑𝑖=1(xi − c)2 = ∑𝑖=1(xi − 5)2 =42>26
Note: the importance of this property will be seen when we discuss standard deviation in unit 4.
3. Easy to calculate and understand
4. The mean is sensitive to extreme values.
e.g. 5, 9, 13, 12, and 16 has the mean 11 but if we have 100 instead of 5 the mean will be 30.
5. Uniqueness: the mean of any set of data is unique.
6. It can be used for further treatment.
- Comparison of means.
- Test on means.
24
Advantages and disadvantages of the arithmetic mean
1. It is easy to understand and to compute.
2. All the values are included in computing the mean.
3. A set of data has only one mean, thus, it is unique.
4. Every set of interval level and ratio level data has a mean.

5. The mean is meaningless in the case of nominal or qualitative data.


6. The arithmetic mean is greatly affected by the extreme values(very low or high values)since the computation
of the mean is based up on inclusion of all values in the data, the extreme values in the data would shift the
mean towards them, thus making the mean unrepresentative of the data.
7. In case of grouped data, if any class interval is open ended, arithmetic mean cannot be calculated, since the
class mark of this interval cannot be found.

Weighted mean
In the computation of arithmetic mean we had given equal importance to each observation. Sometimes the
individual values in the data may not be equally importance. When this is the case, we assigned to each weight
which is proportional to its relative importance and calculate the weighted mean.
The weighted mean of a set of values x1, x2, x3…xn with corresponding weights w1, w2…wn denoted by x̄w and
computed by:
w1x1+w2x2+⋯+ wkxn
x̄w = w1+w2+⋯+wn
∑𝑛
𝑖=1 wixi
This may be shortened to: x̄w = ∑𝑛
1.5
𝑖=1 wi

The calculation of cumulative grade point average (CGPA) in colleges and universities is a good example of
weighted mean.
Exercise If a student scores “A “in a 3 credit hours course ,”B” in a 4 credit hours course ,”C” In another 4
credit hours course and “D” in a 2 credit hours course and the numerical values of the letter grades are A=4,B=3
C=2,D=1,compute his /her GPA for the semester.
Geometrical mean
In algebra geometric mean is calculated in case of geometric progression, but in statistics we need not bother
about the progression, here it is particular type of data for which the geometric mean is of great importance
because it gives a good mean value. If the observed values are measured as ratios, proportions or percentages,
Geometric mean gives a better measure of central tendency than other means.

25
The Geometrical mean of n positive values is defined as the nth root of their product .that is, if all the given
observations
x1, x2, x3…xn are positive, then
G.M=(x1.x2.x3………….xn) 1/n 1.7
For instance, the G.M of 4, 8 and 6 is
G.M = (4x8x6)1/3= (512)1/3=8
Can G.M be calculated if any one or more values are zero or negative? Why?
In case the observed values x1,x2,x3,……..,xn have the corresponding frequencies f1, f2, f3…fn then
G.M=( x1f1. x2f2. x3f3. x4f4.………….. xnfn )1/n 1.8
Where n= ∑𝑛𝑖=1 fi
In case of grouped data, class marks of the class interval are considered as xi and formula 1.8 can be used as
such
G.M = (m1f1. m2f2..………….. mnfn ) 1/n
Where n= ∑𝑛𝑖=1 fi
Exercise The man gets three annual raises in his salary. At the end of first year he gets an increase of 4%, at the
end of the second year he gets an increase of 6% and at the end of the third year he gets an increase of 9% of his
salary. What is the average percentage increase in the three periods?
Exercise Compute the Geometric mean of the following values.
2, 8, 6, 4, 10, 6, 8, 4
We present below the method of computing Geometric mean using logarithm table.
Though standard techniques are available to find over square root and cube root, yet for large values of n, nth
root is not easy to compute .to overcome this difficulty , Geometric mean is computed through logarithm. Now
hoping that you recall properties of logarithm, we formulate (1.7) & (1.8) interns of logarithm (with base ten).
From formula (1.7) when reduced to its logarithmic form, it will be
1 1
Log (G.m) =log (x1.x2.x3…xn)1/n =n log (x1, x2, x3… xn) =n (logx1+logx2+…+logxn)
1
So G.M =antilog (n (logx1+logx2+…+logxn)) 1.9

From formula (1.8) when reduced to its logarithmic form, it will be


1
Log (G.m) = log (x1f1. x2f2. x3f3. x4f4… xnfn) 1/n = n log (x1f1. x2f2. x3f3. x4f4… xnfn)
1
= n (logx1f1+logx2f2+…+logxnfn)
1 1
= n (f1logx1+f2logx2+f3logx3+…+fnlogxn) = ∑𝑛𝑖=1 fi logxi/ ∑𝑛𝑖=1 fi =n (∑𝑛𝑖=1 fi logxi)

26
1
So G.M = antilog (n(f1logx1+f2logx2+f3logx3+…+fnlogxn))
1
= antilog ( (∑𝑛𝑖=1 fi logxi)) 1.9
n

Where n=∑𝑛𝑖=1 fi
For frequency distribution with frequency fi corresponding to values xi, i=1, 2, 3… k similarly for grouped
data.
1
Log (G.m) = ∑𝑛𝑖=1 fi logmi/ ∑𝑛𝑖=1 fi = (n (∑𝑛𝑖=1 fi logxi)) 1.11

Where n= ∑𝑛𝑖=1 fi
Where (mi) and (fi) are the class mark and frequency of the ith class interval respectively. Taking antilog of both
sides in (1.11) we obtain
1
G.M=antilog (n (∑𝑛𝑖=1 fi log mi)) 1.12

Where n= ∑𝑛𝑖=1 fi
The following example illustrate how Geometrical mean is computed through logarithm,
Exercise Find the Geometrical mean of 2, 4, 8, 12, 16, 24
Exercise Given the following frequency distribution of a grouped data.
CI 10-14 15-19 20-24 25-29 30-34 35-39 40-44
frequency 10 15 17 25 18 12 8

Find the Geometrical mean?

Harmonic mean
Another important mean is the harmonic mean, which is suitable measure of central tendency when the data
pertains to speed, rates and time.
Let x1,x2,x3,……..,xn be n variant values in a set of observation, then the harmonic mean is given by
n n
H.M= 1 1 1 This may be shortened to: H.m =∑𝑛
+ +⋯+
x1 x2 xn 𝑖=1 1/xi

The following is a good example in which the application of harmonic mean is appropriate.
Exercise; A motorist travels for three days at a rate (speed) of 480km/day. On the first day he travels 10 hours
at a rate of 32km/h, on the second day 12hours at a rate of 40km/h, on the third day 15hours at a rate of 32km/h.
what is the average speed?
Note: Here harmonic mean gives the correct average speed because the man travelled equal distances on three
speeds. If, however, he had travelled for equal time interval, the arithmetic mean would have been the correct
average.

27
If the data are arranged in the form of frequency distribution in which an observation xi has frequency fi (i=1, 2,
3…k), the harmonic mean is given by
n ∑𝑛 fi
H.M=f1/x1+f2/x2+⋯+fn/xn this may be shortened to: H.m=∑𝑛 𝑖=1fi/xi
𝑖=1

Finally would like to point out that the relationship between among the three means.
(If all the observations are positive) is given as A.M≥G.M≥H.M
All these three means are equal if all positive valued observations are equal.
Median and mode
In this section we deal with two other measure of central tendency, namely, median & mode. It has been pointed
out that arithmetic mean cannot be calculated whenever there is frequency distribution with open ended
intervals. Also the mean is to a great extent affected by the extreme values of the set of observations. Hence in
such cases the arithmetic mean cannot be better described using median or mode.
In fact there are a number of circumstances in which we use these instead of any other of measure of central
tendency. Now we will discuss in details.
Median
Suppose we sort all the observations in numerical order, ranging from smallest to largest or vice versa. The
median is the middle value in the sorted list.
Median of ungrouped data
The median is found by arranging the data in order of magnitude. The median is then the value of the middle
term. We denote it by x̃.
For example: Suppose the sales commission ($) of 15 representatives were as follows:
23, 16, 31, 77, 21, 14, 32, 6, 155, 9, 36, 24, 5, 27, 19
Placing the data in order of magnitude, we have 5, 6, 9, 14, 16, 19, 21, 23, 24, 27, 31, 32, 36, 77, and 155
The value of the middle term is the 8th value that is 23. (There are seven values smaller than 23 and seven
values larger than 23).the median is therefore 23.
In the above illustration the number of observations is an odd number 15. In such case there is always a single
value in the middle of the list.
? How is the median determined for an even number of ungrouped data?
As before, we first order the observations. In this case there is not clearly defined middle observation. Instead
we find two observations in the middle of the ordered list. The middle is then taken as the mean of the two
central observations. For instance, if there are six items with values.
25, 29, 30, 32, 35, 65
The median is 31, obtained by determining the arithmetic mean of the two central observations 30 & 32.
x̃ = (30+32)/2=31
28
We locate it by counting down to the 3.5th items. Notice that 31 is not among the given values.
We summarize the above discussion as follows.
Let x1, x2, x3… xn be n ordered observations. Then median value is given by:
x̃ =Xn+1 if n is odd
2

Or x̃ =Xn + X𝑛+1 if n is even


2 2

n+1 th
That is median is the ( ) observation if n is odd.
2
n n
Or median is the mean of (2)th & (2 + 1)th observation if n is even.

Exercise Find the median from the following data of the heights in inches of a group of 14 students.
61,62,63,64,64,60,65,61,63,64,65,66,64,63
Now consider the case were the data are arranged in the form of frequency distribution. Suppose the ordered
values x1, x2, x3…xk have their corresponding frequencies f1, f2…fk the median for it can be calculated in the
following manner.
Construct the less than cumulative frequency .this is because as you know less than cumulative distribution tells
us the number of values that below or above the specified value of the observations.
n+1 th
If n=∑𝑛𝑖=1 fi is odd, find ( ) and search for the smallest less than cumulative frequency which is greater
2
n+1
than 𝑜𝑟 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 ( ). The variant value corresponding to this less than cumulative frequency is the median.
2
n 𝑛
If n is even, find (2) &( 2 + 1) and then search for the smallest less than cumulative frequency that is greater
n 𝑛
than or equal to (2) & ( 2 + 1). then the arithmetic mean of the variant values corresponding to these less than

cumulative frequency is the median.


We illustrate this by the following example.
Exercise Find the median of the data given below.
Values(xi) 3 5 4 2 7 6
Frequency(fi) 2 1 3 2 1 1

Median of grouped data


Recall that the median is defined as the value of the middle item in the sorted list. Since in the grouped data, the
raw data have been organized in to a frequency distribution of finite class interval, some of the information is
not identifiable. That is we do not know the value of every item .but we can locate the middle observation by
dividing that the total number of observations (n) by 2.
n
The class corresponding to the smallest LCF that is ≥ 2 is called the median class. The median lies in this class.

29
The formula we use to compute the median of grouped data is
𝑛
( −𝑐𝑓)𝑤
2
Median = x̃ = lcb𝑋̃ + 1.17
𝑓𝑚

Where: lcb𝑋̃ = is the lcb of the median class.


n-is the total number of observations (sum of frequencies)
𝑓𝑚 – is the frequency of the median class.
w-is the width of the median class.
𝑐𝑓 - is the cf corresponding to the class immediately preceding the median class.
Recall that w=the Ucb minus the lcb of that particular class.
Observe that in calculating median in this case the steps should be:
1. Construct the LCF table.
2. To determine the median class divide the total number of observations by2 and then search for the smallest LCF
which is ≥ n/2.
3. To apply formula (1.17)
? Have you realized that the median is based on the frequencies, the lower and upper class boundaries of the
median class and the LCF of the class proceeding to it?
The open ended classes that occur at the extreme are rarely needed. Therefore, the median of a frequency
distribution having open ended can be determined if percentage frequencies are given instead of the actual
counts. The percentages are considered substitutes for the actual frequencies. In a sense, they are actual
frequencies whose total is 100.
The following example illustrates how median is computed for a grouped data, the second example has both an
open ended and percentage frequencies.
Exercise; A sample of the daily production of bulbs at a company was organized into the following distribution.
Daily production 80-89 90-99 100-109 110-119 120-129 130-139
frequency 5 9 20 8 6 2
Calculate the median daily production?
Exercise The net sales of a sample of small stamping plants were organized into the following frequency
distribution.
Net sales ($ million)(CI) 1-3 4-6 7-9 10-12 13 and greater
Percent of total 13 14 40 23 10

What is the median net sale?

30
The mode
In every day speech, something is “in the mode” if it is fashionable or popular. In statistics this “popularity”
refers to frequency of observations, and the most frequently observed value in a collection of observations is
therefore called the mode.
The modal wage, for example, is the wage received by more individuals than any other wage.
Mode: is the value of the observations that appears most frequently.

For a given set of data, mode may or may not exist and even if it exists may not be unique. To illustrate this
consider the following three sets of data.
Set A: 10, 10, 9, 8, 5, 4, 5, 12, 10 mode=10
Set B: 10, 10, 9, 9, 8, 12, 15, 5 mode=9 &10
Set C: 4, 6, 7, 15, 12, 9 no mode
Thus it is possible for a frequency distribution to have more than one mode.
Distribution with one mode is called uni modal, those with two modes bimodal, and those with more than two
modes are called multi modal.
In the above illustration the distribution described in set A is uni modal and inset B bimodal.
Remark: In a set of observed values, all values occur once or equal number of times, there is no mode. (See set
C above).
What about for a grouped data?
If the data is grouped such that we are given frequency distribution of finite sized class intervals, we do not
know the value of every item, but we easily determine the class with highest frequency .the mode of the
distribution lies in this class. For this purpose
We call the class with highest frequency modal class.
In this case, the problem of determining the value of the mode is not so straight forward as in the ungrouped
case. Having located the modal class of the data, the next problem is to interpolate the value of the mode with in
this “modal class”.
This is made by the use of interpolation formula given below.
To compute the modal value of a grouped data we use the formula:
∆1
Mode = 𝑋̂ = lcb𝑋̂ + (∆ +∆ )𝑊 1.18 where; ∆1=𝑓𝑚 -𝑓𝑝 ∆2 =𝑓𝑚 -𝑓𝑠
1 2

31
Where: lcb𝑋̂ –is the lcb of the modal class. (That is the class with highest frequency).
𝑓𝑚 -is the frequency of the modal class. 𝑓𝑝 -is the frequency of the class preceding
the modal class.
𝑓𝑠 - is the frequency of the class succeeding the modal class. W -is the width of the modal class.
Now we discuss how to apply formula (1.18) to find the modal value of a grouped data with the help of the
following example.
Exercise The wages of newly hired, unskilled employees work grouped into the following distribution.
Compute the modal age?
Ages 18-20 21-23 24-26 27-29 30-32
number 4 8 11 20 7

Exercise The following table shows the distribution of a group of families according to their expenditure per
week.
The median and the mode of the following distribution are known to be 25.50Birr and 24.50 Birr respectively.
Two frequency values are however missing from the table. Calculate the missing frequencies.
Class interval 1-10 11-20 21-30 31-40 41-50
frequency 14 a 27 b 15
Properties of mode
1. it is not affected by extreme values of a set of observations.
2. It can be calculated for distribution with open ended classes.
3. It can be computed for all levels of data nominal, ordinal, interval and ratio.
4. The main drawback of mode is that often it does not exist.
5. Often its values are not unique.
Measure of non central location (Quintiles’)
There are three types of quintiles. These are:
1. Quartiles
The quartiles are the three points, which divide a given order data into four equal parts. These Q1, Q2, Q3
n+1 th
Q1 is the value corresponding to ( ) order observation.
4
n+1 th
Q2 is the value corresponding to 2( ) order observation.
4
n+1 th
Q3 is the value corresponding to 3( 4
) order observation.

E.g. Consider the age data given below and calculate Q1, Q2, and Q3
32
19, 20, 22, 22, 17, 22, 20, 23, 17, 18
Solution: First arrange the data in ascending order, n=10
17, 17, 18, 19, 20, 20, 22, 22, 22, 23
n+1 th 10+1 th
Q1 = ( ) =( ) = (2.75)th observation = 2nd observation + 0.75(3rd - 2nd)observation=17+0.75(18-
4 4

17)=17.75
Therefore 25% of the observations are below 17.75
n+1 th 10+1 th
Q2= 2( ) =2( ) = (5.5)th observation =5th +0.5(6th - 5th) = 20+0.5(20-20)=20
4 4
n+1 th 10+1 th
Q3= 3( ) = 3( ) = (8.25)th observation =8th +0.25(9th - 8th)=22+0.25(22-22)=22
4 4

Calculation of quartiles from grouped data


For the grouped data, the computations of the three quartiles can be done as follows:
Calculate in/4 and search for the minimum lcf which is ≥ in/4 i=1,2,3
The class corresponding to this lcf is called the ith quartile class. This is the class where Qi lies.
The unique value of the ith quartile (Qi) is then calculated by the formula
𝑖𝑛
( −𝑙𝑐𝑓𝑝)𝑤
4
Qi = lcb𝑞𝑖 + I = 1, 2, 3
𝑓𝑞𝑖

Where lcbqi - is the lower class boundary of the ith quartile class
fi - is the frequency of the ith quartile class
lcfp - is the lcf corresponding to the class immediately preceding the ith quartile class.
Note: Q2= median
2. Percentiles (P)
Percentiles are symbolized by p1, p2 …p99 and divide the ordered distribution into 100 groups.
The percentile corresponding to a given value (x) is computed by
no of values below (c) + 0.5 c+0.5
Percentile (P) = Total no of observation(n) x100% = x100%
n

Exercise A teacher gives a 20 point test to 10 students. The scores are shown below. Find the percentile rank of
a score of 12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
For finding a data value corresponding to a given percentile
1. Arrange the data in increasing order.
𝑛𝑝
2. Find c, c= 100.

3. If c is not a whole number round up to the next whole number.


4. If c is a whole number, use the value halfway b/n c and c+1.
33
Exercise using the above data; find the value corresponding to 25th and 60th percentile.
Calculation of percentiles for grouped data
Percentiles are 99 points, which divide a given ordered data into 100 equal parts so that each part consists of
equal no of elements.
For the grouped data, the computations of the 99 percentiles can be done as follows:
Calculate mn/100 and search for the minimum lcf which is ≥ mn/100 m=1,2,3,……99
The class corresponding to this lcf is called the mth percentile class. This is the class where Pm lies.
The unique value of the mth percentile (Pm)) is then calculated by the formula
𝑝𝑛
( −𝑙𝑐𝑓𝑝)𝑤
pm = lcb𝑝𝑚 + 100
m=1, 2, 3….99
𝑓𝑝𝑚

Where: lcbpm, fpm and lcfppm will have a similar interpretation as in quartiles.
3. Deciles (D)
Deciles are the nine points, which divide the given ordered data into 10 equal parts.
For the grouped data, the computations of the 9 deciles can be done as follows:
Calculate kn/10 and search for the minimum lcf which is ≥kn/10 k=1,2,3,……9
The class corresponding to this lcf is called the kth percentile class. This is the class where Dk lies.
The unique value of the kth percentile (pk0 is then calculated by the formula
𝑘𝑛
( −𝑙𝑐𝑓𝑝)𝑤
DK = lcb𝑑𝑘 + 10
k=1, 2, 3,….9
𝑓𝑑𝑘

Where: lcbdk ,fdk and lcfpdk will have a similar interpretation as quartiles and percentiles.
Note that: median=Q2=D5=P50 and D1, D2, D3,…, D9 correspond to P10,P20,P30,……P90
Q1,Q2,Q3 corresponding to P25,P50,P75
Exercise for the following FD data, find
a) Q1, Q2,Q3 b) P25,P30,P50,P75 c) D1,D2,D3 and D5
interval 21-22 23-24 25-26 27-28 29-30
f 10 22 20 14 14

CHAPTER FOUR

MEASURES OF VARIATION (DISPERSION), SKEWNESS AND KURTOSIS

Measures of variation (dispersion)

34
Measures of central tendency locate the center of the distribution. But they do not tell how individual
observation are scattered on ether side of the center. The spread of the observations around the center is known
as dispersion or variability.
- Small dispersion indicates high uniformity of the observation while larger dispersion indicates less
uniformity.
Objective of measure of dispersion
- To measure reliability of the average being used.
- To control variation in a product.
- To compare variability among two or more groups.
Measure of dispersion can be classified as absolute and relative form.
Absolute measures of dispersion- are expressed in concrete units. That is units in terms of which the data have
been expressed.
Example: centimeters, kilograms, etc.
Relative measure of dispersion: is a quotient obtained by dividing the absolute measure by quantity in respect to
which absolute deviation has been computed. It is a pure number and is usually expressed in a percentage form.
Relative measures are used for making comparisons b/n two or more distributions.

Desirable properties of measure of dispersion (variation)


1. It should be based on all observations.
2. It should be easily calculated.
3. It should not be affected much by sampling fluctuations.
4. It should be readily comprehensible. (It should be easily understood.)
5. It should be amenable to algebraic treatment. That means it should be precisely defined (like arithmetic
mean). It should not be located by inspection like mode.
6. It should be capable of further statistical analysis.
Range
- Range is the difference b/n the highest and smallest observation.
R= XL-Xs where XL- is the largest observation
R= UCBL-LCBF XS- is the smallest observation
Although it is the simplest to compute, it takes into account only the two extreme values of the data. The other
values have no role to play. Because of this range is considered as rough measure of variation. The major area
in which range is applied is statistical quality control.
Relative range

35
Range is a measure of absolute dispersion and as such can not be used for comparing variability of two
distributions expressed in different units. Measurements made in Kg are not comparable with dispersion
measured in centimeters.
The solution is to use relative range or any other relative measure of variation.
range
Relative= 𝑋
𝐿 +𝑋𝑆

Example: consider weekly earning of workers in two laboratories of the same type.

Laboratory A (in birr) Laboratory B (in dollar)


C.I F C.L F
21-22 10 17-18 2
23-24 22 19-20 4
25-26 20 21-22 10
27-28 14 23-24 14
29-30 14 25-26 18
27-28 16
29-30 10
_______ _____ 31-32 6__
Total 80 Total 80
Range for A is 30-21= 9 birr
Range for B is 32-17= 15 dollars
Because the units are different we can not use range for comparing the variability. We have to use relative
range.
range 9
R.R for laboratory A = 𝑋 = 30+21 = 0.1764
𝐿 +𝑋𝑆

range 15
R.R for laboratory B = 𝑋 = 32+17 = 0.306
𝐿 +𝑋𝑆

Therefore, the variation is high in the case of laboratory B.


Advantage of rang
1. It is easy to calculate
2. It is easy to understand
Limitation of range
1. It can be affected by extreme values
2. It can not be computed when the distribution has open-ended classes.

36
3. It can not be tale into account the entire set if data.
4. It does not tell anything about the distribution of values in series relative to measure of central tendency.
Inter-quartile range= (Q3-Q1) = the difference between the third and the first quartile. The larger the inter
quartile rang the larger the variability. It is not affected by exterm value. It is a good indicator of the absolute
variability.
Quartile deviation (semi-inter-quartile range): is defined as half of the inter quartile range.
Quartile deviation= ½ (Q3-Q1)
Coefficient of quartile deviation – this is a relative measure of variation.
𝑄 −𝑄
Relative measure = 𝑄3+𝑄1
1 3

If quartile deviation is to be used for comparing the variability of two series, then it is necessary to convert the
absolute measure to a coefficient of quartile deviation.
Characteristic of quartile deviation
1. The size of the quartile deviation gives an indication about uniformity. If Q.D is small, it denotes large
uniformity. Thus a coefficient of quartile deviation is used for comparing uniformity or variation in
different distributions.
2. Quartile deviation is not a measure of dispersion in the sense that it does not show the scatter around on
average, but only a distance on scale. Consequently quartile devation is regarded as a measure of
partition.
3. It can be computed when the distribution has open-ended classes.

Limitation of quartile deviation


Except for the fact that computation is simple and easy to understand a quartile deviation does not satisfy any
other test of a good measure of variation.
Example: for the following frequency distribution, find
a. Inter-quartile range
b. Quartile deviation
c. Coefficient of quartile deviation
d. Test whether the distribution is symmetric or not
Distribution Frequency Lcf
21-22 10 10
23-24 22 32
25-26 20 52
27-28 14 66
37
29-30 14 80
Total 80

a. Then inter-quartile range = Q3-Q1 = 27.64-23.41= 4.23


b. Quartile deviation= ½ (Q3-Q1) = 4.23/2 = 2.115
𝑄 −𝑄 4.23
c. Coefficient of quartile deviation =𝑄3+𝑄1 = 23.41+27.64 = 0.083
1 3

d. If the distribution is symmetrical median - Q1 = Q3 – median


Median = Q3 = 25.3
In our case median – Q1 = 25.3 - 23.41 = 1.89
Q3 – median = 27.64 - 25.3 = 2.34
Therefore, the distribution is not symmetrical.
Mean deviation (average deviation)
The average deviation measures the scatter of individual observation around a central value usually the mean or
the median of a distribution. So, the mean deviation is defined as the arithmetic mean of positive deviations of
each observation from either mean or the median of a distribution.
- If the deviations are taken from the mean then it is called mean deviation from the mean. If on the other
hand the deviations are taken from the median we call it mean deviation from the median.
∑𝑛 ̅
𝑖=1|𝑋𝑖 − 𝑋 |
I.e. Mean deviation from the mean = 𝑀𝐷𝑥̅ =
𝑛
∑𝑛 ̃
𝑖=1|𝑋𝑖 − 𝑋 |
Mean deviation from the median = 𝑀𝐷𝑥̃ = 𝑛

E.g.: The weight of a sample of six students from a class (in Kg) is given below
53, 56, 57, 59, 63, and 66
i. What is the mean deviation from the mean?
ii. What is the mean deviation from the median?
53+56+⋯+66
Solution: 𝑋̅= = 59 𝑋̃ = 58
6
∑𝑛 ̅
𝑖=1|𝑋𝑖 − 𝑋 | ∑|53− 59|+|56− 59|+⋯+|66− 59|
𝑀𝐷𝑥̅ = = = 3.67
𝑛 6
∑𝑛 ̃
𝑖=1|𝑋𝑖 − 𝑋 | ∑|53− 58|+|56− 58|+⋯+|66− 58|
𝑀𝐷𝑥̃ = = = 3.67
𝑛 6

Mean deviation of grouped data


For grouped data, the mean deviation from the median are given by

38
∑𝑛 ̅
𝑖=1 𝑓𝑖 |𝑋𝑖 − 𝑋 |
𝑀𝐷𝑥̅ = ∑ 𝑓𝑖
∑𝑛 ̃
} For ungrouped frequency distribution
𝑖=1 𝑓𝑖 |𝑋𝑖 − 𝑋 |
𝑀𝐷𝑥̃ = ∑ 𝑓𝑖

∑𝑛 ̅
𝑖=1 𝑓𝑖 |𝑚𝑖 − 𝑋|
𝑀𝐷𝑥̅ = ∑ 𝑓𝑖
∑𝑛 ̃
} For grouped frequency distribution
𝑖=1 𝑓𝑖 |𝑚𝑖 − 𝑋|
𝑀𝐷𝑥̃ = ∑ 𝑓𝑖

The coefficient of mean deviation from the mean and from the median are given
𝑀𝐷𝑥̅
Coefficient of 𝑀𝐷𝑥̅ = 𝑋̅
𝑀𝐷𝑥̃
Coefficient of 𝑀𝐷𝑥̃ = 𝑋̃

E.g. Find the coefficient of mean deviation from the mean and from the median for the weight of six students in
previous example.

𝑀𝐷𝑥̅ 3.6 𝑘𝑔
Solution: Coefficient of 𝑀𝐷𝑥̅ = = = 0.0622
𝑋̅ 59 𝑘𝑔
𝑀𝐷𝑥̃ 3.67 𝑘𝑔
Coefficient of 𝑀𝐷𝑥̃ = = = 0.0633
𝑋̃ 58 𝑘𝑔

4. Variance and standard deviation

The variance is a measure of dispersion. It tells us something about the scatter of scores around the mean. The
variance use the distance of our values from their mean. If the values are grouped near to the mean the variance
will be little. Usually the variance is not accompanied with the measure scale, if it would be the case it would be
the square of the unit of measure. It is defined as the mean squared deviation from the mean, and symbolized by
a small sigma squared - 𝛿 2 Its formula is: Let x1, x2… xN be the values of the observations of size N, then

i. The population variance is denoted by 𝛿 2 and computed by

(𝑥𝑖 −𝜇)2 𝑥𝑖 2 𝑥 2
𝛿 2 = ∑𝑁
𝑖=1 , 𝑖 = 1,2, … , 𝑁 or 𝛿 2 = ∑𝑁
𝑖=1 − (∑𝑁 𝑖
𝑖=1 𝑁 )
𝑁 𝑁

And the standard deviation is positive square root of the variance of the given observation. And it is
given by
(𝑥𝑖 −𝜇)2
𝛿 = √∑𝑁
𝑖=1 , 𝑖 = 1,2, … , 𝑁
𝑁

➢ If the data are given in the form of frequency distribution in which the variate value 𝑥𝑖 has its corresponding
frequency 𝑓𝑖 (i=1,2,3,…,k) then the population variance is given by

39
𝑓𝑖 (𝑥𝑖 −𝜇)2
𝛿 2 = ∑𝑘𝑖=1 , where N is total number of observation
𝑁

ii. The sample variance of the set of x1,x2,x3,…,xn of n observation is denoted by s2 and is computed by
2
(𝑥𝑖 −𝑥̅ )2 𝑥2 (∑𝑛
𝑖=1 𝑥𝑖 )
𝑠 2 = ∑𝑛𝑖=1 or s 2 = ∑𝑛𝑖=1 𝑛−1
𝑖

𝑛−1 𝑛(𝑛−1)

The Standard Deviation


The standard deviation is the square root of the variance and is symbolized by a small Greek sigma - 𝛿 Its
formula is the square root of any of the formulae for the variance
(𝑥𝑖 −𝜇)2
𝛿 = √∑𝑁
𝑖=1 Population standard deviation
𝑁

(𝑥𝑖 −𝑥̅ )2
𝑠 = √∑𝑛𝑖=1 For sample standard deviation
𝑛

➢ If the observation 𝑥𝑖 occurs 𝑓𝑖 times for 𝑖 = 1,2,3, … , 𝑘, then the sample variance is computed by:
𝑓𝑖 (𝑥𝑖 −𝑥̅ )2
𝑠 2 = ∑𝑘𝑖=1 , where n is the total number of observation
𝑛−1

➢ To compute variance and standard deviation for grouped data we can use the same formula in the above
way but in this case 𝑥𝑖 ’s will be the class mark of the distribution.

E.g. compute the variance and standard deviation for the following data

𝑥𝑖 : 3 6 5 3 4 3

E.g. find the sample variance and standard deviation of the data gen below

Class frequency
1–5 4
Properties of variance and standard deviation
6 - 10 1
1. If a 11 - 15 2 constant value is added or subtracted from each observation, the variance
and 16 – 20 3 standard deviation is remain the same.
2. If constant k multiplies each value in a given data set, then the new
variance and standard deviation will be obtained by multiplying the original variance and standard deviation
by k2 and k respectively.
3. If each value of a distribution is divided by constant k, then the new variance and standard deviation will be
obtained by dividing the original variance and standard deviation by k2 and k respectively.

40
The pooled or combined variance:- which is when we want to have combined of many variances from different
k distribution where each of them have their individual variance, then it is given by:

(𝑛1 −1)(𝑠1 2 − 𝑑1 2 )+(𝑛2 −1)(𝑠2 2 − 𝑑2 2 )+⋯+(𝑛𝑘 −1)(𝑠𝑘 2 − 𝑑𝑘 2 )


𝑆𝑝𝑜𝑜𝑙𝑒𝑑 2 = ( )
𝑛1 + 𝑛2 +⋯+𝑛𝑘 −𝑘

Where 𝑑𝑘 = 𝑥̅𝑘 − 𝑥̅𝑐 and 𝑥̅𝑐 is a combined mean for the given group
E.g. the mean and variance of scores earned by two groups on computation yielded the following results, find
the pooled variance.
n1 = 12 𝑥̅1 = 5.5 S12 = 25.5
n2 = 15 𝑥̅2 = 8.5 S22 = 64.2

The coefficient of variation (CV)

The coefficient of variation is the relative measure of variation. It is a pure number independent of units of
measurement and thus is suitable for comparing the variability, homogeneity or uniformity of two or more
distributions.

The coefficient of variation is also a useful measure to compare the variability of two or more distributions
measured in the same units but their means are unequal. The formula is given by:-

𝛿 𝑆
𝐶𝑉 = 𝜇 × 100% 𝑜𝑟 𝐶𝑉 = 𝑥̅ × 100%

A set of observation with les CV is considered more consistent or stable, also the larger the CV, the greater the
variability in the set of data.

E.g. two workers on the same job shows the following result over a long period of time

A B

Mena time of completing the job (in min) 30 25

Standard deviation (in min) 6 4

Which worker is appears to be more consistent?

41
CHAPTER FIVE

5. Elementary Probability
42
Introduction
Defn. Probability (p):- is a numerical description of chance occurrence of a given phenomena under certain
condition. It is used to measure the degree of certainty.
Definition of some probability terms
Random experiment:- is a process that leads to well defined results called outcomes.
Example: tossing a coin two times and observing the number of heads appearing on a top.
An outcome: is the result of a single trial of a random experiment.
Example: when a coin is tossed, there are two outcomes.ie H &T
Sample space (s): -is a set of all possible out comes of a random experiment.
Example: rolling a die s= (1, 2, 3….6) s= (no of outcome) n
Find the sample space for the gender of the children if a family has 3 children .use b for boy and g for girls.
Solution: n=3, no of outcome=2 i. e b or g s= (no of outcome)n =23=8
S= (BBB, BBG, BGB, GBB, GGG, GGB, GBG, BGG)
Events: - a subset of sample space and it consists of one or more outcomes of a random experiment.
Example: getting an odd numbers in rolling a die.
Solution; Let A is an event of getting odd numbers. A= (1, 3, 5)
Complement of an event:- is a set of outcomes in the sample space that are not included in the outcome of an
event. The complement of E is denoted by E’.
Example: a) find the complement of an event of getting 4 in rolling a die.
Solution: let B IS an event of getting 4 in a rolling of a die.
B=4 ; B’= {1, 2, 3, 5 or6}
NB: B+B’=S
b) If tossing two coins and getting all heads.
Soln. let A be an event of getting all heads in tossing two coins.
A= {HH}, A’= {at least one tail} = {HT, TH, TT} b/c AUA’=S
Mutually exclusive events:- if two events cannot occur at the same time (i.e. they have no outcome in
common).
Example: The event of getting a 4 and getting a 6 when a single card is drawn from a deck are mutually
exclusive events. Since a single card cannot be both 4 and 6.
-the event of getting a 4 and a heart in a single draw are not mutually exclusive.
Equally likely events: - events that have the same probability of occurring.
Example: when a single die is rolled, each outcome has the same probability (p) of 1/6.

43
Independent events: - if two events A and B are independent, then the occurrence of A does not affect the
occurrence of A does not affect the occurrence of B.
Example: Rolling a die and getting a 6, and then rolling a second die and getting a 3.
Drawing a card from a deck and getting a queen, replacing it, and drawing a second card and getting a queen.
Dependent events: - when the occurrence of the 1st event affects the occurrence of the second event.
Example: Drawing a card from a deck, not replacing it, and then drawing a second card.
Principles of Counting
1. Addition Principle: if a task can be accomplished by k distinct procedures where the ith procedures has ni
alternatives ,the total number of ways of accomplishing the task equals
n1+n2+…………..+nk
Example1: there are two transportation means from city A to city B, either using bus transportation or train
transportation. There are 3 buses and 2 trains .how many ways of transportation is there from city A to city B?
Example2: suppose one wants to purchase a certain commodity and this commodity is on sale in 5 government
owned shops, 6 public shops and 10 private shops. How many alternatives are there for the person to purchase
this commodity?
2. Multiplication Principle
Rule1: if a sequence of n events in which the first one has k1possiplites, the second events has k2, the third
event has k3,and so forth, the total possibilities will be:
k1.k2…….kn
Example1: a paint manufacturer wishes to manufacture several different paints. the categories include 3 types of
colors (i.e. red, white, blue),two types of type(i.e. latex and oil) and two types of use(i.e. outdoor& indoor).how
many different one color, one type and one use?
Example2: a nurse has 3 patients to visit. How many different ways can she make her rounds if she visit each
patients only one?
Rule2: if each event in the sequence of n events has k different possibilities then, the total number of
possibilities of the sequence will be
k.k.k.k…..k=kn
example1: the digits 0,1,2,3 and 4 are to be used in a 4-digit ID card. How many d/t cards are possible if (a)
repetitions are permitted? B) if repetitions not permitted?
Soln. a. 5c1 5c1 5c1 5c1 = 5*5*5*5=54 =625 cards
b. 5c1 4c1 3c1 2c1 = 120 cards

44
Example 3: (a) an urn contains 4 balls whose colors are red, blue, black and white. A ball is selected, its color
is noted, and it is replaced, then a 2nd ball is selected, and its color is noted. How many color schemes are
possible?
(b) if the 1st ball is not replaced. How many different outcomes are there?
Soln a). 4*4=16=4c1*4c1
b). 4*3=12=4c1*3c1urn
3. Permutations
Definition: permutation is an arrangement of n distinct in a specific order.
n!
Permutation Rule 1:-the arrangement of n objects taken r objects at a time .it is written as. npr = (n−r)!

Example1:-in how many ways can the letters A, B and C be arranged taken to at a time.
n! 3! 3×2×1
nPr=(n−r)! = 3P2 = (3−2)! = =6 i.e. ab, ac, ba, bc, ca, cb
1!

Example2:-in how many ways can a laboratory technician mount 10 specimens on 4 microscopes?
n! 10! 10×9×8×7×6!
nPr=(n−r)! = 10P4 = (10−4)! = = 5040 ways
6!

Permutation Rule 2:-the number of permutations of n distinct objects taken all together is n! or nPn.
Note:- n×n-1×n-2×……3×2×1
0! =1and1! =1
Example:-in how many can a student arrange his/her 6 different books on a shelf?
Permutation Rule 3:-the number of permutation of n distinct objects can be arranged in a circle is (n-1)! Ways.
Note-subtracting 1 is used for starting point
Example:-consider arranging the letters ABC on a circle ways (3-1)! =2ways
Permutation Rule 4:-the number of permutation of n objects in which k1 are alike, k2 are alike, etc is
n!
k1!×k2!×……×kp!
Where, k1+K2+K3+……. +KP=n

Example:-how many different permutations of n objects can be made from the letters in the word MISSISSIPPI
Sol/n the no of M=1, I=4, S=4, & P=2
11!
= 34650 d/t arrangements.
1!4!4!2!

4. COMBINATIONS
Combination is a selection of distinct objects without regard to order.
Combination is used when the order of arrangement is not important, as in the selection process.
n 𝑛!
The number of combinations of r objects selected from n objects is denoted by C r =𝑟!(𝑛−𝑟)!

Example: given the letters A, B, C & D list the permutation & combination for selecting two letters.
Permutation: AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, DC, 4P2=12
45
Combination: AB, AC, AD, BC, BD, CD, 4C2=6
Example: in an English class the students are given the choice of 8 d/t essay topics. In how many ways can 4
students can choice a topic?
a. if no 2 students may choose the same topic. 8p4=1680 ways
b. if there is no restriction on the choice of the topics 84 ways of choosing topics
a committee of 5 people must be selected from 5 men and 8 women. How many ways can selection be
Example: Done if there are at least 3 women on the committee?
Sol/n
The committee can consists of 3 women and 2 men or 4 women and 1 men or 5 women’s.
8c3* 5c2 + 8c4*5c1+8c5*5c0=966

Basic approaches to probability


There are 3 basic types of probability
1. The classical approach
2. The frequents approach
3. The axiomatic approach
1. The classical approach
This is based on the assumption that the outcomes of an experiment are equally likely and the total number of
the outcomes is definite. Uses sample space to determine the numerical probability that an event will happen.
Definition: if there are n equally likely outcomes of an experiment, and out the n outcomes event E occur only
k times the probability of the event E is denoted by p(E) is defined as
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑡𝑜 𝑒𝑣𝑒𝑛𝑡 𝐴 𝑛(𝐸) 𝑘
P(E)= =𝑛(𝑆) =𝑛
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

Example: when a single die is rolled, what is the probability of getting a number less than 5
Sol/n let A= the event that getting numbers less tan 5 in rolling a die
S= sample space= {1, 2, …, 6} A={1, 2, 3, 4}
𝑛(𝐴) 4 2
P (A) = 𝑛(𝑆) =6=3

Example 1: a box of 80 candles consists of 30 defective and 50 non defective candles. If 10 of these candles are
selected at random without replacement, what is the probability?
a) All will be defective?
b) 6 will be non defective?
c) All will be non defective?

46
2. The frequents Approach (Empirical probability):
This approach to probability is based on relative frequencies.
Definition: suppose we repeat a certain experiment n times and let A be an event of the experiment and let k be
the number of times that event A occurs.
Therefore the probability of the event A happening in the long run is given by:

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑒𝑣𝑒𝑛𝑡 𝐴 ℎ𝑎𝑠 𝑜𝑐𝑐𝑢𝑟𝑒𝑑 𝑘


P (A) = =𝑛
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

In a given frequency distribution, the probability of an event (E) being in a given class is p (E) =
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑠𝑡 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑖𝑛𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

Example 2: The national center for health statistics reported that of every 539 deaths in recent years, 24
resulted that from automobile accident, 182 from cancer, and 323 from other disease. What is the probability
that particular death is due to an automobile accident?

3. Axiomatic approach
Let E be experiment and S be a sample space associated with E- with each event A areal number called the
probability of A satisfies the following properties called axioms of probability or postulates of probability.
1. P(A)> 0
2. P(S)= 1, S is the sure event
3. If A and B are mutually exclusive events, the probability that one or the other occur equals the sum of
the two probabilities. i.e p(AUB)=P(A)+P(B)
4. P(A’)=1-P(A)
5. 0<P(A)<1
6. P(ϕ)=0, ϕ is impossible event
Example 3: A fair die is thrown twice. Calculate the probability that the sum of spotson the face of the die that
turn up is divisible by 2 or3.

5.6 conditional Probability and Independence


5.6.1. Conditional probability
If A and B are events, then the conditional probability of A given B means the probability of occurrence of A
when the event B has already happened.
It is denoted by p (A/B) and is defined by

47
𝑃(𝐴𝑛𝐵)
P(A/B)= , if p(B) ≠0
𝑃 (𝐵)
𝑃(𝐴𝑛𝐵)
P(B/A)= , if p(A) ≠0
𝑃 (𝐴)

P (AnB)= p(A)p(B/A)=p(B)p(A/B)

5.6.2 Multiplication law of probability


If A and B are events in a sample space S, then

P (AnB) = p (A) p (B/A) if p (A) ≠0

= p (B) p (A/B) if p (B) ≠0


When p (B/A) represents the conditional probability of B given A and p(A/B) represents the conditional
probability of A given B.
Note: extension of multiplication law of probability for ‘n’ events A1, A2… An we have
P (A1nA2n…An) = p (A1) p (A2/A1)p(A3/A1nA2)…p(An/A1nA2n…nAn-1)
Example4: Suppose that an office has 100 calculating machines. Some of them use electric power (E) while
others are manual (M) and some machines are well known (N) while others are used (U). the table below gives
numbers of machines in each category. A person enter the office picks a machine at random and discovers that
it is new. What is the probability that it is used with electric power?
E M Total
N 40 30 70
U 20 10 30
Total 60 40 100
Example5: Let A and B are events such that p(AUB)=3/4, p(AnB)=1/4 and p(A’)=2/3, find p(A’/B)?
Probability of independent event
Two events A and B are said to be independent if the occurrence of A has no bearing on occurrence of B. that
means knowledge of A has occurred given no information about the occurrence of B. two events A and B are
said to be independent if the p(AnB)= p(A)p(B)
Example 6: A box contains four black and 6 white balls. What is the probability of getting two black balls in
drawing one after the other under the following conditions?
a) The first ball drawn is not replaced
b) The first ball drawn is replaced

48
Chapter 6: Probability Distribution

6.1. Introduction

This unit introduces the concept of a probability distribution, and to show how the various basic probability
distributions (Binomial, Poisson, and Normal) are constructed.

All these probability distributions have immensely useful applications and explain a wide variety of real life
situations which call for computation of desired probabilities.

6.2. Concept of random variable (r.v):-


Variable: - is any characteristic or attribute that can assume different values.
A random variable (r.v):- is a variable whose values are determined by chance.
- It is a function which associates a number (real number) to each possible outcome of an experiment. It is
often the case that our primary interest is in the numerical value of the random variable rather than the
outcome itself. The following examples will help us make this idea clear.
E.g.1: suppose a coin is tossed three times. Let X be the number of heads.
Solution: If we toss a coin three times, then the experiment has a total of eight possible outcomes, and they are
as follows: S= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}. Since X is the characteristic, which denotes
the number of heads out of the three tosses, a r.v, X is associated with each outcome of this experiment.
Therefore, X is a function defined on the elements of S and the possible values of X are {0, 1, 2, and 3}.

49
Specifically X(HHH)=3, X(HHT)=X(HTH)=X(THH)=2, X(HTT)=X(THT)=X(TTH)=1, X(TTT)=0

Discrete Random variable – let x be a r.v. If the number of possible values of x is finite or countable infinite,
we call x a discrete r.v
- The possible values of x can be listed as x1, x2, x3… xn
- Let x be discrete r.v with each possible outcome x, we associate a number P (xi) = P (X=xi) called the
probability of x. The numbers P (xi) must satisfy the following
Requirements for probability distribution

1. The sum of the probabilities of all the events in the sample space must be equal
to 1.i.e ∑ 𝑃(𝑥) = 1
2. The probability of each event in the sample space must be between or equal to
zero and one (0&1), i.e. 0≤ P(xi) ≤ 1.

- (xi, P(xi) ) is called Discrete probability Distribution


E.g.1: Tossing a coin twice. Let x be the number of heads. Construct a probability distribution.

Continuous Random variable – x is continuous if it assume all values in some interval (c, d) where c, d ε R
and there exist a function f, called the probability density function (pdf) of x satisfying the following conditions.
a. f(x)≥0.∀x

b. ∫−∞ f(x)dx =1
c. For any a and b with -∞<a<b<∞, we have
b
P (a<x<b) = ∫a f(x)dx
Remark: a. P(x=a)=0
b. P(a < x < b)= P(a ≤ x ≤ b)= P(a < x ≤ b)= P(a ≤ x < b)

Example1. Determine whether each distribution is a probability distribution

a. x 0 1 2 3
p(x) ¼ ¼ ¼ ¼

x 0 1 2 3

50
b. P(x) -1 ½ ¼ ¼

c. x 1 2 3 4
P(x) ¼ ¼ ½ ¼

E.g.3: Construct a probability distribution for rolling a single die.

Probability distribution can be shown graphically by representing the values of x on the


x-axis and the probability p(x) on they- axis

Exercise: construct a probability distribution for the number of girls a family with two
children has.

6.2. Mean, Variance and Expectation of R.v

Defn: - let x be a discrete r.v with possible values x1, x2, x3… xn … with probability P(x1), P(x2) … P (xn) …
respectively. Then the expected value of x or the mean value of x denoted by E(x) or μ x respectively is defined
as
μx =E(x) =∑ni=1 xi p(xi)
- If x assume finite number of values
μx =E(x) =∑ni=1 xi p(xi)
- If all outcomes are equally likely
E(x) =1/n (∑ xi)
E.g.4: In a family with two children, Find the mean of the number of children who will be girls.

E.g.5: One thousand tickets are sold at $1 each for a color television valued at $ 350. What is the expected
value of the girls if a person purchases one ticket?

Defn: - let x be a r.v, the variance of x denoted by var(x) or σ2x is defined as


Var(x) = σ2x = E{x-E(x)}2 or
Var(x) = σ2x = ∑ x 2 P(x) - μ2x

51
- The standard deviation is the square root of the variance √var(x)

σx = √σ2x = √∑ x 2 P(x) − μ2x


Note: - The variance and standard deviation cannot be negative
E (ax + b) = a E(x) + b, V (ax + b)= a2 V(x)
E.g.6: Find the mean and variance of the number of spots that appear when a die is rolled
X 1 2 3 4 5 6
P(x) 1/6 1/6 1/6 1/6 1/6 1/6

Exercise:- Calculate the mean and variance of the following


distribution
x 2 3 4 5 6 7
p(x) 1/12 2/12 3/12 3/12 2/12 1/12

6.3. Common Discrete Distributions

1. The Binomial Distributions


A binomial experiment is a probability experiment that satisfy the following four requirements
1. There must be a fixed number of trials (called n)
2. Each trial can have only two outcomes or outcomes that can be reduced to two outcomes. These
outcomes can be considered ether success or failure.
3. The outcome of each trial must be independent of each other.
4. The probability of success must remain the same for each trial.
The outcome of a binomial experiment and the corresponding probabilities of these outcomes are called a
Binomial Distribution.
Let x is distributed binomially with number of trials n and probability of success p, written as x ~ Bin(n, p), then
the probability that the number of x success in n trials is given by
𝑛
P(x) =P(x=𝓍) = ( ) px (1-P)n-x , where is x= 0, 1, . . . , n
𝑥
n- The number of trial
x- The number of success in n trial
52
P- The probability of success
q- The probability of failure ⇒ q= 1- P
E.g.7: A fair coin is tossed 3 times
a. What is the probability of obtaining exactly 2 heads?
b. What is the probability of obtaining 2 heads or more?
c. What is the probability of obtaining less than 3 heads?

E.g.8: In a certain developing country 30% of the children are under nourished. In a random sample of 25
children from a country, what is the probability that the number of under nourished will be
a) Exactly 10?
b) Less than 5?
c) 5 or more?
d) Between 3 or 5 inclusive?
e) Less than 7, but more than 4?

E.g.9: Suppose that number female and male students in this class is 50 and 50 respectively.
Suppose a random sample of 5 students are taken from this class, what is the probability that
a. All of them are male students?
b. 2 of them are female students?
c. At least 2 of them are female students?

The mean, variance and standard deviation of a binomial distribution can be found by using the following
formulas
Mean = μ = n.P
Variance = σ2 = n.P.q
s.d = σ =√n. P. q
E.g.: A die is rolled 480 times. Find the mean, variance and s.d of the number 2s that will be rolled.
Solution: This is binomial situation, where getting 2s is success and not getting 2s is failure,
n=480 P=1/6 q=5/6
μ= n.P = 480x1/6 = 80
σ2 = n.P.q = 480x1/6x5/6 = 66.7
σ=√n. P. q = √66.7 = 8.2
On average, there will be eighty 2s with a s.d of 8.2
53
2. The Poisson Distributions
Are used to model situations where the random variable x is the number of occurrences of a particular event
over a given period of time (space). Together with this property, the following conditions must also be fulfilled.
• Events are independent of each other
• Events occur singly
• Events occur at a constant rate ( in other words for a given time interval the mean number of
occurrences is proportional to the length of the interval)
The poison distribution is used as a distribution of rare events such as
➢ Number of telephone calls made to a switch board in a given minute.
➢ Number of misprints within a page
➢ Number of bacteria per slide
➢ Number of road accidents on a particular motorway in one day
➢ Number of natural hazards per year. etc. have a poison distribution
The processes that given rise to such events are called poison processes.
The probability that the number of occurrences, x, over a given period of time is equal to k is
𝑒 −𝜆 𝜆𝑘
P(x=k) = , k=0, 1, 2 … e=2.7182818…
𝑘!

Where λ is the average or mean number of events that occur in that set interval
X~ poi(λ)
The mean and variance of a poison distribution is given by
μ=λ
σ2 =λ
E.g.10: If 1.6 accidents can be expected on a particular motorway on any given day. What is the probability that
there will be 3 accidents on any given day?

E.g.11: If a bank receives on the average 6 bad checks per day. What is the probability that it will receive 4 bad
checks on any given day?
E.g.12: Suppose that bank customers arrive randomly and independently on a week day afternoons at an
average of 3.2 customers every 4 minutes. What is the probability that
a. Exactly 2 customers arrive in a 4 minutes inter on a week day afternoon?
b. Almost 1 customer arrives in one minute’s interval?
c. Two customers will arrive in 8 minutes interval?

54
d. One or more customers will arrive in 12 minutes

The poison approximation to binomial


If X ~Bin(n, p), when n is large and p is small such that λ=n.p is less than 5 (<5), then x can be approximated by
poi(𝜆), where 𝜆= n.p, the mean of the binomial distribution and the formula for evaluating the probabilities is
𝜆𝑘 𝑒 −𝜆 (𝑛𝑝)𝑘 𝑒 −𝑛𝑝
P(x=k) = =
𝑘! 𝑘!

E.g.13: On a particular production lime, the probability that an item is defective is 0.01 using a suitable
approximation; find the probability that in a batch of 200 items.
a. There are no defective item
b. There are exactly 5 defective items.
6.4. Common continuous distributions
Normal distribution: is a continuous, symmetric, bell shaped distribution of a variable.

Properties of normal distribution


1. A normal distribution curve is bell shaped.
2. The mean, median, and mode are equal and are located at the center of the distribution.
3. A normal distribution curve is unimodal (i.e. it has only one mode).
4. The curve is symmetric about the mean, which is equivalent to saying that its shape is the same on both
sides of a vertical line passing through the center.
5. The curve is continuous, that is, there is no gap or holes. For each value of x, there is a corresponding
value of y.
6. The curve never touches the x axis. Theoretically, no matter how far in either direction or the curve
extends, it never meets the x axis- but it gets increasingly closer.
7. The total area under a normal distribution curve is equal to 1.00 or 100%. This fact may seem unusual,
since the curve never touches the x axis, but one can prove it mathematically by using calculus.
8. The area under the part of a normal curve that lies within 1 s.ds of the mean is approximately 0.68, or
68%; within 2 s.ds, about 0.95, or 95%; and within 3 s.ds, about 0.997, or 99.7%.

55
The probability distribution of a normal distribution with mean μ and variance σ2 is given by
−(x−μ)2⁄
e 2σ2
f(x) = , -∞<x<∞, -∞<μ<∞, 0<σ2 <∞
σ√2π

Probability of a value x of a normal distribution between two numbers a and b is given by


b
P (a<x<b) = ∫a f(x)dx
But, this integral is a definite integral which tedious to compute, to overcome this problem we standardize the
value and we use the table of standard normal distribution to compute the probabilities.
The standard normal distribution- is a normal distribution with a mean of 0(zero) and a standard deviation 1.
The formula for the standard normal distribution is
−Z2⁄
e 2
f(x) =
√2π

All normally distributed variables can be transformed into the standard normally distributed variable by using
the formula for the standard score:
value − mean X−μ
Z= or Z =
s.d σ

The probability of any value x lies between two values a and b is given by the area under the standard normal
distribution.

Procedure to find the area under the standard normal distribution curve
1. Between 0 and any Z value: look up the Z 2. In any tail:
value in the table to get the area. a. Look up the Z value in the table to get the
area.
b. Subtract the area from 0.5

56
6. To the right of any z value, where z is less
3. Between two z values on the same side of than the mean:
the mean: a. Look up both Z value to get the area.
a. Look up both Z value to get the area. b. Add 0.5 to the areas.
b. Subtract the smaller area from the larger
area.

7. In any two tail:


a. Look up the Z value in the table to get
4. Between two z values on opposite side of
the area.
the mean:
b. Subtract both areas from 0.5.
a. Look up both Z value to get the area.
c. Add the answers.
b. Add the areas.

Procedure
5. To the left of any z value, where z is greater 1. Draw the picture.
than the mean: 2. Shade the area desired.
a. Look up both Z value to get the area. 3. Find the correct figure.
b. Add 0.5 to the areas. 4. Follow the direction.
Note: The table gives the areas between 0 and any z
value to the right of 0, and all areas are positive.
- Then calculating the value of Z using
𝑥−𝜇
Z= , i.e Z ~ N (0, 1), 𝜇 = 0, 𝜎 = 1
𝜎

- Given a normally distributed r.v x with mean 𝜇 and standard deviation 𝜎. The probability of any value x lies
between two values a and b is given by
𝑎−𝜇 𝑥−𝜇 𝑏−𝜇
P (a<x<b) = p ( < < )
𝜎 𝜎 𝜎
𝑎−𝜇 𝑏−𝜇
=p( <Z< )
𝜎 𝜎

57
E.g.14: Find the area under the standard normal E.g.15: Find the area under the standard normal
distribution which lies. curve which lies
a. between Z=0 & Z=0.96 a. Between Z=-0.67 & Z=0.75
P (0<Z<0.96) =?

P (-0.67 <Z<0.75) =?
P (0<Z<0.96) =0.3315 =P (-0.67<Z<0) + P (0<Z<0.75)
b. Between Z= -1.45 & Z=0 =P (0<Z<0.67) +P (0<Z<0.75) since P (-
0.67<Z<0) =P (0<Z<0.67) B/c they are symmetric
=0.2486 + 0.2734
=0.522
P (-1.45<Z<0) =?
P (-1.45<Z<0) =P (0<Z<1.45) Because of b. Between z=2.13 and z=2.94
symmetric
=0.4265

c. The area to the right of Z = -0.35 P (2.13 <z<2.94) =?


=p (0<z<2.94) - p (0<z<2.13)
=0.4984-0.4834=0.015
E.g.16: Find z if
P (Z>-0.35) = P (-0.35<Z<0) + P(Z>0) a) The normal curve area between 0 and z
= P (0<Z<0.35) + 0.5 (positive) is 0.4726.
= 0.1368 + 0.5 P (0<Z<z) = 0.4726
= 0.6368 z=? z=1.92
d. To the left of Z= -0.35 b) the normal curve area to the left of z is equal
to 0.9868
P (Z<z) =P (Z<0) +P (0<Z<z) = 0.9868
=0.5+P (0<Z<z) = 0.9868-0.5 =
P (Z<-0.35) = 1-P (Z ≥ -0.35) 0.4868
= 1- 0.6368 P (0<Z<z) =0.4868
= 0.3632 z=2.22 from the normal table

58
E.g.17: A random variable x has a normal c. Greater than 76.4?
distribution with mean 80 and standard deviation d. Between 81.2 and 86.0?
4.8. What is the probability that it will take a value?
a. Less than 87.2?
b.
E.g.18: A normal distribution has mean 62.4, find its standard deviation if 20.05% of the area under the normal
curve lies to the right of 72.9

E.g.19: A random variable has a normal distribution with standard deviation 5. Find it’s mean if the probability
that the random variable will assume a value less than 52.5 is 0.6915.

59
Jimma University Department of Statistics
CHAPTER SEVEN

7. SAMPLING AND SAMPLING DISTRIBUTION OF SAMPLE MEAN

7.1. Basic concepts


• Sampling:-is a device (technique) used to draw inferences about the whole population simply observing
or measuring a few of sampling unit.
• Sampling unit:-is one of the possible member of the population that could be included in the sample or
an element of a population that will be a person or object on which observation can be made or from
which information is obtain.
• Sampling error:-which are introduced due to error in selection of sample or the discrepancies between
population parameters and estimate which are derived from a random sample.
• Census:-an investigation that cover every individual unit.

Reason for sampling

Sampling in statistics is common and important. Some of the major reasons why sampling is necessary are: -
1. Sampling saves money, labor and time. The cost of the obtaining information through a sample would
be a lot less than obtaining it through a census.
2. Sampling is the only option for the study in some specialized area. Highly trained personnel and
specialized equipment are needed in medical sciences. Observation or experimentation could be
destructive in nature in quality control like testing the average duration of bulbs and testing the quality
of wine, beer, and etc. In such areas Sampling is the only feasible option for the study.
3. If the population is too large (infinite) to cover sampling is the only way for the study.

Types of sampling techniques

We can classify these techniques in two groups. Namely


A. Probability sampling
B. Non-probability sampling

A. Probability sampling (random sampling)

60
Abiyot Negash
Jimma University Department of Statistics
A sample is selected in such a way that each item or person in the population being studied has a known non-
zero likelihood of being included in the sample. The main advantage of probability sampling is that one get
estimates that are unbiased and having a measurable precision. There are four method of probability sampling.
i. Simple random sampling
ii. Stratified sampling
iii. Cluster sampling
iv. Systematic sampling
Simple random sampling:-is the sampling procedure in which each item or person in the population has the
same chance of being included in the sample .The selection of the elements may be done using lottery method
or random number table.
Systematic sampling:-is the sampling procedure which assumes numbering each subject of the population and
selecting every Kth element. If we N units in the population that are numbered 1 to N, the sampling procedure is
explained as follows. To select a sample of n units, we take a unit at random from the first K units and every K th
unit thereafter. For instance, if K is 10 and if the first unit drawn is number 8, the subsequent unit are numbers

8, 18, 28, 38, 48, 58 and soon. The constant K is usually approximated by N , where N is the number of the
n
population and n is the number of sample size.
Example:-Let N=150 and n=10, where N and n are the population and the sample size respectively, the K can

be approximated by (K = N = 150 = 15) , select one of the first 15 elements at random.


n 10
If 1 is the selected number, then the list of all selected elements are 1,16,31,46,61,76,91,106,121,136
Stratified sampling:-this is a type of sampling in which the population first divided into groups, called strata
and a sample is selected from each stratum. The elements in a stratum are supported to be homogeneous with
respect to a given characteristic, but have different characteristics with the element in the other strata.
Cluster sampling:-can be defined sampling containing several elements occurred in group naturally or
artificially. A cluster has listing units associated each unit can be geographically, temporal, spatial in nature.
Thus cluster can be selected by using simple random sampling.

B. non-probability sampling

It is a procedure that considered only convinces and personal judgments or not all element have a chance of
being included in the sample .The element are selected based on the subjective knowledge of the researcher
about the element. The most common types of non-probability sampling are convenience, judgmental and quota
sampling.

61
Abiyot Negash
Jimma University Department of Statistics
Convenience sampling: - member of the population are chosen based on their relative cases of access.
Example: - to sample friends
- co-workers
Judgmental sampling or purposive sampling:-the researcher chooses the sample based on who they think
would be appropriate the study .this is used primarily when there is a limited number of people that have
expertise in the area being researched.
Quota sampling:-a quota is established (say 30% men) and researchers are free to choose any respondent they
wish as long as the quota is met.

7.2. Sampling distribution of sample mean

Before we give the definition of sampling distribution of mean lets give the definition of these terms.
Sampling with replacement: - it is the process of selecting items one by one replacing the already selected
item before next selection. If we have a population of size N sample of size n then, there are 𝑁 𝑛 possible
samples of size n.
Sampling without replacement:-It is a process of selecting items one by without replacing the selected items.
If we have a population size N and sample size n we have N combination of n (NCn) different samples of size n
from the population of size of N.
Sampling distribution of sample mean is a probability distribution of all possible sample means obtained from
samples and sample size from the same population.

Properties of sampling distribution of sample mean

1. The mean of the sample means denoted by  x


will be the same the population mean

(µ) .That is  x

2. The standard deviation of sampling distribution of sample means(  x


) will be smaller than the standard

deviation of the population(  ) and calculated by

 x
= ……….. If this sampling with replacement.
n

N −n
 =  …… this is sampling without replacement
x
n N −1
Example-consider population size N =3 consisting 1, 2, 6 for a sample size of n=2 with replacement. Find
a. sampling distribution of sample mean

62
Abiyot Negash
Jimma University Department of Statistics
b. the mean of the population
c. the standard deviation of the population
d. the mean of the sampling distribution of sample means
e. the standard deviation of sampling distribution of sample means
f. What can you suggest about a and d as well c and e.

7.3. Sampling distribution of sample proportion

Proportion is a fraction of sample or population possessing a characteristic of interest. Sample proportion is


denoted by and population proportion  .Suppose in a sample size n, there are x items possessing the required
nature .the sample proportion is given by

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑤𝑖𝑡ℎ 𝑛𝑎𝑡𝑢𝑟𝑒 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 𝑥


P̂ = =𝑛
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒

Example:-in a class of 200 students 130 are male .find the proportion of female students
Solution;
Let x = number of female=n-130=70
n = Number of student in a class=200
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑚𝑎𝑙𝑒 𝑥 70
P̂ = = = = 0.35
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑛 200

Properties of sampling distribution of sample proportion

A. The mean of sampling proportions denoted by ( )



is equal to population proportion (  )  = Ncn

pˆ i

……..Sampling without replacement

 =
 pˆ i ………..Sampling with replacement
ˆ
p n
N
B. The standard deviation of sampling distribution of sample proportion with replacement

Q
 pˆ
=
n Where,  = population proportion, Q = 1 −  .

Q N − n
 pˆ
=
n

N −1
…….This is sampling without replacement

63
Abiyot Negash
Jimma University Department of Statistics
Example:-suppose that an urn contains three balls of which two red and the rest black. If we take a random
sample of two balls, with replacement .Find
a. The sampling distribution of sample proportion
b. The proportion of red ball
c. The mean of sampling distribution of sample proportion of red ball
d. The standard deviation of sampling distribution of sample proportion

If a random sample of size n is selected from population of size N having a mean  and variance   , then when
2

𝑃𝑄⁄
goes to infinity, the distribution of p will be with mean p and variance Q . p̂ ~ 𝑁 (𝑃, 𝑛)
n

pˆ − 
The standard normal variable of sample proportion is calculated by  = Q
n

Example:-40% of students in college are business majors’. A random sample of 90 students was selected.
a. Calculate the mean and standard error of sample proportion of business majors
b. What is the probability that the sample proportion of business major will be greater than 0.35
Solution a) Given n = 90
 = 40% = 0.4 Q = 1 −  = 1 − 0.4 = 0.6
Since the population proportion and the mean of sampling distribution of sample proportions are equal. So

the mean of sample proportions  pˆ


= 0.4 and the standard deviation of proportion is

Q 0.4  0.6
 ˆ
p
=
n
=
90
= 0.052

 ( pˆ −  ) 0.35 −  
b) ( pˆ  0.35) =  Q   =    0.35 − 0.4  = (   −0.96 )
 n 0.052   0.05 

= (− 0.96    0) + (  0)


= (0.96  Z  0) + ( Z  0)
= 0.3315 + 0..5
= 0.8315
7.4. Central limit theorem

If a random sample of size n is selected from population of N having mean µ and variance  , then, if n goes
2

to infinity the distribution of sample mean will be normal with mean µ and variance that is

64
Abiyot Negash
Jimma University Department of Statistics
𝑥̅ −𝜇
Suppose that x assumes n values X1, X1, X3 … Xn. The ratio 𝑍 = 𝜎 is called a standard normal variable.
⁄ 𝑛

Where 𝑥̅ =sample mean


µ=population mean

𝜎
⁄ 𝑛 = standard deviation of the 𝑥̅

Example:-The average age of a vehicle registered in U.S 8 years or 96 months .assume that the standard
deviation is 16 months. If the random sample of 36 car is selected .Find the probability that the mean of their
age is between 90 and 100.
Solution

x = 36 (90  x  100) = ?
 = 16
 = 96
90 − 96 X − 100 − 96
(90  x  100) = ( 16
 
 16
) = (−2.25    1.5)
36 n 36

= (−2.25    1.5) = (−2.25    0) + (0    1.5)


= 0.4878 + .4332
= 0.921

65
Abiyot Negash
Jimma University Department of Statistics

CHAPTER 8: ESTIMATION AND HYPOTHESIS TESTING

8.1 Point and Interval Estimation of the Mean

Statistical inference may be divided into two major areas:-estimation and hypothesis test.
What is estimation? Statistical estimation is a procedure of using a sample statistics to estimate a population
parameter. This procedure called an estimator. The particular value taken by the estimator is called an estimate.
Statistical estimation has two components:-point estimation and interval estimation.

a) point estimation of the mean

One important problem of statistical inference is the estimation of unknown population parameters from the
corresponding sample statistics. Here, the parameter of interest is the population mean,µ ,which is to be
estimated. We take a simple random sample of size n and get observations x1,x2 ……xn. Then the quantity
statistics is

∑ n
xi
𝑋̅ = i=1 an estimator of the population mean µ. The sample mean 𝑋̅ is unbiased estimator since the mean of
n

the sampling distribution of 𝑋̅ is equal to the population mean; E (𝑋̅) = µ

Properties of good estimator of the mean:

✓ The estimator should be unbiased estimator: i.e. E(𝑋̅)= µ


✓ The estimator must be consistent ,i.e., as n~∞, 𝑋̅ ~ µ
✓ The estimator must be a relatively efficient estimator, i.e. the estimators have the smallest variance.
b) Interval estimation of the Mean

For most part, the sample mean will be somewhat different from the population mean due to sampling error.
Therefore we can ask a question “how good is a point estimate?”The answer is that there is no way of knowing
how close the estimate is to the population mean. This answer places some doubt on the accuracy of point
estimates and to overcome such a problem we deal with interval estimates; i.e. we can accompany a point
estimate by an interval estimate.

An interval estimate of population mean is the range of values used to estimate the population mean. When an
interval estimate is made certain probability statement is done. The confidence interval for the mean is a

66
Abiyot Negash
Jimma University Department of Statistics
specific interval estimate of the population mean which is determine by using data obtained from a sample and a
specific confidence level of the estimate.

The confidence level of an interval estimate of µ is the probability that the interval estimate will contain the
parameter. That is, p (aL< µ<aU) =1-α, where 1-α is confidence level, aL and aU are lower and upper confidence
limits respectively. The interval aL< µ<aU computed from the selected sample is called a (1-α) 100%confidence
interval. When we try to give an interval estimate for µ we need to consider several conditions on the sample
size n and the population variance. That is, whether the sample size is large or small (n≥30 or n< 30), or where
the population variance is known or not known.

I) For Large sample case (n≥30) and 𝝈𝟐 is known


The error we make when we use a sample mean to estimate the population mean is given by (𝑋̅ -µ). The
sampling distribution of the mean is closely approximated by the normal distribution as long as n≥30, i.e.,
2
𝑋̅ ~N (µ, 𝜎 ⁄𝑛). The sampling distribution of the mean 𝑋̅ can be transformed into the standared normal
̅ −μ
𝑿
distribution by: Z= the magnitude of the error (𝑋̅ -µ) is less than the maximum error 
𝝈/√𝒏

𝝈
Where,  = Z𝛼/2 . That is,
√𝒏

𝝈 𝝈
µ=𝑋̅±Error (  ) then –  ≤𝑋̅- µ ≤  , −Z𝛼/2 ≤𝑋̅- µ ≤Z𝛼/2
√𝒏 √𝒏

𝝈 𝝈
𝑋̅ − Z𝛼/2 ≤ µ ≤𝑋̅ +Z𝛼/2
√𝒏 √𝒏

𝝈
P {µ𝜖(𝑋̅ ±Z𝛼/2 )} =1-α, and (1-α) is known as the level of confidence.
√𝒏

𝝈 𝝈
𝑋̅ − Z𝛼/2 and 𝑋̅ +Z𝛼/2 are known as the lower and upper confidence limits respectively. The
√𝒏 √𝒏
𝝈 𝝈
interval (𝑋̅ − Z𝛼 , 𝑋̅ +Z𝛼/2 ) is called the (1-α) 100% confidence interval (Interval estimate) of µ.
2 √𝒏 √𝒏

The most commonly used α’s 0.1, 0.05 and 0.01so that (1-α) 100% gives the 90%, 95% and 99% interval
estimate, respectively.

Let α=0.05, then (1- α) =0.95 and (1-α) 100% =95%.

For 𝛼/2=0.025, p (Z > Z𝛼/2)= p (Z > Z0.025 )=0.025

67
Abiyot Negash
Jimma University Department of Statistics
P (0 ≤ Z ≤ Z0.025 ) =0.5-0.025=0.475 Z0.025 =1.96 𝑎𝑛𝑑 − Z0.025 =-1.96

Example 1: the president of a certain university wishes to estimate the average age of the students
currently enrolled. From past studies the standard deviation is known to be 2 years. A random sample of 50
students is selected and the mean is found to be 23.2 years. Find the 95%confidence interval for the population
mean.

Given: 𝜎=2(𝜎𝑘𝑛𝑜𝑤𝑛) and n=50(large sample case ) with 𝑋̅ =23.2

95% confidence interval ⇒(1-α) =0.95, α=0.05, α/2=0.025 and Z𝛼/2 =±1.96

Therefore the 95%confidence interval for population mean µ is given by

𝝈
𝑋̅ ± Z𝛼/2 = 𝑋̅ ± 1.96𝑥 2⁄ =23.2±0.55= (22.65, 23.75) ⇒26.65<µ<23.75
√𝒏 √50

Conclusion: we are 95% confident or sure that the true average age of the students in this university will be
contained between 22.65 years and 23.75 years.

Example 2: in a certain study, the sample mean 𝑋̅ =18.85, the sample size n=80 and standard deviation 𝜎 =
5.55.construct the 90% confidence interval for µ.

Given: n=80>30(large sample case), 𝜎 = 5.55 (𝜎 is known).

For 90% confidence interval for µ, we have (1-α) =0.9, α=0.1, α/2=0.05 andZ𝛼/2 =1.645

𝝈
Then 𝑋̅ ± Z𝛼/2 =18.85±1.02 = (17.83, 19.87) ⇒17.83< µ <19.87
√𝒏

We are 90% sure that the true but unknown population mean will be contained within the interval (17.83,
19.87).

Exercise: construct the 95% and 99% confidence interval for the above example and compare the results.
In general, one should note that as the level of confidence increase the interval gets wider and wider.

II) For large sample case and 𝝈 is unknown

The (1-α) 100%confidence interval for µ, when 𝜎 is unknown and n≥30, is given by

68
Abiyot Negash
Jimma University Department of Statistics
𝒔
𝑋̅ ± Z𝛼/2 i.e., we substitute 𝜎 by s, sample standard deviation.
√𝒏

Example: random sample of 49 female shoppers showed that they spent an average of birr 23.45 per visit with
standard deviation s=2.80 birr.

a) Find the 90% confidence interval for the true average expenditure.
b) If an average of 18.6 minutes per visit with standard deviation of 5 minutes, find the 90% confidence
interval for the true mean time a female spends in grocery shopping.

Solution: Given: n=49(large sample case)

a) Let µ1 be the true average expenditure, a female spends in grocery shopping per visit.
𝒔
The (1-α) 100% confidence interval for µ1 is given by 𝑋̅ ± Z𝛼/2 . For the 90% confidence interval,
√𝒏
𝒔
we have α=0.1, α/2=0.05,⟹ Z𝛼/2 =1.645 then𝑋̅ ± Z𝛼/2 =23.45±0.658=(22.79,24.11)
√𝒏

We are 90% sure (confidence) that true mean average expenditure of a female per visit will be contained
in the interval (22.79, 24.11).
b) Let the true mean time a female spends in grocery per visit be µ2.the respective point estimate is given
by 𝑋̅ =18.6 minutes. The 90% confidence interval will be
𝒔 𝟓
𝑋̅ ± z𝛼/2 =18.6±1.645 =18.6±1.175= (17.42, 19.78).
√𝒏 √𝟒𝟗

The true mean time will be contained within the interval (17.42, 19.78) minutes in 90% of the time.

iii) For small sample case and 𝝈 𝒊𝒔 𝒌𝒏𝒐𝒘𝒏

When the sample size is less than 30 the central limit theorem does not apply. But we can make one basic
assumption. That is, the parent population is normal, which means we are sampling from a normal
population. If this assumption that is met, then the sampling distribution of the mean is normal, but the
𝑋̅−μ
quality 𝝈 will still have the standard normal distribution.
⁄ 𝒏

𝑋̅ −μ
If X~N (µ, 𝝈2) and 𝝈2is known and n<30, then Z=𝝈 ~N(0,1)
⁄ 𝒏

Example: the pulse rate of 12 patients increased on the average by 22.33 beats per minute .from previous
study it is known that 𝝈 for this population is 4.28.construct the 99% confidence interval for the mean.
𝛼
Solution: given: n=12<30, 𝝈 =4.28, 𝑋̅=22.33,α=0.01⟹ = 0.005,Z0.005=2.575
2
𝝈
µ= 𝑋̅ ± Z𝛼/2 =22.3±2.575x4.28⁄ =22.33±3.18⟹19.15< µ <25.51
√𝒏 √12

69
Abiyot Negash
Jimma University Department of Statistics
IV). for small sample case and when 𝝈 is unknown
𝑋̅ −μ
if n<30 and we are sampling from a normal population whose variance is unknown, then the quantity 𝒔 will
⁄ 𝒏

have a t-distribution with (n-1) degrees of freedom and the(1-α)100% confidence interval for the population
mean will be given by μ = 𝑋̅±tα/2,n-1𝒔⁄ .
√𝒏
Example: The IQ’S of 16 students from a certain class showed a mean of 107 with standard deviation of
10.construct the 90% confidence interval for the mean.
𝛼
Given: n=16, 𝑋̅=107,s=10,α=0.1⟹ 2 = 0.05, tα/2,n-1=t0.05,15=1.753

µ=𝑋̅±tα/2,n-1𝒔⁄ =107±1.753x 10⁄ =107±4.3825=(102.6,111.4)


√𝒏 √16

8.2 point and interval estimation of the proportion: large sample size

Some times the need would be to estimate the population or percentage. The sample proportion 𝑝̂ is a sample
statistic, and it possesses a sampling distribution. We know that for large samples:
✓ The sampling distribution of 𝑝̂ is approximately normal.
✓ The mean µ 𝑝̂ of the sampling distribution of 𝑝̂ is equal to the population proportion p.
✓ The standard deviation 𝜎𝑝̂ of the sampling distribution of the sample proportion 𝑝̂ is given

𝑝̂ q̂
as√ ⁄𝑛 , where q= 1-p.

The sample considered to be large if np and nq are both greater than 5. When estimating the value of a
population proportion, we don’t know the values of p and q, so we cannot compute 𝜎𝑝̂ . We use the values of s 𝑝̂
as an estimate of the 𝜎𝑝̂ , where s 𝑝̂ is calculated as

𝑝̂ q̂
s 𝑝̂ =√ ⁄𝑛 .the value of the sample proportion 𝑝̂ computed from a sample is a point estimate of the

population proportion p.

Confidence Interval for the population proportion:


The (1-α) 100% confidence interval for the population proportion p is 𝑝̂ ±Z s 𝑝̂ . The value of z is obtained from
the normal distribution table for the given confidence level.
Example 1: in a study done about physicians’ attitudes towards the use of deception to resolve ethical problems
in medical practice and about how much to tell or where or not to tell the truth in various situations, 87% of 109

70
Abiyot Negash
Jimma University Department of Statistics
physicians said that deception is acceptable on rare occasions to benefit their patients. Find a 95% confidence
interval for the proportion of all physicians who hold this view.
Solution: let p is the population proportion and 𝑝̂ is the sample proportion.
Given: n=109, 𝑝̂ =0.87, q=1-p=1-0.87=0.13, confidence level is 95% or 0.95

𝑝̂ q̂
Calculate s 𝑝̂ =√ ⁄𝑛 = √(0.87𝑥0.13 ⁄ 109) = 0.032 .the confidence interval for p is

P=𝑝̂ ±z s 𝑝̂ =0.87±1.96x0.032=0.87±0.063= (0.807, 0.933) or 80.7% to 93.3%


Thus, we can state with 95% confidence that the population proportion is 80.7% to 93.3%.

Example 2: a survey of voters was conducted and 52% said that they would prefer a candidate from party A’S.
Assuming that the sample size for this study was 1500, construct a 99% confidence interval for the proportion
of all voters who hold this view.
Solution: let p is proportion of all voters who prefer a candidate from party A’s and 𝑝̂ is the sample
proportion. From the given information,

Given: n=1500, 𝑝̂ =0.52,q=1-p=0.52=0.48,confidence level is 99% or 0.99,Z=2.58

𝑝̂ q̂
Calculate s 𝑝̂ =√ ⁄𝑛 =√0.52𝑥0.48⁄1500=0.013 the confidence interval for p is

p=𝑝̂ ±z s 𝑝̂ =0.52±2.58x0.013=0.52±0.034= (0.486, 0.554) or 48.6% to 55.4%.

8.3 sample size determination


a) Sample size determination for the estimation of Mean

𝜎
We have observed that  =Z is the maximum error of estimate for µ. Suppose we predetermine the size of
√𝑛

the maximum error  and want to determine the size of the sample that will yield this maximum error. Given
the confidence level and the standard deviation of the population, the sample size that will produce a
𝒁𝟐 𝝈 𝟐
predetermined maximum error  of the confidence interval estimate of µ is n= . IF we do not know 𝜎 find
𝟐
the sample standard deviation s and substitute s for 𝜎 in the formula.

Example: a university dean wishes to estimate the average number of hours his part-time instructors teach per
week. The standard deviation from a previous study is 2.6 hours. How large a sample must be selected if he
wants to be 99% confident of finding whether the true mean differs from the sample mean by 1 hour?

71
Abiyot Negash
Jimma University Department of Statistics
𝒁𝟐 𝝈 𝟐 (𝟐.𝟓𝟖)𝟐 (𝟐.𝟔)𝟐
Solution: Z𝛼/2 =2.58,  =1,𝜎 =2.6 n= = =44.997264=45
 𝟐
𝟏𝟐

c) Sample size determination for the Estimation of proportion

𝑝𝑞
The maximum error E of the interval estimation of the population proportion is  = z𝜎 𝑝̂ =Z√ 𝑛 . given the

confidence level and the value of p and q,the sample size that will produce a predetermined maximum error of
𝒁𝟐 𝐩𝐪
the confidence interval estimate of p is n= . In most cases, the value of p and q are not known to us. In such
𝑬𝟐

a situation, we can choose one of the following alternatives.

✓ Use the most conservative estimate of the sample size n by using p=0.5 and q=0.5 since the
product of these two is greater than the product of any other pair of values for p and q.
✓ Use a preliminary sample and calculate 𝑝̂ and 𝑞̂ for this sample. Then we use these values of 𝑝̂
and 𝑞̂ to find.

Example: The EZ Company wants to estimate the proportion of defective items produced by a machine with
0.02 of the population proportion for a 95% confidence level. Suppose a preliminary sample of 200 items
showed that 7 percent of the items produced on this machine are defective, how large a sample should EZ
company select?

Solution: the value of z for 95% confidence level is 1.96,E=0.02, 𝑃̂=0.07, 𝑞̂=1-𝑃̂ 1-0.07=0.93

𝒁𝟐 𝐩
̂𝐪̂ =(1.96)2 (0.07)(0.93)
n= =625.22≈ 626
𝟐 (0.02)2

8.4 Hypothesis Testing about the Mean

Introduction:
To establish and to investigate the relation ship between the sample measure and population characteristics
(parameter), we make use of hypothesis testing.
Definition: hypothesis testing is a rule or a procedure for determining whether or not an assertion or a statement
about a population parameter (in this case the mean) is true.
Suppose we have a certain sample taken from the population and the sample mean is ̅,
x we set up an
assertion that it came from a population with mean, μ.This implies that the discrepancy between x̅and μ is only

72
Abiyot Negash
Jimma University Department of Statistics
due to chance: i.e. in the long run, repeated sampling will produce data which will result in a mean discrepancy
between x̅ and μ of zero.
We can try to determine the probability of statistical probability of getting a discrepancy between x̅ and μ
as large as or larger than the actual one .this can be done from the knowledge of the sampling distribution ofx̅.
this probability is preferred to as the level of significance. Then we can conclude that either the assertion
(hypothesis) is true or the hypothesis is false.
Steps to perform hypothesis testing:
1. Write the original claim and identify whether it is the null hypothesis or the alternative hypothesis.
2. Write the null and alternative hypothesis .use the alternative hypothesis to identify the type of test.
3. Write down all information from the problem.
4. Compute the test statistic
5. Find the critical value using the tables
6. Make a decision to reject or fail to reject the null hypothesis. A picture showing the critical value and
test statistic may be useful.
7. Write the conclusion.

Every hypothesis testing procedure begins with the statement of a hypothesis.


A statistical hypothesis is a statement about a population parameter .this statement may or may not be true.
There are two types of statistical hypothesis for each situation namely the null hypothesis and the alternative
hypothesis.
Null hypothesis (HO): is a claim about a population parameter that is assumed to be true until it is declared
false. The null hypothesis always includes the equal sign (≥, ≤, =)
Alternative hypothesis (H1 or Ha): is a claim about a population parameter that will be true if the null
hypothesis is false.
In the process a decision is made to reject or not reject it on the basis of the data obtained from a sample. In
this process various types of Errors will occur.
-Rejecting HO when it is true - Type I error.
Accepting HO when it is false - Type II error.

Null hypothesis Accept HO Reject HO


HO is true Correct decision Type I error
HO is false Type II error Correct decision
A type I error occurs when a true null hypothesis is rejected.

73
Abiyot Negash
Jimma University Department of Statistics
P {type I error} =p{rejecting HO when it is true}=α, where α is known as level of significance and 1- α is the
probability rejected.
P {type II error} =p {not rejecting HO when it is false} =β.
Probability value (p-value): the probability of getting the results if the null hypothesis is true. If this
probability is to small (smaller than the level of significance), then we reject the null hypothesis.
Types of tests concerning means:
a) Two –sided (two –tailed) tests:
HO: µ = µo H1: µ≠µo α
We want to accept HO when it is true with probability(1-α)
b) One- side Tests
I) HO: µ = µo H1: µ< µo α (left-tailed test)
ii) HO: µ = µo H1: µ> µo α (right-tailed test)
Generalization: In carrying out any test,
❖ We formulate HO so that p {Type I error} can be calculated.
❖ We formulate H1 in such a way that the rejection of HO is equivalent to the acceptance of H1.
❖ We specify α, the level of significance.
❖ We set a criterion for testing HO versos H1.
❖ We arrive at a decision (accept or reject HO) and conclusion.

a) Large sample test :when σ is known


When n≥30 and is known, we use the standard normal distribution because of the central limit theorem, i.e., for
σ2 x̅−μ x̅−μo
large n, x̅ ~N (µ, n ) so that Z =𝝈 ~N (0, 1) there fore the test criterion will be Z where Z= 𝝈 where µo
⁄ 𝑛 ⁄ 𝑛
√ √

is the value of µ in HO (µo is hypothesized population mean)


Steps: HO: µ = µo H1: µ= µo or µ< µo or µ> µo with given α
x̅−μo
Test statistic: Z value is calculated as Zcal= 𝝈 , where sample results are x̅ and n and σ is known,
⁄ 𝑛

with 𝜇 o given in HO. this statistics is to be compared with the critical value, which is going to depend on H1and
the level of significance.
i) Two-sided test :
HO: µ = µo H1: µ ≠ µo

74
Abiyot Negash
Jimma University Department of Statistics
x̅−μo
Zcal = 𝜎 , and Z critical =± Z𝛼/2 ⇒p (Z > Z𝛼/2 )=𝛼/2 ⇒{0 ≤Z≤ Z𝛼/2 }=0.5- 𝛼/2
⁄ 𝑛

Because of symmetry, p{Z<- Z𝛼/2 }= 𝛼/2. there for we find ± Z𝛼/2 from the table and then Zcal
is compared with ± Z𝛼/2 (Z critical). if Zcal< - Z𝛼/2 ,we reject Ho. if Zcal> Z𝛼/2 ,we reject Ho

If - Z𝛼/2 < Zcal < Z𝛼/2, we accept Ho. this means that we accept Ho if Zcal falls in the acceptance region; we
reject HO if Zcal falls in the rejection regions.
If Zcal=Zcritical, we reserve our judgment of accepting or rejecting HO: we have to increase the sample size in
order to come up with the conclusion of accepting or rejecting HO.

II) One-sided Test:

✓ HO: µ = µo HO: µ <µo α


Reject HO Zcal<- Z𝛼 ; accept Zcal>- Z𝛼
✓ HO: µ = µo HO: µ >µo α
Reject HO Zcal > Z𝛼 ; accept Zcal < Z𝛼 ; i.e. if falls in the acceptance region.
a) Large sample Tests: when 𝝈 is unknown

𝐱̅−μ𝐨
When n≥ 30 and is𝜎 unknown Z= 𝒔 ~N (0, 1)
⁄ 𝒏

There fore the calculated value of the test statistics will be compared with ±Z𝛼/2 - Z𝛼 or Z𝛼 depending on
whether we have two sided or one sided tests respectively.hwe follow the same procedure as given above in (a).
Examples: 1) according to norms established for a reading comprehension test, 8th grader should have an
average of 83.2 with standard deviation of 8.6.if 36 randomly selected students from Tepi School averaged
88.7,test the null hypothesis that µ=83.2 against µ>83.2 at α=0.01 and thus check the directress of claim that her
8th grade students are above average.

HO: µ=83.2 H1: µ>83.2 ⟹one-sided test at α=0.01


given information: n=36, 𝑋̅=88.7, 𝜎=8.6 we have large sample case (n=36>30)and 𝜎 is known.
𝐱̅−μ𝐨 88.7−83.2
Therefore, the test statistic is Z where = 𝝈 and Zcal== 8.6⁄ =3.837≈ 3.84
⁄ 𝒏
√ √36

Since we have H1: µ>µ0 (one-sided test), p{Z>Z0.01}=0.01


P {0 ≤ Z≤ Z0.01} =0.5000-0.01=0.49 from the table Z IS 2.33 and Zcal=3.84>Z0.01⟹Zcal falls in
the rejection region .so we have to reject HO. Therefore, the directress claim that her students are
above average is true.

75
Abiyot Negash
Jimma University Department of Statistics
2) A shopkeeper believes that the average age of customers who purchase a certain brand of
jeans is 25 years of age. A random sample of 35 customers gave an average age of 24.6 years
with standard deviation of 1 year. Is the claim of the shopkeeper true at 5% level of
significance?

Given: n=35(large sample case) and 𝜎 is unknown in which case we substitute it by S=1, 𝑋̅ =
26.6 µ0=25 Years (claim)

HO: µ=25 years H1:µ≠25 years (two-sided test) α=0.05

𝑋̅ −𝜇 24.6−25
Zcal=s =1 =-2.366, and Zcritical=±Zα/2=±Z0.025 =±1.96
⁄ n ⁄
√ √35

Zcal > ±Zα/2 ⟹it falls in the rejection region, so we reject HO

Conclusion: the average age of the customers is different from 25 years; i.e., the shopkeeper’s
claim is not true at 5% level of significance.

b) Small sample tests(n<30):when 𝝈 is known:

We use standard normal distribution (Z) as long as the variable is normally distributed and 𝜎 is known, which is
similar to (a) above.
c) Small sample test(n<30):when 𝝈 is unknown:

When X~N (µ, 𝜎 2 ), n is small, 𝜎 2 is unknown in testing HO against any alternative, the calculated value of the
̅ −𝝁
𝑿
test statistic is Tcal=𝐬 ~tn-1 .here since we have estimated σ by s and there fore the degree of freedom will be
⁄ 𝐧

(n-1). Otherwise the critical values are ±tα/2,n-1 ; - tα,n-1 ; tα,n-1 depending on whether H1: µ ≠ µ0; H1: µ < µ0; H1:
µ > µ0;respectively.

Example: a job placement director claims that the average starting salary for statistics graduates is birr
24000(yearly).a random sample of 10 statistics graduates had a mean of birr 23450 and standard deviation of
birr 400.test HO:µ=birr 24000 versus the alternative H1:µ≠24000 at α=0.05.

̅ =birr23450,δ is unknown, s=birr 400


Given information: n=10(small sample case), 𝑿

HO:µ=birr 24000 H1:µ≠24000 α=0.05

76
Abiyot Negash
Jimma University Department of Statistics
Test statistic is t with n-1 d.f., since n<30 and since δ is unknown.

̅ −𝝁 =(23450−2400)
𝑿
tcal=𝐬 400⁄ =-4.345
⁄ 𝐧
√ √10

tcritical=±t0.025,9=±2.262,thus tcal falls in the rejection region ⟹reject Ho.


Conclusion: reject the claim that the starting salary of statistics graduate is birr 2400 annually.

8.5 Hypothesis testing about the proportion: large samples

This section presents the procedure to test hypothesis about the population proportion p for large samples .the
procedure is similar in many respects to the one for the population mean (µ).the procedure includes the same
steps to that of test of mean .again the tailed can be two-tailed or one-tailed.

If the observations on various items or objects are categorized into two classes (binomial population) we often
want to test the hypothesis, whether the proportion of items in a particular class is Po or not. Thus for binomial
population .the hypothesis
Ho: P=Po versus H1: P≠ Po or H1: P>Po or H1: P<Po

The value of the test statistics Z for the sample proportion p̂ is computed as

̂ −p
p pq
Z= σp̂ ~N (0, 1) where 𝝈p̂ = √ n . The value of p is used in this formula is the one used in the null hypothesis

.the value of q=1-p


Example: to test the conjecture (guess) of the management that 60% of employees favor a new bonus scheme, a
sample of 150 employees was drawn and their opinion was taken whether they favored it or not. Only 55
employees out of 150 favored the new bonus scheme. Does this finding indicate that the proportion of
employees favoring the new bonus scheme is different from the management’s claim? (Use 𝛼 = 0.01)

Solution: let p denote the population proportion o employees favoring the new bonus scheme.

The hypothesis to be tested is Ho: p=0.6 versus H1:p≠0.6 α=0.01

̂ −p
p pq 0.6x0.4 55
Z= σp̂ where δp̂ =√ n =√ =0.04 and p̂=150=0.367
150

0.367−0.6
Z= =-5.825
0.04

77
Abiyot Negash
Jimma University Department of Statistics
Critical value is Zα/2=Z0.005=Z0.5-0.005=Z0.495=2.575

Reject Ho if Zcal>± Zα/2

Since Zcal (-5.825) < Zα/2(2.575) then we reject Ho

Therefore the claim of the manager is not acceptable.


8.6 chi-square test of association
Chi-square distribution can be used to test the association of two variables .when data are arranged in table form
for the chi-square test of association .the table is called contingency table. The table is made up of rows(R) and
columns(C).
Let us consider two attributes A and B where A has r categories and B has c categories .if we classify the
elements of certain population in r categories A and c categories of B, the following contingency table can be
formed. Note: nij denote the number of elements of the population belonging to category I of A and j of B, i=1,
2, 3…..r and j=1, 2, 3………c and let n be the total population.
∑𝑟𝑖=1 𝑛𝑖𝑗 =n.j ∑𝑐𝑗=1 𝑛𝑖𝑗 =ni.

B B1 B2 . . . Bc Total
A
A1 n11 n12 . . . n1c n1.
A2 n21 n22 . . . n2c n2.
. . . . . . . .
. . . . . . . .
. . . . . . . .
Ar nr1 nr2 . . . nrc nr.
Total n.1 n.2 . . . n.c n
The following notation will be used
𝑛𝑖.
eij= 𝑛 x n.j is called the expected frequency of cell (i , j), i=1,2,…, r and j=1,2,3…, c the appropriate test

statistics for the null hypothesis of association is

78
Abiyot Negash
Jimma University Department of Statistics
(nij−eij)2
χ2 = ∑ri=1 ∑cj=1 the value of χ2 becomes small if the discrepancy between nij and eij is
eij

small and it becomes large if the discrepancy is large to test the hypothesis at α level of significance, we
compute the χ2 value from sample observation and compare it with χ2 v (α) where v=(r-1)(c-1) degree
of freedom. And the hypothesis is given by
Ho: two variables are not associated each other.
H1: two variables are associated.
Then, we reject Ho at α level of significance if χ2 > χ2 v (α) otherwise we do not reject Ho.

Example: a researcher wishes to determine whether there is a relation between the gender of an
individual and the amount of alcohol consumed a sample of 68 people is selected and the following data are
obtained.
Alcohol consumption
Gender low moderate high Total
Male 10 9 8 27
Female 13 16 12 41
Total 23 25 20 68
At α=0.1 can the researcher conclude that alcohol consumption is related to gender.
Solution: state the hypothesis and identify the claim
Ho: the amount of alcohol that a person consumes is not associated with individual gender.
H1: The amount of alcohol that a person consumes is associated with individual gender
Critical value χ2 α (v) = χ2 0.1(2) = 4.605
Compute the test statistic. First compute the expected value.
27x23 27x25 27x20
e11= =9.13 e12= =9.93 e13= 7.94
68 28 68
41x23 41x25 41x20
e21= = 13.87 e22 = =15.07 e23= =12.06
68 68 68
(nij−eij)2 (10−9.13)2 (9−9.93)2 (12−12.06)2
χ2 = ∑ri=1 ∑cj=1 = + +……. + =0.283
eij 9.13 9.93 12.06

Make the decision: the decision is not reject the null hypothesis since χ2 =0.285< χ2 α (v) =4.6o5
Summarize the result: there is no enough evidence to support the claim that the amount of alcohol a
person consumes is associated with the individual’s gender.

79
Abiyot Negash
Jimma University Department of Statistics

CHAPTER NINE
SIMPLE LINEAR REGRESSION AND CORRELATION
9.1.Introduction To Regression
Regression analysis: - is statistical method that helps to formulate an algebraic relationship between two or
more variables in the form of an equation to estimate the value of the dependent variable which is continuous.
- The variable whose values is estimated using algebraic equation is called dependent (response) variable
and variable whose values is used as the basis for estimate is called independent (predicator) variable.
- The linear algebraic equation used for expressing a dependent variable in terms of independent variable
is called linear regression equation.
- The linear relationship between the dependent variable Y and an independent variable X which can be
expressed with respect to the population parameter β0 and β1 is given by the model
Y = β0 + β1 x+ ε
where β0 = the intercept or the average value of Y when X is 0.
β1 = the slope or the change in Y for a unit change in X, and
ε= the random or error term
The random term is introduced in the model because the relationship between the dependent and the
independent variable is not exact. In other words, there may be other factors that could affect Y in addition to X,
but we could not measure because of lack of literature or knowledge about those variable. Whatever the number
of independent variables we use in the model, we have to include the random term, as there still be other that we
could not control but could affect the dependent variable, as human knowledge about the nature or social
phenomenon is not exhaustive. Especially, in the field of social science controlling the behavior of human
beings is not possible. Thus, the inclusion of the random term in regression model is justified.
β0 and β1 can not be calculated easily. To solve this, samples are used to estimate, β̂0 to estimate β0 and β̂1 to
estimate β1, and the estimated regression line is
ŷ= β̂0 + β̂1x
Parameter β0 and β1 of simple linear regression model
The population regression equation Y = β0 + β1 x+ ε can be estimated by the sample regression line ŷ= β̂0 +
β̂1x Where ŷ – estimated average (mean) value of the dependent variable y for a given value of independent
variable x.

80
Abiyot Negash
Jimma University Department of Statistics
The coefficients β0 and β1 are estimated by β̂0 and β̂1 which are given by
n ∑ XY−(∑ X )(∑ Y )
β̂1 = ∑ 2 2
n X − (∑ X )
∑Y ∑X
β̂0 = n − β̂1 ( n ) = ̅
Y - β̂1̅
X

These values are obtained on the assumption that the error term ε has zero mean. The method used to obtain the
coefficients is called the method of ordinary least squares (OLS).

9.2. Correlation coefficient


Correlation coefficient (𝛒): Is a statistical measure of the relationship between two variables.
- 𝛒 Is the population correlation coefficient between x and y.
Where x- is independent variable or explanatory variable which is used to explain the
dependent variable.
y- dependent variable which can be explained by the explanatory variables.
Correlation is estimated by sample correlation coefficient denoted by r and is given by
covariance (x,y) cov(x,y)
r= (√var(x))(√var(y)) =
√var(x).var(y)
̅Y
∑ XY− nX ̅
r= ̅ 2 ) (∑ y2 − nY
̅2 )
√(∑ x2 − nX

- The value of r lies between -1 and 1.

E.g. publisher wants to determine the relationship between annual % increase in advertizing expense (x)
and annual % increases in sales revenue (y) for different firms.
a. Fit the regression line.
b. Estimate the increase in sales revenue expected from an increase of 7.5% in advertizing.
c. Compute the correlation coefficient and interpret.
Firms A B C D E F G H
X 1 3 4 6 8 9 11 14
Y 1 2 2 4 6 8 8 9
Solution:
∑ xi = 56, ∑ yi = 40 ∑ XY = 373

81
Abiyot Negash
Jimma University Department of Statistics
∑ x 2 = 524 ∑ y 2 = 370 n= 8
̅ =∑ x = 56 = 7,
X ̅ = ∑ y = 40 = 5, then
Y
n 8 n 8
n ∑ XY−(∑ X )(∑ Y ) 373−(8)(7)(5) 373−280 93
a. β̂1 = ∑ 2 2 = 524−8(72 ) =524−392 = 132 = 0.7045
n X − (∑ X )

β̂0 = Y
̅ - β̂1X
̅ = 5 – 0.7045(7) = 0.07
⇒ Then the fitted model is
ŷ = 0.07 + 0.7045x
⇒when advertisement expense is zero, sales revenue is 7%.
b. When x = 7.5% = 0.075
ŷ = 0.07 + 0.7045(0.075)
= 0.1248 = 12.48%
⇒when we increase advertisement expense by 7.5% the revenue will increase by 12.48%.
̅Y
∑ XY− nX ̅
c. r= ̅ 2 ) (∑ y2 − nY
̅2
√(∑ x2 − nX )
373 – 8(7)(5) 373 –280 93
= = = = 0.62
√(524−8(72 ))(370−8(52 )) √(524−392)(370−200) √(22440

⇒ the correlation coefficient 0.62 implies that, there is strong linear relationship.

82
Abiyot Negash

You might also like