UNIT 2 Stat I
UNIT 2 Stat I
The quality of data greatly affects the final output of an investigation. Hence,
utmost care should be attached to the data collection process and every
possible precaution should be taken to ensure accuracy while collecting data.
Otherwise, with inaccurate and inadequate data, the whole analysis is likely to
be faulty and also the decisions to be taken will also be misleading.
Data obtained from a primary source is called primary data. Likewise, data
gathered from a secondary source is known as secondary data. For example,
assume that a simple study is to be conducted to see the age distribution of
HIV/AIDS victim citizens. Clearly, the variable of study is age. Data about the
age of HIV/AIDS victim citizens may be obtained by making direct interview
with the victims. Note, in this specific case, the victim citizens are primary
sources. Moreover, the data to be collected from them are primary data.
Alternatively, one may use records of hospitals and other related agencies to
obtain the age of the victim citizens without the need of tracing the victims
1
personally. Therefore, the records of the hospitals, in our case, are secondary
sources and the data copied from such records are secondary data.
In most cases, secondary data is obtained from such sources as census and
survey reports, books, official records, reported experimental results, previous
research papers, bulletins, magazines, newspapers, web sites, and other
publications. Different organizations and government agencies publish
information (data) in the form of reports, periodicals, journals, etc. In the case
of Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned
in publishing such relevant information (secondary data).
2
Secondary data may be available in subjects (cases) where it is
impossible to collect primary data. Such a case can be regions where
there is war.
The choice between primary data and secondary data is determined by factors
like nature and scope of the enquiry, availability of financial resources,
availability of time, degree of accuracy desired, and the collecting agency.
Often, primary data are used in situations where secondary data do not provide
adequate basis of analysis. Meaning, when the secondary data do not suit a
specific investigation we use primary data. Unless for such cases, most
statistical investigations rest up on secondary data since it minimizes cost and
saves time. Nevertheless, the following points should be carefully considered
while using secondary data in our investigation.
One should closely examine whether or not the data are suitable for the
intended study,
The source of data should be viewed, keeping in mind whether, at any time,
it is reliable or not. If there is any doubt about the reliability of data, it
should not be used,
It should be noted that the data is not obsolete,
In case the data are based on a sample, one should see whether the sample
is a proper representative of the population,
It should be the case that skilled persons only have handled the primary
data carefully.
Finally, it should be clear that primary data in the hands of one person might
be secondary in the hands of another. That is why it is often said, “the
difference between primary and secondary data is largely one of degree.”
3
2.3. METHODS OF COLLECTING PRIMARY DATA
After discussing the two sources of data, primary and secondary, it is logical to
say a few words about the methods employed in collecting data from its original
or primary source.
Many authors commonly state three methods of collecting primary data. These
are:
a. Personal Enquiry Method (Interview method)
b. Direct Observation
c. Questionnaire method
The following are the major points that we need to take into account while
preparing a questionnaire.
5
Questions of sensitive nature should be avoided. Sensitive questions are
those questions that are too personal and pecuniary like “ Sources of
income”, “Drinking habit”, etc. The logic here is that respondents do not
willingly answer sensitive questions. Such information, if necessary, may be
gathered through interview or through other indirect questions.
I. NOMINAL LEVEL
The terms nominal level of measurements and nominal scaled are commonly
used to refer to data that can only be classified in to categories. In the strict
sense of the words, however, there are no measurements and no seals involved.
In stead, there are just counts.
6
Religion Total
Protestant 78,952,000
Roman catholic 30,669,000
Jewish 3,868,000
Other religion 1,545,000
No religion 3,195,000
Religion not reported 1,104,000
Total 119,333,000
In the above table, the arrangement of religions could have been changed. This
indicates that for nominal level of measurement, there is no particular order for
the groupings. Further, the categories are considered to be mutually exclusive.
Nominal level is considered the most primitive, the lowest or the most limited
type of measurement
II. ORDINAL LEVEL
Look at the data below.
Ratings of the company commander
The table lists the ratings of company commander by the nurses under her
command. This is an illustration of the ordinal level of measurement. One
category is higher than the next one; that is, “Superior” is higher rating than”
good”, “good” is higher than “average”, and so on.
7
commander rated good is twice as competent as one rated average, or that a
company commander rated superior is twice as competent as one rated good. It
can only be said that a rating of superior is greater than a rating of good, and a
good rating is greater than an average rating.
The major differences between interval and ratio levels of measurement are
these: (1) Ratio-level data has a meaningful zero point and (2) the ratio between
two numbers is meaningful. Money is a good illustration having zero dollars
has meaning you have none! Weight is another ratio-level measurement.
If the dial on a scale is zero, there is a complete absence of weight. Also, if you
earn $40,000 a year and John earns $ 10,000, you earn four times what he
8
does. Likewise, if you weigh 80 kg. and John weight 40 kg., you weigh twice
John. But such comparisons are impossible in interval level of measurement.
Purposes of Classification: -
To eliminate unnecessary detail.
To bring out clearly points of similarity & dissimilarity
To enable one to form mental pictures of objects on measurements
To enable one to make comparisons and draw inferences
2.6. METHODS TYPES OF CLASSIFICATION
1. Geographical Classification: - Data are arranged according to places
like continents, regions, and countries
Example
9
2. Chronological Classification:- Data are arranged according to time like
year, month.
Example
Year (in EC) Population (in million)
1974 30
1986 52
1991 60
Employees in a Factory x
Educated Uneducated
Example 4.
Mr. x Height (X) in cm
A 160
B 182
C 175
D 178
10
A. Discrete Variables – are variables that are associated with enumeration or
counting
Example
Number of students in a class
Number of children in a family, etc
11
Types of Frequency distribution
0 // 2
1 //// 4
2 //// 4
3 /// 3
4 // 2
Total 15
Exercise
Consider the following scores in a statistics test obtained by 20 students in a
given class.
10, 4, 4, 7, 5, 7, 7, 8, 5, 7, 8, 5, 10, 8, 7, 5, 7, 8, 7, 4
Prepare an ungrouped FD
B. Grouped Frequency Distribution (GFD)
If the mass of the data is very large, it is necessary to condense the data in to
an appropriate number of classes or groups of values of a variable and indicate
the number of observed values that fall in to each class. Therefore, a GFD is a
12
frequency distribution where values of a variable are linked in to groups &
corresponded with the number of observations in each group.
Example* 2.8
Frequency (fi) 3 10 18 6
*
In Example , the GFD contains four classes: 1 – 25, 26 – 50, 51 – 75, and 76 –
100
LCL1 = 1, UCL1 = 25 LCL3 = 51, UCL3 = 75
LCL2 = 26, UCL2 = 50 LCL4 = 76, UCL4 = 100
ii. Class Frequency (or Simply Frequency): refers to the number of
observations corresponding to a class.
In Example * the class frequency of the 1st, 2nd, 3rd, & 4th classes are
respectively 3, 10, 18 and 6.
iii. Class Boundaries: are boundaries obtained by subtracting half of the unit
of measurement (u) from the lower limits or by adding ½ (u) on the upper limits
of a class.
i.e UCBi = UCLi + ½ (u)
LCBi = LCLi - ½ (u)
Where UCBi = Upper Class Boundaries and
LCBi = Lower Class Boundaries
Remark: The unit of measurement (u) is the gap between any two successive
classes. i.e
u = lower limit of a class – upper limit of the preceding class.
13
In Example *, consider the 2nd class, 26 – 50, since u = 26 – 25 = 1,
LCL2 = 26 UCL2 = 50
LCB2 = 26 - ½(1) = 25.5 UCB2 = 50 + ½(1) =50.5
iv. Class Width (size of a class or class interval): it is the difference between
the upper and lower class limits or the difference between the upper and lower
class boundaries of any class.
Remarks:
1. If both the LCL & UCL are included in a class, it is called an inclusive
class. For inclusive classes,
Class width (cw) = UCBi - LCBi
cw = UCLi – LCLi
R=L–S
14
a. What is the class frequency of the 3rd class?
b. How many observations (items) are linked into the last class?
c. Find i. The LCL and UCL of the fourth class
ii. The UCB and LCB of the third class
iii. The class interval ( class width) of the fifth class
iv. The class mark (mid point) of the second class
R
Example 8. Let 6.8263
n
o If all the observations are whole numbers, cw = 7
o If all the observations are to one decimal places, cw = 6.8
o If all the observations are to two decimal places, cw = 6.83, etc.
Note that a suitable number of classes can be obtained by using the formula
n 1 + 3.322 logN
up/down to the nearest whole number, where N is the total number of
observations.
Alternatively n can also be determined by formula
15
4) Determine the class limits
i. Determine the lower class limit of the first class (LCL1), then
LCL2 = LCL1 + cw, LCL3 = LCL2 + cw,… LCLi+1 = LCLi + cw
ii. Determine the upper class limit of the first class (UCL1) i.e.
UCL1 = LCL1 + cw – u, where u = the unit of measurement, then
UCL2 = UCL1 + cw , UCL3 UCL2, … , UCLi+1 = UCLi + cw
5) Complete the GFD with the respective class frequencies.
Example 9. The number of customers for consecutive 30 days in a
supermarket was listed as follows:
20 48 65 25 48 49
35 25 72 42 22 58
53 42 23 57 65 37
18 65 37 16 39 42
49 68 69 63 29 67
a. Construct a GFD with a suitable number of classes
b. Complete the distribution obtained in (a) with class boundaries & class
marks
Solution: i. Range = Largest value – smallest value
= 72 – 16 = 56
ii. N = 30 (total number of observations)
number of classes, n = 1 + 3.322 log30
n = 1 + 3.322 log30
= 1 + 3.322 (1.4771)
= 5.9
Hence a suitable number of class n is chosen to be 6
Range 56
iii. Class width = = 9.33 = cw
n 6
For the sake of convenience, take cw to be 10 (note that it is also
possible to choose the cw to be 9).
16
iv. Take lower limit of the 1st class (LCL1) to be 16 & u = 1
i.e. LCL1 = 16 and UCL1 = LCL1 + cw – u =16+10-1 = 25
LCL2 = LCL1 + cw = 16 + 10 = 26 UCL2 = UCL1 + cw = 25 + 10 = 35
LCL3 = LCL2 + cw = 26 + 10 = 36 UCL3 = UCL2 + cw = 35 + 10 = 45
Therefore, the GFD would be A
a)
16 – 25 7
26 – 35 2
36 – 45 6
46 – 55 5
56 – 65 6
66 – 75 4
b)
Exercise
Construct a grouped frequency distribution for the following ages of 50 persons
with 6 classes.
37 40 69 35 36 70 72 62 36 72
65 64 47 59 55 42 45 50 46 65
54 63 51 50 61 60 58 58 56 58
17
55 45 49 51 50 56 44 60 70 44
52 43 55 46 42 62 57 48 60 55
I. CUMULATIVE FREQUENCY DISTRIBUTION (CFD)
Remark: The frequency distribution does not tell us directly the number of
units above or below specified values of the classes this can be determined
from a “cumulative Frequency Distribution‟
This means that from „less than‟ cumulative frequency distribution there are 4
observations less than 6.5, 11 observations below 10.5, etc and from „more
than‟ cumulative frequency distribution 30 observations are above 2.5, 26
above 6.5 etc.
II. RELATIVE FREQUENCY DISTRIBUTION (RFD)
18
class by the total frequency. It can be converted in to a percentage frequency
by multiplying each relative frequency by 100%. i.e.
fi
Rf i
n
Total 30 1 100%
QUESTIONS
32 21 28 31 35 46 48 49 49 48
36 37 22 31 28 34 20 45 44 48
38 33 33 23 28 29 33 26 36 30
43 42 32 36 24 27 27 32 45 45
39 39 38 32 33 25 30 28 37 36
42 43 38 40 35 34 20 30 36 32
40 38 38 40 46 36 35 21 31 35
41 42 39 40 46 44 32 37 22 27
41 39 40 38 44 45 48 36 32 23
40 41 40 44 49 49 49 49 37 33
Construct a Grouped Frequency Distribution (GFD) with five classes for the above data.
20
PRESENTATION OF DATA
INTRODUCTION
This unit deals with the study of organizing a set of raw data in to a Frequency Distribution (FD)
and describes the distribution graphically in a histogram, a frequency polygon, & a cumulative
frequency curve (ogive). The other types of numerical information will be summarized &
presented in the form of bar chart, pie chart or a pictogram.
Definition:
A. HISTOGRAM
After you complete a frequency distribution, your next step will be to construct a “picture” of
these data values using a histogram. A histogram is a graph consisting of a series of adjacent
rectangles whose bases are equal to the class width of the corresponding classes and whose
heights are proportional to the corresponding class frequencies. Here, class boundaries are
marked along the horizontal axis (x – axis) and the class frequencies along the vertical axis ( y –
axis) according to a suitable scale. It describes the shape of the data. You can use it to answer
quickly such questions a,s are the data symmetric? And where do most of the data values lie?
3–6 4
7 – 10 7
11 – 14 10
15 – 18 6
19 - 22 3
Total 30
Solution:
21
Histogram for the above distribution
10
Class frequency (fi)
8
6
4
2
5 – 10 4
10 – 15 7
15 – 20 9
20 – 25 12
25 - 30 6
30 – 35 5
B. FREQUENCY POLYGON
It is a line graph of frequency distribution. Although a histogram does demonstrate the shape of
the data, perhaps the shape can be more clearly illustrated by using a frequency polygon. Here,
you merely connect the centers of the tops of the histogram bars (located at the class midpoints)
with a series of straight lines. The resulting figure is a frequency polygon. Here the class marks
are plotted along the x – axis and the class frequencies along the y – axis. Empty classes are
include at each end so that the curve will anchor with the x – axis.
Example 2. Construct a frequency polygon for the frequency distribution given in Example9
22
Solution:
A frequency polygon for the
distribution in example 9
14
12
frequency (fi)
10
8
6
4
2
0
0.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
Class marks (cmi)
CYP 2 Construct a frequency polygon for the frequency distribution given under CYP 1
It is the graphic representation of a cumulative frequency distribution Ogives are of two kinds.
„Less than‟ ogive and „more than‟ Ogive < Ogive and > Ogive.
A) „Less than’ ogive: here, upper class boundaries are plotted against the „less than‟
cumulative frequencies of the respective class & they are joined by adjacent lines.
Example 3. Draw a „less than‟ ogive for the frequency distribution in Example 11
Solution:
A less than ogive showing the frequency
distribution above
40
Less than cumulative
30
frequency (<Cfi)
20
10
0
6.5 10.5 14.5 18.5 22.5
Upper class boundary (UCBi)
23
B) „More than’ ogive: here, lower class boundaries are plotted against the „more than‟
cumulative frequencies of their respective class and they are joined by adjacent line
Example 4. Draw a „More than‟ ogive for the frequency distribution in Example 11
Solution:
A more than ogive for the above frequency
distribution
35
More than cumulative frequency (>Cfi
30
25
20
15
10
5
0
2.5 6.5 10.5 14.5 18.5
lower class boundaries (LCBi)
D. LINE GRAPH
It represents the relation ship between time (on the x-axis) and values of variable (on the y-axis).
The values are recorded with respect to the time of occurrence.
Values 20 10 30 15 1
Solution:
24
A line graph showing the above time series
35
30 30
25 25
Values
20 20
15 15
10 10 10
5
0
1986 1987 1988 1989 1990 1991
Year
Is a graphical representation of discrete data (or characteristics expressed with whole numbers)
with respect to the frequencies? Vertical solid lines are used to indicate the frequencies.
Number of children 2 1 5 4 3
Solution:
Y
5
X
A B C D E
Vertical line graph showing number of children in family A, B, C, D and E
25
F. BAR CHART (BAR DIAGRAM)
Histogram, Frequency polygon, ogives are used for data having an interval or ratio level of
measurement. The other kinds of presenting statistical data suitable for a particular kind of
situations are bar charts, pie chart and pictograph.
Bar chart is a series of equally spaced bars of uniform width where the height (length) of a bar
represents the amount (magnitude) of frequency corresponding with a category. Bars may be
drawn horizontally or vertically. Vertical bar graphs are preferred as they allow comparison with
other bars.
Example18: Revenue (in millions of Birr) of company x from 1980 to 1982 is given below
Year Revenue
1980 50
1981 150
1982 200
Solution:
A simple bar chart showing revenues of company
X from 1980 to 1982
250
200
Revenue
150
100
50
0
1980 1981 1982
year
26
B. MULTIPLE BAR CHART:
Here two or more bars are grouped with the corresponding frequency to represent two or more
interrelated data in each category. The bars of related variables are kept adjacent to each other
for every set of values. These charts can be used if the overall total is not required and each bar
is shaded or colored separately and a key is given to distinguish them.
Example19: The following table shows the production of wheat and maize in hundreds of
quintals.
1980 40 80
1981 20 60
1982 60 100
Solution:
The number of quintals(in thousands) of
wheat and maize production
100
100
80
80
60 60
Number of 60
quintals 40
40 maize
20 wheat
20
0
1980 1981 1982
Year
Example20: The number of quintals of wheat and maize (in millions of quintals) produced by
country x in the indicated years.
27
Year Wheat Maize
Solution:
The number of quintals of wheat and
maize produced by country X
600
500
Number of quintals
Solution:
Year % of Wheat Production % of Maize Production
100%
22
80% 50 40
60% wheat
40% 78 maize
50 60
20%
0%
1980 1981 1982
Year
28
G. PIE CHART
A pie chart is a circle divided in to various sectors with areas proportional to the value of the
component they represent. It shows the components in terms of percentages not in absolute
magnitude. The degree of the angle formed at the center has to be proportional to the values
represented.
Food
300 350
House rent
Clothing
Misc.
100 250
H. PICTOGRAPH (PICTOGRAM)
Example 23: In comparing the population of a country from 1990 to 1992, we simply draw
pictures of people where each picture may represent 1000,000 people.
1992 - Key: = 1,000,000
1991 -
1990 -
29
QUESTION
5.5-10.5 1
10.5-15.5 2
15.5-20.5 3
20.5-25.5 5
25.5-30.5 4
30.5-35.5 3
35.5-40.5 2
3. Construct a horizontal and vertical bar chart for the areas (in square miles) of each of the
great lakes in Ethiopia.
Lake Area (km2)
Tana 3600
Abaya 1160
Chamo 551
Ziway 434
Shala 409
30