Basic Statistical Concepts
Population and Sample
Statistics is a branch of scientific methodology. It deals with the collection, classification,
description and interpretation of data through scientific procedures. Its essential purpose is
to describe about the numerical properties of populations and draw inferences about the
population from the samples.
Population: A population is the collection of all items of interest in a particular study. For
example, all the farmers, students, domestic animals, birds, total forest area, total
agricultural land etc. may constitute a population
Population may be finite or infinite.
Finite population: A population consisting of a finite number of individuals or items is
called a finite population. Students of an institution, farmers in a country, number of
livestocks etc. are examples of finite populations; these have specific numbers that can be
enumerated.
Infinite population: A population consisting of an infinite number of individuals, which
cannot be enumerated, is called an infinite population. For example, number of fishes in a
river, number of stars in the sky etc.
Sample: A small but representative part with finite number of individuals or items of a
population is called a sample. For example, a group of students, representing the first year
honors students (a population), is called a sample. A small quantity of blood, not the whole,
is collected for testing; the blood is a sample where the total quantity of blood of a person
is the population
      Sample is a sub-set or portion of the population selected to represent the population.
                                              1
Sample size: The number of elements selected for a sample is known as the sample size. A
sample of size less than 30 is termed as a small sample and that having 30 or more elements
is termed as a large sample.
Scales of Measurement
This lesson describes the four scales of measurement that are commonly used in statistical
analysis: nominal, ordinal, interval, and ratio scales.
Properties of Measurement Scales
Each scale of measurement satisfies one or more of the following properties of
measurement.
      Identity. Each value on the measurement scale has a unique meaning.
      Magnitude. Values on the measurement scale have an ordered relationship to one
       another. That is, some values are larger and some are smaller.
      Equal intervals. Scale units along the scale are equal to one another. This means,
       for example, that the difference between 1 and 2 would be equal to the difference
       between 19 and 20.
      A minimum value of zero. The scale has a true zero point, below which no values
       exist.
Nominal Scale of Measurement
Nominal variables can be placed into categories. These don’t have a numeric value and so
cannot be added, subtracted, divided or multiplied. These also have no order, and nominal
scale of measurement only satisfies the identity property of measurement.
Gender is an example of a variable that is measured on a nominal scale. Individuals may be
classified as "male" or "female", but neither value represents more or less "gender" than the
                                             2
other. Religion and political affiliation are other examples of variables that are normally
measured on a nominal scale.
Examples:
    Gender: Male, Female, Other.
    Hair Color: Brown, Black, Blonde, Red, Other.
    Type of living accommodation: House, Apartment, Other.
    Religious preference: Buddhist, Mormon, Muslim, Jewish, Christian, Other.
Ordinal Scale of Measurement
The ordinal scale measures a variable in terms of magnitude, or rank. Ordinal scales tell us
relative order, but give us no information regarding differences between the categories. The
ordinal scale has the property of both identity and magnitude.
Examples:
    High school class ranking: 1st, 9th, 87th…
    Socioeconomic status: poor, middle class, rich.
    The Likert Scale: strongly disagree, disagree, neutral, agree, strongly agree.
    Level of Agreement: yes, maybe, no.
    Time of Day: dawn, morning, noon, afternoon, evening, night.
    Political Orientation: left, center, right.
Interval Scale of Measurement
An interval scale has ordered numbers with meaningful divisions, the magnitude between
the consecutive intervals are equal. Interval scales do not have a true zero for which zero
represents simply an additional point of measurement i.e In Celsius 0 degrees does not mean
the absence of heat.
The interval scale of measurement has the properties of identity, magnitude, and equal
intervals.
For example, temperature on Fahrenheit/Celsius thermometer i.e. 90° are hotter than 45°
and the difference between 10° and 30° are the same as the difference between 60° degrees
and 80°.
                                               3
Examples:
    Celsius Temperature.
    Fahrenheit Temperature.
    IQ (intelligence scale).
    Time on a clock with hands.
Ratio Scale of Measurement
The ratio scale of measurement is similar to the interval scale in that it also represents
quantity and has equality of units. However, this scale also has an absolute zero (no numbers
exist below the zero).
The ratio scale of measurement satisfies all four of the properties of measurement: identity,
magnitude, equal intervals, and a minimum value of zero.
The weight of an object would be an example of a ratio scale. Each value on the weight
scale has a unique meaning, weights can be rank ordered, units along the weight scale are
equal to one another, and the scale has a minimum value of zero.
Weight scales have a minimum value of zero because objects at rest can be weightless, but
they cannot have negative weight.
Examples:
    Weight.
    Height.
    Sales Figures.
    Ruler measurements.
    Income earned in a week.
    Years of education.
    Number of children.
Variables and Attributes
Measurable characteristics of a population that may vary from element to element either
in magnitude or in quality are called variables.
Suppose, we have a set of numbers, representing marks obtained by five students in a group.
The possible numbers may be represented as X: 4, 5, 7, 8 and 6. Here X is a variable since
it takes different values.
                                             4
Variables are of two types- quantitative or qualitative. Variable characteristics, whose
values are expressed numerically, are known as quantitative variables.
Examples of quantitative variables are: height, weight, age, yield of crops, length or breadth
of fishes, weight of tomato, number of grains per panicles, income and family size, etc.
Quantitative variables may be further classified as discrete or continuous.
When the variable can take only integral values within a given range, is called discrete
variable. For example, the number of children in a family, number of students per class,
number of grains per panicles etc. These are called discrete variables.
A variable is said to be continuous if it assumes any value, integral or fractional, within
specified limits, a given range.
For example, height or weight of students, weight of tomato, length of fish, height of trees,
price of a commodity are continuous variables.
Some variables, which express the quality of population elements cannot be numerical
measured with a scale but can be classified or categorized, these are called qualitative
variables. A qualitative variable shows variation in objects not in terms of magnitude but
in quality or kind. These qualities are called attributes.
Examples of qualitative variables are type of farmers(big, medium, small), type of fishes(sea
fish, river fish), Hair color (brown, black, white etc.), religion (Muslim, Hindu, Christian
etc), Sex, nationality, type of crime, marital status, literacy, etc cannot be numerically
measured but can be grouped into classes or categories.
People vary according to sex as male and female, according to nationality as American,
French, Italian or Indian. Students in a college may be classified as belonging to Science,
Arts, or commerce faculty.
                                               5
Some Statistical terms:
       Population
             A population consists of all the items or individuals about which we
             want to draw a conclusion.
       Sample
             a sub-set of a population
       Variable
             a characteristic which may take on different values
       A parameter is a characteristic of a population
       A statistic is a characteristic of a sample
Types of Data: Primary and Secondary data
Data
The facts and figures which can be numerically measured are studied in statistics. Numerical
measures of same characteristic is known as observation and collection of observations is
termed as data. Data are collected by individual research workers or by organization through
sample surveys or experiments, keeping in view the objectives of the study. The data
collected may be:
   1. Primary Data
   2. Secondary Data
Primary and Secondary Data in Statistics
The difference between primary and secondary data in Statistics is that Primary data is
collected firsthand by a researcher (organization, person, authority, agency or party etc.)
through experiments, surveys, questionnaires, focus groups, conducting interviews and
taking (required) measurements, while the secondary data is readily available (collected
by someone else) and is available to the public through publications, journals and
newspapers.
                                              6
Primary Data
Primary data means the raw data (data without fabrication or not tailored data) which has
just been collected from the source and has not gone any kind of statistical treatment like
sorting and tabulation. The term primary data may sometimes be used to refer to firsthand
information.
Sources of Primary Data
The sources of primary data are primary units such as basic experimental units,
individuals, households. Following methods are used to collect data from primary units
usually and these methods depends on the nature of the primary unit. Published data and the
data collected in the past is called secondary data.
      Personal Investigation
       The researcher conducts the experiment or survey himself/herself and collected data
       from it. The collected data is generally accurate and reliable. This method of
       collecting primary data is feasible only in case of small scale laboratory, field
       experiments or pilot surveys and is not practicable for large scale experiments and
       surveys because it take too much time.
      Through Investigators
       The trained (experienced) investigators are employed to collect the required data. In
       case of surveys, they contact the individuals and fill in the questionnaires after asking
       the required information, where a questionnaire is an inquiry form having a number
       of questions designed to obtain information from the respondents. This method of
       collecting data is usually employed by most of the organizations and its gives
       reasonably accurate information but it is very costly and may be time taking too.
      Through Questionnaire
       The required information (data) is obtained by sending a questionnaire (printed or
       soft form) to the selected individuals (respondents) (by mail) who fill in the
       questionnaire and return it to the investigator. This method is relatively cheap as
                                              7
       compared to “through investigator” method but non-response rate is very high as
       most of the respondents don’t bother to fill in the questionnaire and send it back to
       investigator.
      Through Local Sources
       The local representatives or agents are asked to send requisite information who
       provide the information based upon their own experience. This method is quick but
       it gives rough estimates only.
      Through Telephone
       The information may be obtained by contacting the individuals on telephone. It’s a
       Quick and provide accurate required information.
      Through Internet
       With the introduction of information technology, the people may be contacted through
       internet and the individuals may be asked to provide the pertinent information. Google
       survey is widely used as online method for data collection now a day. There are many paid
       online survey services too.
It is important to go through the primary data and locate any inconsistent observations
before it is given a statistical treatment.
Sources of Secondary Data
Data which has already been collected by someone, may be sorted, tabulated and has
undergone a statistical treatment. It is fabricated or tailored data.
Sources of Secondary Data
The secondary data may be available from the following sources:
      Government Organizations
       Federal and Provincial Bureau of Statistics, Crop Reporting Service-Agriculture
       Department, Census and Registration Organization etc
                                              8
      Semi-Government Organization
       Municipal committees, District Councils, Commercial and Financial Institutions
       like banks etc
      Teaching and Research Organizations
      Research Journals and Newspapers
      Internet
Presentation of Data
Introduction
Given a large mass of data, it is very hard for a researcher to comprehend all the information
and implications of such collected data. Normally, large masses of data or collected data
must be organized in order to show significant characteristics or information.
Methods of Presenting Data:
       1. Tabular form – where the data are presented in row and columns
       2. Graphical form – where the data are presented in pictorial or visual form.
Tabular Method. This method of data presentation makes use of the table where data are
arranged systematically into rows and columns. This systematic arrangement of data is
called a statistical table. Through this process, data can be readily understood and
comparisons are more easily be made.
A good statistical table has four essential parts:
       1. Table heading – includes the table number and table title. The title should briefly
           explain the contents of the table.
       2. Stub – items or classification written on the first column and identifies what are
           written on the rows.
                                                9
       3. Caption or box head – includes the items or classifications written on the first
           row and identifies what are contained in the columns.
       4. Body –the main part of the table and it contains the substance or the figures of
           one’s data.
In the construction of a table, the following guidelines should prove helpful.
   1. Every table must be self-explanatory.
   2. The title should be clear and descriptive.
   3. The title gives information about what, where, how, and when the data were taken.
Example of a statistical table:
                                             Table 1.1
                       Population of the Philippines 1877 - 1980
                    Year                   Population       Average Annual
                                                           Rate of Increase (%)
                    1877                    5,567,685              2.41
                    1887                    5,984,727              0.72
                    1896                    6,261,339              0.50
                    1903                    7,635,426              2.87
                    1918                   10,314,310              1.89
                    1939                   16,000,303              2.22
                    1948                   19,234,182              1.91
                    1960                   27,087,685              3.06
                    1970                   36,684,486              3.01
                    1975                   41,831,045              2.66
                    1980                   48,098,000              2.40
           Source of Data: National Statistics Office.
                                                  10
Frequency Distribution
The number of times a particular observation occurs in a data set is the frequency of that
particular observation. By the word frequency we mean, repetition of an item/observation.
Frequency is the usually denoted by ‘f’.
Definition of Frequency Distribution:
A frequency distribution is a tabular summary of data showing the frequency of items in
each of several nonoverlapping classes.
 If the data are presented by the observation and their corresponding frequencies, this
  presentation is called frequency distribution.
How to construct a Frequency Distribution Table
In constructing a quantitative frequency distribution, the following steps are considered:
Step-1: Determine the range R.
 R = highest value – lowest value
Step-2: The number of classes is to be decided
The appropriate number of classes (k) may be decided by the following formula:
Sturges’ Formula
  K = 1 + 3.322 log10 N
Where, N is the number of observations to be included in the distribution.
Step-3: The class interval is to be determined. It is obtain by using the relationship
                    R
            C.I 
                    K
Step-4: The table will have three columns having names- classes of the distribution, tally
marks and frequency. In first column, we write down all the classes of the distribution.
Step-5: Give tick mark to each of the values of the original table of raw data and in the
second column, put tally mark against the appropriate classes.
Step-6: The tally marks against each class are then counted. These counted numbers are
called the frequencies of that class. They are written in the third column and in the end, the
                                             11
total of the 3rd column is checked against the total number of individuals or scores. The
whole table is known as frequency distribution table.
The frequency table can be made by two methods:
a) Exclusive method
b) Inclusive method
a) Exclusive method: In this method, the upper limit of any class interval is kept the same
as the lower limit of the just higher class or there is no gap between upper limit of one class
and lower limit of another class. It is continuous distribution
Example:
                 C.I.         Tally marks    Frequency(f)
                 0-10
                 10-20
                 20-30
b) Inclusive method: There will be a gap between the upper limit of any class and the lower
limit of the just higher class. It is discontinuous distribution
Example:
               C.I.       Tally marks       Frequency(f)
               0-9
               10-19
               20-29
To convert discontinuous distribution to continuous distribution by subtracting 0.5 from
lower limit and by adding 0.5 to upper limit
Note:
       The arrangement of data into groups such that each group will have some numbers.
        These groups are called class and numbers of observations against these groups are
        called frequencies.
       Each class interval has two limits 1. Lower limit and 2. Upper limit.
       The difference between upper limit and lower limit is called length of class interval.
       Length of class interval should be same for all the classes.
       The average of these two limits is called mid value of the class.
                                               12
Example:
The profits (in lakhs taka) of 30 companies for the year 2001-2002 are given below:
25, 32, 45, 8, 24, 42, 22, 12, 9, 15, 26, 35, 23, 41, 47, 18, 44, 37, 27, 46, 38, 24, 43,
46, 10, 21, 36, 45, 22, 18.
Construct a frequency distribution table taking a suitable class interval.
Solution:
Range(R) = 47-8 = 39
Number of observations (N) = 30
Number of classes, K = 1 + 3.322 log10 N
                      = 1+3.322 x 1.4771 = 5.91 6.0
C.I=39/6= 6.5 7.0
Inclusive method:
                       C.I.        Tally marks      Frequency(f)
                       8-14              ////             4
                      15-21              ////             4
                      22-28           //// ///            8
                      29-35               //              2
                      36-42              ////             5
                      43-49            //// //            7
                                Total                    30
Exclusive method:
                       C.I.        Tally marks      Frequency(f)
                       8-15              ////             4
                      15-22              ////             4
                      22-29           //// ///            8
                      29-36               //              2
                      36-43              ////             5
                      43-50            //// //            7
                                Total                    30
The relative frequency of a class is the fraction or proportion of the total number of data
items belonging to the class.
A relative frequency distribution is a tabular summary of a set of data showing the relative
frequency for each class.
                                               13
The percent frequency of a class is the relative frequency multiplied by 100.
A percent frequency distribution is a tabular summary of a set of data showing the percent
frequency for each class.
Relative Frequency and Percent Frequency Distributions
 Rating                  Relative frequency    Percent
                                               frequency
 Poor                    0.10                  10
 Below Average           0.15                  15
 Average                 0.25                  25
 Above Average           0.45                  45
 Excellent               0.05                  5
 Total                   1.00                  100
                                              14