Educational Statistics: Prof. Y. K. A. Etsey, Dept. of Educational Foundations
Educational Statistics: Prof. Y. K. A. Etsey, Dept. of Educational Foundations
Statistics
             100
             80
 Chemistry
             60
             40
             20
              0
                   0   20      40     60    80   100
                              Mathematics
                       EPS 211
                     Prof. Y. K. A. Etsey,
              Dept. of Educational
                                 S Foundations
                            August 2012
           TABLE OF CONTENTS
Pages
Appendix 92
                                    2
                                   UNIT                 1
INTRODUCTION TO STATISTICS
                      DEFINITIONS OF STATISTICS
  There are three basic definitions of the term, Statistics.
   1. Statistics (plural) is the body of numbers or data collected in any field.
      For example industrial statistics – number of employees in an industry,
      number of products, value of products; vital statistics – measurements of
      bust, waist, hips; population statistics – number of people in a region, number
      of people with secondary education
   2. Statistics (singular) is the study of methods and procedures used in
      collecting, organizing, analyzing, and interpreting a body of numbers for
      information and decision making.
   3. Statistics (plural; statistic - singular) are the values computed from a body
      of numerical data. For example the “average” age of Level 200 students in
      UCC, the proportion of EPS 211 students who are males.
                                          3
     would indicate whether the general performance of a class is low, average or
     high.
3.   It helps teachers to evaluate course grades and the differences in ability
     represented by different grades. For students personal report cards, grades alone
     do not provide enough information on a student‟s level of performance in a
     subject. This information should be combined with the ranking in the subject.
4.   It helps the teacher in the critical reading and understanding of professional
     journals in education. Journals such as the Journal of Educational Management,
     Journal of Educational Research, Journal of Educational Development and
     Practice, Journal of Research and Development in Education often use statistics
     in their analysis of results.
5.   It is useful for research purposes. Statistics are used for data analysis in project
     work and dissertations/thesis. Teachers would also use it in their own research in
     the teaching profession.
6.   It helps the teacher to understand information from standardized achievement
     test manuals. The statistical information provided in the test manuals describes
     the quality of the test and the interpretation of the test scores.
                       INTRODUCTORY CONCEPTS
Variables
        A variable is any characteristic of an individual or object that can take on
different values. A value is an assigned number or label representing the attribute of
a given individual or object. For example, marital status as a variable can be broken
down into categories and given values as never married - 1, married - 2, divorced - 3
and widowed - 4. Number of children in a family as a variable can be given the
values 0, 1, 2, 3, 4 etc. Height can take on values such as 1.2 metres, 1.7 metres, 2.0
metres and 2.2 metres. Religious affiliation can be broken down to categories and
given values as: Christian – 1, Moslem – 2, Traditionalist – 3, Buddhist - 4.
                                            4
       Variables can also be classified as discrete or continuous.
Discrete variables have values which in theory assume only certain distinct
    values or whole numbers on a number line. These variables usually represent
    counts of indivisible entities, for example, 8, 12, 20, 45, 100. For example the
    number of goals scored in a soccer game or the number of students in a class.
Continuous variables have values which in theory assume any value on a number
    line between two points. The values can differ by infinitesimal amounts, for
    example, 10.5, 14.16, 42.001, 56.2222278. For example the height of a student
    or the weight of a car is a continuous variable.
Inferential statistics uses data from a small group called a sample to make
    statements or generalizations about a much larger group called a population.
    For example, to know the mean age of first year university students (i.e.
    population) in Ghana, a small group (i. e. sample) say 200, of first year students
    can be used. Their mean age could be used as an estimate of the mean age of all
    first year university students in Ghana.
                             Scales of Measurement
     Depending upon the traits/attributes/characteristics and the way they are
     measured, different kinds of data result representing different scales of
     measurement. There are 4 types of measurement scales. These are Nominal,
     Ordinal, Interval and Ratio.
Nominal Scales: A nominal scale classifies persons or objects into two or more
   categories. Whatever the classification, a person can be in one and only one
   category, and members of a given category have a common set of
   characteristics. For identification purposes, categories are numbered. e.g.
   Gender: Male - 1, Female - 2. All males have a common characteristic and all
   females have a common characteristic which is different from males.
Ordinal Scales: An ordinal scale not only classifies subjects but also ranks them
   in terms of the degree to which they possess a characteristic/attribute of interest.
                                           5
    An ordinal scale puts subjects in order from highest to lowest, or from most to
    least. With respect to height, 5 students can be ranked from 1 to 5, the subject
    with rank 1 being the shortest. Though ordinal scales do indicate that some
    subjects are higher or better than others, they do not indicate how much higher
    or better. The intervals between the ranks are not equal.
Interval Scales: An interval scale has all the characteristics of both nominal and
    ordinal scales and in addition has equal intervals. The zero point is arbitrary
    and does not mean the absence of the characteristics/trait. Values can be added
    and subtracted to and from each other but not multiplied or divided. Examples
    include Celsius temperature, academic achievement.
Ratio Scales: A ratio scale has all the advantages of the types of scales and in
      addition it has a meaningful true zero point. Height, weight and time are
      examples. Values can be added, subtracted, multiplied and divided. Sixty
      (60) minutes can be said to be 3 times as long as 20 minutes.
Arithmetic Comparisons
                             Practice Exercises
1. Statistics is important for classroom teachers because it
               A. enables them to write appropriate objectives.
               B. helps them to construct good test items.
               C. helps them to evaluate students‟ grades.
               D. is useful for promotion and certification.
                                         6
3. A district director of education measures many variables on a sample of schools.
   An example of a variable measured in an ordinal scale is the
               A. enrolment of the classes in each school.
               B. income in cedis of the teachers.
               C. professional qualification of the teachers.
               D. years of service for each teacher.
     A study was conducted to see how well reading success in primary three
     could be predicted from various kinds of information obtained in
     kindergarten (reading readiness, age, gender, and socio-economic status).
                                          7
8. The grades, A, B, C, D, E, F in a test were changed to 4, 3, 2, 1, 0 for statistical
   purposes. What scale of measurement was used?
             A. Interval
             B. Nominal
             C. Ordinal
             D. Ratio
11.    Which one of the following variables, measured from Primary 6 pupils, has
       an interval scale?
               A. Dancing ability
               B. Languages spoken
               C. Region of birth
               D. Religious affiliation
                                            8
                               UNIT                 2
                       DATA REPRESENTATION
        Raw scores are often represented by graphics/pictures or tables. The
representation of data in these forms enables more information to be derived from the
scores. In educational statistics, more emphasis is placed on pictorial
representations.
Bar graph/Charts
        Data that are from nominal scales are represented in graphic form with the
use of bar graphs. Bar graphs give a pictorial description of the data and emphasize
how groups compare with one another. They are used to compare the sizes of the
various parts. The height of the bars is the basis for the comparisons and not the area
of the bars.
        Bar graphs are either column or horizontal. Column graphs are more popular
in education. Column bar graphs are simple, compound (multiple) or component.
Examples are shown below.
                                           9
                       Table 1
                       Performance in Inter-House Athletics at BASS
                               House          Total Points
                        One                       120
                        Two                       100
                        Three                     150
                        Four                      60
                        Five                      170
The figure below is a simple column bar graph showing performance in an Inter-
House Athletics competition at BASS.
                                                   Inter-House Athletics at BASS
180
160
140
                 120
  Total Points
100
80
60
40
20
                                                         10
                    Table 2
                    School Enrolment at Texas JSS
                      Form      Male      Female
                       1A         18            35
                       1B         30            25
                       1C         36            12
                       2A         22            30
                       2B         25            25
                       2C         40            18
The figure below is a compound column bar graph showing school enrolment at
Texas JSS by gender.       School Enrolment at Texas JSS by Gender
45
40
35
30
               25
   Enrolment
                                                                              Male
               20
                                                                              Female
               15
10
                0
                       1A        1B        1C                 2A   2B   2C
                                                     Forms
It is known as composite or stacked bar chart. It is used when a set of data combines
to form a total. The total is the length/height of the bar. It allows for visual
comparisons between different components ie how components contribute to the
total of the category.
                                                             11
School enrolment at Texas JSS
1. Draw two axes, a vertical and horizontal. Label the vertical axis by the source of
   the values/scores e.g. enrolment, points etc. Label the horizontal axis by the
   names of the categories.
2. Divide vertical scale by points considering the lowest value and the highest
   value. Choose appropriate scales such that the bars are not too tall or too short
   and must start with zero.
3. Construct equally wide and equally spaced bars for each category with the height
   of the bar being the value/score for the category on the horizontal axis, which has
   the names of the categories as the label.
4. Where computer softwares such as Microsoft Excel and SPSS are not available, it
   is recommended that graph sheets be used.
5. Shade/colour the bars to differentiate bars and components.
   Teachers can use bar graphs in several ways. Enrolment by classes, courses and
   subjects and inter-house competitions can be represented by bar graphs.
Pie Chart
Pie charts use nominal or categorical data. Pie charts are represented in the form of a
circle of 360 0 sliced into the shape of „pies‟. Each pie is cut from an angle at the
centre of the circle. The angle corresponds to the data for each category or group.
Pie charts give a pictorial view and the contributions of the parts that make a whole.
An example is shown below.
Table 3
Performance in Inter Hall House Athletics at BASS
One
Two
Three
Four
Five
                                                 13
   Constructing pie charts
1. Calculate the degree equivalents for the value of each category/group by dividing
   the total point for each group by the overall total points and multiple the result by
   3600.
    For example, for House One above we have:
   120                                               100
         360 0  72 0 and for House Two, we have          360 0  60 0
   600                                                600
2. Use a pair of compass and protractor to draw the circle and the sectors based on
   the degrees calculated.
3. Shade/Colour the sectors to differentiate one from the other.
   Uses
   Pie charts can be used by teachers and educational practitioners for examination
    results by number of passes in various subjects, school enrolment by class, form
    or subjects.
Line Graphs
   Data that are related to time are best used for line graphs. Time could be days,
   weeks, months and years. Line graphs show changes in the data over a period of
   time. Data from interval and ratio scales are most appropriate. Line graphs
   could be simple or compound. Simple line graphs give a pictorial description of
   the data. Compound line graphs compare group data over a period of time.
                                          14
   Examples are shown below.
   Table 4
   Attendance at monthly teachers‟ workshops
                                    Month           Total
                                    January         120
                                    February         85
                                    March           100
                                    April           150
                                    May              90
                                    June             85
                                    July            100
                                    August           60
                                    September        90
                                    October          75
                                    November        100
                                    December        150
The figure below is a simple line graph showing attendance at a monthly teachers‟
workshop.
                                                Attendance at monthly teachers‟ workshops
                           160
                           140
    Number of attendants
                           120
                           100
                           80
                           60
                           40
                           20
                                0
                                    Jan   Feb Mar   Apr     May Jun   Jul   Aug   Sep   Oct   Nov Dec
                                                                    Month
                                                               15
        Table 5
         Attendance at monthly teachers‟ workshops
                                           Attendance
                           Month      Female         Male
                           Jan          60            60
                           Feb          45            40
                           March        60            40
                           April        80            70
                           May          50            40
                           June         40            45
                           July         50            50
                           August       25            35
                           Sep          40            50
                           Oct          30            35
                           Nov          50            50
                           Dec          90            60
                0                                                                    Female
                5                                                                   Male
                0
                4
                0
                3
                0
                2
                0
                1
                0
                 0
                     Jan   Feb Mar Apr May Jun       Jul Aug    Sep   Oct Nov Dec
                                                  Months
                                                   16
   Constructing line graphs
1. Draw two axes, a vertical and horizontal. Label the vertical axis by the source of
   the values/scores e.g. attendance, enrolment, points etc. Label the horizontal axis
   by the time period e.g. months, days, weeks etc.
2. Divide vertical scale by points considering the lowest value and the highest
   value. Choose appropriate scales such that the graph is not too tall or too flat and
   must start with zero.
3. Plot the value/quantity for each time period on the graph and join all the points
   by a straight line.
4. Where computer softwares such as Microsoft Excel and SPSS are not available, it
   is recommended that graph sheets be used.
Uses
   Teachers and educational practitioners can use line graphs in several ways.
   Examination results over a period of years in a subject, total school enrolment as
   well as enrolment by subjects and courses for a period of time can be represented
   by line graphs.
FREQUENCY DISTRIBUTIONS
Data normally comes in raw or ungrouped form as shown below for 40 students in a
Statistics class.
                                           17
The raw data alone does not give much information. In the example above, we can
best know the highest score (93) and the lowest score (49). A lot more information
can be obtained if the data are treated or put in other forms. One of the ways to
obtain information from data is to use frequency distributions.
Table 6
Ungrouped frequency distribution table
                                          18
    Ungrouped frequency distributions are not very useful for further work. Zero
    frequencies are often common and the tables are sometimes too tall.
    For grouped frequency distributions, the individual scores are put into groups or
    classes. The scores are most often put in groups/classes of 3, 5, 7, 9, and 10 as
    group sizes. Column 2 provides the mid-points, column 3 the tallies, and
    column 4 the frequencies.
Table 7
    Grouped frequency distribution of Statistics students‟ performance
Features
                                           20
Table 8
An expanded frequency distribution table
                                           22
                                   Practice Exercise
   Given the following scores of 50 students in a Statistics class, and using a class
   width of 5, construct a grouped frequency distribution table. Also obtain the
   cumulative percentage frequencies, and cumulative relative frequencies.
          32       38    25        40    47        22   48    45     20     35
          16       18    10        6     8         11   33    30     28     27
          42       35    30        34    31        21   25    12     20     25
          43       33    36        39    42        17   19    22     26     10
          33       38    32        22    26        42   37    35     40     46
  GRAPHIC REPRESENTATIONS OF
    FREQUENCY DISTRIBUTIONS
Histogram
Histograms use data from ratio or interval scale and depend on frequency
distributions. It uses the classes and the frequencies from the frequency distribution
table. An example is shown below.
  F 40
  r
  e 30
  q
     20
10
      0        5    10   15   20    25   30 35 40       45   50 55   60
                                         Classes
                                              23
To construct a histogram
1. Draw two axes, a vertical and horizontal. Label the vertical axis by frequency
   and the horizontal axis scores/classes.
2. Select an appropriate scale on the vertical axis considering the highest/largest
   value. When using a graph sheet, the scale should be such that the bars are not
   too tall nor too short.
3. Use class midpoints/marks or class boundaries or class limits to label the points
   on the horizontal axis.
4. Draw bars of equal width representing the classes from a frequency distribution
   table with corresponding heights as the frequencies.
Importance
1. It gives a pictorial description of the raw data, providing information about the
   nature of the data.
2. It gives the direction of performance in terms of academic performance (i.e.
   skewness).
  F 40                                          F 40
  r                                             r
  e 30                                          e 30
  q                                             q
     20                                           20
10 10
      0
             5     10 15    20   25   30               0   5   10   15 20 25       30
                 Classes                                             Classes
       Skewed to the right                                  Skewed to the left
Group performance tends to be low                 Group performance tends to be high
3. It provides an estimate of the most typical score. This is the intersection of the
   two diagonals of the tallest bar.
  Frequency Polygon
  Frequency polygon uses data from ratio or interval scales and depends on
frequency distributions. It uses the classes and the frequencies from the frequency
distribution table. An example is shown below.
                                           24
          F
          r
          e
          q
Classes
   Importance
1. It gives a pictorial description of the raw data, providing information about the
   nature of the data.
2. It provides an estimate of the most typical score. This is the point on the
   horizontal axis where the highest point of the polygon is located.
                                          25
                                                                 Form 1
Form 2
10 20
The diagram shows that Form 2 class, which is more to the right, performs better.
The most typical scores, where the highest point of the polygon is located can be
used to confirm the comparisons. Where the total frequencies are not the same, use
relative frequencies in place of the actual frequencies to draw the polygon.
A B C
                                         26
Plot the graph using the upper class boundaries of each class against the cumulative
percentage frequencies.
C 100
U 80
M 70
% 60
   50
F  40
R 30
E  20
Q 10
         0         10       20        30         40        50        60        70         80
                                 CLASSES
     To construct an ogive,
1.   Obtain cumulative percentage frequencies.
2.   Plot the cumulative percentage frequencies in each class on the vertical scale.
     Choose appropriate scales, on a graph sheet, such that the ogive is not distorted.
3.   Label the horizontal axis as scores or classes.
4.   Plot at the upper class boundary of each class the relevant values of the
     cumulative frequency. Join the points with a straight line.
5.   Extend the line one class to the left so that the polygon touches the horizontal
     axis.
   Importance
1. It is used for comparisons of distributions of performance especially for
   distributions where the class/group sizes are not the same. Generally, the graph
   that moves more to the right has better performance. The median score obtained
   at the cumulative frequency of 50 is also used.
                                           27
   Given the following performances in a test, draw two ogives. Which school
   performed better?
                     School A                         School B
Classes      Frequency Cum. % Freq          Frequency    Cum. % Freq
91 - 100          1           100                 7            100
81 – 90           2            99                17            95.3
71 – 80          11            97                30             84
61 – 70          24            86                25             64
51 – 60          20            62                15            47.3
41 – 50          16            42                11            37.3
31 – 40          12            26                19             30
21 – 30           8            14                14            17.3
11 - 20           4             6                 6              8
1 - 10            2             2                 6              4
Total           100                             150
2. It is used to determine percentiles and percentile ranks. Later in the course, you
   will learn how to obtain the percentiles and percentile ranks.
A box and whisker plot is drawn below. Later in the course, you will learn how to
obtain the percentiles and quartiles.
                       Q1            Q2              Q3
       P10                                                             P90
An example.
Assume that the following values were obtained for two classes, Form 1A and Form
1B in a class test in Mathematics.
               P10             Q2               P90
Form 1A        10              40               73
Form 1B        28              56               91
                                           28
The information is presented below by two box and whisker plots.
Form 1A
                       Q1          Q2               Q3
       P10                                                          P90
Form 1B
                                     Q1                  Q2    Q3
                       P10                                                      P90
0                 25                           50                        75            100
       10              28          40                    56         73             91
It can be observed that P10 , Q1, Q2, Q3, and P90 values are greater in Form 1B than
in Form 1A. This means that performance is better in Form 1B than in Form 1A.
Also note that the graph for Form 1B has moved more to the right towards higher
values than that of Form 1A.
                                          29
                               Practice Exercises
1. You have data on the long vacation earnings of a sample of 1,000 University of
   Cape Coast students. What kind of graph is most appropriate to use to describe
   the distribution of their earnings?
               A. Bar chart.
               B. Box and Whisker.
               C. Histogram.
               D. Pie chart.
2. You are writing an article for the SRC newspaper about the cost of attending a
   university. You want to make a graph to compare costs at your institution and
   three similar institutions. The most appropriate choice of a graph would be a
               A. Bar chart.
               B. Frequency polygon.
               C. Histogram.
               D. Pie chart.
                                         30
4. What is the relative frequency for the class, 61 – 70?
              A. 0.50
              B. 0.20
              C. 10.0
              D. 17.0
6. Histograms are most useful for representing data when the scale of measurement
   is
      I. Interval         II. Nominal        III. Ordinal     IV. Ratio
               A.   I only.
               B.   IV only.
               C.   I and IV.
               D.   I, III, IV.
                       Classes                     Frequency
                        46 - 50                        12
                        41 - 45                        14
                        36 - 40                        12
                       31 – 35                          6
                       26 – 30                          5
                       21 – 25                          1
                         Total                         50
                                          31
7.     The relative frequency for the class, 36-40, is
            A.      0.024
            B.      0.24
            C.      0.48
            D.      0.52
                               32
                             UNIT                3
Illustration
                                       33
                           THE MEAN ( X )
There are three types. These are Arithmetic, Geometric and Harmonic. In Education,
the Arithmetic mean is the most useful.
The Arithmetic Mean. It is the sum of the observations divided by the total number
of observations.
 i.e.  Add the values and divide by the number of observations.
                                          15
       i.e. 4 + 2 + 3 + 1 + 5 = 15 Mean =    3
                                           5
Methods
The Arithmetic Mean ( X ) can be obtained from both the ungrouped and grouped
data. It can also be easily obtained from Microsoft Excel.
1. Ungrouped data
Given the following scores, 15, 12, 10, 10, 9, 20, 14, 11, 13, 16, to obtain the mean,
all the scores are added and divided by the total number of observations. The mean
2. Grouped data
Two methods can be used. These are the long method and the coding method. The
methods are used with frequency distributions.
Long method: X 
                    fx OR X   fx where f is the frequency and x, the class
                     n                N
marks.
                                          34
Example using the long method
Scores    Midpoint            Freq
              X               f                fx
46 – 50       48              4                192
41 – 45       43              6                258
36 – 40       38              10               380
31 – 35       33              12               396
26 – 30       28              8                224
21 – 25       23              7                161
16 – 20       18              3                 54
Total                         50               1665
Long method X 
                      fx  1665  33.3
                      n      50
Coding method: X  AM 
                              fd i , which is used for distributions with equal
                                   n
class intervals. AM, is the assumed mean, f, is the frequency, d is the code for each
class, n is the total frequency and i, the class size.
        To use the coding method, class intervals must have the same size. The class
in the middle or the class with the highest frequency is chosen for the code of 0.
Classes above the zero coded class are given positive codes and those below are
given negative codes in steps of 1.
                             fd i        35  33  15  33.3
Coding method X  AM                 33 
                               n            50         50
                                          35
OPTIONAL
     Using Microsoft Excel
   1. Open Excel
   2. Type in data to be used in one column, if data is not yet entered.
   3. Click an empty cell where you want the result to be and type in Mean.
   4. Click the empty cell directly below where you typed Mean.
   5. Click white space to the right of the fx symbol.
   6. Type in =AVERAGEA(cell number where data begins from:cell number
      where data ends at). E.g. =AVERAGEA(B2:B32). This means that data
      begins at cell B2 and ends at cell B32.
   7. Press Enter. (The mean is given in the empty cell clicked.
                                     36
Properties of the Mean
1. The mean is influenced by every score or value that makes it up. If a score is
   changed, the values of the mean changes.
      3, 4, 2, 4, 7  Mean = 4
      3, 4, 7, 4, 7  Mean = 5. The change of the score 2 to 7 has changed the
      mean to 5.
3. The mean is a function of the sum (or aggregate or total) of the scores.
                      X 
                           X
                             N
                      NX   X This implies that the number of observations
multiplied by the mean gives the sum of the scores.
Of the three measures it is the only one that is a function of the sum of the scores.
It is also possible to calculate the mean for a combined group if only the means and
number of scores (N) are available.
4. If the mean is subtracted from each individual score and the differences are
   summed, the result is 0.
               4 – 4 =0
               2 – 4 = -2
               3 – 4 = -1
               6–4=2
               5–4=1
The distance of the score from the mean is known as the deviation.
5. If the same value is added to or subtracted from every number in a set of scores,
   the mean goes up or goes down by the value of the number.
For example, given 8         2        10     4              X  6.
                                          37
Now add 2 to each score:       10      4        12    6       X  8 ie 6 + 2
6. If each score is multiplied or divided by the same value, the mean increases or
   decreases by the same value.
 For example, given                   8       2     10      4,      X 6.
  Now multiple each score by 3.       24      6     30      12      X  18 ie 6 × 3
                                              (n  1)
For odd set of numbers, median occupies the           th position.
                                                 2
For even set of numbers, find the mean of the two middle numbers or the number at
    (n  1)
the         th position.
       2
The median can be obtained from both ungrouped and grouped data and also from
Microsoft Excel.
                                           38
To find the median from ungrouped data
1. Arrange all observations in order of size from smallest to largest or vice versa.
2. If the number of observations, n, is odd, the median is the number at the centre or
                      (n  1)
    the number at the         th position.
                         2
3. If the number of observations, n, is even, the median is the mean of the two
    centre observations.
Examples
                                         39
Step 1. Identify the median class. It is the class that will contain the middle score.
                   N
Find the value of     , where N is the total score. This is the position of the middle
                    2
score. Checking from the cumulative frequency column, find the number equal to
the position or the smallest
                                                                   N 50
number that is greater than the position. From the table above,             25 ,
                                                                    2    2
therefore the number is 30. The class that this number belongs to is the median class.
From the table above, the median class is 31 – 35.
Step 2. Use the formula below to obtain the Median.
           N           
            2  cf     
Mdn =   L             i   where
         1
            f mdn      
                       
L1      is the lower class boundary of the median class
N       is the total frequency
cf      is the cumulative frequency of the class just below the median class
i       is the class size/width
fmdn    is the frequency of the median class
Mdn =
        50     
        2  18             25  18             7
30.5          5  30.5            5  30.5   5  30.5  0.585  30.5  2.9  33.4
        12                 12                  12 
              
OPTIONAL
                                          40
   6. Type in =MEDIAN(cell number where data begins from:cell number where
      data ends
      at). E.g. =MEDIAN(B2:B32). This means that data begins at cell B2 and
      ends at cell   B32.
   7. Press Enter. (The median is given in the empty cell clicked.
                                        41
6. Where there are very few observations, the median is not representative of the
   data.
7. Where the data set is large, it is tedious to arrange the data in an array for
   ungrouped data computation of the median.
                               THE MODE
It is the number that occurs most frequently in a distribution.
Given the following scores, 1, 2, 4, 6, 4, 6, 7, 2, 4 the number that occurs most
frequently is 4. This is the Mode. This number appears 3 times.
Given the following scores,11, 22, 14, 26, 34, 6, 27, 12, 40 no number occurs most
frequently. There is therefore no mode.
1. The main advantage is that it is the only measure that is useful for nominal scale.
2. It is used when there is the need for a rough estimate of the measure of location.
3. It is used when there is the need to know the most frequently occurring value e.g.
    dress styles.
4. It is not useful for further statistical work because the distribution can be bi-modal
    or trimodal or no mode at all.
                                           42
                                  Practice Exercises
                Use the histogram below to answer questions 1 and 2
                         40
                     F
                      30
                     R
                      20
                     E
                      10
                     Q
                              0 10 20 30   40 50 60 70 80
                                                Classes
                                           43
4. The median score for a group of 19 students was 58. A 20th student had a score
     of 45. What is the new median score?
             A. 10.5
             B. 45.0
             C. 58.0
             D. It cannot be determined
7. In a class quiz, a mean of 62 was obtained with a median of 48. How would the
   performance of the class be described?
               A. Above average
               B. Average
               C. High
               D. Low
                   MEASURES OF VARIABILITY
These are also called measures of variation, dispersion or scatter. The main
measures that are used mainly in education are:
                 1. The range
                 2. The Variance
                 3. The Standard Deviation
                 4. The Quartile Deviation (Semi-interquartile range)
They are used as single scores to describe individual differences in terms of
achievement.
For example: 48, 51, 47, 50 Total = 196         Mean = 49            …..(i)
                 30, 72, 90, 4   Total = 196    Mean = 49            …..(ii)
However, a closer look at the two sets of data shows that the distribution within each
set is not the same. Where the scores cluster around the mean, performance is said to
be homogeneous as in (i). Where the scores move away from the mean,
performance is said to be heterogeneous as in (ii).
                                   THE RANGE
It is the difference between the highest and the lowest values in a set of data.
e.g.: 48, 51, 47, 50 Total = 196 Mean = 49           …..(i) Range: 51 – 47 = 4
         30, 72, 90, 4 Total = 196 Mean = 49         …..(ii) Range: 90 – 4 = 86
Features
1. It is easy to compute.
2. It is easy to interpret.
3. It is a crude measure of dispersion and does not take into account all the
   data/scores.
4. It ignores the spread of all the scores.
5. It uses only two values and does not consider how the other scores relate to each
   other.
6. The range does not consider the typical observations in the distribution but
   concentrates only on the extreme values.
7. It can give a distorted picture of the variation within a set of data.
8. Different distributions can have the same range which would give misleading
   conclusions.
                                         45
   Uses
1. When data is too scanty or too scattered to justify the computation of a more
   precise measure.
2. When knowledge of extreme scores or total spread is all that is needed.
Ungrouped data
This is based on raw data. It is computed by using the following formulae.
Variance (S2,  )
                         2
                      X  X                                         X
                                          2                                   2
1. Var ( S
              2
                  )                              2. Var ( S    2
                                                                    )             X 2 3.
                                 n                                        n
                 X                 X
                                                  2
                                              
                             2
Var ( S   2
              )                           
                                              
                     n               n       
                         X                 X
                                                            2
                                     2
                                                        
Std .Dev ( S )                                      
                                                        
                                 n           N         
Given a set of data as 48 51 50 47 and the mean of the distribution as 49, the
variance and the standard deviation could be computed as follows:
                                                                     46
      196
X         49
       4
                              X           XX        X  X        2
                                                                            X2
                              48           -1                  1           2304
                              51            2                  4           2601
                              47           -2                  4           2209
                              50            1                  1           2500
                             Total                            10           9614
          X  X 
                         2
                                 10
SD                                 2.5  1.58                  Variance = 1.582 = 2.5
                 n                4
OR
SD 
         X2            2
                             
                                     9614
                                           49 2  2403 .5  2401 .0  2.5  1.58 Var. =
           n         X                 4
1.582 = 2.5
OR
                                 2
SD 
         X 2    X      
                                          9614  196 
                                              
                                                       2
                                                       2403 .5  2401 .0  2.5  1.58
           n            n                4   4 
                     
Grouped data:
This is based on a frequency distribution of the scores.
                                      f X  X                           f X  X 
                                                    2                                       2
Long method:             SD                                        Var 
                                              n                                    n
                                                                                           fX
                                                                                                       2
                                      fX 2        fX                        fX 2
                                                              2
                                                                                                  
Short method:        SD                               
                                                                  Var                         
                                                                                                   
                                      n           n                          n           n      
                                                     47
                                                          2
                               fd 2
                                       
                                             fd     
Coding Method SD  i                                         This is useful with equal class
                                n
                                               n
                                                      
intervals.
Short method:
         fX 2   fX
                            2
                                    58765  1665 
                                                              2
                                                48
Coding Method
                                  2
            fd 2      fd
                    
                                        133  3 
                                                   2
      Standard Deviation
   1. Open Excel
   2. Type in data to be used in one column, if data is not yet entered.
   3. Click an empty cell where you want the result to be and type in Std. Dev.
   4. Click the empty cell directly below where you typed Std. Dev.
   5. Click white space to the right of the fx symbol.
   6. Type in =STDEVPA(cell number where data begins from:cell number
      where data ends at). E.g. =STDEVPA(B2:B32). This means that data begins
      at cell B2 and ends at cell B32.
   7. Press Enter. (The standard deviation is given in the empty cell clicked.
      Variance
   1. Open Excel
   2. Type in data to be used in one column, if data is not yet entered.
   3. Click an empty cell where you want the result to be and type in Variance.
   4. Click the empty cell directly below where you typed Variance.
   5. Click white space to the right of the fx symbol.
   6. Type in =VARPA(cell number where data begins from:cell number where
      data ends at). E.g. =VARPA(B2:B32). This means that data begins at cell
      B2 and ends at cell B32.
   7. Press Enter. The variance is given in the empty cell clicked.
                                             49
An example is below:
   USES
1. It is used as the most appropriate measure of variation/dispersion when there is
   reason to believe that the distribution is normal.
2. It helps to find out the variation in achievement among a group of students. i.e. it
   determines if a group is homogeneous or heterogeneous.
   Where the standard deviation is relatively small, the group is believed to be
   homogeneous i.e. performing at the about the same level. On the other hand,
   where the standard deviation is relatively large, the group is believed to be
   heterogeneous, i.e. performing at different levels.
   To be more precise, the coefficient of variation (CV) is computed.
           
   CV =      x 100 If the value of CV is greater than 33, the group is
           x
   heterogeneous, otherwise it is homogeneous.
   With this information, the teacher has to adopt a teaching method to suit each
   group.
3. It is helpful in computing other statistics e.g. standard scores, correlation
   coefficients.
4. It is useful in determining the reliability of test scores. The split-half correlation
   method or internal consistency methods use the standard deviation of the scores.
QUARTILE DEVIATION
There are two methods – the median method and the formula method.
Example.
Given the following scores, 8, 10, 12, 7, 6, 13, 18, 25, 4, 22, 9.
Q1 Median Q3
Given the following scores, 8, 10, 12, 7, 6, 13, 18, 25, 4, 22, 9, after arranging them
in ascending order as,4, 6, 7, 8, 9, 10, 12, 13, 18, 22, 25
      1                     1
Q1 = (n+1)th position → (12) = 3rd position
      4                     4
      3                     3
Q3 = (n+1)th position → (12) = 9th position
      4                     4
                                           52
       Q3  Q1 18  7 11
QD                     5.5
          2      2     2
            N        
             4  cf 
Q1 = LQ1            i where
             fQ1 
                    
LQ1 is the lower class boundary of the lower quartile class
N     is the total frequency
cf    is the cumulative frequency of the class just below the lower quartile class
i     is the class size/width
fQ1 is the frequency of the lower quartile class
            3N          
            4  cf      
Q3 = LQ3               i where
            fQ3         
                       
Example
Classes              Midpoint           Freq            Cum Freq
                        X               f                  cf
46 – 50                 48              4                 50
41 – 45                 43              6                 46
36 – 40                 38              10                40
31 – 35                 33              12                30
26 – 30                 28              8                 18
21 – 25                 23              7                10
16 – 20                 18              3                  3 __
Total                                   50                    __
                                            53
Step 1. Identify the quartile class. It is the class that will contain the quartile of
                             N                                  3N
interest. Find the value of     , for the lower quartile and        for the upper quartile
                             4                                   4
(where N is the total score) as positions. Checking from the cumulative frequency
column, find the number equal to the position or the smallest number that is greater
                                             N 50
than the position. From the table above,              12.5 , therefore the number is 18
                                             4     4
     3N 150
and             37.5 therefore the number is 40. The classes that these numbers
      4      4
belong to are the quartile classes. From the table above, the lower quartile class is 26
– 30 and the upper quartile class is 36 – 40
.
Step 2. Use the formulae below to obtain the lower and upper quartiles.
           N           
            4  cf     
Q1 = LQ1              i =
            fQ1        
                      
        50     
        4  10         12.5  10              2 .5 
25.5          5  25.5           5  25.5   5  25.5  1.5625  27.06
        8                  8                  8 
              
              3N      
              4  cf 
Q3 = LQ3             i =
              fQ3 
                     
        150     
        4  30           37.5  30              7 .5 
35.5           5  35.5            5  35.5   5  35.5  3.75  39.25
        10               10                     10 
               
                                           54
OPTIONAL
Using Microsoft Excel
First/Lower Quartile
   1. Open Excel
   2. Type in data to be used in one column, if data is not yet entered.
   3. Click an empty cell where you want the result to be and type in Q1(Lower
      Quartile).
   4. Click the empty cell directly below where you typed Q1.
   5. Click white space to the right of the fx symbol.
   6. Type in =QUARTILE(cell number where data begins from:cell number
      where data
      ends at,1). E.g. = QUARTILE(B2:B32,1). This means that data begins at
      cell B2 and
      ends at cell B32 and 1 means first or lower quartile) .
   7. Press Enter. The Q1, first/lower quartile is given in the empty cell clicked.
Third/Upper Quartile
   1. Open Excel
   2. Type in data to be used in one column, if data is not yet entered.
   3. Click an empty cell where you want the result to be and type in Q3 (Upper
      Quartile).
   4. Click the empty cell directly below where you typed Q3.
   5. Click white space to the right of the fx symbol.
   6. Type in =QUARTILE(cell number where data begins from:cell number
      where data
      ends at,3). E.g. = QUARTILE(B2:B32,3). This means that data begins at
      cell B2 and
      ends at cell B32 and 3 means third or upper quartile) .
   7. Press Enter. The Q3, third/upper quartile is given in the empty cell clicked.
                                         55
An example is shown below.
          QD
   CV =        x 100 If the value of CV is greater than 33, the group is
          Mdn
   heterogeneous, otherwise it is homogeneous.
        With this information, the teacher has to adopt a teaching method to suit
        each group.
                                         56
   3. It does not make use of all the information provided by the scores.
                                     Class Exercise
               Distribution of examination scores for Statistics students
                      Classes                         Frequency
                      46 - 50                            12
                      41 - 45                            14
                      36 - 40                            12
                      31 – 35                               6
                      26 – 30                               5
                      21 – 25                               1
                       Total                             50
                                            57
                                 Practice Exercises
1. The standard deviation is a measure of the
      A. Center of a distribution.
      B. Center of the mean deviation.
      C. Validity of a measurement.
      D. Variability of a distribution.
5. When a distribution is highly skewed to the right, the most appropriate measure
   of variability is the
      A. first quartile.
      B. mean deviation.
      C. quartile deviation.
      D. standard deviation.
                                           58
6. The first quartile in a distribution of scores is 10.0. The third quartile in the same
   distribution is 25.0. What is the value of the semi-interquartile range?
       A. 7.5
       B. 10.0
       C. 15.0
       D. It cannot be determined.
7. In a Psychology course quiz, the mean was 45 with a standard deviation of 15.
   The instructor later added 10 points to every student‟s score. What is the new
   standard deviation?
       A.   10
       B.   15
       C.   25
       D.   It cannot be determined
                                            59
                              UNIT                 5
There are two main measures. These are Percentiles and Percentile Ranks, Z scores
and T scores. Z scores and T scores are often referred to as standard scores.
The main purpose of these measures is to describe an individual‟s position in relation
to a known group or the norm group.
                                  PERCENTILES
Definition: They are points in a distribution below which a given percent, P, of the
             cases lie.
   There are 99 percentiles that divide a distribution into 100 equal parts.
   Percentiles are individual scores.
      Notation: P40 = 60. Sixty is the score below which 40% of the scores lie in a
               specific group after the scores have been arranged sequentially. This
               means that a student who obtains a score of 60 has done better than
               40% of the members in the specific group.
               P75 = 50. Fifty is the score below which 75% of the scores lie in a
               specific group after the scores have been arranged sequentially. This
               means that a student who obtains a score of 50 has done better than
               75% of the members in the specific group.
   A score in one group may be a different percentile in another group.
  For example, in Statistics Quiz 1, a student with a score of 15 may be at P90 in the
  Social Science group but the same score may put the student at P85 in the Home
  Economics group.
   P50 is the same as the median. P25 is the first quartile and P75 is the third
      quartile.
                              PERCENTILE RANKS
Definition: The percentage of cases falling below a given point on the measurement
             scale. It is the position on a scale of 100 to which an individual score
             lies.
Notation:      PR of 60 = 75. Seventy-five is the position for a score of 60 when the
               distribution is divided into 100 parts. This means that a student who
               obtains a score of 60 has 75% of the scores falling below him/her in
               the group.
                                         60
The easiest way to obtain percentiles and percentile ranks is to use the ogive
(cumulative percentage graph).
100
90
80
70
60
50
40
30
20
10
 0
      0       5        10     15       20    25         30     35        40   45
                             Scores
From the ogive, P60 = 34. PR of a score of 26 is 40.
                      XX
Formula: Z               ,     T = 50 + 10Z, where mean is 50 and standard deviation
                       s
             is 10.
1. A student had a Z score of 2.5. The mean for the class was 60 with a standard
   deviation of 4.0. What was the student‟s observed score?
     XX         X  60
Z       → 2.5         → 10  X  60 →X = 10 + 60 = 70
      s            4
     XX          70  X
Z       → 3 .5         → 17.5  70  X → X = 70 − 17.5 = 52.5
      s              5
USES
Salome has done better in Mathematics than Social Studies, considering the class
performance.
    3. It helps the teacher to guide and counsel the student to choose the correct
       course for a future career and vocation
                                         62
                                    Practice Exercises
     1. Scores on an SSSCE Social Studies paper had a mean of 46. Joana obtained a
        score of 70, giving her a standard score of 3.0. What was the standard deviation
        of the scores?
            A. 3
            B. 8
            C. 24
            D. 72
                         Use the ogive below to answer questions 2 & 3.
     100
%age
Cum.
Freq. 80
60
40
20
           0        10    20      30      40            50   60        70     80      90
                                          Scores
5. Joyce‟s percentile rank in an end of year examination was 80. Her actual
   examination score was 95. This information means that she performed better
   than _____ of the students in the class.
       A.   5%
       B.   20%
       C.   80%
       D.   95%
6. Paul‟s score in a final examination is at the 80th percentile of the scores in the
   class. Paul‟s score lies
       A. above the third quartile.
       B. at the median.
       C. below the first quartile.
       D. between the median and the third quartile.
                 MEASURES OF RELATIONSHIPS
Concept
Natural relationships exist in the world. Parents and children as well as twins have
things in common. Males are normally attracted to females and rain results in good
harvest.
The concept of correlation provides information about the extent of the relationship
between two variables. Two variables are correlated if they tend to „go together‟.
For example, if high scores on one variable tend to be associated with high scores on
a second variable, then both variables are correlated.
The statistical summary of the degree and direction of the linear relationship or
association between any two variables is given by the coefficient of correlation.
Correlation coefficients range between -1.0 and +1.0. Correlation coefficients are
normally represented by the symbols, r and ρ (rho).
Scatter plots
A scatter plot or scatter diagram shows the nature of the relationship between any 2
variables. To obtain a scatter plot, marks are made on a graph representing the
intersection of the two variables. Scatter plots could either be linear or curvilinear.
                                           65
Examples
              100
              80
  Chemistry
              60
                                                         Linear relationship
              40
              20
               0
                    0   20   40      60    80   100
                             Mathematics
              100
               80
  English
               60                                        Curvilinear relationship
               40
               20
                0
                    0   20    40     60    80   100
                               Accounts
Assumptions
        1. The variables are random. Neither the values of X nor Y are predetermined.
        2. The relationship between the variables is linear.
        3. The probability distribution of X‟s, given a fixed Y, is normal, i.e. the sample
           is drawn from a joint normal distribution.
        4. The standard deviation of X‟s, given each value of Y is assumed to be the
           same, just as the standard deviation of Y‟s given each value of X is the same.
                                                66
Assume the following scores in two tests.
Student 1     2     3   4 5 6       7   8   9 10 11 12 13 14 15 16 17 18 19 20
X        14 16 15 10 9 18 18 14 12 13 15 18 10 12 16 20 15 12 14 10
Y        10 12 15 10 12 15 15 12 14 14 14 10 12 15 10 12 15 15 10 14
(a) Direction: Positive, (+)   High values go with high values and low values go
                               with low values.
                  Negative (─) High values go with low values and low values go with
                               high values.
                                            67
Some examples
                68
                                                    Moderate linear positive
                                                    correlation
Computational examples
r=
     Co var iance ( X , Y )
                            =
                                 ( X  X )(Y  Y )    =
           S X .S Y                 n S x.S Y
    ( X  X )(Y  Y )              ……...................….(1)
   ( X  X ) . (Y  Y )
                2               2
                n  XY   X  Y )
r=                                                    .....(2)
      [ n  X 2  ( X ) 2 ][ n  Y 2  ( Y ) 2 ]
                                               70
Student Quiz      Quiz
                           XX          (X  X )2    Y Y   (Y  Y ) 2   ( X  X )( Y  Y )
        1         2
No.       X         Y
1          4         6         -2           4         -1        1                2
2          8         8          2           4          1        1                2
3         10         9          4          16          2        4                8
4          7         7          1           1          0        0                0
5          6         8          0           0          1        1                0
6          3         2         -3           9         -5       25               15
7          8         9          2           4          2        4                4
8          5        10         -1           1          3        9               -3
9          5         6         -1           1         -1        1                1
10         4         5         -2           4         -2        2                4
Total     60        70                     44                  50               33
Note: X = 6 and Y  7
Using Formula 1:
          33             33
 r=                 =           = 0.7
      ( 44)(50)         46.9
2. The Spearman rank correlation coefficient (ρ): For ordinal scale variables
           6 d 2
 ρ = 1
          N N 2  1
                                                71
Given the following scores:
          6 d 2                643        258
ρ = 1                 1               1      1  0.26  0.74
          
         N N 1
              2
                            10100  1     990
         2
Φ=
          n
This is used when there are only two sub-categories for rows as well as columns i.e.
2x2
          2
C=
         n 2
This is used when there is at least more than two sub-categories for either row or
column. i.e. 2x3, 3x3, 2x4, 3x4, etc.
                                             72
                  Oij  Eij 
    r     c                           2
i 1 j 1
                        Eij
                                          where Oij is the observed count in each cell and Eij is
                                             Gender
                                   Male                 Female
Result                                                               Total
          
  r c Oij  Eij              2       150  125 2 100  125 2 50  75 2 100  75 2
                                                                                 
                                =        125                 125            75         75
i 1 j 1 Eij
          2              26.66
Φ=                  =                 = 0.365
              n              200
The result shows that there is a weak positive association between gender and
passing a driving test.
                                                         73
Example 2: Association between Halls of Residence and Region of Birth (3x3)
                                      Hall of Residence
Region of Birth           Hall 1           Hall 2             Hall 3           Total
Region 1                    40               30                 30             100
                      (30)           (30)                (40)
Region 2                    50               40                 60             150
                      (45)           (45)                (60)
Region 3                    30               50                 70             150
                      (45)           (45)                (60)
Total                      120              120                160             400
The figures in bold and in bracket are the expected counts in each cell.
                         
   r     c                   2
               Oij  Eij
 i 1 j 1
                   Eij
                                =
40  30   2
                  30  30 2
                                 30  40
                                           2
                                                  
                                                 50  45
                                                          2
                                                                  
                                                                40  45
                                                                        2
                                                                           
                                                                      
      30               30            40             45             45
60  60   2
                  30  45 2
                                 50  45
                                           2
                                                  
                                                 70  60
                                                          2
                                                              
                                            
      60               45            45             60
     100      0 100 25 25             0     225 25 100
  =                                            
      30 30 40 45 45 60                      45    45 60
= 3.3+0.0+2.5+0.56+0.56+0.0+5.0+0.56+1.67 = 14.15
             2               14.15             14.15
C=                                                         0.0342  0.185
        n       2       400  14.15           414 .15
The result shows that there is a very weak positive association between gender and
passing a driving test.
                                                74
                       Uses of correlation in education
                               Practice Exercises
1. The Phi Coefficient (Φ) is the most appropriate measure of linear relationship
   when two variables are:
       A.   Both continuous.
       B.   Both natural dichotomies.
       C.   Continuous and natural dichotomy.
       D.   Continuous and artificial dichotomy.
2. A University professor wishes to find the relationship between the age of the
   students in her Statistics class and the scores in a quiz. What is the most
   appropriate measure of relationship to use?
       A.   Pearson‟s product moment correlation coefficient
       B.   Phi coefficient
       C.   Point-biserial correlation coefficient
       D.   Spearman‟s rank correlation coefficient
                                         75
3. Which of the following correlation coefficients indicates the strongest
   relationship?
       A. –0.6
       B. 0.07
       C. 0.25
       D. 0.55
4. A teacher wishes to find the relationship between the spelling ability of the
   students in her class and their performance in the end-of-semester examination.
   What is the most appropriate measure of relationship to use?
       A.   Pearson‟s product moment correlation coefficient
       B.   Phi coefficient
       C.   Point-biserial correlation coefficient
       D.   Spearman‟s rank correlation coefficient
5. The correlation between study habits and achievement in Statistics has been
   found to be 0.92 in a study. The study implies that a student with a
       A. high score in study habits is more likely to score low in Statistics.
       B. low score in study habits is more likely to obtain a moderate score in
          Statistics.
       C. low score in study habits is more likely to score low in Statistics.
       D. moderate score in study habits is more likely to score high in Statistics.
6. A teacher wishes to find the relationship between the gender of students in his
   class and their performance in the end-of-semester examination. What is the
   most appropriate measure of relationship to use?
       A. Phi coefficient
       B. Spearman‟s rank correlation coefficient.
       C. Point-biserial correlation coefficient.
       D. Pearson‟s product moment correlation coefficient.
                                         76
The graph below shows the relationship between achievement in Mathematics and
English. The correlation of the relationship is approximately
60
50
                              40
            English
30
20
10
                               0
                                   0   10   20       30   40      50
Mathematics
                      A.   –0.8
                      B.   –0.2
                      C.   0.3
                      D.   0.8
The scatter plot shows that students with low college entrance examination scores are
more likely to have
               A. low college GPA.
               B. high college GPA.
               C. perfect college GPA.
               D. zero college GPA.
                                             77
                               UNIT                 7
Conditions/Assumptions
1. The possible values of the independent variable, X, are fixed in advance.
2. The true relationship between the variables, X and Y, is linear and expressed by
the equation, Y = a + bX +ei known as the regression equation. a and b are
parameters of the population and are estimated while ei is the random error. The
equation is the line of regression of Y on X. a is the Y intercept and b is the
regression coefficient or the slope of the regression line.
3. The probability distribution of Y‟s, given a fixed X, is normal.
b slope
    a
intercept
                                          78
1. The first step is to present the variables on a scatter diagram to be sure that the
relationship between the variables is linear.
2. Normal equations are solved to obtain the equations for the parameter estimates in
raw score form.
                                    Y = a + bX
                                  ∑Y = na + b∑X
                               ∑XY = a∑X + b∑X2
Intercept, a =
                  Y - b X    OR a = Y  bX
                      n
The intercept is the point on the Y axis where X, the independent variable has a value
of 0.
Example
The following scores were obtained in Quiz 1 and Final Examination.
         Quiz 1    Final
                    Exam
           X         Y         XY          X2
          18         75        1350        324
          12         55         660        144
          10         45         450        100
          20         85        1700        400
          15         65         975        225
          15         65         975        225
          14         60         840        196
          10         60         600        100
          12         50         600        144
          11         50         550        121
          18         70        1260        324
          16         75        1200        256
           9         45         405        81
          13         60         780        169
          17         70        1190        289
                                              79
∑X = 210                  ∑Y = 930               ∑XY = 13535         ∑X2 = 3098
b=
        XY - nXY    =
                         13535 - 15 14 62  13535 - 13025 510
                                                =             =     =3.23
        X  nX
           2     2
                          3098  15 14 
                                            2
                                                  3098  2940   158
OR
          11.77 
b = 0.93         =3.22
          3.36 
a=
      Y - b X      =
                         930  3.23210  251.7
                                         =      =16.8 OR a = Y  bX =62-3.23(14) =
           n                    15         15
16.8
Use in prediction
After obtaining the estimates of a and b, the least squares regression line can be
drawn using two values for X (including X = 0 to obtain the intercept).
Corresponding Y values are obtained for the X values and these values are used to
draw the estimated regression line. Values can then be read from the regression line
to obtain the predicted values.
Method 1
Given Yˆ  16.8  3.22 X ,
Select two values say 0 and 10 for x and compute the corresponding Y values.
For example:
X = 0, Y = 16.8 +3.22(0) =16.8
X = 10, Y = 16.8+3.22(10)=49
Plot the values (0, 16.8) and (10, 49) on the graph using a graph sheet and draw a
                                            80
straight line. Then estimate any value of Y given an X value on the regression line.
Yˆ  16.8  3.22 X =
49
     16.8
              0                              10
Method 2
The estimated regression equation can be used by substituting the given X values to
obtain the predicted values for Y.
      i.      What would be the exam score for a student who obtains 12.5 in Quiz 1?
            ˆ
           Y  16.8  3.22(12.5)  57
     ii.      A student obtained 72 in her exam. However, she did not take part in
              Quiz 1. What would be an estimate of her Quiz 1 score?
           72  16.8  3.22 X
        72-16.8 = 3.22X
        3.22X = 72 ─ 16.8
            72  16.8
        X=            =17
               3.22
                                           81
                                Practice Exercises
The following scores were obtained by 20 students in an intelligent test which was
used to predict final examination scores in Educational Psychology.
2. A student obtained 98 in the intelligent test. What is his final examination score
   in Educational Psychology?
3. A student obtained a final examination score of 75. What was her score in the
   intelligence test?
                                            82
             Use the following information to answer questions 4 and 5.
A study gathers data on the outside temperature during February and March, in
degrees Celsius, and the amount of electricity an office complex consumes, in
kilowatts. Call the temperature, x and electricity consumed, y. The office complex
uses an air-conditioner, so x helps explain y. The least-squares regression line for
predicting y from x is
                               y = -1344 + 19x
4. It can be seen from the equation of the line that as the temperature goes up,
    electricity used, y, goes
        A. Down because –1344 is less than 19.
        B. Down, because the slope 19 is positive.
        C. Up, because the intercept 1344 is negative.
        D. Up, because the slope 19 is positive.
5. When the temperature goes up 1 degree, the electricity usage predicted by the
   regression line goes
       A. Down 1 kilowatt.
       B. Down 19 kilowatts.
       C. Up 1 kilowatt.
       D. Up 19 kilowatts.
6. The following estimate of a least squares regression equation was obtained for
   headteachers‟ supervision and achievement in a private school.
       Y = –5.0 + 1.5X, where Y is the achievement score and X is the supervision
       rating.
   Mr. Danso obtained a supervision rating of 40. What is the estimate of his score
   in achievement?
       A.   6.5
       B.   55.0
       C.   65.0
       D.   65.5
                                          83
                               UNIT                  8
The horizontal axis is measured in terms of standard deviation units. The values
decrease to the left and increase to the right from the centre.
Suppose the standard deviation is 4 with a mean of 21. The distribution takes the
form below.
                            -3  -2  -1         μ 1       2     3
                             9    13   17         21 25       29   33
                                           84
Symbol
μ, is the mean and σ 2 , the variance.   This is read as „the variable, X, is distributed
as norma1 with
a given mean and a given variance.
Features
1. It is a bell-shaped curve.
2. It is unimodal.
3. It is symmetrical.
4. It is asymptotic.
5. The total area under the curve is 1.0.
6. The mean, mode and median are all equal.
7. When the values of a normal distribution have been converted to standard z-
scores, a standard normal curve is obtained. The standard normal curve has a
mean of 0 and a standard deviation/variance of 1.
                         -3 -2      -1      0     1     2      3
                          Mean = mode = median = 0
                  The mean of 0 also means that the Z value is 0.
                                          85
8. Areas under the normal curve. Note that these areas are obtained from the table
   on normal distributions. Refer to Appendix to follow the areas.
                                 Basic applications
Finding Probabilities
   1. The distribution for a Statistics examination is normal with a mean of 60 and
      variance of 64 (i.e. X ~ N(60, 64). A student is selected at random from the
      class. What is the probability that the student selected obtains a score above
      68? Above 76? Below 52?
                                68  60
             P (X>68) = P( Z            )
                                    8
                                8
                        = P( Z  )
                                8
                        = P ( Z 1)
                          0     1
                        =0.5000-0.3413
                        = 0.1587
                                         86
   2. The distribution for a Statistics examination is normal with a mean of 60 and
      variance of 64 (i.e. X ~ N(60,64). A student is selected at random from the
      class. What is the probability that the student selected obtains a score
      between 52 and 76? Between 68 and 76?
                                52  60      76  60
              P (52<X<76) = P(          Z          )
                                   8            8
                            8     16
                       = P(    Z  )
                            8       8
                       = P ( 1 Z  2)
                     ─1   0       2
                       =0.3413 + 0.4773
                       = 0.8186
                                12  16
              P (X<12) = P( Z          )
                                   2
                                4
                       = P( Z     )
                                 2
                       = P(Z<−2)
                 ─2      0
                =0.5000 − 0.4772
                = 0.0228 (About 2%. Actual 2.28%)
                                            87
2. Given that a distribution is normal, with a mean of 50 and a standard deviation of
   10. From a class of 2000 students, approximately how many students obtained
   scores above 70? Between 40 and 60?
                                70  50
              P (X>70) = P( Z          )
                                   10
                                20
                        = P( Z  )
                                10
                        = P ( Z  2)
                        =0.5000-0.4772
                      = 0.0228
       Number of students: 0.0228 x 2000 = 45.6 ≈ 46
3. In a promotion examination, a pass mark was fixed at 40. Given that the
   distribution is normal, with a mean of 50 and a standard deviation of 5.1,
   approximately how many students failed from a class of 400?
                                 40  50
              P (X<40) = P( Z           )
                                    5.1
                                  10
                        = P( Z       )
                                 5.1
                        = P ( Z 1.96)
                  ─1.96    0
                        =0.5000-0.475
                      = 0.025
       Number of students: 0.025 x 400 = 10
                                             88
                               Practice Exercises
                                     89
5. Which of the following statements is true about the standard normal curve?
         A. It is bi-modal.
         B. Mean is 1.0.
         C. Mean is less than median.
         D. Variance is 1.0.
                                     90
                               References
                                          91
92
93