Unit 3
Unit 3
(Core Course)
                 Semester IV
        221601404 DATA SCIENCE
                    Unit 3
       Statistics used in Data Science
           Measures of Central Tendancy
MEAN
                   MEDIAN
MODE
    Measures of Dispersion
Range
         Notes
                                                           25
                     MEASURES OF CENTRAL TENDENCY
                 In the previous lesson, we have learnt that the data could be summarised to some extent
                 by presenting it in the form of a frequency table. We have also seen how data were
                 represented graphically through bar graphs, histograms and frequency polygons to get
                 some broad idea about the nature of the data.
                 Some aspects of the data can be described quantitatively to represent certain features of
                 the data. An average is one of such representative measures. As average is a number of
                 indicating the representative or central value of the data, it lies somewhere in between the
                 two extremes. For this reason, average is called a measure of central tendency.
                 In this lesson, we will study some common measures of central tendency, viz.
                          OBJECTIVES
                 After studying this lesson, you will be able to
                 •   define mean of raw/ungrouped and grouped data;
                 •   calculate mean of raw/ungrouped data and also of grouped data by ordinary
                     and short-cut-methods;
                 •   define median and mode of raw/ungrouped data;
                 •   calculate median and mode of raw/ungrouped data.
                x1 + x 2 + ... + x n
                         n
It is generally denoted by x . so
                       x1 + x 2 + ... + x n
                x =
                                n
                       ∑x
                       i =1
                                i
                   =                          (I)
                            n
where the symbol “Σ” is the capital letter ‘SIGMA’ of the Greek alphabet and is used to
denote summation.
To economise the space required in writing such lengthy expression, we use the symbol Σ,
read as sigma.
      n
In   ∑x
     i =1
            i   , i is called the index of summation.
Example 25.1: The weight of four bags of wheat (in kg) are 103, 105, 102, 104. Find the
mean weight.
                                                  103 + 105 + 102 + 104
Solution: Mean weight ( x )                   =                         kg
                                                            4
                                                    414
                                              =         kg = 103.5 kg
                                                     4
Example 25.2: The enrolment in a school in last five years was 605, 710, 745, 835 and
910. What was the average enrolment per year?
Solution: Average enrolment (or mean enrolment)
30
                                         ∑x      i
                                                         40 + 73 + .... + 28 1455
                        Mean = ( x ) =    i =1
                                                     =                      =
                                            n                   30            30
                                                         = 48.5
                 Example 25.4: Refer to Example 25.1. Show that the sum of x1– x , x2– x , x3– x and
                 x4– x is 0, where xi’s are the weights of the four bags and x is their mean.
                 Solution:      x1– x = 103 – 103.5 = – 0.5, x2– x = 105 – 103.5 = 1.5
                                x3– x = 102 – 103.5 = – 1.5, x4– x = 104 – 103.5 = 0.5
                        So, (x1– x ) + (x2– x ) + (x3– x ) + (x4– x ) = – 0.5 + 1.5 + (–1.5) + 0.5 = 0
                 Example 25.5: The mean of marks obtained by 30 students of Section A of Class X is
                 48, that of 35 students of Section B is 50. Find the mean marks obtained by 65 students
                 in Class X.
                 Solution: Mean marks of 30 students of Section A = 48
                 So, total marks obtained by 30 students of Section A = 30 × 48 = 1440
                 Similarly, total marks obtained by 35 students of Section B = 35 × 50 = 1750
                 Total marks obtained by both sections = 1440 + 1750 = 3190
                                                                       3190
                 Mean of marks obtained by 65 students =                    = 49.1 approx.
                                                                        65
                 Example 25.6: The mean of 6 observations was found to be 40. Later on, it was detected
                 that one observation 82 was misread as 28. Find the correct mean.
                 i.e.,
                         ∑x   i
                            . But this process will be time consuming.
                        n
                 We can also find the mean of this data by first making a frequency table of the data and
                 then applying the formula:
                                           n
                                          ∑fx     i i
                          mean = x =      i =1
                                             n
                                                              (II)
                                          ∑f
                                           i =1
                                                   i
                  ∑fx   i i
                              =
                                   146
                                       = 7.3
        Mean =
                  ∑f     i          20
Example 25.7: The following data represents the weekly wages (in rupees) of the                 Notes
employees:
    Weekly wages             900       1000          1100       1200   1300 1400         1500
    (in `)
    Number of                 12        13              14      13     14     11          5
    employees
Find the mean weekly wages of the employees.
Solution: In the following table, entries in the first column are xi’s and entries in second
columen are fi’s, i.e., corresponding frequencies. Recall that to find mean, we require the
product of each xi with corresponding frequency fi. So, let us put them in a column as
shown in the following table:
    Weekly wages (in `)                Number of employees                   f ix i
           (xi)                                (fi)
              900                                     12                     10800
             1000                                     13                     13000
             1100                                     14                     15400
             1200                                     13                     15600
             1300                                     12                     15600
             1400                                     11                     15400
             1500                                      5                      7500
                                                   Σfi = 80             Σfi xi = 93300
                                       ∑fx   i i        93300
        Mean weekly wages =                        =`
                                       ∑f      i          80
                                       = ` 1166.25
Sometimes when the numerical values of xi and fi are large, finding the product fi and xi
becomes tedius and time consuming.
We wish to find a short-cut method. Here, we choose an arbitrary constant a, also called
the assumed mean and subtract it from each of the values xi. The reduced value,
di = xi – a is called the deviation of xi from a.
Thus, xi = – a + di
         Notes
                         ∑ f x =∑ af + ∑ f d
                         i =1
                                i i
                                        i =1
                                               i
                                                   i =1
                                                           i    i   [Summing both sides over i from i to r]
                                             1
                 Hence x = ∑ f i +             ∑ fi di , where Σfi = N
                                             N
                                      1
                         x =a+
                                      N
                                        ∑ f i di               (III)
                                      [since Σfi = N]
                 This meghod of calcualtion of mean is known as Assumed Mean Method.
                 In Example 25.7, the values xi were very large. So the product fixi became tedious and
                 time consuming. Let us find mean by Assumed Mean Method. Let us take assumed
                 mean a = 1200
                    Weekly wages                    Number of                     Deviations                  f id i
                      (in `) (xi)                  employees (fi)                di = xi – 1200
                          900                                12                      – 300                – 3600
                         1000                                13                      – 200                – 2600
                         1100                                14                      – 100                – 1400
                         1200                                13                        0                     0
                         1300                                12                       100                + 1200
                         1400                                11                       200                + 2200
                         1500                                 5                       300                + 1500
                                                          Σfi = 80                                   Σfi di = – 2700
                                               1
                         Mean = a +
                                               N
                                                 ∑ f i di
                                                    1
                                      = 1200 +        (– 2700)
                                                   80
                                      = 1200 – 33.75 = 1166.25
                 So, the mean weekly wages = ` 1166.25
                 Observe that the mean is the same whether it is calculated by Direct Method or by Assumed
                 Mean Method.
Example 25.8: If the mean of the following data is 20.2, find the value of k
                   xi       10            15       20     25           30
                   fi       6             8        20     k            6
                                                                                                 Notes
                                ∑fx  i i
                                           =
                                               60 + 120 + 400 + 25k + 180
Solution:          Mean =
                                ∑f    i                  40 + k
                                    760 + 25k
                                =
                                      40 + k
                760 + 25k
          So,             = 20.2 (Given)
                  40 + k
          or       760 +25k = 20.2 (40 + k)
          or       7600 + 250k = 8080 + 202k
          or       k = 10
                 What we can infer from this table is that there are 5 workers earning daily somewhere
                 from ` 150 to ` 160 (not included 160). We donot know what exactly the earnings of
                 each of these 5 workers are
                 Therefore, to find mean of the grasped frequency distribution, we make the following
                 assumptions:
                 Frequency in any class is centred at its class mark or mid point
                                                                                             150 + 160
                 Now, we can say that there are 5 workers earning a daily wage of `                    =
                                                                                                 2
                                                                     160 + 170
                 ` 155 each, 8 workers earning a daily wage of `               = ` 165, 15 workers aerning
                                                                         2
                                    170 + 160
                 a daily wage of `             = ` 175 and so on. Now we can calculate mean of the given
                                         2
                 data as follows, using the Formula (II)
         ∑fx  i i
                    =
                        6960
                             = 174
Mean =
         ∑f    i         40
So, the mean daily wage = ` 174                                                              Notes
This method of calculate of the mean of grouped data is Direct Method.
We can also find the mean of grouped data by using Formula III, i.e., by Assumed Mean
Method as follows:
We take assumed mean a = 175
   Daily wages          Number of             Class marks    Deviations        f id i
      (in `)            workers (fi)              (xi)       di = xi–175
     150-160                 5                    155           – 20          – 100
     160-170                 8                    165           – 10          – 80
     170-180                15                    175            0              0
     180-190                10                    185           + 10           100
     190-200                 2                    195           + 20            40
                          Σfi = 40                                         Σfidi = – 40
So, using Formula III,
                                 1
                   Mean = a +
                                 N
                                   ∑ f i di
                                       1
                           = 175 +        (–40)
                                       40
                           = 175 – 1 = 174
     Thus, the mean daily wage = ` 174.
Example 25.9: Find the mean for the following frequency distribution by (i) Direct Method,
(ii) Assumed Mean Method.
                            Class                           Frequency
                            20-40                              9
                            40-60                              11
                            60-80                              14
                           80-100                              6
                           100-120                             8
                           120-140                             15
                           140-160                             12
                            Total                              75
                                ∑fx  i i
                                             =
                                                 6970
                                                      = 92.93
                 So, mean =
                                ∑f      i         75
                                     1                 220
                     Mean = a +
                                     N
                                       ∑ f i di = 90 +
                                                        75
                                                           = 92.93
                 Note that mean comes out to be the same in both the methods.
                 In the table above, observe that the values in column 4 are all multiples of 20. So, if we
                 divide these value by 20, we would get smaller numbers to multiply with fi.
                 Note that, 20 is also the class size of each class.
                                xi − a
                 So, let ui =          , where a is the assumed mean and h is the class size.
                                   h
                    ⎛ ∑ f iU i ⎞
    Mean = x = a + ⎜⎜          ⎟⎟ × h       (IV)                                           Notes
                    ⎝  ∑   f i ⎠
                    ⎛ ∑ f i ui ⎞
    Mean = x = a + ⎜⎜          ⎟⎟ × h = 90 + 11 × 20
                    ⎝ ∑ fi ⎠                 75
                                                  220
                                        = 90 +        = 92.93
                                                   75
Calculating mean by using Formula (IV) is known as Step-deviation Method.
Note that mean comes out to be the same by using Direct Method, Assumed Method or
Step Deviation Method.
Example 25.10: Calcualte the mean daily wage from the following distribution by using
Step deviation method.
Daily wages (in `)          150-160       160-70       170-180     180-190    190-200
Numbr of workers              5               8           15             10       2
                                             ⎛ ∑ f i ui   ⎞
                     Mean daily wages = a + ⎜⎜            ⎟⎟ × h = 175 + − 4 × 10 = ` 174
                                             ⎝ ∑ fi        ⎠             40
                 Note: Here again note that the mean is the same whether it is calculated using the Direct
                 Method, Assumed mean Method or Step deviation Method.
    Calculate mean weekly cost of living index by using Step deviation Method.
4. Find the mean of the following data by using (i) Assumed Mean Method and (ii) Step
   deviation Method.
    Class            150-200     200-250    250-300      300-350 350-400
    Frequency          48           32          35          20          10
 25.2 MEDIAN
In an office there are 5 employees: a superviosor and 4 workers. The workers draw a
salary of ` 5000, ` 6500, ` 7500 and ` 8000 per month while the supervisor gets
` 20000 per month.
                                 5000 + 6500 + 7500 + 8000 + 20000
In this case mean (salary) = `
                                                 5
                            47000
                            =`       = ` 9400
                               5
Note that 4 out of 5 employees have their salaries much less than ` 9400. The mean salary
` 9400 does not given even an approximate estimate of any one of their salaries.
This is a weakness of the mean. It is affected by the extreme values of the observations in
the data.
This weekness of mean drives us to look for another average which is unaffected by a few
extreme values. Median is one such a measure of central tendency.
Median is a measure of central tendency which gives the value of the middle-
most observation in the data when the data is arranged in ascending (or descending)
order.
                                                                            ⎛ n +1⎞
(ii) When the number of observations (n) is odd, the median is the value of ⎜     ⎟ th
                                                                            ⎝ 2 ⎠
     observation.
                                                                                                  ⎛n⎞
                 (iii) When the number of observations (n) is even, the median is the mean of the ⎜ ⎟ th
                                                                                                  ⎝2⎠
         Notes           ⎛n ⎞
                     and ⎜ +1⎟ th observations.
                         ⎝2 ⎠
                 Let us illustrate this with the help of some examples.
                 Example 25.11: The weights (in kg) of 15 dogs are as follows:
                     9, 26, 10 , 22, 36, 13, 20, 20, 10, 21, 25, 16, 12, 14, 19
                 Find the median weight.
                 Solution: Let us arrange the data in the ascending (or descending) order:
                         9, 10, 10, 12, 13, 14, 16, 19, 20, 20, 21, 22, 25, 36
                                                   Median
                 Here, number of observations = 15
                                        ⎛ n +1⎞           ⎛ 15 + 1 ⎞
                 So, the median will be ⎜     ⎟ th, i.e., ⎜        ⎟ th, i.e., 8th observation which is 19 kg.
                                        ⎝ 2 ⎠             ⎝ 2 ⎠
                 Remark: The median weight 19 kg conveys the information that 50% dogs have weights
                 less than 19 kg and another 50% have weights more then 19 kg.
                 Example 25.12: The points scored by a basket ball team in a series of matches are as
                 follows:
                         16, 1, 6, 26, 14, 4, 13, 8, 9, 23, 47, 9, 7, 8, 17, 28
                 Find the median of the data.
                 Solution: Here number of observations = 16
                                                      ⎛ 16 ⎞       ⎛ 16 ⎞
                 So, the median will be the mean of ⎜ ⎟ th and ⎜ +1⎟ th, i.e., mean of 6th and 9th
                                                      ⎝2⎠          ⎝ 2     ⎠
                 observations, when the data is arranged in ascending (or descending) order as:
                     1, 4, 6, 7, 8, 8, 9, 9, 13, 14, 16, 17, 23, 26, 28, 47
                                           8th term 9th term
                                    9 + 13
                 So, the median =          = 11
                                      2
                 Remark: Here again the median 11 conveys the information that the values of 50% of the
                 observations are less than 11 and the values of 50% of the observations are more than 11.
                                                  ⎛ n +1⎞           ⎛ 35 + 1 ⎞
Here n = 35, which is odd. So, the median will be ⎜     ⎟ th, i.e., ⎜        ⎟ th, i.e., 18th
                                                  ⎝ 2 ⎠             ⎝ 2 ⎠
observation.
To find value of 18th observation, we prepare cumulative frequency table as follows:
      Marks obtained           Number of students                  Cumulative frequency
               3                            4                               4
               5                            6                              10
               6                            5                              15
               7                            3                              18
              10                            1                              19
               11                           7                              26
              12                            3                              29
              13                            2                              31
              14                            3                              34
              15                            1                              35
 From the table above, we see that 18th observation is 7
 So, Median = 7
Example 25.14: Find the median of the following data:
 Weight (in kg)          40        41       42       43       44      45        46        48
 Number of               2         5        7        8        13      26        6         3
 students
                                                                    ⎛n⎞        ⎛n ⎞
                 Since n is even, so the median will be the mean of ⎜ ⎟ th and ⎜ +1⎟ th observations,
                                                                    ⎝2⎠        ⎝2 ⎠
                 i.e., 35th and 36th observations. From the table, we see that
                           35 the observation is 44
                 and       36th observation is 45
                                        44 + 45
                 So,       Median =             = 44.5
                                           2
    (b)     xi    5      10      15      20       25      30      35       40
            fi    3      7       12      20       28      31      28       26
 25.3 MODE
Look at the following example:
A company produces readymade shirts of different sizes. The company kept record of its
sale for one week which is given below:
From the table, we see that the sales of shirts of size 105 cm is maximum. So, the company
will go ahead producing this size in the largest number. Here, 105 is nothing but the mode
of the data. Mode is also one of the measures of central tendency.
The observation that occurs most frequently in the data is called mode of the
data.
In other words, the observation with maximum frequency is called mode of the data.
The readymade garments and shoe industries etc, make use of this measure of central
tendency. Based on mode of the demand data, these industries decide which size of the
product should be produced in large numbers to meet the market demand.
In case of raw data, it is easy to pick up mode by just looking at the data. Let us consider
the following example:
            n 50
              =     = 25       class median is the 3rd class
            2     2
     So,     F = 22, fm = 12, L = 20.5 and i = 10
                               m
Therefore,
                                    ⎛n        ⎞
                                    ⎜     - F ⎟
                      Median = Lm + ⎜ 2       ⎟i
                                        f
                                    ⎜ m ⎟
                                    ⎝         ⎠
                                      ⎛ 25 - 22 ⎞
                            = 21.5 + ⎜           ⎟ 10
                                      ⎝    12    ⎠
                            = 24
Thus, 25 persons take less than 24 minutes to travel to work and another 25 persons
take more than 24 minutes to travel to work.
What is Mode of Grouped Data?
Mode is one of the measurements of a dataset’s central tendency that requires the identification of
the data set’s central position as a single number. When dealing with ungrouped data, the mode is
simply the item with the highest frequency. The mode is derived for grouped data using the formula.
Empirical Formula for Mean, Median and Mode
For a moderately skewed frequency distribution, there exists a relationship between mean, median
and mode which is given as below:
                                    Mode = 3 Median – 2 Mean
Example: For a given distribution the values of mean and median are 44 and 43 respectively.
Find the value of mode.
Solution:
     We know,
20-30 15
30-40 12
40-50 5
Solution:
     Modal class = 20 – 30
⇒ Mode = 20 + 10{15-8/(2×15-8-12)}
⇒ Mode = 20 + 10{7/10]
     ⇒ Mode = 20 + 7 = 27
      Therefore, Mode = 27
where,
L is the largest value in the Distribution, S is the smallest value in the Distribution
Example: Find the range of the data set 10, 20, 15, 0, 100.
Solution:
            • Smallest Value in the data = 0
            • Largest Value in the data = 100
R = 100 – 0
R = 100
Range = 61 – 17 = 44
Range = 40 – 0
Range = 40
Examples:
where,
Solution:
     Step 1: Given data is in unsorted manner. So sort it in ascending order.
10,20,25,30,40,50,70
Q1 = ((n+1)/4)th term
= ((7+1)/4)th term
= (8/4)th term
= 2nd term
2nd term is 20
So Quartile1 = 20
Q3 = ((3x(n+1))/4)th term
       = ((3×(7+1))/4)th term
       = ((3×8)/4)th term
= (24/4)th term
= 6th term
6th term is 50
So Quartile3 = 50
IQR = Q3 – Q1
= 50 – 20
= 30
12,22,25,26,30,45,49,55,75
Q1 = ((n+1)/4)th term
= ((9+1)/4)th term
= (10/4)th term
= 2.5th term
So Quartile1 = 23.5
= ((3x(9+1))/4)th term
= ((3×10)/4)th term
= (30/4)th term
= 7.5th term
= 104/2
= 52
So Quartile3 = 52
IQR = Q3 – Q1
= 52 – 23.5
= 28.5
We know,
      ⇒ μ = (2 + 3 + 4 + 5 + 6)/ 5
      ⇒μ=4