SMTB1402-PROBABILITY & STATISTICS
UNIT 4 CORRELATION AND REGRESSION & CURVE FITTING;
CORRELATION
It refers the combination of two words ‘co’ (together) and relation (connection) between
two quantities. Correlation is the statistical tool to measure the degree to which two
variables are linearly related to each other.
i.e., To study the relationship between two variables.
If the quantities (X,Y) vary in such way that change in one variable corresponds to change
in other variable, then the variable X and Y are correlated.
Example: Price of crude oil and stock price of an oil producing company.
            Price of commodity and amount of demand.
            Years of Experience and Salary of Employees
            Dividend and Premium of Shares
            Population and National Income etc.
Types of Correlation:
   (i) Positive Correlation
           If for an increase in the value of one variable there is also an increase in the
           value of other variable and vice versa. (Same Direction)
           Examples: The more time spend running on a treadmill, the more
                            calories burn out.
                           Time spend on Marketing and Customers etc.
                           Temperature and Ice cream sales.
   (i) Negative Correlation:
           If for an increase in the value of one variable there is a decrease in the value of
           the other variable and vice versa.(Opposite Direction)
           Examples: Quantity of a commodity demanded and its price are
                               Negatively correlated.
                          Tax and dividend.
   (ii) No correlation:
           If the change in the value of one variable has no connection with the change in
           the value of other variables.
           Examples: Shoe size and Salary
                         Weight of person and Colour of his hair.
Correlation coefficient: A Correlation coefficient is a numerical measure of
some type of correlation. It is a statistical measure of the strength of the
relationship between the two variables.
Properties of correlation coefficient
       The coefficient of correlation lies between -1 and +1.
       When r is positive, the variables x and y increase or decrease together.
       If r = +1 then there is a perfect positive correlation.
       When r is negative the variables x and y move in opposite direction.
       If r = -1 then there is a perfect negative correlation.
       If r = 0 then the variables are uncorrelated.
                                   Problems
Q1: Calculate the correlation coefficient for the following heights (in inches)
of fathers (X) and their sons (Y).
                X:          65   66     67     67     68   69    70      72
                Y:          67   68     65     68     72   72    69      71
Solution:
Formula (Karl – Pearson’s Cofficient of Correlation)
             n∑XY- (∑X)( ∑Y)
 rXY =                                          -
       SQRT[n∑X -(∑X) ]SQRT[n∑Y -(∑Y) ]
                   2      2           2      2
X            Y          X2             Y2          XY
65           67         4225           4489        4355
66           68         4356           4624        4488
67           65         4489           4225        4355
67           68         4489           4624        4556
68           72         4624           5184        4896
69           72         4761           5184        4968
70           69         4900           4761        4830
72           71         5184           5041        5112
∑X=544 ∑Y=552 ∑X =37028 ∑Y =38132 ∑XY=37560
                           2               2
 Here n=8.
 Substituting above values in the formula,
 We get
                         8(37560) - (544)(552)
        rxy =
                SQRT [8(37028)-(544)2] SQRT [8(38132)-(552)]2
          = 0.603
          There is a positive correlation between x and y.
Q2.   A computer while calculating the correlation coefficient between x and
      y from 25 pairs of observations, obtained the following
      n=25, ∑x=125, ∑x2=650, ∑y=100, ∑y2=460, ∑xy=508.
      It was however, later discovered at the time of checking that they had
      copied down two pairs as (6,14) and (8,6) while the correct values
      were (8,12) and (6,8). Obtain the correct value of the correlation
      coefficient.
Solution:
       The correct values are ∑x=125-6-8+8+6=125
                              ∑y=100-14-6+12+8=100
                              ∑x2=650-62-82+82+62=650
                              ∑y2=460-142-62+122+82=436
                              ∑xy=508-(6x14)-(8x6)+(8x12)+(6x8)=520
       Therefore,
                The correct value of correlation coefficient
                               n∑XY- (∑X)( ∑Y)
               rXY =
                       SQRT[n∑X2-(∑X)2]SQRT[n∑Y2-(∑Y)2]
                           (25)(520)-(125)(100)
      rXY =
                       SQRT[(25)(650)-(125)2]SQRT[(25)(436)-(100)2]
                    = 0.667.
                           RANK CORRELATION
It is a Qualitative assessment measurement of analyzing data arranged in order
of merit in possession of two characteristics A and B.
Examples: Honesty, Beauty, Intelligence etc.,
In general the assumption that the values of variables are exactly measurable.
In some situations, it may not be possible to give precise values for the
variables. In such cases we can use another measure of correlation coefficient
called rank correlation.
Let (xi,yi) i =1,2,3,…. n be the ranks of n individuals in the group for two
characteristics A and B respectively. The correlation coefficient between the
xi,yi is called the rank correlation.
Spearman’s Rank Correlation coefficient
                           6∑di2
                  ρxy = 1-
                           n(n2-1)
where di = xi - yi and n is the number of pairs of observations.
Types:
              1. When ranks are given
              2. When the ranks are not given
              3. When equal ranks are given.
                                   PROBLEMS
Q1. When ranks are given:
The following are the ranks obtained by 10 students in statistics and
mathematics. To what extent is knowledge of students in statistics related to
knowledge in mathematics?
Rank of Stats : 1 2        3       4      5      6        7   8      9       10
Rank of Maths :2 4         1       5      3      9        7   10     6        8
Solution:
              Rank in                      Rank in
              Statistics(R1)           Mathematics ( R2)            d=x-y        d2
                       1                        2                  -1        1
                       2                        4                  -2        4
                       3                        1                   2        4
                       4                        5                   -1       1
                       5                        3                    2       4
                       6                        9                   -3       9
                       7                        7                    0       0
                       8                       10                   -2       4
                       9                        6                    3       9
                      10                        8                    2       4
                                                                                 -
                                                                     ∑d =40
                                                                         2
                                                                                  -
                       6∑di    2
                                     6x40
              ρxy = 1- ------ = 1 ------------- = 0.76.
                      n(n2-1) 10(100-1)
Q2. When ranks are not given:
Calculate Spearman’s rank correlation for the following data.
 X: 53       98 95 81 75 71 59 55
 Y: 47       25 32 37 30          40 39 45
Solution:
    X        Y     Rank X       Rank Y d=(R1-R2)         d2
                    (R1)           (R2)
   53         47    8            1       7             49
   98         25    1            8      -7             49
   95         32    2            6      -4             16
   81         37    3            5      -2               4
   75         30    4            7      -3               9
   71         40    5            3       2               4
   59         39    6            4       2               4
   55         45    7            2       5              25
                                                              -
                                                       2
                                                    ∑di =160
                     6∑di2           6x160
              ρXY= 1- ------- = 1- ------------ = -0.9048.
                     n(n2-1)        8(64-1)
There is very high negative correlation between X and Y.
Equal Ranks:
Q3.Find the rank correlation coefficient for the following data
     x     92 89 87 86 86 77 71 63 53 50
     y     86 83 91 77 68 85 52 82 37 57
Solution:
Let R1 and R2 denote the ranks in x and y respectively.
     x      y      R1    R2    d=R1-R2 d2
     92     86     1     2     -1      1
     89     83     2     4     -2      4
     87     91     3     1      2      4
     86     77     4.5   6     -1.5    2.25
     86     68     4.5   7     -2.5    6.25
     77     85     6     3      3      9.00
     71     52     7     9     -2      4.00
     63     82     8     5      3      9.00
     53     37     9     10    -1      1.00
     50     57     10    8     2       4.00
                                       ∑di 2 = 44.50
                             m(m2-1)
                  6[∑di2 + ----------- + .........]
                                12
 ρxy = 1 ------------------------------------------
                             n(n2-1)
where d=R1-R2 and ‘m’ is the number of times, an items is repeated.
Here n=10 and an item 86 is repeated twice i.e. m=2.
                          2(22-1)
            6[44.5+ -----------+…]
                            12
 ρxy = 1-
                       10x99
                 6(44.5+0.5)          6x45
    =     1 -                  =1-
                990                990
   =        0.727.
There is high positive Correlation between x and y.
                                REGRESSION
Regression is the measure of the average relationship between two or
more variables in terms of original units of data.
Example:
If the sales and advertisement are correlated we can find out expected amount
of sales for a given advertising expenditure or the amount needed for attaining
the given amount of sales.
Lines of regression
We shall have two regression lines as the regression line of X on Y and the
regression line of Y on X.
The regression line of Y on X gives the most probable value of Y for given
values of X and the regression line of X on Y gives the most probable values
of X for given values of Y.
Formula:
Regression Equations:
    (i) Equations of line of regression of Y on X
               y-ӯ = byx(x-x)
                               ∑(x-x)(y- ӯ)
            where byx = ---------------------------------
                               ∑(x-x)2
   (ii) Equations of line of regression of X on Y.
           (i) x- x = byx(y- ӯ)
          where bxy =              ∑(x-x)(y- ӯ)
                               ----------------------
                                   ∑(y- ӯ)       2
Q4. From the following data, find
      (i)   The two regression equations
      (ii) The coefficient of correlation between the marks in Economics
            and Statistics
      (iii) The most likely marks in statistics when marks in Economics are
            30
              Marks        in25       28        35      32   31   36   29   38   34   32
              Economics(x)
              Marks         in43      46        49      41   36   32   31   30   33   39
              Statistics(y)
Solution:
x       y         x- x           y- ӯ             (x-x)2 (y- ӯ)2 (x-x)(y- ӯ)
                  = x-32         = y-38
25      43        -7             5                49     25       -35
28      46        -4             8                16     64       -32
35      49         3             11               9      121       33
32      41         0             3                0      9          0
31      36        -1             -2               1      4          2
36      32         4             -6               16     36       -24
29      31        -3             -7               9      49        21
38      30         6             -8               36     64       -48
34      33         2             -5               4      25       -10
32      39         0             1                0      1          0
                                                                  -93
320     380       0              0                140    398      -93
    Here x = ∑ x / n                          ӯ=∑y/n
       = 320 / 10                       = 380 / 10
       = 32                             = 38
Equations of line of regression of Y on X
      y-ӯ = byx(x-x)
                 ∑(x-x)(y- ӯ)
where byx = ---------------------------------
                    ∑(x-x)2
           = -93 / 140
           = -0.6643
Therefore y – 38 = -0.6643(x-32)
          y = - 0.6643x + 38 + 0.6643 * 32
          y = - 0.6642x + 59.257
Equations of line of regression of X on Y.
              x-x =bxy(y- ӯ)
                        ∑(x-x)(y- ӯ) where
bxy =         ---------------------------------
                ∑(y- ӯ)             2
            = - 93 / 398
           = - 0.2337
Therefore x – 32 = -0.2337(y-38)
                 x = -0.2337y + 40.88
coefficient of correlation
       r2 = byx * bxy
          = - 0.6643 * (-0.2337)
          = 0.1552
        r = sqrt(0.1552)
          = 0.394
Now we have to find the most likely marks in statistics (y) when marks in
economics (x) are 30. We use the line of regression of y on x.
i.e. y = -0.6643x + 59.2575
put x = 30, we get y = 39.32
                     y = 39(appr.)
The most likely marks in statistics (y) when marks in economics (x) are 30
calculated as 39.
                              CURVE FITTING
Curve fitting is the process of constructing a curve or a mathematical function
that has the best fit to a set of points possibly subject to constraints.
For example, we measure the rainfall and yield of fields in Tamilnadu and
represent those values by xi and yi,,i=1,2,3…….n.
We would like to know there is any relation between x and y. The empirical
relation is written in the form of an equation y =f(x).
Most of the time we may not able to get an exact relation but may get only
approximate curve. If (xi,yi),i=1,2,3….n are the n paired data which are plotted
on the graph sheet, it is possible to draw a number of smooth curves passing
through the points. The method of finding such approximating curve is called
curve fitting.
To fit a straight line y=a+bx
METHODS OF CURVE FITTING
   1. The graphical methods
   2. The Methods of Group Averages
   3. The Method of Least Squares.
The first one is a rough method and in the second method evaluation of
constants may vary. So we adopt another method called the method of least
squares which gives a unique set of values to the constants in the equation of
fitting curves.
METHOD OF LEAST SQUARES
The least squares method is a statistical procedure to find best fit for a set of
data points by minimizing the sum of the residuals of points from the plotted
curve.
TYPES OF CURVE
  1. Fitting of a straight line : y = a + bx
                        y→Dependent Variable
                        x→ Independent Variable
                        a,b →Constants
      The normal equations are
      ∑y =n a + b∑x
      ∑xy = a∑x + b∑x2
   2. Fitting of a parabolic curve: y = a + bx + cx2
      The normal equations are
      ∑y = na + b∑x + c∑x2
      ∑xy = a∑x + b∑x2 + c∑x3
      ∑x2y = a∑x2 + b∑x3 + c∑x4
    1. FITTING OF A STRAIGHT LINE
Let y = a + bx….(1) be a straight line to be fitted to the given data.
Let (xi , yi), i = 1,2,….n be the n sets of observations, which fit the straight
line (1) We have to select a and b which will best fit the straight line to the
given data.
Working procedure:
     To fit the straight line y = a + bx
     Substitute the observed set of n values in this equation.
     Form the normal equations for each constant i.e.
       The normal equations are
       ∑y =n a + b∑x
       ∑xy = a∑x + b∑x2
       which are got by taking ∑ on both sides of y = a + bx and also taking
       ∑ on both sides after multiplying by x both sides of equation (1).
       Remark: Summing of constants n times will give n times of constant.
     Solve these normal equations as simultaneous equations of a and b.
       Substitute the values of a and b in y = a + bx, which is required line of
       best fit.
          Q1.Fit a straight line of the form y = a + bx by using the methods
             of least squares.
               x     :      3             7                 9       10
               y     :      168           120               72      73
       Solution:
       Let the straight line be y=a + bx ................. (1)
       The normal equations are
       ∑y = n a + b∑x……………………….…..(2) and
       ∑xy = a∑x + b∑x2 ..................................... (3)
                      x            y               xy               x2
                      3           168             504               9
                      7           120             840               49
                      9            72             648               81
                     10            73             730               100
                   ∑x =29       ∑y =433    ∑xy =2722         ∑x2 =239
        Therefore (2) →433 = 4a + 29b (here n = 4) ....................(4)
                  (3)→2722 = 29a + 239b................................... (5)
       Multiply equation (4) by 29 and equation (5) by 4
       116a + 841b = 12557
       116a + 956b = 10888
       Solving above two equations by changing sign we get,
       -115b = 1669
       b = -14.5
       Substituting b value in equation (4) we get
       4a + 29 (-14.5) = 433
       4a – 420.5 = 433
       4a = 433 + 420.5
       4a = 853.5
       a = 213.375
       Substituting a and b values in equation (1)
       we get y = 213.375 – 14.5x which is the required curve.
Q2. Fit a straight line y = a + bx, to the following data .
      Year:         1991 1992 1993 1994 1995 1996 1997
      Sales:        125 128            133 135               140 141 143
      (in lakhs)
Solution:
      Let the straight line be y=a + bx .............. (1)
      The normal equations are
      ∑y = na + b∑x ......................................... (2) and
      ∑xy = a∑x + b∑x2 ................................... (3)
         year      y(sales)       x=year – origin        x2         xy
                                           (1994)
         1991       125                 -3                9         -375
         1992       128                 -2                4         -256
         1993       133                 -1                1         -133
         1994       135                 0                 0           0
         1995       140                 1                 1         140
         1996       141                 2                 4         282
         1997       143                 3                 9         429
                  ∑y = 945           ∑x = 0           ∑ x2 =28    ∑xy = 87
                  i.e., ∑x = 0, ∑y = 945, ∑ x2 =28, ∑xy = 87, n = 7
      Substituting the above values in the normal equations we get,
      945 = 7a + b(0) .........................(4)
      87 = a(0) + b(28) ...................... (5)
      From (4) a = 945 / 7 = 135
      From (5) b = 87 / 28 = 3.11
      Therefore the straight line trend equation is y = 135 + 3.11x
Q3. Fit a straight line y = a + bx to the following data.
    Year : 1971 1972 1973 1974 1975 1976
    Profit : 83          92    71      90      169 191
 Solution:
   Let the straight line be y=a+bx ............. (1)
   The normal equations are
      ∑y =n a + b∑x……………………….…..(2) and
     ∑xy = a∑x + b∑x2 ................................... (3)
   Since n=6(even), we take the origin to be 1973.5
     Year      y(sales)     x=year – 1973.5                  x2       xy
      1971         83               -2.5               6.25         -207.5
      1972         92               -1.5               2.25        -138.0
      1973         71               -0.5               0.25          -35.5
      1974         90                0.5               0.25           45.0
      1975        169                1.5               2.25          253.5
      1976        191                2.5               6.25          477.5
      Total     ∑y = 696           ∑x = 0           ∑ x2 = 17.5   ∑xy = 395
The normal equations are
696=6a+b(0)
395 =a(0)+b(17.5)
Therefore, a=696/6=116. b=395/17.5=22.57
The straight line trend is given be y=116+22.57x.
                   FITTING OF A PARABOLA y = a + bx + cx2
Let (xi , yi) , i = 1,2,……n be set of observations of two variables x and y.
Let y = a + bx + cx2 be the equation which fits best the given data.
The normal equations are
      na + b∑x + c∑x2 = ∑y ......................(1)
      a∑x + b∑x2 + c∑x3 = ∑xy .................. (2)
      a∑x2 + b∑x3 + c∑x4 = ∑x2y… ............ (3)
           (i) In y = a + bx + cx2,take ∑on both sides
           (ii)    Multiply by x both sides and then take ∑ on both sides.
           (iii) Multiply both sides by x2 and then take ∑ on both sides.
Working Procedure:
   Form Normal Equations
    na + b∑x + c∑x2 = ∑y
    a∑x + b∑x2 + c∑x3 = ∑xy
    a∑x2 + b∑x3 + c∑x4 = ∑x2y
   Solve these as simultaneous equations for a,b,c.
   Substitute the values of a,b,c in y = a + bx + cx2 the required parabola
    of the best fit.
Q1.Fit a second degree parabola y = a + bx + cx2 to the following data:
          x: 1      2     3      4       5         6        7  8  9
          y: 2      6     7      8       10        11       11 10 9
Solution:
          Let the parabola be y = a + bx + cx2……(1)
          Whose normal equations are
          ∑y = na + b∑x + c∑x2 ........................... (2)
          ∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
          ∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)
           x           y           x2      x3       x4        xy         x2y
           1           2           1       1        1         2          2
           2           6           4       8        16        12         24
           3           7           9       27       81        21         63
           4           8           16      64       256       32         128
           5           10          25      125      625       50         250
           6           11          36      216      1296      66         396
           7           11          49      343      2401      77         539
           8           10          64      512      4096      80         640
           9           9           81      729      6561      81         729
           ∑x =45      ∑y =74      ∑x2=2 ∑x3=20 ∑x4=          ∑xy =      ∑x2y =
                                   85    25
                                                15333         421        2771
Therefore (2) →74 = 9a + 45b + 285c .................... (5) (since n = 9)
           (3)→421 = 45a + 285b + 2025c............. (6)
           (4)→2771 = 285a + 2025b + 15333c… (7)
Solving Eqn (5) and (6) by multiplying Eqn.(5) by 5 and solving Eqn (6) and
(7) by multiplying Eqn.(6) by 285 and Eqn (7) by 45.
       60b + 600c = 51 ............................................................. (8)
       220b + 2508c = 104.67… ............................................... (9)
       Solving Eqn(8) and (9) by multiplying Eqn (8) by 220 and Eqn (9) by
       60
       →18480c = -4939.8
       →c = - 0.2673
       Substituting c value in Eqn.(8)
       60b=51-600(-0.2673)
       60b=211.38
       →b = 3.523
      Now by substituting b and c values in Eqn.(5)
      a = -0.9283
      Therefore, y = -0.9283 + 3.523x – 0.2673x2 is the required parabola.
Q2. Fit a second degree polynomial equation to the following data.
          Year(x): 1976 1977 1978 1979 1980 1981 1982 1983 1984
          Sales(y): 50      65        70        85 82             75 65 90 95
          (in lakhs)
Solution:
          The second degree polynomial equation is
           y=a+bx+cx2 .......................................... (1)
          The normal equations are
          ∑y = na + b∑x + c∑x2 ........................... (2)
          ∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
          ∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)
          Here n=9.
           year    y     x=year-1980   x2     x3     x4   xy       x2 y
          1976     50         -4       16    -64    256 -200     800
          1977     65         -3        9    -27     81   -195   585
          1978     70         -2        4     -8     16   -140   280
          1979     85         -1        1     -1     1    -85     85
          1980     82         0         0     0      0     0      0
          1981     75         1         1     1      1    75      75
          1982     65         2         4     8      16   130    260
          1983     90         3         9    K227    81   270    810
          1984     95         4        16     64    256   380    1520
          Total ∑y=677      ∑x=0       ∑x2 ∑x3      ∑x4 ∑xy      ∑x2 y
                                       =60    =0    =708 =235 =4415
         Substituting above values we get
                 677=9a+60c ............. (5)
                 235=60b….............. (6)
                4415=60a+708c…… (7)
         Solving we get
                a=77.3509, b=3.9167, c=-0.3193
         Hence, the second degree polynomial trend equation is
                    y =77.3509+3.9167x-0.3193x2.
Q3.The price of a commodity during 1993-98 are givien below. Fit a parabola
y=a+bx+cx2 to these data. Calculate the trend values. Estimate the price of
commodity for the year 1999.
         Year: 1993 1994 1995 1996 1997 1998
         Price: 100 107       128       140         181        192.
Solution:
         The required parabola is y=a+bx+cx2 .................. (1)
         The normal equations are
         ∑y = na + b∑x + c∑x2 ........................... (2)
         ∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
         ∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)
Year Price y x=year-          x2           x3               x4       xy    x2y
                -1995.5
1993 100        -2.5       6.25      -15.625 39.062               -250    625
1994 107        -1.5       2.25      -3.375           5.0625      -160.5  240.75
1995 128        -0.5       0.25      -0.125           0.0625      -64     32
1996 140         0.5       0.25       0.125           0.0625        70    35
1997 181         1.5       2.25       3.375           5.0625        271.5 407.2
1998 192         2.5       6.25       15.625          39.0625 480         1200
      ∑y= 848 ∑x= 0 ∑x =17.5 ∑x =0 ∑x =88.37 ∑xy
                            2              3              4
                                                                           ∑x2y
                                                                     =347 =2540
The normal equations are
         848=6a+17.5c ------------- (5)
         347=17.5b                  (6)
         2540=17.5a+88.37c ------ (7)
Solving,
         a=136.12, b=19.83, c=1.786
Hence required parabola is y=136.12+19.83x+1.786x2.
The trend values are calculated in the table
For the year 1999, x=1999-1995.5=3.5
Therefore, Price in 1999= 136.12+(19.83x3.5)+(1.786x3.5x3.5)
                                      Rs.227.4035
        (2)→4a + 32b + 16c = 56
       (3)→32a + 290b + 159c = 505
       (4)→16a + 159b + 94c = 276
       Solving we get a = 0.6444, b = 1.661, c = 0.0169.
Therefore the required equation is
y = a + bx1 + cx2
y = 0.6444 + 1.661x1 + 0.0169x2.