Curve Fitting & Correlation Techniques
Curve Fitting & Correlation Techniques
Statistical Techniques: Curve fitting by method of least squares: y=a+bx, y=a+bx+cx2 and
y=ab . Correlation–Karl Pearson’s coefficient of correlation, Regression analysis–lines of
     x
                                                       i 1    i 1
The curve of best fit is that for which E is minimum. This is called the Principle of least squares
(PLS).
             i 1       i 1                    i 1
is minimum. Differentiating (iii) partially w.r.to a and b, and equating to zero, we get
 E              E
    0    and        0
 a              b
    E                                                 E
      0                                                0
    a                                                 b
        n                                                            n
  2  yi  a  bxi ) (1)  0                             2  yi  a  bxi ( xi )  0
       i 1                                                         i 1
        n                                                            n
   yi  a  bxi   0                                      yi xi  axi  bxi 2   0
       i 1                                                         i 1
        n             n         n                                    n      n         n
  yi   a   bxi                                         yi xi   axi  b xi 2
       i 1         i 1    i 1                                    i 1   i 0      i 0
 n              n           n                                        n          n     n
 yi  a1  b xi                                           xi yi  a  xi b xi 2
i 1           i 1        i 1                                     i 1     i 0    i 0
        n                   n
  yi  na b  xi
       i 1                i 1
Thus the two unknown parameters a and b of Eq.(i) are determined from the two equations
 y  na  b  x …………………….(iii)
 xy  a x  b  x
                                    2
                                        …………………(iv)
Equations (iii) and (iv) are known as “normal equations” for fitting a straight line y  a  b x .
Note: Let y  a  b x being a straight line, then the normal equations are
    y  na  b  x and
             xy  a x  b  x2
3.2 Fitting a quadratic curve (parabola) by method least squares
Assume that y  a  b x  c x 2 being a parabola.
Approximate the data according to PLS. Then the unknown three parameters a, b, c are
determined from the following three normal equations obtained in similar way as above,
 y  na  b  x  c  x2 …………………….(iii)
 xy  a x  b  x2  c  x3 …………………(iv)
 x2 y  a x2  b  x3  c  x4 ……………..(v)
3.3 Fitting a nonlinear curve by least squares
Assume that y  a b x
Taking logarithm on both sides, we get
log y  log a  x log b
 Y  A  B X …………………..(i)
where Y  log y , A  log a , B  log b and X  x
Equation (i) is a linear equation in Y and X. For estimating A and B, normal equations are
Y  nA  B  X                            XY  A X  B  X
                                                                2
                                    and
where n is the number of pairs of values of x and y .
Ultimately, a  antilog( A).   and    b  antilog( B).
Example 1 By the method of least squares, find a straight line that best fits the following data
points:
           x      0      1       2      3       4
           y     1.0    2.9     4.8    6.7     8.6
                           xy  a x  b  x …………………(iii)
                                                 2
                            x            y           xy            x2
                            0           1.0          0              0
                            1           2.9         2.9             1
                            2           4.8         9.6             4
                            3           6.7         20.1            9
                            4           8.6         34.4           16
                          x  10      y  24     xy  67      x2  30
Here n  5 (number of pairs)
The normal equations are 24  5a  10 b ………………… .(iv)
                             67  10a  30b …………………….(v)
Solving (iv) and (v), we get a  1 and b  1.9.
Substituting in Eq.(i), line of best fit is y  1  1.9 x .
Example 2 By the method of least squares, find a straight line that best fits the following data
points:
           x      1      2       3      4       5
           y      14    27      40     55      68
                           xy  a x  b  x ……………...(iii)
                                                 2
Example 3. If P is the pull required to lift a W by means of a pulley block, find a linear law of
the form P  c  mW connecting P and W , using the following data:
             P     12     15      21       25
             W     50     70     100 120
where P and W are taken in kg.wt. Compute P when 150 kg.wt.
x          y             xy            x2             x2y        x3         x4
1           1     1                     1                 1          1          1
3           2     6                     9                 18         27         81
4           4     16                    16                64         64         256
6           4     24                    36                144        216        1296
8           5     40                    64                320        512        4096
9           7     63                    81                567        729        6561
11          8     88                    121               968        1331       14641
14          9     126                   196               1764       2744       38416
 x  56  y  40  xy  364                x2  524      x2 y     x3         x4
                                                          3846        5624      65348
If the data values are equispaced (with height (h)) and quite large for computation, simplification
may be done by origin shifting as given below:
      When number of observations (n) is odd, take the origin at middle value of the table; say
                                  x  x0
        ( x0 ) and substitute u 
                                    h
      y values if small; may be left unchanged; or we can shift them at average value of y
                   y  y0
        data v 
                     h
      When number of observations (n) is even, take the origin as mean of two middle values,
                           h                    x  x0
        with new height   and substitute u            .
                           2                     h/2
Solution: Here number of the given data is n=5 (odd), h=1, then
    x  x0 x  2
u               x  2 and y  v so that the parabola of fit y  a  b x  cx 2 ………….(i)
      h      1
becomes v  A  B u  C u 2 …………………(ii)
The normal equations of (ii) are
 v  nA  B  u  C  u 2 …………………(iii)
 uv  A u  B  u 2  C  u3 ……………(iv)
 u 2v  A u 2  B  u3  C  u 4 ………….(v)
u=x-2          v=y              u2              u 2v                u3                        u4                   uv
-2             1               4               4                   -8                        16                    -2
-1             1.8             1               1.8                 -1                        1                     -1.8
0              1.3             0               0                   0                         0                     0
1              2.5             1               2.5                 1                         1                     2.5
2              6.3             4               25.2                8                         16                    12.6
 u =0           u =12.9  u 2 =10                u v
                                                       2
                                                                   u
                                                                              3
                                                                                  =0         u
                                                                                                   4
                                                                                                       =34          uv
                                               =33.5                                                               =11.3
6 2.0
5 1.8
 4                                                          1.6
y
 3                                                          1.4
2 1.2
1 1.0
Fig.1. Plot of y verses x : Given data Fig.2. Plot of y verses x : y  1.42  1.07 x  0.55 x 2
Example 8. The pressure and volume of a gas related to the equation p v  k where  and k
being constants. Fit this equation to the following data:
            x  p (kg / cm2 )    0.5     1.0    1.5    2.0 2.5     3
               y  v (liters )      1.62       1.00   0.75    0.62      0.52     0.46
Example10. By the method of least squares, find the straight line that best fits the following
data:
           x       1        2        3         4         5
               y       14        27       40        55       68
3.4 Correlation
        In a bivariate distribution, if the change in one variable affects a change in the other
variable, the variables are said to be correlated.
        If the two variables deviate in the same direction i.e., if the increase (or decrease) in one
results in a corresponding increase (or decrease) in the other, correlation is said to be direct or
positive.
r
        Covariance(x, y)
                                  
                                      
                                         
                                                XY          (remember)
    Variance(x) Variance(y )  x y           X 2
                                                        Y 2
x
     x , y   y are means of x and y series respectively, also
      n          n
x     X2          and  y 
                                   Y 2     are called the Standard Deviation (SD) of x and y respectively.
             n                        n
Alternate form : r ( x, y ) 
                                         XY      
                                                          ( x  x )( y  y )
                                      X 2  Y 2  ( x  x )2  ( y  y )2
                                 n xy   x y
That is r ( x, y ) 
                          n x 2    x  n y 2    y 
                                          2                 2
Here x 
              x  45  5 ,      y
                                       y  108  12 .
                 n     9                n     9
  x       y           X  xx         Y  y y       X2       Y2        XY
  9      15               4                3         16         9       12
  8      16               3                4          9        16       12
  7      14               2                2          4         4        4
  6      13               1                1          1         1        1
  5      11               0               -1          0         1        0
  4      12              -1                0          1         0        0
  3      10              -2               -2          2         4        4
  2       8              -3               -4          9        16       12
  1       9              -4               -3         16         9       12
x       y                                         X2       Y 2      XY
 45      108                                       60       60      57
The Karl Pearson coefficient of correlation is r 
                                                                     XY       
                                                                                    57
                                                                                        
                                                                                          57
                                                                                              0.95 .
                                                                   X 2 Y 2       60 60 60
Here x 
            x  990  99 ,           y
                                            y  980  98 .
               n     10                     n     10
Example 4. Find the coefficient of correlation between the values of x and y (using alternate
form):
x      1      3      5        7       8        10
y      8      12     15       17      18       20
Solution : Here, n = 6 .
  x        y             x2                 y2         xy
  1        8              1                64           8
  3       12              9                144          36
  5       15             25                225          75
  7       17             49                289         119
 8       18            64                   324        144
10       20            100                  400        200
x       y         x
                         2
                                          y
                                               2
                                                        xy
 34      90        248                  1446        582
                             6(12)  (3)(3)                        63     63 3
i.e., r ( x, y)                                                            
                    6(19)   3          6(19)   3           105 105 105 5
                                      2                    2
Therefore, r ( x, y )  0.6.
                                      x 2   y 2   x y 2
Example 6. Establish the formula r                           .
                                            2 x y
where r is the correlation coefficient between x and y. Using the above formula, calculate the
"coefficient of correlation” from the following data:
x:      21       23       30     54       57     58      72      78      87      90
y:      60       71       72     83       110    84      100     92      113     135
3.5 Regression
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
dependent variable y) and a series of other variables (known as independent variable x) and vice
versa.
(ii) In the field of economic planning and sociological studies, projections of population, birth
rates, death rates and other similar variables are of great use.
a
      y b  x  y b x .
       n         n
Substituting the values of 'a' in Eq.(i), we get
 y  y  b( x  x ) ………………………(iv)
Equation (iv) is called regression line of y on x,. 'b' is called the regression coefficient of y on x
and is usually denoted by b yx .
Hence Eq.(iv) can be written as
 y  y  byx ( x  x )
 ( x  x )( y  y )  a ( x  x )  b  ( x  x ) ……………….(v).
                                                      2
We know that  ( x  x )  0 ,
                  2
                            X 2  ( x  x )2
                     x                            ( x  x )2  n 2
                                                                     x
                               n              n
                   r
                            XY           
                                                  ( x  x )( y  y )   ( x  x )( y  y )
                       X 2  Y 2  ( x  x )2  ( y  y )2                  n x y
                     ( x  x )( y  y )  rn x y
Equation (v) reduces to
 ( x  x )( y  y )  a ( x  x )  b  ( x  x )
                                                      2
 rn x y  a (0)  b n x2
                 y
Therefore b  r
                 x
                y
That is byx  r     called the regression coefficient (slope of line of regression) y on x.
                x
Here r is the coefficient of correlation,  y and  x are the standard deviation of x and y series
respectively.
Note: The regression line of y on x is y  y  byx ( x  x ) (remember)
                   y
Where byx  r         called the regression coefficient (slope of line of regression) y on x.
                   x
                   x
or       bxy  r
                   y
where the terms have their usual meanings.
Here b yx and bxy are known coefficients of regression and are connected by the relation:
             y       x         2
byx  bxy   r         r        r .
             x       y         
Note : If r  0 , the two lines of regression become x  x and y  y which are two straight lines
parallel to x and y axes respectively and passing through their means y and x . They are
mutually perpendicular.
If r  1 , the two lines of regression will coincide.
       As   byx  bxy  r , the coefficient of correlation (r) is the geometric mean between the
        two regression coefficients.
               byx  bxy
       Since             byx  bxy  r ,  arithmetic mean of the two regression coefficients is
                   2
        greater than or equal to the correlation coefficient ( r ).
       If there is a perfect correlation between the two variables under consideration, then
        byx  bxy  r ; and the two lines of regression coincide. Converse is also true, i.e. if two
        lines of regression coincide, then there is a perfect correlation.
       Since byx  bxy  r 2  0 , the signs of both regression coefficients b yx and bxy and
        coefficient of correlation ( r ) must be same; either all three negative or all positive.
       Since byx  bxy  r 2  1 , if one of the regression coefficients is greater than unity, other
        must be less than unity.
       Point of intersection of two lines of regression is ( x , y ), where x and y are the means x
        and y series respectively.
       If both lines of regression cut each other at right angle, there is no correlation between the
        two variables; i.e., r  0.
Example 1. The two regression equations of the variables x and y are x  18.13  0.87 y and
 y  11.64  0.54 x . Find (1) the mean of x’s and y’s, (2) the co-efficient of correlation between x
and y.
Solution: Given x  18.13  0.87 y ……………..(i)
 and                y  11.64  0.54 x …………(ii)
Since the mean of x’s and y’s lie on the two regression lines, we have
         x  18.13  0.87 y ……………………………..(iii)
and      y  11.64  0.54 x ……………........(iv)
(1) On solving the above equations, we get x  15.79 and y  3.74.
(2) Regression coefficient y on x from Eq(ii) is byx  0.54 and Regression coefficient x on y
from Eq(i) is bxy  0.87.
Therefore, the coefficient of correlation is the geometric mean between the two regression
coefficients is given by r  byx  bxy  (0.54)(0.87)  0.66  0.66.
(here the -sign is taken, since both the regression coefficients are - sign ).
Example 2. In the partially destroyed laboratory record, only the lines of regression of y on x
and x on y are available as 4 x  5 y  33  0 and 20 x  9 y  107 respectively. Calculate (ii) x
and y , and (ii) the co-efficient of correlation between x and y.
Solution: Given 4 x  5 y  33 ……………..(i)
 and               20 x  9 y  107 ………………(ii).
(i) Since the mean of x’s and y’s lie on the two regression lines (i) and (ii), we have
4 x  5 y  33
20 x  9 y  107 .
On solving these equations, we have x  13 and y  17.
                                                        4     33
(ii) The regression line y on x from the Eq(.(i) is y  x  …………..(iii)
                                                        5      5
                                                4
    The regression coefficient y on x is byx  .
                                                5
                                                                9     107
Similarly, the regression line x on y from the Eq(.(ii) is x      y     …………..(iii)
                                                                20     9
                                                 9
    The regression coefficient x on y is bxy  .
                                                20
Therefore, the coefficient of correlation is the geometric mean between the two regression
                                             4  9 
coefficients is given by r  byx  bxy      0.36  0.6  0.6 .
                                             5  20 
(here, the +sign is taken, since both the regression coefficients are + sign ).
Example 3. In the following table are recorded data showing the test scores made by salesmen
on an intelligence test and their weekly sales :
 Salesmen           1     2     3      4       5     6      7       8     9     10
 Test Scores=x 40 70 50                60      80    50 90          40 60 60
 Sales(000)=y       2.5 6.0 4.5        5.0     4.5   2.0 5.5        3.0 4.5 3.0
Calculate the lines of regression of sales (y) on test scores (x) and estimate the most probable
weekly sales volume if a sales man makes a score of 70.
Solution: Determine the regression line of sales (y) on test scores (x): y  y  byx ( x  x ) .
              n xy   x  y
where byx                          .
               n x    x 
                     2          2
y  mean of y (sales) 
                         y  40.5  4.05 .
                          n    10
Test scores = x Sales(000) = y xy          x2
      40               2.5
      70               6.0
      50               4.5
      60               5.0
      80               4.5
      50               2.0
      90               5.5
      40               3.0
      60               4.5
      60               3.0
   x =600         y  40.5
                   n xy   x  y
Therefore, byx                              0.06 .
                   n x 2    x 
                                        2
Example 4. Following data depicts the statistical values of rainfall and production of wheat in a
region for a specified time period.
                                               Mean Standard Deviation
 Production of Wheat (kg. per unit area) =y      10               8
 Rainfall (cm) =x                                  8              2
Estimate the production of wheat (y) when rainfall (x) is 9cm if correlation coefficient between
production and rainfall is given to be 0.5.
Solution: Let the variables x and y denote rainfall and production respectively.
Given that x  8,  x  2; y  10,  y  8, r  0.5.
Now equation of regression of y on x is given by:
              y
y y r          (x  x )
              x
             (0.5)(8)
 y  10             ( x  8)
                2
 y  10  2( x  8)
 y  2x  6  0 .
 y  2x  6
That is the production of wheat on the rain fall.
When rainfall is 9cm (x), production (y) of wheat is estimated to be 2(9)  6  12 kg. per unit
area.
Example 5. Find the coefficient of correlation and the lines of regression for the data
given below:
n  18,  x  12,             y  18,  x2  60,  y 2  96,  xy  48.
                                                                                     n xy   x y
Solution: (i) The coefficient of correlation: r ( x, y )                                                                and
                                                                           n x    x            n y    y 
                                                                                 2              2        2           2
Now x 
               x  12  0.67,         y
                                             y  18  1.
                n           18               n       18
      x2  
                                x  16  12 
                        2          2             2
                    x
                                       2.9   x  1.7.
                    n           n  18  18 
                y 2    y 
                                   2             2
                                96  18 
    y   2
                                   4.33   y  2.08.
             n       n        18    18 
           y            2.08                                1.7 
   byx  r      (0.57)         0.7 ;    bxy  r x  (0.57)         0.47 .
           x            1.7                     y           2.08 
                                                            18(48)  (12)(18)
 (i) The coefficient of correlation: r                                                          0.57
                                                      18(60)  12        18(96)  18 
                                                                       2                    2
************************************END******************************