0% found this document useful (0 votes)
288 views21 pages

Curve Fitting & Correlation Techniques

The document discusses curve fitting using the method of least squares. It provides 3 examples of using the method to fit straight lines and parabolas to data points. The key steps are: 1) Assume a model curve (e.g. straight line y=a+bx or parabola y=a+bx+cx^2) 2) Derive "normal equations" by minimizing the sum of squared errors 3) Solve the normal equations to determine the curve parameters (a, b, c etc.) 4) Check how well the fitted curve matches the original data points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views21 pages

Curve Fitting & Correlation Techniques

The document discusses curve fitting using the method of least squares. It provides 3 examples of using the method to fit straight lines and parabolas to data points. The key steps are: 1) Assume a model curve (e.g. straight line y=a+bx or parabola y=a+bx+cx^2) 2) Derive "normal equations" by minimizing the sum of squared errors 3) Solve the normal equations to determine the curve parameters (a, b, c etc.) 4) Check how well the fitted curve matches the original data points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT-3

Statistical Techniques: Curve fitting by method of least squares: y=a+bx, y=a+bx+cx2 and
y=ab . Correlation–Karl Pearson’s coefficient of correlation, Regression analysis–lines of
x

regression (without proof)- problems.

3 Curve fitting: Least Squares Methods


Curve fitting is a problem that arises very frequently in science and engineering.
The process of constructing an approximate curve y  f ( x) which
fit best to a given discrete set of points ( xi , yi ),
i  1, 2, 3,......., n is called curve fitting

Principle of Least Squares:


The principle of least squares (PLS) is one of the most popular
methods for finding the curve of best fit to a given data set
( xi , yi ), i  1, 2, 3,......., n .
Let y  f ( x) be the equation of the curve to be fitted to the given
set of points P1 ( x1, y1 ), P2 ( x2 , y2 ), P3 ( x3 , y3 ),............, Pn ( xn , yn ).
Then e1  y1  f ( x1 )
e2  y2  f ( x2 )
e3  y3  f ( x3 )
……………..
……………..
ei  yi  f ( xi )
Squaring each error (or residue) ei and adding, we get
n n
E  e12  e2 2  e32  ..................  en 2   ei 2    yi  f ( xi )  …..(i)
2

i 1 i 1
The curve of best fit is that for which E is minimum. This is called the Principle of least squares
(PLS).

Some standard approximating curves :


1. y  a  bx (straight line)
2. y  a  b x  cx 2 (parabola or quadratic curve)
3. y  a b x (exponential curve)

3.1 Fitting a straight line by least squares


Let be the straight line y  f ( x)  a  b x ……………………..(ii)
to be fitted to the given set of data points ( x1, y1 ), ( x2 , y2 ), ( x3 , y3 ),............,( xn , yn ) .
To determine the two unknowns a (intercept) and b (slope) in (i) use the PLS criteria that E is
minimum,
n n n
i.e., E   ei 2    yi  f ( xi )    yi  a  bxi  ………………..(iii)
2 2

i 1 i 1 i 1
is minimum. Differentiating (iii) partially w.r.to a and b, and equating to zero, we get
E E
0 and 0
a b
E E
 0  0
a b
n n
  2  yi  a  bxi ) (1)  0   2  yi  a  bxi ( xi )  0
i 1 i 1
n n
   yi  a  bxi   0    yi xi  axi  bxi 2   0
i 1 i 1
n n n n n n
  yi   a   bxi   yi xi   axi  b xi 2
i 1 i 1 i 1 i 1 i 0 i 0
n n n n n n
 yi  a1  b xi   xi yi  a  xi b xi 2
i 1 i 1 i 1 i 1 i 0 i 0
n n
  yi  na b  xi
i 1 i 1
Thus the two unknown parameters a and b of Eq.(i) are determined from the two equations
 y  na  b  x …………………….(iii)
 xy  a x  b  x
2
…………………(iv)
Equations (iii) and (iv) are known as “normal equations” for fitting a straight line y  a  b x .
Note: Let y  a  b x being a straight line, then the normal equations are
  y  na  b  x and

 xy  a x  b  x2
3.2 Fitting a quadratic curve (parabola) by method least squares
Assume that y  a  b x  c x 2 being a parabola.
Approximate the data according to PLS. Then the unknown three parameters a, b, c are
determined from the following three normal equations obtained in similar way as above,
 y  na  b  x  c  x2 …………………….(iii)
 xy  a x  b  x2  c  x3 …………………(iv)
 x2 y  a x2  b  x3  c  x4 ……………..(v)
3.3 Fitting a nonlinear curve by least squares
Assume that y  a b x
Taking logarithm on both sides, we get
log y  log a  x log b
 Y  A  B X …………………..(i)
where Y  log y , A  log a , B  log b and X  x
Equation (i) is a linear equation in Y and X. For estimating A and B, normal equations are
Y  nA  B  X  XY  A X  B  X
2
and
where n is the number of pairs of values of x and y .
Ultimately, a  antilog( A). and b  antilog( B).

Example 1 By the method of least squares, find a straight line that best fits the following data
points:
x 0 1 2 3 4
y 1.0 2.9 4.8 6.7 8.6

Solution: Let line of best fit be given by y  a  bx ……………..(i)


Where a and b are constants to be determined by the normal equations.
The normal equations are  y  na  b  x ………………… .(ii)

 xy  a x  b  x …………………(iii)
2

Calculating the values of  x,  y,  xy,  x2 from the following data:

x y xy x2
0 1.0 0 0
1 2.9 2.9 1
2 4.8 9.6 4
3 6.7 20.1 9
4 8.6 34.4 16
 x  10  y  24  xy  67  x2  30
Here n  5 (number of pairs)
The normal equations are 24  5a  10 b ………………… .(iv)
67  10a  30b …………………….(v)
Solving (iv) and (v), we get a  1 and b  1.9.
Substituting in Eq.(i), line of best fit is y  1  1.9 x .

Example 2 By the method of least squares, find a straight line that best fits the following data
points:
x 1 2 3 4 5
y 14 27 40 55 68

Solution: Let line of best fit be given by y  a  bx ……………..(i)


The normal equations are  y  na  b  x …………………..(ii)

 xy  a x  b  x ……………...(iii)
2

Calculating the values of  x,  y,  xy,  x2 from the following data:


x y xy x2
1 14 14 1
2 27 54 4
3 40 120 9
4 55 220 16
5 68 340 25
 x  15  y  204  xy  748  x2  55
Here n  5 (number of pairs)
The normal equations are 204  5a 15b ………………… .(iv)
748  15a  55b …………………….(v)
Solving (iv) and (v), we get a  0 and b  13.6
Substituting in Eq.(i), line of best fit is y  13.6 x .

Example 3. If P is the pull required to lift a W by means of a pulley block, find a linear law of
the form P  c  mW connecting P and W , using the following data:
P 12 15 21 25
W 50 70 100 120
where P and W are taken in kg.wt. Compute P when 150 kg.wt.

Solution: Line for best fit is given as P  c  mW ………..(i)


The corresponding normal equations are
 P  nc  mW …………………(ii)
 PW  cW  mW
2
…………………….(iii)
W P WP W2
50 12 600 2500
70 15 1050 4900
100 21 2100 10000
120 25 3000 14400
W  340  P  73  PW  31800 W 2  6750
Equations (ii) and (iii) becomes
73  4c  340 m
6750  340c  31800 m
i.e., 2c  170 m  365
34c  3180 m  675 .
On solving above equations, we get m  0.1879 and c  2.2785 .
Substituting in Eq.(i), line of best fit is P  2.2759  0.1879W .
When W  150 kg. P  2.2759  0.1879 (150)  30.4635 kg.

Example 4 Fit a 2nd parabola to the given data


x 1 3 4 6 8 9 11 14
y 1 2 4 4 5 7 8 9

Solution: Let the parabola of best fit be given by y  a  b x  cx 2 ………..(i)


where a, b, c are costants to be determined.
By normal equations, we have
 y  na  b  x  c  x2 …………………….(ii)
 xy  a x  b  x2  c  x3 …………………(iii)
 x2 y  a x2  b  x3  c  x4 ……………..(iv)

x y xy x2 x2y x3 x4
1 1 1 1 1 1 1
3 2 6 9 18 27 81
4 4 16 16 64 64 256
6 4 24 36 144 216 1296
8 5 40 64 320 512 4096
9 7 63 81 567 729 6561
11 8 88 121 968 1331 14641
14 9 126 196 1764 2744 38416
 x  56  y  40  xy  364  x2  524  x2 y   x3  x4
3846  5624  65348

Substituting these values in Eqs.(ii)-(iv), we get


40 = 8a+56b+524c
364 = 56a+524b+5624c
3846 = 524a+5624b+65348c
On solving above equations, we get
a = 0.195, b = 0.77, c = 0.009.
Substituting in (i), parabola of best fit is y  0.195  0.77 x  0.009 x 2 .

3.4 Change of Scale

If the data values are equispaced (with height (h)) and quite large for computation, simplification
may be done by origin shifting as given below:
 When number of observations (n) is odd, take the origin at middle value of the table; say
x  x0
( x0 ) and substitute u 
h
 y values if small; may be left unchanged; or we can shift them at average value of y
y  y0
data v 
h
 When number of observations (n) is even, take the origin as mean of two middle values,
h x  x0
with new height   and substitute u  .
2 h/2

Example 5 Fit a 2nd parabola to the following data:


x 0 1 x0 =2 3 4
y 1 1.8 1.3 2.5 6.3

Solution: Here number of the given data is n=5 (odd), h=1, then
x  x0 x  2
u   x  2 and y  v so that the parabola of fit y  a  b x  cx 2 ………….(i)
h 1
becomes v  A  B u  C u 2 …………………(ii)
The normal equations of (ii) are
 v  nA  B  u  C  u 2 …………………(iii)
 uv  A u  B  u 2  C  u3 ……………(iv)
 u 2v  A u 2  B  u3  C  u 4 ………….(v)
u=x-2 v=y u2 u 2v u3 u4 uv
-2 1 4 4 -8 16 -2
-1 1.8 1 1.8 -1 1 -1.8
0 1.3 0 0 0 0 0
1 2.5 1 2.5 1 1 2.5
2 6.3 4 25.2 8 16 12.6
 u =0  u =12.9  u 2 =10 u v
2
u
3
=0 u
4
=34  uv
=33.5 =11.3

Equations (iii)-(v) are


12.9  5 A 10C
11.3  10B
33.5  10 A  34C
Solving these simultaneous equations, we get
A  1.48, B  1.13 and C  0.55.
Equation (ii) yields
v  A  B u  C u 2  1.48  1.13 u  0.55 u 2
Hence y  1.48  1.13( x  2)  0.55( x  2) 2
i.e., y  1.42  1.07 x  0.55 x 2 .
Which is the required solution of B parabola,
7
2.2

6 2.0

5 1.8

4 1.6
y
3 1.4

2 1.2

1 1.0

0 1 2 x 3 4 0.5 1.0 1.5 2.0 2.5

Fig.1. Plot of y verses x : Given data Fig.2. Plot of y verses x : y  1.42  1.07 x  0.55 x 2

Example 6 Fit a 2nd parabola to the following data:


x 1.0 1.5 2.0 2.5 3.0 3.5 4.0
y 1.1 1.3 1.6 2.0 2.7 3.4 4.1

Solution: Since number of observations is odd and h  0.5.


x  x0 x  2.5
Taking u    2 x  5 and y  v so that the parabola of fit
h 0.5
y  a  b x  cx 2 …………(i)
becomes v  A  B u  C u 2 …………………(ii)
The normal equations are
 v  nA  B  u  C  u 2 ……………….(iii)
 uv   u  B  u 2  C  u3 …………..(iv)
 u 2v  A u 2  B  u3  C  u 4 ………(v)
x y=v u  2x  5 u2 u3 u4 uv u2 v
1.0 1.1 -3 9 -27 81 -3.3 9.9
1.5 1.3 -2 4 -8 64 -2.6 5.2
2.0 1.6 -1 1 -1 1 -1.6 1.6
2.5 2.0 0 0 0 0 0 0.0
3.0 2.7 1 1 1 1 2.7 2.7
3.5 3.4 2 4 8 64 6.8 13.6
4.0 4.1 3 9 27 81 12.3 36.9
 v  u   u 2   u3   u 4   uv   u 2v 
16.2 0 28 0 196 14.3 69.9
Using the table values, Eqs.(iii)-(v) reduces to
16.2  7 A  0  28 C  7 A  28 C  16.2
14.3  0  28 B  0  28 B  14.3
69.9  28 A  0  196 C  28 A  196 C  69.9
On solving the simultaneous equations, we get
A  2.07, B  0.511, C  0.061 .
Equation (ii) becomes v  2.07  0.511u  0.061u 2 .
Put u  2x  5 then y  2.07  0.511(2 x  5)  0.061(2 x  5) 2 .
i.e., y  1.04  0.198 x  0.244 x 2 .
Which is the best fit of the parabola.

Example 7. Fit a 2nd degree parabola for the following data:


x 1989 1990 1991 1992 1993 1994 1995 1996 1997
y 352 356 357 358 360 361 361 360 359

Solution: Since number of observations is odd and h  1.


x  x0 x  1993 y  y0 y  357
Taking u    x  1993 and v    y  357 so that the parabola
h 1 h 1
of fit y  a  b x  cx 2 ………………….(i)
becomes v  A  B u  C u 2 …………………(ii)
The normal equations are
 v  nA  B  u  C  u 2 ……………….(iii)
 uv   u  B  u 2  C  u3 …………..(iv)
 u 2v  A u 2  B  u3  C  u 4 ………(v)
x y u  x 1993 v  y  360 u2 u3 u4 uv u2v

1989 352 -4 -5 16 -64 256 20 -80


1990 356 -3 -1 9 -27 81 3 -9
1991 357 -2 0 4 -8 16 0 0
1992 358 -1 1 1 -1 1 -1 1
1993 360 0 3 0 0 0 0 0
1994 361 1 4 1 1 1 4 1
1995 361 2 4 4 8 16 8 16
1996 360 3 3 9 27 81 9 27
1997 359 4 2 16 64 256 8 32
u v  u2  u3  u4  uv  u 2v
0  11  60 0  708  51  9

Using the table values, Eqs.(iii)-(v) reduces to


11  9 A  60 C  9 A  60 C  11
51  60 B  60 B  51
9  60 A  708 C  60 A  708 C  9
On solving the simultaneous equations, we get
694 17 247
A , B , C .
231 20 924
694 17 247 2
Equation (ii) becomes v   u u .
231 20 924
Substituting u  x 1993 and v  y  357 into Eq,(i), we get
694 17 247
y  357   ( x  1993)  ( x  1993) 2 .
231 20 924
i.e., y  1000106.41  1034.29 x  0.267 x 2 .

Example 8. The pressure and volume of a gas related to the equation p v  k where  and k
being constants. Fit this equation to the following data:
x  p (kg / cm2 ) 0.5 1.0 1.5 2.0 2.5 3
y  v (liters ) 1.62 1.00 0.75 0.62 0.52 0.46

Solution: Given p v  k …………………(i)


where  and k are constants to be determined.
Taking log, log10 p   log10 v  log10 k
 log10 v  log10 k  log10 p
1 1
log10 v  log10 k  log10 p
 
i.e., Y  A  B X ……………………..(ii)
1 1
where Y  log10 v, A  log10 k , B   , X  log10 p .
 
Normal equations of (ii) are Y  nA  B  X
 XY  A X  B  X 2
p v X  log10 p Y  log10 v XY X2

0.5 1.62 -0.3010 0.2095 -0.0630 0.0906


1.0 1.00 0.0000 0.0000 -0.0000 0
1.5 0.75 0.1761 -0.1249 -0.0220 0.0310
2.0 0.62 0.3010 -0.2076 -0.0625 0.0906
2.5 0.52 0.3979 -0.2840 -0.1130 0.1583
3.0 0.46 0.4771 -0.3372 -0.1609 0.2276
X Y  XY X2
 1.0511  0.7442  0.4214  0.5981
Here, n = 6
 X , Y ,  XY ,  X
2
Substituting the values of into the normal equations, we get
0.7442  6 A  1.0511 B
0.4214  1.0511A  0.5981 B
Solving these, we get A  0.0132 and B  0.7836
1 1
Now      1.1276
B 0.7836
1
Again , A  log10 k  log k   A

 k  anti log( A)  anti log( A)  anti log(0.0168)  1.039 .
Substituting the values of  and k in Eq.(i), we get
p v1.1276  1.039.
Which is the required curve.

Example 9. An experiment gave the following data


v ( ft / min) 350 400 500 600
t (min) 61 26 7 2.6
It is known that v and t are connected by v  a t b . Find the best possible values of a and b.
Solution: Given v  a t b is the non-linear equation. ………..(i)
Where a and b are constants to be determined.
Taking log on both sides, we get
log10 v  log10 a  b log10 t
 Y  A  B X is the linear equation
where Y  log10 v , X  log10 t , A  log10 a , B  b.
The normal equations are Y  nA  B  X and  XY  A  X  B  X 2 .
v t X  log10 t Y  log10 v XY X2

350 61 1.7853 2.5441 4.542 3.187


400 26 1.4150 2.6021 3.682 2.002
500 7 0.8451 2.6990 2.281 0.714
600 2.6 0.4150 2.7782 1.153 0.172
X Y  XY X2
 4.4604  10.6234  11.658  6.075
Here, n = 4. The normal equations become
10.6234  4 A  4.4604 B and 11.658  4.4604 A  6.075B
Solving these, A  2.845 and B  b  0.1697 .
Now A  log10 a  a  anti log( A)  anti log(2.845)  699.8 .
Substituting the values of a and b into Eq.(i), we get
v  699.8 t 0.1697 .

Example10. By the method of least squares, find the straight line that best fits the following
data:
x 1 2 3 4 5
y 14 27 40 55 68

Home work problem.

3.4 Correlation
In a bivariate distribution, if the change in one variable affects a change in the other
variable, the variables are said to be correlated.
If the two variables deviate in the same direction i.e., if the increase (or decrease) in one
results in a corresponding increase (or decrease) in the other, correlation is said to be direct or
positive.

Fig.1. Positive Correlation Fig.2. Negative Correlation

e.g., the correlation between income and expenditure is positive.


If the two variables deviate in opposite direction i.e., if the increase (or decrease) in one
results in a corresponding decrease (or increase) in the other, correlation is said to be inverse or
negative.
e.g., the correlation between volume and the pressure of a perfect gas or the correlation
between the price and demand is negative.
Correlation is said to be perfect if the deviation in one variable is a followed by a
corresponding proportional deviation in the other.

3.4.1 Scatter or dot diagrams


It is the simplest method of the diagrammatic representation of bivariate data. Let
( xi , yi ), i  1, 2,3,....., n be a bivariate distribution. Let the values of the variables x and y be
plotted along the x-axis and y-axis on a suitable scale. Then corresponding to every ordered pair,
there corresponds a point or dot in the xy-plane. The diagram of dots so obtained is called a dot
or scatter diagram.
If the dots are very close to each other and the number of observations is not very large, a
fairly good correlation is expected. If the dots are widely scattered, a poor correlation is
expected.

3.4.2 Coefficient of Correlation


Coefficient of correlation ( r ) lies between -1 and +1, i.e., 1  r  1 .
If r is zero; no correlation between two variables, positive correlation ( 0  r  1 ); when both
variables increase or decrease simultaneously, and negative correlation ( 1  r  0 ); when
increase in one is associated with decrease in other variable and vice-versa.

3.4.3 Karl Pearson Coefficient of Correlation


Coefficient of correlation ( r ) between two variables x and y is defined as

r
Covariance(x, y)



 XY (remember)
Variance(x) Variance(y )  x y  X 2
Y 2

where X  x  x , Y  y  y , x , y are means of x and y data values.


1
  Cov(x, y )   XY is the covariance between the variables x and y,
n

x
 x , y   y are means of x and y series respectively, also
n n

x  X2 and  y 
Y 2 are called the Standard Deviation (SD) of x and y respectively.
n n

Alternate form : r ( x, y ) 
 XY 
 ( x  x )( y  y )
 X 2  Y 2  ( x  x )2  ( y  y )2
n xy   x y
That is r ( x, y ) 
n x 2    x  n y 2    y 
2 2

Here n is the number of pairs of values of x and y.

Example 1. If Cov(x, y )  10, Var( x)  25, Var( y )  9 , find coefficient of correlation.


Covariance(x, y) 10 10
Solution: r     0.67.
Variance(x) Variance(y) 25 9 5  3

Example 2. Calculate coefficient of correlation from the following data:


x 9 8 7 6 5 4 3 2 1
y 15 16 14 13 11 12 10 8 9
Solution: Karl Pearson coefficient of correlation ( r ) is given by r 
 XY
 X 2 Y 2
where X  x  x , Y  y  y , x , y are means of x and y data values.

Here x 
 x  45  5 , y
 y  108  12 .
n 9 n 9
x y X  xx Y  y y X2 Y2 XY
9 15 4 3 16 9 12
8 16 3 4 9 16 12
7 14 2 2 4 4 4
6 13 1 1 1 1 1
5 11 0 -1 0 1 0
4 12 -1 0 1 0 0
3 10 -2 -2 2 4 4
2 8 -3 -4 9 16 12
1 9 -4 -3 16 9 12
x y X2 Y 2  XY
 45  108  60  60  57
The Karl Pearson coefficient of correlation is r 
 XY 
57

57
 0.95 .
 X 2 Y 2 60 60 60

Example 3. Psychological tests of intelligence and of engineering ability were applied to 10


students. Here is a record of ungrouped data showing intelligence ratio (I.R) and engineering
ratio (E.R). Calculate the coefficient of correlation.
Student A B C D E F G H I J
IR 104 105 102 101 100 99 98 96 93 92
ER 101 103 100 98 95 96 104 92 97 94

Solution : Karl Pearson coefficient of correlation ( r ) is given by r 


 XY ………(i)
 X 2 Y 2
where X  x  x , Y  y  y , x , y are means of x and y data values.

Here x 
 x  990  99 , y
 y  980  98 .
n 10 n 10

Student I.R (x) E.R (y) X  xx Y  y y X2 Y2 XY


A 104 101 6 3 36 9 18
B 105 103 5 5 25 25 25
C 102 100 3 2 9 4 6
D 101 98 2 0 4 0 0
E 100 95 1 -3 1 9 -3
F 99 96 0 -2 0 4 0
G 98 104 -1 6 1 36 -6
H 96 92 -3 -6 9 36 18
I 93 97 -6 -1 36 1 6
J 92 94 -7 -4 49 16 28
x y X Y X2 Y 2  XY
 990  980 0 0  170  140  92

Substituting these values in Eq.(i), we get r 


 XY 
92

92
 0.59.
 X 2 Y 2 170 140 154.3

Example 4. Find the coefficient of correlation between the values of x and y (using alternate
form):
x 1 3 5 7 8 10
y 8 12 15 17 18 20

Solution : Here, n = 6 .

x y x2 y2 xy
1 8 1 64 8
3 12 9 144 36
5 15 25 225 75
7 17 49 289 119
8 18 64 324 144
10 20 100 400 200
x y x
2
y
2
 xy
 34  90  248  1446  582

Karl Pearson's coefficient of correlation is given by


n xy   x y 6(582)  (34)(90)
r ( x, y )   = 0.9879
n x    x  n y    y  6(248)   34  6(1446)   90 
2 2 2 2 2 2

Shortcut Method for Karl Pearson Coefficient of Correlation


 We can also find Karl Pearson Coefficient of Correlation by taking assumed means as
shown:
If we take X  x  a, Y  y  b
where a and b are assumed means of x and y data values.
xa
 If x ’s are equispaced with height h , we can take : u  .
h
y b
Similarly y ’s are equispaced with height k , we can take : v  .
k
n uv   u  v
Then r ( x, y )  .
n u 2    u  n v 2    v 
2 2

Example 5. Find the co-efficient of correlation for the following table:


x 10 14 18 22 26 30
y 18 12 24 6 30 36
x  22 y  24
Solution : Let u  , v .
4 6
x y u v u2 v2 uv
10 18 -3 -1 9 1 3
14 12 -2 -2 4 4 4
18 24 -1 0 1 0 0
22 6 0 -3 0 9 0
26 30 1 1 1 1 1
30 36 2 2 4 4 4
u v u 2
v
2
 uv
 3  3  19  19  12
Here n  6.
n uv   u  v
The Karl Pearson Coefficient of Correlation : r ( x, y ) 
n u 2    u  n v 2    v 
2 2

6(12)  (3)(3) 63 63 3
i.e., r ( x, y)    
6(19)   3 6(19)   3 105 105 105 5
2 2

Therefore, r ( x, y )  0.6.
 x 2   y 2   x y 2
Example 6. Establish the formula r  .
2 x y
where r is the correlation coefficient between x and y. Using the above formula, calculate the
"coefficient of correlation” from the following data:
x: 21 23 30 54 57 58 72 78 87 90
y: 60 71 72 83 110 84 100 92 113 135

Solution: Let z  x  y , then z  x  y .


Therefore, z  z  ( x  y )  ( x  y )
z  z  (x  x )  ( y  y )
Squaring on both sides, we get
( z  z )2  ( x  x ) 2  ( y  y ) 2  2( x  x )( y  y )
Operating ‘  ’ on both sides, we get
 ( z  z )2   ( x  x )2   ( y  y )2  2 ( x  x )( y  y )
 ( z  z )2   ( x  x )2   ( y  y )2  2  ( x  x )( y  y )
n n n n
 z   x   y  2r x y
2 2 2 
 r
 ( x  x )( y  y ) 
 n x y 
 x   y   x y
2 2 2
Therefore, r  .
2 x y
which is the required result.
Home work: Using the above formula, calculate the "coefficient of correlation” from the given
data. Try yourself, submit through the e-mail (nanjundappace@gmail.com)

3.5 Regression
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
dependent variable y) and a series of other variables (known as independent variable x) and vice
versa.

Use of Regression Analysis


(i) In the field of Business, this tool of statistical analysis is widely used. Businessmen are
interested in predicting future production, consumption, investment, prices, profits and sales etc.

(ii) In the field of economic planning and sociological studies, projections of population, birth
rates, death rates and other similar variables are of great use.

3.5.1 Linear Regression


Regression describes the functional relationship between dependent and independent variables;
which helps us to make estimates of one variable from the other. Correlation quantifies the
association between the two variables; whereas linear regression finds the best line that predicts y
from x and x also from y. The difference between correlation and regression is illustrated in the
adjoining figure.
Correlation Regression

3.5.2 Lines of Regression


A line of regression is the straight line which gives the best fit in the least square sense to the
given frequency.
In case of n pairs ( xi , yi ); i  1, 2,3,............., n. from bivariate data, we have no reason or
justification to assume y as dependent variable or x as independent variable. Either of the two
may be estimated for the given values of the other. Thus if we wish to estimate y for given values
of x, we shall have the regression equation of the form y  a  bx , called the regression line of y
on 'x'. If we wish to estimate x for given values of y, we shall have the regression line of the form
x  a  by , called the regression line x on y.
Thus it implies, in general, we always have two lines of regression.

3.5.3 Derivation of Lines of Regression


3.5.3(i) Line of Regression of y on x
To obtain the line of regression of y on x, we shall assume y as dependent variable and x as
independent variable. Let the equation of regression line of y on x is
y  a  bx ……………………………………………(i)
The normal equations as derived by the method of least Square are:
 y  na  b  x ………………… .(ii)
 xy  a x  b  x
2
…………………(iii)
Solving (ii) and (iii) for 'a' and 'b', we get
n xy   x y
b and
n x 2    x 
2

a
 y b  x  y b x .
n n
Substituting the values of 'a' in Eq.(i), we get
y  y  b( x  x ) ………………………(iv)
Equation (iv) is called regression line of y on x,. 'b' is called the regression coefficient of y on x
and is usually denoted by b yx .
Hence Eq.(iv) can be written as
y  y  byx ( x  x )

is called the regression line y on x.


where x , y are the mean values of x and y respectively, while
n xy   x  y
byx  . (Remember)
n x 2    x 
2

In equation (iii), shifting the origin to ( x , y ) , we get

 ( x  x )( y  y )  a ( x  x )  b  ( x  x ) ……………….(v).
2

We know that  ( x  x )  0 ,

 2
X 2  ( x  x )2
x    ( x  x )2  n 2
x
n n

r
 XY 
 ( x  x )( y  y )   ( x  x )( y  y )
 X 2  Y 2  ( x  x )2  ( y  y )2 n x y

  ( x  x )( y  y )  rn x y
Equation (v) reduces to
 ( x  x )( y  y )  a ( x  x )  b  ( x  x )
2

 rn x y  a (0)  b n x2
y
Therefore b  r
x
y
That is byx  r called the regression coefficient (slope of line of regression) y on x.
x
Here r is the coefficient of correlation,  y and  x are the standard deviation of x and y series
respectively.
Note: The regression line of y on x is y  y  byx ( x  x ) (remember)
y
Where byx  r called the regression coefficient (slope of line of regression) y on x.
x

3.5.3(ii) Line of Regression of x on y


Proceeding in the same way as 4.5.2(i), we can derive the regression line of x on y as
x  x  bxy ( y  y )
is called the line of regression of x on y
Here bxy is the regression coefficient of x on y and is given by
n xy   x  y
bxy 
n y 2    y 
2

x
or bxy  r
y
where the terms have their usual meanings.

Here b yx and bxy are known coefficients of regression and are connected by the relation:
 y   x  2
byx  bxy   r    r   r .
 x   y 
Note : If r  0 , the two lines of regression become x  x and y  y which are two straight lines
parallel to x and y axes respectively and passing through their means y and x . They are
mutually perpendicular.
If r  1 , the two lines of regression will coincide.

3.5.4 Properties of Regression Coefficients

 As byx  bxy  r , the coefficient of correlation (r) is the geometric mean between the
two regression coefficients.
byx  bxy
 Since  byx  bxy  r ,  arithmetic mean of the two regression coefficients is
2
greater than or equal to the correlation coefficient ( r ).
 If there is a perfect correlation between the two variables under consideration, then
byx  bxy  r ; and the two lines of regression coincide. Converse is also true, i.e. if two
lines of regression coincide, then there is a perfect correlation.
 Since byx  bxy  r 2  0 , the signs of both regression coefficients b yx and bxy and
coefficient of correlation ( r ) must be same; either all three negative or all positive.
 Since byx  bxy  r 2  1 , if one of the regression coefficients is greater than unity, other
must be less than unity.
 Point of intersection of two lines of regression is ( x , y ), where x and y are the means x
and y series respectively.
 If both lines of regression cut each other at right angle, there is no correlation between the
two variables; i.e., r  0.

3.5.5 Angle between the Lines of Regression


If  be the acute angle between the two regression lines for two variables x and y, then
1  r 2   x y 
tan    .
r   x 2   y 2 
Proof: The two lines of regression are given by:
y
y y r ( x  x ) …………………..(i)
x
 y
and x  x  r x ( y  y ) or y  y  ( x  x ) ………………….(ii)
y r x
If m1 and m2 are slopes of lines (i) and (ii), then
m  m1 r y 
tan   2 , where m1  and m2  y
1  m1m2 x r x
 y r y  1 y
  r
r x  x r x
 tan   
 r y   y    
2
1    1  y

  x  r x  x 
1  r 2   x y 
Therefore, tan     …………………………(iii)
r   x 2   y 2 

 When r  0, tan      
2
Therefore, the two lines of regression are perpendicular to each other.
 When r  1, tan   0    0
Therefore, the two lines of regression are coincident

Example 1. The two regression equations of the variables x and y are x  18.13  0.87 y and
y  11.64  0.54 x . Find (1) the mean of x’s and y’s, (2) the co-efficient of correlation between x
and y.
Solution: Given x  18.13  0.87 y ……………..(i)
and y  11.64  0.54 x …………(ii)
Since the mean of x’s and y’s lie on the two regression lines, we have
x  18.13  0.87 y ……………………………..(iii)
and y  11.64  0.54 x ……………........(iv)
(1) On solving the above equations, we get x  15.79 and y  3.74.
(2) Regression coefficient y on x from Eq(ii) is byx  0.54 and Regression coefficient x on y
from Eq(i) is bxy  0.87.
Therefore, the coefficient of correlation is the geometric mean between the two regression
coefficients is given by r  byx  bxy  (0.54)(0.87)  0.66  0.66.
(here the -sign is taken, since both the regression coefficients are - sign ).

Example 2. In the partially destroyed laboratory record, only the lines of regression of y on x
and x on y are available as 4 x  5 y  33  0 and 20 x  9 y  107 respectively. Calculate (ii) x
and y , and (ii) the co-efficient of correlation between x and y.
Solution: Given 4 x  5 y  33 ……………..(i)
and 20 x  9 y  107 ………………(ii).
(i) Since the mean of x’s and y’s lie on the two regression lines (i) and (ii), we have
4 x  5 y  33
20 x  9 y  107 .
On solving these equations, we have x  13 and y  17.
4 33
(ii) The regression line y on x from the Eq(.(i) is y  x  …………..(iii)
5 5
4
The regression coefficient y on x is byx  .
5
9 107
Similarly, the regression line x on y from the Eq(.(ii) is x  y …………..(iii)
20 9
9
The regression coefficient x on y is bxy  .
20
Therefore, the coefficient of correlation is the geometric mean between the two regression
 4  9 
coefficients is given by r  byx  bxy      0.36  0.6  0.6 .
 5  20 
(here, the +sign is taken, since both the regression coefficients are + sign ).

Example 3. In the following table are recorded data showing the test scores made by salesmen
on an intelligence test and their weekly sales :
Salesmen 1 2 3 4 5 6 7 8 9 10
Test Scores=x 40 70 50 60 80 50 90 40 60 60
Sales(000)=y 2.5 6.0 4.5 5.0 4.5 2.0 5.5 3.0 4.5 3.0
Calculate the lines of regression of sales (y) on test scores (x) and estimate the most probable
weekly sales volume if a sales man makes a score of 70.
Solution: Determine the regression line of sales (y) on test scores (x): y  y  byx ( x  x ) .
n xy   x  y
where byx  .
n x    x 
2 2

From the table, we have

x  mean of x (test scores) 


 x  600  60 and
n 10

y  mean of y (sales) 
 y  40.5  4.05 .
n 10
Test scores = x Sales(000) = y xy x2
40 2.5
70 6.0
50 4.5
60 5.0
80 4.5
50 2.0
90 5.5
40 3.0
60 4.5
60 3.0
 x =600  y  40.5
n xy   x  y
Therefore, byx   0.06 .
n x 2    x 
2

The required regression line y on x is y  4.05  (0.06)( x  60)


i.e., y  0.06 x  0.45 .
When x  70 , y  0.06(70)  0.45  4.65 .
Thus the most probable weekly sales volume if a sales man makes a score of 70 is 4.65.

Example 4. Following data depicts the statistical values of rainfall and production of wheat in a
region for a specified time period.
Mean Standard Deviation
Production of Wheat (kg. per unit area) =y 10 8
Rainfall (cm) =x 8 2
Estimate the production of wheat (y) when rainfall (x) is 9cm if correlation coefficient between
production and rainfall is given to be 0.5.
Solution: Let the variables x and y denote rainfall and production respectively.
Given that x  8,  x  2; y  10,  y  8, r  0.5.
Now equation of regression of y on x is given by:
y
y y r (x  x )
x
(0.5)(8)
 y  10  ( x  8)
2
 y  10  2( x  8)
 y  2x  6  0 .
 y  2x  6
That is the production of wheat on the rain fall.
When rainfall is 9cm (x), production (y) of wheat is estimated to be 2(9)  6  12 kg. per unit
area.
Example 5. Find the coefficient of correlation and the lines of regression for the data
given below:
n  18,  x  12,  y  18,  x2  60,  y 2  96,  xy  48.
n xy   x y
Solution: (i) The coefficient of correlation: r ( x, y )  and
n x    x  n y    y 
2 2 2 2

(ii) Equations of regression lines: y  y  byx ( x  x ) , x  x  bxy ( y  y ) .

Now x 
 x  12  0.67, y
 y  18  1.
n 18 n 18

 x2  
  x  16  12 
2 2 2
x
        2.9   x  1.7.
n  n  18  18 
 y 2    y 
2 2
96  18 
y 2
        4.33   y  2.08.
n  n  18  18 
y  2.08    1.7 
byx  r  (0.57)    0.7 ; bxy  r x  (0.57)    0.47 .
x  1.7  y  2.08 

18(48)  (12)(18)
(i) The coefficient of correlation: r   0.57
18(60)  12  18(96)  18 
2 2

(ii) Equations of regression lines :


y  y  byx ( x  x ) , x  x  bxy ( y  y )
 y  1  0.7( x  0.67) , x  0.67  0.47( y  1)
 y  0.7 x  0.53 , x  0.47 y  0.2 .
Example 6. Find the correlation coefficient between x and y, when the two lines of
regression are given by: 2 x  9 y  6  0 and x  2 y  1  0 .
Solution: Let the line of regression of x on y be 2 x  9 y  6  0 ………..(i)
Then the line of regression of y on x be x  2 y  1  0 …………(ii)
9 9
From Eq.(i), we have x  y  3  bxy  and
2 2
1 1 1
From Eq.(ii), we have y  x   byx  .
2 2 2
1 9 3
 r  byx  bxy     1.5 , which is not possible as 1  r  1 .
2 2 2
So our choice of regression lines is incorrect.
2 2 2
Therefore, line of regression of y on x is 2 x  9 y  6  0  y  x   byx 
9 3 9
Also, line of regression of x on y is x  2 y  1  0  x  2 y  1  bxy  2
2 2
 r  byx  bxy  2  .
9 3
2
Hence coefficient of correlation between x and y is .
3

************************************END******************************

You might also like