Correlation and
Regression
                  © BGIMS
Outline
    Introduction
    Scatter Plots
   Correlation
   Regression
                    © BGIMS
Objectives
  Draw a scatter plot for a set of
  ordered pairs.
  Find the correlation coefficient.
  Find the equation of the
  regression line.
                                © BGIMS
Objectives
  Find the coefficient of
  determination.
  Find the standard error of
  estimate.
                               © BGIMS
Introduction
    Every day we take personal and
    professional decisions that are based
    on predictions of future events.
   To make these forecasts, we rely on the
    relationship between what is already
    known and what is to be estimated.
                                     © BGIMS
    Regression and correlation analysis
    show us how to determine both the
    nature and the strength of a
    relationship between two variables.
                                 © BGIMS
Significance of the study of
correlation
1.   Most of the variables show some kind of relationship
     between price and supply, income and expenditure, etc.
     correlation analysis gives the degree of relationship in
     one figure
2.   Once we know the relationship we can estimate the
     value of one variable given the value of another.
3.   Correlation analysis contributes to the economic
     behaviour. In business, correlation analysis enables the
     executive to estimate costs, price, etc.
                                                    © BGIMS
Types of correlation
    Positive and negative
    Simple, partial and multiple and
   Linear and non linear
                                       © BGIMS
Positive and Negative correlation
   If two variables vary together in the same
    direction or in opposite directions, they are
    said to be correlated.
   If as X increases Y increases consistently, X&Y
    are +vely correlated
   If as X increases Y decreases and as X
    decreases Y increases X&Y are -vely correlated
                                            © BGIMS
Simple, partial and multiple
correlation
    When only two variables are studied –
    simple correlation.
   When two or more variables are studied –
    partial or multiple correlation.
   In multiple correlation two or more
    variables are studied simultaneously
   In partial correlation more than two
    variables are there but we consider only two
    variables (keeping the other as constant)
                                         © BGIMS
Dependent & Independent
variables
    The known variable is called the
    independent variable and the variable
    we are trying to predict is the
    dependent variable.
                                   © BGIMS
Scatter Plots - Example
          Positive Relationship or correlation
               150
                150
    Pressure
   Pressure
               140
                140
               130
                130
               120
                120
                      40
                       40   50
                             50          60
                                          60   70
                                                70
                                  Age
                                   Age
                                                     © BGIMS
Scatter Plots - Other Examples
    Negative Relationship or correlation
                  90
                   90
                  80
                   80
          grade
    Finalgrade
                  70
                   70
   Final
                  60
                   60
                  50
                   50
                  40
                   40
                        55               10
                                          10        15
                                                     15
                             Number
                              Numberofofabsences
                                         absences
                                                          © BGIMS
Scatter Plots - Other Examples
                     No Relationship
         10
          10
         55
     Y
     y
         00
               00   10
                     10 20
                         20 30
                             30 4040 50
                                      50 60
                                          60 70
                                              70
                               xX
                                                   © BGIMS
    If the correlation is perfect positive, all
    the points will lie in a straight line as
    shown in figure and the correlation is
    perfect negative they will be in a line as
    shown in figure
                                        © BGIMS
Perfect positive correlation
Y
                          © BGIMS
Perfect negative correlation
              X
                          © BGIMS
Correlation
    The statistical tool with the help of
    which the relationships between two or
    more than two variables is studied is
    called correlation.
                                    © BGIMS
Correlation Analysis
    Correlation analysis is the statistical
    tool to describe the degree to which one
    variable is linearly related to another.
                                      © BGIMS
The coefficient of determination
    The extent, or strength of the
    association that exists between two
    variables X & Y
   Sample coefficient of determination
                                 © BGIMS
Sample coefficient of determination
    r   2
            = 1−
                 ∑ (Y − Y )
                 ∑ (Y − Y )      2
                                     © BGIMS
Sample coefficient of
determination
    r2=1 when there is perfect correlation
    r2=0 when there is no correlation
   Note
   r2 measures only the strength of a
    linear relationship between two
    variables.
                                    © BGIMS
Correlation Coefficient
   The correlation coefficient computed
    from the sample data measures the
    strength and direction of a relationship
    between two variables.
   Sample correlation coefficient, r.
   Population correlation coefficient, ρ .
                                      © BGIMS
  Range of Values for the
  Correlation Coefficient
Strong negative   No linear      Strong positive
relationship      relationship   relationship
   −1                 0                   +1
                                        © BGIMS
   Coefficient of correlation
   r= r2
   When the slope the equation is positive r is
    the positive square root, but if b is negative r
    is the negative square root..
    The sign of r indicates the direction of the
    relationship between two variables X & Y
                                             © BGIMS
    Karl Pearson’s Correlation
    coefficient
The coefficient of correlation denoted by r and named
after Karl Pearson is defined as
        r=
           ∑ ( X − X )(Y − Y )
                           Nσ xσ y
      Where there are N pairs (x,y)
                                                 © BGIMS
   This is also called product moment coefficient of
    correlation.
   Covariance of x and y is defined as
       cov( X , Y ) =
                      ∑ ( X − X )(Y − Y )
                                   N
       ∴
          cov( X , Y )
       r=
            σ xσ y
                                                © BGIMS
The formula for r can be simplified as
r=
           ∑ XY − N XY
                             2                      2
       ∑ X − N X ∑Y
                2                        2
                                             − NY
                                               © BGIMS
Interpreting r2
    Coeff. Of determinations expresses the
    amount of variation in Y that is
    explained by the regression line.
                                    © BGIMS
What does r=0.6 mean?
   r=0.6  r2=0.36
    36% of the variation in the amount spent
    on movies is explained by the regression line.
   From r=0.6  the amount spent on movies
    correlates 0.6 with family income seems
    like fairy strong correlation . But r2=0.36
    36% of the variation in the amount of
    money families spend on movies.
                                           © BGIMS
Cont..
    If you designed your marketing
    strategy to appeal only to families with
    high incomes, you’d miss a lot of
    potential customers.
    Instead try to find what else is
    influencing family movie decisions.
                                      © BGIMS
Rank Correlation Coefficient
    When quantitative measure of certain
    factors cannot be fixed, but the
    individuals in the group can be
    arranged in order thereby obtaining
    for each individual a number
    indicating his rank in the group.
                                  © BGIMS
    The rank correlation coefficient is applied
     to a set of ordinal rank numbers, with 1 for
     the individual ranked first, in quantity or
     quality, and so on, N for last ranked one,
     then R can be defined as
Where D refers to the
difference of ranks between
paired items.                  6∑ D    2
                R = 1−
                              N ( N − 1)
                                   2
                                            © BGIMS
Example
    Two managers are asked to rank a
    group of employees in order of
    potential for eventually becoming top
    managers. The ranking are as follows.
    Compute the coefficient of rank
    correlation and comment on the value.
                                   © BGIMS
    Employees        Ranking by manager 1   Ranking by manager 2
A               10                          9
B               2                           4
C               1                           2
D               4                           3
E               3                           1
F               6                           5
G               5                           6
H               8                           8
I               7                           7
J               9                           10
N=10
                                                                   © BGIMS
    Employees   Ranking by manager 1 Ranking by   D2 = (R1-R2)2
                                     manager 2
A               10                  9             1
B               2                   4             4
C               1                   2             1
D               4                   3             1
E               3                   1             4
F               6                   5             1
G               5                   6             1
H               8                   8             0
I               7                   7             0
J               9                   10            1
N=10                                              14
                                                             © BGIMS
Cont..
    R=1-0.085
    =0.915
                © BGIMS
Where Ranks are not given
    Assign ranks. Then apply the same
    formula
                                   © BGIMS
Equal ranks or tie in ranks
   Assign each individual or entry an
    average rank.
    Thus if individuals are ranked equal at
    5th place, give the rank (5+6)/2 =5.5 to
    both .
   If m is the number of items whose
    ranks are common then R is
                                      © BGIMS
Cont..
              1            1                
       6∑ D + (m1 − m1 ) + (m2 − m2 ) + ....
            2     3            3
R = 1−       12           12                
                    N3 − N
                                          © BGIMS
Regression analysis
                      © BGIMS
Example
   Sales of major appliances vary with the new
    housing market. When new home sales are
    good, so are the sales of dishwashers,
    washing      machines,       drinkers      and
    refrigerators. A trade association compiled
    the following historical data ( in thousands of
    units) on major appliance sales and housing
    starts.
                                            © BGIMS
Housing Starts   Appliances sales
 (thousands)       (thousands)
      2                 5
     2.5               5.5
     3.2                6
     3.6                7
     3.3               7.2
      4                7.7
     4.2               8.4
     4.6                9
     4.8              9.71
      5                10           © BGIMS
                     How can we fit this line ?
12
10
0
     0   1   2   3    4    5   6
                                              © BGIMS
Cont..
   In this case, data points represents the
    relationship between the housing market and
    sales of house appliances. The relationship
    between X & Y is well described a straight
    line.
   The direction of the line can indicate
    whether the relationship is direct or inverse.
                                           © BGIMS
   William C Andrews, an organizational behavior
    consultant for Victory Motorcycles ,has
    designed a test to show the company’s
    supervisors the dangers of over supervising their
    workers. A worker from the assembly line is
    given a series of complicated tasks to perform.
    During the worker’s performance, a supervisor
    constantly interrupts the worker to assist him or
    her in completing the tasks. The worker, upon
    completion of the tasks, is then given a
    psychological test designed to measure the
    worker’s hostility toward authority (a high score
    equals low hostility).
                                             © BGIMS
Cont..
   Eight different workers were assigned the
    tasks and then interrupted for the purpose of
    instructional assistance variance number of
    times. Their corresponding scores on the
    hostility test are revealed as follows. Predict
    the expected test score if the worker is
    interrupted 18 times.
                                            © BGIMS
no. of times workers   Worker's score on
      interrupted         hostility test
        5                     58
        10                    41
        10                    45
        15                    27
        15                    26
        20                    12
        20                    16
        25                     3
                                     © BGIMS
                  70             How can we fit this line ?
                  60
Hostility Score
                  50
                  40
                  30
                  20
                  10
                  0
                       0    10             20     30
                           Number of interrupts
                                                              © BGIMS
How can we fit a line
mathematically?
   To a statistician, the line will have a
    good fit if it minimizes the error
    between the estimated points on the
    line and actual observed points that
    were used to draw it. (method of least
    squares)
                                     © BGIMS
The method of least squares
   An equation of a line that is drawn through the
    middle of a set of points in a scatter diagram
    such that the sum of the squares of the errors is
    minimum . The estimating line or points that lie
    on the estimating line
                ∧
                y =a+bX
                                               © BGIMS
Slope of the best-fitting Regression line &
Y-intercept of the best-fitting Regression
line
        b=
           ∑XY            −n X Y
           ∑X
                                 2
                      2
                          −n X
        a =Y −b X
                                     © BGIMS
    The given equation is regression
    equation of Y on X. It gives most
    probable values of Y for given values
    of X.
    The regression line of X on Y gives the
    probable values of X for given values
    of Y. say X=a + bY.
                                     © BGIMS
   The regression equation of Y on X can also be
    represented by
                  σy
         Y −Y = r    (X − X )
                  σx
         Where r is the coefficient of correlation
                                                     © BGIMS
   The regression equation of X on Y can also be
    represented by
                    σx
            X −X =r    (Y − Y )
                    σy
                                                    © BGIMS
Example
    The general sales manager of Kiran Enterprises – an
    enterprise dealing in the sale of ready-made men’s wears – is
    toying with the idea of increasing his sales to 80,000. on
    checking the records of sales during the last 10 years, it was
    found that the annual sale proceeds and advertisement
    expenditure were highly correlated to the extent of 0.8. It was
    further noted that the annual average sale has been Rs. 45,000
    and annual average advertisement expenditure Rs. 30,000 with
    a variance of Rs.1600 and Rs. 626 in advertisement
    expenditure respectively.
    In view of the above, how much expenditure on advertisement
    you would suggest the General sales Manager pf the enterprise
    to incur to meet his target of sales.
                                                         © BGIMS
    X- advertisement expenditure
    Y- sales expenditure
   When Y= 80,000
   X= 47500
                                   © BGIMS
Example
   Suppose BMC is interested in the
    relationship between the age of
    garbage truck and the annual repair
    expense they should expect to incur. In
    order to determine this relationship,
    BMC has accumulated information
    concerning four of the trucks the city
    currently owns.
                                     © BGIMS
Cont..
    Organize the data as outlined in table
    Use the equations of a & b to find the
    numerical constants for our regression
    line.
                                    © BGIMS
truck number age of truck in   repair expense
                 years (x)      during last year
                                in thousands of
                                      Rs.
    101             5                7
    102             3                7
    103             3                6
    104             1                4   © BGIMS
   b= 0.75
   a= 3.75
   Y=3.75+0.75X
   BMC can estimate the annual repair expense given
    the age of truck.
   If it is 4 years old use the equation Y=3.75+0.75X to
    get the annual expense as follows
   Y= 3.75+0.75 *4
   =6.75
   Expected annual repair expense =6750.0
                                                  © BGIMS
How to measure the reliability of the
estimating equation?
    Measured by the standard error of
    estimate
   It measures the variability, or scatter
    of the observed values around the
    regression line.
                                     © BGIMS
Standard error
  Se =
          ∑ (Y − Y )   2
             n−2
                           © BGIMS
    For the above example
    Standard error=0.866 866.0 /-
   If standard error is zero we expect the
    estimating equation to be a perfect
    estimator of the dependent variable.
                                     © BGIMS
   Assuming that the observed points are
    normally distributed around the regression line,
    we can expect
   68% of the points within     + Se
   95.5 % of the points within +Se and 99.7% of
    the points within + 3Se
                                             © BGIMS
Multiple regression and
correlation analysis
                          © BGIMS
Cont..
   We can use more than one independent
    variable to estimate the dependent
    variable and thus attempt to increase
    the accuracy of the estimate.
   This process is called multiple
    regression analysis
                                   © BGIMS
Example
   Consider the real estate agent who wishes to
    relate the number of houses the firm sells in a
    month to the amount of her monthly
    advertising.
   Certainly we can find a simple estimating
    equation that relates these two variables.
   Could we also improve the accuracy of our
    equation by including the number of
    salespeople she employs each month ?
   Then we can use number of sales agents and the
    advertising expenditures to predict monthly
    house sales.                             © BGIMS
 Multiple regression equations
      Y      = a + b1 X 1 + b2 X 2
X1 & X2 = values of the two independent variables
Y= estimated vale corresponding to the dependent variable
                                                  © BGIMS
   For getting a, b & c solve the normal equations
      ∑ y = na + b ∑ X + b ∑ X
                            1         1       2           2
      ∑ X y = a∑ X + b ∑ X + b ∑ X X
                                                      2
               1                 1        1       1           2       1   2
      ∑ X y = a∑ X + b ∑ X X + b ∑ X
                                                                              2
               2                  2       1           1   2       2       2
                                                                                  © BGIMS
Example
   In trying to evaluate the effectiveness in
    its advertising campaign, a firm
    complied the following information
   Year                1996 1997 1998 1999 2000 2001 2002 2003
   Adv. Expenditure      12 15     15 23      24 38 42 48
   (‘000 Rs.)
   Sales (Lakh Rs.)     5.0 5.6 5.8 7.0 7.2 8.8 9.2 9.5
   Estimate the probable sales when advertisement expenditure is
    Rs. 60 thousand.
                                                         © BGIMS
Cont..
    Y= 3.8719+ 0.1250 X
    When X=60
   Y= 11.37
                          © BGIMS