CORRELATION AND REGRESSION
Dr. Mohamed Sief
                                           Fayoum University
                                            April 8, 2024
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   1 / 27
Introduction
  Statistics is often used to investigate the relationship between two
 (or more) variables of interest. Some examples of relations that are
 often studied include:
        Is there a relationship between high school grade and the first
        year college grade point average (GPA)? If so, what is the
        relationship?
        What is the relationship between the expenditure and income
        of a Egyptian family?
        What is the relationship between age and blood pressure?
        The relationship between body mass index and systolic blood
        pressure, or between hours of exercise per week and percent
        body fat.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   2 / 27
Introduction (Cont’d)
 Key Questions
 In the above examples, we see that there are two basic questions of
 interest when investigating a pair of variables:
    1   Is there a relationship between the two variables?
    2   What is the relationship (if any) between the two variables?
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   3 / 27
Simple Linear Correlation
 Problem Statement
 In this section, we delve into measuring the linear relationship, also known
 as linear association, between two variables X and Y .
        For studying correlation between X and Y , we typically have data
        represented by pairs of observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ),
        where xi is the value of X for the i-th observation, yi is the value of
        Y for the i-th observation, and n is the number of observations.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION     April 8, 2024    4 / 27
Scatter Plot
 Definition
 The scatter plot is an essential tool in the study of correlation, providing a
 visual representation of the relationship between two variables X and Y .
        The scatter plot allows us to gain a preliminary understanding of the
        mutual influence between the explanatory variable X and the response
        variable Y .
        It displays the individual data points as dots on a two-dimensional
        graph, with one variable plotted along the x-axis and the other along
        the y-axis.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   5 / 27
Scatter Plot (Cont’d)
 Purpose
 The primary purpose of constructing a scatter plot is to visually assess the
 presence and strength of any relationship between X and Y .
        A positive correlation between X and Y suggests that as X increases,
        Y tends to increase.
        A negative correlation indicates that as X increases, Y tends to
        decrease.
        The scatter plot helps identify patterns, trends, outliers, and potential
        nonlinear relationships between the variables.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   6 / 27
                  Figure: Illustration of the concept of the Scatter diagram
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION     April 8, 2024   7 / 27
Remarks
        The scatter plot serves as a crucial initial step in correlation analysis,
        providing valuable insights into the relationship between two variables
        X and Y .
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   8 / 27
Remarks
        The scatter plot serves as a crucial initial step in correlation analysis,
        providing valuable insights into the relationship between two variables
        X and Y .
        It allows researchers to make informed decisions regarding further
        analysis and modeling techniques.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   8 / 27
Remarks
        The scatter plot serves as a crucial initial step in correlation analysis,
        providing valuable insights into the relationship between two variables
        X and Y .
        It allows researchers to make informed decisions regarding further
        analysis and modeling techniques.
        Understanding the scatter plot is fundamental for conducting effective
        correlation studies and drawing meaningful conclusions from the data.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   8 / 27
Pearson’s Correlation Coefficient
 Correlation Coefficient
 There are several measures to quantify the correlation between two
 variables X and Y . One commonly used measure is the Pearson coefficient
 of linear correlation.
        Different statistics textbooks may present equivalent formulas for
        computing the correlation coefficient.
        In this section, we introduce one of these formulas, known as
        Pearson’s correlation coefficient.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   9 / 27
Pearson’s Correlation Coefficient
 Definition 5.1.2
 Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be the given data pairs. The Pearson’s
 Correlation Coefficient, denoted as r , is given by the following relation:
                                             P
                                                (x − x̄)(y − ȳ )
                               r = pP                  pP
                                             (x − x̄)2     (y − ȳ )2
 Or using the following relation:
                                P     P P
                              n xy − ( x)( y )
                r=  p                 p P
                       n x 2 − ( x)2 · n y 2 − ( y )2
                        P         P             P
        n is the number of data pairs.
        P
            denotes summation form i = 1 to n.
        x and y are the values of variables X and Y , respectively.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION    April 8, 2024       10 / 27
Interpretation
 Interpretation
 The correlation coefficient r quantifies the strength and direction of the
 linear relationship between X and Y .
        r > 0: Positive correlation (as X increases, Y tends to increase).
        r < 0: Negative correlation (as X increases, Y tends to decrease).
        |r | ≈ 1: Strong correlation.
        |r | ≈ 0: Weak or no correlation.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   11 / 27
Remarks
        Pearson’s correlation coefficient provides a standardized measure of
        linear association between two variables.
        Understanding and interpreting r is essential for analyzing
        relationships between variables and making informed decisions.
        Further analysis and inference can be based on the calculated value of
        r.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   12 / 27
Different patterns of data produce different degrees of
linear correlation.
                            Figure: Different shape of liner correlations
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION           April 8, 2024   13 / 27
Interpreting Scatterplots
 Key Observations
 Several points are evident from the scatterplots:
        When the slope of the line in the plot is negative, the correlation is
        negative; and vice versa.
        The strongest correlations (r = 1.0 and r = −1.0) occur when data
        points fall exactly on a straight line.
        The correlation becomes weaker as the data points become more
        scattered.
        If the data points fall in a random pattern with unclear direction, the
        correlation is close to zero.
        Correlation is affected by outliers.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   14 / 27
Effect of Outliers
 Comparison
 Compare the second scatter plot with the last scatter plot. The single
 outlier in the last plot greatly reduces the correlation (from 0.80 to 0.70).
        Outliers can have a significant impact on correlation, especially in
        smaller datasets.
        It’s important to identify and address outliers appropriately in
        correlation analysis.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   15 / 27
Coefficient of Determination
 Definition
 The square of the correlation coefficient (r 2 ) is called the coefficient of
 determination.
        r 2 represents the proportion of the variance in Y that is predictable
        from X .
        It ranges from 0 to 1, where 0 indicates no linear relationship and 1
        indicates a perfect linear relationship.
        r 2 provides a measure of the strength of the linear relationship
        between X and Y .
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   16 / 27
Assessment of Correlation Strength
  Correlation Strength                                              r
  Perfect Correlation (all points are located on the line)          |r | = 1.00
  Very Strong Correlation (high degree of linearity)                0.86 < |r | < 1.00
  Strong Correlation (the linearity is very clear)                  0.70 < |r | ≤ 0.86
  Moderate Correlation                                              0.50 < |r | ≤ 0.70
  Weak Correlation (an acceptable degree of linearity)              0.30 < |r | ≤ 0.50
  Very Weak Correlation                                             0.00 < |r | ≤ 0.30
  No Correlation (No linear or SX = 0 and SY = 0)                   |r | = 0
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION       April 8, 2024   17 / 27
EXAMPLE 5.1.1
 The results of a class of 10 students on midterm exam marks (X ) and on
 the final examination marks (Y ) are as follows:
        The values of X : 77, 54, 71, 72, 81, 94, 96, 99, 83, 67
        The values of Y : 82, 38, 78, 34, 47, 85, 99, 99, 79, 68
    1   Represent the given data on the scatter plot.
    2   Is there a linear relationship (linear association) between X and Y ? Is
        it positive or negative?
    3   Calculate the correlation coefficient (r ).
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   18 / 27
Solution
 a) Scatter Plot
 The scatter plot for the given data is:
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   19 / 27
Solution (continued)
 b) Interpretation
 The scatter plot suggests that there is a positive linear association
 between X and Y since there is a linear trend for which the value of Y
 linearly increases when the value of X increases.
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   20 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
    794
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
    794       709
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
    794       709          0             0
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
    794       709          0             0        1818.4
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
    794       709          0             0        1818.4       5040.9
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (continued)
 c) Calculation of Correlation Coefficient (r)
      xi       yi      xi − x̄         yi − ȳ   (xi − x̄)2   (yi − ȳ )2   (xi − x̄)(yi − ȳ )
     77        82        -2.4           11.1       5.76         123.21            -26.64
     54        38       -25.4          -32.9      645.16       1082.41            835.66
     71        78        -8.4           7.1       70.56         50.41             -59.64
     72        34        -7.4          -36.9      54.76        1361.61            273.06
     81        47        1.6           -23.9       2.56         571.21            -38.24
     94        85       14.6            14.1      213.16        198.81            205.86
     96        99       16.6            28.1      275.56        789.61            466.46
     99        99       19.6            28.1      384.16        789.61            550.76
     83        79        3.6            8.1       12.96         65.61             29.16
     67        68       -12.4           -2.9      153.76         8.41             35.96
    794       709          0             0        1818.4       5040.9             2272.4
Dr. Mohamed Sief (Fayoum University)      CORRELATION AND REGRESSION           April 8, 2024      21 / 27
Solution (c)
 Calculation of Correlation Coefficient (r ):
                                       794
                                  x̄ =      = 79.4
                                        10
                                       709
                                  ȳ =      = 70.9
                                        10 P
                                               (x − x̄)(y − ȳ )
                                   r = pP             pP
                                            (x − x̄)2     (y − ȳ )2
                                               2272.4
                                     =√               √
                                         ×1818.4 × 5040.9
                                     ≈ 0.75
Dr. Mohamed Sief (Fayoum University)     CORRELATION AND REGRESSION    April 8, 2024   22 / 27
Solution
 Alternatively, we can use the relation:
                             P        P P
                           n xy − ( x)( y )
                r=p P
                     (n x 2 − ( x)2 )(n y 2 − ( y )2 )
                                P        P     P
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   23 / 27
Calculation
                            i          xi        xi2       yi      yi2   xi yi
                          1            77       5929       82    6724    6314
                          2            54       2916       38    1444    2052
                          3            71       5041       78    6084    5538
                          4            72       5184       34    1156    2448
                          5            81       6561       47    2209    3807
                          6            94       8836       85    7225    7990
                          7            96       9216       99    9801    9504
                          8            99       9801       99    9801    9801
                          9            83       6889       79    6241    6557
                          10           67       4489       68    4624    4556
                         Total         794     64862      709    55309   58567
Dr. Mohamed Sief (Fayoum University)        CORRELATION AND REGRESSION           April 8, 2024   24 / 27
Calculation (continued)
 Calculation:
                                             P
                                            P P
                                      xy − ( x)( y )
                                         n
                        r=p P
                            (n x 2 − ( x)2 )(n y 2 − ( y )2 )
                                       P      P        P
                                 585670 − 794 × 709
                        r=p                p
                            648620 − (794)2 553090 − (709)2
                          ≈ 0.75056
 Conclusion
 Based on our rule, there is a strong positive linear relationship between X
 and Y . (As X values increase, the corresponding Y values also increase).
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   25 / 27
Exercise: Relationship between Hours Studied and Exam
Scores
 Consider the following data representing the number of hours students
 studied (X ) and their corresponding exam scores (Y ):
        Hours Studied (X ): 5, 7, 3, 8, 4, 6, 9, 2, 5, 7
        Exam Scores (Y ): 65, 75, 60, 80, 70, 72, 85, 55, 68, 78
    1   Represent the given data on a scatter plot.
    2   Determine if there exists a linear relationship (linear association)
        between X and Y . If so, is it positive or negative?
    3   Calculate the correlation coefficient (r ).
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   26 / 27
                                       Thank You
                                       Any Questions?
Dr. Mohamed Sief (Fayoum University)   CORRELATION AND REGRESSION   April 8, 2024   27 / 27