CORRELATION ANALYSIS
1.   What does correlation between variables tell us?
             Correlation is the degree to which two or more quantities or variables are linearly associated. In a two-
     dimensional plot, the degree of correlation between the values on the two axes is quantified by the so-called correlation
     coefficient.
             Correlation is a statistical measurement of the relationship between two variables. Possible correlations range
     from +1 to 1. A zero correlation indicates that there is no relationship between the variables. A correlation of 1
     indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1
     indicates a perfect positive correlation, meaning that both variables move in the same direction together.
             Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.
     How are they estimated?
              Correlation is estimated by quantifying it through a value known as correlation coefficient usually represented by
     r. Statistics provides various types of correlation coefficients, the use of which, depends on several factors such as the
     kind of variables being correlated.
              The table below shows how different types of data, categorized according to measurement scale, may be
     correlated and what statistical tool would provide its correct correlation coefficient.
                                  Quantitiative
           Variable Y\X                                                      Ordinal X                                  Nominal X
                                       X
           Quantitative
                                     Pearson r                                Biserial rb                         Point Biserial rpb (ad)
               Y
                                                           Spearman rho (naturally dichotomous)
             Ordinal Y               Biserial rb                                                                     Rank Biserial rrb
                                                          /Tetrachoric rtet (artificially dichotomous)
             Nominal Y           Point Biserial rpb                      Rank Bisereal rrb                          Phi, L, C, Lambda
2.   What is the range of values that a correlation coefficient may take? How is the particular range of values of
     correlation coefficient interpreted?
             The main result of a correlation is called the correlation coefficient. It ranges from -1.0 to +1.0. The closer r is to
     +1 or -1, the more closely the two variables are related.
             If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one
     variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often
     called an "inverse" correlation).
             While correlation coefficients are normally reported as r, squaring them makes then easier to understand. The
     square of the coefficient is equal to the percent of the variation in one variable that is related to the variation in the other.
             A correlation report can also show a second result of each test - statistical significance. In this case, the
     significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random
     sampling error. If you are working with small sample sizes, choose a report format that includes the significance level.
     This format also reports the sample size.
3.   For each correlation coefficient, provide a description and an illustrative example to show its appropriateness and
     how it can be computed.
     a. Person-product Moment Correlation
                                              The Pearson product-moment correlation coefficient is a measure of the
                                      strength of a linear association between two variables and is denoted by r.
     Description                      Basically, a Pearson product-moment correlation attempts to draw a line of best
                                      fit through the data of two variables, and the Pearson correlation coefficient, r,
                                      indicates how far away all these data points are to this line of best fit.
                                                     n  xy(  x )(  y )
                                      r=
     Formula
                                            n ( x )(  x ) n ( y )( y )
                                                     2           2            2           2
 Von Christopher G. Chua, LPT, MST
                                     The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A
                                     value of 0 indicates that there is no association between the two variables. A value
                                     greater than 0 indicates a positive association; that is, as the value of one variable
                                     increases, so does the value of the other variable. A value less than 0 indicates a
                                     negative association; that is, as the value of one variable increases, the value of the
    Interpretation                   other variable decreases.
                                   -1.0 to -0.7 strong negative association.
                                  -0.7 to -0.3 weak negative association.
                                  -0.3 to +0.3 little or no association.
                                  +0.3 to +0.7 weak positive association.
                                  +0.7 to +1.0 strong positive association
                                The following table shows the grades obtained by six students in Algebra and
                                Trigonometry. Compute for the Pearson-product moment correlation coefficient.
                                          Student
                                                          1        2       3       4         5        6
                                            No.
                                          Algebra        83       78       94     90        88       88
                                        Trigonomet
                                                         82       83       93     94        84       86
                                              ry
                                To solve for the correlation coefficient, some values in the formula must be obtained.
    Example
                                                   x           y           x2            y2         xy
                                                    83          82         6889          6724       6806
                                                    78          83         6084          6889       6474
                                                    94          93         8836          8649       8742
                                                    90          94         8100          8836       8460
                                                    88          84         7744          7056       7392
                                                    88          86         7744          7396       7568
                                                  x=5        y=5       x2=45        y2=45     xy=45
                                                    21          22          397           550        442
                                Computation:
                                                 n  xy(  x )(  y )
                                  r=
                                        n ( x )(  x ) n ( y )( y )
                                                  2            2           2            2
                                                      ( 6 ) ( 45442 )(521)(522)
                                  r=
                                       ( ( 6 )( 45397 )( 521 )2 )( ( 6 )( 45550 )( 522 )2 )
                                           690          690        690
                                  r=              =              =       =0.79
                                        941  816 (30.68)(28.57) 876.27
                                With a correlation coefficient equal to 0.79, we can conclude that there is a strong
                                positive association in the grades of the six students in Algebra and Trigonometry.
    b. Phi-coefficient
Description
  The phi coefficient is a measure of the degree of association between two binary or dichotomous variables. This measure is
  similar to the correlation coefficient in its interpretation because it was also formulated by Karl Pearson.
Formula
                 adbc
            =
                   efgh                                                                          X-       X+          Total
                                                                                        Y-        a        b             e
  Phi compares the product of the diagonal cells (a*d) to the product of the
                                                                                        Y+        c        d             f
  off-diagonal cells (b*c). The denominator is an adjustment that ensures
                                                                                       Total      g        h             n
  that Phi is always between -1 and +1.
Interpretation
 Von Christopher G. Chua, LPT, MST
 Two binary variables are considered positively associated if most of the data falls along the diagonal cells (i.e., a and d are
 larger than b and c). In contrast, two binary variables are considered negatively associated if most of the data falls off the
 diagonal.
Example
The table below shows the first time driving test results of a sample of 200 individuals classified by gender and success or
failure in the examination. We wish to explore the association between the two variables, the null hypothesis being that there
is no relationship between gender and success/failure in driving test results.
    Gender             Success               Failure           Total
     Male                 70                   28                98
    Female                50                   52               102
     Total               120                   80               200
       adbc       ( 70 )( 52 ) (28)(50)
  =           =
         efgh    (98)(102)(120)(80)
       36401400   2240
  =             =        =0.23
        95961600 9796.00
The data shows that gender and success or failure in the driving test has little or no correlation.
         c.   Point Biserial Correlation Coefficient
                                   The point biserial correlation coefficient (rpb) is a correlation coefficient used when
    Description                    one variable is dichotomous; Y can either be "naturally" dichotomous, like
                                   gender, or an artificially dichotomized variable. In most situations it is not
                                   advisable to artificially dichotomize variables.
                                   To calculate rpb, assume that the dichotomous variable Y has the two values 0 and 1. If
                                   we divide the data set into two groups, group 1 which received the value "1" on Y and
                                   group 2 which received the value "0" on Y, then the point-biserial correlation
                                   coefficient (for population) is calculated as follows:
                                     r pb=
                                                Sn    n
                                                        2
                                                         
                                             M 1M 2 n1 n0
                                   Where:
                                    sn is the standard deviation used when you have data for every member of the
                                   population:
    Formula                          sN =
                                                ( x x )2
                                                     N
                                   M1 being the mean on the continuous variable X for all data points in group 1, and M0
                                   the mean on the continuous variable X for all data points in group 2.
                                    n1 is the number of data points in group 1, n0 is the number of data points in group 2
                                   and n is the total sample size. There is an equivalent formula that uses sn1:
                                   point biserial correlation coefficient (for sample)
                                     r pb=
                                             M 1M 2
                                                Sn       
                                                      n1 n 0
                                                     n(n1) n
                                                             s=
                                                                             ( x x )2
                                                                                n1
    Interpretation                 Pett (1997) asserts that the same criteria for evaluating the coefficient of determination
                                   in regard to standard correlation can be applied to rpb2 because of the close relationship
                                   between rpb and the Pearson r. The coefficient of determination in the form of rpb2,
                                   therefore, is a useful index for drawing conclusions from the data.
                                   Very strong:  .81
                                   Strong:    .49-.80
                                   Moderate: .25-.48
 Von Christopher G. Chua, LPT, MST
                                Weak:       .00-.08
                             An urban planner hypothesizes the correlation between lack of car ownership and use of
                             public transportation would be positive in a particular urban location. In this case, the
                             dichotomous variable (X) is car ownership, which is the independent variable because it
                             is hypothesized as affecting frequency of public transportation use. The non-dichotomous
                             variable is the number of times in a given time spans that person uses public
                             transportation. The non-dichotomous variable is the dependent variable in this example.
                             Next, the researcher collects a small sample of 18 participants for her study, gathering the
                             following information(Table 1):
                                                                           Use of Public
                                 Participant        Car Ownership Transportation
                                                                                  3
                                       1                    No
                                      2                  No                     12
                                      3                  No                     10
                                      4                  No                     11
                                      5                  No                     12
                                      6                  No                     23
                                      7                  No                     14
                                      8                  No                      0
                                      9                  No                     16
                                     10                  Yes                     0
                                     11                  Yes                     2
   Example                           12                  Yes                     1
                                     13                  Yes                     0
                                     14                  Yes                     3
                                     15                  Yes                     4
                                     16                  Yes                     0
                                     17                  Yes                     0
                                     18                  Yes                     1
                             The next step would be to code the responses Yes as 0 and No as 1, making vehicle
                             ownership into a numerically dichotomous variable. At first glance, this may seem
                             counterintuitive because we associate zero as negative response (no) and 1 as positive
                             response (yes). However, because the researcher hypothesizes the effects of not having
                             a car rather than having a car will be in terms of an increase in public transportation use,
                             the researcher will code No responses as 1 as Yes responses as 0. Recall that the
                             researcher wants to know about lack of car ownership, not car ownership, couching the
                             hypothesis in terms of a positive relationship.
                             The correlation coefficient, 0.735means that those who do not own cars tend to use
                             public transportation more.
       d. Spearmans Rank Correlation Coefficient
   Description           The Spearman's rank-order correlation is the nonparametric version of the Pearson
                         product-moment correlation. Spearman's correlation coefficient, (, also signified by rs)
                         measures the strength of association between two ranked variables.
                             A monotonic relationship is a relationship that does one of the following: (1) as the value
                             of one variable increases, so does the value of the other variable; or (2) as the value of
                             one variable increases, the other variable value decreases. A monotonic relationship is an
                             important underlying assumption of the Spearman rank-order correlation. It is also
                             important to recognize the assumption of a monotonic relationship is less restrictive than
                             a linear relationship.
Von Christopher G. Chua, LPT, MST
                             There are two methods to calculate Spearman's rank-order correlation depending on
                             whether: (1) your data does not have tied ranks or (2) your data has tied ranks. The
                             formula for when there are no tied ranks is:
                                     6  di2
                                =1
                                     n(n21)
   Formula                   Where di is the difference in the paired ranks and n is the number of cases.
                             The formula to use when there are tied ranks is:
                                                ( x i x ) ( y i y )
                                                i
                                =
                                       ( x x )  ( y  y )
                                                             2                 2
                                                    i                  i
                                           i                      i
                               The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
                               indicates a perfect association of ranks, a rs of zero indicates no association between
   Interpretation
                               ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
                               zero, the weaker the association between the ranks.
                             The table which follows shows the scores of 10 high school students in an English and
                             Filipino exam. Both were 40-item tests.
                               English                  18        20        14         34        40        35        7         10       28    38
                                Filipino                27        30        25         36        38        29        24        22       35    40
                             To compute for the Spearman rho, we construct the below:
                                    English                  18        20        14         34        40        35        7     10       28   38
                                Filipino                     27        30        25         36        38        29        24    22       35   40
   Example
                               Eng(Rank
                                                             7         6           8        4         1         3         10        9    5    2
                                    )
                               Fil(Rank)                     7         5           8        3         2         6         9     10       4    1
                                    d                        0         1           0        1         1         3         1      1       1    1
                                   d2                        0         1           0        1         1         9         1      1       1    1
                                               6  d i2                     6 ( 16 )               96
                                =1                         =1                        =1           =10.097=0.91
                                                    2
                                               n ( n 1 )              10 ( 10 1 )2
                                                                                                  990
                             The spearman rho value of 0.91 indicates a strong positive relationship between the two
                             variables.
       e.  Rank Biserial Correlation
                           The rank-biserial correlation coefficient, rrb, is used for dichotomous nominal data
   Description             vs rankings (ordinal).
                                         2 ( y 1 y 0 )
                                r rb =
                                               n
   Formula                   Where n is the number of data pairs, and Y0 and Y1 are the Y score means for data pairs
                             with an x score of 0 and 1 respectively. These Y scores are ranks and the formula
                             assumes no tied ranks are present.
   Example                   The table shows the performances of 12 Grade 7 students in Science during the first
Von Christopher G. Chua, LPT, MST
                             quarter of the school year.
                               Stude                                 Ran      Studen                           Ran
                                            Sex      Grade                             Sex       Grade
                              nt No.                                  k        t No.                            k
                                  1           M         82             8          7         F        79         11
                                  2           M         85             7          8         F        81          9
                                  3           M         87             5          9         F        95          1
                                  4           M         80            10         10         F        86          6
                                  5           M         90             2         11         F        89          3
                                  6           M         88             4         12         F        73         12
                                                                    2 ( y 1 y 0 ) 2(76) 2(1) 2
                                y 1=7 y 0=6 n=12           r rb =                 =      =    = =0.17
                                                                          n          12    12 12
       f.  Biserial Correlation Coefficient
                             Another measure of association, the biserial correlation coefficient, termed rb, is
                             similar to the point biserial, but its quantitative data against ordinal data, but
   Description
                             ordinal data with an underlying continuity but measured discretely as two values
                             (dichotomous).
                                                 pq
   Formula
                                               ( )
                                r b= ( Y 1Y 0 )
                                                 Y
                                                 Y
                             Where Y0 and Y1 are the Y score means for the data pairs with an x score of 0 and 1,
                             respectively, q=1-p and p are the proportions of data pairs with x scores of 0 and 1, and
                             Y is the populations standard deviation for the y data, and Y is the height of the
                             standardized normal distribution at the point z.
                             An example might be test performance vs anxiety, where anxiety is designated as either
                             high or low. Presumably, anxiety can take on any value in between, perhaps beyond, but
                             it may be difficult to measure. We further assume that anxiety is normally distributed.
                             The following data presents the test scores in Math of seven college students together
                             with their anxiety level during the exam. A two-point scale was used to measure anxiety
                             level where 0 corresponds to relaxed and 1 to anxious.
                               Test Score             65      78         84     90     88       93        70    83
                               Anxiety Level          0        0         1      0      1        1         1      0
   Example
                                                 pq
                                               ( )
                                r b= ( Y 1Y 0 )
                                                 Y
                                                 Y
                                Y 0=79 Y 1=83.75 p=0.5 q=0.5Y =3.99  Y =9.16
                                                  (                  )
                                                  (0.5)(0.5)
                                                     3.99              0.06
                                r b= ( 83.7579 )
                                                     9.16
                                                             =( 4.75 )
                                                                       9.16    ( )
                                                                            =( 4.75 ) ( 0.0068 )=0.03
       g. Tetrachoric Coefficient
   Description             The tetrachoric correlation for binary data, and the polychoric correlation, for ordered-
Von Christopher G. Chua, LPT, MST
                              category data, are excellent ways to measure rater agreement. They estimate what the
                              correlation between raters would be if ratings were made on a continuous scale; they are,
                              theoretically, invariant over changes in the number or "width" of rating categories. The
                              tetrachoric and polychoric correlations also provide a framework that allows testing of
                              marginal homogeneity between raters. Thus, these statistics let one separately assess both
                              components of rater agreement: agreement on trait definition and agreement on
                              definitions of specific categories.
                              The tetrachoric correlation coefficient, rtet, is used when both variables are dichotomous,
                              like the phi, but we need also to be able to assume both variables really are continuous
                              and normally distributed. Thus it is applied to ordinal vs. ordinal data which has this
                              characteristic. Ranks are discrete so in this manner it differs from the Spearman. The
                              formula involves a trigonometric function called cosine.
                                                   180
   Formula
                                r tet =cos
                                             (   1+
                                                    BC
                                                     AD
                                                                  )
   Example
       h. Partial Correlation Coefficient
                            Partial correlation analysis is aimed at finding correlation between two variables after
                            removing the effects of other variables. This type of analysis helps spot spurious
                            correlations (i.e. correlations explained by the effect of other variables) as well as to
                            reveal hidden correlations - i.e correlations masked by the effect of other variables.
                            The central concept in partial correlation analysis is the partial correlation coefficient
                            rxy.z between variables x and y , adjusted for a third variable z . Both x and y are
   Description              presumed to be linearly related to z :
                                                            x = Az + B + dx;
                                                            y = Cz + D + dy;
                              The     partial   correlation     coefficient    rxy.z is   defined     as     the correlation
                              coefficient between residuals dx and dy in this model.
                              The partial correlation coefficient rxy.z is defined as the correlation
                              coefficient between residuals dx and dy in this model.
                              The partial correlation coefficient rxy.z between x and y adjusted for z may be computed
   Formula                    from the pairwise values of the correlation between variables x , y , and z (rxy, ryz, rxz) :
                                                 r xy r xz r yz
                                r xy , z=
                                            ( 1r   xz
                                                          2
                                                              )(1r yz 2)
   Example
       References:
           (1) http://www.surveysystem.com/correlation.htm
           (2) https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
           (3) -http://www.pmean.com/definitions/phi.htm
           (4) http://en.wikipedia.org/wiki/Correlation_and_dependence
           (5) http://www.andrews.edu/~calkins/math/edrm611/edrm13.htm
           (6) http://www.statistics.com/index.php?page=glossary&term_id=538
Von Christopher G. Chua, LPT, MST