TEACHER’S CORNER
(wileyonlinelibrary.com) DOI: 10.1002/pst.1659                                   Published online 3 December 2014 in Wiley Online Library
     The disagreeable behaviour of the
     kappa statistic
     Laura Flight and Steven A. Julious*
     It is often of interest to measure the agreement between a number of raters when an outcome is nominal or ordinal. The kappa
     statistic is used as a measure of agreement. The statistic is highly sensitive to the distribution of the marginal totals and can
     produce unreliable results. Other statistics such as the proportion of concordance, maximum attainable kappa and prevalence
     and bias adjusted kappa should be considered to indicate how well the kappa statistic represents agreement in the data. Each
     kappa should be considered and interpreted based on the context of the data being analysed. Copyright © 2014 John Wiley &
     Sons, Ltd.
     Keywords: kappa statistic; concordance; agreement; PABAK
     1. INTRODUCTION
                                                                                                       Table I. Interpretations
     It is often of interest to measure the agreement between a num-                                   of the kappa statistic
     ber of raters when an outcome is nominal or ordinal. The kappa
                                                                                                      Kappa              Agreement
     statistic, first proposed by Cohen [1], is a measure of agreement
     and is frequently used. This method, however, has a number of                                    < 0.20             Poor
     limitations.                                                                                     0.21  0.40        Fair
         This Teacher’s Corner article gives a description of the statis-                             0.41  0.60        Moderate
     tic and important related concepts, followed by a motivational                                   0.61  0.80        Good
     example of how the limitations affect the usefulness of kappa in                                 0.81  1.00        Very Good
     practice.
                                                                             chance denoted by pe [1]. Consequently, Cohen proposed the
     2. THE KAPPA STATISTIC                                                  kappa statistic:
                                                                                                         po  pe
     The measurement of agreement often arises in the context of                                   D            .                   (1)
                                                                                                         1  pe
     reliability, where the concern is less about whether there is an
                                                                             This estimates the proportion of agreement between raters after
     association or correlation between the classifications of two raters
                                                                             removing any chance agreement. The kappa statistic can be inter-
     but more about whether the two raters agree [2]. Bloch and Krae-
                                                                             preted using the scale taken from Altman [4] given in Table I.
     mer [3] suggest that agreement is a distinct kind of association
                                                                             Shrout [5] gives more conservative interpretations of the statis-
     describing how well one rater’s classification agrees with another
                                                                             tic, for example, values in the range 0.41–0.60 represent only
     or how well a single rater’s classifications at one time point agrees
                                                                             fair agreement. Throughout, the Altman scale of interpretation
     with their classifications at another time point.
                                                                             is used.
        Throughout this article, the main focus is on inter-rater agree-
     ment, where it is assumed the objects to be classified are inde-
     pendent, the raters make their classifications independently and        2.2. Weighted Kappa Statistic
     the categories are independent [1]. The issues raised can also be       When considering multiple categories to classify objects, ordering
     generalised to intra-rater agreement.                                   is important. Cohen’s kappa (Equation 1), however, assumes the
                                                                             disagreement between different categories is equally weighted,
     2.1. Cohen’s Kappa Statistic                                            and the ordering of categories is not important [6]. If disagree-
                                                                             ments are thought to have varying consequences, a weighted
     The kappa statistic was proposed by Cohen in the context of two
                                                                             kappa can be calculated. This method applies a weight vij to dis-
     raters classifying objects into two categories. The proportion of
                                                                             agreements in the ith row and jth column of the data table. Larger
     concordance po is the number of times different raters classify
     objects into the same category divided by the total sample size
     (N). This is the observed agreement between the raters and can          Medical Statistics Group, University of Sheffield, Sheffield, England
     be used as a simple estimate of agreement.                              *Correspondence to: Professor Steven Julious, Medical Statistics Group, ScHARR,
        Cohen (1960) suggests this estimate alone is not sufficient          University of Sheffield, 30 Regent Court, Regent Street, Sheffield, England, S1 4DA
     as it is necessary to account for agreement expected by                 E-mail: s.a.julious@sheffield.ac.uk
74
     Pharmaceut. Statist. 2015, 14 74–78                                                                        Copyright © 2014 John Wiley & Sons, Ltd.
L. Flight and S. A. Julious
                                                                            Feinstein and Cicchetti [7] highlight the following issues they
                        Table II. Data with                              term ‘paradoxes’ of the kappa statistic:
                        two raters categoris-
                        ing into two cate-                                  (1) For high values of concordance, low values of kappa can
                        gories                                                  be recorded.
                                                                            (2) Asymmetric, imperfectly imbalanced tables have a higher
                               1      2    Total                                kappa than perfectly imbalanced and symmetric tables.
                       1       a      b     g1
                                                                           By examining Equation 1 for the kappa statistic, it is evident
                       2       c      d     g2
                                                                         that its value is dependent on the proportion of agreement
                       Total   f1     f2    N                            expected by chance pe . The smaller the value of pe , the larger the
                                                                         kappa statistic will be. The larger the pe , the smaller the kappa is.
                                                                           Feinstein and Cicchetti illustrate that the value of pe is depen-
                                                                         dent on the distribution of the marginal totals. This is evident
weights apply greater penalties to disagreements with larger con-        when rewriting the formula for pe as:
sequences [2]. The determination of these weights should be
made prior to the collection of data [6].                                                               .f1 g1 C f2 g2 /
  The formula for a weighted kappa statistic is given by,                                        pe D                    .                  (3)
                                                                                                               N2
                                  P
                                     vij poij
                       w D 1  P             ,                (2)       The values f1 , f2 and g1 , g2 are the marginal totals for rater 1 and
                                     vij peij
                                                                         rater 2, respectively.
where poij is the proportion of observed classifications in the i, jth      The first paradox occurs when there is symmetrical imbal-
cell, and peij is the proportion of expected classifications. The peij   ance in the vertical and horizontal marginal totals. The sec-
are found by multiplying the ith row total by the jth column total       ond paradox occurs if the imbalance is asymmetrical or imper-
and dividing by the total sample size N.                                 fectly symmetrical.
                                                                         3.1. The Maximum Attainable Kappa, Ämax
3. THE PARADOXES
                                                                         Feinstein and Cicchetti also highlight that it is not only the mag-
The following definitions (Box 2) are given to help explain some         nitude of kappa that is affected by the marginal totals but also
of the issues arising with the kappa statistic and are used in the       the maximum possible value of the statistic [7]. Cohen notes
context of Table II.                                                     that kappa can only reach the maximum value of one when the
   Throughout this article, the special dichotomous (22) case           off-diagonal elements (b, c) in Table II are equal to zero [1]. For
is considered where no weightings are applied. The principles            this to occur, the marginal values must be identical. Hence g1 D f1
discussed can be generalised to the (r  r) case.                        and g2 D f2 , resulting in a perfectly symmetrical table.
                                                                                                                                                  75
Pharmaceut. Statist. 2015, 14 74–78                                                                   Copyright © 2014 John Wiley & Sons, Ltd.
                                                                                                                            L. Flight and S. A. Julious
       The maximum attainable kappa, max , is calculated using,                lence and bias (PABAK) [8]. This statistic is estimated by replacing
                                                                                the diagonal elements of the table (a, d) by their average n D
                                        poM  pe                                 .a C d/
                              max D             ,                        (4)             and the off-diagonal elements (b, c) by their average
                                         1  pe                                     2
                                                                                       .b C c/
     where poM is estimated by,                                                 m D            . Substituting these values into Equations 1 and 3
                                                                                          2
                                                                          and using N D aCbCcCd, the formula for PABAK can be reduced
                   g1 f 1           g2 f2                   gr fr
      poM D min       ,     C min     ,     C : : : C min     ,     (5)         to:                          
                    N N             N N                     N N                                              2n
                                                                                                                   0.5
                                                                                                              N
     for an r  r table.                                                                         PABAK D                 D 2po  1                (6)
                                                                                                             .1  0.5/
        Sim and Wright [2] interpret max as reflecting the extent to
     which the rater’s ability to agree is constrained by pre-existing fac-
     tors that result in unequal marginal totals. It is helpful to report       5. MOTIVATING EXAMPLE
     this statistic alongside Cohen’s kappa, as it can illustrate how a low
     value might be a consequence of the marginal totals rather than            In this section, a number of examples are used to illustrate
     poor agreement.                                                            how the kappa statistic should not be solely relied upon when
                                                                                assessing agreement.
                                                                                  In the example that is the motivation for this paper, there
     3.2. Prevalence and Bias
                                                                                are N D 261 students who are categorised by two indepen-
     Byrt et al. [8] describe the kappa statistic as being affected by both     dent assessors as either ‘one’ or ‘two’. It is important to establish
     the prevalence and bias between raters . Prevalence in this con-           whether there is agreement between assessors. If agreement is
     text is the probability with which a rater classifies an object as         poor it would be necessary to use additional assessors or to have
     ‘one’ in the study sample, say. This relates to the balance of the         the grading scheme amended.
     table, with low prevalences giving a balanced table.
        Bias is concerned with the frequency at which raters choose             5.1. Unadjusted Kappa
     a particular category and affects the symmetry of the table. A
                                                                                Data for this example are given in Table III. Using Feinstein and
     non-biased table will be symmetrical as the raters do not differ
                                                                                Ciccetti’s definitions, this is an example of a symmetrically imbal-
     in their frequency of choosing category one. The extent of bias
                                                                                anced table, as f1 and g1 are large, and f2 , g2 are small. The propor-
     and prevalence can be evaluated using the Prevalence and Bias
                                                                                tion of the objects in class one for both assessors is clearly greater
     indexes. Prevalence and bias are used along with symmetry and
                                                                                than 0.5. There is a high probability that an assessor will classify
     balance to explain when the paradoxes of kappa can occur.
                                                                                a student as one and hence, high prevalence (PI=0.628). The fre-
                                                                                quencies for which the assessors choose a particular category are
     4. PREVALENCE AND BIAS ADJUSTED KAPPA                                      similar, and so there is low bias (BI=0.234).
                                                                                    The proportion of concordance po D 0.682 indicates good
     If after examining the symmetry and balance of a table and esti-           agreement between the two assessors. The kappa statistic, on
     mating the prevalence index (PI) and the bias index (BI) it is clear       the other hand is  D 0.038, with pe D 0.360. This suggests
     that the kappa is likely to be influenced by the distribution of           little agreement beyond that expected by chance between the
     the marginal totals, it is possible to calculate an adjusted kappa         assessors. The interpretation of agreement varies substantially
     statistic. Byrt et al. (1993) define a statistic that adjusts for preva-   depending on the summary statistic chosen.
76
     Copyright © 2014 John Wiley & Sons, Ltd.                                                                      Pharmaceut. Statist. 2015, 14 74–78
L. Flight and S. A. Julious
                                                                                       0.4
                       Table III. Data with
                       two assessors categoris-
                                                                                       0.3
                       ing into two categories
                       (N D 261)
                                                                                       0.2
                                    1      2    Total
                                                                               Kappa
                       1           171     72   243
                                                                                       0.1
                       2            11      7    18
                       Total       182     79   261
                                                                                       0.0
   Considering the imbalance in the table and the high PI value,                             0.65     0.70     0.75         0.80         0.85   0.90   0.95
it is likely that the kappa statistic is being influenced by the                                                    Proportion in g1
prevalences and hence, distribution of the marginal totals. The
proportion of students in the marginal g1 is 0.931 and f1 is 0.697.            Figure 1. Kappa statistic plotted against the proportion of students rated as
                                                                                        g1
The maximum attainable kappa is                                                one:one . /
                                                                                         N
                                                   
                    243 182                 18 79
          min          ,           C min      ,            0.670
                    261 261                261 261
max D                                                              D 0.292.                          Table IV. Perfectly sym-
                                   1  0.670                                                          metrical (N D 261)
                                                                 (7)
This low value indicates that there is unlikely to ever be strong                                               1           2          Total
agreement between the assessors when considering the inter-
pretations in Table I, a consequence of factors influencing their                                    1         100         51          151
marginal totals.                                                                                     2          51         59          110
   As an illustration of how the prevalences and the distribution                                    Total     151        110          261
of the marginals influences kappa, in Table III, the number of stu-
dents in the off-diagonal cells is fixed, and the remaining students
are split evenly between the diagonal cells. One student is then
moved from cell ‘two:two’ to ‘one:one’, hence, the prevalence of                                         Table V. Imperfectly
each assessor categorising a student as one is increased. The over-                                      asymmetrical (N D 261)
all concordance, however, remains constant as the proportion
                                                                                                                1          2       Total
of students awarded the same grade by each assessor does not
change. The proportion of concordance is constant where as the                                        1        60        101           161
kappa statistic falls.                                                                                2         1         99           100
   Although the proportion of concordance and kappa are not                                           Total    61        200           261
strictly comparable as concordance does not account for agree-
ment expected by chance, the large differences in their values
indicate that the kappa may not be behaving as expected.                          Table V (PI=0.149, BI=0.383) also has po D 0.609, however, this
   The proportion of students in the marginal g1 and kappa are                 table is asymmetrical. One assessor categorises more students in
calculated as one student is moved as described. These two statis-             category one and fewer students in category two when compared
tics are plotted in Figure 1 to demonstrate how there is a change              with the second assessor. There is a difference in the distribu-
in kappa despite no change in the concordance and arguably                     tion of the marginal totals of the assessors. This suggests less
the agreement. The vertical line on the plot marks the propor-                 agreement than in the symmetric Table IV, as intuitively asses-
tion of students in g1 for Table III, and the horizontal line marks            sors who agree will have similar marginal distributions [7]. The
the value of the kappa statistic for this table. The kappa statis-             kappa statistic for Table V, however, is larger  D 0.391. This value
tic even falls below zero when the proportion in the marginal g1               indicates greater agreement despite the lack of agreement in the
approaches 0.95. This negative kappa is interpreted by Viera and               marginal totals.
Garrett (2005) as less than chance agreement [9].
   Based on the motivational example, scenarios are imagined                   5.2. Adjusted Kappa
with a sample size of 261 where symmetry and imbalance impact
on kappa. If a table has symmetry and is perfectly imbalanced,                 In the motivating example, the PABAK D 0.364, much greater
such as the hypothetical example in Table IV, this indicates some              than the unadjusted kappa. The high prevalence and imbalance
agreement between the assessors. The PI value is 0.157, a low                  in the table is decreasing the kappa statistic, which can lead to
value reflecting the distribution of students in the marginals for             incorrect inferences about the agreement between the two raters.
both assessors being balanced. The BI score of 0 is a consequence                 Varying the prevalences in Table III by moving a student from
of the perfect symmetry of the table, g1 D f1 and g2 D f2 . The                two:two to one:one as before, the PABAK statistic remains con-
proportion of concordance for this table is 0.609, which further               stant as the proportion of objects in the marginal g1 changes.
indicates good agreement, however, the kappa is 0.189. Using the               This illustrates how the adjusted kappa is less influenced by
interpretations in Table I, this suggests poor agreement.                      prevalence and the distribution of the marginals.
                                                                                                                                                               77
Pharmaceut. Statist. 2015, 14 74–78                                                                            Copyright © 2014 John Wiley & Sons, Ltd.
                                                                                                                              L. Flight and S. A. Julious
     6. DISCUSSION                                                            of interpretation should be used with these limitations in mind.
                                                                              It is recommended, when reporting the kappa statistic, to also
     The motivational and hypothetical examples presented demon-              report the proportion of concordance, PABAK, PI, BI and max
     strate some of the ways a kappa statistic can behave in an unex-         to allow a fully informed judgement about the usefulness of
     pected and disagreeable way. Feinstein and Cicchetti attribute           kappa regarding the level of agreement seen [10]. Each kappa
     these problems to the distribution of the marginal totals in a table     should be considered and interpreted based on the context
     and the affect this has on the proportion of agreement expected          in hand.
     by chance, pe [7].
        Byrt et al. describe the properties of the data in terms of preva-
     lences and bias [8]. The Bias Index and the Prevalence Index are
     proposed as ways of measuring how bias and prevalence might              Acknowledgements
     influence the marginal totals and impact on the interpretation
     of kappa.                                                                This report is an independent research arising from an NIHR
        The motivational example demonstrates how the kappa statis-           Research Methods Fellowship, RMFI-2013-04-011 Goodacre, sup-
     tic can mask high levels of agreement seen when considering              ported by the National Institute for Health Research. The views
     only the observed proportion of agreement. The large discrepan-          expressed in this publication are those of the author(s) and not
     cies between these two values can result in misleading inferences        necessarily those of the NHS, the National Institute for Health
     being made if relying solely on the kappa statistic.                     Research, the Department of Health or the University of Sheffield.
        The hypothetical examples show how it is not feasible to
     use standardised interpretations of kappa such as those given              We would like to thank two anonymous reviewers for their
     in Table I. When comparing two tables, a higher kappa value              valuable comments which greatly improved the paper.
     might actually be associated with a table that has intuitively
     less agreement.
        Interpretations of the kappa statistic should be made on a case       REFERENCES
     by case basis after considering all the characteristics of the data.
     These include the proportion of concordance and the symme-                [1] Cohen J. A coefficient of agreement for nominal scales. Educational
     try and balance of the marginal totals. The PI, BI and maximum                and Psychological Measurement 1960; 20:37–46.
     attainable kappa max should also be calculated so appropriate            [2] Sim J, Wright CC. The kappa statistic in reliability studies: use, inter-
     interpretations can be made.                                                  pretation, and sample size requirements. Physical Therapy 2005;
                                                                                   85:257–268.
        Alternatives such as the prevalence and bias adjusted kappa
                                                                               [3] Bloch DA, Kraemer HC. 2  2 Kappa coefficients: measures of agree-
     (PABAK) can be calculated to facilitate comparisons with the                  ment or association. Biometrics 1989; 45:269–287.
     unadjusted kappa to assess how sensitive the statistic is to the          [4] Altman DG. Practical statistics for medical research. Chapman and
     distribution of the marginal totals [8]. It is unlikely that comparing        Hall: London, 1991, pp. 404.
     kappa statistics from different tables and data will be of use due        [5] Shrout PE. Measurement reliability and agreement in psychiatry.
     to these sensitives.                                                          Statistical Methods in Medical Research 1998; 7:301–317.
                                                                               [6] Cohen J. Weighted kappa: nominal scale agreement with provision
                                                                                   for scaled disagreement or partial credit. Psychological Bulletin 1968;
     7. CONCLUSION                                                                 70:213–220.
                                                                               [7] Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The
                                                                                   problems of two paradoxes. Journal of Clinical Epidemiology 1990;
     In this Teacher’s Corner article an example is used to describe               43:543–548.
     the kappa statistic. Although a well known and frequently used            [8] Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. Journal of
     method for estimating levels of agreement, the kappa statistic has            Clinical Epidemiology 1993; 46(5):423–429.
     serious limitations. It is highly sensitive to the distribution of the    [9] Viera A, Garrett JM. Understanding interobserver agreement:the
     marginal totals and can produce unreliable results.                           kappa statistic. Family Medicine 2005; 37(5):360–3.
                                                                              [10] Chen G, Faris P, Hemmelgarn B, Walker RL, Quan H. Measuring
        The kappa statistic should be used with caution, and inferences            agreement of administrative data with chart data using prevalence
     made should account for these limitations. Kappas from different              unadjusted and adjusted kappa. BMC Medical Research Methodology
     data are unlikely to be directly comparable, and a generic scale              2009; 9:5. DOI: 10.1186/1471-2288-9-5.
78
     Copyright © 2014 John Wiley & Sons, Ltd.                                                                        Pharmaceut. Statist. 2015, 14 74–78