Inter Rater Reliability PDF
Inter Rater Reliability PDF
Received July 31st, 2012; revised August 28th, 2012; accepted September 14th, 2012
                   Objective Structured Clinical Examinations (OSCEs) have been used globally in evaluating clinical com-
                   petence in the education of health professionals. Despite the objective intent of OSCEs, scoring methods
                   used by examiners have been a potential source of measurement error affecting the precision with which
                   test scores are determined. In this study, we investigated the differences in the inter-rater reliabilities of
                   objective checklist and subjective global rating scores of examiners (who were exposed to an online
                   training program to standardise scoring techniques) across two medical schools. Examiners’ perceptions
                   of the e-scoring program were also investigated. Two Australian universities shared three OSCE stations
                   in their end-of-year undergraduate medical OSCEs. The scenarios were video-taped and used for on-line
                   examiner training prior to actual exams. Examiner ratings of performance at both sites were analysed us-
                   ing generalisability theory. A single facet, all random persons by raters design [PxR] was used to measure
                   inter-rater reliability for each station, separate for checklist scores and global ratings. The resulting vari-
                   ance components were pooled across stations and examination sites. Decision studies were used to meas-
                   ure reliability estimates. There was no significant mean score difference between examination sites.
                   Variation in examinee ability accounted for 68.3% of the total variance in checklist scores and 90.2% in
                   global ratings. Rater contribution was 1.4% & 0% of the total variance in checklist score and global rating
                   respectively, reflecting high inter-rater reliability of the scores provided by co-examiners across the two
                   schools. Score variance due to interaction and residual error was larger for checklist scores (30.3% vs
                   9.7%) than for global ratings. Reproducibility coefficients for global ratings were higher than for checklist
                   scores. Survey results showed that the e-scoring package facilitated consensus on scoring techniques. This
                   approach to examiner training also allowed examiners to calibrate the OSCEs in their own time. This
                   study revealed that inter-rater reliability was higher for global ratings than for checklist scores, thus pro-
                   viding further evidence for the reliability of subjective examiner ratings.
                   Keywords: Objective Structured Clinical Examination; Inter-Rater Reliability; Checklist Scores; Global
                             Ratings
et al., 1997; Cushing, 2002). These authors have demonstrated        Examination Procedure
the reliability and validity of global rating scales, thereby pro-
viding evidence that subjectivity may not be inherently unreli-         The examination at School A was conducted over a two-day
able. Global ratings have also been reported to better evaluate      period to two different cohorts of students, while at School B it
the performance of advanced students as well as negate some of       was a one day event with the three shared OSCEs embedded in
the nuances associated with checklists (Van der Vleuten et al.,      a 12-OSCE station examination. Two concurrent sessions of
1991; Regehr et al., 1998; Hodges et al., 1999). Some studies        each station were conducted at School A and four were con-
have compared the psychometric properties of checklists and          ducted at School B, each with one SP and one examiner.
global rating scales on OSCEs and concluded that global rating       Clearance was obtained from the relevant ethics committee for
scales scored by experts showed higher inter-station reliability,    this study.
better construct validity and better concurrent validity than did
checklists (Hodges et al., 1997; Regehr et al., 1998).               Examiners
   Intensive examiner training improves inter-rater reliability as
                                                                        Three examiners were independently selected from each
it ensures that all raters interprete item descriptions similarly
                                                                     school to serve as external examiners, one on each of the shared
and apply similar standards on students’ performance (Williams
                                                                     stations, and double mark with the internal examiners at the
et al., 2003; Spencer & Silverman, 2004). Although earlier
                                                                     other school. Each external examiner independently double
studies have indicated that examiner training varied in effec-
                                                                     marked a total of 20 student observations. Each examiner rated
tiveness as a function of medical experience (Newble et al.,         student performance by first scoring the task-specific checklist
1980; Van der Vleuten et al., 1989), more recent studies have        and then completing a global rating. The two components were
demonstrated the high impact of examiner training on the con-        then summed to generate an overall performance score.
sistency of scoring (Humphrey-Murto et al., 2005; Chesser et
al., 2009)
                                                                     Examiner Training
   However, establishing excellent examiner training sessions
still remains a major problem for medical schools with increas-         To aid examiner training and standardise marking across the
ing number of students, difficulty finding sufficient number of      two examination sites, an OSCE e-scoring tool was developed
experienced examiners for multi-site exams and the challenges        and set up in a secure intranet site, in the on-line Blackboard
of getting time-poor clinicians away from their other activities     Learning System Vista environment. The three shared OSCE
to attend examiner-training sessions. Innovative and feasible        scenarios were videotaped and used for the on-line examiner
approaches to tackling these tasks are necessary. The primary        training; PGY1 residents (interns) were recruited to role play as
purpose of this study was to compare the inter-rater reliabilities   medical students and SPs were recruited from the SP pool.
of checklist and global rating scores of examiners who were          Informed consent and confidentiality agreement were obtained
exposed to an online training program (to standardise scoring        from all the video participants.
techniques) across two medical schools. The study also exam-            A total of 24 examiners were involved in the on-line OSCE
ined examiners’ perceptions of the feasibility and usability of      training program. All the internal (on only the shared OSCEs)
the e-scoring program.                                               and external examiners were invited via email, given login
                                                                     access and instructions on how to use the program; the video
                           Methods                                   clips were made accessible to the examiners one week prior to
                                                                     the examination. The examiners were able to view the record-
Study Context                                                        ings in their own time and assess the interns’ performances.
                                                                        Each examiner was asked to watch two unlabelled scenarios
   In November 2010, two Australian medical schools (A and B)
                                                                     (poor and good performance) of the OSCE case which they had
participated in a collaborative inter-school study of clinical
                                                                     been assigned to examine. After watching each scenario, they
competence in which three OSCE stations were developed and
                                                                     were required to assess the performance using the marking
embedded in the (3rd and 4th years respectively) end-of-year         sheet that was provided in another window. The station infor-
clinical examinations. School A runs a five-year undergraduate       mation and criteria for marking were also made available. After
medical programme, while School B runs a six-year under-             completing and submitting their marking/scoring sheet, the
graduate programme. Both schools have similar horizontally           examiners were then able to view and compare the scores they
and vertically integrated outcomes-based curricula. The se-          had given for the checklist task and the global rating with oth-
lected year groups were chosen because of their comparable           ers already submitted. This enabled examiners at both sites to
levels of intended learning outcomes.                                achieve consensus regarding what constituted unsatisfactory,
                                                                     borderline or satisfactory performance. The SPs on the shared
The Shared OSCE Stations                                             OSCE stations were allowed to view the video clips and they
   The three OSCEs (chest pain, diabetic foot and gallstones)        discussed face-to-face with the internal examiners about ex-
                                                                     pected performance.
comprised of eight-minute stations and were administered to a
total of 119 third year medical students at School A and 94
fourth year medical students at School B. The three OSCE sta-        Statistical Analysis
tions covered a range of core clinical competencies with which       Quantitative Data
examiners at both schools were familiar. Between five to nine           Descriptive statistics of the on-line training scores, compara-
task-specific checklist items were developed for each case. The      tive analysis for checklist scores and global ratings in both
behaviourally anchored 4 - 7-point rating scales assessed degree     schools were calculated using SAS. The difference between
of coherence, empathy, verbal and non-verbal expressions.            internal and external examiners’ scores was tested using 2-sample
t-test. Generalisability analysis was used to test for inter-rater                          responses to four open-ended on-line survey questions which
reliability across sites. Multilevel mixed-effects linear regres-                           were administered to them immediately after completing their
sion in STATA was used to calculate the variance components                                 scoring of the OSCE scenarios. The examiners were asked to 1)
and to evaluate the magnitude of the different sources of varia-                            comment on aspects they liked most about the e-scoring pro-
tion affecting the measurement. Different pairs of raters as-                               gram; 2) comment on aspects they didn’t like; 3) proffer sug-
sessed examinees at each of the three stations and the examina-                             gestions on improvement of the program and 4) provide their
tion at school A was conducted over two days with a different                               views on the effect of the program on future assessments. The
cohort on each day. Due to the disconnected design, variance                                survey data were collated and emerging themes independently
components for each station within each site were estimated                                 coded and confirmed by two researchers. Illustrative quotes are
separately and the estimates were pooled across sites to elimi-                             reported verbatim in Appendix 1.
nate confounding of the proficiency of examinee groups and the
stringency of examiner groups across sites. For both checklist                                                                 Results
scores and global ratings, a single facet, random, raters/examiners
(R) by persons/examinees (P) design [PxR] and the interaction                                  Table 1 portrays the mean checklist scores and global ratings
effect of person by rater with residual effect (PxR,e) was used                             ± the standard deviation (SD) given by co-examiners during the
to assess inter-rater reliability. D-study was used to measure                              actual examination. There were no statistical differences in the
reliability estimates.                                                                      mean scores given by the internal and external examiners in
                                                                                            both schools.
                                                                                               The estimated variance components from generalisability
Qualitative Data
                                                                                            analyses for checklist scores and global ratings are presented in
  To capture their perceptions of the on-line training/e-scoring                            Table 2. Pooled score variance attributed to student ability was
program, examiners were prompted to provide anonymous                                       higher on global ratings in comparison to checklist scores
Table 1.
Descriptive statistics for checklist scores and global ratings at both sites (mean scores ± standard deviation).
      Station             Examiner            School A checklist score         School B checklist score              School A global rating        School B global rating
    Chest pain            Internal                     74.3 ± 9.7                        70.0 ± 9.9                        4.3 ± 1.0                      4.3 ± 1.0
      N = 20              External                 71.15 ± 10.2                        72.6 ± 10.6                         4.4 ± 0.8                      4.3 ± 0.8
  Diabetic foot           Internal                     65.0 ± 12.3                     69.0 ± 15.2                         3.5 ± 1.3                      3.6 ± 1.4
      N = 20              External                     63.3 ± 13.1                     67.7 ± 14.9                         3.4 ± 1.3                      3.6 ± 1.4
    Gallstones            Internal                     72.0 ± 11.9                     77.0 ± 12.4                         4.2 ± 1.0                      4.4 ± 1.0
      N = 20              External                 72.15 ± 10.4                        75.4 ± 11.1                         4.1 ± 0.9                      4.2 ± 1.2
   Total Scores                                        69.6 ± 11.3                     71.9 ± 12.4                         3.9 ± 1.1                      4.0 ± 1.1
Table 2.
Variance component estimates and G coefficients for checklist scores and global ratings.
     A            3                  110.79       5.38              14.87        0.882          0.937             1.553     0.003        0.172            0.9         0.947
                  Combined                                                                              a
                                     258.66       5.38           113.02          0.696          0.821             3.003     0.003        0.397          0.883         0.938a
                   stations
                 % variation         68.60%      1.40%          30.00%                                            88.20%    0.10%       11.70%
(90.2% vs 68.3%). Rater effect accounted for 1.4% and 0% of            training for both quality assurance and appraisal purposes. The
total variance in checklist score and global rating respectively.      examiners valued the process as it allowed them to reach con-
Score variance due to interaction and residual error was larger        sensus about their scoring techniques and resulted in similar
for checklist scores (30.3% vs 9.7%) than for global ratings.          trends of scoring in both schools. Furthermore, given the busy
  G coefficients for checklist scores and global ratings are also      schedule of clinicians and the challenges of getting away from
presented in Table 2. G coefficients varied from each case,            their other activities to attend examiner-training sessions, the
with the lowest values been obtained on the diabetic foot station      e-scoring package allowed examiners to use it in their own time.
across the two schools. In addition, reliability estimates for the     Most of them found it easy to navigate through the program,
global ratings were higher than for the checklists.                    but a few expressed difficulties in understanding the technology
   Survey results showed that examiners valued the process be-         as well as the statistics generated for comparison of scores.
cause it gave them an opportunity to see a “dry run” of the sta-          The examiners also suggested that scoring of borderline per-
tion and allowed them to set the “expected standard” for the           formances would be more useful, indicating that it was easier
station prior to the actual exam (Appendix 1). They also indi-         for the examiners to identify and agree on their ratings, particu-
cated that this sort of tool should be used more widely in OS-         larly for good performance. This is a valid point, given the fact
CEs. However, they pointed out that scoring borderline per-            that borderline students are the ones medical educators are most
formance, rather than good or poor performance would make              concerned about. It is important for examiners to be able to
the e-scoring process more useful.                                     make accurate pass/fail decisions so that only competent stu-
                                                                       dents are allowed to progress academically. On the whole, the
                          Discussion                                   examiners concurred on the efficacy and possibility of wider
                                                                       use of the e-scoring program.
   The observed low variance in rater effect in our study indi-           The major limitation of this study is the small number of sta-
cates high inter-rater reliability, meaning each rater’s scores are    tions used. In addition, the rating of the global scales after the
consistent across different students. The results also indicate        checklists could have affected examiner scoring of student per-
that there are no significant differences in average scores across     formance. Due to the design of the study, inter-case reliability
raters; hence the assessment clearly reveals the competence of         and the comparison between trained and non-trained examiners
each examinee. Our results show higher inter-rater agreement           could not be determined. Further studies should explore these
for global ratings in comparison to checklist scores. A growing        areas.
body of literature has reported that global ratings have higher
reliability than checklist scores and are better able to discrimi-                                 Conclusion
nate between examinees (Hodges et al., 1999; Govaerts et al.,
2002; Hodges et al., 2003; Wilkinson et al., 2003). The higher            The results of this study suggest that global rating scales are
examinee and lower residual variance estimates observed in the         a more appropriate summative measure than checklists in as-
global ratings in this study in comparison to the checklist scores     sessing examinees on performance based tests, providing fur-
echoes these findings.                                                 ther support for the reliability of subjective examiner judgments.
   McManus et al. (2006) reported that thorough selection,             This study also indicates possible elimination of examiner
monitoring and training did not eliminate examiner stringency/         variance measurement error with the use of on-line examiner
leniency effect. However, our study indicates otherwise, with          training program. The tool holds great promise for high stakes
the observed lower variance due to examiner difference. This           performance-based assessments conducted across multiple sites
might be as a result of the online training, which allowed ex-         and will afford time-poor geographically separated clinicians
aminers to agree on the “expected standard” for each station           the opportunity to better engage in the assessment process.
prior to the actual examination. The use of two examiners to
reduce examiner bias has been proposed (Norcini, 2002; Wil-                                  Acknowledgements
kinson et al., 2003), but our findings clearly demonstrate that
using on-line examiner training, higher reliabilities of 0.7 and          The authors would like to thank Jo Hanuszewicz, Di Madden,
above for high stakes examinations can be achieved even with           Kathy Spencer, Felicity Ey, Donna-May Brown, Kaspar Will-
the use of one examiner per station, indicating that there is little   son, Matt Holmes, Milford McArthur, Theresa Mokry, Stefan
or no benefit in using examiners to double mark. Interestingly,        Blechinger, Leslee Wells, Gail Richardson and Florence Schaeffer
our study showed that external examiners gave lower scores             for their contributions to the video recordings and data collation.
than internal examiners; this may indicate the effect of exam-         The authors also acknowledge the contributions of the examin-
iner familiarity with candidates as a potential source of bias         ers.
(Stroud et al., 2011).
   Researchers have suggested that variability in performance
                                                                                                REFERENCES
across cases is not simply related to content variation, but to
other factors, such as pattern recognition based on irrelevant         Chesser, A., Cameron, H., Evans, P., Cleland, J., Boursicot, K., &
contextual features of the case (Govaerts et al., 2002). The ob-         Mires, G. (2009). Sources of variation in performance on a shared
served varying magnitudes of estimated variance components               OSCE station across four UK medical schools. Medical Education,
across stations (cases) may indicate that the relative ordering of       43, 526-532. doi:10.1111/j.1365-2923.2009.03370.x
                                                                       Cohen, D. S., Colliver, J. A., Robbs, R. S., & Swartz, M. H. (1997). A
cases and the specificity of case content have a large effect on
                                                                         large-scale study of the reliabilities of checklist scores and ratings of
the variance. There is therefore the need to explore the magni-          interpersonal and communication skills evaluated on a standardised-
tude of variance attributable to case, content and/or context            patient examination. Advances in Health Science Education, 1, 209-
specificity.                                                             213. doi:10.1023/A:1018326019953
   The survey results showed that the e-scoring program offered        Cunnington, J. P. W., Neville, A. J., & Norman, G. R. (1997). The
  risks of thoroughness: Reliability and validity of global ratings and       (UK) clinical examination (PACES) using multi-facet Rasch model-
  checklists in an OSCE. Advances in Health Science Education, 1,             ling. BMC Medical Education, 6, 1272-1294.
  27-33.                                                                    Newble, D. (2004). Techniques for measuring clinical competence:
Cushing, A. (2002). Assessment of non-cognitive factors. In G. R.             Objective structured clinical examinations. Medical Education, 38,
  Norman, C. P. M. van der Vleuten, & D. I. Newble (Eds.), Interna-           199-203. doi:10.1111/j.1365-2923.2004.01755.x
  tional handbook of research in medical education (pp. 711-755).           Newble, D. I., Hoare, J., & Sheldrake, P. F. (1980). The selection and
  Dordrecht: Kluwer Academic Publishers.                                      training of examiners for clinical examinations. Medical Education,
  doi:10.1007/978-94-010-0462-6_27                                            14, 345-349. doi:10.1111/j.1365-2923.1980.tb02379.x
Downing S., & Yudkowsky R. (2009). Assessment in health profes-             Norcini, J. J. (2002). The death of the long case? British Medical
  sions education. London: Routledge.                                         Journal, 324, 408-409. doi:10.1136/bmj.324.7334.408
Govaerts, M. J. B., Van der Vleuten, C. P. M., & Schuwirth, L. W. T.        Regehr, G., Freeman, R., Hodges, B., & Russell, L. (1999). Assessing
  (2002). Optimising the reproducibility of a performance-based assess-       the generalisability of OSCE measures across content domains. Aca-
  ment test in midwifery education. Advances in Health Science Education,     demic Medicine, 74, 1320-1322.
  7, 133-145. doi:10.1023/A:1015720302925                                     doi:10.1097/00001888-199912000-00015
Harden, R. M., Stevenson, M., Downie, W. W., & Wilson, G. M.                Regehr, G., MacRae, H., Reznick, R. K., & Szalay, D. (1998). Com-
  (1975). Assessment of clinical competence using objective structured        paring the psychometric properties of checklists and global rating
  examination. British Medical Journal, 1, 447-451.                           scales for assessing performance on an OSCE-format examination.
  doi:10.1136/bmj.1.5955.447                                                  Academic Medicine, 73, 993-997.
Harden, R. M., & Gleeson, F. A. (1979). Assessment of clinical com-           doi:10.1097/00001888-199809000-00020
  petence using an objective structured clinical examination (OSCE).        SAS (2009). Statistical Analysis System. Cary, CA: SAS Institute.
  Medical Education, 13, 41-54.                                             Spencer, J. A., & Silverman, J. (2004). Communication education and
  doi:10.1111/j.1365-2923.1979.tb00918.x                                      assessment: Taking account of diversity. Medical Education, 38,
Hodges, B., Regehr, G., Hanson, M., & McNaughton, N. (1997). An               116-118. doi:10.1111/j.1365-2923.2004.01801.x
  objective structured clinical examination for evaluating psychiatric      StataCorp. (2011). Stata Statistical Software: Release 12. College Sta-
  clinical clerks. Academic Medicine, 72, 715-721.                            tion, TX: StataCorp LP.
  doi:10.1097/00001888-199708000-00019                                      Stroud, L., Herold, J., Tomlinson, G., & Cavalcanti, R. B. (2011). Who
Hodges, B., Regehr, G., McNaughton, N., Tiberius, R., & Hanson, M.            you know or what you know? Effect of examiner familiarity with
  (1999). Checklists do not capture increasing levels of expertise. Aca-      residents on OSCE scores. Academic Medicine, 86, S8-S11.
  demic Medicine, 74, 1129-1134.                                              doi:10.1097/ACM.0b013e31822a729d
  doi:10.1097/00001888-199910000-00017                                      Van der Vleuten, C. P. M., Van Luyk, S. J., Van Ballegooijen, A. M. J.,
Hodges, B., McNaughton, N., Regehr, G., Tiberius, R., & Hanson, M.            & Swanson, D. B. (1989). Training and experience of examiners.
  (2002). The challenge of creating new OSCE measures to capture the          Medical Education, 23, 290-296.
  characteristics of expertise. Medical Education, 36, 742-748.               doi:10.1111/j.1365-2923.1989.tb01547.x
  doi:10.1046/j.1365-2923.2002.01203.x                                      Van der Vleuten, C. P. M., Norman, G. R., & De Graaff, E. (1991).
Hodges, B., & McIlroy, J. H. (2003). Analytic global OSCE ratings are         Pitfalls in the pursuit of objectivity: Issues of reliability. Medical
  sensitive to level of training. Medical Education, 37, 1012-1016.           Education, 25, 110-118. doi:10.1111/j.1365-2923.1991.tb00036.x
  doi:10.1046/j.1365-2923.2003.01674.x                                      Wilkinson, T. J., Frampton, C. M., Thompson-Fawcett, M., & Egan, T.
Humphrey-Murto, S., Smee, S., Touchie, C., Wood, T. J., & Blackmore,          (2003). Objectivity in objective structured clinical examinations:
  D. E. (2005). A comparison of physician examiners and trained asses-        Checklists are no substitute for examiner commitment. Academic
  sors in a high-stakes OSCE setting. Academic Medicine, 80, S59-S62.         Medicine, 78, 219-223. doi:10.1097/00001888-200302000-00021
  doi:10.1097/00001888-200510001-00017                                      Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive,
Kirby, R. L., & Curry, L. (1982). Introduction of an objective struc-         social and environmental sources of bias in clinical competence rat-
  tured clinical examination (OSCE) to an undergraduate clinical skills       ings. Teaching and Learning Medicine, 15, 270-292.
  programme. Medical Education, 16, 362-364.                                  doi:10.1207/S15328015TLM1504_11
  doi:10.1111/j.1365-2923.1982.tb00951.x
McManus, I. C., Thompson, M., & Mollon, J. (2006). Assessment of
  examiner leniency and stringency (“hawk-dove effect”) in the MRCP