0% found this document useful (0 votes)

70 views24 pages

AMEE Guide 49 - OSCE Evaluation

This document is an AMEE Guide that reviews various metrics for measuring the quality of Objective Structured Clinical Examinations (OSCEs) in medical education. It emphasizes the importance of using multiple metrics to gain a comprehensive understanding of assessment quality and provides practical advice for assessment practitioners. The guide also discusses the complexities of OSCE delivery and the need for continuous quality improvement through effective metric utilization.

Uploaded by

mmy zzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views24 pages

AMEE Guide 49 - OSCE Evaluation

Uploaded by

mmy zzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

How to measure the quality of

the OSCE: A review of metrics

Godfrey Pell
Richard Fuller
Matthew Homer
Trudie Roberts

AMEE GUIDE
Assessment 49

AMEE Guides in Medical Education www.amee.org

Welcome to AMEE Guides Series 2
The AMEE Guides cover important topics in medical and healthcare professions education and provide
information, practical advice and support. We hope that they will also stimulate your thinking and reflection
on the topic. The Guides have been logically structured for ease of reading and contain useful take-home
messages. Text boxes highlight key points and examples in practice. Each page in the guide provides a
column for your own personal annotations, stimulated either by the text itself or the quotations. Sources of
further information on the topic are provided in the reference list and bibliography. Guides are classified
according to subject:

Teaching and Learning Curriculum Planning

Research in Medical Education Assessment
Education Management Theories of Medical Education

The Guides are designed for use by individual teachers to inform their practice and can be used to support
staff development programmes.

‘Living Guides’: An important feature of this new Guide series is the concept of supplements, which
will provide a continuing source of information on the topic. Published supplements will be available for
download.
If you would like to contribute a supplement based on your own experience, please contact the Guides
Series Editor, Professor Trevor Gibbs (tjg.gibbs@gmail.com).
Supplements may comprise either a ‘Viewpoint’, when you communicate your views and comments on
the Guide or the topic more generally, or a ‘Practical Application’, where you report on implementation
of some aspect of the subject of the Guide in your own situation. Submissions for consideration for inclusion
as a Guide supplement should be maximum 1,000 words.

Other Guides in the new series: A list of topics in this exciting new series are listed below and
continued on the back inside cover.

30 Peer Assisted Learning: a planning and 34 Teaching in the clinical environment 38 Learning in Interprofessional Teams
implementation framework Subha Ramani, Sam Leinster (2008) Marilyn Hammick, Lorna Olckers,
Michael Ross, Helen Cameron (2007) ISBN: 978-1-903934-43-2 Charles Campion-Smith (2010)
ISBN: 978-1-903934-38-8 An examination of the many ISBN: 978-1-903934-52-4
Primarily designed to assist curriculum challenges for teachers in the clinical Clarification of what is meant by
developers, course organisers and environment, application of relevant Inter-professional learning and an
educational researchers develop and educational theories to the clinical exploration of the concept of teams
implement their own PAL initiatives. context and practical teaching tips for and team working.
clinical teachers.
31 Workplace-based Assessment as an 39 Online eAssessment
Educational Tool 35 Continuing Medical Education Reg Dennick, Simon Wilkinson,
John Norcini, Vanessa Burch (2008) Nancy Davis, David Davis, Ralph Bloch Nigel Purcell (2010)
ISBN: 978-1-903934-39-5 (2010) ISBN: 978-1-903934-53-1
ISBN: 978-1-903934-44-9 An outline of the advantages of on-
Several methods for assessing work-
based activities are described, Designed to provide a foundation line eAssessment and an examination
with preliminary evidence of their for developing effective continuing of the intellectual, technical, learning
application, practicability, reliability medical education (CME) for and cost issues that arise from its use.
and validity. practicing physicians.
40 Creating effective poster presentations
32 e-Learning in Medical Education 36 Problem-Based Learning: where are we George Hess, Kathryn Tosney, Leon
Rachel Ellaway, Ken Masters (2008) now? Liegel (2009)
ISBN: 978-1-903934-41-8 David Taylor, Barbara Miflin (2010) ISBN: 978-1-903934-48-7
An increasingly important topic in ISBN: 978-1-903934-45-6 Practical tips on preparing a poster
medical education – a ‘must read’ A look at the various interpretations – an important, but often badly
introduction for the novice and a useful and practices that claim the label PBL, executed communication tool.
resource and update for the more and a critique of these against the
experienced practitioner. original concept and practice. 41 The Place of Anatomy in Medical
Education
33 Faculty Development: Yesterday, Today 37 Setting and maintaining standards in Graham Louw, Norman Eizenberg,
and Tomorrow multiple choice examinations Stephen W Carmichael (2010)
Michelle McLean, Francois Cilliers, Raja C Bandaranayake (2010) ISBN: 978-1-903934-54-8
Jacqueline M van Wyk (2010) ISBN: 978-1-903934-51-7 The teaching of anatomy in a
ISBN: 978-1-903934-42-5 An examination of the more traditional and in a problem-based
Useful frameworks for designing, commonly used methods of standard curriculum from a practical and a
implementing and evaluating faculty setting together with their advantages theoretical perspective.
development programmes. and disadvantages and illustrations of
the procedures used in each, with the
help of an example.
Institution/Corresponding address:
Godfrey Pell, Principal Statistician, Medical Education Unit, Leeds Institute of Medical Education,
Worsley Building, University of Leeds, Leeds LS2 9JT, UK
Tel: +44 (0)113 23434378
Fax: +44 (0)113 23432597
Email: G.Pell@leeds.ac.uk

The authors:
Godfrey Pell is a Senior Statistician who has a strong background in management. Before joining the
University of Leeds he was with the Centre for Higher Education Practice at the Open University. Current
research includes standard setting for practical assessment in higher education, and the value of short
term interventionist programmes in literacy.

Richard Fuller is a Consultant Physician, and Director of the Leeds MB ChB undergraduate degree
programme within the Institute of Medical Education. His research interests include clinical assessment,
in particular monitoring and improving the quality of the OSCE.

Matthew Homer is a Research Fellow at the University of Leeds, working in both the Schools of Medicine
and Education. He works on a range of research projects and provides general statistical support to
colleagues. His research interests include the statistical side of assessment, particularly related to OSCEs.

Trudie Roberts is a Consultant Physician, a Professor of Medical Education and is the Director of the
Leeds Institute of Medical Education. Her research interests include clinical assessment.

This AMEE Guide was first published in Medical Teacher:

Pell G, Fuller R, Homer M & Roberts T (2010). How to measure the quality of the OSCE: A review of metrics.
AMEE Guide No.49. Medical Teacher, 32(10): 802-811.

Guide Series Editor: Trevor Gibbs (tjg.gibbs@gmail.com)

Production Editor: Morag Allan Campbell
Published by: Association for Medical Education in Europe (AMEE), Dundee, UK
Designed by: Lynn Thomson

Guide 49: How to measure the quality of the OSCE: A review of metrics
Contents
Abstract .. .. .. .. .. .. .. 1

Introduction .. .. .. .. .. .. .. 2

Understanding quality in OSCE assessments – General Principles .. .. 3

Which method of standard setting? .. .. .. .. .. 4
How to generate station level quality metrics .. .. .. .. 5

Metric 1 – Cronbach’s Alpha .. .. .. .. .. 6

Metric 2 – Coefficient of Determination R2 .. .. .. .. 7

Metric 3 – Inter-grade discrimination .. .. .. .. 9

Metric 4 – Number of failures .. .. .. .. .. 10

Metric 5 – Between-group variations (including assessor effects) .. .. 11

Metric 6 – Between-group variations (other effects) .. .. .. 12

Metric 7 – Standardised Patients ratings .. .. .. .. 13

The 360 degree picture of OSCE quality .. .. .. .. 13

Quality control through observation – detecting problems .. .. .. 15
Post hoc remediation .. .. .. .. .. .. 15

Conclusions .. .. .. .. .. .. 16

References .. .. .. .. .. .. .. 18

as measures of quality in Objective Structured Clinical Examinations (OSCEs).

In this Guide, aimed at assessment practitioners, the authors aim to review the
metrics that are available for measuring quality and indicate how a rounded
picture of OSCE assessment quality may be constructed by using a variety of
such measures, and also to consider which characteristics of the OSCE are
appropriately judged by which measure(s). The authors will discuss the quality
issues both at the individual station level and across the complete clinical
assessment as a whole, using a series of ‘worked examples’ drawn from OSCE
data sets from the authors’ institution.

Take home messages

• It is important always to evaluate the quality of a high stakes assessment, such

as an OSCE, through the use of a range of appropriate metrics.

• When judging the quality of an OSCE, it is very important to employ more than
one metric to gain an all round view of the assessment quality.

• Assessment practitioners need to develop a ‘toolkit’ for identifying and avoiding

common pitfalls.

• The key to widespread quality improvement is to focus on station level

performance and improvements, and apply these within the wider context of
the entire OSCE assessment process.

• The routine use of metrics within OSCE quality improvement allows a clear
method of measuring the effects of change.

Guide 49: How to measure the quality of the OSCE: A review of metrics
Introduction
With increasing scrutiny of the techniques used to support high level decision
making in academic disciplines, Criterion Based Assessment (CBA) delivers a
reliable and structured methodological approach. As a competency-based
methodology, CBA allows the delivery of ’high stakes’ summative assessment
(e.g. qualifying level or degree level examinations), and the demonstration
of high levels of both reliability and validity. This assessment methodology is
attractive, with a number of key benefits over more ’traditional’ unstructured
forms of assessment (e.g. viva voce) in that it is absolutist, carefully
standardised for all candidates, and assessments are clearly designed and
closely linked with performance objectives. These objectives can be clearly
mapped against curricular outcomes, and where appropriate, standards
laid down by regulatory and licensing bodies, that are available to students
and teachers alike. As such, CBA methodology has seen a wide application
beyond summative assessments, extending into the delivery of a variety of
work-based assessment tools across a range of academic disciplines (Norcini,
2007; Postgraduate Medical Education and Training Board, 2009). CBA is
also now being used in the UK in the recruitment of junior doctors, using a
structured interview similar to that used for selecting admissions to higher
education programmes (Eva et al., 2004).

The Objective Structured Clinical Examination (OSCE) uses CBA principles

within a complex process that begins with ’blueprinting’ course content
against pre-defined objectives (Newble, 2004). The aim here is to ensure
both that the ‘correct’ standard is assessed, and that the content of the
OSCE is objectively mapped to curricular outcomes. Performance is scored,
at the station level, using an item checklist, detailing individual (sequences
of) behaviours, and by a global grade, reliant on a less deterministic overall
assessment by examiners (Cohen, 1997; Regehr, 1998).

Central to the delivery of any successful CBA is the assurance of sufficient

quality and robust standard setting, supported by a range of metrics that
allow thoughtful consideration of the performance of the assessment as a
whole, rather than just a narrow focus on candidate outcomes (Roberts,
2006). ’Assessing the assessment’ is vital, as the delivery of OSCEs is complex
and resource intensive, usually involving large numbers of examiners,
candidates, simulators and patients, and often taking place across
parallel sites. This complexity means CBA may be subject to difficulties with
standardisation, and is heavily reliant on assessor behaviour, even given the
controlling mechanism of item checklists. No single metric is sufficient in itself No single metric is sufficient
in itself to meaningfully judge
to meaningfully judge the quality of the assessment process, just as no single
the quality of the assessment
assessment is sufficient in judging, for example, the clinical competence of an process, just as no single
undergraduate student. Understanding and utilising metrics effectively assessment is sufficient in
judging, for example, the
are therefore central to CBA, both in measuring quality and in directing clinical competence of an
resources to appropriate further research and development of the undergraduate student.
assessment (Wass, 2001).

Guide 49: How to measure the quality of the OSCE: A review of metrics
Understanding quality in OSCE assessments
– General Principles
This Guide will examine the metrics available, using final year OSCE results
from recent years as exemplars of how exactly these metrics can be
employed to measure the quality of the assessment. It is important to
recognise that a review of the OSCE metrics is only part of the overall process
of reviewing OSCE quality – which needs to embrace all relationships in the
wider assessment process (Figure 1).

Figure 1
OSCE quality assurance and improvement – a complex process

Curriculum
Staff
Blueprinting development
& assessment & examiner
innovation training

Station writing Support staff

& amending & operating
item checklists Quality metrics procedures
& continuous
improvement

Reviewing
poor metrics – Simulation
assessing causes, – patients &
modelling technology
solutions

Institutional
Standard engagement
setting – oversight and
strategy

Where OSCEs are used as part of a national examination structure, stations

are designed centrally to a common standard, and typically delivered from
a central administration. However, at the local level with the assessment
designed within specific medical schools, some variation, for example, in
station maxima will result dependent upon the importance and complexity
of the station to those setting the exam. These absolute differences between
stations will adversely affect the reliability metric making the 0.9 value,
often quoted, unobtainable. It is possible to standardise the OSCE data
and thereby obtain a higher reliability metric but this would not be a true
representation of the assessment as set with respect to the objectives of
the assessing body. This Guide is aimed primarily at those involved with
clinical assessment at the local level within individual medical schools where,
although the assessment may take place across multiple sites, it is a single
administration. Those involved with national clinical assessments are likely to
have a different perspective.

Guide 49: How to measure the quality of the OSCE: A review of metrics
Which method of standard setting?
The method of standard setting will determine the metrics available for use
in assessing quality. Standards can be relative (e.g. norm referenced) or
absolute, based either on the test item (Ebel & Angoff), or the performance
of the candidate (borderline methods). With the requirement for standards
to be defensible, evidenced and acceptable (Norcini, 2003), absolute
standards are generally used. Whilst all methods of standard setting will
generate a number of post-hoc metrics (e.g. station pass rates, fixed
effects (time of assessment, comparison across sites) or frequency of mark
distribution), it is important to choose a method of standard setting that
generates additional quality measures. At present, a large number of
institutions favour borderline methods, but only the regression method will give
some indication of the relationship between global grade and checklist score
and also the level of discrimination between weaker and stronger students.
Table 1 highlights the key differences between different borderline methods,
and what they contribute to assessment metrics.

Table 1
Comparison of the borderline methods of standard setting

Borderline Groups (BLG)

& Contrasting Groups Borderline Regression (BLR)

• Easy to compute • More expertise required for

computation
• Only 3 global ratings required
(fail, borderline, pass) • Usually 5 global ratings (e.g. fail,
borderline, pass, credit, distinction)
• Uses only borderline data and only
a proportion of assessor/candidate • Uses all assessor/candidate interactions
interactions in analysis

• Needs sufficient candidates in • Requires no borderline grade students

borderline group (20+)
• Wider variety of quality assurance
• Produces limited quality assurance metrics
metrics

The authors favour the borderline regression method because it uses all
the assessment interactions between assessors and candidates, and these
interactions are ‘real’. It is objectively based on pre-determined criteria, using
a large number of assessors and generates a wide range of metrics.

One of the criticisms sometimes levelled at the borderline regression method

is its possible sensitivity to outliers. These outliers occur in three main groups:

• Students who perform very badly and obtain a near zero checklist score.

• Students who achieve a creditable checklist score but who fail to impress
the assessor overall.

• The assessor who gives the wrong overall grade.

These issues will be discussed in more detail at the appropriate points

throughout the Guide.

Guide 49: How to measure the quality of the OSCE: A review of metrics
How to generate station level quality metrics
Table 2 details a ‘standard’ report of metrics from a typical OSCE (20 stations
over two days, total testing time ~ three hours, spread over four examination
centres). This typically involves ~250 candidates, 500 assessors and 150
simulated patients, and healthy patient volunteers with stable clinical signs
(used for physical examination). Candidates are required to meet a passing
profile comprising of an overall pass score, minimum number of stations
passed (preventing excessive compensation, and adding the fidelity to the
requirement for a competent ‘all round’ doctor) and a minimum number
of acceptable patient ratings. Assessors complete an item checklist, and
then an overall global grade (the global grades in our OSCEs are recorded
numerically as 0=Clear fail, 1=Borderline, 2=Clear pass, 3=Very good pass,
4=Excellent pass).

The borderline regression method was used for standard setting (Pell &
Roberts, 2006). Typically such an OSCE will generate roughly 60,000 data
items (i.e. individual student-level checklist marks), which form a valuable
resource for allowing quality measurement and improvement. As a result
of utilising such data, we have seen our own OSCEs deliver progressively
more innovation, whilst simultaneously maintaining or improving the levels of
reliability.

Under any of the borderline methods of standard setting, where a global Under any of the borderline
methods of standard setting,
grade is awarded in addition to the checklist score, accompanying metrics
where a global grade
are useful in measuring the quality of the assessments. For other types of is awarded in addition
standard setting, where such a global grade does not form part of the to the checklist score,
accompanying metrics
standard setting procedure e.g. Ebel & Angoff, inter-grade discrimination and are useful in measuring the
coefficient of determination (R2) will not apply (Cusimano, 1996). quality of the assessments.

Table 2
Final year OSCE Metrics
Cronbach’s alpha Inter-grade Number of Between-group
Station if item deleted R2 discrimination failures variation (%)

1 0.745 0.465 4.21 53 31.1

2 0.742 0.590 5.23 24 30.1
3 0.738 0.555 5.14 39 33.0
4 0.742 0.598 4.38 39 28.0
5 0.732 0.511 4.14 29 20.5
6 0.750 0.452 4.74 43 40.3
7 0.739 0.579 4.51 36 19.5
8 0.749 0.487 3.45 39 33.8
9 0.744 0.540 4.06 30 36.0
10 0.747 0.582 3.91 26 29.9
11 0.744 0.512 4.68 37 37.6
12 0.744 0.556 2.80 23 32.3
13 0.746 0.678 3.99 30 22.0
14 0.746 0.697 5.27 54 27.3
15 0.739 0.594 3.49 44 25.9
16 0.737 0.596 3.46 41 34.3
17 0.753 0.573 3.58 49 46.5
18 0.745 0.592 2.42 15 25.4
19 0.749 0.404 3.22 52 39.5
20 0.754 0.565 4.50 37 34.1

Number of candidates=241

Guide 49: How to measure the quality of the OSCE: A review of metrics
Metric 1 – Cronbach’s Alpha
A selection of these overall summary metrics will be used in this Guide to
illustrate the use of psychometric data ‘in action’, and to outline approaches
to identifying and managing unsatisfactory station-level assessment
performance. We have chosen older OSCE data to illustrate this Guide, to
highlight quality issues, and subsequent actions and improvements.

This is a measure of internal consistency (commonly, though not entirely

accurately, thought of as ’reliability’), whereby in a good assessment
the better students should do relatively well across the board (i.e. on the
checklist scores at each station). Two forms of alpha can be calculated
– non standardised or standardised – and in this Guide we refer to the non
standardised form (this is the default setting for SPSS). This is a measure of the
mean intercorrelation weighted by variances, and yields the same value as
the G-coefficient for a simple model of items crossed with candidates. The
(overall) value for alpha that is usually regarded as acceptable in this type of
high stakes assessment, where standardised and real patients are used, and
the individual station metrics are not standardised, is 0.7 or above.

Where station metrics are standardised a higher alpha would be expected.

Alpha for this set of stations was 0.754, and it can be seen (from the second
column of Table 2) that no station detracted from the overall ‘reliability’,
although stations 17 and 20 contributed little in this regard.

Since alpha tends to increase with the number of items in the assessment, the
resulting ‘alpha if item deleted’ scores should all be lower than the overall
alpha score if the item/station has performed well. Where this is not the case
this may be caused by any of the following reasons:

• The item is measuring a different construct from the rest of the set of items.

• The item is poorly designed.

• There are teaching issues – either the topic being tested has not been well
taught, or has been taught to a different standard across different groups
of candidates.

• The assessors are not assessing to a common standard.

In such circumstances, quality improvement should be undertaken by

revisiting the performance of the station, and reviewing checklist and station
design, or examining quality of teaching in the curriculum.

However, one cannot rely on alpha alone as a measure of the quality of

an assessment. As we have indicated, if the number of items increases, so
will alpha, and therefore a scale can be made to look more homogenous
than it really is merely by being of sufficient length in terms of the number of
items it contains. This means that if two scales measuring distinct constructs
are combined, to form a single long scale, this can result in a misleadingly
high alpha. Furthermore, a set of items can have a high alpha and still be
multidimensional. This happens when there are separate clusters of items
(i.e. measuring separate dimensions) which intercorrelate highly, even though
the clusters themselves do not correlate with each other particularly highly.

Guide 49: How to measure the quality of the OSCE: A review of metrics
It is also possible for alpha to be too high (e.g. >0.9), possibly indicating
redundancy in the assessment, whilst low alpha scores can sometimes be
attributed to large differences in station mean scores rather than being the
result of poorly designed stations.

We should point out that in the authors’ medical school, and in many similar
institutions throughout the UK, over 1,000 assessors are required for the
OSCE assessment season (usually comprising 2-3 large scale examinations
as previously described). Consequently, recruiting sufficient assessors of
acceptable quality is a perennial issue, so it is not possible to implement
double-marking arrangements that would then make the employment of
G-theory worthwhile in terms of more accurately quantifying differences in
assessors. Such types of analysis are more complex than those covered in this
Guide, and often require the use of additional, less user-friendly, software.
An individual, institution-based decision to use G-theory or Cronbach’s alpha An individual, institution
based decision to use G-
should be made in context with delivery requirements and any constraints. theory or Cronbach’s alpha
should be made in context
The hawks and doves effect, either within an individual station, or aggregated with delivery requirements
and any constraints.
to significant site effects, may have the effect of inflating the alpha value.
However, it is highly likely that this effect will lead to unsatisfactory metrics
in the areas of coefficient of determination, between-group within-station
error variance, and, possibly, in fixed effect site differences as we will explore
later in this Guide. Our philosophy is that one metric alone, including alpha,
is always insufficient in judging quality, and that in the case of an OSCE with
a high alpha but other poor metrics, this would not indicate a high quality
assessment.

As an alternative measure to ‘alpha if item is deleted’, it is possible to use the

correlation between station score and ‘total score less station score’. This will
give a more extended scale, but the datum value (i.e. correlation) between
contributing to reliability and detracting from it is to some extent dependent
on the assessment design and is therefore more difficult to interpret.

Metric 2 – Coefficient of Determination R2

The R2 coefficient is the proportional change in the dependent variable
(checklist score) due to change in the independent variable (global
grade). This allows us to determine the degree of (linear) correlation
between the checklist score and the overall global rating at each station,
with the expectation that higher overall global ratings should generally
correspond with higher checklist scores. The square root of the coefficient of
determination is the simple Pearsonian correlation coefficient. SPSS and other
statistical software packages also give the adjusted value of R2 which takes
into account the sample size and the number of predictors in the model (one
in this case); ideally this value should be close to the unadjusted value.

A good correlation (R2> 0.5) will indicate a reasonable relationship between

checklist scores and global grades, but care is needed to ensure that
overly detailed global descriptors are not simply translated automatically
by assessors into a corresponding checklist score, thereby artificially inflating
R2. In Table 2, station 14 (a practical and medico-legal skills station) has
a good R2 value of 0.697, implying that 69.7% of variation in the students’

Guide 49: How to measure the quality of the OSCE: A review of metrics
global ratings are accounted for by variation in their checklist scores. In
contrast, station 19 is less satisfactory with an R2 value of 0.404. This was a new
station focussing on patient safety and the management of a needlestick
injury. To understand why R2 was low, it is helpful to examine the relationship
graphically (for example, using SPSS Curve estimation) to investigate the
precise nature of the association between checklist and global grade – see
Figure 2. In this figure, assessor global grades are shown on the x-axis and the
total item checklist score is plotted on the y-axis. Clustered checklist scores
are indicated by the size of the black circle, as shown in the key. SPSS can
calculate the R2 coefficient for polynomials of different degree, and thereby
provide additional information on the degree of linearity in the relationship.
We would recommend always plotting a scatter graph of checklist marks We would recommend
always plotting a scatter
against global ratings as routine good practice, regardless of station metrics.
graph of checklist marks
against global ratings as
In station 19 we can see that there are two main problems – a wide spread routine good practice,
regardless of station metrics.
of marks for each global grade, and a very wide spread of marks for which
the fail grade (0 on the x-axis) has been awarded. This indicates that some
students have acquired many of the marks from the item checklist, but their
overall performance has raised concerns in the assessor leading to a global
fail grade.

In our introduction, we raised the impact of outliers on the regression method.

Examples of poor checklist scores but with reasonable grades can be
observed in Figure 3. In other stations, we sometimes see candidates scoring
very few marks on the checklist score. This has the effect of reducing the
value of the regression intercept with the y-axis, and increasing the slope of
the regression line. For the data indicated in Table 2, the removal of outliers
and re-computation of the passing score and individual station pass marks
makes very little difference, increasing the passing score by less than 0.2%.

Figure 2
Curve estimation (Station 19) – Assessor checklist score (y) versus global grade (x)

Guide 49: How to measure the quality of the OSCE: A review of metrics
This unsatisfactory relationship between checklist marks and global ratings
causes some degree of non-linearity, as demonstrated in the accompanying
Table 3 (produced by SPSS), where it is clear graphically that the best fit
is clearly cubic. Note that mathematically speaking, a cubic will always
produce a better fit, but parsimony dictates that the difference between
the two fits has to be statistically significant for a higher order model to be
preferred. In this example the fit of the cubic polynomial is significantly better
than that of the linear. The key point to note is whether the cubic expression
is the result of an underlying relationship or as a result of outliers, resulting
from inappropriate checklist design or unacceptable assessor behaviour in
marking. In making this judgement, readers should review the distribution of
marks seen on the scattergraph. Our own experience suggests that where
stations metrics are generally of good quality, a departure from strict linearity
is not a cause for concern.

Table 3
Curve estimation table (Station19)

Polynomial fitted R Square F df1 df2 Sig.

Linear .401 159.889 1 239 .000
Quadratic .435 91.779 2 238 .000
Cubic .470 70.083 3 237 .000

The existence of low R2 values at certain stations and/or a wide spread of

marks for a given grade should prompt a review of the item checklist and
station design. In this particular case, although there was intended to be a
key emphasis on safe, effective management in the station, re-assessment
of the checklist in light of these metrics showed this emphasis was not well
represented. It is clear that weaker candidates were able to acquire many
marks for ‘process’ but did not fulfil the higher level expectations of the
station (the focus on decision making). This has been resolved through a re-
write of the station and the checklist, with plans for re-use of this station and
subsequent analysis of performance within a future OSCE.

Metric 3 – Inter-grade discrimination

This statistic gives the slope of the regression line and indicates the average
increase in checklist mark corresponding to an increase of one grade on the
global rating scale. Although there is no clear guidance on ‘ideal’ values, we
would recommend that this discrimination index should be of the order of a
tenth of the maximum available checklist mark (which is typically 30-35 in our
data).

A low value of inter-grade discrimination is often accompanied by other

poor metrics for the station such as low values of R2 (indicating a poor
overall relationship between grade and checklist score), or high levels of
assessor error variance (see metric 5 below) where assessors have failed to
use a common standard. Too high levels of inter-grade discrimination may
indicate either a very low pass mark, or a lack of linearity caused by a small
number of badly failing students who tend to steepen the regression line.

Guide 49: How to measure the quality of the OSCE: A review of metrics
Where very poor student performance in terms of the checklist score occurs,
consideration needs to be given to whether these very low scores should be
excluded from standard setting to avoid excessive impact on overall passing
scores in a downward direction.

Returning to Table 2, it is clear that the inter-grade discrimination values

are generally acceptable across the stations (station maxima being in the
region of 30-35 marks), although there are three stations with discrimination
values in excess of 5 (e.g. station 14 – a skills station involving completion of a
cremation form).

Where there is doubt about a station in terms of its performance based on

the discrimination metric, returning to the R2 measure of variance and curve
estimation is often instructive. In Table 2, station 14 has the highest inter-grade
discrimination, and it can be seen in Figure 3 that most global grades again
encompass a wide range of marks, especially the ‘clear pass’ grade – value
2 on the x-axis, ranging from 4 to 27, but that the lower of these values are
clearly outliers. As the rest of the station metrics are acceptable, this station
can remain unchanged but should be monitored carefully when used in
subsequent assessments.

Figure 3
Curve estimation (Station 14) Assessor checklist score (y) versus global grade (x)

Metric 4 – Number of failures

It would be a mistake to automatically assume that an unusually high It would be a mistake to
automatically assume that
number of failures indicates a station that is somehow too difficult. The ‘reality
an unusually high number of
check’, which is an essential part of borderline methods, will to a large extent failures indicates a station
compensate for station difficulty. This represents the expert judgement made that is somehow too difficult.
by trained assessors in determining the global rating against the expected
performance of the minimally competent student.

10 Guide 49: How to measure the quality of the OSCE: A review of metrics
As previously described, other psychometric data can be used to investigate
station design and performance in order to identify problems. Failure rates
may be used to review the impact of a change in teaching on a particular
topic – with higher such rates indicating where a review of content and
methods of teaching can help course design. There are no major outliers for
this metric in Table 2, but the difficulties with station 19 have allowed us to
identify and deliver additional teaching around elements of patient safety
within the final year curriculum, and introduce this specific safety focus into
checklists.

Metric 5 – Between-group variation

(including assessor effects)
When performing analysis on data resulting from complex assessment
arrangements such as OSCEs where, by necessity, the students are
subdivided into groups for practical purposes, it is vital that the design is fully
randomised. Sometimes, however, this is not always possible, with logistical
issues including dealing with special needs students who may require more
time and have to be managed exclusively within a separate cycle. Any non-
random subgroups must be excluded from statistically-based types of analysis
that rely on randomness in the data as a key assumption.

In the ideal assessment process, all the variation in marks will be due
to differences in student performance, and not due to differences in
environment (e.g. local variations in layout or equipment), location (e.g.
hospital based sites having different local policies for management of clinical
conditions), or differences of assessor attitude (i.e. hawks & doves). There are
two ways of measuring such effects, either by performing a one-way ANOVA
on the station (e.g. with the assessor as a fixed effect), or by computing
the proportion of total variance which is group specific. The latter allows
an estimation of the proportion of variation in checklist scores that is due
to student performance as distinct from other possible factors mentioned
above, although this is usually given as the proportion of variance which is
circuit specific.

If the variance components are computed, using group (i.e. circuit) as a

random effect, then the percentage of variance specific to group can be
computed. This is a very powerful metric as it gives a very good indication of
the uniformity of the assessment process between groups. It is also relatively
straightforward to calculate. Ideally between-group variance should be
under 30%, and values over 40% should give cause for concern, indicating
potential problems at the station level due to inconsistent assessor behaviour
and/or other circuit specific characteristics, rather than student performance.

From Table 2, stations 6, 17 and 19 give cause for concern with regard to this
metric, with the highest levels of between-group variance. In addition, station
6 has a poor R2, and the overall combination of poor metrics at this station
tells us that the poor R2 was probably due to poor checklist design. These
observations prompted a review of the design of station 6, and the checklist
was found to consist of a large number of low level criteria where weaker
candidates could attain high scores through ‘process’ only. In other words,

Guide 49: How to measure the quality of the OSCE: A review of metrics 11
there was a likely mismatch between the nature of the checklist, and the
aims and objectives of the station as understood by the assessors. Hence, in
redesigning the station, a number of the low-level criteria were chunked (that
is, grouped together to form a higher level criterion) in order to facilitate the
assessment of higher level processes as originally intended.

Station 17 tells a different story, as the good R2 coupled with the high
between-group variation indicates that assessors are marking consistently
within groups, but that there is a distinct hawks and doves effect between
groups. In such a case, this ought to be further investigated by undertaking a
one-way ANOVA analysis to determine whether this is an individual assessor
or a site phenomenon. The amount of variance attributable to different sites
is subsumed in the simple computation of within-station between-group
variance as describe above. However, its significance may be determined
using a one-way ANOVA analysis with sites as fixed effects.

However, care needs to be exercised in making judgements based on

a single metric, since, with quite large populations, applying ANOVA to
individual stations is likely to reveal at least one significant result, as a result
of a type I error due to multiple significance tests across a large number of
groups (e.g. within our own OSCE assessments, a population of 250 students
and approximately 15 parallel circuits across different sites). Careful post-hoc
analysis will indicate any significant hawks and doves effects, and specific
groups should be tracked across other stations to determine general levels
of performance. If a completely random assessment model of both students
and assessors has been used (mindful of the caveats about local variations in
equipment and exam set up), then many of these effects should be largely
self-cancelling; it is in the aggregate totals that group-specific fixed effects
are important and may require remedial action.

Metric 6 – Between-group variance (other effects)

ANOVA analysis can also be of use when there are non-random allocations
of either assessors or students, as is the case in some medical schools with
large cohorts and associated teaching hospitals where multi-site assessment
may occur. Such complex arrangements can result in the non-random
assignment of assessors to circuits since it is often difficult for clinical staff to
leave their place of work. This may then lead to significant differences due
to ‘site effects’ which can be identified with appropriate action taken in the
analysis of results.

Other important fixed effects can also be identified through the use of
ANOVA. For example, assessor training effects, staff/student gender effects,
and associated interactions, which have all been previously described
(Pell, 2008), and which underline the need for complete and enhanced
assessor training as previously highlighted (Holmboe, 2004).

12 Guide 49: How to measure the quality of the OSCE: A review of metrics
Metric 7 – Standardised Patients ratings
Most centres that use simulated/standardised patients (SPs) require them to
rate candidates, and this typically follows an intensive training programme.
Within our own institution, SPs would be asked a question such as “Would you
like to consult again with this doctor?” with a range of responses (strongly
agree, agree, neither agree nor disagree, disagree, strongly disagree), the
two latter responses being regarded as adverse. Akin to Metric 4 (Number of
station failures), a higher than normal proportion of candidates (e.g. >10%)
receiving adverse SP ratings may indicate problems. There is no available There is no available
literature on what constitutes
literature on what constitutes an ‘acceptable’ range of SP ratings at station
an ‘acceptable’ range of SP
level, so we have chosen an arbitrary cut off figure of 10%. The critical issue ratings at station level.
here is that other station metrics should be reviewed, and the impact on SP
ratings monitored in response to training or other interventions.

If this is coupled with a higher than normal failure rate it could be the result
of inadequate teaching of the topic. Adverse values of this metric are often
accompanied by high rates of between-group variance; assessors viewing
candidates exhibiting a lower than expected level of competence often
have difficulty in achieving consistency.

The overall reliability of the assessment may be increased by adding the

SP rating to the checklist score; typically the SP rating should contribute
10-20% of the total station score (Homer & Pell, 2009). An alternative
approach, taken within our own institution at graduating level OSCEs, is to set
a ‘minimum’ requirement for SP comments as a proxy for patient satisfaction
(using rigorously trained SPs).

The 360 degree picture of OSCE quality

As outlined, it is critical to review station quality in light of all available
station-level metrics before making assumptions about quality, and planning
improvements.

Review of the metrics of station 8 (focusing on consultation, diagnosis and

decision making) show a positive contribution to overall assessment reliability
(alpha if item deleted 0.749). As can be seen below in the curve estimation in
figure 4, the R2 coefficient is poor at 0.4 with a wide spread of item checklist
scores within grades, and significant overlap across the higher grades (pass,
credit and distinction).

Coupled with high levels of between-group variance of 33.8%, this suggests

a mismatch between assessor expectations and grading, and the construct
of the item checklist in the provision of higher level performance actions. This
leads to inconsistency within and between stations.

Actions to resolve this would typically include a review of the station content
and translation to the item checklist. Reviewing grade descriptors and
support material for assessors at station level should help overcome the
mismatch revealed by the poor R2 and higher error variance.

Guide 49: How to measure the quality of the OSCE: A review of metrics 13
Figure 4:
Curve estimation (Station 8) – Assessor checklist score (y) versus global grade (x)

Observed
50.00 Linear
Quadratic
Cubic

40.00

30.00

20.00

10.00
0.00 1.00 2.00 3.00 4.00

Station 9 is represented by the curve estimation seen below in Figure 5.

Here we see a more strongly positive contribution to reliability (alpha if item

deleted 0.74) and better station-level metrics. The R2 coefficient is acceptable
at 0.5, but between-group variance is still high at 36%.

The curve shows wide performance variance at each grade level. The good
R2 suggests the variation in assessor global rating rather than assessor checklist
scoring, with a hawks and doves effect.

Figure 5
Curve estimation (Station 9) - Assessor checklist score (y) versus global grade (x)

Observed
30.00 Linear
Quadratic
Cubic

25.00

20.00

15.00

10.00

5.00
0.00 1.00 2.00 3.00 4.00

14 Guide 49: How to measure the quality of the OSCE: A review of metrics
Action to investigate and improve this would focus on assessor support
material in relation to global ratings.

Quality control by observation: detecting problems in the run up to

OSCEs and on the day
It is essential for those concerned with minimising error variance between
groups to observe the OSCE assessment systematically. When considering
some of the causes of between-group error, all those involved in the wider
OSCE process (Figure 1) must be part of the quality control process.

In advance of the OSCE, many of the contributing factors to error variance

can be anticipated and corrected by applying some of the points below:

• Checking across stations to ensure congruence in design.

• Ensuring that new (and older, established) stations follow up-to-date

requirements in terms of checklist design, weighting and anchor points.

• Reviewing the set up of parallel OSCE circuits – for example, differences

in the placing of disinfectant gel outside a station may mean the assessor
may not be able to score hand hygiene approaches.

• Ensuring that stations carry the same provision of equipment (or permit
flexibility if students are taught different approaches with different
equipment).

Other sources of error variance can occur during the delivery of the OSCE:

• Assessors who arrive late and miss the pre-assessment briefing and who
therefore fail to adhere adequately to the prescribed methodology.

• Unauthorised prompting by assessors (despite training and pre-exam

briefings).

• Inappropriate behaviour by assessors (e.g. changing the ‘tone’ of a station

through excessive interaction).

• Excessively proactive simulated patients whose questions act as prompts to

the students.

• Biased real patients (e.g. gender or race bias). Simulated patients receive
training on how to interact with the candidates, but this may not be
possible with the majority of real patients to the same level undertaken with
simulators.

• Assessors (or assistants) not returning equipment to the start or neutral

position as candidates change over.

Post hoc remediation

When faced with unsatisfactory metrics, a number of pragmatic, post hoc When faced with
unsatisfactory metrics, a
remediation methods can be employed.
number of pragmatic, post
hoc remediation methods
1. Adjustment of total marks for site effects: The easiest method is to adjust
can be employed.
to a common mean across all sites. After any such adjustment, the site
profile of failing students should be checked to ensure that, for example,
all failures are not confined to a single site. The effect of any special
needs group (e.g. candidates receiving extra time as a result of health

Guide 49: How to measure the quality of the OSCE: A review of metrics 15
needs) located within a single specific site needs to be discounted when
computing the adjustment level.

2. Adjustment at the station level: This is seldom necessary because any

adverse effects will tend to cancel each other out. In the rare cases
where this does not happen, a station level procedure as above can be
carried out.

3. Removal of a station: Again, this is a rare event and the criteria for this are
usually multiple adverse metrics, the result of which would disadvantage
students to such an extent that the assessment decisions are indefensible
against appeal.

Conclusions
Using a series of worked examples and ‘live data’, this Guide focuses on
commonly used OSCE metrics and how they can be used to identify and
manage problems, and how such an approach helps to anticipate future
issues at the school/single institution level. This methodology therefore
naturally feeds into the wider assessment processes as described in Figure 1.

In the authors’ institution there is a close relationship between those

who analyse the data and those who design and administer the clinical
assessments and develop/deliver teaching. Routine and detailed review of
station level metrics has revealed mismatches between checklists and global
ratings. This has led to the redesign of certain OSCE stations with a subsequent
improvement of metrics. Some of these redesigns include:

• Chunking of a number of simple criteria into fewer criteria of higher level.

• Chunking to allow for higher level criteria commensurate with the stage of
student progression, allowing assessment of higher level, less process driven
performance

• The inclusion of intermediate grade descriptors on the assessor checklists.

• Ensuring that checklist criteria have three instead of two anchors where
appropriate, thereby allowing greater discrimination by assessors.

• A greater degree of uniformity between the physical arrangements of the

different circuits.

The presence of high failure rates at particular stations has led to a revisiting
of the teaching of specific parts of the curriculum, and was followed by
changes in the way things were taught, resulting in improved student
performance as measured in subsequent OSCEs.

Indications of poor agreement between assessors has, on occasion, led to

a number of changes all of which have been beneficial to the quality of
assessment:

• Upgrading of assessor training methods.

• Updating (‘refreshing’) assessors who were trained some time ago.

• The provision of more detailed support material for assessors.

16 Guide 49: How to measure the quality of the OSCE: A review of metrics
• Improved assessor briefings prior to the assessment.

• Improved SP briefings prior to the assessment.

• Dummy runs before the formal assessment for both assessors and SPs (this
is only really practicable where student numbers are relatively small e.g.
resits, and in dental OSCEs with smaller cohorts of students).

The need for all the above improvements would be unlikely to have been
apparent from using a single reliability metric, such as Cronbach’s alpha or
the G Coefficient. It is only when a family of metrics is used that a true picture It is only when a family of
metrics is used that a true
of quality can be obtained and the deficient areas identified. Adopting this picture of quality can be
approach will be rewarded with a steady improvement in the delivery and obtained and the deficient
standard of clinical assessment. areas identified. Adopting
this approach will be
rewarded with a steady
improvement in the delivery
and standard of clinical
assessment.

Guide 49: How to measure the quality of the OSCE: A review of metrics 17
References
COHEN DS, COLLIVER JA, ROBBS RS & SWARTZ MH (1997). A Large-Scale Study of the
Reliabilities of Checklist Scores and Ratings of Interpersonal and Communication
Skills Evaluated on a Standardized-Patient Examination. Advances in Health Sciences
Education, 1: 209-213.

CUSIMANO M (1996). Standard setting in Medical Education. Academic Medicine,

71(10): S112-S120.

EVA KW, ROSENFELD J, REITER H & NORMAN GR (2004). An Admissions OSCE: the
multiple mini-interview. Medical Education, 38: 314-326.

FIELD A (2000). Discovering Statistics (using SPSS for windows), p.130 (Sage Publications,
London)

HOLMBOE E (2004). Faculty and the observation of trainees’ clinical skills: Problems and
opportunities. Academic Medicine, 79(1): 16-22.

HOMER M & PELL G (2009). The impact of the inclusion of simulated patient ratings on
the reliability of OSCE assessments under the borderline regression method. Medical
Teacher, 31(5): 420-425.

NEWBLE D (2004). Techniques for measuring clinical competence: objective structured

clinical examinations. Medical Education, 38: 199-203.

NORCINI J (2003). Setting standards on educational tests. Medical Education, 37(5):

464-469.

NORCINI J & BURCH V (2007) Workplace-based assessment as an educational tool:

AMEE guide No. 31. Medical Teacher, 29(9): 855-871.

PELL G, HOMER M & ROBERTS TE (2008). Assessor Training: Its Effects on Criterion Based
Assessment in a Medical Context. International Journal of Research & Method in
Education, 31(2): 143-154.

PELL G & ROBERTS TE (2006). Setting standards for student assessment. International
Journal of Research & Method in Education, 29(1): 91-103.

POSTGRADUATE MEDICAL EDUCATION TRAINING AND EDUCATION BOARD (2009).

Workplace based assessment. A guide for Implementation (London).
www.pmetb.org.uk/fileadmin/user/QA/assessment/PMETB_WPBA_Guide_20090501.pdf
(accessed May 11th 2009)

REGEHR G, MACRAE H, REZNICK RK & SZALAY D (1998). Comparing the psychometric

properties of checklists and global rating scales for assessing performance on an
OSCE-format examination. Academic Medicine, 73(9): 993-997.

ROBERTS C, NEWBLE D, JOLLY B, REED M & HAMPTON K (2006). Assuring the quality of
high-stakes undergraduate assessments of clinical competence. Medical Teacher,
28(6): 535-543.

STEVENS J (1992). Applied multivariate statistics for the social sciences (2nd ed),
Chapter 4: 151-182 (Erlbaum, Hillside NJ)

WASS V, MCGIBBON D & VAN DER VLEUTEN C (2001). Composite undergraduate

clinical examinations: how should the components be combined to maximise
reliability? Medical Education, 35: 326-330.

See the website (www.amee.org) for more information.

If you would like more information about AMEE and its activities, please contact the AMEE Office:
Association for Medical Education in Europe (AMEE), Tay Park House, 484 Perth Road, Dundee DD2 1LR, UK
Tel: +44 (0)1382 381953; Fax: +44 (0)1382 381987; Email: amee@dundee.ac.uk
www.amee.org
ISBN: 978-1-903934-62-3 Scottish Charity No. SC 031618

Self Regulation Theory Applications To Medical Education
No ratings yet
Self Regulation Theory Applications To Medical Education
28 pages
AMEE and BEME Guides
100% (1)
AMEE and BEME Guides
10 pages
Setting and Maintaining Standards in Multiple Choice Examinations
No ratings yet
Setting and Maintaining Standards in Multiple Choice Examinations
27 pages
AMEE Guides: Medical Education Resources
No ratings yet
AMEE Guides: Medical Education Resources
4 pages
AMEE Publication Flyer
No ratings yet
AMEE Publication Flyer
4 pages
1 AMEE Guide No 25 The Assessment of Learning Outcomes For The Competent and Reflective Physician
No ratings yet
1 AMEE Guide No 25 The Assessment of Learning Outcomes For The Competent and Reflective Physician
17 pages
A Practical Guide For Medical Teachers 5e
100% (5)
A Practical Guide For Medical Teachers 5e
426 pages
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment 1st Edition Ronald M. Harden - Ebook PDF Instant Download
100% (2)
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment 1st Edition Ronald M. Harden - Ebook PDF Instant Download
74 pages
Medical Education A Dictionary of Quotations 1st Edition Kieran Walsh Complete Edition
100% (4)
Medical Education A Dictionary of Quotations 1st Edition Kieran Walsh Complete Edition
87 pages
Essential Skills For Medical Teacher
No ratings yet
Essential Skills For Medical Teacher
305 pages
Shumway 2003
No ratings yet
Shumway 2003
21 pages
Essential Skills of Medical Teacher
100% (6)
Essential Skills of Medical Teacher
305 pages
Medical Education A Dictionary of Quotations 1st Edition Kieran Walsh Full Access
100% (2)
Medical Education A Dictionary of Quotations 1st Edition Kieran Walsh Full Access
168 pages
DR Maung Maung Oo
No ratings yet
DR Maung Maung Oo
385 pages
Session 1 Guide 14 Outcome Based Education
No ratings yet
Session 1 Guide 14 Outcome Based Education
46 pages
Ebooks File Medical Education A Dictionary of Quotations 1st Edition Kieran Walsh All Chapters
No ratings yet
Ebooks File Medical Education A Dictionary of Quotations 1st Edition Kieran Walsh All Chapters
52 pages
Curriculum & Course Design
No ratings yet
Curriculum & Course Design
4 pages
Amee 2012 Abstract Book
No ratings yet
Amee 2012 Abstract Book
521 pages
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment Ronald M. Harden PDF Download
No ratings yet
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment Ronald M. Harden PDF Download
125 pages
AMEE Guides Leaflet
No ratings yet
AMEE Guides Leaflet
2 pages
Essential Skills For A Medical Teacher: An Introduction To Teaching and Learning in Medicine - Ebook PDF Instant Download
100% (1)
Essential Skills For A Medical Teacher: An Introduction To Teaching and Learning in Medicine - Ebook PDF Instant Download
76 pages
Amee 2007 Abstracts
No ratings yet
Amee 2007 Abstracts
241 pages
Practical Guide To Medical Student Assessment 1st Edition Zubair Amin Instant Download
No ratings yet
Practical Guide To Medical Student Assessment 1st Edition Zubair Amin Instant Download
52 pages
Essential Skills For A Medical Teacher 3rd Edition Ronald Harden - Ebook PDF PDF Download
100% (4)
Essential Skills For A Medical Teacher 3rd Edition Ronald Harden - Ebook PDF PDF Download
81 pages
1 Amee 2016 Abstract Book Full Book Updated Online
No ratings yet
1 Amee 2016 Abstract Book Full Book Updated Online
857 pages
A Handbook For Medical Teachers 4th Edition Scribd PDF Download
100% (12)
A Handbook For Medical Teachers 4th Edition Scribd PDF Download
14 pages
Teaching Made Easy A Manual For Health Professionals 4th Edition Unlimited Download
100% (12)
Teaching Made Easy A Manual For Health Professionals 4th Edition Unlimited Download
14 pages
AMEE Guide 25 Assessment
No ratings yet
AMEE Guide 25 Assessment
16 pages
AMEE Guide 24 - Portfolios As A Method of Assessment
No ratings yet
AMEE Guide 24 - Portfolios As A Method of Assessment
18 pages
Trends and Issues in ME, HPE
No ratings yet
Trends and Issues in ME, HPE
47 pages
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment 1st Edition Ronald M. Harden Full
No ratings yet
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment 1st Edition Ronald M. Harden Full
107 pages
Assessment in Medical Education (India)
100% (10)
Assessment in Medical Education (India)
237 pages
Assessment Coordinators Training Program - Abridged
No ratings yet
Assessment Coordinators Training Program - Abridged
4 pages
OSCE Guide: Organization & Administration
No ratings yet
OSCE Guide: Organization & Administration
17 pages
Facilitating Learning Teaching - Learning Methods
No ratings yet
Facilitating Learning Teaching - Learning Methods
55 pages
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment 1st Edition Ronald M. Harden Updated 2025
No ratings yet
The Definitive Guide To The OSCE: The Objective Structured Clinical Examination As A Performance Assessment 1st Edition Ronald M. Harden Updated 2025
144 pages
Course Planning for Medical Educators
No ratings yet
Course Planning for Medical Educators
10 pages
(65521479) AMEE Medical Education Guide No 24 Portfolios As A Method of Student Assessment
No ratings yet
(65521479) AMEE Medical Education Guide No 24 Portfolios As A Method of Student Assessment
19 pages
OSCE AMEE Part 2
No ratings yet
OSCE AMEE Part 2
18 pages
Principles of Medical Education
100% (17)
Principles of Medical Education
219 pages
Instructional Design in Outcome Based Education
No ratings yet
Instructional Design in Outcome Based Education
44 pages
Eric S. Holmboe MD MACP FRCP, Steven James Durning MD PHD MACP - Practical Guide To The Assessment of Clinical Competence-Elsevier (2024)
No ratings yet
Eric S. Holmboe MD MACP FRCP, Steven James Durning MD PHD MACP - Practical Guide To The Assessment of Clinical Competence-Elsevier (2024)
1,632 pages
Khan Et Al 2013 OSCE Organisation Administration - AMEE Guide 81 Part I 17
No ratings yet
Khan Et Al 2013 OSCE Organisation Administration - AMEE Guide 81 Part I 17
17 pages
12.+"JMCJMS A 24 BR 505+Dr.+OM+104 108
No ratings yet
12.+"JMCJMS A 24 BR 505+Dr.+OM+104 108
5 pages
WEB Teaching Skills For Doctors 28102015 SM
No ratings yet
WEB Teaching Skills For Doctors 28102015 SM
120 pages
Preclinical Professionalism Insights
No ratings yet
Preclinical Professionalism Insights
8 pages
The Good Teacher Is More Than A Lecturer - The Twelve Roles of The Teacher - MEDEV, School of Medical Sciences Education Development PDF
No ratings yet
The Good Teacher Is More Than A Lecturer - The Twelve Roles of The Teacher - MEDEV, School of Medical Sciences Education Development PDF
5 pages
An Introduction To Medical Teaching 1st Edition Fast Ebook Download
No ratings yet
An Introduction To Medical Teaching 1st Edition Fast Ebook Download
16 pages
Medical Education and Training From Theory To Delivery 1st Edition Yvonne Carter Complete Edition
100% (2)
Medical Education and Training From Theory To Delivery 1st Edition Yvonne Carter Complete Edition
90 pages
An Introduction To Medical Teaching, 1st Edition Complete DOCX Download
100% (9)
An Introduction To Medical Teaching, 1st Edition Complete DOCX Download
15 pages
Practical Guide To Medical Student Assessment 1st Edition Zubair Amin Instant Download Full Chapters
No ratings yet
Practical Guide To Medical Student Assessment 1st Edition Zubair Amin Instant Download Full Chapters
149 pages
Optimal Thawing for Frozen Tempe
No ratings yet
Optimal Thawing for Frozen Tempe
10 pages
Muelas & Navarro 2015 - Learning Strategies and Academic Achievement
No ratings yet
Muelas & Navarro 2015 - Learning Strategies and Academic Achievement
5 pages
(Ebook) Applied Quantitative Analysis For Real Estate by Sotiris Tsolacos, Mark Andrew ISBN 9781138561328, 1138561320 Full Chapters Included
No ratings yet
(Ebook) Applied Quantitative Analysis For Real Estate by Sotiris Tsolacos, Mark Andrew ISBN 9781138561328, 1138561320 Full Chapters Included
110 pages
EDUC 210 - Module 1
No ratings yet
EDUC 210 - Module 1
42 pages
Research Methodology - Short Notes
No ratings yet
Research Methodology - Short Notes
73 pages
Ccs337 Cs Unit IV
No ratings yet
Ccs337 Cs Unit IV
30 pages
Advanced Engineering Mathematics B.Tech. 2nd Year, III-Semester Computer Science & Engineering Lecture-1: Introduction
No ratings yet
Advanced Engineering Mathematics B.Tech. 2nd Year, III-Semester Computer Science & Engineering Lecture-1: Introduction
13 pages
PSQT Questions
No ratings yet
PSQT Questions
6 pages
SPSS Data Analysis for Skripsi
No ratings yet
SPSS Data Analysis for Skripsi
17 pages
(Ebook) SPSS Demystified: A Simple Guide and Reference by Ronald D Yockey ISBN 9781138286283, 1138286281 PDF Download
100% (1)
(Ebook) SPSS Demystified: A Simple Guide and Reference by Ronald D Yockey ISBN 9781138286283, 1138286281 PDF Download
52 pages
Liu and Hu 2020 Self Compassion Mediates and Moderates The Associa
No ratings yet
Liu and Hu 2020 Self Compassion Mediates and Moderates The Associa
13 pages
Theory of Mind Understanding
No ratings yet
Theory of Mind Understanding
9 pages
MDU MBA 1st Semester Buisness Statitcs and Analytics Notes 1
No ratings yet
MDU MBA 1st Semester Buisness Statitcs and Analytics Notes 1
207 pages
Bio2 Module 5 - Logistic Regression
No ratings yet
Bio2 Module 5 - Logistic Regression
19 pages
When Are Neighbourhoods Communities Community in Dutch Neighbourhoods
No ratings yet
When Are Neighbourhoods Communities Community in Dutch Neighbourhoods
17 pages
Youth Volunteering Factors in Sungai Petani
No ratings yet
Youth Volunteering Factors in Sungai Petani
11 pages
Arabic Version of The ODI
No ratings yet
Arabic Version of The ODI
11 pages
Sta 242 Bivariate Analysis 12 Bivariate Normal
No ratings yet
Sta 242 Bivariate Analysis 12 Bivariate Normal
11 pages
Q MTP 3 Quant..apt
No ratings yet
Q MTP 3 Quant..apt
18 pages
Contoh LAMPIRAN WORD Spss
No ratings yet
Contoh LAMPIRAN WORD Spss
30 pages
Rath Deo 2023 The Moderating Role of Esg Disclosure Scores in Determining The Impact of Firm Performance On Ceo Pay A
No ratings yet
Rath Deo 2023 The Moderating Role of Esg Disclosure Scores in Determining The Impact of Firm Performance On Ceo Pay A
17 pages
198-Article Text-364-1-10-20230418
No ratings yet
198-Article Text-364-1-10-20230418
7 pages
Mathematics-IV Model Paper 2024-25
No ratings yet
Mathematics-IV Model Paper 2024-25
4 pages
Pr2 Final Paper
No ratings yet
Pr2 Final Paper
45 pages
Studying Believability in Videogame Creature Design
No ratings yet
Studying Believability in Videogame Creature Design
24 pages
Fixed Income Securities Innovative Assignment
No ratings yet
Fixed Income Securities Innovative Assignment
11 pages
PDS Final Project Report
No ratings yet
PDS Final Project Report
12 pages
Effect of Climate Change On Maize Yield in Western Ethiopia
No ratings yet
Effect of Climate Change On Maize Yield in Western Ethiopia
14 pages
Understanding Basic Statistics, 8th Edition, Charles Henry Brase, Corrinne
No ratings yet
Understanding Basic Statistics, 8th Edition, Charles Henry Brase, Corrinne
406 pages
LPPT Ch010203 ARS8 - 2023
No ratings yet
LPPT Ch010203 ARS8 - 2023
88 pages

AMEE Guide 49 - OSCE Evaluation

Uploaded by

AMEE Guide 49 - OSCE Evaluation

Uploaded by

How to measure the quality of

the OSCE: A review of metrics

AMEE Guides in Medical Education www.amee.org

Teaching and Learning Curriculum Planning

This AMEE Guide was first published in Medical Teacher:

Guide Series Editor: Trevor Gibbs (tjg.gibbs@gmail.com)

Understanding quality in OSCE assessments – General Principles .. .. 3

Metric 1 – Cronbach’s Alpha .. .. .. .. .. 6

Metric 2 – Coefficient of Determination R2 .. .. .. .. 7

Metric 3 – Inter-grade discrimination .. .. .. .. 9

Metric 4 – Number of failures .. .. .. .. .. 10

Metric 5 – Between-group variations (including assessor effects) .. .. 11

Metric 6 – Between-group variations (other effects) .. .. .. 12

Metric 7 – Standardised Patients ratings .. .. .. .. 13

The 360 degree picture of OSCE quality .. .. .. .. 13

Suggested Further Reading .. .. .. .. .. 18

as measures of quality in Objective Structured Clinical Examinations (OSCEs).

Take home messages

• It is important always to evaluate the quality of a high stakes assessment, such

• Assessment practitioners need to develop a ‘toolkit’ for identifying and avoiding

• The key to widespread quality improvement is to focus on station level

The Objective Structured Clinical Examination (OSCE) uses CBA principles

Central to the delivery of any successful CBA is the assurance of sufficient

Station writing Support staff

Where OSCEs are used as part of a national examination structure, stations

Borderline Groups (BLG)

• Easy to compute • More expertise required for

• Needs sufficient candidates in • Requires no borderline grade students

One of the criticisms sometimes levelled at the borderline regression method

• The assessor who gives the wrong overall grade.

These issues will be discussed in more detail at the appropriate points

1 0.745 0.465 4.21 53 31.1

This is a measure of internal consistency (commonly, though not entirely

Where station metrics are standardised a higher alpha would be expected.

• The item is poorly designed.

• The assessors are not assessing to a common standard.

In such circumstances, quality improvement should be undertaken by

However, one cannot rely on alpha alone as a measure of the quality of

As an alternative measure to ‘alpha if item is deleted’, it is possible to use the

Metric 2 – Coefficient of Determination R2

A good correlation (R2> 0.5) will indicate a reasonable relationship between

In our introduction, we raised the impact of outliers on the regression method.

Polynomial fitted R Square F df1 df2 Sig.

The existence of low R2 values at certain stations and/or a wide spread of

Metric 3 – Inter-grade discrimination

A low value of inter-grade discrimination is often accompanied by other

Returning to Table 2, it is clear that the inter-grade discrimination values

Where there is doubt about a station in terms of its performance based on

Metric 4 – Number of failures

Metric 5 – Between-group variation

If the variance components are computed, using group (i.e. circuit) as a

However, care needs to be exercised in making judgements based on

Metric 6 – Between-group variance (other effects)

The overall reliability of the assessment may be increased by adding the

The 360 degree picture of OSCE quality

Review of the metrics of station 8 (focusing on consultation, diagnosis and

Coupled with high levels of between-group variance of 33.8%, this suggests

Station 9 is represented by the curve estimation seen below in Figure 5.

Here we see a more strongly positive contribution to reliability (alpha if item

Quality control by observation: detecting problems in the run up to

In advance of the OSCE, many of the contributing factors to error variance

• Checking across stations to ensure congruence in design.

• Ensuring that new (and older, established) stations follow up-to-date

• Reviewing the set up of parallel OSCE circuits – for example, differences

• Unauthorised prompting by assessors (despite training and pre-exam

• Inappropriate behaviour by assessors (e.g. changing the ‘tone’ of a station

• Excessively proactive simulated patients whose questions act as prompts to

• Assessors (or assistants) not returning equipment to the start or neutral

Post hoc remediation

2. Adjustment at the station level: This is seldom necessary because any

In the authors’ institution there is a close relationship between those

• Chunking of a number of simple criteria into fewer criteria of higher level.

• The inclusion of intermediate grade descriptors on the assessor checklists.

• A greater degree of uniformity between the physical arrangements of the

Indications of poor agreement between assessors has, on occasion, led to

• Upgrading of assessor training methods.