0% found this document useful (0 votes)
70 views24 pages

AMEE Guide 49 - OSCE Evaluation

This document is an AMEE Guide that reviews various metrics for measuring the quality of Objective Structured Clinical Examinations (OSCEs) in medical education. It emphasizes the importance of using multiple metrics to gain a comprehensive understanding of assessment quality and provides practical advice for assessment practitioners. The guide also discusses the complexities of OSCE delivery and the need for continuous quality improvement through effective metric utilization.

Uploaded by

mmy zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views24 pages

AMEE Guide 49 - OSCE Evaluation

This document is an AMEE Guide that reviews various metrics for measuring the quality of Objective Structured Clinical Examinations (OSCEs) in medical education. It emphasizes the importance of using multiple metrics to gain a comprehensive understanding of assessment quality and provides practical advice for assessment practitioners. The guide also discusses the complexities of OSCE delivery and the need for continuous quality improvement through effective metric utilization.

Uploaded by

mmy zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

How to measure the quality of

the OSCE: A review of metrics


Godfrey Pell
Richard Fuller
Matthew Homer
Trudie Roberts

AMEE GUIDE
Assessment 49

AMEE Guides in Medical Education www.amee.org


Welcome to AMEE Guides Series 2
The AMEE Guides cover important topics in medical and healthcare professions education and provide
information, practical advice and support. We hope that they will also stimulate your thinking and reflection
on the topic. The Guides have been logically structured for ease of reading and contain useful take-home
messages. Text boxes highlight key points and examples in practice. Each page in the guide provides a
column for your own personal annotations, stimulated either by the text itself or the quotations. Sources of
further information on the topic are provided in the reference list and bibliography. Guides are classified
according to subject:

Teaching and Learning Curriculum Planning


Research in Medical Education Assessment
Education Management Theories of Medical Education

The Guides are designed for use by individual teachers to inform their practice and can be used to support
staff development programmes.

‘Living Guides’: An important feature of this new Guide series is the concept of supplements, which
will provide a continuing source of information on the topic. Published supplements will be available for
download.
If you would like to contribute a supplement based on your own experience, please contact the Guides
Series Editor, Professor Trevor Gibbs (tjg.gibbs@gmail.com).
Supplements may comprise either a ‘Viewpoint’, when you communicate your views and comments on
the Guide or the topic more generally, or a ‘Practical Application’, where you report on implementation
of some aspect of the subject of the Guide in your own situation. Submissions for consideration for inclusion
as a Guide supplement should be maximum 1,000 words.

Other Guides in the new series: A list of topics in this exciting new series are listed below and
continued on the back inside cover.

30 Peer Assisted Learning: a planning and 34 Teaching in the clinical environment 38 Learning in Interprofessional Teams
implementation framework Subha Ramani, Sam Leinster (2008) Marilyn Hammick, Lorna Olckers,
Michael Ross, Helen Cameron (2007) ISBN: 978-1-903934-43-2 Charles Campion-Smith (2010)
ISBN: 978-1-903934-38-8 An examination of the many ISBN: 978-1-903934-52-4
Primarily designed to assist curriculum challenges for teachers in the clinical Clarification of what is meant by
developers, course organisers and environment, application of relevant Inter-professional learning and an
educational researchers develop and educational theories to the clinical exploration of the concept of teams
implement their own PAL initiatives. context and practical teaching tips for and team working.
clinical teachers.
31 Workplace-based Assessment as an 39 Online eAssessment
Educational Tool 35 Continuing Medical Education Reg Dennick, Simon Wilkinson,
John Norcini, Vanessa Burch (2008) Nancy Davis, David Davis, Ralph Bloch Nigel Purcell (2010)
ISBN: 978-1-903934-39-5 (2010) ISBN: 978-1-903934-53-1
ISBN: 978-1-903934-44-9 An outline of the advantages of on-
Several methods for assessing work-
based activities are described, Designed to provide a foundation line eAssessment and an examination
with preliminary evidence of their for developing effective continuing of the intellectual, technical, learning
application, practicability, reliability medical education (CME) for and cost issues that arise from its use.
and validity. practicing physicians.
40 Creating effective poster presentations
32 e-Learning in Medical Education 36 Problem-Based Learning: where are we George Hess, Kathryn Tosney, Leon
Rachel Ellaway, Ken Masters (2008) now? Liegel (2009)
ISBN: 978-1-903934-41-8 David Taylor, Barbara Miflin (2010) ISBN: 978-1-903934-48-7
An increasingly important topic in ISBN: 978-1-903934-45-6 Practical tips on preparing a poster
medical education – a ‘must read’ A look at the various interpretations – an important, but often badly
introduction for the novice and a useful and practices that claim the label PBL, executed communication tool.
resource and update for the more and a critique of these against the
experienced practitioner. original concept and practice. 41 The Place of Anatomy in Medical
Education
33 Faculty Development: Yesterday, Today 37 Setting and maintaining standards in Graham Louw, Norman Eizenberg,
and Tomorrow multiple choice examinations Stephen W Carmichael (2010)
Michelle McLean, Francois Cilliers, Raja C Bandaranayake (2010) ISBN: 978-1-903934-54-8
Jacqueline M van Wyk (2010) ISBN: 978-1-903934-51-7 The teaching of anatomy in a
ISBN: 978-1-903934-42-5 An examination of the more traditional and in a problem-based
Useful frameworks for designing, commonly used methods of standard curriculum from a practical and a
implementing and evaluating faculty setting together with their advantages theoretical perspective.
development programmes. and disadvantages and illustrations of
the procedures used in each, with the
help of an example.
Institution/Corresponding address:
Godfrey Pell, Principal Statistician, Medical Education Unit, Leeds Institute of Medical Education,
Worsley Building, University of Leeds, Leeds LS2 9JT, UK
Tel: +44 (0)113 23434378
Fax: +44 (0)113 23432597
Email: G.Pell@leeds.ac.uk

The authors:
Godfrey Pell is a Senior Statistician who has a strong background in management. Before joining the
University of Leeds he was with the Centre for Higher Education Practice at the Open University. Current
research includes standard setting for practical assessment in higher education, and the value of short
term interventionist programmes in literacy.

Richard Fuller is a Consultant Physician, and Director of the Leeds MB ChB undergraduate degree
programme within the Institute of Medical Education. His research interests include clinical assessment,
in particular monitoring and improving the quality of the OSCE.

Matthew Homer is a Research Fellow at the University of Leeds, working in both the Schools of Medicine
and Education. He works on a range of research projects and provides general statistical support to
colleagues. His research interests include the statistical side of assessment, particularly related to OSCEs.

Trudie Roberts is a Consultant Physician, a Professor of Medical Education and is the Director of the
Leeds Institute of Medical Education. Her research interests include clinical assessment.

This AMEE Guide was first published in Medical Teacher:


Pell G, Fuller R, Homer M & Roberts T (2010). How to measure the quality of the OSCE: A review of metrics.
AMEE Guide No.49. Medical Teacher, 32(10): 802-811.

Guide Series Editor: Trevor Gibbs (tjg.gibbs@gmail.com)


Production Editor: Morag Allan Campbell
Published by: Association for Medical Education in Europe (AMEE), Dundee, UK
Designed by: Lynn Thomson

© AMEE 2011
ISBN: 978-1-903934-62-3

Guide 49: How to measure the quality of the OSCE: A review of metrics
Contents
Abstract .. .. .. .. .. .. .. 1

Introduction .. .. .. .. .. .. .. 2

Understanding quality in OSCE assessments – General Principles .. .. 3


Which method of standard setting? .. .. .. .. .. 4
How to generate station level quality metrics .. .. .. .. 5

Metric 1 – Cronbach’s Alpha .. .. .. .. .. 6

Metric 2 – Coefficient of Determination R2 .. .. .. .. 7

Metric 3 – Inter-grade discrimination .. .. .. .. 9

Metric 4 – Number of failures .. .. .. .. .. 10

Metric 5 – Between-group variations (including assessor effects) .. .. 11

Metric 6 – Between-group variations (other effects) .. .. .. 12

Metric 7 – Standardised Patients ratings .. .. .. .. 13

The 360 degree picture of OSCE quality .. .. .. .. 13


Quality control through observation – detecting problems .. .. .. 15
Post hoc remediation .. .. .. .. .. .. 15

Conclusions .. .. .. .. .. .. 16

References .. .. .. .. .. .. .. 18

Suggested Further Reading .. .. .. .. .. 18

Guide 49: How to measure the quality of the OSCE: A review of metrics
Abstract
With an increasing use of criterion based assessment techniques in both With an increasing
undergraduate and postgraduate healthcare programmes, there is a use of criterion based
assessment techniques in
consequent need to ensure the quality and rigour of these assessments. both undergraduate and
The obvious question for those responsible for delivering assessment is how postgraduate healthcare
programmes, there is a
is this ‘quality’ measured, and what mechanisms might there be that allow consequent need to ensure
improvements in assessment quality over time to be demonstrated? Whilst a the quality and rigour of
small base of literature exists, few papers give more than one or two metrics these assessments.

as measures of quality in Objective Structured Clinical Examinations (OSCEs).

In this Guide, aimed at assessment practitioners, the authors aim to review the
metrics that are available for measuring quality and indicate how a rounded
picture of OSCE assessment quality may be constructed by using a variety of
such measures, and also to consider which characteristics of the OSCE are
appropriately judged by which measure(s). The authors will discuss the quality
issues both at the individual station level and across the complete clinical
assessment as a whole, using a series of ‘worked examples’ drawn from OSCE
data sets from the authors’ institution.

Take home messages

• It is important always to evaluate the quality of a high stakes assessment, such


as an OSCE, through the use of a range of appropriate metrics.

• When judging the quality of an OSCE, it is very important to employ more than
one metric to gain an all round view of the assessment quality.

• Assessment practitioners need to develop a ‘toolkit’ for identifying and avoiding


common pitfalls.

• The key to widespread quality improvement is to focus on station level


performance and improvements, and apply these within the wider context of
the entire OSCE assessment process.

• The routine use of metrics within OSCE quality improvement allows a clear
method of measuring the effects of change.

Guide 49: How to measure the quality of the OSCE: A review of metrics 
Introduction
With increasing scrutiny of the techniques used to support high level decision
making in academic disciplines, Criterion Based Assessment (CBA) delivers a
reliable and structured methodological approach. As a competency-based
methodology, CBA allows the delivery of ’high stakes’ summative assessment
(e.g. qualifying level or degree level examinations), and the demonstration
of high levels of both reliability and validity. This assessment methodology is
attractive, with a number of key benefits over more ’traditional’ unstructured
forms of assessment (e.g. viva voce) in that it is absolutist, carefully
standardised for all candidates, and assessments are clearly designed and
closely linked with performance objectives. These objectives can be clearly
mapped against curricular outcomes, and where appropriate, standards
laid down by regulatory and licensing bodies, that are available to students
and teachers alike. As such, CBA methodology has seen a wide application
beyond summative assessments, extending into the delivery of a variety of
work-based assessment tools across a range of academic disciplines (Norcini,
2007; Postgraduate Medical Education and Training Board, 2009). CBA is
also now being used in the UK in the recruitment of junior doctors, using a
structured interview similar to that used for selecting admissions to higher
education programmes (Eva et al., 2004).

The Objective Structured Clinical Examination (OSCE) uses CBA principles


within a complex process that begins with ’blueprinting’ course content
against pre-defined objectives (Newble, 2004). The aim here is to ensure
both that the ‘correct’ standard is assessed, and that the content of the
OSCE is objectively mapped to curricular outcomes. Performance is scored,
at the station level, using an item checklist, detailing individual (sequences
of) behaviours, and by a global grade, reliant on a less deterministic overall
assessment by examiners (Cohen, 1997; Regehr, 1998).

Central to the delivery of any successful CBA is the assurance of sufficient


quality and robust standard setting, supported by a range of metrics that
allow thoughtful consideration of the performance of the assessment as a
whole, rather than just a narrow focus on candidate outcomes (Roberts,
2006). ’Assessing the assessment’ is vital, as the delivery of OSCEs is complex
and resource intensive, usually involving large numbers of examiners,
candidates, simulators and patients, and often taking place across
parallel sites. This complexity means CBA may be subject to difficulties with
standardisation, and is heavily reliant on assessor behaviour, even given the
controlling mechanism of item checklists. No single metric is sufficient in itself No single metric is sufficient
in itself to meaningfully judge
to meaningfully judge the quality of the assessment process, just as no single
the quality of the assessment
assessment is sufficient in judging, for example, the clinical competence of an process, just as no single
undergraduate student. Understanding and utilising metrics effectively assessment is sufficient in
judging, for example, the
are therefore central to CBA, both in measuring quality and in directing clinical competence of an
resources to appropriate further research and development of the undergraduate student.
assessment (Wass, 2001).

 Guide 49: How to measure the quality of the OSCE: A review of metrics
Understanding quality in OSCE assessments
– General Principles
This Guide will examine the metrics available, using final year OSCE results
from recent years as exemplars of how exactly these metrics can be
employed to measure the quality of the assessment. It is important to
recognise that a review of the OSCE metrics is only part of the overall process
of reviewing OSCE quality – which needs to embrace all relationships in the
wider assessment process (Figure 1).

Figure 1
OSCE quality assurance and improvement – a complex process

Curriculum
Staff
Blueprinting development
& assessment & examiner
innovation training

Station writing Support staff


& amending & operating
item checklists Quality metrics procedures
& continuous
improvement

Reviewing
poor metrics – Simulation
assessing causes, – patients &
modelling technology
solutions

Institutional
Standard engagement
setting – oversight and
strategy

Where OSCEs are used as part of a national examination structure, stations


are designed centrally to a common standard, and typically delivered from
a central administration. However, at the local level with the assessment
designed within specific medical schools, some variation, for example, in
station maxima will result dependent upon the importance and complexity
of the station to those setting the exam. These absolute differences between
stations will adversely affect the reliability metric making the 0.9 value,
often quoted, unobtainable. It is possible to standardise the OSCE data
and thereby obtain a higher reliability metric but this would not be a true
representation of the assessment as set with respect to the objectives of
the assessing body. This Guide is aimed primarily at those involved with
clinical assessment at the local level within individual medical schools where,
although the assessment may take place across multiple sites, it is a single
administration. Those involved with national clinical assessments are likely to
have a different perspective.

Guide 49: How to measure the quality of the OSCE: A review of metrics 
Which method of standard setting?
The method of standard setting will determine the metrics available for use
in assessing quality. Standards can be relative (e.g. norm referenced) or
absolute, based either on the test item (Ebel & Angoff), or the performance
of the candidate (borderline methods). With the requirement for standards
to be defensible, evidenced and acceptable (Norcini, 2003), absolute
standards are generally used. Whilst all methods of standard setting will
generate a number of post-hoc metrics (e.g. station pass rates, fixed
effects (time of assessment, comparison across sites) or frequency of mark
distribution), it is important to choose a method of standard setting that
generates additional quality measures. At present, a large number of
institutions favour borderline methods, but only the regression method will give
some indication of the relationship between global grade and checklist score
and also the level of discrimination between weaker and stronger students.
Table 1 highlights the key differences between different borderline methods,
and what they contribute to assessment metrics.

Table 1
Comparison of the borderline methods of standard setting

Borderline Groups (BLG)


& Contrasting Groups Borderline Regression (BLR)

• Easy to compute • More expertise required for


computation
• Only 3 global ratings required
(fail, borderline, pass) • Usually 5 global ratings (e.g. fail,
borderline, pass, credit, distinction)
• Uses only borderline data and only
a proportion of assessor/candidate • Uses all assessor/candidate interactions
interactions in analysis

• Needs sufficient candidates in • Requires no borderline grade students


borderline group (20+)
• Wider variety of quality assurance
• Produces limited quality assurance metrics
metrics

The authors favour the borderline regression method because it uses all
the assessment interactions between assessors and candidates, and these
interactions are ‘real’. It is objectively based on pre-determined criteria, using
a large number of assessors and generates a wide range of metrics.

One of the criticisms sometimes levelled at the borderline regression method


is its possible sensitivity to outliers. These outliers occur in three main groups:

• Students who perform very badly and obtain a near zero checklist score.

• Students who achieve a creditable checklist score but who fail to impress
the assessor overall.

• The assessor who gives the wrong overall grade.

These issues will be discussed in more detail at the appropriate points


throughout the Guide.

 Guide 49: How to measure the quality of the OSCE: A review of metrics
How to generate station level quality metrics
Table 2 details a ‘standard’ report of metrics from a typical OSCE (20 stations
over two days, total testing time ~ three hours, spread over four examination
centres). This typically involves ~250 candidates, 500 assessors and 150
simulated patients, and healthy patient volunteers with stable clinical signs
(used for physical examination). Candidates are required to meet a passing
profile comprising of an overall pass score, minimum number of stations
passed (preventing excessive compensation, and adding the fidelity to the
requirement for a competent ‘all round’ doctor) and a minimum number
of acceptable patient ratings. Assessors complete an item checklist, and
then an overall global grade (the global grades in our OSCEs are recorded
numerically as 0=Clear fail, 1=Borderline, 2=Clear pass, 3=Very good pass,
4=Excellent pass).

The borderline regression method was used for standard setting (Pell &
Roberts, 2006). Typically such an OSCE will generate roughly 60,000 data
items (i.e. individual student-level checklist marks), which form a valuable
resource for allowing quality measurement and improvement. As a result
of utilising such data, we have seen our own OSCEs deliver progressively
more innovation, whilst simultaneously maintaining or improving the levels of
reliability.

Under any of the borderline methods of standard setting, where a global Under any of the borderline
methods of standard setting,
grade is awarded in addition to the checklist score, accompanying metrics
where a global grade
are useful in measuring the quality of the assessments. For other types of is awarded in addition
standard setting, where such a global grade does not form part of the to the checklist score,
accompanying metrics
standard setting procedure e.g. Ebel & Angoff, inter-grade discrimination and are useful in measuring the
coefficient of determination (R2) will not apply (Cusimano, 1996). quality of the assessments.

Table 2
Final year OSCE Metrics
Cronbach’s alpha Inter-grade Number of Between-group
Station if item deleted R2 discrimination failures variation (%)

1 0.745 0.465 4.21 53 31.1


2 0.742 0.590 5.23 24 30.1
3 0.738 0.555 5.14 39 33.0
4 0.742 0.598 4.38 39 28.0
5 0.732 0.511 4.14 29 20.5
6 0.750 0.452 4.74 43 40.3
7 0.739 0.579 4.51 36 19.5
8 0.749 0.487 3.45 39 33.8
9 0.744 0.540 4.06 30 36.0
10 0.747 0.582 3.91 26 29.9
11 0.744 0.512 4.68 37 37.6
12 0.744 0.556 2.80 23 32.3
13 0.746 0.678 3.99 30 22.0
14 0.746 0.697 5.27 54 27.3
15 0.739 0.594 3.49 44 25.9
16 0.737 0.596 3.46 41 34.3
17 0.753 0.573 3.58 49 46.5
18 0.745 0.592 2.42 15 25.4
19 0.749 0.404 3.22 52 39.5
20 0.754 0.565 4.50 37 34.1

Number of candidates=241

Guide 49: How to measure the quality of the OSCE: A review of metrics 
Metric 1 – Cronbach’s Alpha
A selection of these overall summary metrics will be used in this Guide to
illustrate the use of psychometric data ‘in action’, and to outline approaches
to identifying and managing unsatisfactory station-level assessment
performance. We have chosen older OSCE data to illustrate this Guide, to
highlight quality issues, and subsequent actions and improvements.

This is a measure of internal consistency (commonly, though not entirely


accurately, thought of as ’reliability’), whereby in a good assessment
the better students should do relatively well across the board (i.e. on the
checklist scores at each station). Two forms of alpha can be calculated
– non standardised or standardised – and in this Guide we refer to the non
standardised form (this is the default setting for SPSS). This is a measure of the
mean intercorrelation weighted by variances, and yields the same value as
the G-coefficient for a simple model of items crossed with candidates. The
(overall) value for alpha that is usually regarded as acceptable in this type of
high stakes assessment, where standardised and real patients are used, and
the individual station metrics are not standardised, is 0.7 or above.

Where station metrics are standardised a higher alpha would be expected.


Alpha for this set of stations was 0.754, and it can be seen (from the second
column of Table 2) that no station detracted from the overall ‘reliability’,
although stations 17 and 20 contributed little in this regard.

Since alpha tends to increase with the number of items in the assessment, the
resulting ‘alpha if item deleted’ scores should all be lower than the overall
alpha score if the item/station has performed well. Where this is not the case
this may be caused by any of the following reasons:

• The item is measuring a different construct from the rest of the set of items.

• The item is poorly designed.

• There are teaching issues – either the topic being tested has not been well
taught, or has been taught to a different standard across different groups
of candidates.

• The assessors are not assessing to a common standard.

In such circumstances, quality improvement should be undertaken by


revisiting the performance of the station, and reviewing checklist and station
design, or examining quality of teaching in the curriculum.

However, one cannot rely on alpha alone as a measure of the quality of


an assessment. As we have indicated, if the number of items increases, so
will alpha, and therefore a scale can be made to look more homogenous
than it really is merely by being of sufficient length in terms of the number of
items it contains. This means that if two scales measuring distinct constructs
are combined, to form a single long scale, this can result in a misleadingly
high alpha. Furthermore, a set of items can have a high alpha and still be
multidimensional. This happens when there are separate clusters of items
(i.e. measuring separate dimensions) which intercorrelate highly, even though
the clusters themselves do not correlate with each other particularly highly.

 Guide 49: How to measure the quality of the OSCE: A review of metrics
It is also possible for alpha to be too high (e.g. >0.9), possibly indicating
redundancy in the assessment, whilst low alpha scores can sometimes be
attributed to large differences in station mean scores rather than being the
result of poorly designed stations.

We should point out that in the authors’ medical school, and in many similar
institutions throughout the UK, over 1,000 assessors are required for the
OSCE assessment season (usually comprising 2-3 large scale examinations
as previously described). Consequently, recruiting sufficient assessors of
acceptable quality is a perennial issue, so it is not possible to implement
double-marking arrangements that would then make the employment of
G-theory worthwhile in terms of more accurately quantifying differences in
assessors. Such types of analysis are more complex than those covered in this
Guide, and often require the use of additional, less user-friendly, software.
An individual, institution-based decision to use G-theory or Cronbach’s alpha An individual, institution
based decision to use G-
should be made in context with delivery requirements and any constraints. theory or Cronbach’s alpha
should be made in context
The hawks and doves effect, either within an individual station, or aggregated with delivery requirements
and any constraints.
to significant site effects, may have the effect of inflating the alpha value.
However, it is highly likely that this effect will lead to unsatisfactory metrics
in the areas of coefficient of determination, between-group within-station
error variance, and, possibly, in fixed effect site differences as we will explore
later in this Guide. Our philosophy is that one metric alone, including alpha,
is always insufficient in judging quality, and that in the case of an OSCE with
a high alpha but other poor metrics, this would not indicate a high quality
assessment.

As an alternative measure to ‘alpha if item is deleted’, it is possible to use the


correlation between station score and ‘total score less station score’. This will
give a more extended scale, but the datum value (i.e. correlation) between
contributing to reliability and detracting from it is to some extent dependent
on the assessment design and is therefore more difficult to interpret.

Metric 2 – Coefficient of Determination R2


The R2 coefficient is the proportional change in the dependent variable
(checklist score) due to change in the independent variable (global
grade). This allows us to determine the degree of (linear) correlation
between the checklist score and the overall global rating at each station,
with the expectation that higher overall global ratings should generally
correspond with higher checklist scores. The square root of the coefficient of
determination is the simple Pearsonian correlation coefficient. SPSS and other
statistical software packages also give the adjusted value of R2 which takes
into account the sample size and the number of predictors in the model (one
in this case); ideally this value should be close to the unadjusted value.

A good correlation (R2> 0.5) will indicate a reasonable relationship between


checklist scores and global grades, but care is needed to ensure that
overly detailed global descriptors are not simply translated automatically
by assessors into a corresponding checklist score, thereby artificially inflating
R2. In Table 2, station 14 (a practical and medico-legal skills station) has
a good R2 value of 0.697, implying that 69.7% of variation in the students’

Guide 49: How to measure the quality of the OSCE: A review of metrics 
global ratings are accounted for by variation in their checklist scores. In
contrast, station 19 is less satisfactory with an R2 value of 0.404. This was a new
station focussing on patient safety and the management of a needlestick
injury. To understand why R2 was low, it is helpful to examine the relationship
graphically (for example, using SPSS Curve estimation) to investigate the
precise nature of the association between checklist and global grade – see
Figure 2. In this figure, assessor global grades are shown on the x-axis and the
total item checklist score is plotted on the y-axis. Clustered checklist scores
are indicated by the size of the black circle, as shown in the key. SPSS can
calculate the R2 coefficient for polynomials of different degree, and thereby
provide additional information on the degree of linearity in the relationship.
We would recommend always plotting a scatter graph of checklist marks We would recommend
always plotting a scatter
against global ratings as routine good practice, regardless of station metrics.
graph of checklist marks
against global ratings as
In station 19 we can see that there are two main problems – a wide spread routine good practice,
regardless of station metrics.
of marks for each global grade, and a very wide spread of marks for which
the fail grade (0 on the x-axis) has been awarded. This indicates that some
students have acquired many of the marks from the item checklist, but their
overall performance has raised concerns in the assessor leading to a global
fail grade.

In our introduction, we raised the impact of outliers on the regression method.


Examples of poor checklist scores but with reasonable grades can be
observed in Figure 3. In other stations, we sometimes see candidates scoring
very few marks on the checklist score. This has the effect of reducing the
value of the regression intercept with the y-axis, and increasing the slope of
the regression line. For the data indicated in Table 2, the removal of outliers
and re-computation of the passing score and individual station pass marks
makes very little difference, increasing the passing score by less than 0.2%.

Figure 2
Curve estimation (Station 19) – Assessor checklist score (y) versus global grade (x)

 Guide 49: How to measure the quality of the OSCE: A review of metrics
This unsatisfactory relationship between checklist marks and global ratings
causes some degree of non-linearity, as demonstrated in the accompanying
Table 3 (produced by SPSS), where it is clear graphically that the best fit
is clearly cubic. Note that mathematically speaking, a cubic will always
produce a better fit, but parsimony dictates that the difference between
the two fits has to be statistically significant for a higher order model to be
preferred. In this example the fit of the cubic polynomial is significantly better
than that of the linear. The key point to note is whether the cubic expression
is the result of an underlying relationship or as a result of outliers, resulting
from inappropriate checklist design or unacceptable assessor behaviour in
marking. In making this judgement, readers should review the distribution of
marks seen on the scattergraph. Our own experience suggests that where
stations metrics are generally of good quality, a departure from strict linearity
is not a cause for concern.

Table 3
Curve estimation table (Station19)

Polynomial fitted R Square F df1 df2 Sig.


Linear .401 159.889 1 239 .000
Quadratic .435 91.779 2 238 .000
Cubic .470 70.083 3 237 .000

The existence of low R2 values at certain stations and/or a wide spread of


marks for a given grade should prompt a review of the item checklist and
station design. In this particular case, although there was intended to be a
key emphasis on safe, effective management in the station, re-assessment
of the checklist in light of these metrics showed this emphasis was not well
represented. It is clear that weaker candidates were able to acquire many
marks for ‘process’ but did not fulfil the higher level expectations of the
station (the focus on decision making). This has been resolved through a re-
write of the station and the checklist, with plans for re-use of this station and
subsequent analysis of performance within a future OSCE.

Metric 3 – Inter-grade discrimination


This statistic gives the slope of the regression line and indicates the average
increase in checklist mark corresponding to an increase of one grade on the
global rating scale. Although there is no clear guidance on ‘ideal’ values, we
would recommend that this discrimination index should be of the order of a
tenth of the maximum available checklist mark (which is typically 30-35 in our
data).

A low value of inter-grade discrimination is often accompanied by other


poor metrics for the station such as low values of R2 (indicating a poor
overall relationship between grade and checklist score), or high levels of
assessor error variance (see metric 5 below) where assessors have failed to
use a common standard. Too high levels of inter-grade discrimination may
indicate either a very low pass mark, or a lack of linearity caused by a small
number of badly failing students who tend to steepen the regression line.

Guide 49: How to measure the quality of the OSCE: A review of metrics 
Where very poor student performance in terms of the checklist score occurs,
consideration needs to be given to whether these very low scores should be
excluded from standard setting to avoid excessive impact on overall passing
scores in a downward direction.

Returning to Table 2, it is clear that the inter-grade discrimination values


are generally acceptable across the stations (station maxima being in the
region of 30-35 marks), although there are three stations with discrimination
values in excess of 5 (e.g. station 14 – a skills station involving completion of a
cremation form).

Where there is doubt about a station in terms of its performance based on


the discrimination metric, returning to the R2 measure of variance and curve
estimation is often instructive. In Table 2, station 14 has the highest inter-grade
discrimination, and it can be seen in Figure 3 that most global grades again
encompass a wide range of marks, especially the ‘clear pass’ grade – value
2 on the x-axis, ranging from 4 to 27, but that the lower of these values are
clearly outliers. As the rest of the station metrics are acceptable, this station
can remain unchanged but should be monitored carefully when used in
subsequent assessments.

Figure 3
Curve estimation (Station 14) Assessor checklist score (y) versus global grade (x)

Metric 4 – Number of failures


It would be a mistake to automatically assume that an unusually high It would be a mistake to
automatically assume that
number of failures indicates a station that is somehow too difficult. The ‘reality
an unusually high number of
check’, which is an essential part of borderline methods, will to a large extent failures indicates a station
compensate for station difficulty. This represents the expert judgement made that is somehow too difficult.
by trained assessors in determining the global rating against the expected
performance of the minimally competent student.

10 Guide 49: How to measure the quality of the OSCE: A review of metrics
As previously described, other psychometric data can be used to investigate
station design and performance in order to identify problems. Failure rates
may be used to review the impact of a change in teaching on a particular
topic – with higher such rates indicating where a review of content and
methods of teaching can help course design. There are no major outliers for
this metric in Table 2, but the difficulties with station 19 have allowed us to
identify and deliver additional teaching around elements of patient safety
within the final year curriculum, and introduce this specific safety focus into
checklists.

Metric 5 – Between-group variation


(including assessor effects)
When performing analysis on data resulting from complex assessment
arrangements such as OSCEs where, by necessity, the students are
subdivided into groups for practical purposes, it is vital that the design is fully
randomised. Sometimes, however, this is not always possible, with logistical
issues including dealing with special needs students who may require more
time and have to be managed exclusively within a separate cycle. Any non-
random subgroups must be excluded from statistically-based types of analysis
that rely on randomness in the data as a key assumption.

In the ideal assessment process, all the variation in marks will be due
to differences in student performance, and not due to differences in
environment (e.g. local variations in layout or equipment), location (e.g.
hospital based sites having different local policies for management of clinical
conditions), or differences of assessor attitude (i.e. hawks & doves). There are
two ways of measuring such effects, either by performing a one-way ANOVA
on the station (e.g. with the assessor as a fixed effect), or by computing
the proportion of total variance which is group specific. The latter allows
an estimation of the proportion of variation in checklist scores that is due
to student performance as distinct from other possible factors mentioned
above, although this is usually given as the proportion of variance which is
circuit specific.

If the variance components are computed, using group (i.e. circuit) as a


random effect, then the percentage of variance specific to group can be
computed. This is a very powerful metric as it gives a very good indication of
the uniformity of the assessment process between groups. It is also relatively
straightforward to calculate. Ideally between-group variance should be
under 30%, and values over 40% should give cause for concern, indicating
potential problems at the station level due to inconsistent assessor behaviour
and/or other circuit specific characteristics, rather than student performance.

From Table 2, stations 6, 17 and 19 give cause for concern with regard to this
metric, with the highest levels of between-group variance. In addition, station
6 has a poor R2, and the overall combination of poor metrics at this station
tells us that the poor R2 was probably due to poor checklist design. These
observations prompted a review of the design of station 6, and the checklist
was found to consist of a large number of low level criteria where weaker
candidates could attain high scores through ‘process’ only. In other words,

Guide 49: How to measure the quality of the OSCE: A review of metrics 11
there was a likely mismatch between the nature of the checklist, and the
aims and objectives of the station as understood by the assessors. Hence, in
redesigning the station, a number of the low-level criteria were chunked (that
is, grouped together to form a higher level criterion) in order to facilitate the
assessment of higher level processes as originally intended.

Station 17 tells a different story, as the good R2 coupled with the high
between-group variation indicates that assessors are marking consistently
within groups, but that there is a distinct hawks and doves effect between
groups. In such a case, this ought to be further investigated by undertaking a
one-way ANOVA analysis to determine whether this is an individual assessor
or a site phenomenon. The amount of variance attributable to different sites
is subsumed in the simple computation of within-station between-group
variance as describe above. However, its significance may be determined
using a one-way ANOVA analysis with sites as fixed effects.

However, care needs to be exercised in making judgements based on


a single metric, since, with quite large populations, applying ANOVA to
individual stations is likely to reveal at least one significant result, as a result
of a type I error due to multiple significance tests across a large number of
groups (e.g. within our own OSCE assessments, a population of 250 students
and approximately 15 parallel circuits across different sites). Careful post-hoc
analysis will indicate any significant hawks and doves effects, and specific
groups should be tracked across other stations to determine general levels
of performance. If a completely random assessment model of both students
and assessors has been used (mindful of the caveats about local variations in
equipment and exam set up), then many of these effects should be largely
self-cancelling; it is in the aggregate totals that group-specific fixed effects
are important and may require remedial action.

Metric 6 – Between-group variance (other effects)


ANOVA analysis can also be of use when there are non-random allocations
of either assessors or students, as is the case in some medical schools with
large cohorts and associated teaching hospitals where multi-site assessment
may occur. Such complex arrangements can result in the non-random
assignment of assessors to circuits since it is often difficult for clinical staff to
leave their place of work. This may then lead to significant differences due
to ‘site effects’ which can be identified with appropriate action taken in the
analysis of results.

Other important fixed effects can also be identified through the use of
ANOVA. For example, assessor training effects, staff/student gender effects,
and associated interactions, which have all been previously described
(Pell, 2008), and which underline the need for complete and enhanced
assessor training as previously highlighted (Holmboe, 2004).

12 Guide 49: How to measure the quality of the OSCE: A review of metrics
Metric 7 – Standardised Patients ratings
Most centres that use simulated/standardised patients (SPs) require them to
rate candidates, and this typically follows an intensive training programme.
Within our own institution, SPs would be asked a question such as “Would you
like to consult again with this doctor?” with a range of responses (strongly
agree, agree, neither agree nor disagree, disagree, strongly disagree), the
two latter responses being regarded as adverse. Akin to Metric 4 (Number of
station failures), a higher than normal proportion of candidates (e.g. >10%)
receiving adverse SP ratings may indicate problems. There is no available There is no available
literature on what constitutes
literature on what constitutes an ‘acceptable’ range of SP ratings at station
an ‘acceptable’ range of SP
level, so we have chosen an arbitrary cut off figure of 10%. The critical issue ratings at station level.
here is that other station metrics should be reviewed, and the impact on SP
ratings monitored in response to training or other interventions.

If this is coupled with a higher than normal failure rate it could be the result
of inadequate teaching of the topic. Adverse values of this metric are often
accompanied by high rates of between-group variance; assessors viewing
candidates exhibiting a lower than expected level of competence often
have difficulty in achieving consistency.

The overall reliability of the assessment may be increased by adding the


SP rating to the checklist score; typically the SP rating should contribute
10-20% of the total station score (Homer & Pell, 2009). An alternative
approach, taken within our own institution at graduating level OSCEs, is to set
a ‘minimum’ requirement for SP comments as a proxy for patient satisfaction
(using rigorously trained SPs).

The 360 degree picture of OSCE quality


As outlined, it is critical to review station quality in light of all available
station-level metrics before making assumptions about quality, and planning
improvements.

Review of the metrics of station 8 (focusing on consultation, diagnosis and


decision making) show a positive contribution to overall assessment reliability
(alpha if item deleted 0.749). As can be seen below in the curve estimation in
figure 4, the R2 coefficient is poor at 0.4 with a wide spread of item checklist
scores within grades, and significant overlap across the higher grades (pass,
credit and distinction).

Coupled with high levels of between-group variance of 33.8%, this suggests


a mismatch between assessor expectations and grading, and the construct
of the item checklist in the provision of higher level performance actions. This
leads to inconsistency within and between stations.

Actions to resolve this would typically include a review of the station content
and translation to the item checklist. Reviewing grade descriptors and
support material for assessors at station level should help overcome the
mismatch revealed by the poor R2 and higher error variance.

Guide 49: How to measure the quality of the OSCE: A review of metrics 13
Figure 4:
Curve estimation (Station 8) – Assessor checklist score (y) versus global grade (x)

Observed
50.00 Linear
Quadratic
Cubic

40.00

30.00

20.00

10.00
0.00 1.00 2.00 3.00 4.00

Station 9 is represented by the curve estimation seen below in Figure 5.

Here we see a more strongly positive contribution to reliability (alpha if item


deleted 0.74) and better station-level metrics. The R2 coefficient is acceptable
at 0.5, but between-group variance is still high at 36%.

The curve shows wide performance variance at each grade level. The good
R2 suggests the variation in assessor global rating rather than assessor checklist
scoring, with a hawks and doves effect.

Figure 5
Curve estimation (Station 9) - Assessor checklist score (y) versus global grade (x)

Observed
30.00 Linear
Quadratic
Cubic

25.00

20.00

15.00

10.00

5.00
0.00 1.00 2.00 3.00 4.00

14 Guide 49: How to measure the quality of the OSCE: A review of metrics
Action to investigate and improve this would focus on assessor support
material in relation to global ratings.

Quality control by observation: detecting problems in the run up to


OSCEs and on the day
It is essential for those concerned with minimising error variance between
groups to observe the OSCE assessment systematically. When considering
some of the causes of between-group error, all those involved in the wider
OSCE process (Figure 1) must be part of the quality control process.

In advance of the OSCE, many of the contributing factors to error variance


can be anticipated and corrected by applying some of the points below:

• Checking across stations to ensure congruence in design.

• Ensuring that new (and older, established) stations follow up-to-date


requirements in terms of checklist design, weighting and anchor points.

• Reviewing the set up of parallel OSCE circuits – for example, differences


in the placing of disinfectant gel outside a station may mean the assessor
may not be able to score hand hygiene approaches.

• Ensuring that stations carry the same provision of equipment (or permit
flexibility if students are taught different approaches with different
equipment).

Other sources of error variance can occur during the delivery of the OSCE:

• Assessors who arrive late and miss the pre-assessment briefing and who
therefore fail to adhere adequately to the prescribed methodology.

• Unauthorised prompting by assessors (despite training and pre-exam


briefings).

• Inappropriate behaviour by assessors (e.g. changing the ‘tone’ of a station


through excessive interaction).

• Excessively proactive simulated patients whose questions act as prompts to


the students.

• Biased real patients (e.g. gender or race bias). Simulated patients receive
training on how to interact with the candidates, but this may not be
possible with the majority of real patients to the same level undertaken with
simulators.

• Assessors (or assistants) not returning equipment to the start or neutral


position as candidates change over.

Post hoc remediation


When faced with unsatisfactory metrics, a number of pragmatic, post hoc When faced with
unsatisfactory metrics, a
remediation methods can be employed.
number of pragmatic, post
hoc remediation methods
1. Adjustment of total marks for site effects: The easiest method is to adjust
can be employed.
to a common mean across all sites. After any such adjustment, the site
profile of failing students should be checked to ensure that, for example,
all failures are not confined to a single site. The effect of any special
needs group (e.g. candidates receiving extra time as a result of health

Guide 49: How to measure the quality of the OSCE: A review of metrics 15
needs) located within a single specific site needs to be discounted when
computing the adjustment level.

2. Adjustment at the station level: This is seldom necessary because any


adverse effects will tend to cancel each other out. In the rare cases
where this does not happen, a station level procedure as above can be
carried out.

3. Removal of a station: Again, this is a rare event and the criteria for this are
usually multiple adverse metrics, the result of which would disadvantage
students to such an extent that the assessment decisions are indefensible
against appeal.

Conclusions
Using a series of worked examples and ‘live data’, this Guide focuses on
commonly used OSCE metrics and how they can be used to identify and
manage problems, and how such an approach helps to anticipate future
issues at the school/single institution level. This methodology therefore
naturally feeds into the wider assessment processes as described in Figure 1.

In the authors’ institution there is a close relationship between those


who analyse the data and those who design and administer the clinical
assessments and develop/deliver teaching. Routine and detailed review of
station level metrics has revealed mismatches between checklists and global
ratings. This has led to the redesign of certain OSCE stations with a subsequent
improvement of metrics. Some of these redesigns include:

• Chunking of a number of simple criteria into fewer criteria of higher level.

• Chunking to allow for higher level criteria commensurate with the stage of
student progression, allowing assessment of higher level, less process driven
performance

• The inclusion of intermediate grade descriptors on the assessor checklists.

• Ensuring that checklist criteria have three instead of two anchors where
appropriate, thereby allowing greater discrimination by assessors.

• A greater degree of uniformity between the physical arrangements of the


different circuits.

The presence of high failure rates at particular stations has led to a revisiting
of the teaching of specific parts of the curriculum, and was followed by
changes in the way things were taught, resulting in improved student
performance as measured in subsequent OSCEs.

Indications of poor agreement between assessors has, on occasion, led to


a number of changes all of which have been beneficial to the quality of
assessment:

• Upgrading of assessor training methods.

• Updating (‘refreshing’) assessors who were trained some time ago.

• The provision of more detailed support material for assessors.

16 Guide 49: How to measure the quality of the OSCE: A review of metrics
• Improved assessor briefings prior to the assessment.

• Improved SP briefings prior to the assessment.

• Dummy runs before the formal assessment for both assessors and SPs (this
is only really practicable where student numbers are relatively small e.g.
resits, and in dental OSCEs with smaller cohorts of students).

The need for all the above improvements would be unlikely to have been
apparent from using a single reliability metric, such as Cronbach’s alpha or
the G Coefficient. It is only when a family of metrics is used that a true picture It is only when a family of
metrics is used that a true
of quality can be obtained and the deficient areas identified. Adopting this picture of quality can be
approach will be rewarded with a steady improvement in the delivery and obtained and the deficient
standard of clinical assessment. areas identified. Adopting
this approach will be
rewarded with a steady
improvement in the delivery
and standard of clinical
assessment.

Guide 49: How to measure the quality of the OSCE: A review of metrics 17
References
COHEN DS, COLLIVER JA, ROBBS RS & SWARTZ MH (1997). A Large-Scale Study of the
Reliabilities of Checklist Scores and Ratings of Interpersonal and Communication
Skills Evaluated on a Standardized-Patient Examination. Advances in Health Sciences
Education, 1: 209-213.

CUSIMANO M (1996). Standard setting in Medical Education. Academic Medicine,


71(10): S112-S120.

EVA KW, ROSENFELD J, REITER H & NORMAN GR (2004). An Admissions OSCE: the
multiple mini-interview. Medical Education, 38: 314-326.

FIELD A (2000). Discovering Statistics (using SPSS for windows), p.130 (Sage Publications,
London)

HOLMBOE E (2004). Faculty and the observation of trainees’ clinical skills: Problems and
opportunities. Academic Medicine, 79(1): 16-22.

HOMER M & PELL G (2009). The impact of the inclusion of simulated patient ratings on
the reliability of OSCE assessments under the borderline regression method. Medical
Teacher, 31(5): 420-425.

NEWBLE D (2004). Techniques for measuring clinical competence: objective structured


clinical examinations. Medical Education, 38: 199-203.

NORCINI J (2003). Setting standards on educational tests. Medical Education, 37(5):


464-469.

NORCINI J & BURCH V (2007) Workplace-based assessment as an educational tool:


AMEE guide No. 31. Medical Teacher, 29(9): 855-871.

PELL G, HOMER M & ROBERTS TE (2008). Assessor Training: Its Effects on Criterion Based
Assessment in a Medical Context. International Journal of Research & Method in
Education, 31(2): 143-154.

PELL G & ROBERTS TE (2006). Setting standards for student assessment. International
Journal of Research & Method in Education, 29(1): 91-103.

POSTGRADUATE MEDICAL EDUCATION TRAINING AND EDUCATION BOARD (2009).


Workplace based assessment. A guide for Implementation (London).
www.pmetb.org.uk/fileadmin/user/QA/assessment/PMETB_WPBA_Guide_20090501.pdf
(accessed May 11th 2009)

REGEHR G, MACRAE H, REZNICK RK & SZALAY D (1998). Comparing the psychometric


properties of checklists and global rating scales for assessing performance on an
OSCE-format examination. Academic Medicine, 73(9): 993-997.

ROBERTS C, NEWBLE D, JOLLY B, REED M & HAMPTON K (2006). Assuring the quality of
high-stakes undergraduate assessments of clinical competence. Medical Teacher,
28(6): 535-543.

STEVENS J (1992). Applied multivariate statistics for the social sciences (2nd ed),
Chapter 4: 151-182 (Erlbaum, Hillside NJ)

WASS V, MCGIBBON D & VAN DER VLEUTEN C (2001). Composite undergraduate


clinical examinations: how should the components be combined to maximise
reliability? Medical Education, 35: 326-330.

Suggested Further Reading


STREINER DL & NORMAN GR (2008). Health Measurement Scales: a practical guide to
their development and use (4th ed). Oxford University Press, Oxford

CIZEK GJ & BUNCH MB (2007). Standard Setting (1st ed). Sage Publications, London

18 Guide 49: How to measure the quality of the OSCE: A review of metrics
42 The use of simulated patients in 50 Simulation in Healthcare Education. 57 General overview of the theories used
medical education Building a Simulation Programme: in assessment
Jennifer A Cleland, Keiko Abe, a Practical Guide Lambert WT Schuwirth, Cees PM van
Jan-Joost Rethans (2010) Kamran Khan, Serena Tolhurst-Cleaver, der Vleuten (2012)
ISBN: 978-1-903934-55-5 Sara White, William Simpson (2011) ISBN: 978-1-903934-97-5
A detailed overview on how to recruit, ISBN: 978-1-903934-63-0 As assessment is modified to suit
train and use Standardized Patients A very practical approach to student learning, it is important that we
from a teaching and assessment designing, developing and organising understand the theories that underpin
perspective. a simulation programme. which method of assessment are
chosen. This guide provides an insight
43 Scholarship, Publication and Career 51 Communication Skills: An essential into the essesntial theories used.
Advancement in Health Professions component of medical curricula. Part I:
Education Assessment of Clinical Communication 58 Self-Regulation Theory: Applications to
William C McGaghie (2010) Anita Laidlaw, Jo Hart (2011) medical education
ISBN: 978-1-903934-50-0 ISBN: 978-1-903934-85-2 John Sandars, Timothy J Cleary (2012)
Advice for the teacher on the An overview of the essential steps to ISBN: 978-1-903934-99-9
preparation and publication of take in the development of assessing Self-regulation theory, as applied to
manuscripts and twenty-one practical the competency of students’ medical education, describes the
suggestions about how to advance a communication skills. cyclical control of academic and
successful and satisfying career in the clinical performance through several
academic health professions. key processes that include goal-
52 Situativity Theory: A perspective on
directed behaviour, use of specific
how participants and the environment
44 The Use of Reflection in Medical can interact
strategies to attain goals, and the
Education adaptation and modification to
Steven J Durning, Anthony R Artino behaviours or strategies to optimise
John Sandars (2010) (2011)
ISBN: 978-1-903934-56-2 learning and performance.
ISBN: 978-1-903934-87-6
A variety of educational approaches Clarification of the theory that our
in undergraduate, postgraduate and 59 How can Self-Determination Theory
environment affects what we and our
continuing medical education that assist our understanding of the
students learn.
can be used for reflection, from text teaching and learning processes in
based reflective journals and critical medical education?
incident reports to the creative use of 53 Ethics and Law in the Medical Olle ThJ ten Cate, Rashmi A Kusurkar,
digital media and storytelling. Curriculum Geoffrey C Williams (2012)
Al Dowie, Anthea Martin (2011) ISBN: 978-1-908438-01-0
ISBN: 978-1-903934-89-0
45 Portfolios for Assessment and Learning Self-Determination Theory (SDT) serves
Jan van Tartwijk, Erik W Driessen (2010) An explanation of why and a practical among the current major motivational
ISBN: 978-1-903934-57-9 description of how this important theories in psychology but its
topic can be introduced into the applications in medical education
An overview of the content and undergraduate medical curriculum.
structure of various types of portfolios, are rare. This guide uncovers the
including eportfolios, and the factors potential of SDT to help understand
that influence their success. 54 Post Examination Analysis of many common processes in medical
Objective Tests education.
Mohsen Tavakol, Reg Dennick (2011)
46 Student Selected Components
Simon C Riley (2010)
ISBN: 978-1-903934-91-3 60 Building bridges between theory
ISBN: 978-1-903934-58-6 A clear overview of the practical and practice in medical education
importance of analysing questions to by using a design-based research
An insight into the structure of an SSC
ensure quality and fairness of the test approach
programme and its various important
component parts. Diana HJM Dolmans, Dineke Tigelaar
55 Developing a Medical School: (2012)
Expansion of Medical Student ISBN: 978-1-908438-03-4
47 Using Rural and Remote Settings in the
Undergraduate Medical Curriculum Capacity in New Locations This guide describes how Design-Based
David Snadden, Joanna Bates, Philip Research (DBR) can help to bridge the
Moira Maley, Paul Worley, John Dent
(2010) Burns, Oscar Casiro, Richard B Hays, gap between research and practice,
Dan Hunt, Angela Towle (2011) by contributing towards theory testing
ISBN: 978-1-903934-59-3
ISBN: 978-1-903934-93-7 and refinement on the one hand and
A description of an RRME programme improvement of educational practice
As many new medical schools are
in action with a discussion of the on the other.
developed around the world, this
potential benefits and issues relating to
guide draws upon the experience
implementation.
of seven experts to provide a very 61 Integrating Professionalism into the
practical and logical approach to this Curriculum
48 Effective Small Group Learning often difficult subject. Helen O’Sullivan, Walther van der
Sarah Edmunds, George Brown (2010) Mook, Ray Fewtrell, Val Wass (2012)
ISBN: 978-1-903934-60-9
56 Research in Medical Education. “The ISBN: 978-1-908438-05-8
An overview of the use of small group research compass”: An introduction to Professionalism is now established
methods in medicine and what makes research in medical education an an important component of
them effective. Charlotte Ringsted, Brian Hodges, all medical curricula. This guide
Albert Scherpbier (2011) clearly explains the why and how of
49 How to Measure the Quality of the ISBN: 978-1-903934-95-1 integrating professionalism into the
OSCE: A Review of Metrics An introductory guide giving a broad curriculum and ways to overcome
Godfrey Pell, Richard Fuller, Matthew overview of the importance attached many of the obstacles encountered.
Homer, Trudie Roberts (2011) to research in medical education.
ISBN: 978-1-903934-62-3
A review of the metrics that are
available for measuring quality in
assessment and indicating how a
rounded picture of OSCE assessment
quality may be constructed by using a
variety of measures.
About AMEE
What is AMEE?
AMEE is an association for all with an interest in medical and healthcare professions education,
with members throughout the world. AMEE’s interests span the continuum of education from
undergraduate/basic training, through postgraduate/specialist training, to continuing professional
development/continuing medical education.

• Conferences: Since 1973 AMEE has been organising an annual conference, held in a European
city. The conference now attracts over 2300 participants from 80 countries.

• Courses: AMEE offers a series of courses at AMEE and other major medical education conferences
relating to teaching, assessment, research and technology in medical education.

• MedEdWorld: AMEE’s exciting new initiative has been established to help all concerned with
medical education to keep up to date with developments in the field, to promote networking
and sharing of ideas and resources between members and to promote collaborative learning
between students and teachers internationally.

• Medical Teacher: AMEE produces a leading international journal, Medical Teacher, published 12
times a year, included in the membership fee for individual and student members.

• Education Guides: AMEE also produces a series of education guides on a range of topics, including
Best Evidence Medical Education Guides reporting results of BEME Systematic Reviews in medical
education.

• Best Evidence Medical Education (BEME): AMEE is a leading player in the BEME initiative which aims
to create a culture of the use of best evidence in making decisions about teaching in medical and
healthcare professions education.

Membership categories
• Individual and student members (£85/£39 a year): Receive Medical Teacher (12 issues a year, hard
copy and online access), free membership of MedEdWorld, discount on conference attendance
and discount on publications.

• Institutional membership (£200 a year): Receive free membership of MedEdWorld for the institution,
discount on conference attendance for members of the institution and discount on publications.

See the website (www.amee.org) for more information.

If you would like more information about AMEE and its activities, please contact the AMEE Office:
Association for Medical Education in Europe (AMEE), Tay Park House, 484 Perth Road, Dundee DD2 1LR, UK
Tel: +44 (0)1382 381953; Fax: +44 (0)1382 381987; Email: amee@dundee.ac.uk
www.amee.org
ISBN: 978-1-903934-62-3 Scottish Charity No. SC 031618

You might also like