Chapter4 3
Chapter4 3
This chapter is the second part of the story about designing items. The chapter
concentrates on how to categorize the item responses and then score them to be indicators of the
construct. It introduces the idea of an “outcome space,” which describes how to categorize the
responses. Some important qualities of this categorization include that the categories should be:
Well-defined, finite and exhaustive, ordered, context-specific and research-based. In addition,
the categories need to be scored in order to use them in the Calibration model—the topic of the
next chapter. The chapter concludes with a description of three widely applicable strategies for
jointly developing the outcome space and scoring strategy: phenomenography, the structure of
the learning outcome (SOLO) technique and Guttman items.
Key concepts: outcome space, well-defined categories, finite and exhaustive categories, ordered
categories, context-specific categories, research-based categories, scoring scheme,
phenomenography, structure of the learning outcome (SOLO), Guttman items, raters.
The outcome space is the third building block in the Bear Assessment System (BAS). It
has already been introduced, lightly, in Chapter 1, and its relationship to the other building
blocks was illustrated there too—see Figure 4.1. In this chapter, it is the main focus.
The term “outcome space” was introduced by Ference Marton (1981) for a set of
outcome categories developed from a detailed (“phenomenographic”) analysis of students’
responses to standardized open-ended items such as the LPS item discussed in the previous
chapter1. In much of his writing Marton describes the development of a set of outcome
categories as a process of “discovering” the qualitatively different ways in which students
respond to a task. In this book the term outcome space is adopted and applied in a broader sense
to any set of qualitatively described categories for recording and/or judging how respondents
have responded to items. Several examples of outcome spaces have already been shown in
earlier examples. The LPS Argumentation construct map in Figure 2.9 (Example 4) summarizes
how to categorize the responses to the LPS items attached to the Argumentation construct—this
is a fairly typical outcome space for an open-ended item. The outcome spaces for fixed-response
items look different—they are simply the fixed responses themselves—for example, the outcome
space for an evaluation item in the PF-10 Survey (Example 6) is:
“limited a lot”
“a little” or
“not at all.”
1
Note that the ADM item in Figure 1.6 could also be used here, along with the construct map in Figure 1.9 and
Appendix 1A. This holds for all of the references to the LPS item in the remainder of this Chapter.
1
Although these two types of outcome space look quite different, it is important to see that they
are connected in a deep way—in both cases, the response categories are designed to map back to
the waypoints of the construct map. Thus, if two sets of items, some of which were constructed
response and some selected response, related to the same construct map, then, despite how
different they looked, they would ALL have the common feature hat their responses could be
mapped to the waypoints of that construct map. As noted above, this connection leads to a good
way to develop a fixed set of responses for selected response items: First construct the open-
ended outcome space, and second, use some of the sample responses in the categories as a way
to generate representative fixed choices for selection. Of course, many considerations must be
borne in mind while making those choices.
Figure 4.1 The four building blocks in the BEAR Assessment System (BAS)
Inherent in the idea of categorization is an understanding that the categories that define
the outcome space are qualitatively distinct. All measures are based, at some point, on qualitative
distinctions. Even fixed-response formats such as multiple-choice test items and Likert-style
survey questions rely upon a qualitative understanding of what constitutes different levels of
response (more or less correct, or more or less agreeable, as the case may be). Rasch (1977, p.
68) pointed out that this principle goes far beyond measurement in the social sciences: “That
science should require observations to be measurable quantities is a mistake of course; even in
physics, observations may be qualitative--as in the last analysis they always are.”
The remainder of this section contains a description of the important qualities of a sound
and useful outcome space. These qualities include: well-defined, finite and exhaustive, ordered,
context-specific and research-based, as detailed below.
The categories that make up the outcome space must be well-defined. For our purposes,
this will need to include not only (a) a general definition of what is being measured by that item
(i.e., in the approach described in this book, a description of the construct map), but also (b)
relevant background material and (c) examples of items, item responses and their categorization,
as well as (d) a training procedure for constructed response items. The LPS example displays all
2
except the last of these characteristics: Figure 2.9 summarizes the Argumentation construct map
including descriptions of different levels of response; Figures 3.3 and 3.4 shows an example
item; and the paper cited in the description in Chapter 1 (Osborne et al, 2016) gives a
background discussion to the construct map, including references to the relevant literature.
Construct Mapping. What is not shown in the LPS materials is a training program to
achieve high inter-rater agreement in the types of responses that fall into different categories,
which will in turn support the usefulness for the results. To achieve high levels of agreement, it
is necessary to go beyond written materials; some sort of training is usually required. One such
method that is consistent with the BAS approach is called “construct mapping” (Draney &
Wilson, 2010/11). In the context of education this method has been found to be particularly
helpful for teachers, who can bring their professional experiences to help in the judgement
process, but who also have found the process to enhance their professional development. In this
technique, teachers choose examples of item responses from their own students or others, and
then circulate the responses beforehand to other members of the moderation group. All the
members of the group categorize the responses using the scoring guides and other material
available to them. They then come together to “moderate” those categorizations at a consensus
building meeting. The aim of the meeting is for the group to compare their categorizations,
discuss them until they come to a consensus about the scores, and to discuss the instructional
implications of knowing what categories the students have been categorized into. This process
may be repeated several times with different sets of responses to achieve higher levels of initial
agreement, and to track teachers’ improvement over time. In line with the iterative nature of
design, the outcome space may be modified from the original by this process.
One way to check tha the outcome space contains sufficiently interpretable detail is to
have different teams of judges use the materials to categorize a set of responses. The agreement
between the two sets of judgments provides an index of how successful the definition of the
outcome space has been (although, of course, standards of success may vary). Marton (1986)
gives a useful distinction between developing an outcome space and using one. In comparing the
work of the measurer to that of a botanist classifying species of plants, he notes that “while there
is no reason to expect that two persons working independently will construct the same taxonomy,
the important question is whether a category can be found or recognized by others once it has
been described... It must be possible to reach a high degree of agreement concerning the presence
or absence of categories if other researchers are to be able to use them” (Marton, 1986, p. 35).
The construction of an outcome space should be part of the process of developing an item
and, hence, should be informed by research aimed at establishing the construct to be measured,
and identifying and understanding the variety of responses students give to that task. In the
domain of measuring achievement, a National Research Council committee concluded:
3
coarsely grained, depending on the purpose of the assessment, but it should
always be based on empirical studies of learners in a domain. Ideally, the model
will also provide a developmental perspective, showing typical ways in which
learners progress toward competence (NRC, 2001, pp. 2-5).
Thus, in the achievement context, a research-based model of cognition and learning should be
the foundation for the definition of the construct, and hence also for the design of the outcome
space and the development of items. In other areas, similar advice pertains—in psychological
scales, health questionnaires, even in marketing surveys—there should be a research-based
construct to tie all of the development efforts together. There is a range of formality and depth
that one can expect of the research behind such “research-based” outcome spaces. For example,
the LPS Argumentation construct is based on a close reading of the relevant literature (Osborne
et al, 2016), as are the ADM constructs (Lehrer et al., 2014). The research basis for the PF-10 is
documented in Ware and Gandek (1998), although the construct is not explicitly established
there. For each of the rest of the Examples, there is a basis in the relevant research literature for
the construct map, although (of course) some literatures have more depth than others.
In the measurement of a construct, the outcome space must always be specific to that
construct and the contexts in which it is to be used. Sometimes it is possible to confuse the
context-specific nature of an outcome space and the generality of the scores that are derived from
that. For example, a multiple choice item will have distractors that are only meaningful (and
scoreable) in the context of that item, but the usual scores of the item (“correct”/“incorrect” or
“1”/“0”) are interpretable more broadly as indicating “correctness”). This can lead to a certain
problem in developing achievement items, which I call the “correctness fallacy”—that is, the
view (perhaps an unconscious view) that the categorization of the responses to items is simply
according to whether the student supplied a “correct” answer to it. The problem with this
approach is that the “correctness” of a response may not fully comprehend the complexity of
what is asked for in the relevant construct map. For example, in the “Ice-to-Water-Vapor” Task
in Figure 3.3, note how a student could be asked which student, Anna or Evan, is correct. The
response to this could indeed be judged as correct or not, but nevertheless, the judgement would
have little information regarding Argumentation—what is needed to pry out of this context is to
proceed to the next part of the task, as exemplified in Figure 3.4, where the prompts are used to
disassemble the “correctness” into aspects relevant to the Argumentation construct map.
Even when categories are labelled in the same way from context to context, their use
inevitably requires a re-interpretation in each new context. The set of categories for the LPS
tasks, for example, was developed from an analysis of students' answers to the set of tasks used
in the pilot and subsequent years of the assessment development project. The general scoring
guide used for the LPS Argumentation construct needs to be supplemented by an item scoring
guide, including a specific set of exemplars for each specific task (as shown in Table 1.1 for the
ADM MoV Piano Width item).
4
4.1.4 Finite and Exhaustive Categories.
The responses that the measurer obtains to an open-ended item will generally be a sample
from a very large population of possible responses. Consider a single essay prompt—something
like the classic “What did you do over the summer vacation?” Suppose that there is a restriction
to the length of the essay of, say, five pages. Think of how many possible different essays could
be written in response to that prompt. It is indeed a very large number (although, because there
is only a finite number of words in English, there is in fact a finite upper limit that could be
estimated). Multiply this by number of different possible prompts (again, very large, but finite),
and then again by all the different possible sorts of administrative conditions (it can be hard to
say what the numerical limit is here, perhaps infinite), and you end up with an even bigger
number. The role of the outcome space is to bring order and sense to this extremely large and
potentially unruly bunch of possible responses. One prime characteristic is that the outcome
space should consist of only a finite number of categories. For example, the LPS scoring guide
categorizes all Argumentation item responses into 13 categories, as shown in Figure 2.6. The
PF-10 outcome space is just three categories: “Yes, limited a lot”, “Yes, limited a little” and “No,
not limited at all”.
The outcome space, to be fully useful, must also be exhaustive: There must be a category
for every possible response. Note that some potential responses may not be delimited in the
construct map. First, under broad circumstances, there may be responses that indicate:
(a) that there was no opportunity for a particular respondent to respond (for example, this can
occur due to the measurer’s data collection design, where the items are, say, distributed
across a number of forms, and certain items are included on one of the forms);
(b) that the respondent was prevented from completing all of the items by matters not related to
the construct or the purpose of the measurement (such as an internet interruption).
For such circumstances, the categorization of the responses to a construct map level would be
misleading, and so a categorization into “missing” or an equivalent would be best. The
implications for this “missing” category need to be borne in mind for the analysis of the resulting
data, which is addressed in Section 5.X below.
Second, there will often be responses found that do not conform with the expected range.
In the constructed response items one can get responses like “tests suk” or “I vote for Mickey
Mouse” etc. Although such responses should not be ignored, as they sometimes contain
information that can be interpreted in a larger context and may even be quite important in that
larger context, they will usually not inform the measurer about the respondent’s location on a
specific construct. In fixed-response item formats like the PF-10 scale, the finiteness and
exhaustiveness of the response categorization is seemingly forced by the format, but one can still
find instances where the respondent has endorsed, say, two of the options for a single item. In
situations like these the choice of a “missing” category may seem automatic, but there are
circumstances where that may be misleading; in educational achievement testing, for example, it
may be more consistent with an underlying construct (i.e., because such responses do not reflect
“achievement”) to categorize them at the lowest waypoint, as was indicated for the
Argumentation construct map in Figure 2.6. Whatever policy is developed, it has to be sensitive
to both the underlying construct, and the circumstances of the measurement.
5
4.1.5 Ordered Categories.
Most often, the set of categories that come directly out of an outcome space is not yet
sufficient as a basis for measurement. One more step is needed—the provision of a scoring
guide. The scoring guide organizes the ordered categories as waypoints along the construct map:
The categories must be related back to the responses side of the generating construct map. This
can be seen simply as providing numerical values for the ordered levels of the outcome space
(i.e., scoring of the item response categories), but the deeper meaning of this pertains to the
relationship back to the construct map from Chapter 1. In many cases, this process is seen as
integral to the definition of the categories, and that is indeed a good thing, as it means that the
categorization and scoring work in concert with one another. Nevertheless, it is important to be
able to distinguish the two processes, at least in theory, because (a) the measurer must be able to
justify each step in the process of developing the instrument, and (b) sometimes the possibility of
having different scoring schemes is useful in understanding and exploring the construct.
6
that have a negative valence with the construct, the scoring will generally be reversed, to be 3, 2,
1, and 0.
With open-ended items, the outcome categories must be ordered into qualitatively
distinct, ordinal categories, such as was done in the LPS example. Just as for Likert-style items,
it makes sense to think of each of these ordinal levels as being scored by successive integers, just
as they are in Figure 3.3, where the successive ordered categories are scored thus:
This can be augmented where there are finer gradations available—one way to represent this is
by using “+” and “-” for (a) responses palpably above a waypoint, but not yet up to the next
waypoint, and (b) responses palpably below a waypoint, but not yet down to the next waypoint,
respectively. Note that these may be waypoints in the making. Another way is to increase the
number of scores to incorporate the extra categories. The category of “no opportunity” is scored
as “missing” above. Under some circumstances, say, where the student was not administered an
achievement item because it was deemed too difficult an a priori basis, then it would make sense
to score this missing consistently with that logic as a “0.” However, if the student was not
administered the item for reasons that related to some reason unrelated to that student’s measure
on the construct, say, that they were ill that day, then it would make sense to maintain the
“missing” and interpret it as indicating “missing data”.
The construction of an outcome space will depend heavily on the specific context, both
theoretical and practical, in which the measurer is developing the instrument. It should begin
with the definition of the construct, and then proceed to the definition of the descriptive
components of the items design and will also require the initial development of some example
items. Below are described two general schema that have been developed for this purpose,
focusing on the cognitive domain: (a) phenomenography (Marton, 1981), which was mentioned
above, and the SOLO Taxonomy (Biggs & Collis, 1982). At the end of this section, a third
method, applicable to non-cognitive contexts, and derived from the work of Guttman, is
described.
4.3.1 Phenomenography
7
Phenomenographic analysis usually involves the presentation of an open-ended task,
question, or problem designed to elicit information about an individual’s understanding of a
particular phenomenon. Most commonly, tasks are attempted in relatively unstructured
interviews during which students are encouraged to explain their approach to the task or
conception of the problem. Researchers have applied phenomenographic analysis to such topics
as physics education (Ornek, 2012), teacher conceptions of success (Carbone et al., 2007),
blended learning (Bliuc et al., 2012), teaching (Gao et al., 2002), nursing research (Sjöström &
Dahlgren, 2002), proportionality (Lybeck, 1981), supply and demand (Dahlgren, 1979), and
speed, distance and time (Dall’alba et al., 1990; Ramsden et al., 1990).
A significant finding of these studies is that students’ responses typically reflect a limited
number of qualitatively different ways of thinking about a phenomenon, concept or principle
(Marton, 1988). An analysis of responses to the question in Figure 4.2, for example, revealed
just a few different ways of thinking about the relationship between light and seeing. The main
result of phenomenographic analysis is a set of categories describing the qualitatively different
kinds of responses students give, forming the outcome space, which Dahlgren (1984) describes
an outcome space as a “kind of analytic map”:
“It is an empirical concept which is not the product of logical or deductive
analysis, but instead results from intensive examination of empirical data.
Equally important, the outcome space is content-specific: the set of descriptive
categories arrived at has not been determined a priori, but depends on the specific
content of the [item]”. (p. 26)
The data analyzed in studies of this kind are often, but not always, transcripts of
interviews. In the analysis of students' responses, an attempt is made to identify the key features
of each student’s response to the assigned task. The procedure can be quite complex, involving
up to seven steps (Sjöström & Dahlgren, 2002). A search is made for statements that are
particularly revealing of a student’s way of thinking about the phenomenon under discussion.
These revealing statements, with details of the contexts in which they were made, are excerpted
from the transcripts and assembled into a pool of quotes for the next step in the analysis.
On a clear, dark night, a car is parked on a straight, flat road. The car's
headlights are on and dipped. A pedestrian standing on the road sees the
car's lights. The situation is illustrated in the figure below which is divided
into four sections. In which of the sections is there light? Give reasons for
your answer.
8
The focus of the analysis then shifts to the pool of quotes. Students’ statements are read
and assembled into groups. Borderline statements are examined in an attempt to clarify
differences between the emerging groups. Of particular importance in this process is the study of
contrasts. “Bringing the quotes together develops the meaning of the category, and at the same
time the evolving meaning of the category determines which quotes should be included and
which should not. This means, of course, a tedious, time-consuming iterative procedure with
repeated changes in the quotes brought together and in the exact meaning of each group of
quotes” (Marton, 1988, p. 198).
Consider now the outcome space in Figure 4.3 based on an investigation of students'
understandings of the relationship between light and seeing (see the item shown in Figure 4.2).
The concept of light as a physical entity that spreads in space and has an existence independent
of its source and effects is an important notion in physics and is essential to understanding the
relationship between light and seeing. Andersson and Kärrqvist (1981) found that very few 9th
grade students in Swedish comprehensive schools understood these basic properties of light.
They observe that authors of science textbooks take for granted an understanding of light and
move rapidly to topics such as lenses and systems of lenses that rely on students’ understanding
of these foundational ideas about light. And teachers similarly assume an understanding of the
fundamental properties of light: “Teachers probably do not systematically teach this fundamental
understanding, which is so much a part of a teacher's way of thinking that they neither think
about how fundamental it is, nor recognize that it can be problematic for students” (Andersson
and Kärrqvist, 1981, 82).
To investigate students' understandings of light and sight more closely, 558 students from
the last four grades of the Swedish comprehensive school were given the question in Figure 4.2
9
and follow-up interviews were conducted with 21 of these students (Marton, 1983). On the basis
of students' written and verbal explanations, five different ways of thinking about light and sight
were identified. These are summarized in the five categories in Figure 4.3.
(e) The object reflects light and when the light reaches the eyes we see
the object.
(d) There are beams going back and forth between the eyes and the
object. The eyes send out beams which hit the object, return and
tell the eyes about it.
(c) There are beams coming out from the eyes. When they hit the object
we see (cf. Euclid's concept of “beam of sight”).
(b) There is a picture going from the object to the eyes. When it reaches
the eyes, we see (cf. the concept of “eidola” of the atomists in
ancient Greece).
(a) The link between eyes and object is “taken for granted”. It is not
problematic: 'you can simply see'. The necessity of light may be
pointed out and an explanation of what happens within the system
of sight may be given.
Reading from the bottom of Figure 4.2 up, it can be seen that some students give
responses to this task that demonstrate no understanding of the passage of light between the
object and the eye: according to these students, we simply “see” (a). Other students describe the
passage of “pictures” from objects to the eye (b); the passage of “beams” from the eye to the
object with the eyes directing and focusing these beams in much the same way as a flashlight
directs a beam (c); the passage of beams to the object and their reflection back to the eye (d); and
the reflection of light from objects to the eye (e).
10
The SOLO (Structure Of the Learning Outcome) taxonomy is a general theoretical
assessment development framework that may be used to construct an outcome space for a task
related to cognition. The taxonomy, which is shown in Figure 4.4, was originally developed by
John Biggs and Kevin Collis (1982) to provide a frame of reference for judging and classifying
students' responses from elementary to higher education (Biggs, 2011).
The SOLO taxonomy is based on Biggs and Collis’ initial observation that attempts to
allocate students to Piagetian stages and to then use these allocations to predict students'
responses to tasks invariably results in unexpected observations (i.e., 'inconsistent' performances
of individuals from task to task). The solution for Biggs and Collis is to shift the focus from a
hierarchy of very broad developmental stages to a hierarchy of observable outcome categories
within a narrow range regarding a specific topic—in our terms, a construct: “The difficulty, from
a practical point of view, can be resolved simply by shifting the label from the student to his
response to a particular task” (1982, p. 22). Thus, the SOLO levels “describe a particular
performance at a particular time and are not meant as labels to tag students” (1982, p. 23).
An extended abstract response is one that not only includes all relevant pieces of
information but extends the response to integrate relevant pieces of
information not in the stimulus.
A relational response integrates all relevant pieces of information from the
stimulus.
A multistructural response is one that responds to several relevant pieces of
information from the stimulus.
A unistructural response is one that responds to only one relevant piece of
information from the stimulus.
A pre-structural response is one that consists only of irrelevant information.
The SOLO Taxonomy has been applied in the context of many instructional and
measurement areas in education, including topics such as science curricula (Brabrand & Dahl,
2009), inquiry-based learning (Damopolii et al., 2020), high school chemistry (Claesgens et al.,
2009), mathematical functions (Wilmot et al., 2011), middle school number sense and algebra
(Junpeng et al., 2020), and middle school science (Wilson & Sloane, 2000).
The example detailed in Figures 4.5 and 4.6 illustrates the construction of an outcome
space by defining categories to match the levels of the SOLO framework. In this example, five
categories corresponding to the five levels of the SOLO taxonomy—pre-structural, unistructural,
multistructural, relational, and extended abstract--have been developed for a task requiring
students to interpret historical data about Stonehenge (Biggs & Collis, 1982, 47-9). The History
task in Figure 4.4 was constructed to assess students' abilities to develop plausible interpretations
from incomplete data. Students aged between seven-and-a-half and 15 years of age were given
11
the passage in Figure 4.4 and asked to give in writing their thoughts about whether Stonehenge
might have been a fort rather than a temple. The detailed SOLO scoring guide for this item is
shown in Figure 4.5.
12
Figure 4.5 A SOLO task in the area of History (from Biggs & Collis, 1982).
This example raises the interesting question of how useful theoretical frameworks of this
kind might be in general. Certainly, Biggs and Collis have demonstrated the possibility of
applying the SOLO taxonomy to a wide variety of tasks and learning areas and other researchers
have observed SOLO-like structures in empirical data. Dahlgren (1984, 29-30), however,
believes that “the great strength of the SOLO taxonomy—its generality of application—is also
13
its weakness. Differences in outcome which are bound up with the specific content of a
particular task may remain unaccounted for. In some of our analyses, qualitative differences in
outcome similar to those represented in the SOLO taxonomy can be observed, and yet
differences dependent on the specific content are repeatedly found.”
Nevertheless, the SOLO taxonomy has been used in many assessment contexts as a way
to get started. An example of such an adaptation was made for the Using Evidence construct map
for the Issues Evidence and You curriculum (Example 9; Wilson & Sloane, 2000) shown in
Figure 4.6, which began with a SOLO hierarchy as its outcome space, but eventually morphed to
the structure shown. For example, in Figure 4.7:
waypoint I is clearly a pre-structural response, but
waypoint II is a special unistructural response consisting only of subjective reasons
and/or inaccurate or irrelevant evidence;
waypoint III is similar to a multistructural response, but is characterized by
incompleteness;
waypoint IV is a traditional relational response, and is the standard schoolbook “correct
answer” while
waypoint V adds some of the usual “extras” of extended abstract.
Similar adaptations were made for all of the IEY constructs, which were adapted from the SOLO
structure based on the evidence from student responses to the items. This may be the greatest
strength of the SOLO Taxonomy—its usefulness as a starting place for the analysis of responses.
In subsequent work using the SOLO Taxonomy, several other useful levels have been
developed. A problem in applying the Taxonomy was found—the multistructural level tends to
be quite a bit larger than the other levels—effectively, there are lots of ways to be partially
correct. In order to improve the diagnostic uses of the levels, several intermediate levels within
the multistructural one have been developed by the Berkeley Evaluation and Assessment
Research (BEAR) Center, and hence, the new generic outcome space is called the SOLO-B
Taxonomy. Figure 4.8 gives the revised Taxonomy.
14
Figure 4.6 SOLO outcome space for the history task (from Biggs & Collis, 1982).
4 Extended Abstract
e.g., 'Stonehenge is one of the many monuments from the past about which there are
a number of theories. It may have been a fort but the evidence suggests it was more
likely to have been a temple. Archaeologists think that there were three different
periods in its construction so it seems unlikely to have been a fort. The circular
design and the blue stones from Wales make it seem reasonable that Stonehenge was
built as a place of worship. It has been suggested that it was for the worship of the
sun god because at a certain time of the year the sun shines along a path to the altar
stone. There is a theory that its construction has astrological significance or that the
outside ring of pits was used to record time. There are many explanations about
Stonehenge but nobody really knows.'
This response reveals the student's ability to hold the result unclosed while he
considers evidence from both points of view. The student has introduced information
from outside the data and the structure of his response reveals his ability to reason
deductively.
3 Relational
e.g., 'I think it would be a temple because it has a round formation with an altar at
the top end. I think it was used for worship of the sun god. There was no roof on it so
that the sun shines right into the temple. There is a lot of hard work and labor in it
for a god and the fact that they brought the blue stone from Wales. Anyway, it's
unlikely they'd build a fort in the middle of a plain.'
This is a more thoughtful response than the ones below; it incorporates most of the
data, considers the alternatives, and interrelates the facts.
2 Multistructural
e.g., 'It might have been a fort because it looks like it would stand up to it. They used
to
build castles out of stones in those days. It looks like you could defend it too.'
'It is more likely that Stonehenge was a temple because it looks like a kind of
design
all in circles and they have gone to a lot of trouble.'
These students have chosen an answer to the question (i.e., they have required a
closed result) by considering a few features that stand out for them in the data, and
have treated those features as independent and unrelated. They have not weighed
the pros and cons of each alternative and come to balanced conclusion on the
probabilities.
1 Unistructural
e.g., 'It looks more like a temple because they are all in circles.'
'It could have been a fort because some of those big stones have been pushed
over.'
These students have focused on one aspect of the data and have used it to support
their answer to the question.
0 Prestructural
e.g., 'A temple because people live in it.'
'It can't be a fort or a temple because those big stones have fallen over.'
15
The first response shows a lack of understanding of the material presented and of the
implication of the question. The student is vaguely aware of 'temple', 'people', and
'living', and he uses these disconnected data from the story, picture, and questions to
form his response. In the second response the pupil has focused on an irrelevant
aspect of the picture.
16
Figure 4.7. A sketch of the construct map for the Using Evidence construct of the IEY curriculum..
Direction of increasing
sophistication in using
evidence.
Responses to Items
Students
Direction of decreasing
sophistication in using
evidence.
17
Figure 4.8 The SOLO-B Taxonomy.
An extended abstract response is one that not only includes all relevant pieces of information, but
extends the response to integrate relevant pieces of information not in the stimulus.
A relational response integrates all relevant pieces of information from the stimulus.
A semi-relational response is one that integrates some (but not all) of the relevant pieces of
information into a self-consistent whole.
A multistructural response is one that responds to several relevant pieces of information from the
stimulus, and that relates them together. but that does not result in a self-consistent
whole.
A plural response is one that responds to more than one relevant piece of information, but that
does not succeed in relating them together.
A unitary response is one that responds to only one relevant piece of information from the
stimulus.
The two general approaches described above relate most effectively to the cognitive
domain—there are also general approaches in the attitudinal and behavioral domains. The most
common general approach to the creation of outcome spaces in areas such as attitude and
behavior surveys has been the Likert style of item. The most generic form of this is the
provision of a stimulus statement (sometimes called a “stem”), and a set of standard options that
the respondent must choose among. Possibly the most common set of options is “Strongly
Agree” “Agree” “Disagree” and “Strongly Disagree” sometimes with a middle “neutral” option.
The set of options may be adapted to match the context: For example, the PF-10 Health
Outcomes survey uses this approach (see Section 2.2.1). Although this is a very popular
approach, largely I suspect, because it is relatively easy to come up with many items when all
that is needed is a new stem for each one, there is certain dissatisfaction with the way that the
response options relate to the construct. The problem is that there is very little to guide a
respondent in judging what is the difference between, say, “Strongly Disagree” and “Agree”.
Indeed, individual respondents may well have radically different ideas about these distinctions
(ref.). This problem is greatly aggravated when the options offered are not even words, but
numerals or letters, such as “1”, “2”, “3”, “4”, and “5”—in this sort of array, the respondent does
not even get a hint as to what it is that she is supposed to be making distinctions between!
The Likert response format has been criticized frequently over the almost 100 years since
Likert wrote his foundational paper, as one might expect for anything that is so widely used
Likert (1932/33). Among those criticisms are: (a) some respondents have a tendency to respond
on only one response side or the other (i.e., the positive side or the negative side), (b) some have
a tendency to not choose extremes or choose mainly extremes, (c) that some respondents confuse
an “equally-balanced” response (e.g., between Agree and Disagree) with a “don’t know/ does not
18
apply” response (DeVellis, 2017), and (d) that, under some circumstances it has been found to be
better to collapse the alternatives into just two categories (Kaiser & Wilson, 2000). A
particularly disturbing criticism is found in a paper by Andrew Maul (2013), where he throws
into question the assumption that even the stems (i.e., the questions or statements) are needed for
Likert-style items?!
For psychometricians, probably the most common criticism is that the use of integers for
recording the responses (or, even, as noted above, as the options themselves) gives the
impression that the options are “placed’ at equal intervals (e.g., Carafio & Perla, 2007; Jamieson,
2005; Kuzon et al., 1996; Ubersax, 2006). This then give the measurer a false confidence that
there is (at least) interval-level measurement status for the resulting data, and hence that one can
proceed with confidence to employ statistical procedures that assume this (for example, linear
regression, factor analysis, etc.). There is also a literature on the robustness of such statistical
analyses against this violation of assumptions, dating back even to Likert (1932/33) himself, but
with others making similar points over the years (cf., Glass et al., 1972; Labovitz, 1967; Traylor,
1983). This will be an issue that rises again in Chapter 6.
An alternative is to build into each set of options meaningful statements that give the
respondent some context in which to make the desired distinctions. The aim here is to try and
make the relationship between each item and the overall scale interpretable. This approach was
formalized by Guttman (1944), who created his scalogram technique (also known as Guttman
scaling):
If a person endorses a more extreme statement, he should endorse all less extreme
statements if the statements are to be considered a [Guttman] scale…We shall call
a set of items of common content a scale if a person with a higher rank than
another person is just as high or higher on every item than the other person.
(Guttman, 1950, p.62)
To illustrate this idea, suppose there are two dichotomous attitude items that form a
Guttman scale, as described by Guttman. If Item B is more extreme than Item A, which, in our
terms, would mean that Item B was higher on the construct map than Item A (see Figure 4.9),
then the only possible Likert-style responses (in the format (Item A, Item B) would be:
(a) (Disagree, Disagree)
(b) (Agree, Disagree)
(c) (Agree, Agree).
That is, a respondent could agree or disagree with both, or could agree with the less extreme Item
A and disagree with the more extreme Item B, or agree with both. But the response
(d) (Disagree, Agree)
would be disallowed. Consider now Figure 4.9, which is sketch of the (very minimal) meaning
that one might have of a Guttman scale—note how it is represented as a continuum consistent
with the term “scale”, but that otherwise, it is very minimal (which matches the “minimal
meaning”). It can be interpreted thus: if a respondent is below Item A, then they will disagree
with both the items; if they are in between A and B, then they will agree with A but not B; and, if
they are beyond B, they will agree with both. But there is no point where they would agree with
19
B but not A. One can see that Guttman’s ideas are quite consistent with the ideas behind the
construct map, at least as far as the sketch goes.
Four items developed by Guttman using this approach are shown in Figure 4.10. These
items were used in a study of American soldiers returning from the Second World War
(Guttman, 1944)—the items were designed to be part of a survey to gauge the intentions of
returned soldiers to study (rather than work) on being discharged from their military service.
The logic of the questions is: If I turn down a good job and go back to school regardless of help,
then I will certainly make the same decision for a poor job or no job. Some of the questions have
more than two categories, which makes them somewhat more complicated to interpret as
Guttman items, but nevertheless, they can still be thought of in the same way. Note how for
these items,
(a) The questions are ordered according to the construct (i.e., their intentions
to study); and
(b) the options to be selected within each item are
(i) clear and distinct choices for the respondents
(ii) related in content to the construct and also to the question
(iii) also ordered in terms of the construct (i.e., their intentions to study)
(iv) not necessarily the same from question to question.
It is these features that we will concentrate on here. In the next paragraphs this idea, dubbed as
“Guttman-style” items by Wilson (2005), will be explored in terms of our Example 2.
20
Figure 4.10 Guttman’s example items (Guttman, 1944, p.145).
As noted in Chapter 2, the RIS Project developed the Researcher Identity Scale (RIS)
construct map in Figure 2.3 (Example 2). Following the typical attitude survey development
steps, the developers made items following the Likert response format approach—see Figure
4.11 for some examples. Altogether, they developed 45 Likert response items, with six response
categories for each item as shown in Figure 4.11 (i.e., Strongly Disagree, Disagree, Slightly
Disagree, Slightly Agree, Agree, and Strongly Agree).
Figure 4.11 Some Likert-style items developed for the Researcher Identity Scale (from Wilson et
al., 2020)
The SFHI researchers then decided to transform the Likert-style items to Guttman-style
items. The trick of doing this is that the stems of the Likert-style items (e.g., the first column in
Figure 4.10) become the options in the Guttman-style items. This means that each Guttman-style
item may correspond to several Likert-style items.
21
To do this the stems of the Likert-style items (i.e., the first column in Figure 4.10) were
grouped together based on the researcher’s judgement of the similarity of their content and their
match to the RIS construct map waypoints to create Guttman-style sets of ordered response
options (see Figure 4.12). For not every case were there Likert stems that matched the complete
RIS set of waypoints, so the researchers had to create some new options to fill the gaps. The
resulting Guttman-style response options were placed in order based on
(a) the theoretical levels of the construct map that they were intended to map to, and
(b) empirical evidence of how students responded to the items in earlier rounds of testing.
To see an illustration of this, compare the Likert-style items shown in Figure 4.10 with options
(c), (d) and (e) for the Guttman response format item in Figure 4.12. This example shows,
indeed, the matched set of these three Likert response format items with one Guttman response
format item. As there were not any matching items for the two lower levels among the Likert
items, two more options were developed for the Guttman item—options (a) and (b) in Figure
4.12.
Figure 4.12 An example item from the RIS (Guttman response style item) (from Wilson et al
2020)
The project developed 12 Guttman response format items and collected validity evidence
for the use of the instrument (Morell et al., 2021) based on 21 of the Likert-style items. Eleven
of the 12 have at least one Likert-style option that the Guttman options were designed to match
to, with 21 matching levels in all, out of a total possible 60 across all 12 Guttman response
format items (so, approximately 2/3rds of the Guttman options were new). Details of the
matching of the 21 Likert-style items with the 12 Guttman-style items is given in Appendix 4B.
The SFHI researchers found that, comparing the Likert-style item set to the Guttman-style item
set:
(a) the Guttman-style set gave more interpretable results (see details of this in Section 6.X),
(b) the reliabilities were approximately the same, although there were fewer Guttman-style items
(45 vs. 12), resulting in an equivalency of approximately 3.75 Likert-style items to each
Guttman-style item, and
(c) respondents tended to be slower in responding to each Guttman-style item (though the
equivalence in (b) balances that out (Wilson et al., 2020).
Of course, one does not have to develop a set of Likert-style items first in order to get to
the Guttman-style items—that was the case in the SFHI project and was chosen to share here as
it is a useful account of how to change the numerous Likert-style instruments into Guttman-style
instruments. But, in fact the account shows the relationship quite clearly—the Gutman-style
items can be thought of as sets of Likert-style items, where the vague Likert-style options have
been swapped out for the more concrete and interpretable Guttman-style stems.
22
4.4 When Humans Become a Part of the Items Design: The Rater
How can raters be a part of measurement?
In the majority of the different item formats described above in Section 3.3, there will be
a requirement for the responses to be judged, or rated, into categories relating to the waypoints in
the construct map. Even for the selected response format, it was recommended that the
development process include a constructed response initial phase. It may be possible in some
cases to use machine learning to either assist or replace the human element in large-scale
measurement situations, but most situations still need an initial phase that will require human
rating to gather a training sample for the machine learning to work. Thus, an essential element of
many instrument development efforts will require the use of human raters of the responses to the
items, and this requirement needs to be considered at the design stage. The following chapter, on
the Outcome Space describes important aspects of the categorization of responses, and this will
have to be an essential element of the rating design—the issues related to that will be discussed
in that chapter: but it must also be considered when designing the items that will be generating
the responses.
In designing items, it is important to be aware that open-ended items are not without their
drawbacks. They can be expensive and time consuming to take, code and score and inevitably
introduce a certain amount of subjectivity into the scoring process. This subjectivity is inherent
in the need for the raters to make judgments about the open-ended responses. Guidelines for the
judgments cannot encompass all possible contingencies and therefore the rater must make
judgment with a certain degree of consistency. But counter to this flaw, it is the judgment that
offers the possibility of a broader and deeper interpretation of the responses and hence the
measurements.
Failures to judge the responses in appropriate categories may be due to several factors related to
the rater, such as: fatigue, failure to understand and correctly apply the guidelines, distractions due to
matters such as poor expression by the respondent, and distractions due to recent judgements about
preceding responses. A traditional classification of the types of problematic patterns that raters tend to
exhibit is described in the next three paragraphs (e.g., Saal, Downey, and Lahey, 1980).
Rater severity or leniency is a consistent tendency by the rater to judge a response into a category
that is lower or higher, respectively, than is appropriate. Detection of this pattern is relatively
straightforward when the construct has been designed as a construct map (as opposed to more traditional
item-development approaches) as the successive qualitative categories implicit in the waypoint
definitions give useful reference points for an observer (and even the self-observer).
A halo effect may reveal itself in three different ways—they all involve how the rating of one
response can affect the rating of another. The first may happen in the circumstance that a single response
is judged to be judged against several subscales. The problem is that the judgement of one of the
subscales may influence the judgement of another—a typical case is where the rater makes an overall
determination across the whole set of subscales rather than attending to each of the subscales separately.
The second type of halo effect arises when the rater forms an impression based on the person’s previous
responses rather than scoring each response on its own merit. The third type of halo effect occurs
23
between respondents—the response from an earlier respondent may influence the judging of a response
from a later respondent.
Restriction in range is a problematic pattern where the rater tends to judge the responses into
only a subset of the full range of scorable categories. There are several different forms of this: (a)
central tendency is where the rater tends to avoid extreme categories (i.e., the judgements tend to be
towards the middle of the range); (b) extreme tendency is the opposite, where the rater tends to avoid
middle categories (i.e., the judgements tend to be towards the extremes of the range); and, of course, (c)
severity or leniency could be seen as a tendency to restrict the range of the categories to low or high end
of the range, respectively.
The causes of these problematic patterns may differ as well. A rater may adopt a rating strategy
that looks like central tendency because they adopt a “least harm” tactic—staying in the middle of the
range reduces the possibility of grossly mis-scoring any respondent (which means that a discrepancy
index used to check up on raters will not be very sensitive). However, a restriction of range, for example
to the low end, may be due to a failure on the part of the rater to see distinctions between the different
levels.
There are several design strategies that can be taken to avoid problematic patterns of judgement.
One typical strategy is to provide extensive training so that the rater more fully understands the
intentions of the measurer. Carrying out the ratings in the context of a construct map provides an
excellent basis for such training—the waypoints make the definition of the construct itself much clearer,
and, in turn, the exemplars2 make the interpretation of the waypoints much clearer also. This training
should include opportunities for raters to score sample responses with established ratings, and scrutiny
of the results of the ratings, as well as repetitions of the training at appropriate intervals. There are many
models of delivery of such training, but one useful approach is to have three foci: (a) provide general
background -- e.g., in an online module; (b) small groups to work through judgement scenarios; and (c)
individual coaching to address questions and areas of weakness.
A third strategy uses auxiliary information from another source of information to check for
consistency with the rater’s judgement about a response. A rater would be considered severe if they
tended to give scores that were lower than would be expected from other sources of information. If the
rater tended to give higher scores, then that would be considered leniency. In Wilson and Case (2000),
for example, where the instrument consisted of achievement items that were both selected response and
constructed response, student responses to the set of selected response items were used to provide
auxiliary information about a student’s location on the construct in comparison to the rater’s judgements
for the constructed response items. In a second example (Shin, et al., 2019)), where all of the items were
2
That is, examples of typical responses at each waypoint, as will be defined in the next chapter.
24
constructed response, a machine-learning algorithm was used to provide auxiliary information about the
ratings of responses.
For an instrument that requires raters, the measurement developer needs to be aware of the
considerations mentioned above, but there are many others beside that are dependent on the nature of the
construct being measured, and the contexts for those measurements. It is beyond the scope of this book
to discuss all of the many such complexities, and the interested reader should look for support from the
relevant literature. For example, a useful and principled description of these complexities in the context
of educational performance assessments (e.g., written essays and teacher portfolios) is given in
Engelhard and Wind (2017).
4.5 Resources.
The development of an outcome space is a complex and demanding exercise. The scoring
of outcome spaces is an interesting topic by itself—for studies of the effects of applying different
scores to an outcome space, see Wright and Masters (1981) and Wilson (1992). Probably the
largest single collection of accounts of outcome space examples is contained in the volume on
phenomenography by Marton, Hounsell, and Entwistle (1984), but also see the later collection by
Bowden and Green (2005). The seminal reference on the SOLO taxonomy is Biggs and Collis
(1982); extensive information on using the taxonomy in educational settings is given in Biggs
and Moore (1993) and Biggs (2011). The Guttman-style item is a new concept, although based
on an old idea—see Wilson et al. (2020) for the only complete account so far, although it is
presaged in the first edition of this volume (Wilson, 2005)
2. After developing your outcome space, write it up as a scoring guide (e.g.,Table 1.1) for your
items, and incorporate this information into your construct map.
3. Log into BASS and enter the information about Waypoints, Exemplars, etc.
4. Carry out an Item Pilot Investigation as described in Appendix 4A. The analyses for the data
resulting from this investigation will be described in Chapters 6, 7 and 8.
5. Make sure that the data from your Pilot Investigation is entered into BASS. This will be
automatic if you used BASS to collect the data. If you used another way to collect the data
(e.gs., pencil and paper, another type of assessment deployment software), then use the “Upload”
options to load it into BASS.
6. Try to think through the steps outlined above in the context of developing your instrument
and write down notes about your plans and accomplishments.
25
7. Share your plans and progress with others—discuss what you and they are succeeding on, and
what problems have arisen.
26
Appendix 4A
27
Appendix 4B
The left column of Figure A1 shows the 21 chosen Likert-style items. Each of
these Likert-style items was developed with six ordered response choices – strongly
agree, agree, slightly agree, slightly disagree, disagree and strongly disagree. To
transform the set into to the Guttman-style format, we first grouped Likert-style items
together based on similar content. For example, the first three Likert-style items in Figure
A1 are focusing on an individuals’ comfort with seeing himself/herself as a researcher.
The first item “I am beginning to consider myself a researcher,” targets a relatively lower
level of the construct map in comparison to the second item “I consider myself a
researcher,” which in turn targets a relatively lower level of the construct map in
comparison to the third item “I consider myself to be a professional researcher.” Each of
these items becomes an option in the first Guttman-style item (G1), adapted in some
cases to make the expressions consistent across the Guttman-style options. To match to
the construct map for this variable, we added two more options at levels below that for
Item 1. The development process was similar for four of the Guttman-style items. For
some Likert-style items (e.g., Item 7, and five others) there were no others that were of a
similar content, so we needed to add four Guttman-style options for each of them. We
found that this process produced a somewhat imbalanced set of Guttman-style items, with
two Guttman-style items for the Agency strand and three for the rest. Hence, we
developed one extra Guttman-style item focused on the Agency strand that was not
matched among the 21 Likert-style items (Guttman-style item G9). Through this process,
we obtained the right column of the Figure, the 12 Guttman-style items. In subsequent
analyses, one of the Likert-style items (Likert-style item 12) was found not to fit the
statistical model, so we deleted it from the comparisons, but the corresponding Guttman-
style item did fit, so we left it in the comparisons. Each of the remaining 20 Likert-style
items maps to an option for one of the Guttman-style items. To get comparable estimates,
we ensured each student in our sample of 863 high school students took both formats of
the instrument. We randomized the order of the instruments, meaning, some students
looked at the Likert items first while others looked at the Guttman items first.
28
Figure 4B1. Mapping Likert-style Items into Guttman-style Items.
29
11. A career in research would be a good way for d) I would be interested in doing research that helps
me to help people. my community.
e) I am definitely interested in doing research that
helps my community.
30
a) I do not plan to pursue research in the future.
b) I do not know if doing research is in my future.
c) I am not sure if a research-related degree is right
for me.
d) I might get a research-related degree in college.
15. I plan to get a research-related degree in college. e) I plan to get a research-related degree in college.
31