0% found this document useful (0 votes)

7 views31 pages

Chapter4 3

Uploaded by

Carlos - Tam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views31 pages

Chapter4 3

Uploaded by

Carlos - Tam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 31

Chapter 4

The Outcome Space

4.0 Chapter Overview

This chapter is the second part of the story about designing items. The chapter
concentrates on how to categorize the item responses and then score them to be indicators of the
construct. It introduces the idea of an “outcome space,” which describes how to categorize the
responses. Some important qualities of this categorization include that the categories should be:
Well-defined, finite and exhaustive, ordered, context-specific and research-based. In addition,
the categories need to be scored in order to use them in the Calibration model—the topic of the
next chapter. The chapter concludes with a description of three widely applicable strategies for
jointly developing the outcome space and scoring strategy: phenomenography, the structure of
the learning outcome (SOLO) technique and Guttman items.

Key concepts: outcome space, well-defined categories, finite and exhaustive categories, ordered
categories, context-specific categories, research-based categories, scoring scheme,
phenomenography, structure of the learning outcome (SOLO), Guttman items, raters.

4.1 The Qualities of an Outcome Space

The outcome space is the third building block in the Bear Assessment System (BAS). It
has already been introduced, lightly, in Chapter 1, and its relationship to the other building
blocks was illustrated there too—see Figure 4.1. In this chapter, it is the main focus.

The term “outcome space” was introduced by Ference Marton (1981) for a set of
outcome categories developed from a detailed (“phenomenographic”) analysis of students’
responses to standardized open-ended items such as the LPS item discussed in the previous
chapter1. In much of his writing Marton describes the development of a set of outcome
categories as a process of “discovering” the qualitatively different ways in which students
respond to a task. In this book the term outcome space is adopted and applied in a broader sense
to any set of qualitatively described categories for recording and/or judging how respondents
have responded to items. Several examples of outcome spaces have already been shown in
earlier examples. The LPS Argumentation construct map in Figure 2.9 (Example 4) summarizes
how to categorize the responses to the LPS items attached to the Argumentation construct—this
is a fairly typical outcome space for an open-ended item. The outcome spaces for fixed-response
items look different—they are simply the fixed responses themselves—for example, the outcome
space for an evaluation item in the PF-10 Survey (Example 6) is:
“limited a lot”
“a little” or
“not at all.”
1
Note that the ADM item in Figure 1.6 could also be used here, along with the construct map in Figure 1.9 and
Appendix 1A. This holds for all of the references to the LPS item in the remainder of this Chapter.

1
Although these two types of outcome space look quite different, it is important to see that they
are connected in a deep way—in both cases, the response categories are designed to map back to
the waypoints of the construct map. Thus, if two sets of items, some of which were constructed
response and some selected response, related to the same construct map, then, despite how
different they looked, they would ALL have the common feature hat their responses could be
mapped to the waypoints of that construct map. As noted above, this connection leads to a good
way to develop a fixed set of responses for selected response items: First construct the open-
ended outcome space, and second, use some of the sample responses in the categories as a way
to generate representative fixed choices for selection. Of course, many considerations must be
borne in mind while making those choices.

Figure 4.1 The four building blocks in the BEAR Assessment System (BAS)

Inherent in the idea of categorization is an understanding that the categories that define
the outcome space are qualitatively distinct. All measures are based, at some point, on qualitative
distinctions. Even fixed-response formats such as multiple-choice test items and Likert-style
survey questions rely upon a qualitative understanding of what constitutes different levels of
response (more or less correct, or more or less agreeable, as the case may be). Rasch (1977, p.
68) pointed out that this principle goes far beyond measurement in the social sciences: “That
science should require observations to be measurable quantities is a mistake of course; even in
physics, observations may be qualitative--as in the last analysis they always are.”

The remainder of this section contains a description of the important qualities of a sound
and useful outcome space. These qualities include: well-defined, finite and exhaustive, ordered,
context-specific and research-based, as detailed below.

4.1.1 Well-defined Categories.

The categories that make up the outcome space must be well-defined. For our purposes,
this will need to include not only (a) a general definition of what is being measured by that item
(i.e., in the approach described in this book, a description of the construct map), but also (b)
relevant background material and (c) examples of items, item responses and their categorization,
as well as (d) a training procedure for constructed response items. The LPS example displays all

2
except the last of these characteristics: Figure 2.9 summarizes the Argumentation construct map
including descriptions of different levels of response; Figures 3.3 and 3.4 shows an example
item; and the paper cited in the description in Chapter 1 (Osborne et al, 2016) gives a
background discussion to the construct map, including references to the relevant literature.

Construct Mapping. What is not shown in the LPS materials is a training program to
achieve high inter-rater agreement in the types of responses that fall into different categories,
which will in turn support the usefulness for the results. To achieve high levels of agreement, it
is necessary to go beyond written materials; some sort of training is usually required. One such
method that is consistent with the BAS approach is called “construct mapping” (Draney &
Wilson, 2010/11). In the context of education this method has been found to be particularly
helpful for teachers, who can bring their professional experiences to help in the judgement
process, but who also have found the process to enhance their professional development. In this
technique, teachers choose examples of item responses from their own students or others, and
then circulate the responses beforehand to other members of the moderation group. All the
members of the group categorize the responses using the scoring guides and other material
available to them. They then come together to “moderate” those categorizations at a consensus
building meeting. The aim of the meeting is for the group to compare their categorizations,
discuss them until they come to a consensus about the scores, and to discuss the instructional
implications of knowing what categories the students have been categorized into. This process
may be repeated several times with different sets of responses to achieve higher levels of initial
agreement, and to track teachers’ improvement over time. In line with the iterative nature of
design, the outcome space may be modified from the original by this process.

One way to check tha the outcome space contains sufficiently interpretable detail is to
have different teams of judges use the materials to categorize a set of responses. The agreement
between the two sets of judgments provides an index of how successful the definition of the
outcome space has been (although, of course, standards of success may vary). Marton (1986)
gives a useful distinction between developing an outcome space and using one. In comparing the
work of the measurer to that of a botanist classifying species of plants, he notes that “while there
is no reason to expect that two persons working independently will construct the same taxonomy,
the important question is whether a category can be found or recognized by others once it has
been described... It must be possible to reach a high degree of agreement concerning the presence
or absence of categories if other researchers are to be able to use them” (Marton, 1986, p. 35).

4.1.2 Research-based Categories.

The construction of an outcome space should be part of the process of developing an item
and, hence, should be informed by research aimed at establishing the construct to be measured,
and identifying and understanding the variety of responses students give to that task. In the
domain of measuring achievement, a National Research Council committee concluded:

A model of cognition and learning should serve as the cornerstone of the

assessment design process. This model should be based on the best available
understanding of how students represent knowledge and develop competence in
the domain … This model may be fine-grained and very elaborate or more

3
coarsely grained, depending on the purpose of the assessment, but it should
always be based on empirical studies of learners in a domain. Ideally, the model
will also provide a developmental perspective, showing typical ways in which
learners progress toward competence (NRC, 2001, pp. 2-5).

Thus, in the achievement context, a research-based model of cognition and learning should be
the foundation for the definition of the construct, and hence also for the design of the outcome
space and the development of items. In other areas, similar advice pertains—in psychological
scales, health questionnaires, even in marketing surveys—there should be a research-based
construct to tie all of the development efforts together. There is a range of formality and depth
that one can expect of the research behind such “research-based” outcome spaces. For example,
the LPS Argumentation construct is based on a close reading of the relevant literature (Osborne
et al, 2016), as are the ADM constructs (Lehrer et al., 2014). The research basis for the PF-10 is
documented in Ware and Gandek (1998), although the construct is not explicitly established
there. For each of the rest of the Examples, there is a basis in the relevant research literature for
the construct map, although (of course) some literatures have more depth than others.

4.1.3 Context-specific Categories.

In the measurement of a construct, the outcome space must always be specific to that
construct and the contexts in which it is to be used. Sometimes it is possible to confuse the
context-specific nature of an outcome space and the generality of the scores that are derived from
that. For example, a multiple choice item will have distractors that are only meaningful (and
scoreable) in the context of that item, but the usual scores of the item (“correct”/“incorrect” or
“1”/“0”) are interpretable more broadly as indicating “correctness”). This can lead to a certain
problem in developing achievement items, which I call the “correctness fallacy”—that is, the
view (perhaps an unconscious view) that the categorization of the responses to items is simply
according to whether the student supplied a “correct” answer to it. The problem with this
approach is that the “correctness” of a response may not fully comprehend the complexity of
what is asked for in the relevant construct map. For example, in the “Ice-to-Water-Vapor” Task
in Figure 3.3, note how a student could be asked which student, Anna or Evan, is correct. The
response to this could indeed be judged as correct or not, but nevertheless, the judgement would
have little information regarding Argumentation—what is needed to pry out of this context is to
proceed to the next part of the task, as exemplified in Figure 3.4, where the prompts are used to
disassemble the “correctness” into aspects relevant to the Argumentation construct map.

Even when categories are labelled in the same way from context to context, their use
inevitably requires a re-interpretation in each new context. The set of categories for the LPS
tasks, for example, was developed from an analysis of students' answers to the set of tasks used
in the pilot and subsequent years of the assessment development project. The general scoring
guide used for the LPS Argumentation construct needs to be supplemented by an item scoring
guide, including a specific set of exemplars for each specific task (as shown in Table 1.1 for the
ADM MoV Piano Width item).

4
4.1.4 Finite and Exhaustive Categories.

The responses that the measurer obtains to an open-ended item will generally be a sample
from a very large population of possible responses. Consider a single essay prompt—something
like the classic “What did you do over the summer vacation?” Suppose that there is a restriction
to the length of the essay of, say, five pages. Think of how many possible different essays could
be written in response to that prompt. It is indeed a very large number (although, because there
is only a finite number of words in English, there is in fact a finite upper limit that could be
estimated). Multiply this by number of different possible prompts (again, very large, but finite),
and then again by all the different possible sorts of administrative conditions (it can be hard to
say what the numerical limit is here, perhaps infinite), and you end up with an even bigger
number. The role of the outcome space is to bring order and sense to this extremely large and
potentially unruly bunch of possible responses. One prime characteristic is that the outcome
space should consist of only a finite number of categories. For example, the LPS scoring guide
categorizes all Argumentation item responses into 13 categories, as shown in Figure 2.6. The
PF-10 outcome space is just three categories: “Yes, limited a lot”, “Yes, limited a little” and “No,
not limited at all”.

The outcome space, to be fully useful, must also be exhaustive: There must be a category
for every possible response. Note that some potential responses may not be delimited in the
construct map. First, under broad circumstances, there may be responses that indicate:
(a) that there was no opportunity for a particular respondent to respond (for example, this can
occur due to the measurer’s data collection design, where the items are, say, distributed
across a number of forms, and certain items are included on one of the forms);
(b) that the respondent was prevented from completing all of the items by matters not related to
the construct or the purpose of the measurement (such as an internet interruption).
For such circumstances, the categorization of the responses to a construct map level would be
misleading, and so a categorization into “missing” or an equivalent would be best. The
implications for this “missing” category need to be borne in mind for the analysis of the resulting
data, which is addressed in Section 5.X below.

Second, there will often be responses found that do not conform with the expected range.
In the constructed response items one can get responses like “tests suk” or “I vote for Mickey
Mouse” etc. Although such responses should not be ignored, as they sometimes contain
information that can be interpreted in a larger context and may even be quite important in that
larger context, they will usually not inform the measurer about the respondent’s location on a
specific construct. In fixed-response item formats like the PF-10 scale, the finiteness and
exhaustiveness of the response categorization is seemingly forced by the format, but one can still
find instances where the respondent has endorsed, say, two of the options for a single item. In
situations like these the choice of a “missing” category may seem automatic, but there are
circumstances where that may be misleading; in educational achievement testing, for example, it
may be more consistent with an underlying construct (i.e., because such responses do not reflect
“achievement”) to categorize them at the lowest waypoint, as was indicated for the
Argumentation construct map in Figure 2.6. Whatever policy is developed, it has to be sensitive
to both the underlying construct, and the circumstances of the measurement.

5
4.1.5 Ordered Categories.

For an outcome space to be useful in defining a construct that is to be shaped into a

construct map, the categories must be capable of being ordered in some way. Some categories
must represent lower levels on the construct, some must represent higher ones. In traditional
fixed-response item formats like the multiple choice test item and the true-false survey question,
the responses are ordered into just two levels—in the case of true-false questions, (obviously)
into “true” and “false”; in the case of multiple choice items, into the correct category, for
choosing the correct option, and into the false category for choosing one of the false options. In
Likert-type survey questions, the order is implicit in the nature of the choices: The options
“Strongly Agree”, “Agree”, “Disagree”, “Strongly Disagree”, gives an ordering for the
responses, etc. A scoring guide for an open-ended item needs to do the same thing—the scores
shown in Figure 3.3 for the LPS item give four ordered categories scored 0 to 3, respectively.
This ordering needs to be supported by both the theory behind the construct and empirical
evidence—the theory behind the outcome space should be the same as that behind the construct
itself. Empirical evidence can be used to support the ordering of an outcome space—and is an
essential part of both pilot and field investigations of an instrument (see Section 8.X). The
ordering of the categories does not need to be complete. An ordered partition (i.e., where several
categories may have the same rank in the ordering such as shown for the scoring guide for
Argumentation in Figure 3.2) can still be used to provide useful information (Wilson & Adams,
1995).

4.2 Scoring the Outcome Space (The Scoring Guide)

Most often, the set of categories that come directly out of an outcome space is not yet
sufficient as a basis for measurement. One more step is needed—the provision of a scoring
guide. The scoring guide organizes the ordered categories as waypoints along the construct map:
The categories must be related back to the responses side of the generating construct map. This
can be seen simply as providing numerical values for the ordered levels of the outcome space
(i.e., scoring of the item response categories), but the deeper meaning of this pertains to the
relationship back to the construct map from Chapter 1. In many cases, this process is seen as
integral to the definition of the categories, and that is indeed a good thing, as it means that the
categorization and scoring work in concert with one another. Nevertheless, it is important to be
able to distinguish the two processes, at least in theory, because (a) the measurer must be able to
justify each step in the process of developing the instrument, and (b) sometimes the possibility of
having different scoring schemes is useful in understanding and exploring the construct.

In many circumstances, especially where the measurer is using an established item

format, the question of what scoring procedure to use has been established by long-term practice
and is regularly not considered as an issue to be examined. For example, with multiple choice
test items, it is standard procedure to score the correct distractor as 1 and the incorrect ones with
0. This is the most common way that multiple choice items are scored (although see comments
about ordered multiple choice items below). Likert-style response questions in surveys and
questionnaires are usually scored according to the number of response categories allowed—if
there are four categories like “Strongly Agree”, “Agree”, “Disagree”, “Strongly Disagree”, then
these are scored as 0,1, 2, and 3 respectively (or, sometimes, 1, 2, 3, and 4). With option-sets

6
that have a negative valence with the construct, the scoring will generally be reversed, to be 3, 2,
1, and 0.

With open-ended items, the outcome categories must be ordered into qualitatively
distinct, ordinal categories, such as was done in the LPS example. Just as for Likert-style items,
it makes sense to think of each of these ordinal levels as being scored by successive integers, just
as they are in Figure 3.3, where the successive ordered categories are scored thus:

A full critique, or comparison of two arguments = 3,

One complete warrant, or a counterargument = 2,
One claim or piece of evidence = 1,
No evidence =0
No opportunity to respond = missing.

This can be augmented where there are finer gradations available—one way to represent this is
by using “+” and “-” for (a) responses palpably above a waypoint, but not yet up to the next
waypoint, and (b) responses palpably below a waypoint, but not yet down to the next waypoint,
respectively. Note that these may be waypoints in the making. Another way is to increase the
number of scores to incorporate the extra categories. The category of “no opportunity” is scored
as “missing” above. Under some circumstances, say, where the student was not administered an
achievement item because it was deemed too difficult an a priori basis, then it would make sense
to score this missing consistently with that logic as a “0.” However, if the student was not
administered the item for reasons that related to some reason unrelated to that student’s measure
on the construct, say, that they were ill that day, then it would make sense to maintain the
“missing” and interpret it as indicating “missing data”.

4.3 General Approaches to Constructing an Outcome Space.

The construction of an outcome space will depend heavily on the specific context, both
theoretical and practical, in which the measurer is developing the instrument. It should begin
with the definition of the construct, and then proceed to the definition of the descriptive
components of the items design and will also require the initial development of some example
items. Below are described two general schema that have been developed for this purpose,
focusing on the cognitive domain: (a) phenomenography (Marton, 1981), which was mentioned
above, and the SOLO Taxonomy (Biggs & Collis, 1982). At the end of this section, a third
method, applicable to non-cognitive contexts, and derived from the work of Guttman, is
described.

4.3.1 Phenomenography

Phenomenography provides a method of constructing an outcome space for a cognitive

task based on a detailed analysis of a set of student responses (Bowden & Green, 2005).
Phenomenographic analysis has its origins in the work of Ference Marton (1981) who describes
it as “a research method for mapping the qualitatively different ways in which people experience,
conceptualize, perceive, and understand various aspects of, and phenomena in, the world around
them” (Marton, 1986, p. 31).

7
Phenomenographic analysis usually involves the presentation of an open-ended task,
question, or problem designed to elicit information about an individual’s understanding of a
particular phenomenon. Most commonly, tasks are attempted in relatively unstructured
interviews during which students are encouraged to explain their approach to the task or
conception of the problem. Researchers have applied phenomenographic analysis to such topics
as physics education (Ornek, 2012), teacher conceptions of success (Carbone et al., 2007),
blended learning (Bliuc et al., 2012), teaching (Gao et al., 2002), nursing research (Sjöström &
Dahlgren, 2002), proportionality (Lybeck, 1981), supply and demand (Dahlgren, 1979), and
speed, distance and time (Dall’alba et al., 1990; Ramsden et al., 1990).

A significant finding of these studies is that students’ responses typically reflect a limited
number of qualitatively different ways of thinking about a phenomenon, concept or principle
(Marton, 1988). An analysis of responses to the question in Figure 4.2, for example, revealed
just a few different ways of thinking about the relationship between light and seeing. The main
result of phenomenographic analysis is a set of categories describing the qualitatively different
kinds of responses students give, forming the outcome space, which Dahlgren (1984) describes
an outcome space as a “kind of analytic map”:
“It is an empirical concept which is not the product of logical or deductive
analysis, but instead results from intensive examination of empirical data.
Equally important, the outcome space is content-specific: the set of descriptive
categories arrived at has not been determined a priori, but depends on the specific
content of the [item]”. (p. 26)

The data analyzed in studies of this kind are often, but not always, transcripts of
interviews. In the analysis of students' responses, an attempt is made to identify the key features
of each student’s response to the assigned task. The procedure can be quite complex, involving
up to seven steps (Sjöström & Dahlgren, 2002). A search is made for statements that are
particularly revealing of a student’s way of thinking about the phenomenon under discussion.
These revealing statements, with details of the contexts in which they were made, are excerpted
from the transcripts and assembled into a pool of quotes for the next step in the analysis.

Figure 4.2 An open-ended question in physics (from Marton, 1983)

On a clear, dark night, a car is parked on a straight, flat road. The car's
headlights are on and dipped. A pedestrian standing on the road sees the
car's lights. The situation is illustrated in the figure below which is divided
into four sections. In which of the sections is there light? Give reasons for
your answer.

8
The focus of the analysis then shifts to the pool of quotes. Students’ statements are read
and assembled into groups. Borderline statements are examined in an attempt to clarify
differences between the emerging groups. Of particular importance in this process is the study of
contrasts. “Bringing the quotes together develops the meaning of the category, and at the same
time the evolving meaning of the category determines which quotes should be included and
which should not. This means, of course, a tedious, time-consuming iterative procedure with
repeated changes in the quotes brought together and in the exact meaning of each group of
quotes” (Marton, 1988, p. 198).

The result of the analysis is a grouping of quotes reflecting different kinds of

understandings. These groupings become the outcome categories, which are then described and
illustrated using sampled student quotes. Outcome categories are “usually presented in terms of
some hierarchy: There is a best conception, and sometimes the other conceptions can be ordered
along an evaluative dimension” (Marton, 1988, p. 195). For Ramsden (1990), it is the
construction of hierarchically ordered, increasingly complex levels of understanding, and the
attempt to describe the logical relations among these levels that most clearly distinguishes
phenomenography from other qualitative research methods. The link to the idea of a construct
map should be clear.

Consider now the outcome space in Figure 4.3 based on an investigation of students'
understandings of the relationship between light and seeing (see the item shown in Figure 4.2).
The concept of light as a physical entity that spreads in space and has an existence independent
of its source and effects is an important notion in physics and is essential to understanding the
relationship between light and seeing. Andersson and Kärrqvist (1981) found that very few 9th
grade students in Swedish comprehensive schools understood these basic properties of light.
They observe that authors of science textbooks take for granted an understanding of light and
move rapidly to topics such as lenses and systems of lenses that rely on students’ understanding
of these foundational ideas about light. And teachers similarly assume an understanding of the
fundamental properties of light: “Teachers probably do not systematically teach this fundamental
understanding, which is so much a part of a teacher's way of thinking that they neither think
about how fundamental it is, nor recognize that it can be problematic for students” (Andersson
and Kärrqvist, 1981, 82).

To investigate students' understandings of light and sight more closely, 558 students from
the last four grades of the Swedish comprehensive school were given the question in Figure 4.2

9
and follow-up interviews were conducted with 21 of these students (Marton, 1983). On the basis
of students' written and verbal explanations, five different ways of thinking about light and sight
were identified. These are summarized in the five categories in Figure 4.3.

Figure 4.3 A phenomenographic outcome space.

(e) The object reflects light and when the light reaches the eyes we see
the object.

(d) There are beams going back and forth between the eyes and the
object. The eyes send out beams which hit the object, return and
tell the eyes about it.

(b) There is a picture going from the object to the eyes. When it reaches
the eyes, we see (cf. the concept of “eidola” of the atomists in
ancient Greece).

(a) The link between eyes and object is “taken for granted”. It is not
problematic: 'you can simply see'. The necessity of light may be
pointed out and an explanation of what happens within the system
of sight may be given.

Reading from the bottom of Figure 4.2 up, it can be seen that some students give
responses to this task that demonstrate no understanding of the passage of light between the
object and the eye: according to these students, we simply “see” (a). Other students describe the
passage of “pictures” from objects to the eye (b); the passage of “beams” from the eye to the
object with the eyes directing and focusing these beams in much the same way as a flashlight
directs a beam (c); the passage of beams to the object and their reflection back to the eye (d); and
the reflection of light from objects to the eye (e).

Each of these responses suggests a qualitatively different understanding. The highest

level of understanding is reflected in category (e); the lowest, in category (a). Marton (1983)
does not say whether he considers the five categories to constitute a hierarchy of five levels of
understanding. His main purpose is to illustrate the process of constructing a set of outcome
categories. Certainly, categories (b), (c) and (d) reflect qualitatively different responses at one or
more intermediate levels of understanding between categories (a) and (e). Note that in this
sample, no student in the 6th grade, and only 11 percent of students in the ninth grade gave
responses judged as being in category (e).

4.3.2 The SOLO Taxonomy

10
The SOLO (Structure Of the Learning Outcome) taxonomy is a general theoretical
assessment development framework that may be used to construct an outcome space for a task
related to cognition. The taxonomy, which is shown in Figure 4.4, was originally developed by
John Biggs and Kevin Collis (1982) to provide a frame of reference for judging and classifying
students' responses from elementary to higher education (Biggs, 2011).

The SOLO taxonomy is based on Biggs and Collis’ initial observation that attempts to
allocate students to Piagetian stages and to then use these allocations to predict students'
responses to tasks invariably results in unexpected observations (i.e., 'inconsistent' performances
of individuals from task to task). The solution for Biggs and Collis is to shift the focus from a
hierarchy of very broad developmental stages to a hierarchy of observable outcome categories
within a narrow range regarding a specific topic—in our terms, a construct: “The difficulty, from
a practical point of view, can be resolved simply by shifting the label from the student to his
response to a particular task” (1982, p. 22). Thus, the SOLO levels “describe a particular
performance at a particular time and are not meant as labels to tag students” (1982, p. 23).

Figure 4.4 The SOLO Taxonomy.

An extended abstract response is one that not only includes all relevant pieces of
information but extends the response to integrate relevant pieces of
information not in the stimulus.
A relational response integrates all relevant pieces of information from the
stimulus.
A multistructural response is one that responds to several relevant pieces of
information from the stimulus.
A unistructural response is one that responds to only one relevant piece of
information from the stimulus.
A pre-structural response is one that consists only of irrelevant information.

The SOLO Taxonomy has been applied in the context of many instructional and
measurement areas in education, including topics such as science curricula (Brabrand & Dahl,
2009), inquiry-based learning (Damopolii et al., 2020), high school chemistry (Claesgens et al.,
2009), mathematical functions (Wilmot et al., 2011), middle school number sense and algebra
(Junpeng et al., 2020), and middle school science (Wilson & Sloane, 2000).

The example detailed in Figures 4.5 and 4.6 illustrates the construction of an outcome
space by defining categories to match the levels of the SOLO framework. In this example, five
categories corresponding to the five levels of the SOLO taxonomy—pre-structural, unistructural,
multistructural, relational, and extended abstract--have been developed for a task requiring
students to interpret historical data about Stonehenge (Biggs & Collis, 1982, 47-9). The History
task in Figure 4.4 was constructed to assess students' abilities to develop plausible interpretations
from incomplete data. Students aged between seven-and-a-half and 15 years of age were given

11
the passage in Figure 4.4 and asked to give in writing their thoughts about whether Stonehenge
might have been a fort rather than a temple. The detailed SOLO scoring guide for this item is
shown in Figure 4.5.

12
Figure 4.5 A SOLO task in the area of History (from Biggs & Collis, 1982).

The Function of Stonehenge

Stonehenge is in the South of England, on the flat plain of

Salisbury. There is a ring of very big stones which the picture
shows. Some of the stones have fallen down and some have
disappeared from the place. The people who lived in England in
those days we call Bronze Age Men. Long before there were any
towns, Stonehenge was a temple for worship and sacrifice. Some
of the stones were brought from the nearby hills but others which
we call Blue Stones, we think came from the mountains of Wales.

Question: Do you think Stonehenge might have been a fort and

not a temple? Why do you think that?

This example raises the interesting question of how useful theoretical frameworks of this
kind might be in general. Certainly, Biggs and Collis have demonstrated the possibility of
applying the SOLO taxonomy to a wide variety of tasks and learning areas and other researchers
have observed SOLO-like structures in empirical data. Dahlgren (1984, 29-30), however,
believes that “the great strength of the SOLO taxonomy—its generality of application—is also

13
its weakness. Differences in outcome which are bound up with the specific content of a
particular task may remain unaccounted for. In some of our analyses, qualitative differences in
outcome similar to those represented in the SOLO taxonomy can be observed, and yet
differences dependent on the specific content are repeatedly found.”

Nevertheless, the SOLO taxonomy has been used in many assessment contexts as a way
to get started. An example of such an adaptation was made for the Using Evidence construct map
for the Issues Evidence and You curriculum (Example 9; Wilson & Sloane, 2000) shown in
Figure 4.6, which began with a SOLO hierarchy as its outcome space, but eventually morphed to
the structure shown. For example, in Figure 4.7:
waypoint I is clearly a pre-structural response, but
waypoint II is a special unistructural response consisting only of subjective reasons
and/or inaccurate or irrelevant evidence;
waypoint III is similar to a multistructural response, but is characterized by
incompleteness;
waypoint IV is a traditional relational response, and is the standard schoolbook “correct
answer” while
waypoint V adds some of the usual “extras” of extended abstract.
Similar adaptations were made for all of the IEY constructs, which were adapted from the SOLO
structure based on the evidence from student responses to the items. This may be the greatest
strength of the SOLO Taxonomy—its usefulness as a starting place for the analysis of responses.

In subsequent work using the SOLO Taxonomy, several other useful levels have been
developed. A problem in applying the Taxonomy was found—the multistructural level tends to
be quite a bit larger than the other levels—effectively, there are lots of ways to be partially
correct. In order to improve the diagnostic uses of the levels, several intermediate levels within
the multistructural one have been developed by the Berkeley Evaluation and Assessment
Research (BEAR) Center, and hence, the new generic outcome space is called the SOLO-B
Taxonomy. Figure 4.8 gives the revised Taxonomy.

14
Figure 4.6 SOLO outcome space for the history task (from Biggs & Collis, 1982).
4 Extended Abstract
e.g., 'Stonehenge is one of the many monuments from the past about which there are
a number of theories. It may have been a fort but the evidence suggests it was more
likely to have been a temple. Archaeologists think that there were three different
periods in its construction so it seems unlikely to have been a fort. The circular
design and the blue stones from Wales make it seem reasonable that Stonehenge was
built as a place of worship. It has been suggested that it was for the worship of the
sun god because at a certain time of the year the sun shines along a path to the altar
stone. There is a theory that its construction has astrological significance or that the
outside ring of pits was used to record time. There are many explanations about
Stonehenge but nobody really knows.'

This response reveals the student's ability to hold the result unclosed while he
considers evidence from both points of view. The student has introduced information
from outside the data and the structure of his response reveals his ability to reason
deductively.

3 Relational
e.g., 'I think it would be a temple because it has a round formation with an altar at
the top end. I think it was used for worship of the sun god. There was no roof on it so
that the sun shines right into the temple. There is a lot of hard work and labor in it
for a god and the fact that they brought the blue stone from Wales. Anyway, it's
unlikely they'd build a fort in the middle of a plain.'

This is a more thoughtful response than the ones below; it incorporates most of the
data, considers the alternatives, and interrelates the facts.

2 Multistructural
e.g., 'It might have been a fort because it looks like it would stand up to it. They used
to
build castles out of stones in those days. It looks like you could defend it too.'
'It is more likely that Stonehenge was a temple because it looks like a kind of
design
all in circles and they have gone to a lot of trouble.'

These students have chosen an answer to the question (i.e., they have required a
closed result) by considering a few features that stand out for them in the data, and
have treated those features as independent and unrelated. They have not weighed
the pros and cons of each alternative and come to balanced conclusion on the
probabilities.

1 Unistructural
e.g., 'It looks more like a temple because they are all in circles.'
'It could have been a fort because some of those big stones have been pushed
over.'

These students have focused on one aspect of the data and have used it to support
their answer to the question.

0 Prestructural
e.g., 'A temple because people live in it.'
'It can't be a fort or a temple because those big stones have fallen over.'

15
The first response shows a lack of understanding of the material presented and of the
implication of the question. The student is vaguely aware of 'temple', 'people', and
'living', and he uses these disconnected data from the story, picture, and questions to
form his response. In the second response the pupil has focused on an irrelevant
aspect of the picture.

16
Figure 4.7. A sketch of the construct map for the Using Evidence construct of the IEY curriculum..

Direction of increasing
sophistication in using
evidence.

Responses to Items
Students

V. Response accomplishes lower level AND goes beyond in

some significant way, such as questioning or justifying the
source, validity, and/or quantity of evidence.

IV. Response provides major objective reasons AND supports

each with relevant & accurate evidence.

III. Response provides some objective reasons AND some

supporting evidence, BUT at least one reason is missing
and/or part of the evidence is incomplete.

II. Response provides only subjective reasons (opinions) for

choice and/or uses inaccurate or irrelevant evidence from
the activity.

I. No response; illegible response; response offers no reasons

AND no evidence to support choice made.

Direction of decreasing
sophistication in using
evidence.

17
Figure 4.8 The SOLO-B Taxonomy.

An extended abstract response is one that not only includes all relevant pieces of information, but
extends the response to integrate relevant pieces of information not in the stimulus.

A relational response integrates all relevant pieces of information from the stimulus.

A semi-relational response is one that integrates some (but not all) of the relevant pieces of
information into a self-consistent whole.

A multistructural response is one that responds to several relevant pieces of information from the
stimulus, and that relates them together. but that does not result in a self-consistent
whole.

A plural response is one that responds to more than one relevant piece of information, but that
does not succeed in relating them together.

A unitary response is one that responds to only one relevant piece of information from the
stimulus.

A pre-structural response is one that consists only of irrelevant information.

4.3.3 Guttman Items

The two general approaches described above relate most effectively to the cognitive
domain—there are also general approaches in the attitudinal and behavioral domains. The most
common general approach to the creation of outcome spaces in areas such as attitude and
behavior surveys has been the Likert style of item. The most generic form of this is the
provision of a stimulus statement (sometimes called a “stem”), and a set of standard options that
the respondent must choose among. Possibly the most common set of options is “Strongly
Agree” “Agree” “Disagree” and “Strongly Disagree” sometimes with a middle “neutral” option.
The set of options may be adapted to match the context: For example, the PF-10 Health
Outcomes survey uses this approach (see Section 2.2.1). Although this is a very popular
approach, largely I suspect, because it is relatively easy to come up with many items when all
that is needed is a new stem for each one, there is certain dissatisfaction with the way that the
response options relate to the construct. The problem is that there is very little to guide a
respondent in judging what is the difference between, say, “Strongly Disagree” and “Agree”.
Indeed, individual respondents may well have radically different ideas about these distinctions
(ref.). This problem is greatly aggravated when the options offered are not even words, but
numerals or letters, such as “1”, “2”, “3”, “4”, and “5”—in this sort of array, the respondent does
not even get a hint as to what it is that she is supposed to be making distinctions between!

The Likert response format has been criticized frequently over the almost 100 years since
Likert wrote his foundational paper, as one might expect for anything that is so widely used
Likert (1932/33). Among those criticisms are: (a) some respondents have a tendency to respond
on only one response side or the other (i.e., the positive side or the negative side), (b) some have
a tendency to not choose extremes or choose mainly extremes, (c) that some respondents confuse
an “equally-balanced” response (e.g., between Agree and Disagree) with a “don’t know/ does not

18
apply” response (DeVellis, 2017), and (d) that, under some circumstances it has been found to be
better to collapse the alternatives into just two categories (Kaiser & Wilson, 2000). A
particularly disturbing criticism is found in a paper by Andrew Maul (2013), where he throws
into question the assumption that even the stems (i.e., the questions or statements) are needed for
Likert-style items?!

For psychometricians, probably the most common criticism is that the use of integers for
recording the responses (or, even, as noted above, as the options themselves) gives the
impression that the options are “placed’ at equal intervals (e.g., Carafio & Perla, 2007; Jamieson,
2005; Kuzon et al., 1996; Ubersax, 2006). This then give the measurer a false confidence that
there is (at least) interval-level measurement status for the resulting data, and hence that one can
proceed with confidence to employ statistical procedures that assume this (for example, linear
regression, factor analysis, etc.). There is also a literature on the robustness of such statistical
analyses against this violation of assumptions, dating back even to Likert (1932/33) himself, but
with others making similar points over the years (cf., Glass et al., 1972; Labovitz, 1967; Traylor,
1983). This will be an issue that rises again in Chapter 6.

An alternative is to build into each set of options meaningful statements that give the
respondent some context in which to make the desired distinctions. The aim here is to try and
make the relationship between each item and the overall scale interpretable. This approach was
formalized by Guttman (1944), who created his scalogram technique (also known as Guttman
scaling):

If a person endorses a more extreme statement, he should endorse all less extreme
statements if the statements are to be considered a [Guttman] scale…We shall call
a set of items of common content a scale if a person with a higher rank than
another person is just as high or higher on every item than the other person.
(Guttman, 1950, p.62)

To illustrate this idea, suppose there are two dichotomous attitude items that form a
Guttman scale, as described by Guttman. If Item B is more extreme than Item A, which, in our
terms, would mean that Item B was higher on the construct map than Item A (see Figure 4.9),
then the only possible Likert-style responses (in the format (Item A, Item B) would be:
(a) (Disagree, Disagree)
(b) (Agree, Disagree)
(c) (Agree, Agree).
That is, a respondent could agree or disagree with both, or could agree with the less extreme Item
A and disagree with the more extreme Item B, or agree with both. But the response
(d) (Disagree, Agree)
would be disallowed. Consider now Figure 4.9, which is sketch of the (very minimal) meaning
that one might have of a Guttman scale—note how it is represented as a continuum consistent
with the term “scale”, but that otherwise, it is very minimal (which matches the “minimal
meaning”). It can be interpreted thus: if a respondent is below Item A, then they will disagree
with both the items; if they are in between A and B, then they will agree with A but not B; and, if
they are beyond B, they will agree with both. But there is no point where they would agree with

19
B but not A. One can see that Guttman’s ideas are quite consistent with the ideas behind the
construct map, at least as far as the sketch goes.

Figure 4.9 Sketch of a “Guttman scale”

Negative AB________ Positive

Four items developed by Guttman using this approach are shown in Figure 4.10. These
items were used in a study of American soldiers returning from the Second World War
(Guttman, 1944)—the items were designed to be part of a survey to gauge the intentions of
returned soldiers to study (rather than work) on being discharged from their military service.
The logic of the questions is: If I turn down a good job and go back to school regardless of help,
then I will certainly make the same decision for a poor job or no job. Some of the questions have
more than two categories, which makes them somewhat more complicated to interpret as
Guttman items, but nevertheless, they can still be thought of in the same way. Note how for
these items,
(a) The questions are ordered according to the construct (i.e., their intentions
to study); and
(b) the options to be selected within each item are
(i) clear and distinct choices for the respondents
(ii) related in content to the construct and also to the question
(iii) also ordered in terms of the construct (i.e., their intentions to study)
(iv) not necessarily the same from question to question.
It is these features that we will concentrate on here. In the next paragraphs this idea, dubbed as
“Guttman-style” items by Wilson (2005), will be explored in terms of our Example 2.

20
Figure 4.10 Guttman’s example items (Guttman, 1944, p.145).

As noted in Chapter 2, the RIS Project developed the Researcher Identity Scale (RIS)
construct map in Figure 2.3 (Example 2). Following the typical attitude survey development
steps, the developers made items following the Likert response format approach—see Figure
4.11 for some examples. Altogether, they developed 45 Likert response items, with six response
categories for each item as shown in Figure 4.11 (i.e., Strongly Disagree, Disagree, Slightly
Disagree, Slightly Agree, Agree, and Strongly Agree).

Figure 4.11 Some Likert-style items developed for the Researcher Identity Scale (from Wilson et
al., 2020)

The SFHI researchers then decided to transform the Likert-style items to Guttman-style
items. The trick of doing this is that the stems of the Likert-style items (e.g., the first column in
Figure 4.10) become the options in the Guttman-style items. This means that each Guttman-style
item may correspond to several Likert-style items.

21
To do this the stems of the Likert-style items (i.e., the first column in Figure 4.10) were
grouped together based on the researcher’s judgement of the similarity of their content and their
match to the RIS construct map waypoints to create Guttman-style sets of ordered response
options (see Figure 4.12). For not every case were there Likert stems that matched the complete
RIS set of waypoints, so the researchers had to create some new options to fill the gaps. The
resulting Guttman-style response options were placed in order based on
(a) the theoretical levels of the construct map that they were intended to map to, and
(b) empirical evidence of how students responded to the items in earlier rounds of testing.
To see an illustration of this, compare the Likert-style items shown in Figure 4.10 with options
(c), (d) and (e) for the Guttman response format item in Figure 4.12. This example shows,
indeed, the matched set of these three Likert response format items with one Guttman response
format item. As there were not any matching items for the two lower levels among the Likert
items, two more options were developed for the Guttman item—options (a) and (b) in Figure
4.12.

Figure 4.12 An example item from the RIS (Guttman response style item) (from Wilson et al
2020)

The project developed 12 Guttman response format items and collected validity evidence
for the use of the instrument (Morell et al., 2021) based on 21 of the Likert-style items. Eleven
of the 12 have at least one Likert-style option that the Guttman options were designed to match
to, with 21 matching levels in all, out of a total possible 60 across all 12 Guttman response
format items (so, approximately 2/3rds of the Guttman options were new). Details of the
matching of the 21 Likert-style items with the 12 Guttman-style items is given in Appendix 4B.
The SFHI researchers found that, comparing the Likert-style item set to the Guttman-style item
set:
(a) the Guttman-style set gave more interpretable results (see details of this in Section 6.X),
(b) the reliabilities were approximately the same, although there were fewer Guttman-style items
(45 vs. 12), resulting in an equivalency of approximately 3.75 Likert-style items to each
Guttman-style item, and
(c) respondents tended to be slower in responding to each Guttman-style item (though the
equivalence in (b) balances that out (Wilson et al., 2020).

Of course, one does not have to develop a set of Likert-style items first in order to get to
the Guttman-style items—that was the case in the SFHI project and was chosen to share here as
it is a useful account of how to change the numerous Likert-style instruments into Guttman-style
instruments. But, in fact the account shows the relationship quite clearly—the Gutman-style
items can be thought of as sets of Likert-style items, where the vague Likert-style options have
been swapped out for the more concrete and interpretable Guttman-style stems.

22
4.4 When Humans Become a Part of the Items Design: The Rater
How can raters be a part of measurement?

In the majority of the different item formats described above in Section 3.3, there will be
a requirement for the responses to be judged, or rated, into categories relating to the waypoints in
the construct map. Even for the selected response format, it was recommended that the
development process include a constructed response initial phase. It may be possible in some
cases to use machine learning to either assist or replace the human element in large-scale
measurement situations, but most situations still need an initial phase that will require human
rating to gather a training sample for the machine learning to work. Thus, an essential element of
many instrument development efforts will require the use of human raters of the responses to the
items, and this requirement needs to be considered at the design stage. The following chapter, on
the Outcome Space describes important aspects of the categorization of responses, and this will
have to be an essential element of the rating design—the issues related to that will be discussed
in that chapter: but it must also be considered when designing the items that will be generating
the responses.

In designing items, it is important to be aware that open-ended items are not without their
drawbacks. They can be expensive and time consuming to take, code and score and inevitably
introduce a certain amount of subjectivity into the scoring process. This subjectivity is inherent
in the need for the raters to make judgments about the open-ended responses. Guidelines for the
judgments cannot encompass all possible contingencies and therefore the rater must make
judgment with a certain degree of consistency. But counter to this flaw, it is the judgment that
offers the possibility of a broader and deeper interpretation of the responses and hence the
measurements.

Failures to judge the responses in appropriate categories may be due to several factors related to
the rater, such as: fatigue, failure to understand and correctly apply the guidelines, distractions due to
matters such as poor expression by the respondent, and distractions due to recent judgements about
preceding responses. A traditional classification of the types of problematic patterns that raters tend to
exhibit is described in the next three paragraphs (e.g., Saal, Downey, and Lahey, 1980).

Rater severity or leniency is a consistent tendency by the rater to judge a response into a category
that is lower or higher, respectively, than is appropriate. Detection of this pattern is relatively
straightforward when the construct has been designed as a construct map (as opposed to more traditional
item-development approaches) as the successive qualitative categories implicit in the waypoint
definitions give useful reference points for an observer (and even the self-observer).

A halo effect may reveal itself in three different ways—they all involve how the rating of one
response can affect the rating of another. The first may happen in the circumstance that a single response
is judged to be judged against several subscales. The problem is that the judgement of one of the
subscales may influence the judgement of another—a typical case is where the rater makes an overall
determination across the whole set of subscales rather than attending to each of the subscales separately.
The second type of halo effect arises when the rater forms an impression based on the person’s previous
responses rather than scoring each response on its own merit. The third type of halo effect occurs

23
between respondents—the response from an earlier respondent may influence the judging of a response
from a later respondent.

Restriction in range is a problematic pattern where the rater tends to judge the responses into
only a subset of the full range of scorable categories. There are several different forms of this: (a)
central tendency is where the rater tends to avoid extreme categories (i.e., the judgements tend to be
towards the middle of the range); (b) extreme tendency is the opposite, where the rater tends to avoid
middle categories (i.e., the judgements tend to be towards the extremes of the range); and, of course, (c)
severity or leniency could be seen as a tendency to restrict the range of the categories to low or high end
of the range, respectively.

The causes of these problematic patterns may differ as well. A rater may adopt a rating strategy
that looks like central tendency because they adopt a “least harm” tactic—staying in the middle of the
range reduces the possibility of grossly mis-scoring any respondent (which means that a discrepancy
index used to check up on raters will not be very sensitive). However, a restriction of range, for example
to the low end, may be due to a failure on the part of the rater to see distinctions between the different
levels.

There are several design strategies that can be taken to avoid problematic patterns of judgement.
One typical strategy is to provide extensive training so that the rater more fully understands the
intentions of the measurer. Carrying out the ratings in the context of a construct map provides an
excellent basis for such training—the waypoints make the definition of the construct itself much clearer,
and, in turn, the exemplars2 make the interpretation of the waypoints much clearer also. This training
should include opportunities for raters to score sample responses with established ratings, and scrutiny
of the results of the ratings, as well as repetitions of the training at appropriate intervals. There are many
models of delivery of such training, but one useful approach is to have three foci: (a) provide general
background -- e.g., in an online module; (b) small groups to work through judgement scenarios; and (c)
individual coaching to address questions and areas of weakness.

A second strategy is to have double or triple readings of a sample of responses. Differences

between category judgements from different raters can then be considered in several ways to improve
the ratings. Differences can be investigated by (a) discussion and mutual agreement among the set of
raters, (b) comparing the individual ratings with the frequency pattern of rating categories across the set
of raters, or (c) comparing each rating to that of the most expert of the raters. Of course, with this
strategy, the lighter the sampling of responses, the less reliable this method will be in detecting raters
with problematic patterns.

A third strategy uses auxiliary information from another source of information to check for
consistency with the rater’s judgement about a response. A rater would be considered severe if they
tended to give scores that were lower than would be expected from other sources of information. If the
rater tended to give higher scores, then that would be considered leniency. In Wilson and Case (2000),
for example, where the instrument consisted of achievement items that were both selected response and
constructed response, student responses to the set of selected response items were used to provide
auxiliary information about a student’s location on the construct in comparison to the rater’s judgements
for the constructed response items. In a second example (Shin, et al., 2019)), where all of the items were
2
That is, examples of typical responses at each waypoint, as will be defined in the next chapter.

24
constructed response, a machine-learning algorithm was used to provide auxiliary information about the
ratings of responses.

For an instrument that requires raters, the measurement developer needs to be aware of the
considerations mentioned above, but there are many others beside that are dependent on the nature of the
construct being measured, and the contexts for those measurements. It is beyond the scope of this book
to discuss all of the many such complexities, and the interested reader should look for support from the
relevant literature. For example, a useful and principled description of these complexities in the context
of educational performance assessments (e.g., written essays and teacher portfolios) is given in
Engelhard and Wind (2017).

4.5 Resources.

The development of an outcome space is a complex and demanding exercise. The scoring
of outcome spaces is an interesting topic by itself—for studies of the effects of applying different
scores to an outcome space, see Wright and Masters (1981) and Wilson (1992). Probably the
largest single collection of accounts of outcome space examples is contained in the volume on
phenomenography by Marton, Hounsell, and Entwistle (1984), but also see the later collection by
Bowden and Green (2005). The seminal reference on the SOLO taxonomy is Biggs and Collis
(1982); extensive information on using the taxonomy in educational settings is given in Biggs
and Moore (1993) and Biggs (2011). The Guttman-style item is a new concept, although based
on an old idea—see Wilson et al. (2020) for the only complete account so far, although it is
presaged in the first edition of this volume (Wilson, 2005)

4.6 Exercises and Activities.

(following on from the exercises and activities in Chapters 1-3)

1. For some of your items, carry out a phenomenographic study, as described in Section 4.3.1
above to develop an outcome space.

2. After developing your outcome space, write it up as a scoring guide (e.g.,Table 1.1) for your
items, and incorporate this information into your construct map.

3. Log into BASS and enter the information about Waypoints, Exemplars, etc.

4. Carry out an Item Pilot Investigation as described in Appendix 4A. The analyses for the data
resulting from this investigation will be described in Chapters 6, 7 and 8.

5. Make sure that the data from your Pilot Investigation is entered into BASS. This will be
automatic if you used BASS to collect the data. If you used another way to collect the data
(e.gs., pencil and paper, another type of assessment deployment software), then use the “Upload”
options to load it into BASS.

6. Try to think through the steps outlined above in the context of developing your instrument
and write down notes about your plans and accomplishments.

25
7. Share your plans and progress with others—discuss what you and they are succeeding on, and
what problems have arisen.

26
Appendix 4A

The Item Pilot Investigation

Before the Pilot Investigation

(a) Complete the Item Panel described in Appendix 3A.
(b) Recruit a small group of respondents (say, 30-100) who represent the range of your
target typical respondents. Note, it is not necessary that this group be a statistically
representative sample, but it is important that (a) the full range on (especially) the
construct be included, and (b) that any subgroups that you want to consider as focal
groups in consideration of fairness should also be represented.
(c) Select subgroups for (i) the think-alouds and (ii) the exit interviews.
(d) Try out the administration procedures for the pilot instrument to (i) familiarize the
instrument administrator (probably yourself) with procedures, and (ii) to iron out any
bugs in the procedures. Practice the think aloud procedures and the exit interview
procedures as well
(e) Incorporate into your design opportunities to examine both validity and reliability (see
Chapters 7 and 8). Use the “Research Report Structure” outlined in Appendix 8A to help
think about reporting on the pilot investigation.

The Pilot Investigation

(a) Administer the instrument just as you intend it to be used.
(b) For a subgroup, use a think aloud procedure, and record their comments.
(c) Give all of the respondents an exit survey asking them about their experience in taking
the instrument and opinions about it.
(d) For another subgroup, administer an exit interview, asking them about each item in
turn.

Follow-up to the Pilot Investigation

(a) Read and reflect upon the records of the think-alouds, the exit interviews, and the exit
survey.
(b) Check with the instrument administrator to find out about any administration
problems.
(c) Examine the item responses to look for any interesting patterns. Because there are
just a few respondents, only very gross patterns will be evident here, such as items that
get no responses, or responses of only one kind.
(d) Carry out the statistical analyses and the reliability and validity studies (described in
following chapters), and write-up a report as outlined in Appendix 8A.

27
Appendix 4B

Matching Likert and Guttman Items in the RIS Example

The left column of Figure A1 shows the 21 chosen Likert-style items. Each of
these Likert-style items was developed with six ordered response choices – strongly
agree, agree, slightly agree, slightly disagree, disagree and strongly disagree. To
transform the set into to the Guttman-style format, we first grouped Likert-style items
together based on similar content. For example, the first three Likert-style items in Figure
A1 are focusing on an individuals’ comfort with seeing himself/herself as a researcher.
The first item “I am beginning to consider myself a researcher,” targets a relatively lower
level of the construct map in comparison to the second item “I consider myself a
researcher,” which in turn targets a relatively lower level of the construct map in
comparison to the third item “I consider myself to be a professional researcher.” Each of
these items becomes an option in the first Guttman-style item (G1), adapted in some
cases to make the expressions consistent across the Guttman-style options. To match to
the construct map for this variable, we added two more options at levels below that for
Item 1. The development process was similar for four of the Guttman-style items. For
some Likert-style items (e.g., Item 7, and five others) there were no others that were of a
similar content, so we needed to add four Guttman-style options for each of them. We
found that this process produced a somewhat imbalanced set of Guttman-style items, with
two Guttman-style items for the Agency strand and three for the rest. Hence, we
developed one extra Guttman-style item focused on the Agency strand that was not
matched among the 21 Likert-style items (Guttman-style item G9). Through this process,
we obtained the right column of the Figure, the 12 Guttman-style items. In subsequent
analyses, one of the Likert-style items (Likert-style item 12) was found not to fit the
statistical model, so we deleted it from the comparisons, but the corresponding Guttman-
style item did fit, so we left it in the comparisons. Each of the remaining 20 Likert-style
items maps to an option for one of the Guttman-style items. To get comparable estimates,
we ensured each student in our sample of 863 high school students took both formats of
the instrument. We randomized the order of the instruments, meaning, some students
looked at the Likert items first while others looked at the Guttman items first.

28
Figure 4B1. Mapping Likert-style Items into Guttman-style Items.

Likert-Style Items Guttman-Style Items

G1. Which statement about being a researcher best

captures your opinion of yourself?
a) I do not consider myself a researcher.
b) I probably do not consider myself a researcher.
1. I am beginning to consider myself a researcher. c) I am beginning to consider myself a student
researcher.
2. I consider myself a researcher. d) I consider myself a student researcher.
3. I consider myself to be a professional researcher. e) I consider myself to be a professional researcher.

G2. Which statement below best describes your

skills to do research?
a) I do not have the skills to do research.
b) I’m interested in gaining research skills.
4. I have the basic skills to help do research. c) I have the basic skills to help do research.
5. I have the skills to help do research. d) I have the skills to conduct research with a little
help from others.
6. I have the skills to conduct research on my own. e) I have the skills to conduct research on my own.

G3. Which statement below best captures your

identity as a researcher?
a) Being a researcher is not a part of who I am.
b) I am not sure if being a researcher is a part of my
identity.
c) Being a researcher might be a small part of my
identity.
d) Being a researcher is a part of my identity.
7. The researcher part of my identity is important to e) Being a researcher is a big part of my identity.
me.

G4. Which statement best describes you?

a) I don’t consider myself a part of a research
community.
b) I am beginning to feel like a part of a research
community.
8. I am a member of a research community c) I am a small part of a research community.
9. I am a part of a group of researchers. d) I am a part of a research community.
10. I am an important part of a group of researchers e) I am an important part of a research community.

G5. Which statement best describes your interest in

research?
a) I do not have an interest in doing research that
helps my community.
b) I am slightly interested in doing research that
helps my community.
c) I might be interested in doing research that helps
my community.

29
11. A career in research would be a good way for d) I would be interested in doing research that helps
me to help people. my community.
e) I am definitely interested in doing research that
helps my community.

G6. Which statement best describes your level of

comfort in communicating with researchers?
a) I am uncomfortable speaking to experienced
researchers right now.
b) I hesitate to speak to researchers that have more
experience than me.
c) I am learning how to communicate with
researchers that have more experience than me.
12. I am comfortable talking with more experienced d) I am comfortable talking to researchers that have
researchers. more experience than me.
e) I can speak with confidence to researchers that
have more experience than me.

G7. Which statement best describes your interest in

contributing to society?
a) I do not have the desire to contribute to the
society through research.
b) I have an interest in contributing to society
through research.
c) I have a desire to make some contribution to the
society through research.
d) I have a desire to make a meaningful contribution
to the society through research.
13.I have a strong desire to make a meaningful e) I have a strong desire to make a meaningful
contribution to society through research. contribution to the society through research.

G8. Which statement best describes your level of

skill?
a) I have no research skills.
c) I can research issues with a lot of help
c) I can research issues with some help
d) I can research issues with very little help
14. I can research issues independently. e) I can research issues independently.

G9. Which statement best describes your level of

researcher voice? We describe researcher voice as
the extent to which you feel empowered to speak
about your research.
a) I do not want to have a researcher voice.
b) I do not have a researcher voice now but I would
like to develop one.
c) I can use my researcher voice to guide
discussions.
d) I am developing a strong researcher voice.
e) I have a strong researcher voice.

G10. Which statement best describes your future

plans?

30
a) I do not plan to pursue research in the future.
b) I do not know if doing research is in my future.
c) I am not sure if a research-related degree is right
for me.
d) I might get a research-related degree in college.
15. I plan to get a research-related degree in college. e) I plan to get a research-related degree in college.

G11. Which statement best describes your interest

in research?
a) I think research is boring.
b) I think research is a little interesting.
16. I think research is interesting. c) I think research can be interesting at times.
17. I think that research is an exciting field of study. d) I think research is very interesting.
18. I think research is an engaging field of study. e) I think research is an engaging field of study.

G12. Which statement best describes your interest

in a research career?
a) A career in research would not be a good fit for
me.
b) I am not sure if I am interested in research as a
career.
19. I think a career in research maybe a good fit for c) I might have an interest in research as a career.
me.
20. I think a career in research could fit into my d) A career in research could be a good fit for me.
career plans.
21. A career in research would be a great choice for e) A career in research would be a great fit for me.
me.

Course Test and Scale Development (Code:3759) : Unit 4 Outcome Spaces
No ratings yet
Course Test and Scale Development (Code:3759) : Unit 4 Outcome Spaces
6 pages
Lesson 2 Research Methods
No ratings yet
Lesson 2 Research Methods
23 pages
Math Test Validity for Educators
No ratings yet
Math Test Validity for Educators
17 pages
Unit of Work Explanation
No ratings yet
Unit of Work Explanation
5 pages
Taxonomies of Educational Objectives
No ratings yet
Taxonomies of Educational Objectives
7 pages
Curriculum Aims and Outcomes
No ratings yet
Curriculum Aims and Outcomes
2 pages
I. Construct Map: Unit 3 Concept Mapping
No ratings yet
I. Construct Map: Unit 3 Concept Mapping
3 pages
A - Development of The Outcomes Star Published in Housing Care and Support
No ratings yet
A - Development of The Outcomes Star Published in Housing Care and Support
14 pages
Design Types and Sub-Types: Grouping
No ratings yet
Design Types and Sub-Types: Grouping
3 pages
Ch-1 Code 3759
No ratings yet
Ch-1 Code 3759
3 pages
The Basis of Scale Anchoring in Item Mapping Some Issues of Concern
No ratings yet
The Basis of Scale Anchoring in Item Mapping Some Issues of Concern
15 pages
Design: Intransitive Verb
No ratings yet
Design: Intransitive Verb
3 pages
Curriculum As A Product
No ratings yet
Curriculum As A Product
3 pages
Index
No ratings yet
Index
3 pages
Sample
No ratings yet
Sample
15 pages
Module 3
No ratings yet
Module 3
5 pages
Translate Lanjut
No ratings yet
Translate Lanjut
10 pages
Atm4140 Notes 2
No ratings yet
Atm4140 Notes 2
53 pages
Phenomenology: Philosophy Husserl Heidegger
No ratings yet
Phenomenology: Philosophy Husserl Heidegger
14 pages
2021 10 23 Ex Post Handout C Methods
No ratings yet
2021 10 23 Ex Post Handout C Methods
5 pages
Linacre 2012
No ratings yet
Linacre 2012
29 pages
Research Design
No ratings yet
Research Design
15 pages
Chapter2 3
No ratings yet
Chapter2 3
34 pages
1299760628curriculum Development Models
No ratings yet
1299760628curriculum Development Models
13 pages
Advantages and Disadvantages of The Learning Outcomes Approach
No ratings yet
Advantages and Disadvantages of The Learning Outcomes Approach
4 pages
Ocs 601 June 2023
No ratings yet
Ocs 601 June 2023
11 pages
Measurement: Scaling, Reliability and Validity
100% (1)
Measurement: Scaling, Reliability and Validity
42 pages
Research Chap 4
No ratings yet
Research Chap 4
18 pages
Ed 511288
No ratings yet
Ed 511288
15 pages
(2007) Patrick Griffin The Confort of Competence and The Uncertainty of Assessment
No ratings yet
(2007) Patrick Griffin The Confort of Competence and The Uncertainty of Assessment
13 pages
Student Success. Table 1 Provides An Updated Listing of Student Publications and
No ratings yet
Student Success. Table 1 Provides An Updated Listing of Student Publications and
7 pages
Constructing Knowledge Models. Cooperative Autonomous Learning Using Concept Maps and V Diagrams
No ratings yet
Constructing Knowledge Models. Cooperative Autonomous Learning Using Concept Maps and V Diagrams
4 pages
Exampack 2016
No ratings yet
Exampack 2016
36 pages
AL2 Handout1 Taxonomy Bloom Krathwohl Simpson
No ratings yet
AL2 Handout1 Taxonomy Bloom Krathwohl Simpson
7 pages
Cooperative Learning in The Thinking Classroom
No ratings yet
Cooperative Learning in The Thinking Classroom
21 pages
Research Methodoloy: Q-Sort Scale
No ratings yet
Research Methodoloy: Q-Sort Scale
10 pages
Business Research Method The Design of Research
No ratings yet
Business Research Method The Design of Research
18 pages
Likert Scale and Gantt Chart
No ratings yet
Likert Scale and Gantt Chart
1 page
Research As A Scientific Process
No ratings yet
Research As A Scientific Process
51 pages
Toc1 0
No ratings yet
Toc1 0
6 pages
Unit 3
No ratings yet
Unit 3
12 pages
Buku Kompen (123 141)
No ratings yet
Buku Kompen (123 141)
19 pages
MODULE 7. LESSON PROPER Psych Asses
No ratings yet
MODULE 7. LESSON PROPER Psych Asses
8 pages
Module 5 - Measuring Instruments
No ratings yet
Module 5 - Measuring Instruments
31 pages
Measuring Learning With Games
No ratings yet
Measuring Learning With Games
3 pages
Unit 2 - RM
No ratings yet
Unit 2 - RM
36 pages
Backward Design
No ratings yet
Backward Design
6 pages
WK 6 Feb 22
No ratings yet
WK 6 Feb 22
15 pages
Research Methods and Designs
No ratings yet
Research Methods and Designs
36 pages
Principles of Research Design in The Social Sciences
No ratings yet
Principles of Research Design in The Social Sciences
180 pages
Social Science Research Design Guide
No ratings yet
Social Science Research Design Guide
180 pages
Educators' Guide to OBE Principles
No ratings yet
Educators' Guide to OBE Principles
17 pages
Chapter5 3
No ratings yet
Chapter5 3
35 pages
Group 3 EDUC5
No ratings yet
Group 3 EDUC5
31 pages
Chapter3 3
No ratings yet
Chapter3 3
36 pages
Chapter1 3
No ratings yet
Chapter1 3
41 pages
Chapter7 3
No ratings yet
Chapter7 3
22 pages
Math Exam Commentary 2015
No ratings yet
Math Exam Commentary 2015
10 pages
Examiners' Commentaries 2019: MT2116 Abstract Mathematics
No ratings yet
Examiners' Commentaries 2019: MT2116 Abstract Mathematics
27 pages
Lesson Plan 4a
No ratings yet
Lesson Plan 4a
8 pages
Program Evaluation and Early Outcomes of A Severe.2
No ratings yet
Program Evaluation and Early Outcomes of A Severe.2
7 pages
Land-Use Change Impact on Chalimbana River
No ratings yet
Land-Use Change Impact on Chalimbana River
10 pages
Leveraging Generative AI For Market Research
No ratings yet
Leveraging Generative AI For Market Research
3 pages
Evidence-Based Practice in Psychology: Ronald F. Levant and Nadia T. Hasan
No ratings yet
Evidence-Based Practice in Psychology: Ronald F. Levant and Nadia T. Hasan
5 pages
Linguistics Thesis Analysis
No ratings yet
Linguistics Thesis Analysis
86 pages
DecDoc A Tool For Documenting Design Decisions Collaboratively and Incrementally
No ratings yet
DecDoc A Tool For Documenting Design Decisions Collaboratively and Incrementally
8 pages
Benefits of a No Homework Policy
100% (1)
Benefits of a No Homework Policy
7 pages
Economics Project Synopsis 1
No ratings yet
Economics Project Synopsis 1
2 pages
Writing Professional, Letter, Social Media, Report
No ratings yet
Writing Professional, Letter, Social Media, Report
10 pages
Assess For Success Marketing Analytics and Measurement
No ratings yet
Assess For Success Marketing Analytics and Measurement
37 pages
Practice Standard Safety & Quality
No ratings yet
Practice Standard Safety & Quality
2 pages
Redesign and Initial Validation of An Instrument To Assess The Motivational Qualities of Music in Exercise The Brunel Music Rating Inventory 2
No ratings yet
Redesign and Initial Validation of An Instrument To Assess The Motivational Qualities of Music in Exercise The Brunel Music Rating Inventory 2
12 pages
Chief Commercial Officer Role Overview
No ratings yet
Chief Commercial Officer Role Overview
4 pages
How To Get A 2 1 in Media Communication and Cultural Studies 1st Edition Noel R Williams 2024 Scribd Download
100% (5)
How To Get A 2 1 in Media Communication and Cultural Studies 1st Edition Noel R Williams 2024 Scribd Download
67 pages
Documents - Pub Eia Guidance Manual Final Ports Harborsmay 10
No ratings yet
Documents - Pub Eia Guidance Manual Final Ports Harborsmay 10
145 pages
Issues For Review and Discussion
No ratings yet
Issues For Review and Discussion
11 pages
Developing Employability Skills Guide
No ratings yet
Developing Employability Skills Guide
8 pages
Questionnaires and Interviews
No ratings yet
Questionnaires and Interviews
14 pages
Clinical Trials Strategy: Context
No ratings yet
Clinical Trials Strategy: Context
11 pages
Internal Assessment Evaluation Criteria
No ratings yet
Internal Assessment Evaluation Criteria
2 pages
A Novel Based Translation Model From English To Telugu
No ratings yet
A Novel Based Translation Model From English To Telugu
4 pages
Research Areas in Open Schooling: A NIOS Perspective
No ratings yet
Research Areas in Open Schooling: A NIOS Perspective
5 pages
Regression Analysis for Economists
No ratings yet
Regression Analysis for Economists
57 pages
Furniture Longevity (PDFDrive)
No ratings yet
Furniture Longevity (PDFDrive)
150 pages
Little 2011
No ratings yet
Little 2011
15 pages
Business Studies Essay Writing Guide
100% (2)
Business Studies Essay Writing Guide
3 pages
Andersson Et Al. (2012) - Estimating Use and Non-Use Values of A Music Festival
No ratings yet
Andersson Et Al. (2012) - Estimating Use and Non-Use Values of A Music Festival
18 pages
STS Module 1 2.5
No ratings yet
STS Module 1 2.5
58 pages
Salvi Et Al., July 2023
No ratings yet
Salvi Et Al., July 2023
34 pages

Chapter4 3

Uploaded by

Chapter4 3

Uploaded by

Chapter 4

The Outcome Space

4.1 The Qualities of an Outcome Space

4.1.1 Well-defined Categories.

4.1.2 Research-based Categories.

A model of cognition and learning should serve as the cornerstone of the

4.1.3 Context-specific Categories.

For an outcome space to be useful in defining a construct that is to be shaped into a

4.2 Scoring the Outcome Space (The Scoring Guide)

In many circumstances, especially where the measurer is using an established item

A full critique, or comparison of two arguments = 3,

4.3 General Approaches to Constructing an Outcome Space.

Phenomenography provides a method of constructing an outcome space for a cognitive

Figure 4.2 An open-ended question in physics (from Marton, 1983)

The result of the analysis is a grouping of quotes reflecting different kinds of

Figure 4.3 A phenomenographic outcome space.

Each of these responses suggests a qualitatively different understanding. The highest

4.3.2 The SOLO Taxonomy

Figure 4.4 The SOLO Taxonomy.

The Function of Stonehenge

Stonehenge is in the South of England, on the flat plain of

Question: Do you think Stonehenge might have been a fort and

V. Response accomplishes lower level AND goes beyond in

IV. Response provides major objective reasons AND supports

III. Response provides some objective reasons AND some

II. Response provides only subjective reasons (opinions) for

I. No response; illegible response; response offers no reasons

A pre-structural response is one that consists only of irrelevant information.

4.3.3 Guttman Items

Figure 4.9 Sketch of a “Guttman scale”

Negative ____________A____________________B____________________ Positive

A second strategy is to have double or triple readings of a sample of responses. Differences

4.6 Exercises and Activities.

(following on from the exercises and activities in Chapters 1-3)

The Item Pilot Investigation

Before the Pilot Investigation

The Pilot Investigation

Follow-up to the Pilot Investigation

Matching Likert and Guttman Items in the RIS Example

Likert-Style Items Guttman-Style Items

G1. Which statement about being a researcher best

G2. Which statement below best describes your

G3. Which statement below best captures your

G4. Which statement best describes you?

G5. Which statement best describes your interest in

G6. Which statement best describes your level of

G7. Which statement best describes your interest in

G8. Which statement best describes your level of

G9. Which statement best describes your level of

G10. Which statement best describes your future

G11. Which statement best describes your interest

G12. Which statement best describes your interest

You might also like

Negative AB________ Positive