0% found this document useful (0 votes)
19 views29 pages

Robey Etal 1999

This document reviews the state of single-subject clinical-outcome research in aphasia treatment, highlighting the importance of reliable and valid statistical methods for assessing treatment effectiveness. It discusses key developments in research design, data analysis, effect sizes, and the implications of autocorrelated data, ultimately recommending best practices for future studies. The findings indicate that while many studies are well-designed, only one analysis method, ITSACORR, effectively controls for Type I and II errors in clinical applications.

Uploaded by

sebastianpintea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views29 pages

Robey Etal 1999

This document reviews the state of single-subject clinical-outcome research in aphasia treatment, highlighting the importance of reliable and valid statistical methods for assessing treatment effectiveness. It discusses key developments in research design, data analysis, effect sizes, and the implications of autocorrelated data, ultimately recommending best practices for future studies. The findings indicate that while many studies are well-designed, only one analysis method, ITSACORR, effectively controls for Type I and II errors in clinical applications.

Uploaded by

sebastianpintea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

aphasiology , 1999, vol. 13, no.

6, 445±473

Review
Single-subject clinical-outcome research:
designs, data, eåect sizes, and analyses

R AND A LL R. R OB EY, MAR TI N C. SC HUL T Z‹,


AM Y B . C R AW F OR D and CHE R YL A. S I NN ER
University of Virginia, VA, USA
‹ Southern Illinois UniversityÐCarbondale, IL, USA

(Received 30 January 1998; accepted 22 August 1998)

Abstract
In the last 20 years, single-subject research designs have become important
forms of aphasia-treatment research for assessing the eåectiveness of treatment
on a subject-by-subject (or patient-by-patient) basis. In that time, several
important developments in the statistical literature centring on the reliability
and validity of single-subject research have occurred. This work assesses the
state of aphasia-treatment single-subject research in the context of that
scholarship and proposes recommendations for future applications through a
tutorial-like presentation. This paper details the analysis of published single
subject results and proposes recommendations concerning future applications
of single-subject designs. The work focuses on four domains : designs, data,
eåect sizes, and analyses. The ®ndings indicate that aphasia-treatment single-
subject studies, which are well designed for the most part, yield a short series
of autocorrelated data manifesting generally large treatment eåects. However,
only one analysis satisfactorily controlled Type I and Type II errors under
typical clinical-aphasiology applications. That procedure, ITSACORR, is
easily accomplished and it expresses outcome in familiar terms. To facilitate
understanding, the review promotes a hands-on understanding of the various
analysis options through worked examples and clari®es the (in)appropriateness
of each procedure for clinical applications. Although the focus of the work is
treatment for aphasia, the central thesis has general application across disorder
categories.

Introduction
Communication disorders scientists and practitioners currently experience a
transition of great potential. The movement is toward providing evidence arising
out of broadly-accepted forms of experimentation for testing that treatments are
eåective, and away from idiosyncratic and less eåective forms of experimentation
and evidence (Robey and Schultz 1998). The outcome of the transition will be
acceptance by; (a) the public, (b) those in¯uential in the creation of public policy,
and (c) public and private reimbursers of therapy services, that treatments for
communication disorders are demonstrably eåective as evaluated throughstringent
Address correspondence to : Randall R. Robey, Communication Disorders Program, University of
Virginia, Suite 202, 2205 Fontaine Avenue, Charlottsville, VA 22903, USA.

0268±7038} 99 $12.00 ’ 1999 Taylor & Francis Ltd


446 R. R. Robey et al.

scienti®c criteria. A necessary aspect of this transition is incorporating valid and


reliable procedures for quantifying and synthesizing outcomes in the practice of
single-subject research. Since the introduction of single-subject research designs to
aphasia-treatment research (e.g. Davis 1978, LaPointe 1978), this form of
experimentation has become an increasingly accepted means for testing the
"
eåectiveness of treatments for aphasia in one or a few subjects. However, recent
scholarship in the statistical literature raises serious questions about the validity and
reliability of analysis procedures currently used in single-subject aphasia-treatment
clinical-outcome research.
Initially, this work was intended as a meta-analysis on single-subject aphasia-
treatment research to synthesize this large body of clinical outcome research in the
same manner that group studies of aphasia treatment have been synthesized
(Robey 1994, 1998). However, only 12 single-subject studies could be su¬ciently
quanti®ed for the application of meta-analysis procedures. Furthermore, too few of
those 12 studies were su¬ciently similar to warrant averaging. As a result, the focus
of the work necessarily moved from meta-analysis per se to experimental design and
quanti®cation issues bearing directly on the potential of aphasia-treatment single-
subject studies for eventually yielding converging scienti®c evidence. Although
speech-language pathologists draw upon a rich literature for guidance in planning
#
this form of quasi-experimentation (e.g. Davis 1978, LaPointe 1978, McReynolds
and Kearns 1983, Connell and Thompson 1986, Kearns 1986, McReynolds and
Thompson 1986, Fukkink 1996), an equally rich literature has developed in the
domain of statistical research regarding the methodology and analysis of single-
subject experimentation. The present paper incorporates the principles and
conventions of the broader clinical-outcome research community in applications of
aphasia-treatment single-subject research.
Important developments in the statistical literature can be organized in four
"
A few words regarding terminology are in order. The term eåectiveness is de®ned diåerently than the
term e¬cacy. Eåectiveness is the likelihood of bene®cial outcome to individuals in a certain population
due to a certain treatment administered under usual and routine clinical-practice conditions (Frattali
1998, Robey and Schultz 1998). E¬cacy is the likelihood of bene®cial outcome to individuals in a
certain population due to a certain treatment administered under optimal clinical-experiment
conditions (Frattali 1998, Robey and Schultz 1998). The term outcome research comprises evidence of
e¬cacy as well as evidence of eåectiveness. Outcome research indexes diåerences between
observations made prior to the administration of a treatment and observations made sometime after
the termination of treatment (Sederer et al. 1996, Hopkins 1998). Indices of outcome may take many
forms (Kane 1997, Frattali 1998): physiology, impairment, disability, handicap, quality of life,
consumer satisfaction, disposition in terms of returning to the workforce, or post-treatment income
tax contributions, among others.
#
Note that this is not a pejorative or diminishing term ; the term quasi-experiment may be used to
describe group designs as well as single-subject designs. At the core of the technical de®nition of a
true experiment is random assignment of subjects to either control or treatment conditions (Campbell
and Stanley 1963, Cook and Campbell 1979, Fineberg 1990, Cook and Shadish 1994, Johnson et al.
1995, Reichardt and Mark 1998, among others). When subjects cannot be randomly assigned to
groups (as in the case of interrupted time-series research), the design is labelled a quasi-experiment.
The term denotes, but denotes only, that random assignment is not possible and so neither are the
associated mathematical bene®ts. The term does not imply a lack of scienti®c rigor ; rather it speaks
to the complexity of the behavioural and social sciences. The burden for researchers conducting true
or quasi-experiments is the same : rule out threats to experimental validity through experimental
controls. As Johnson et al. (1995) point out, it is largely possible to assure the validity of single-subject
designs. It should also be noted that central to the random assignment of true experiments is a
between-subject comparison: central to single-subject designs is a within-subject comparison.
Single subject research 447

domains: (a) design (e.g. What types of design should (not) be used to assess clinical
outcomes of behavioural treatments ?); (b) data (e.g. Are behavioural single-subject
$
data autocorrelated and, if so, does the observed degree of autocorrelation aåect
the validity or reliability of analysis decisions?) ; (c) eåect size (e.g. Can the results
of a single-subject design be quanti®ed to yield an estimate of eåect size and so
make the application of meta-analysis procedures possible?); and (d) analysis (e.g.
Is the visual analysis of single-subject data valid and reliable ? Are statistical
analyses of single-subject data valid and reliable ?). Scholarship addressing these
issues in the statistical literature has direct application in aphasia-treatment single-
subject experimentation and provides answers for timely and focused questions in
each of the four domains.
E Given that treatments for aphasia impart a carry-over eåect when they are
%
successful, what single-subject research designs are prescribed and proscribed
&
by this threat to internal validity ?
E Are single-subject aphasia-treatment data autocorrelated and, if so, to what
extent ? What is the impact of this potential threat to statistical conclusion
validity ?
E Can aphasia-treatment single-subject outcomes be quanti®ed for calculating
estimates of eåect size ? If so, do single-subject designs produce appreciable
eåect sizes ? and
E How are aphasia-treatment single-subject data validly and reliably analysed ?

Purpose
This work centres on the designs, data, eåect sizes, and analyses of single-subject
aphasia-treatment research. The purpose of the work is to accomplish two
objectives in each of these four domains: (a) review current scholarship regarding
single-subject experimentation reported in the statistical and clinical research
literatures ; and (b) assess the state of single-subject aphasia-treatment exper-
imentation. The assessment of available eåect sizes constitutes an elementary meta-
analysis. Recommendations for optimizing future applications of aphasia-treatment
single-subject research are proposed throughout.

Methods
The ®rst objective was accomplished through a review of relevant statistical
literature. Since writings on the analysis of single-subject data are much more
numerous than writings in the other three domains, that section of this work is
more extensive than the others. To facilitate communication, the analysis section
contains a series of step-by-step worked examples and illustrations.
An extensive search of aphasia-treatment literature reported by Robey (1998)
$
Autocorrelation refers to the relationship (i.e. the degree of predictability, or lack of independence)
between an observation of a subject and observations of the same subject made at later points in time.
See the Data section of this work for a technical de®nition and example.
%
A carry-over eåect occurs when the eåect of one treatment is not washed out (i.e. behaviour does
not return to baseline) prior to the administration of a second treatment. As a result, the observation
of behaviour in the second period will result from an interaction of the ®rst and second treatments
(see Robey and Schultz 1998).
&
The reader is referred to Cook and Campbell (1979) and to Robey and Schultz (1993) for
explanations of internal validity as well as statistical conclusion validity mentioned in the next point.
448 R. R. Robey et al.

formed the basis of accomplishing the second objective. A literature search of


scienti®c reports commenting on recovery of aphasia (written in English) was
conducted through two means : a systematic manual search for references in
relevant literature sources, and searches through electronic data bases of published
reports. The former eåort centred on the periodicals Aphasiology, Brain and
Language, Clinical Aphasiology, Journal of Speech and Hearing Disorders, Journal of Speech
and Hearing Research, as well as chapters, texts, and various bibliographies likely to
contain relevant references. The searches of electronic data bases included Carl-
Uncover, Dissertation Abstracts, EM-BASE, ERIC, MEDLINE, PsychLit, and
Science Citation Index. The search produced 479 reports. Of these, 76 were set aside
since they did not report an observation of treatment outcome (i.e. tutorials, essays,
letters, and studies not focusing on treatment outcomes). Each of the remaining
403 reports was examined for relevance to the focus of this work (i.e. the
application of a single-subject research design to one or more aphasic individuals).
Based on Wortman (1994), the inclusion criteria centred on Cook and Campbell’s
(1979) construct validity (basically, what is being measured) and external validity
(basically, to what populations in what environments do the results apply). A total
of 69 reports satis®ed the inclusion criterion. However, three of these studies were
eliminated as redundant of others (e.g. a dissertation and the subsequent
publication) and so the total reduced to 66.
Wortman’s (1994) exclusion criteria for meta-analysis centre on Cook and
Campbell’s internal validity (basically, the outcome occurs for the purported
reasons) and statistical conclusion validity (basically, the analysis of observations is
appropriate and correctly carried out). The exclusion criterion, scienti®c ac-
ceptability, required that a single-subject design include at least one period of no-
treatment followed by a period of treatment. As is the case for group studies, the
experimental validities of some single-subject studies were more threatened than
were the experimental validities of others. However, only three reports did not
satisfy the acceptability criterion (Wortman 1994) : two designs contained no
baseline period and the design was not described in a third.
As a result, the extensive search for reports of single-subject quasi-experiments
on clinical outcomes for the treatment of aphasia produced a pool of 63 studies.
These 63 studies formed the evidentiary base for assessing the designs, data, eåect
sizes and analyses of single-subject research applications in clinical aphasiology. As
is true in all lines of investigation, some of the 63 studies were exemplary in design
and experimental controls and others were less so. Because the exclusion of some
studies in a survey invites a claim of bias, all studies meeting the general inclusion
and exclusion criteria were examined for design, data, eåect size, and analysis. It
will become apparent that not all of the 63 studies could contribute to the
assessment in each domain.

Designs
Levin (1992) describes two classes of motivation for single-subject research
designs, those generating hypotheses and those testing hypotheses once generated. The
former is appropriate in exploratory research when the focus of inquiry is
exploration of the several dimensions of a new treatment : amount of treatment,
materials, protocols and contingencies (see Phase I and Phase II research in Robey
and Schultz 1998). These early and purposeful considerations are necessary for
Single subject research 449

specifying the null hypothesis to be tested in the e¬cacy stage of clinical outcome
research (see Phase III research in Robey and Schultz 1998). The second class
applies to con®rmatory research wherein the focus is an already particularized null
hypothesis.That is, hypotheses are selected for single-subject research to determine
eåectiveness of a speci®c treatment administered to patients having one or more
speci®c attributes through some particular means of service delivery (see Phase IV
research in Robey and Schultz 1998).
Focused hypotheses for testing treatment eåectiveness must be replicated.
Hilliard (1993) explains that the necessary sequence is ®rst direct replication
(additional experiments using similar subjects and similar circumstances to
determine the reliability of the eåectiveness of treatment) and then systematic
replication (i.e. replications with thoughtfully selected diåerences to determine
limits on the generality of eåectiveness). Said diåerently, testing the eåectiveness of
treatments encompasses a programme of serial testing of focused, theory-driven,
single-subject null hypotheses (Kearns and Thompson 1991b) in addition to
similarly focused group studies.
The di¬culty in eåectiveness research is that one cannot compare multiple
treatments in a single individual without suåering the negative consequences of a
carry-over eåect in the analysis of outcome (Shapiro et al. 1982, Kratochwill and
Williams 1988, Senn 1993, Franklin et al. 1996, Fukkink 1996, Backman et al. 1997).
By way of explanation, a carry-over eåect occurs when the eåect of a prior
treatment in¯uences a measurement made following the next (or later) treatment
(Senn 1993). That is, the eåect of an early treatment persists and in¯uences
measurements made after the administration of (an)other treatment(s). It would be
unreasonable to expect the eåect of aphasia treatment to wash out between
experimental periods and undesirable if it occurred. When successful, treatment of
aphasia imparts permanent change; a carry-over eåect is the expected and necessary
outcome. Therefore, two treatments applied to one person cannot be compared on
the basis of a common baseline of performance unaltered by treatment. Moreover,
it is unreasonable to expect the eåects of treatments to be linearly additive ; that is,
one cannot expect that if treatment" brings about a units of change, and treatment#
brings about b units of change, administering treatment" and then treatment#
would yield a total magnitude of a plus b units of change. As a result, direct
comparisons of two treatments administered to the same subject or subjects often
yield ambiguous ®ndings (Kazdin 1986). The most direct solution is to test one
treatment per subject (Eick and Kofoed 1994, Fukkink 1996).

Single-subject designs reported in aphasia-treatment literature


Because authors of primary studies have used a variety of notations to code single-
subject research designs, and because technical terms for describing experimental
periods have not always been applied appropriately, identifying similarities and
diåerences among and between studies can be a complicated matter. For the
present review, all designs were classi®ed using a common code adopted from
mainstream single-subject literature (Bloom and Fischer 1982, Portney and Watkins
1993, Yaden 1995, Backman et al. 1997, among others). A code of A was used to
code a period of no treatment culminating in a measurement occasion (e.g.
baseline). A withdrawal period in which treatment was stopped for the duration of
450 R. R. Robey et al.

Table 1.

Multiple
Design Number baseline

A, B 9 9
A, B, A 34 29
A, B, A, B 3 1
A, B, A, B, A 3 1
A, B, A, C, A 1 0
A, B, A, C, B 1 0
A, B, B1 C, A 1 1
A, B, C 1 1
A, B, C, A 3 0
A, B, C, B, C 1 0
A, B, C, D, E, A 1 0
A, B r C, A (alternating treatments) 4 2
A, B1 C, B, B1 C, A, B, B1 C, B, A 1 1

the period was likewise coded A. Periods labelled `follow up’ and `maintenance ’
to designate post-treatment periods of no treatment were also coded A.
Some authors have used the term ` reversal period’ to connote a period of no
treatment (e.g. Connors and Wells 1982, Yaden 1995) and these were coded A.
Classically, the terms `withdrawal ’ and ` reversal’ are distinct. Bloom and Fischer
(1982), Hersen and Barlow (1976), Kazdin (1982), and Kratochwill (1978) point
out that a reversal period is one in which active treatment persists. In a reversal
period, either the reinforcement contingency protocol changes or treatment is
directed toward a diåerent target behaviour.
The ®rst period of active treatment was always coded B ; all later periods in which
that same treatment was again implemented were also coded B. Any period for
implementing a second and distinct treatment was coded C ; periods dedicated to
a third treatment were coded D, and so forth.
Nine of the 63 studies reported a classic AB design with multiple baseline
controls. Thirty-four studies reported a withdrawal control in the form of an ABA
design (withdrawal design). Twenty-nine of those ABA designs also incorporated
multiple baseline controls. In all, 45 studies reported some form of a multiple
baseline design (e.g. across behaviours, across subjects ; see Kearns and Thompson
1991a for a similar ®nding). In addition, 38 of the 63 designs incorporated some
form of generalization probe.
Forty-nine of the 63 studies compared one or more periods of a single treatment
with one or more periods of no treatment. The remaining 14 studies compared two
or more treatments in sequence. The number of subjects reported in each of the 63
studies ranged from 1 to 10 with a mean of 3 and a standard deviation of 2. The
most frequently reported n was 1 (26 studies). A catalogue of the reported designs
and their frequencies is listed in table 1.
Because the number of baseline observations and the number of treatment
observations in¯uence the validity and the reliability of conclusions, these design
attributes of the 63 studies were indexed. In multiple-baseline designs, the numbers
of observations per period were averaged for each subject and those values were
then averaged across subjects so that each study contributed one value for each
period (all fractions were rounded up to whole numbers, as per Cohen 1988). For
Single subject research 451

the baseline period, the numbers of observations ranged from 1 to 20 with an


average of 8 and a standard deviation of 5. The most frequently occurring number
of observations in the baseline period was 3 (16 studies). Note that the average of
8 baseline observations is somewhat misleading since that average included
multiple-baseline designs; the average for initial baseline periods (i.e. the ®rst
iteration through an AB sequence) was 4.
The ®rst treatment period involved 11 observations on average with a range
from 1 to 29 and a standard deviation of 6. The most frequently occurring numbers
of observation were 5 and 6 (in 24 studies altogether). Third period observations,
which sometimes came from a withdrawal period and sometimes a second-
treatment period, averaged 8 with a standard deviation of 7 and a range from 1 to
37. The third period most frequently consisted of ®ve observations (10 studies).
The initial baselines (prior to the administration of any treatment) of the 63
studies were inspected for variability and trend. Only gross estimates were possible
for those baselines consisting of only a few data points. Overall, baselines were
relatively stable. The baselines of 13 studies demonstrated moderate variability and
®ve demonstrated substantial variability. The initial baselines of only ®ve studies
were characterized by moderate trend ; four studies demonstrated substantial trend.
This stability is likely associated with, but not exclusively attributable to, the
recovery state of the subjects studied. Forty studies reported results for individuals
who were 1 year or more post ictus ; subjects in 13 studies were between 6 and 12
months post ictus ; 11 studies reported results for subjects who were less than 6
months post ictus.
For the most part, single-subject aphasia-treatment studies are hypothesis
driven. However, studies of only nine treatments (i.e. computer-based visual-
communication treatment, cueing-verb treatment, cueing hierarchy for naming,
Helm’s Elicited Language Training Program for Syntax Stimulation, phonology-
based treatment, Response Elaboration Treatment, short-term memory training,
facilitating generalized requesting behaviour, and verbal combined with non-
verbal treatment) were replicated either directly or systematically (Kratochwill and
Williams 1988, Hilliard 1993). Altogether, 25 studies contributed to a replication
series ; the remaining 38 studies each addressed a diåerent form of treatment.

Data
The property of times-series data known as autocorrelation (also known as
autoregression, serial dependence, and serial correlation) biases conclusions drawn
from analyses of those data. Positive autocorrelation leads to liberally biased errors
and negative autocorrelation leads to conservatively biased errors (Crosbie 1987).
By way of explanation, autocorrelation indexes the relationship between observa-
tions taken at one point in time and observations on the same subject taken at a
diåerent time. The autocorrelation coe¬cient is nothing more than the Pearson-
Product-Moment correlation coe¬cient for two vectors of the same data. For a
®rst-order autocorrelation coe¬cient, the ®rst vector contains the observations as
they were collected in series. The ®rst element of the second vector is set empty;
the second element is the ®rst observation; the third element is the second
observation, and so forth. The second vector is said to be a lag-1 variate. A
correlation coe¬cient is then calculated for the two vectors of data. The lag-1
autocorrelation coe¬cient indexes the relationship between data points and those
452 R. R. Robey et al.

that immediately follow ; it is often termed the ®rst-order autocorrelation in¯uence.


A lag-2 autocorrelation coe¬cient (i.e. the ®rst two elements of the second vector
are set empty and the third element contains the ®rst observation) indexes a
relationship that plays out not one but two observations later. With a su¬ciently
long data stream, a researcher might investigate autocorrelation through lag-15 or
so. In some applications, the cycle of measurement might focus a researcher’s
attention on one or another lag (e.g. lag 8). Most often, however, interest centres
on the ®rst-order in¯uence.
Statisticians debate the degree to which behavioural data are autocorrelated.
Huitma (1985, 1988) argued that the degree of autocorrelation in behavioural data
is negligible. Busk and Maracuilo (1988) and Sharpley and Alavosius (1988),
among others, have argued that the mathematical assumptions underpinning
Huitma’s assertion are not warranted and that the degree of autocorrelation in
behavioural data is su¬cient to cause serious bias in analysis decisions. Huitma and
McKean (1991) have since demonstrated that small sample sizes (too few
observations) can lead to underestimation of the degree to which behavioural data
are in fact autocorrelated.
For aphasia-treatment applications, this debate can be rendered moot by taking
on the straightforward task of calculating estimates from the available data.
Therefore, coe¬cients of ®rst-order autocorrelation were calculated from reported
raw data. Furthermore, because estimates of variances constitute important
information in selecting among diåerent mathematical models for estimating eåect
size, variances of baseline and treatment data were calculated as well.

Single-subject data reported in aphasia-treatment literature


Twelve of the 63 studies reported recoverable raw data spanning a baseline period
and an immediately-following treatment period. Few studies reported raw data.
For the most part, raw data were extracted from performance-over-time plots (two
authors independently double checked all data transcription). Usually, the size and
resolution of this plot was insu¬cient for extracting exact values. When exact
values could not be determined, a study was set aside from further consideration.
The ratio of the variance of treatment data over the variance of baseline data was
calculated for each of the 12 studies. The winsorized mean (Sokal and Rohlf 1981,
p. 413) of those ratios was 3. 88 : 1 (i.e. the variance of treatment data is 3.88 times
greater than that of baseline data) which is not surprising given the likelihood of
a ¯oor eåect in baseline data.
First-order autocorrelation coe¬cients were calculated for each data series.
When a study reported multiple baselines, the average of the ®rst-order coe¬cients
was calculated to represent the study. As a result, each of the 12 studies contributed
a single (and therefore independent) estimate of ®rst-order autocorrelation. The
coe¬cients ranged from 0.397 to 0.874 with a mean of 0. 626 and a standard
deviation of 0.181. This estimate suggests a positive, moderate-to-high ®rst-order
autoregressive in¯uence in aphasia-treatment time-series data.
When considered separately, diåerent ®rst-order autoregressive in¯uences are
found in baseline and treatment data. Eighteen of the 63 studies reporting
recoverable baseline-period raw data which produced an average ®rst-order
autoregressive coe¬cient of 2 0. 107 with a standard deviation of 0. 327 and
estimates ranging from 2 0. 833 to 0. 670. Twenty-one studies reported recoverable
Single subject research 453

treatment-period raw data ; in these the average ®rst-order autoregressive


coe¬cient was 0. 272 with a standard deviation of 0. 240 and estimates ranging from
2 0. 174 to 0. 713. Altogether, 22 studies contributed data to the calculation of one
or another autoregressive coe¬cient (Salvatore 1976, Thompson 1983, Kearns
1985, Starch and Marshall 1986, Sullivan et al. 1986, Doyle et al. 1987, Warren et al.
1987, Coelho 1991, Conlon and McNeil 1991, Thompson et al. 1996).
Collectively, these ®ndings indicate that : (a) treatment data are nearly four times
more variable than are baseline data ; (b) the magnitude of autocorrelation in
baseline data is low on average but highly variable ; (c) the degree of autocorrelation
is greater in treatment data where change is evident; and (d) when baseline and
treatment data are combined, the change in performance from the A period to the
B period substantially raises the magnitude of overall autocorrelation.

Eåect sizes
The quanti®cation and synthesis of single-subject outcomes was ®rst proposed by
Scruggs and Mastropieri (1987). Their approach has received much criticism (e.g.
White 1987, Allison and Gorman 1993) causing Scruggs and Mastropieri (1994) to
acknowledge that the procedure suåers some shortcomings. Recently, Busk and
Serlin (1992) proposed algorithms for estimating single-subject eåect sizes ; the
Busk and Serlin algorithms are grounded in the principles of conventional meta-
analysis. An eåect size indexes the magnitude of departure from the null state (i.e.
no meaningful change) in a set of experimental observations; it has no associated
probability. Because an eåect size is a scale-free index, its value is independent of
measurement scales and so its absolute size is meaningful and, as a result, it is
possible to compare directly the eåect sizes taken from diåerent studies.
The estimation of eåect sizes (ES) for a meta-analysis of single-subject research
is described by Busk and Serlin (1992, pp. 197±198) :
xa B 2 xa A
ES 5 (1)
sA
where B and A designate treatment and baseline periods respectively, x- is the mean
of the data collected in a period and s is the correspondingstandard deviation. Two
comments are germane. First, this is the most liberal of Busk and Serlin’s estimates ;
the 3 : 1 variance ratio of aphasia-treatment data (treatment : baseline) cannot justify
application of their more conservative algorithms. Also important is that meta-
analysis procedures for single-subject research are elementary compared to those
for group studies and the two expressions of eåect size are not comparable.
Nevertheless, the Busk and Serlin calculations quantify the magnitude of change
brought about by treatment and permit relative comparisons across single-subject
studies. It should also be noted that when there is no variability in baseline data (i.e.
baseline values are all equal), ES cannot be calculated.

Eåect sizes in aphasia-treatment literature


Eåect sizes were calculated for each of 12 studies for which initial AB data were
recoverable. The 12 studies and their eåect sizes are described in table 2. The eåect
sizes in table 2 were calculated on observations of behaviours following direct
treatment. No generalization probe data were included in the calculations. Only the
®rst sequence of AB data contributed to the calculation of eåect size. In cases of
454

Table 2. Single-subject design characteristics

Eåect
Study Design N Treatment " Treatment # size

Doyle and Goldstein (1985) A, B, A (m) 2 HELPSS Generalization 3.914


Kearns and Salmon (1984) A, B, A, B 2 Auxiliary `is ’ training Reverse training 8.003
Potter and Goodman (1983) A, B, A 2 Humor augmented Ð 2.178
Raymer and Thompson (1991) A, B, A (m) 1 Verbal plus gestural Ð 23.927
Raymer et al. (1993) A, B, A (m) 4 Phonological Ð 2.007
Steele et al. (1989) A, B, A 5 C-VIC Ð 2.611
Thompson and Byrne (1984) A, B, A 3 Loose training Ð 2.974
Thompson and McReynolds (1986) A, B, BC, A‹ (m) 4 Direct production A-V stimulation and direct production 5.117
Thompson et al. (1986) A, B,Œ (m) 3 Hypnosis plus imagery Ð 3.821
Thompson and Shapiro (1994) A, B, A 5 Speci®c linguistic training Ð 11.037
Thompson et al. (1991) A, B, A (m) 2 Phonological Ð 4.038
Wambaugh and Thompson (1989) A, B, A (m) 4 Wh- is1 nominative Ð 5.833

Note : A designates a no-treatment period ; B designates a treatment period ; C designates a diåerent treatment ; (m) designates a multiple baseline control;
C-VIC 5 computer-based visual-communication treatment ; HELPSS 5 Helm’s Elicited Language Training Program for Syntax Stimulation (Helm-Estabrooks
1981); A-V 5 audio-visual.
‹ This design was also reversed in the form of A, C, BC, 0.
Œ One subject received a second treatment.
 Subjects 1±3 in table 5.
R. R. Robey et al.
Single subject research 455

multiple baselines, an eåect size was calculated for each target behaviour and the set
of eåect sizes was averaged to achieve a single index for each study. The subjects
described in the 12 studies were mostly non-¯uent aphasic individuals in the third
through eighth decades of life. With few exceptions, subjects were in the chronic
stage of recovery and demonstrated marked aphasia.
Notably, each eåect size in table 2 is a large-sized eåect (Glass 1976). Overall, the
magnitude of eåect size estimates in table 2 compares favourably to analogous
eåects reported in counselling literature (Busse et al. 1995). The extraordinarily
large eåect size for Raymer and Thompson (1991) is attributed to the very small
variance from a ¯oor eåect in those baseline data (many zero-correct observations).
The reader is reminded that, because of diåerences in the mathematics, these eåect
sizes cannot be compared to those from multiple-subject meta-analyses (Robey
1994, 1998). Furthermore, the equation used to obtain these eåect sizes produces
liberal estimates. Nevertheless, table 2 demonstrates that single-subject treatment
eåects can be quanti®ed and that treatments for aphasia bring about appreciable
change. It is therefore unfortunate that table 2 can contain eåects reported by only
19 % of the available studies.
Analysis
Single-subject research designs yield sequences of performance-over-time data
called time series data. An interruption in the stream of data occurs through
experimenter manipulation of time as an independent variable (e.g. a period of no
treatment followed by a period of treatment). As a consequence, primary analyses
of single-subject aphasia-treatment clinical- outcome research centre on the
tenabilities of two null hypotheses (H! subscripted seriatim) :
H! " : b slope A & b slope B
H! # : b level A & b level B
Here, b slope is a coe¬cient indexing the slope of the performance-over-time
plot of the data points within a certain period (e.g. the period of treatment
observations). Similarly, b level is a coe¬cient indexing the magnitude of overall
performance within a period. The subscripts A and B designate no-treatment and
active-treatment periods respectively. The reader is referred to Bloom and Fischer
(1982, pp. 428±441) for a thorough discussion of changes in slope and level.
The research hypothesis associated with H! " asserts that the slope observed in a
no-treatment period accelerates in the following active-treatment period. Likewise,
the research hypothesis for H! # asserts that the overall level of performance
increases from a no-treatment period to the following active-treatment period.
When a subject is in the acute stage of recovery and spontaneous recovery causes
performance to increase over time, drawing the conclusion that treatment is
eåective requires a rejection of H! " applied to the slope data as well as a rejection
of H! # ; that is, treatment must accelerate the rate of recovery and increase the
overall level of performance. When stability of baseline is evident, eåectiveness
may be demonstrated with rejection of H! # only.
Only two analysis procedures providing a direct test of H! " and H! # are designed
to be insensitive to the negative aåects of autocorrelation on statistical conclusion
validity : interrupted time-series analysis (ITSA) and an improved interrupted time-
series analysis (ITSACORR). For that reason, the presentation of these two
analyses are ordered ®rst in this review. As will be seen, the latter is more applicable
456 R. R. Robey et al.
A 100 E 100
90 Baseline Treatment 90 Baseline Treatment

80 80
70 70
Percent correct

Percent correct
60 60
50 50
40 40
30 30
20 20
10 10
0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

B 100 F 100
90 Baseline Treatment 90 Baseline Treatment

80 80
70 70
Percent correct

Percent correct
60 60
50 50
40 40
30 30
20 20
10 10
0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

C 100 G 100
90 Baseline Treatment 90 Baseline Treatment
80 80
70 70
Percent correct

Percent correct

60 60
50 50
40 40
30 30
20 20
10 10
0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

D 100
H 100
90 Baseline Treatment 90 Baseline Treatment

80 80
70 70
Percent correct
Percent correct

60 60
50 50
40 40
30 30
20 20
10 10
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Sessions Sessions

Figure 1. Plots of the example data.

in aphasia-treatment research, but presentation of the former precedes as it eases


exposition.
Several other forms of analysis, including the familiar visual analysis, have been
proposed as means for evaluating the outcomes of single-subject quasi-experi-
ments. Research in statistics methodology and applications of probability theory
Single subject research 457

have demonstrated that the statistical conclusion validity of each of these


alternatives is, at best, an open question. Each of these analyses is reviewed in the
context of that literature.
The review is intended to: (a) provide a working understandingof each analysis ;
and (b) clarify the reason for (not) selecting each as the analysis-of-choice. A
worked example is presented for every analysis procedure. Each example is a
solution for the ®ctitious set of percentage-correct data plotted in ®gure 1(A). For
the sake of simplicity, the examples consist of only two periods: the initial AB
sequence.
The numbers of observations in the example AB sequence were chosen for
consistency with typical clinical-aphasiology applications. Let’s say the target
behaviour is production of a certain grammatic construction (or some speech act)
in response to a story completion task. The data were created to exemplify an
unsatisfactory outcome: moving from the 20 %-correct level without treatment
(i.e. a behaviour that should readily respond to treatment) to a plateau performance
of only 60 % correct with treatment. All things being equal, a valid analysis, then,
should render a negative decision on outcome eåectiveness because an untreated
behaviour, occurring correctly in one of every ®ve attempts, is raised to a level of
only three in ®ve correct. However, two exceptions are possible. If a score of 60 %
correct constitutes a psychometrically valid threshold for desired status (e.g.
functional independence), then clinical importance is achieved. Alternatively, if a
provider and a carrier agree that 60 % correct constitutes a success criterion,
clinical importance is similarly achieved.
The eight data points in the ®rst period have a mean and median of 20.00 and a
standard deviation of 5. 35. The 11 data points in the second period have a mean of
44. 55, a median 50. 00, and a standard deviation of 16. 35. The ®rst order
autoregression coe¬cient for these data is 0.366, which is near the low end of the
range observed in actual data.

Analyses that are insensitive to autocorrelation


In this section, autocorrelation-insensitive analyses, those maintaining statistical
conclusion validity when data are autocorrelated, are discussed and exampled.

Interrupted time series analysis


Gottman (1981) de®ned the interrupted-time-series analysis (ITSA) for a stream of
serially-dependent time-series observations occurring throughout two exper-
imental periods. The experimental periods are manipulated and classi®ed by an
experimenter. The most fundamental ITSA yields three statistical tests : (a) an F
test of the null hypothesis that no omnibus change occurred in the progression
from period 1 to period 2 ; (b) a t test for the null hypothesis that no change in slope
occurred between periods 1 and 2 (which is H! " ) ; and (c) a t test for the null
hypothesis that no change in level occurred between periods 1 and 2 (which is H! # ).
Accomplishing Gottman’s original analysis requires a preliminary investigation
of the autoregressive in¯uences in the data through an inspection of the
autocorrelation coe¬cients (i.e. lag 1, lag 2, and so forth), the partial auto-
correlation coe¬cients, and the autoregressive parameters. The inspection of the
autocorrelation and partial autocorrelation coe¬cients is straightforward and
easily obtained using SAS (SAS Institute Inc. 1994) or SPSS (SPSS Inc. 1994)
458 R. R. Robey et al.

software. Inspecting the autoregressive parameters is another matter. The


autoregressive parameters ( pi ) are the last i elements in the column vector B of the
general linear model (Timm± 1975). The analyst iterates through solutions to the
matrix equation B 5 (X´X) X´Y inspecting models having from one to ®ve or
"

more autoregressive parameters. The analyst’s task is to determine the smallest


number of autoregressive parameters necessary to ®t the data. Fortunately, the
burden for the analyst has been eased considerably by research demonstrating that
this initial step (i.e. choosing an autoregressive model) is not critical for obtaining
correct decisions regarding the salient null hypotheses (e.g. Velicer and McDonald
1984, Harrop and Velicer 1985, Sharpley 1987).
Having subjectively determined the optimal number of in¯uential autoregressive
parameters, the data are arranged in the X and Y matrices of the general linear
model to accomplish two analyses. Crosbie (1995, p. 377) provides the matrix-
algebra model for completing the ®rst step : obtaining the error sums-of-squares
term when all sources of variance are accounted for in the X matrix. This term is
labelled SS" where SS designates a sums-of-squares term. The matrix model for the
second step, obtaining an error sums-of-squaresterm when the distinction between
periods 1 and 2 is ignored (labelled SS! ), is found in Gottman (1981, p. 392). Each
con®guration of the data is submitted to the usual general linear model matrix
operations
B 5 (X´X ) X´Y
±"

Y= 5 XB
E 5 Y2 Y=
SSE 5 E´E
to obtain their respective sums-of-squares-due-to-error term (SSE). Because Y is a
column vector in each case, SSE (i.e. SS" and SS! ) is always a scalar.
The overall F test is given by
(SS! 2 SS" )} 2
F5
SS" } m
where m is N2 42 2p degrees of freedom. Here, N is the total number of
observations in the two periods being contrasted and p is the number of
autoregressive parameters in the model. The value of F is evaluated on 2 and m
degrees of freedom. If statistical signi®cance is achieved, the elements of B in the
SS" model indexing the period-1 and period-2 levels (i.e. elements B" , " and B" , $ ) are
contrasted by means of a t test (see Crosbie 1995, p. 378, equation 12.20). Similarly,
the elements of B indexing the period-1 and period-2 slopes (i.e. elements B" , # and
B" , % ) are contrasted. Lastly, these same four elements of B are entered into
Gottman’s (1981, pp. 371±372) equations for generating a straight-line level-slope
function characterizing the ®t of the autoregressive model to the obtained data.
Preliminary analyses of the example data suggested a single autoregressive
parameter (i.e. p 5 1) as a best ®t. The elemental structures of the X and Y matrices
giving the values of SS" (i.e. 613.746) and SS! (i.e. 1070.486) are displayed in ®gure
2. The resulting value of the overall F was 4. 837. On 2 and 13 degrees of freedom,
the test is signi®cant at the 0.05 level but not at the 0.01 level. However, neither the
separate t test for a diåerence in level (t 5 2 0. 053) nor the t test for a diåerence in
slope (t 5 2 0. 408) achieved statistical signi®cance. If Type I error tolerance for the
overall F test is set at 0. 05, the analysis suggests an omnibus diåerence from period
Single subject research

Figure 2. Normal equation for SS" and SS! .


459
460 R. R. Robey et al.

1 to period 2 which cannot be particularized as a robust change in either level or


slope. Figure 1(B) displays the autoregressive ®t for these data. An annotated but
rudimentary SAS (PROC MATRIX) program for calculating the various tests is
available from the ®rst author.
Gottman (1981) claimed that ITSA could maintain Type I error control with
series as short as 10 observations per period. The ®ndings of Greenwood and
Matyas (1990) and those of Crosbie (1993) demonstrate that the claim is not
justi®ed. While ITSA controls Type I error satisfactorily with long streams of data,
long streams are not a realistic expectation in clinical applications. As a result,
ITSA cannot be recommended for analysing aphasia-treatment data.

ITSACORR
Crosbie (1993) determined that ITSA algorithms underestimate positive auto-
correlation with short series of data and altered them accordingly. Crosbie (1993)
conducted Monte Carlo simulations to test the resulting Type I and Type II error
characteristics of the optimized algorithms. In general, the improved ITSA, named
ITSACORR, maintained Type I error at or below the nominal level with
satisfactory statistical power in the analysis of short-series data (i.e. fewer than 50
observations per period). The exception occurred when autocorrelation exceeded
0.6 and sample sizes were less than 20. Crosbie (1993) recommended a general
minimum of 10 observations per period, although the greater the number of
observations is, the more accurately autocorrelation is estimated.
ITSACORR is simpler in use than ITSA because it does not require an initial
®tting of autoregressive parameters. Like ITSA, ITSACORR yields a test of
overall change, a test of change in slope, and a test of change in level. The ®rst is
a preliminary test for preserving Type I error control: if the overall test does not
achieve statistical signi®cance, the tests of H! " : b slope A & b slope B and H! # : b level A &
b level b are not interpreted. With multiple baseline studies in which ITSACORR
would be applied to each iteration of the basic design, preserving Type I error
control would require the additional step of setting an experiment-wise (see
Maxwell and Delaney 1990) Type-I-error tolerance (a ) to be divided equally among
each of the ITSACORR applications. It should be noted that the directional null
hypotheses make possible the use of one-tailed a levels.
The ITSACORR F test for the null hypothesis of overall change in the example
data is 1.650 on 2 and 14 degrees of freedom which does not achieve statistical
signi®cance. The ITSACORR ®t for the example data is displayed in ®gure 1(C).
If the example data are made more consistent with outcomes observed in aphasia-
treatment literature by lowering the baseline observations to centre around the
10 %-correct level and altering the treatment observations to culminate at the
90 %-correct level (see ®gure 1(D)), the sensitivity of ITSACORR to appreciable
changes is made evident. For these data, the overall F is 5.910 with an exact
probability of 0.014. The t test for a change in level is also signi®cant (i.e. t 5 3. 341,
p 5 0. 005) ; the t test for a change in slope does not achieve statistical signi®cance
(i.e. t 5 0.187, p 5 0.855). That is, the overall change from period 1 to period 2 in
®gure 1(D) is particularized as a change in level but not in slope.
The superiority of ITSACORR among the various analysis alternatives is given
by three large and important advantages. First, ITSACORR provides scienti®cally
valid evidence on the tenability of change in using numbers of observations that are
Single subject research 461

realistic in clinical applications. Second, the products of ITSACORR (i.e. F tests


and t tests) have well known noncentrality parameters (i.e., d ) and so the synthesis
of ®ndings by means of advanced meta-analysis procedures becomes possible.
Third, Crosbie (1993, 1995) makes available versatile software for carrying out the
necessary calculations ; the program is extraordinary in ease of use.

Analyses aåected by autocorrelation


The analyses in this section are not insensitive, nor are they robust to, the presence
of autocorrelation in time-series data.
Visual analysis
The subjectivity and resulting inconsistency of decisions regarding the presence}
absence of a treatment eåect, as determined by the appearance of a graph, has long
been suspected as a shortcomingof single-subject research (Kazdin 1982). Since the
late 1970s, much research has been conducted to assess the reliability of visually-
determined decisions about single-subject data. Recently, Ottenbacher (1993)
conducted a meta-analysis of this visual-analysis literature and found an overall
inter-rater agreement coe¬cient of only 0. 58. Defenders of subjective visual
analysis (e.g. Parsonsonand Baer 1992) point to imperfections in studies addressing
the question of inter-rater reliability to supporttheir claim that low reliability is an
open question. However, they do not oåer empirical evidence demonstrating
satisfactory reliability ; their claim is based on criticism rather than positive
evidence.
In 1978, Jones et al. compared two analyses of the same data : visual inspection
and time-series analysis. In general, the rate of agreement between the outcomes of
the two analyses was only slightly greater than chance. Furthermore, as the degree
of autocorrelation increased, visual raters became increasingly inconsistent while
the time-series analysis retained reliability.
DeProspero and Cohen (1979) examined the interjudge reliability for visual
analysis of 250 journal editors and manuscript reviewers with expertise in single-
subject research. They found a reliability coe¬cient of just 0.61. Furlong and
Wampold (1982) examined another editorial board to determine the source of inter-
rater inconsistency. They found that raters focused on inter-period diåerences of
level and slope with little consideration of variability as a source for explanation.
Ottenbacher (1990b) found that 61 raters disagreed `considerably’ when changes
in variability or changes in slope characterized single-subject data. Consistent with
Jones et al. (1978), Matyas and Greenwood (1990) found that positive auto-
correlation and random variation contribute to misinterpretation of visual displays
of single-subject data. Furthermore, Matyas and Greenwood (1990) found a Type-
I error rate (i.e. concluding an eåect is present when none exists) for visual analyses
that ranged from 16 % to 84 %. The situation is considerably improved when data
are not autocorrelated. Bobrovitz and Ottenbacher (1998) found good agreement
between visual analysis and statistical analysis for autocorrelation-equal-zero data
with moderate or larger treatment eåects.
Split-middle trend line
A split-middle trend line is obtained by ®rst dividing a period of data (e.g. baseline
period) in half so that an equal number of data points lie on either side of a vertical
462 R. R. Robey et al.

dividing line. The median of the data in each half of the period is then calculated.
Each half of the period is halved by another vertical dividing line so that the period
is segmented into quarters. The value of the ®rst median (i.e. of the data in the ®rst
half of the period) is plotted on the ®rst-quarter reference line ; the second median
is plotted on the third-quarter reference line. A straight line extending throughout
the entire period is drawn through the two median data points. If an equal number
of data points do not appear on either side of the connecting line, a second line,
parallel to the ®rst and achieving the even split, is drawn. The equal dividing line
is termed the split-middle trend line. It is extrapolated into the next period (e.g.
treatment period) as a dashed or dotted line (see Kazdin 1982, p. 311f).
A split-middle trend line ®tted for the original example data is found in ®gure
1(E). In this example, the baseline data is separated into two equal portions each
consisting of four data points. The median of each portion was 20.0. The ®rst-
quarter reference line occurs at 2.5 on the abscissa; the third-quarter reference line
occurs at 6. 5. In this case, the split-middle line is a horizontal line through 20. 0 on
the ordinate which extends into the treatment period as a dotted line. To contrast
the level and slope of the two adjacent periods of data, a split-middle line was
drawn for the second-period data (i.e. the solid line through the treatment period).
The split-middle analysis of the example data supports the conclusion that change
occurred.
Hojem and Ottenbacher (1988) examined the value of adding the split-middle
trend line to graphs of single-subject data. Their results suggested that adding the
line may improve inter-rater reliability of visual analysis somewhat. Johnson and
Ottenbacher (1991) examined the eåect of adding split-middle trend line to graphs
of single-subject data on inter-rater agreement and found only marginal utility in
adding the trend line ; overall inter-rater agreement rose to an average of 0. 76 when
a split-middle line was included in the graph. Similarly, Ottenbacher and Cusick
(1991) found that the inconsistency of visual analyses without bene®t of reference
lines (i.e. interrater reliability of 0.54) marginally diminished with the presence of
a split-middle trend line (i.e. interrater reliability raised to 0. 67), but remained
unacceptable.

Celeration trend line


The procedures for obtaining a celeration line (White 1977) are identical to those
for obtaining a split-middle trend line except that means are used in place of
medians. In the example data, the mean of the ®rst four data points in the baseline
period was 17.50 and the mean of the last four data points was 22. 50. In ®gure 1(F),
a solid line is drawn throughthese points in the baseline period and is extended into
the treatment period as a dotted line. As in the case of the split-middle trend line,
a separate celeration trend line was ®tted through the treatment-period data to
contrast slopes and levels (see Bloom and Fischer 1982, p. 443f). Like the split-
middle analysis, the celeration-line analysis supports the conclusion that change
occurred.
Stocks and Williams (1995) found that celeration lines improved the accuracy of
individuals rating plots of single-subject data only when subject performance
deteriorated in the treatment period. Otherwise, the presence of a celeration line
did not aåect rater accuracy. It is clear that such an outcome is undesirable and
should not occur in a well-designed and well-implemented study of aphasia
Single subject research 463

treatment. Of particular interest, Stocks and Williams (1995) found a 16 % to 42 %


false-positive rate, that is, Type I errors or declaring change when none had
occurred.

Regression trend line


Figure 1(G) displays a plot of linear regression trend lines through each period of
the example data. Each line is simply a function of predicted values plotted over
time. The predicted values are obtained by regressing scores on sessions separately
for each period. The functions are compared visually for diåerences in level and
slope.
The raw coe¬cients for the base line data are b ! 5 16. 786 (i.e. intercept) and
b " 5 0. 714 (i.e. regression coe¬cient) ; for the treatment data they are b ! 5
2 16.545 and b " 5 4.364. The regression-line analysis of the example data leads to
a conclusion that change occurred.
Ottenbacher (1990b) found little utility for the inclusion of linear-regression
lines in plots of single-subject data. Ottenbacher (1993) con®rmed the ®nding of
little utility for the inclusion of linear-regression trend lines with one exception.
Ottenbacher (1993) found that raters who were familiar with the mathematics and
rationale of regression analysis did bene®t from the inclusion of a regression trend
line in the graphing of single-subject data. Under those circumstances, the
coe¬cient of inter-rater agreement rose to 0. 76 but remained unsatisfactory.

Shewart-chart trend line


The Shewart (1931) procedure (see Bloom and Fischer 1982, Krishef 1991) is a
means for detecting a change in level only. The mean and standard deviation of the
baseline data are calculated. Two horizontal reference lines, at two standard
deviations above and below the mean, are drawn across the baseline period and
extended throughout the treatment period. Two successive data points in the
treatment period falling outside one of these bounds indicates signi®cant change.
Figure 1(H) is a Shewart chart for the example data. The reference lines were drawn
at 206 10. 69. The chart indicates a signi®cant change in level. The reliability and
validity of the Shewart chart procedure have not been systematically investigated.

Binomial tests
Two applications of the well known binomial test (Siegel 1956) have been
proposed as easily obtained tests of the null hypothesis of no change in level across
period boundaries. One application is based upon the split-middle trend line and
the other is based upon the celeration trend line. In each case, the binomial test is
carried out by comparing the proportion of second-period data points falling
above} below the extrapolation of the baseline trend line.
In ®gures 1(E) (i.e. split-middle trend line) and (F) (i.e. celeration trend line),
only one data point falls below the dotted line extending through the treatment
period. In each case then, the proportion of data points below the extended line is
0.0909 and the proportion above is 0.9091. As a result, the following calculations
apply for both ®gures 1(E) and (F).
Cohen (1988) points out that the density of the distribution of proportionsis not
uniform and so the area under the sampling distribution curve between say 0.60
464 R. R. Robey et al.

and 0.65 (near the centre of the distribution) is not equivalent to the area beneath
0.90 and 0.95 (in the extremity of the tail). Therefore, Cohen recommends that the
test of the null hypothesis that one population proportion (p ) equals another is
transformed from
H! : r p baseline 5 p treatment r
to
H! : r u baseline 5 u treatment r
where
u x 5 2 arcsin o p x

with arcsin expressed in radians (rather than in degrees). As an example, consider


#
the sample estimates (i.e p # and u ) obtained in the baseline period. In this case,
u # baseline 5 2 arcsin o p W baseline
5 2 arcsin o 0. 0909
5 2 arcsin(0. 3015)
5 2(0. 3063)
5 0. 6126
The test statistic for testing the null hypothesis is given by r 0.6126±2.5290r which
equals 1.9165. For a two-tailed test with a Type I error tolerance of 0. 01, the value
1.9165 is compared to the critical value given by Cohen (1988, p. 192) : 1.098. Since
the obtained test size (1.9165) exceeds the critical value (1.098), the null hypothesis
is rejected ; the result indicates change.
Crosbie (1987) conducted a Monte Carlo study to estimate the Type I error
properties of each application (i.e. split-middle and celeration lines) under likely
conditions of autocorrelation. Crosbie found that the actual Type I error rate of the
binomial test exceeded the nominal level when data were autocorrelated. That is,
the binomial test is unacceptably liberal when applied to autocorrelated data. As a
result, neither application of the binomial test (i.e. the split-middle or the
celeration lines) renders a valid test of the null hypothesis (i.e. of no change in
level).

Analysis of variance
For behavioural scientists, an intuitive approach to the analysis of single-subject
data might be the application of a t test (i.e. o F) to compare data from two
diåerent periods for mean diåerence. However, this approach has two inherent
problems. First, the analysis of variance (ANOVA) model assumes that obser-
vations are independent of one another (i.e. the errors are not correlated). Since the
data points all derive from a single subject, the assumption is not tenable. In
addition, Scheåe! (1959) showed mathematically that F is not robust to violations
of the independence assumption. Furthermore, Phillips (1983), Toothaker et al.
(1983), Sharpley and Alavosius (1988), and Suen et al. (1990) have all demonstrated
through Monte Carlo simulation experiments that ANOVA is not at all robust
when the data are even slightly autocorrelated. Autocorrelation causes the test to
be prohibitively liberal in terms of Type I error control. The value of t for the
example data is 4. 06 on 17 degrees of freedom yielding a probability of p ! 0. 001.
Single subject research 465

Randomization tests
Several variations of the randomization test have been proposed as solutions to the
analysis di¬culties in single-subject research (Edgington 1987). Revusky’s Rn test
is one such test. Originally, the algorithm tested only for changes in level (Franzen
and Iverson 1990). Wolery and Billingsley (1982) added mathematics to test for
changes in slope. However, both the original and enhanced procedures require an
experiment that is not interesting : each of four subjects must receive treatments
administered in random order. As a result, the test is beside the point for most
clinical single-subject research applications including the example data ; the test is
meant for a diåerent design altogether.

The C statistic
Tryon (1982) applied the C statistic to single-subject data to contrast the slope of
the baseline data with the slope of the data obtained during the treatment period.
The value of C is given by
n
3 (xi 2
#
xi + " )
C5 12
i= "
n

23 (xi 2 xa )
#
i= "

where xi is the ith data point in the combined stream of n data points in periods
1 and 2. The ratio of C over its standard error yields the standard unit-normaldeviate
Z which gives the probability value for assessing the tenability of the null
hypothesis. The value of C for the example data is 2 0.2025 (Z 5 2 0. 8058) which
does not achieve statistical signi®cance (Tryon 1982). Crosbie (1989) conducted a
Monte Carlo analysis examining the Type I error characteristics of the C statistic ;
the actual Type I error rates of the C statistic were prohibitively high when applied
to autocorrelated data.

Analyses reported in aphasia-treatment literature


Fifty-six of the 63 aphasia-treatment single-subject studies reported a visual
analysis. Of the remaining seven studies, two reported modi®ed Shewart-charts
(6 1 standard deviation), one reported a celeration line with an accompanying
binomial test as well as a C statistic, and four studies reported a t test. None of these
analyses can provide adequate assurance on statistical conclusion validity since
single-subject aphasia-treatment data are clearly autocorrelated.

Discussion
For the most part, single-subject aphasia-treatment research designs are hypothesis
driven and well controlled. Seventy-eight per cent (i.e. 49) of the 63 studies tested
only one treatment, and thereby precluded a carry-over eåect. Every one of these
49 studies comprised multiple baseline controls, or withdrawal controls, or both.
The great advantage of including a withdrawal period in the sequence is the
opportunity to test two important eåects : direct treatment and maintenance.
466 R. R. Robey et al.

Typically, the number of initial baseline observations has been insu¬cient for
conducting valid analyses of change. Single-subject research must optimize on the
extended baseline series that are possible with multiple-baseline designs. It is highly
desirable, for example, to re-order multiple target behaviours for each subject
whenever possible. By changing the order in which target behaviours are
introduced in each replication of the basic design, an extended baseline is
established for each target.
The elementary estimates of eåect sizes in table 2 clearly establish that
quantitatively-based analysis can capture the outcomes of single-subject quasi-
experiments and can express them in standard scienti®c terms. Although changes
brought about by aphasia treatment in single-subject quasi-experiments appear
robust, few studies reported the necessary mathematical details required for
calculating an eåect size. When calculable, eåect sizes were large. The lesson is
clear : single-subject aphasia-treatment studies merit and require the quanti®ed
outcomes expected by the greater clinical-outcome research community, including
reimbursers. The interests of the professionand practitioners will be well served by
demonstrably valid and reliable applications of hypothesis testing logic.
Certainly, visual analyses should not be abandoned. As Franklin et al. (1996)
point out, visual analyses of single-subject data are necessary descriptive tools and
statistical analyses are necessary inferential tools. Taken together, conservative and
consistent visual analyses combined with valid statistical analyses capture clinical
signi®cance as well as statistical signi®cance ; interpretation requires both (Bloom
and Fischer 1982, Johnston et al. 1995, Franklin et al. 1996).
Single-subject aphasia-treatment data are clearly autocorrelated. Therefore, the
means for testing the two null hypotheses of aphasia-treatment outcome must be
insensitive to autocorrelation. That restriction narrows the ®eld of analysis
alternatives to two : ITSA and ITSACORR. Because the number of observations
per period is typically low in single-subject aphasia-treatment studies, many fewer
than 50 per period, ITSACORR is preferred to ITSA. Until it is surpassed by
future developments, ITSACORR should be the procedure-of-choice, and
essentially the standard, for applying hypothesis testing logic to single-subject data.
A comment and a caveat regarding ITSACORR are necessary. As in all areas of
research, there is likely to be an evolution of ITSACORR algorithms or
programming or both. The user and consumer of ITSACORR should therefore
expect further developments in the analysis of short-series single-subject data. The
caveat relates to the F and t tests reported by ITSACORR. It is important to realize
the rejection of an idiographic null hypothesis (Grossman 1986, Ottenbacher
1990), say H! # : b level A & b level B , warrants only the inference that change has
occurred in one human being ; it provides no basis for an inference regarding a
clinical population. For that reason, single-subject research designs are the
preferred means for testing the eåectiveness of treatment provided to a particular
individual. The generality of treatment eåectiveness for a population must be
assessed through a synthesis of many similarly focused and similarly designed
single-subject quasi-experiments, a meta-analysis. The same statement regarding
generality applies when single-subject designs are used to assess treatment
eåectiveness in terms of, for example, variations in treatment methods, sub-
populations, or diåerent service-delivery models (see Robey and Schultz 1998).
Writings on single-subject research often assert that a compelling concert of
evidence will result through the absolutely vital practice of systematic replication.
Single subject research 467

Most often, however, those writings do not mention how this evidence is to be
formed through combination and synthesis to achieve a broadly based, and so
general, conclusion. That is, given extensive and successful systematic replication,
how are the individual outcomes combined to form a generalized ®nding? By
de®nition, one would not combine single-subject data from several studies to
obtain a `group’ estimate of central tendency, nor would one desire to do so. What
needs combining is not single-subject data but the outcomes of single-subject
quasi-experiments bearing on a common research question.
A ` vote counting’ approach applied to a series of subjective decisions (i.e. visual
analyses) could yield nothing more than a collective subjective conclusion formed
out of many individual subjective decisions. It could not satisfy the demand that
conclusions be scienti®cally rigorous in applying current statistics and probability
theory. Acceptable procedures for combining experimental results require
estimates of eåect size that can be meaningfully combined and synthesized (Robey
1997). The accepted means for synthesizing research outcomes bearing on a
research question is through meta-analysis : a set of mathematical procedures for
estimating an average eåect size, with its associated con®dence interval, from all
available evidence (Hunter and Schmidt 1990, Hall et al. 1994, Petitti 1994).
Increasingly, researchers in other clinical sciences recognize the value of meta-
analysis for objectively synthesizing individual results to achieve generality in
conclusion (e.g. Wilson et al. 1996, Wurthmann et al. 1996). Certainly, the
procedures for conducting meta-analyses of single-subject research are elementary
at this point but, as certainly, the technology will advance. Aphasiologists will
optimize on these advances and the associated bene®ts only if the products of future
single-subject research are quanti®ed. Only one of three measures is necessary to
meet this criterion : (a) report statistical tests such as those obtained through
ITSACORR; (b) report raw data ; or (c) construct performance-over-time plots
with su¬cient resolution to permit data retrieval.
A great and increasing need is experienced for studies testing that interventions
are applicable to persons other than those tested with the reasonable expectation
that the interventions will be generally eåective. It seems obvious that for purposes
of documenting or justifying the eåectiveness of therapeutic intervention, it is in
the professional interest to make the most powerful arguments that one can, using
both group and individual study data wherever possible. Results that can be
generalized are required, then, both from studies of groups and of single subjects.
For either class, individual or group, care is required in design and in
implementation so that the results can be entered into meaningful combinations
with other studies and so increase the potency of the results. Incorporating valid
quantitative analyses of single-subject data in research reports will position the
profession to assess the generality of individual outcomes and so optimize the
warrant for asserting treatment eåectiveness.

Summary
It is clear that the large majority of aphasia-treatment single-subject quasi-
experiments are hypothesis-driven tests of single treatments utilizing multiple
baseline controls or withdrawal controls or both. It is also clear that single-subject
aphasia-treatment designs yield short series of autocorrelated data embodying
appreciable eåects of the independent variable : the crossing over from no
468 R. R. Robey et al.

treatment to treatment. Such designs and data deserve programmatic hypothesis


testing through valid and reliable means.
Visual analyses of single-subject data do not produce outcomes satisfying the
requirements of conservative science : objective and apparent operations, open to
public inspection and replication, on which to base decisions. Furthermore, visual
analyses do not meet the usual requirement for reliability. Over a period of nearly
two decades, a considerable body of empirical evidence on the question of
reliability has accumulated and the ®nding of unsatisfactory reliability has been
consistent. The often-heard counter argument to low reliability is that visual
analysis leads only to conservative decisions regarding the presence} absence of an
eåect so that only large eåects are acknowledged. Meanwhile, the ®nding that
16±84 % of visual analyses yield false-positive decisions, or Type I errors (Matyas
and Greenwood 1990), casts serious doubt on this otherwise unsubstantiated claim.
The available evidence suggests that eåect sizes for treatment of aphasia, as
indexed by single-subject research, are remarkably large. But, integrating the
outcomes of single-subject research to form a coherent concert (i.e. establish the
weight of all single-subject scienti®c evidence) requires a meta-analysis. At present,
only 19 % of single-subject studies present quanti®able outcomes. Said diåerently,
the capacity for single-subject designs to produce standard evidence regarding the
eåectiveness of a treatment for an individual is largely unrealized. Combined with
thoughtfully selected null hypotheses, ITSACORR can produce quantitative
outcomes in the form of valid and reliable applications of hypothesis testing logic.
Those valid and reliable quantities will be necessary if single-subject research is to
realize fully its potential for producing converging scienti®c evidence.

Acknowledgement
The authors thank Drs Robert S. Barcikowski, John W. Lloyd, and Tonya R.
Moon for their helpful comments regarding drafts of this manuscript.

References
Allison, D. B. and Gorman, B. S. 1993, Calculating eåect sizes for meta-analysis : The case of the
single-case. Behaviour Research and Therapy, 31, 621±631.
Backman, C. L., Harris, S. R., Chisholm, J.-A. M. and M onette, A. D. 1997, Single-subject
research in rehabilitation : A review of studies using AB, withdrawal, multiple baseline, and
alternating treatments designs. Archives of Physical Medicine and Rehabilitation , 78, 1145±1153.
Bloom, M. and Fischer, J. 1982, Evaluating Practice : Guidelines for the Accountable Professional
(Englewood Cliå, NJ : Prentice-Hall).
Bobrovitz, C. D. and Ottenbacher, K. J. 1998, Comparison of visual inspection and statistical
analysis of single-subject data in rehabilitation research. American Journal of Physical Medicine and
Rehabilitation, 77, 94±102.
Busk, P. L. and Maracuilo, R. C. 1988, Autocorrelation in single-subject research : A counter-
argument to the myth of no autocorrelation. Behavioral Assessment, 10, 229±242.
Busk, P. L. and Serlin, R. C. 1992, Meta-analysis for single case research. In T. R. Kratochwill and
J. R. Levin (Eds) Single-Case Research Design and Analysis (Hillsdale, NJ : Lawrence Erlbaum
Associates, Inc.), pp. 197±198.
Buss, R. T., Kratochwill, T. R. and Elliot, S. N. 1995, Meta-analysis for single-case consultation
outcomes: Applications to research and practice. Journal of School Psychology, 33, 269±285.
Campbell, D. T. and Stanley, J. C. 1963, Experimental and Quasi-Experimental Designs for Research.
(Boston: Houghton Mi°in Co.)
Coelho, C. A. 1991, Manual sign acquisition and use in two aphasic subjects. In M. L. Lemme (Ed.)
Clinical Aphasiology, vol. 19 (Austin, TX : Pro-Ed), pp. 209±218.‹
Single subject research 469

Cohen, J. 1988, Statistical Power Analysis for the Behavioral Sciences (2nd edn) (Hillsdale, NJ : Lawrence
Erlbaum).
Conlon, C. P. and McNeil, M. R. 1991, The e¬cacy of treatment for two globally aphasic adults
using visual action therapy. In M. L. Lemme (Ed.) Clinical Aphasiology, vol. 19 (Austin, TX :
Pro-Ed), pp. 185±195.‹
Connell, P. J. and Thompson, C. K. 1986, Flexibility of single-subject experimental designs. Part
III : Using ¯exibility to design or modify experiments. Journal of Speech and Hearing Disorders, 51,
214±225.
Conners, C. K. and W ells, K. C. 1982, Single-case designs in psychopharmacology. In A. E.
Kazdin and A. H. Tuma (Eds), Single-Case Research Designs (San Francisco, CA : Jossey-Bass,
Inc), pp. 61±77.
Cook, T. D. and Campbell, D. T. 1979, Quasi-Experimentation : Design and Analysis Issues for Field
Settings (Boston: Houghton Mi°in).
Cook, T. D. and Shadish, W. R. 1994, Social experiments: Some developments over the past ®fteen
years. Annual Review of Psychology, 45, 545±580.
Crosbie, J. 1987, The inability of the binomial test to control Type I error with single-subject data.
Behavioral Assessment, 9, 141±150.
Crosbie, J. 1989, The inappropriateness of the C statistic for assessing stability or treatment eåects
with single-subject data. Behavioral Assessment, 11, 315±325.
Crosbie, J. 1993, Interrupted time-series analysis with brief single-subject data. Journal of Consulting
and Clinical Psychology, 61, 966±974.
Crosbie, J. 1995, Interrupted time-series analysis with short series : Why is it problematic : how can
it be improved. In J. M. Gottman (Ed.) The Analysis of Change (Mahwah, NJ : Lawrence
Erlbaum Associates), pp. 361±395.
D avis, G. A. 1978, The clinical application of withdrawal, single-case research designs. In R. H.
Brookshire (Ed.) Clinical Aphasiology, vol. 8 (Minneapolis, MN : BRK Publishers), pp. 11±19.
D eProspero, W. and Cohen, S. 1979, Inconsistent visual analysis of intrasubject data. Journal of
Applied Behavior Analysis, 12, 573±579.
D oyle, P. J. and Goldstein, H. 1985, Experimental analysis of acquisition and generalization of
syntax in Broca’s aphasia. In R. H. Brookshire (Ed.) Clinical Aphasiology, vol. 15 (Minneapolis,
MN : BRK Publishers), pp. 205±213.*
D oyle, P. J., Goldstein, H. and Bourgeois, M. S. 1987, Experimental analysis of syntax training in
Broca’s aphasia : A generalization and social validation study. Journal of Speech and Hearing
Disorders, 52, 143±155.‹
Edgington, E. S. 1987, Randomized single-subject experiments and statistical tests. Journal of
Counseling Psychology, 34, 437±442.
Eick, T. J. and Kofoed, L. 1994, An unusual indication for a single-subject clinical trial. The Journal
of Nervous and Mental Disease, 182, 587±590.
Franklin, R. D., Gorman, B. S., Beasely, T. M. and Allison, D. B. 1996, Graphical display and
visual analysis. In R. D. Franklin, D. B. Allison and B. S. Gorman (Eds) Design and Analysis of
Single-Case Research (Mahwah, NJ : Lawrence Erlbaum), pp. 119±158.
Franzen, M. D. and Iverson, G. L. 1990, Applications of single subject design to cognitive
rehabilitation. In A. M. Horton (Ed.) Neuropsychology Across the Life-Span : Assessment and
Treatment (New York : Springer Publishing Co.), pp. 155±174.
Frattali, C. M. 1998, Outcomes measurement: de®nitions, dimensions, and perspectives. In C. M.
Frattali (Ed.) Measuring Outcomes in Speech-Language Pathology (New York : Thieme), pp. 1±27.
Fukkink , R. 1996, The internal validity of aphasiological single-subject studies. Aphasiology, 10,
741±754.
Furlong, M. J. and W ampold, B. E. 1982, Intervention eåects and relative variation as dimensions
on experts’ use of visual inference. Journal of Applied Behavior Analysis, 15, 415±421.
Glass, G. V. 1976, Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3±8.
Gottman, J. M. 1981, Time-Series Analysis : A comprehensive Introduction for Social Scientists (Cambridge:
Cambridge University Press).
Greenwood, K. M. and Matyas, T. A. 1990, Problems with the application of interrupted time
series analysis for brief single-subject data. Behavioral Assessment , 12, 355±370.
Grossman, K. E. 1986, From idiographic approaches to nomothetic hypotheses: Stern, Alport, and
the biology of knowledge, exempli®ed by an exploration of sibling relationships. In J. Valsiner
(Ed.) The Individual Subject in Scienti®c Psychology (New York : Plenum Press), pp. 37±69.
470 R. R. Robey et al.

Hall, J. A., Rosenthal, R., T ickle-D egnen, L. and Mosteller, F. 1994, Hypotheses and problem
in research synthesis. In H. Cooper and L. V. Hedges (Eds.) The Handbook of Research Synthesis.
(New York : Russel Sage Foundation), pp. 17±28.
Harrop, J. W. and Velicer, W. F. 1985, A comparison of alternative approaches to the analysis of
interrupted time-series. Multivariate Behavioral Research, 20, 27±44.
Helm-Estabrooks, N. 1981, Helm’s Elicited Language Program for Syntax Stimulation (HELPSS).
Austin, TX : Exceptional Resources, Inc.).
Hersen, M. and Barlow, D. H. 1976, Single-Case Experimental Designs : Strategies for Studying Behavior
Change (New York : Pergamon Press).
Hilliard, R. B. 1993, Single-case methodology in psychotherapy process and outcome research.
Journal of Consulting and Clinical Psychology, 61, 373±380.
Hojem, M. A. and Ottenbacher, K. J. 1988, Empirical investigation of visual-inspection versus
trend-line analysis of single-subject data. Physical Therapy, 68, 983±988.
Hopkins, A. 1998, The measurement of outcomes of health care research. In M. Swash (Ed.) Outcomes
in neurological and neurosurgical disorders (Cambridge, UK : Cambridge University Press).
Huitma, B. E. 1985, Autocorrelation in applied behavior analysis : A myth. Behavioral Assessment, 7,
107±118.
Huitma, B. E. 1988, Autocorrelation : 10 years of confusion. Behavioral Assessment , 10, 252±294.
Huitma, B. E. and McKean, J. W. 1991, Autocorrelation estimation and inference with small
samples. Psychological Bulletin, 110, 291±304.
Hunter, J. E. and Schmidt, F. L. 1990, Methods of meta-analysis : Correcting error and bias in research
®ndings (Newbury Park, CA : Sage Publications).
Johnson, M. B. and Ottenbacher, K. J. 1991, Trend line in¯uence on visual analysis of single-
subject data in rehabilitation research. International Disabilities Studies, 13, 55±59.
Johnston, M. V., Ottenbacher, K. J. and Reichardt, C. S. 1995, Strong quasi-experimental
designs for research on the eåectiveness of rehabilitation. American Journal of Physical Medicine
and Rehabilitation , 74, 383±392.
Jones, R. R., Weinrott, M. R. and Vaught, R. S. 1978, Eåects of serial dependency on the
agreement between visual and statistical inference. Journal of Applied Behavior Analysis, 11,
277±283.
Kane, R. L. 1997, Approaching the outcomes question. In R. L. Kane (Ed.) Understanding health care
outcome research (Gaithersburg, MD : Aspen Publishers).
Kazdin, A. E. 1982, Single-Case Research Designs : Methods for Clinical and Applied Settings (New York :
Oxford University Press).
Kazdin, A. E. 1986, Comparative outcome studies of psychotherapy: methodological issues and
strategies. Journal of Consulting and Clinical Psychology, 54, 95±105.
Kearns, K. P. 1985, Response elaboration training for patient initiated utterances. In R. H.
Brookshire (Ed.) Clinical Aphasiology, vol. 15 (Minneapolis, MN : BRK Publishers), pp.
196±204.‹
Kearns, K. 1986, Flexibility of single-subject experimental designs. Part II : Design selection and
arrangement of experimental phases. Journal of Speech and Hearing Disorders, 51, 204±214.
Kearns, K. P. and Salmon, S. J. 1984, An experimental analysis of auxiliary and copula verb
generalization in aphasia. Journal of Speech and Hearing Disorders, 49, 152±163.*
Kearns, K. P. and Thompson, C. K. 1991a, Analytical and technical directions in applied aphasia
analysis : The Midas touch. In T. E. Prescott (Ed.) Clinical Aphasiology, vol. 19 (Austin, TX :
Pro-Ed), pp. 40±54.
Kearns, K. P. and Thompson, C. K. 1991b, Technical drift and conceptual myopia: The Merlin
eåect. In T. E. Prescott (Ed.) Clinical Aphasiology, vol. 19 (Austin, TX : Pro-Ed), pp. 31±40.
Kratochwill, T. R. 1978, Single Subject Research : Strategies For Evaluating Change (New York :
Academic Press).
Kratochwill, T. R. and Williams, B. L. 1988, Perspectives on pitfalls and hassles in single-subject
research. Journal of the Association for Persons with Severe Handicaps, 13, 147±154.
Krishef, C. H. 1991, Fundamental Approaches to Single Subject Design and Analysis (Malabar, FL :
Krieger Publishing Co.).
LaPointe, L. L. 1978, Multiple baseline design. In R. H. Brookshire (Ed.) Clinical Aphasiology, vol.
8 (Minneapolis, MN : BRK Publishers), pp. 20±29.
Levin, J. R. 1992, Single-case research design and analysis : Comments and concerns. In T. R.
Single subject research 471

Kratochwill and J. R. Levin (Eds.) Single-Case Research Design and Analysis (Hillsdale, NJ :
Lawrence Erlbaum Associates, Inc.), pp. 213±224.
Matyas, T. A. and Greenwood, K. M. 1990, Visual analysis of single-case time series : Eåects of
variability, serial dependence, and magnitude of intervention eåects. Journal of Applied Behavior
Analysis, 23, 341±351.
Maxwell, S. E. and Delaney, H. D. 1990, Designing Experiments and Analyzing Data : A Model
Comparison Perspective (Belmont, CA : Wadsworth Publishing).
McReynolds, L. V. and Kearns, K. P. 1983, Single-Subject Experimental Designs in Communicative
Disorders (Baltimore, MD : University Park Press).
McReynolds, L. V. and T hompson, C. K. 1986, Flexibility of single-subject experimental designs.
Part I : Review of the basics of single-subject designs. Journal of Speech and Hearing Disorders, 51,
194±203.
Ottenbacher, K. J. 1990a, Clinically relevant designs for rehabilitation research : The idiographic
model. American Journal of Physical Medicine and Rehabilitation , 69, 287±292.
Ottenbacher, K. J. 1990b, Visual inspection of single-subject data: An empirical analysis. Mental
Retardation, 28, 283±290.
Ottenbacher, K. J. 1993, Interrater agreement of visual analysis in single-subject decisions:
Quantitative review and analysis. American Journal of Mental Retardation, 98, 135±142.
Ottenbacher, K. J. and Cusick, A. 1991, An empirical investigation of interrater agreement for
single-subject data using graphs with and without trend lines. Journal of the Association for
Persons with Severe Handicaps, 16, 48±55.
Parsonson, B. S. and Baer, D. M. 1992, The visual analysis of data, and current research into the
stimuli controlling it. In T. R. Kratochwill and J. R. Levin (Eds.) Single-Case Research Design
and Analysis (Hillsdale, NJ : Lawrence Erlbaum Associates, Inc.), pp. 15±40.
Petitti, D. B. 1994, Meta-Analysis, Decision Analysis, and Cost-Eåectiveness Analysis (New York :
Oxford University Press).
Phillips, J. P. N. 1983, Serially correlated errors in some single-subject designs. British Journal of
Mathematical and Statistical Psychology, 36, 269±280.
Portney, L. G. and W atkins, M. P. 1993, Foundations of Clinical Research : Application to Practice
(Norwalk, CN : Appleton and Lange).
Potter, R. E. and Goodman, N. J. 1983, The implementation of laughter as a therapy facilitator with
adult aphasics. Journal of Communication Disorders, 16, 41±48.*
Raymer, A. and T hompson, C. K. 1991, Eåects of verbal plus gestural treatment in a patient with
aphasia and severe apraxia of speech. In M. L. Lemme (Ed.) Clinical Aphasiology, vol. 20
(Austin, TX : Pro-Ed.), pp. 285±297.*
Raymer, A. M., T hompson, C. K., Jacobs, B. and Le Grand, H. R. 1993, Phonological treatment
of naming de®cits in aphasia: model based generalization analysis. Aphasiology, 7, 27±53.*
Reichardt, C. S. and Mark, M. M. 1998, Quasi-experimentation. In L. Bickman and D. L. Rog
(Eds) Handbook of Applied Research Methods (Thousand Oaks, CA : Sage Publications Inc.), pp.
193±228.
Robey, R. R. 1994, The e¬cacy of treatment for aphasic persons: A meta-analysis. Brain and Language,
47, 582±608.
Robey, R. R. 1997, Meta-Analysis of Clinical Outcome Research. A paper presented before the annual
meeting of the American Speech-Language-Hearing Association, Boston.
Robey, R. R. 1998, A meta-analysis of clinical outcomes in the treatment of aphasia. Journal of Speech
and Hearing Research, 41, 172±187.
Robey, R. R. and Schultz, M. C. 1993, Optimizing Theories and Experiments (San Diego, CA : Singular
Publishing Group).
Robey, R. R. and Schultz, M. C. 1998, A model for conducting clinical outcome research : An
adaptation of the standard protocol for use in aphasiology. Aphasiology, 12, 787±810.
Salvatore, A. 1976, Training an aphasic adult to respond appropriately to spoken commands by
fading pause duration within commands. In R. H. Brookshire (Ed.) Clinical Aphasiology, vol.
6 (Minneapolis, MN : BRK Publishers), pp. 172±191.‹
SAS Institute Inc. 1994, SAS System Under Microsoft Windows, Release 6.10 (Cary, NC : SAS Institute
Inc.).
Scheffe! , H. 1959, The Analysis of Variance (New York : John Wiley and Sons).
Scruggs, T. E. and Mastropieri, M. A. 1994, The utility of the PND statistic : A reply to Allison
and Gorman. Behaviour Research and Therapy, 32, 879±883.
472 R. R. Robey et al.

Sederer, L. I., Dickey, B. and Hermann, R. C. 1996, The imperative of outcomes assessment in
psychiatry. In L. I. Sederer and B. Dickey, (eds) Outcomes Assessment in Clinical Practice
(Baltimore, MD : Williams and Wilkins), pp. 1±7.
Senn, S. 1993, Suspended Judgment: N-of-1 trials. Controlled Clinical Trials, 14, 1±5.
Shapiro, E. S., Kazdin, A. E. and McGonigle, J. J. 1982, Multiple-treatment interference in the
simultaneous- or alternating-treatment design. Behavioral Assessment , 4, 105±115.
Sharpley, C. F. 1987, Time-series analysis of behavioural data : An update. Behaviour Change, 4,
40±45.
Sharpley, C. F. and Alavosius, M. P. 1988, Autocorrelation in behavioral data: An alternative
perspective. Behavioral Assessment, 10, 243±251.
Shewart, W. A. 1931, Economic Control of Quantity of Manufactured Products (New York : Van
Nostrand Reinhold).
Siegel, S. 1956, Nonparametric Statistics for the Behavioral Sciences (New York : McGraw-Hill).
Sokal, R. R. and Rohlf, F. J. 1981, Biometry : The principles and practice of statistics in biological research
(2nd ed.) (New York : W. H. Freeman and Co.).
SPSS Inc. 1994, SPSS for Windows, Release 6.1 (Chicago : SPSS Inc.).
Starch, S. A. and M arshall, R. C. 1986, Who’s on ®rst ? A treatment approach for name recall
with aphasic patients. In R. H. Brookshire (Ed.) Clinical Aphasiology, vol. 16 (Minneapolis,
MN : BRK Publishers), pp. 73±79.‹
Steele, R. D., Weinrich, M., Wertz, R. T., Kleczewska, M. K. and Carlson, G. S. 1989,
Computer-based visual communication in aphasia. Neuropsychologia, 27, 409±426.*
Stocks, J. T. and Williams, M. 1995, Evaluation of single subject data using statistical hypothesis
tests versus visual inspection of charts with and without celeration lines. Journal of Social Service
Research, 20, 105±126.
Suen, H. K., Lee, P. S. C. and Owen, S. V. 1990, Eåects of autocorrelation on single-subject single-
facet crossed-design generalizability assessment. Behavioral Assessment, 12, 305±315.
Sullivan, M. P., Fisher , B. and Marshall, R. C. 1986, Treating the repetition de®cit in conduction
aphasia. In R. H. Brookshire (Ed.) Clinical Aphasiology, vol. 16 (Minneapolis, MN : BRK
Publishers), pp. 172±180.‹
T hompson, C. K. 1983, An experimental analysis of the eåects of two treatments on Wh interrogative
production in agrammatic aphasia. Doctoral dissertation, University of Kansas, Kansas,
USA.‹
T hompson, C. K. and Byrne, M. E. 1984, Across setting generalization of social conventions in
aphasia: An experimental analysis of `` loose training.’’ In R. H. Brookshire (Ed.) Clinical
Aphasiology, vol. 14 (Minneapolis, MN : BRK Publishers).*
T hompson, C. K. and McReynolds, L. V. 1986, Wh interrogative production in agrammatic
aphasia: An experimental analysis of auditory-visual stimulation and direct-production
treatment. Journal of Speech and Hearing Research, 29, 193±206.*
T hompson, C. K. and Shapiro, L. P. 1994, A linguistic-speci®c approach to treatment of sentence
production de®cits in aphasia. In M. L. Lemme (Ed.) Clinical Aphasiology, vol. 22 (Austin, TX :
Pro-Ed.), pp. 307±323.*
T hompson, C. K., Hall, H. R. and Sison, C. E. 1986, Eåects of hypnosis and imagery training on
naming behavior in aphasia. Brain and Language, 28, 141±153.*
T hompson, C. K., Raymer, A. and le Grand, H. 1991, Eåects of phonologically based treatment on
aphasic naming de®cits : A model driven approach. In T. E. Prescott (Ed.) Clinical Aphasiology,
vol. 20 (Austin, TX : Pro-Ed.).*
T hompson, C. K., Shapiro, L. P., T ait, M. E., Jacobs, B. J. and Schneider, S. L. 1996, Training
wh-question production in agrammatic aphasia: Analysis of argument and adjunct movement.
Brain and Language, 52, 175±228.‹
T imm , N. H. 1975, Multivariate Analysis with Applications in Education and Psychology (Monterey, CA :
Brooks-Cole).
T oothaker, L. E., Banz, M., Noble, C., Camp, J. and D avis, D. 1983, N 5 1 designs: The failure
of ANOVA-based tests. Journal of Educational Statistics, 8, 289±309.
T ryon, W. W. 1982, A simpli®ed time-series analysis for evaluating treatment interventions. Journal
of Applied Behavioral Analysis, 15, 423±429.
Velicer, W. F. and McDonald, R. P. 1984, Times series analysis without model identi®cation.
Multivariate Behavioral Research, 19, 33±47.
Single subject research 473

Wambaugh, J. L. and Thompson, C. K. 1989, Training and generalization of agrammatic aphasic


adults’ Wh-interrogative productions. Journal of Speech and Hearing Disorders, 54, 509±525.*
Warren, R., Gabriel, C., Johnston, A. and Gaddie, A. 1987, E¬cacy during acute rehabilitation.
In R. H. Brookshire (Ed.) Clinical Aphasiology, vol. 17 (Minneapolis, MN : BRK Publishers),
pp. 1±11.‹
White, O. R. 1977, Data-based instruction : Evaluating educational progress. In J. D. Cone and
R. P. Hawkins (Eds) Behavioral Assessments : New Directions in Clinical Psychology (New York :
Brunner} Maze).
Wilson, S. L., Powell, G. E., Brock, D. and T hwaites, H. 1996, Vegetative state and responses
to sensory stimulation : An analysis of 24 cases. Brain Injury, 10, 807±818.
Wolery, M. and Billingsley, F. F. 1982, The application of Revusky’s Rn test to slope and level
changes. Behavioral Assessment, 4, 93±103.
Wortman, P. M. 1994, Judging research quality. In H. Cooper and L. V. Hedges (Eds) The Handbook
of Research Synthesis (New York : Russell Sage Foundation), pp. 97±109.
Wurthmann, C., Klieser, E., Lehmann, E. and Krauth, J. 1996, Single-subject experiments to
determine individually diåerential eåects of anxiolytics in generalized anxiety disorder.
Neuropsychobiology, 33, 196±201.
Yaden, D. B. 1995, Reversal designs. In S. B. Neuman and S. McCormick (Eds) Single-Subject
Experimental Research : Applications for Literacy (Newark, DE : International Reading
Association), pp. 32±46.

References marked with an * indicate studies included in the meta-analysis and contributed estimates
of autocorrelation. References marked with a ‹ contributed estimates of autocorrelation.

You might also like