Principles and Procedures of Exploratory Data Analysis: John T. Behrens
Principles and Procedures of Exploratory Data Analysis: John T. Behrens
John T. Behrens
Arizona State University
seldom warranted. Even when well-specified theories are held, EDA helps one
This document is copyrighted by the American Psychological Association or one of its allied publishers.
interpret the results of CDA and may reveal unexpected or misleading patterns in
the data. This article introduces the central heuristics and computational tools of
EDA and contrasts it with CDA and exploratory statistics in general. EDA tech-
niques are illustrated using previously published psychological data. Changes in
statistical training and practice are recommended to incorporate these tools.
The widespread availability of software for graphi- cies and the generally bleak picture of methodological
cal data analysis and calls for increased use of explor- instruction presented by Aiken et al. (1990) indicate
atory data analysis (EDA) on epistemic grounds (e.g. that little EDA makes its way into graduate training
Cohen, 1994) have increased the visibility of EDA. and even less makes its way out as usable skills.
Nevertheless, few psychologists receive explicit train- This essay introduces researchers to the philosoph-
ing in the beliefs or procedures of this tradition. ical underpinnings and general heuristics of EDA in
Huberty (1991) remarked that statistical texts are three sections. First, the background, rationale, and
likely to give cursory references to common EDA basic principles of EDA are presented. Next, a primer
techniques such as stem-and-leaf plots, box plots, or covers heuristics, prototypical beliefs, and procedures
residual analysis and yet seldom integrate these tech- of EDA using examples from psychological research.
niques throughout a book. A survey of graduate train- The final section addresses implications of this analy-
ing programs in psychology corroborates such an im- sis for psychological method and training.
pression (Aiken, West, Securest, & Reno, 1990). In
this investigation, 37 (20%) of the 186 responding
Background and First Principles
departments reported teaching some aspect of EDA in What Is EDA?
introductory graduate courses. However, the percent- Unaware of historical precedent, researchers may
age of institutions indicating that most or all students
develop their own definition of EDA from denotations
could apply a learned technique was as follows: (a)
of its name. Sometimes the term is used to mean
detection and treatment of influential data, 8%; (b)
exploratory analysis in general. Mulaik (1984), for
modern graphical display, 15%; (c) data transforma-
example, discussed a long history of generic "explor-
tions, 31%; (d) alternatives to ordinary least squares
atory statistics" in response to an article concerning
(OLS) regression, 3%. These low levels of competen-
EDA (Good, 1983), and yet scarcely mentioned the
specific tradition of EDA to be discussed in this essay.
Sometimes the model-building approach of Box (e.g.,
I gratefully acknowledge comments and criticisms of ear-
1980) is considered exploratory, although it relies
lier versions of this article, which were provided by Ray-
more heavily on probabilistic measures than does
mond Miller, Joe Rodgers, Larry Toothaker, Alex Yu, and
EDA.
Dan Huston.
In this article, EDA refers to a specific tradition of
Correspondence concerning this article should be ad-
dressed to John T. Behrens, Methodological Studies, Divi- data analysis that stems from the work of John Tukey
sion of Psychology in Education, Arizona State University, and his associates, which dates back to the early
Tempe, Arizona 85287-0611. Electronic mail may be sent 1960s. This tradition of EDA can be loosely charac-
via Internet to behrens@asu.edu. terized by (a) an emphasis on the substantive under-
131
132 BEHRENS
standing of data that address the broad question of ertheless, the scientific process of model building and
"what is going on here?" (b) an emphasis on graphic testing often requires learning from the data at all
representations of data; (c) a focus on tentative model stages of research. For example, while conducting a
building and hypothesis generation in an iterative pro- regression analysis, one may be interested in assess-
cess of model specification, residual analysis, and ing the specific hypothesis that a particular (3j = 0 in
model respecification; (d) use of robust measures, re- a model with X, and X2. When assessing the status of
expression, and subset analysis; and (e) positions of prespecified statistical issues, the researcher is work-
skepticism, flexibility, and ecumenism regarding ing in what Mayer (1980) called the confirmatory
which methods to apply. mode. More often, however, researchers are con-
The goal of EDA is to discover patterns in data. cerned with a broader range of questions about the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Tukey often likened EDA to detective work. The role data than the statistical significance of the partialed
This document is copyrighted by the American Psychological Association or one of its allied publishers.
of the data analyst is to listen to the data in as many slopes: What if responses on X2 occurred only at two
ways as possible until a plausible "story" of the data levels rather than across all possible levels of the
is apparent, even if such a description would not be scale? Are there extreme values that unduly affect the
borne out in subsequent samples. Finch (1979) as- estimation of the slopes? Is the shape of the data in the
serted that "we claim for exploratory investigation no scatter plot like an ellipse, a horseshoe, or a banana?
more than that it is an activity directed toward the Is there something misleading me? When addressing
formation of analogy. The end of it is simply a state- such a broad set of questions, a researcher is working
ment that the data look as if they could reasonably be in an exploratory mode. Because the goals of the two
thought of in such and such a way" (p. 189). modes of data analysis are different, the modes are
Classical works in this tradition are Tukey's Ex- complementary rather than antagonistic.
ploratory Data Analysis (1977); Mosteller and In contrast to EDA, most training in CDA fails to
Tukey's Data Analysis and Regression: A Second address the early and messy stages of data analysis.
Course in Statistics (1977); Hoaglin, Mosteller, and This practice constitutes what McGuire (1989) called
Tukey's studies (1983b, 1985, 1991); volumes three, the hypothesis testing myth. He argued that we do a
four, and five of Tukey's collected works (Cleveland, disservice to training and practice by glossing over or
1988; Jones, 1986a, 1986b); and Velleman and Hoa- ignoring preliminary data analyses during which we
glin's work (1981, 1992). Summaries of EDA have refine hypotheses, evaluate and clarify our auxiliary
been presented by Hartwig and Dearing (1979), Lein- assumptions, and simply make sure our mental model
hardt and Leinhardt (1980), Leinhardt and Wasser- of the data is well aligned with reality.
man (1979), and more recently by Behrens and Smith
(1996) and Smith and Prentice (1993). Erickson and Exploratory and Confirmatory
Nosanchuk's (1992) text is for a first course in data
analysis that presents a balanced presentation of both
CDA is often likened to Anglo-Saxon jury trials:
EDA and confirmatory data analysis (CDA). Behrens
Researchers play the role of prosecutor, data collec-
(1996) provided on-line materials for teaching EDA.
tion serves as the trial proceeding, and statistical
Although exploratory techniques have been devel-
analysis plays the role of jury decision (Kraemer &
oped by others, Tukey and his associates began the
Thiemann, 1987; Tukey, 1977). The detective anal-
endeavor and continue to lead the articulation of the
ogy for EDA fits well with this jurisprudence model
purpose and constraints necessary for reasonable
because the role of the detective is to establish pre-
EDA (cf. Hoaglin et al., 1991). Tukey (1969) recom-
trial evidence and hunches, the veracity of which
mended the EDA approach to psychologists at the
are tested at the trial. Kraemer and Thiemann pointed
1968 meeting of the American Psychological Asso-
out that prosecutors examine preliminary evidence
ciation in a paper entitled, "Analyzing Data: Sancti-
before deciding whether to prosecute or not. They
fication or Detective Work?" Since that time, surpris-
equate this process with EDA and other pretrial
ingly few have responded.
evidence gathering such as power- or meta-analysis.
By using both exploratory and confirmatory tech-
The Need for EDA niques, a data analyst collects complete pretrial evi-
dence and brings the full weight of CDA to bear at
Most psychologists are well trained in testing sta- the trial.
tistical hypotheses at the end of an investigation. Nev- In a trial, rules of presenting and evaluating evi-
EXPLORATORY DATA ANALYSIS 133
dence are as well established as the rules of statistical pearances already found to be believed?" (Tukey,
inference. To make the strong claim of innocence or 1972/1986b, p. 760). In the confirmatory mode, re-
guilt (significance or nonsignificance), one uses spe- searchers work to test specific hypotheses using a
cific rules and procedures with strict interpretations of strict probabilistic framework following a decision
the data. In EDA, the goal is not to draw conclusions theoretic approach.
regarding guilt and innocence but rather to investigate When trained in all three modes of data analysis, a
the actors, generate hunches, and provide preliminary researcher is likely to move fluidly between modes
evidence. EDA is more like an interrogation in which and work in multiple modes on the same problem. For
clean and corrupted stories are told, whereas CDA is example, a researcher may have strict hypotheses
testimony regarding evidence that fits carefully laid- about main effects in a factorial analysis of variance
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
out trial procedures. The goal of EDA is indictment; (ANOVA) and yet have no hypotheses concerning
This document is copyrighted by the American Psychological Association or one of its allied publishers.
the goal of CDA is conviction (Behrens & Smith, possible interactions. Working in a strict confirmatory
1996). mode, the researcher would compute only the main
There is, however, a point at which the trial analogy effects test and ignore possible interactions. Working
breaks down. In a jury trial a witness may be used to in multiple modes, a researcher would likewise state
both formulate and test hunches. Alternatively, in sci- the hypothesis for main effects and test them using
entific practice different data must be used for model
strict CDA. At the same time, however, the researcher
formulation (EDA) and testing (CDA). Failure to rec-
working in multiple modes would explore possible
ognize this important fact will lead to inflation of
interactions with statistical graphics, resistant sum-
Type I error and overfilling. Along these lines Giere
mary statistics, and even loosely interpreted signifi-
(1984, cited in Howson & Urbach, 1993) argued:
cance tests. Patterns of unexpected outcomes would
be regarded as starting points for hypothesis genera-
If the known facts were used in constructing the model
tion and future testing rather than as statistical con-
and were thus built into the resulting hypothesis . .. then
the fit between these facts and the hypothesis provides clusions. In addition, the researcher familiar with
no evidence that the hypothesis is true [since] these facts EDA will also explore data patterns associated with
had no chance of refuting the hypothesis, (p. 408) the hypothesized mam effect to make sure the CDA
was not misled by unrecognized patterns that can lead
When sufficiently large samples are available, Ihe ex- to conclusions inconsistent with the data.
ploratory dala analysl is likely to conducl EDA on one Tukey summarized the relation between these
dala sel to generate hypotheses and assess Ihe model modes of data analysis, arguing "(a) both exploration
on anolher. The importance of dislinguishing between and confirmation are important, (b) exploration comes
model building and testing led Mosteller and Tukey first, (c) any given study can, and usually should,
(1977) lo state that "we plan to cross-validate care- combine both" (Tukey, 1980/1986e, p. 822; cf.
fully wherever we can" (p. 40). Cross-validation Tukey, 1980). Tukey (1982/1986d) presented a more
means that when patterns are discovered, they are detailed analysis of levels and types of data analysis
considered provisional (consistent with the EDA following this framework. Joreskog and Sorbom
mode) until their presence is tested in different data. (1993) presented a similar discussion of situations of
Tukey (1972/1986b) discussed data analysis as a (a) model generating, (b) analysis of competing mod-
continuum from EDA to CDA. Between the two is els, and (c) strictly confirmatory analysis of a single
an intermediate mode called rough confirmatory model, all in the context of structural equation mod-
analysis. In EDA the researcher entertains numerous eling (p. 115).
hypotheses, looks for patterns, and suggests hypo- Researchers who want to know more about their
theses based on the data, with or without theoretical data than they have hypothesized sometimes use con-
grounding. Working in this mode, the researcher firmatory methods while working in a pseudoconfir-
begins to delineate a set of plausible models and matory mode. Examples of this kind of behavior in-
seeks rich descriptions of the data. In rough CDA, clude interpreting unexpected interactions as if
the researcher undertakes initial assessment of the hypothesized and computing t tests or chi-square ho-
plausible models using probabilistic approaches such mogeneity tests on myriad possibilities. This is not
as confidence intervals or significance tests (cf. Be- EDA. In these cases, researchers have exploratory
hrens & Smith, 1996). In this step the researcher an- goals but are approaching them by using confirmatory
swers the question, "With what accuracy are the ap- tools, assumptions, and conclusions improperly. EDA
134 BEHRENS
helps avoid these improper approaches by being clear Although holding exploratory goals alone does not
about hypothesis specificity and conclusion strength necessarily imply EDA, use of exploratory procedures
and by providing a language for different stages and such as the plotting of simple summaries or the tabu-
purposes of data analysis. lation of simple descriptive statistics does not neces-
sarily imply EDA either. In many cases, simple de-
scriptive statistics or plots may hide important
EDA and Other Exploratory Methods
patterns as much as they reveal others.
exploratory in their goals, including stepwise regres- EDA emphasizes that at different stages of research
sion, some forms of factor analysis, cluster analysis, there are different types of questions, different levels
discriminant analysis, and many applications of of hypothesis specificity used, and different levels of
structural equation modeling. These and other meth- conclusion specificity that are warranted. EDA does
ods are exploratory when the researcher is trying to not call for the abandonment of CDA but rather for
determine a "best" set of variables or the "best" the broadening of data analysis to incorporate a wide
range of attitudes and techniques appropriate to the
model for a sample rather than testing a prespecified
different stages and questions in scientific work. At
model for a specific population. For example, so-
the same time, EDA is seen as indispensable in any
called confirmatory factor analysis via structural
investigation: "Exploratory data analysis can never
equation models becomes exploratory when a number
be the whole story, but nothing else can serve as
of alternate models are assessed. The exploratory na-
the foundation stone—as the first step" (Tukey, 1977,
ture of these techniques underscores the idea that data
p. 3).
exploration and the integration of empirical and theo-
retical knowledge are well-established aspects of sci-
entific psychology. Beliefs, Heuristics, and Trademarks
Given that EDA is not simply a set of techniques
but an attitude toward the data (Tukey, 1977), are
Although Tukey often argues that EDA is an atti-
researchers conducting EDA when they compute ex-
tude rather than a set of tools, a number of heuristics
ploratory factor analysis or other exploratory statis- have been devised for EDA. To find patterns, reveal
tics? The answer depends on how the analysis is con- structure, and make tentative model assessments,
ducted. A researcher may conduct an exploratory EDA emphasizes the use of graphics and the process
factor analysis without examining the data for pos- of iterative model fit and residual analysis. To avoid
sible rogue values, outliers, or anomalies; fail to plot being fooled by unwarranted assumptions about the
the multivariate data to ensure the data avoid patho- data, EDA is a much more data-driven approach to
logical patterns; and leave all decision making up to data analysis than CDA. Because a complete catalog-
the default computer settings. Such activity would not ing of techniques is beyond the scope of this article,
be considered EDA because the researcher may be this section discusses major themes of EDA and
easily misled by many aspects of the data or the com- presents examples.
puter package. Any description that would come from It cannot be overemphasized that an appropriate
the factor analysis itself would rest on too many un- technique for EDA is determined not by computation
assessed assumptions to leave the exploratory data but rather by a procedure's purpose and use. Whether
analyst comfortable. Henderson and Velleman (1981) residuals are obtained from a computer program in-
demonstrated how an interactive (EDA based) ap- tended for CDA or EDA is not important. What is
proach to stepwise regression can lead to markedly important is to obtain a rich description of the data
different results than would be obtained by automated and to understand the relationship between the model
variable selection. This occurs because the researcher and patterns of residuals. The techniques described
plots the data and residuals at each stage and thereby next have been helpful in EDA, but techniques are
considers numerous patterns in the data while the secondary to the goal of building rich mental models
computer program is blind to all aspects of the data of the data. The reader may note that the procedures
except the R2. described are highly related, not simply a laundry list.
EXPLORATORY DATA ANALYSIS 135
Each aspect of EDA is used in concert with other Use Graphic Representations of Data
aspects so that a single isolated procedure is seldom
used. Recommendations presented here are not nec- Graphical analysis is central to EDA. Tukey (1977)
essarily unique to EDA. What is unique is the con- summed up the role of graphics in EDA by saying that
figuration of beliefs and procedures. "the greatest value of a picture is when it forces us to
notice what we never expected to see'' (p. vi). Graphi-
cal summaries are almost universally sought to aug-
Understand the Context
ment algebraic summaries because graphics can por-
tray numerous data values simultaneously, while
To some, the analogy of the data analyst as detec-
algebraic summaries often sum over important attri-
tive connotes someone entering an unknown arena
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Higher Professional 1994) and visual analysis (Light, Singer, & Willett,
1994) of meta-analytic data described in The Hand-
Lower Professional . book of Research Synthesis (Cooper & Hedges, 1994)
rely extensively, and explicitly, on EDA.
Child The stem-and-leaf plot shown in Figure 2 repre-
Skilled
IQ sents a type of frequency table organized graphically
Clerical to resemble a histogram while retaining information
about the exact value of each observation. The left
side of the plot are the ' "stems" that mark intervals or
Semi-skilled
bins; the right side of the plot contains "leaves,"
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
& I
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
w (?)
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Locus
Anxiety Assertiveness of Self Esteem
Control
Variable Measured
Figure 3. Dot plot of sex difference effect sizes for different affective and cognitive vari-
ables reported by Feingold (1994).
lewicz, 1989), a common form is shown in Figure 4, and Strenio (1983) presented a complete treatment of
which portrays the data from Figure 3. The box plot basic boxplot design.
offers a five-number summary in schematic form. The Kernel density smoothers are graphic devices that
ends of a box mark the first and third quartiles, and provide estimates of a population shape, as seen in
the median is indicated with a line positioned within Figure 5. This smooth shape is arrived at by taking the
the box.1 The ranges of most or all of the data in the relative frequency of data at each x value and aver-
tails of the distribution are marked using lines extend- aging it with that of the surrounding data (Scott,
ing away from the box, creating "whiskers" or 1992). Figure 5 is a kernel density smooth of the
"tails." Rules governing the construction of the whis- Feingold anxiety data depicted in Figures 3 and 4.
kers vary. One method suggested by Tukey (1977) What is clear from this graphic is the near bimodality
was to extend the whisker to the most extreme value, of the data that is hidden in the boxplot and not ob-
not exceeding a distance of 1.5 times the interquartile vious in the dotplot. By varying the type of averaging
spread (interquartile spread is the scale value of the across the data at each point, the size of the window
75th percentile minus the value at the 25th percentile). of averaging, or the weighting function used around
In this scheme the tails will cover the middle 99.3% of each point, the appearance of the plot can be varied to
a Gaussian distribution. Data values occurring past make the plot appear more jagged or more smooth.
this point are typically displayed individually, as Overlaying density functions from each distribution
shown in Figure 4. allows direct comparison of their shapes, as shown in
Comparing the boxplot to the dotplot, one can Figure 6. From this we can see that the underlying
see that the box plot offers information about the lo- distributions are quite similar, with the exception of
cation of key elements in the distribution (including the second mode of the anxiety distribution. Further
outliers) and omits more subtle details. The sum- analysis of these data is warranted to ascertain wheth-
marizing function of this plot is especially useful er there are unique study characteristics associated
when a number of distributions are being compared. with this group. This example underscores the impor-
Other forms of the boxplot have been developed to tance of multiple depictions of data and the impor-
indicate the confidence interval of the median by
shading the center of the box or indenting the box 1
More precisely, the key elements of the box plot are
along the length of the interval (cf. McGill, Tukey, based on statistics that Tukey called "hinges." These are
& Larsen, 1978). Other modifications superimpose robust measures that generally match the quartiles, although
dots over the boxes (Berk, 1994) or alter the appear- slight differences may occur in some cases. See Frigge et al.
ance of the box (Stock & Behrens, 1991). Emerson (1989) for a discussion of these issues.
138 BEHRENS
: *
£ «,
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Locus
Anxiety Assertiveness of
Control
Variable Measured
Figure 4. Box plot of effect sizes displayed in Figure 3. Data in the tails of the distribution
are marked using lines extending away from the box. The O indicates outliers that deserve
special attention.
tanee of rejecting the notion of "the plot" of a set of programs. The small window on top of the dot plot
data. Scott (1992) provided a full treatment of uni- is a palette that allows rapid change of observation
variate and multivariate density-smoothing functions. shape by selecting observations and pointing to the
A major component of the detective work of EDA desired shape. The highlighted portions of the bar
is the rough assessment of hunches, a quick look at chart reflect the highlighted (second mode) por-
the question "could it be that.. ." or "what if it is the tions of the dot plot. Of the nine highlighted obser-
case that " As Tukey and Wilk (1986) argued, vations, three are from the United States and two
citing Chamberlain (1965), "science is the holding of each are from Israel, Canada, and Sweden. All of
multiple working hypotheses." To hold and assess the positive effect sizes are from studies conducted
multiple working hypotheses, data analysts depend in the United States, suggesting the possibility of
heavily on interactive computer graphics. Interactive country-related effects. These data are limited in size
graphics can be acted on directly by touching them but are consistent with the tentative hypothesis that
with the cursor or other pointing device. Another im- sex differences in reported anxiety vary as a func-
portant innovation, linked plots, are organized so that tion of country of origin. Such an idea provides di-
a change made to the color or shape of a point repre- rection for conducting evaluations of other data sets and
senting an observation in one plot automatically changes for conducting cross-cultural studies in the future.
the appearance of the observation in all other plots. Because it is impossible to anticipate all relevant
In the case of the Feingold data, interactive graph- aspects of data in either experimental or nonexperi-
ics allow the selection of the observations in the mental work, it is difficult to overstate the value of
second mode of the anxiety effect size data by draw- graphics. The multiplicity of data patterns that can
ing a rectangle around the observations of interest. match a single mean led early psychologists to con-
When this is done, the observations are highlighted sistently report means with histograms. Changes from
in all the linked windows. To determine whether this convention were heatedly discussed. By 1935, an
there is a common effect in these data based on the editorial in Comparative Psychology (Dunlap, 1935)
country in which the study occurred, a plot of the asked, "... Should we not exclude reports in which
data organized by country is opened. Figure 7 is an the group averages of performance are presented
illustration of how linking the two plots allows quick without interpretive distributions?" (p. 3). Even the
determination of covariation across variables. This staunchest proponents of CDA argued for balance in
figure is a picture of a computer screen obtained exploratory and confirmatory methods. Fisher's Sta-
using Data Desk (Data Description, Inc., 1995), al- tistical Methods for Research Workers (1925) included
though linking is common in most EDA-oriented an entire chapter on "diagrams" that begins noting:
EXPLORATORY DATA ANALYSIS 139
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Figure 5. Density estimation plot of the anxiety effect size data indicating bimodality in the
anxiety measures.
The preliminary examination of most data is facilitated scription of the data using the framework: data = fit
by the use of diagrams. Diagrams prove nothing, but + residual. Following a graphical analogy it is some-
bring outstanding features readily to the eye; they are times said that data = smooth + rough.
therefore no substitute for such critical tests as may be
These formulas reflect the fact that the aim of
applied to the data, but are valuable in suggesting such
tests, and in explaining the conclusions founded upon data analysis is to fit or summarize the data and that
them. (p. 24 of the llth edition) all description fails to some degree as reflected in
the residuals. Even the use of the mean or median
is a fit in this view. The boxplot is valuable because
Develop Models in an Iterative Process of
it indicates both the fit and residual of a single set
Tentative Model Specification and
of data. The fit is the median or mean, single values
Residual Assessment
that describe the data well. Residuals are devia-
When working in the exploratory mode, the data tions from that point. The singling out of outliers is
analyst takes the goal of developing a plausible de- part of the important process of identifying observa-
Esteem
50
o.a • -
1 s
40 1 " 1 '1
0.4
30
1 » 1
d 0.0 i i ",
20
10 -0.4
I!
ny__ 1 1.1 .
country
r Type [~~
Figure 7. Image of computer screen during interactive graphic session with brushing
and linking.
EXPLORATORY DATA ANALYSIS 141
overall effects. First, a tentative fit or description of sions. The armed forces leads as a high-option pro-
the level of options seen in each ethnicity is found by fession.
calculating the median percentage of options in each To model the values associated with these occu-
column. This provides fits of 47, 47, and 42.5 for the pation effects, the two-way fit continues by calcu-
Native American, Hispanic, and White groups, re- lating the median residual for each occupation. This
spectively. After this first step, unexpected patterns provides a summary of the occupation effects, simi-
begin to appear: On average, fewer White students lar in form to the initial summary of the ethnicity
rate occupations as options than their Native Ameri- effects. Next, cell values from Table 3 are subtracted
can or Hispanic counterparts. To complete the initial from the occupation medians, allowing each cell of
pass at decomposing data into fit and residual, residu- the original table to be recreated by adding the occu-
als are computed by subtracting each data value from pation and ethnicity fits to the residual. This process
the median value for the corresponding ethnicity. The is then extended to find an overall fit by fitting the
bottom of Table 3 displays fits for each ethnicity with ethnicity and occupation fits. After these fits have
been obtained, the process is repeated by iteratively
refitting the residuals in each direction of the table
Table 3
Residual Percentage of Individuals Viewing Each Career until additional patterns cannot be extracted (Tukey,
as an Option After a First Pass at Removing the 1977).
Ethnicity Fit The final results of such an analysis are presented
in Table 4, with occupations reordered by the size of
Ethnicity
their fits. Each datum in the original table can be
Native recreated by adding the overall, ethnicity, and occu-
Occupation American Hispanic White pation fits to the residual. When the overall, occupa-
X-ray technician tion, and ethnicity fits are perfectly additive, residuals
-19 -24 -23.5
equal zero. Residuals indicate interaction effects over
Medical technician -12 -7 -8.5
Physical therapist and above the main effects modeled in the ethnicity
-10 -7 -1.5
Social worker 17 9 11.5 and occupation fits. For example, 29% of the White
Bookkeeper 0 -9 -5.5 respondents consider work as a probation officer an
Secretary 5 9 5.5 option. This is nine percentage points less than the
Fashion shop manager 0 9 2.5 predicted value of 38 obtained by adding the overall
Receptionist -12 0 -1.5 fit (45) plus the White fit (-2) plus the probation
Librarian -11 -24 -24.5 officer fit (-5). In contrast, the Native American
Electrician 2 0 -0.5 group considers this occupation as an option 17%
Electronics technician 7 4 0.5
more often than one would expect given the overall fit
Veterinarian -13 -9 12.5
(45), the Native American fit (+1), and the probation
Probation officer 11 -7 -13.5
officer fit (-5).
Armed forces 29 24 19.5
Accountant 0 4 The use of residuals in EDA differs from that in
6.5
Lawyer 16 21 19.5 CDA in several ways. First, although the logic of
Auto salesperson -13 -14 -16.5 reducing residuals by use of an improved model is
Photographer 19 18 24.5 inherent in ANOVA and regression, it is seldom dis-
cussed explicitly outside of model comparison ap-
Ethnicity fits 47 47 42.5
proaches to these techniques (e.g., Maxwell &
142 BEHRENS
Table 4
Additive Occupation and Ethnicity Fits With Residuals and Overall Fit From
Median Smoothing of Table 2
Residuals by ethnicity
Native Occupation
Occupation American Hispanic White fits
Armed forces 4 0 -7 26
Photographer 0 0 4 20
Lawyer -2 4 0 19
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Social worker 7 0 0 11
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Secretary 0 5 -1 6
Accountant -5 0 0 6
Electronics technician 2 0 -6 6
Electrician 1 0 -3 2
Fashion shop manager -1 9 0 2
Receptionist -9 4 0 -2
Probation officer 17 9 -9 -5
Physical therapist -4 0 3 -5
Bookkeeper 7 -1 0 -6
Veterinarian -5 0 19 -7
Medical technician -2 4 0 -9
Auto salesperson 0 0 -5 -12
Librarian 12 0 -3 -22
X-ray technician 4 0 -2 -22
Ethnicity fits 1 0 -2 45
(Overall fit)
Note. Rows are reordered by size of fit.
Delaney, 1990). Second, CDA generally assesses the the data by examining their departure from the model.
size of residuals in global summary statistics such as In EDA these residuals represent important deviations
the mean squared error (MSE). Because the MSE is from expectations that inform us about the structure of
based on the sum of squared residuals, the size of the data rather than simply "error" that should be
individual residuals is aggregated and the pattern of minimized.
residuals obscured. After data are well understood and In this example, the table consisted of percentages,
CDA is asking the constrained question concerning yet the two-way fit is general enough to apply to other
the relative size of residuals compared with model types of values, including frequencies and means.
effects, F statistics and related techniques may be ap- Tukey (1986c) noted that such decompositions of
propriate. However, when the underlying form of the tables based on multiplicative or, as in this case,
data is not well understood, an exploratory data ana- additive models were long considered a standard tool
lyst is more likely to ask "Where are the good and in data analysis. He noted the additive model un-
bad fits and why?" rather than the more specific ques- derlies ANOVA for crossed and nested factors,
tions addressed in CDA. whereas the multiplicative model underlies the chi-
This analysis represents a valuable start for under- square test of independence in contingency tables.
standing how perceptions of occupations vary across This accounts for the fact that, when using mean
ethnicity. A bivariate structure of the table is sug- smoothing on cell means, the two-way fit provides
gested that offers detail about the size of effects well the same results as the procedures recommended by
beyond noting the ethnicity with the highest options in Rosenthal and Rosnow (1991) for interpreting inter-
each of the six significant chi-square tests reported by action effects in ANOVA. From the perspective of
Lauver and Jones (1991). In sharp contrast to most EDA, Rosenthal and Rosnow are recommending the
applications of CDA, detailed analysis of residuals use of F ratios for hypothesis testing and two-way
was used both to assess the model and to understand fits with residual analysis for parallel EDA to help
EXPLORATORY DATA ANALYSIS 143
build a rich description. Most programs for computing Other statistics can be assessed in a similar manner.
log-linear models will give similar results of pa- For example, the percentage of data points that can be
rameter estimates and cell residuals following a mul- arbitrarily changed in a set of data without changing
tiplicative model. Hoaglin et al. (1991) discussed the the mean is 0. In contrast, half the data of a distribu-
two-way fit in detail using mean smoothing for a tion can be altered to infinity before the median
number of ANOVA designs. changes, thereby giving the median a breakdown
An elegant graphic representation of two-way fits point of 0.5.
and residuals is available, although its presentation is Additional resistant measures include the trimean,
lengthy and beyond the scope of this article. Inter- which is a measure of central tendency based on the
ested readers may consult Tukey (1977) for its origi- arithmetic average of the value of the first quartile, the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
nal treatment or Becker, Chambers, and Wilks (1988) third quartile, and the median counted twice. The me-
This document is copyrighted by the American Psychological Association or one of its allied publishers.
or Statistical Sciences, Inc. (1993) for some computer dian absolute distance from the median is a measure
implementations. Behrens and Smith (1996) should of dispersion that follows its name exactly. Winsoriz-
be consulted for an example using data from instruc- ing (pulling tail values of a distribution in to match a
tional psychology. preset extreme score) or trimming (dropping values
past a preset extreme score) may also be used. Some
Use Robust and Resistant Methods
researchers object to the differential weighting af-
In the analysis of the two-way table, fits were based forded data in these cases. This differential weighting
on medians rather than means. In EDA, robust is, however, no different from procedures commonly
estimators such as the median are generally pre- used by instructors who drop a student's lowest score
ferred. Hoaglin, Mosteller, and Tukey (1983a) de- or Olympic judging that is based on a mean score
fined robustness as a concern for the degree to which following the elimination of the highest and lowest
statistics are insensitive to underlying assumptions. scores. As in psychological work, these strategies
Mallows (1979) discussed three aspects of robust- seem justified if the results downplay errant values
ness: resistance, smoothness, and breadth. Resis- while offering an otherwise expected summary. In
tance concerns being insensitive to minor pertur- one of his most influential papers, Fisher (1922) ar-
bations in the data and weaknesses in the model gued that "assuredly an observer need be exposed to
used. Smoothness concerns the degree to which tech- no criticism, if after recording data which are not
niques are affected by gradual introduction of bad probably normal in distribution, he prefers to adopt
data. Breadth is the degree to which a statistic is some value other than the arithmetic mean" (p. 323).
applicable in a wide range of situations. Robustness Lind and Zumbo (1993) presented an overview of
is important in EDA because the underlying form of robustness issues in psychological research as did
the data cannot always be presumed, and statistics Wainer (1977a).
that can be easily fooled (like the mean) may mis- Although problematic, data requiring resistant sum-
lead. maries are not uncommon in psychological work. For
Several approaches are available to assess the re- example, Paap and Johansen (1994) reported the re-
sistance of a statistic (cf. Goodall, 1983b), includ- sults of reaction time (RT) experiments aimed at
ing the breakdown point (Hampel, 1971). Hampel evaluating their memory model of word verification.
(1974) defined the breakdown point as "the smallest Among the data reported is the frequency with which
percentage of free contamination which can carry the each word used in the experimental task occurs in a
value of the estimator over all bounds" (p. 388). Dis- standard corpus. The distribution of this variable is
cussing the resistance of regression lines, Emerson depicted in Figure 8. In addition to the extreme skew,
and Hoaglin (1983) explained a breakdown point as the distribution is marked by an extreme outlier rep-
follows: resenting the word "that" with word frequency of
10,595 in the reference corpus. The second most fre-
Operationally, we can think of dispatching data points quent word used from this corpus is "than" with a
"to infinity" in haphazard or even troublesome direc- frequency of 1,789. The mean word frequency is 267,
tions until the calculated slope and intercept can tolerate
and the median frequency is 47. The failure of the
it no longer and break down by going off to infinity as
well. We ask how large a fraction of the data—no matter mean and median to give a common indication un-
how they are chosen—can be so drastically changed derscores the value of resistant measures. The graphic
without greatly changing the fitted line. (p. 159). display and these numbers suggest that the mean can
144 BEHRENS
in
(1994) defined an outlier as "an observation (or sub-
I set of observations) which appears to be inconsistent
with the remainder of that set of data" (p. 7). As the
detective analogy suggests, the outlying data are tell-
CD ing a different story from the rest of the data, and to
L
_D
try to summarize all of the data with a single model or
3
statistic leads to a case of combining apples and or-
anges.
Temporarily setting aside an observation allows a
diagnostic assessment of the role of the value in the
seee iee«e ise«e summary statistics. For example, the effect of the
word "that" in the Paap and Johansen experiments
can be assessed by computing the mean both with and
Word Frequency in Lexicon
without the word included. When the word is re-
Figure 8. A dot plot of the distribution of word frequen- moved, the mean of the data drops to 183 from the
cies used by Paap and Johansen (1994). original 267. This change is considerable because the
observation comprises only 1/128 or 0.8% of the data.
easily mislead the researcher from the bulk of the data This temporary diagnostic setting aside may lead the
and that the median is a good fit for most of the data data analyst to set the observation aside for the re-
points. mainder of the analysis or continue with it in the data
set. A common extension of "setting one aside" is
Pay Attention to Outliers
the generalized jackknife procedures (Efron, 1982;
Although resistant measures guard against misin- Mosteller & Tukey, 1977). When conducting a jack-
formation from small perturbations, sometimes per- knife procedure, the data analyst repeatedly removes
turbations are so great that inclusion of the bulk of subsets of the data and recomputes a statistic of in-
data along with well-documented oddities leads to terest with the eye for deviations in the statistic across
meaningless summary statistics. In EDA, extreme or subsamples. Homogeneity of the statistics reflects ho-
otherwise unusual data are noted as outliers so they mogeneity of information in the data, whereas vari-
may be treated differently or call increased attention ability in the statistics reflects variability in the data,
to a phenomenon. The problem of outliers has a long as seen previously. Although once considered only as
history. Hampel, Ronchetti, Rousseeuw, and Stahel EDA techniques, such procedures have become main-
(1986) noted that discussion of the omission of out- stream methods in areas including regression diagnos-
liers goes back as far as Bernoulli (1777/1961) and tics that use a "leave one out" approach in measures
Bessel and Baeyer (1838). Hampel et al. provided such as Cook's distance and diffitts (cf. Atkinson,
additional references and notes, including Bernoulli's 1985; Cook & Weisberg, 1994).
remark that rejection of outliers was commonplace An idea closely related to outliers is that of fringe-
among astronomers of his time. The discussion con- liers. Fringeliers are unusual points that are not as
cerning the separation of extreme values has not clearly deviant as outliers but may appear with un-
ended (cf. Barnett & Lewis, 1994; Hawkins, 1980; usual frequency in unexpected ways (Wainer, 1977a).
Hoaglin & Iglewicz, 1987). A group of observations clumped three standard de-
EXPLORATORY DATA ANALYSIS 145
viations from the mean would be one example. As formation can be chosen to "prove anything." These
with outliers, relating the structure of fringeliers to the concerns can be mollified by noting (a) reexpression
phenomenon being studied is the best possible out- of numerical values is common in everyday life and
come. Hadi and Simonoff (1993) discussed a number psychological work and (b) the goal of reexpression is
of similar issues for outlier detection in multivariate to find a scale that represents the phenomenon in a
models. Although the data analyst working in the ex- meaningful way.
ploratory mode is likely to set an observation aside if Reexpressing data as standard scores or log func-
it allows sensible description of the remaining data, tions is a familiar practice in psychological research.
such a decision must consider other possible repre- Daily life holds experience of reexpressed scales as
sentations of the data, weigh gains and losses of in- well. Hoaglin (1988) noted a number of examples of
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
formation, and document the status of any data that reexpression from everyday experience, including the
This document is copyrighted by the American Psychological Association or one of its allied publishers.
ers may focus on precisely aligning their theoretical Paap and Johansen tested this hypothesis by us-
hypotheses with statistical tests and yet fail to col- ing OLS multiple linear regression, which is com-
lect preliminary evidence concerning the distribu- monly referred to as "multiple regression." The
tional form their measurements take. Although a fuller name, however, reminds us of the assumption
relationship between two variables may be hypoth- of linearity inherent in that procedure and the sen-
esized on the basis of theoretical work, the statistical sitivity of the regression line to extreme values
analysis occurs on an empirical realization that is a when the OLS approach is used. While keeping an
result of the underlying form of the constructs and the eye on their original hypothesis, working in an ex-
way in which the constructs are measured. Failure to ploratory mode allows broader questions such as:
delve into a detailed analysis of the form of the dis- How are the independent variables related to each
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
tributions and the reexpressions that make them most other and the dependent variable? What patterns un-
This document is copyrighted by the American Psychological Association or one of its allied publishers.
interpretable can lead to glaring misinterpretations of derlie the results reported by Paap and Johansen?
the data. What can be done to improve the model? What can
we find that we did not expect? How might we be
Putting It All Together: A Reexamination of the
fooled by the summaries?
Paap and Johansen Data
A first look. When working with multivari-
The preceding sections have examined a number ate data such as these, a common strategy is to
of foundational tools in the EDA toolbox. The theme examine variables individually and then in bivariate
common to all the procedures described previously and higher order configurations. Figure 9 depicts
is not the use of canonical technique but a willing- the shapes of distributions from this analysis using
ness to use any technique that helps ensure a rich boxplots. An analyst working on these data should
mental model of the data that fits closely with the view histograms, density plots, and dot plots as
true form of the data. This requires a high degree well. Before suggesting first aid for these messy
of interactivity with the data and a familiarity with distributions using outlier handling or reexpression,
a wide range of techniques. To illustrate how these it is often helpful to assess how the shapes of these
principles and procedures interact, data published distributions affect assessment of bivariate and higher
in the Journal of Experimental Psychology: Human order relationships in the data. This can be done
Perception and Performance, which were intro- graphically using a scatter plot matrix (also called
duced previously in the discussion of resistance, are a generalized draftsman's display) described in
now reexamined from an EDA perspective. In this Chambers et al. (1983) and shown in Figure 10.
article, Paap and Johansen (1994) reported numerous The plot presents all pairwise combinations of the
analyses of several data sets used to support their five variables of interest. The graphic may be thought
theory of word verification, including RT data for of as a pictorial correlation matrix with scatter
128 words from a standard experimental lexicon. plots replacing correlation coefficients. In this ver-
Variables associated with each word include the sion of a scatter plot matrix, normal probability
average RT to respond to each word in an experi- plots are presented in the matrix diagonals, with
mental task, the number of high-frequency neigh- variable labels indicating the associated row and
bors (HFN), neighborhood size (NS), word fre- column scales. For example, on the top row of plots,
quency in the lexicon (WF), and the summed bi- the RT measure is plotted on each vertical axis while
gram frequency (SBF). A neighbor is a word created the *-axes vary from HFN to NS to WF and SBF as
by changing a single letter in an original word. one moves from left to right in that row. The HFN
HFN are words similar to the original word that variable is plotted on the vertical axis of all plots in
occur often in standard use. The SBF is a measure the second row and on the horizontal axis of plots in
of position-specific bigrams that occur in the word. the second column. The top right-most plot indicates
The authors summarized their expectations toward RT on the vertical axis and SBF on the horizontal
these data arguing "the only variable that directly axis.
determines word RT is the number of HFNs . . . Normal probability plots are a diagnostic aid used
Thus, NS, SBF and WF are all indirectly effects to assess the degree to which the empirical distribu-
that should not account for any of the variance in tion matches the Gaussian distribution. This is accom-
word RT once the effects of HFN were partialed plished by calculating the fraction of data below each
out" (pp. 144-145). data value (i.e., the quantile) and computing the z
EXPLORATORY DATA ANALYSIS 147
975- 10500
-r- 21 • 45000
900 9000-
IS- 37500
825 7500-
15- 30000
750 6000-
JL 12- 22500-
675 4500
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
9 15000
600 J 3000
525 - 6 1500
w 7500
450 I 3 0
HFN NS SBF
Figure 9. Box plots of word verification variables from Paap and Johansen (1994) indicat-
ing severe nonnormality and outliers. The locations of data in the upper and lower quartiles
are marked using lines extending away from the box. The O indicates outliers that deserve
special attention. The * indicates extreme outliers. (RT = reaction time; HFN = high-
frequency neighbors; NS = neighborhood size; WF = word frequency; SBF = summed
bigram frequency.)
score for points with corresponding quantiles in the EDA, quick assessment of such effects is straightfor-
Gaussian distribution. When the scale values are plot- ward. In this case, regression lines were added to a
ted against the expected z score, a straight line is number of the scatter plots in Figure 10 to indicate the
obtained if the distribution is Gaussian. Curves in the OLS predictions that would be computed with and
normal probability plot indicate skew, whereas S without the two outlying values. This was accom-
shapes indicate shorter than expected tail regions. plished by selecting options from pull-down menus
Cleveland (1993) presented a complete discussion of accessible on the scatter plots themselves (Data De-
the normal probability plot and the more general scription, Inc., 1995). In each case, the regression line
quantile-quantile plot. The farthest left plots in Rows nearest the outlier indicates prediction lines with the
1 and 2 of Figure 10 indicate moderate positive skew outliers included.
in RT and HFN while the NS plot indicates relative It is clear from these plots that the extreme outliers
normality, and WF and SBF plots indicate marked are dramatically different from the bulk of the data
deviation from Gaussian shape. Individual outliers are and disproportionately influence the fit from the least
marked as "xs." Although these patterns were visible squares line. Likewise, these extreme points are arti-
in the box plots, normal probability plots are an im- ficially inflating or deflating the correlation that
portant adjunct because they are compared directly holds in the mass of the data. For the case of WF,
against the normal distribution and display each piece the outliers pull the line toward a slope of zero when
of datum rather than the five-number summary of the compared against the negative slope that exists when
box plot. The extremity of the word ' 'that'' in WF can the outliers are set aside. In addition to the difficulty
be seen in the bivariate plots. with the regression lines being disproportionally af-
One natural method for summarizing the bivariate fected by these points, their presence compresses
relationships between variables is to use the formula the variability in the bulk of the data and may hide
for a line as a fit from which to derive residuals. If an important patterns. Because these two outliers ap-
OLS fit is used (as is the default in most computer pear to be qualitatively different from the bulk of
packages), the line is easily affected by extreme val- the data, unduly influence the OLS summary, and
ues such as the outliers in WF and SBF, which rep- may distort the visual impression of the data, it is
resent the common words "that" and "than." In in- advisable to set them aside for some portion of the
teractive data analysis environments common to analysis.
148 BEHRENS
I P HS-ns...LJHi I P rc.-VF..4Jfc!
Figure 10. Scatter plot matrix of word verification data from Paap and Johansen (1994).
Outliers are indicated with the "at" symbol. Regression lines have been added to indicate
predicted values as they would occur with and without the outliers included. (RT = reaction
time; NS = neighborhood size; HFN = high-frequency neighbors; SBF = summed bigram
frequency, WF = word frequency.)
A Better Description. Temporarily setting aside a roughly Gaussian shape. By finding the degree to
the two outlying data points and reconstructing the which the univariate distributions need to be reex-
scatter plot matrix leads to the display in Figure 11. In pressed to be Gaussian, one also finds the degree to
this plot the relationships with SBF are clearer (al- which the line of fit must be bent to meet the data.
though not very strong), and curvilinear relationships When reexpressing variables in EDA, one may use
between WF and both RT and HFN are visible. These the notion of a ladder of reexpression. A number of
curvilinear relationships are not completely unex- versions of the ladder exist. In each case the rungs of
pected. The curved form of the data is reflected in the the ladder represent an exponential value to which
bunching up of the data in the lower left corner of the scores may be raised. In the simplest case, movement
two-dimensional plot. This is likely to occur given the up the ladder refers to raising scores to a higher
bunching of data in the lower part of each of the power. Moving down the ladder refers to raising
univariate distributions. scores to decreasing negative exponents (reciprocals).
A straightforward way to find an appropriate de- Exponents along this ladder and the corresponding
scription for the curved function is to find a reexpres- reexpression are listed below for the range of expo-
sion of the univariate distributions that leads them to nents from —2 to +2.
EXPLORATORY DATA ANALYSIS 149
Figure 11. Scatter plot matrix of word verification variables from Paap and Johansen (1994)
with two outliers removed and curvilinear relation between WF and RT as well as WF and
HFN exposed. (RT = reaction time; NS = neighborhood size; HFN = high-frequency
neighbors; SBF = summed bigram frequency, WF = word frequency.)
as the slider values change. Such systems allow quick These plots indicate that the log reexpression im-
assessment of a large number of reexpressions. proves the fit dramatically, whereas the similarity of
A choice of transformation is recommended by regression lines with and without the outliers (Panel a)
moving up or down the ladder in the direction of the indicates that setting aside the outliers does little to
bulk of the data on the scale. Positively skewed dis- change the regression line. In this dimension the out-
tributions with the bulk of the data lower on the scale liers may be considered natural extensions of the tail
can be normalized by moving down the ladder of of this log-normal distribution. A very misleading fit
reexpression; distributions with the bulk of the data would occur in any model assuming RT = a + i(WF)
high on the scale can be normalized by moving up the rather than RT = a + fc(log(WF)). The residual plots
ladder of reexpression. In the present case, the WF also indicate three words whose residuals are excep-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
variable has the bulk of the data in the lower portion tionally large as indicated by their positions above the
This document is copyrighted by the American Psychological Association or one of its allied publishers.
of the distribution, so moving down the ladder is ap- bulk of the data in the left, center, and right side of the
propriate. Starting with WF1 (the unchanged data), we residual plots. These points indicate the values of the
move down to WF1'2, which is the square root of WF, words "oaf," "mere," and "came" respectively,
followed by WF°, which is assigned the value of with RTs much longer than otherwise expected given
logj0(WF), and -WF~I/2, which is equal to minus one their word frequency. Analyses of the SBF variable
over the square root of WF. Box plots of each of these lead to the conclusion that the SBF variable is roughly
transformations for all data, including the outliers, are Gaussian with the exception of the two outliers.
presented in Figure 12. As the reader may see, reex- To properly specify a linear model using the WF
pression to a log transformation provides an approxi- variable, it should be reexpressed to log(WF). To ap-
mately Gaussian distribution, whereas more extreme propriately include SBF, the two extreme points
reexpression leads to distortion in the opposite direc- should be set aside and noted for their impact on the
tion and less extreme reexpression fails to correct the analysis. Including the two outliers in subsequent
shape. In practice, normal probability plots rather than analyses would serve no purpose but to demonstrate
box plots would be used to assess normality. Box that the majority of the SBF pattern cannot be well
plots, however, effectively and compactly communi- modeled because of two rogue points. Setting them
cate the effect of the reexpressions. aside will allow appropriate modeling of the bulk of
Panel a of Figure 13 is a scatter plot of RT re- the data. This is a practical application of the principle
gressed on logjo(WF) with the two outliers indicated that it is better to be somewhat right than precisely
by "Xs" and regression lines for models with all the wrong. All of this information suggests the corpus of
data as well as from outlier-deleted data only. Panel b words used in this study requires additional attention.
is a plot of the regression residuals versus log10(WF). How did we do ? To assess the total effect of the
90 3.75 -0.15 ~
10000
75 -0.30
BOOO- 3.00
60 . -0.45 . L
2.25
6000
45 -0.60 •
1.50
4000-
30 -0.75
0.75
][
2000 15 . -0.90
*
J_
0 0.00 -1.05
VTOF -1/VW
Figure 12. Box plots of reexpressions of word frequency (WF) moving down the ladder of
reexpression. The location of data in the upper and lower quartiles are marked using lines
extending away from the box. The O indicates outliers that deserve special attention. The *
indicates extreme outliers. (RT = reaction time; HFN = high-frequency neighbors; NS =
neighborhood size; WF = word frequency; SBF = summed bigram frequency.)
EXPLORATORY DATA ANALYSIS 151
0 1 2 3 0 1 2 3
(a) (b)
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Figure 13. Panel a: Scatter plot of reaction time (RT) versus log (word frequency [WF]) with
regression lines for all data and outlier deleted data. Panel b: Residuals from regression of
data in Panel a.
work we have done up to this point, we replot the these analyses may have occurred because the collinear-
scatter plot matrix with the SBF outliers removed and ity problem between the predictors actually became
worse when the log transform was applied. The correla-
the reexpressed WF variable as shown in Figure 14.
tion of -.23 between plain WF and the number of HFNs
Comparing the quality of regression lines for predict- ballooned to -.65 for the log WF. The greater the col-
ing data in Figure 14 with that of Figure 10 under- linearity between two predictors, the less confident one
scores the value of EDA in steering researchers to- can be that the statistical model has identified the real
ward an appropriate model. The reexpression winner.... Because of the collinearity problem, some
will see the hole (effects of log WF) where others see the
corrected the curvilinearity in the WF-RT relation-
doughnut (effects of NS and the number of HFNs) in our
ship as well as in the WF-HFN relationship. Because data. (pp. 1145-1146)
it is sometimes difficult to understand the reexpres-
sion being used, readers may benefit from seeing pre- Without the log transformation, these authors found
dicted values from reexpressed variables plotted in the what they predicted: a significant relationship be-
scale of the original variables as shown in Figure 15. tween RT and HFN and a nonsignificant relation be-
Panel a portrays the predicted values of RT from tween RT and WF. Alternatively, the logarithmic re-
log(WF) plotted against log(WF), and Panel b por- expression led to a nonsignificant correlation between
trays the same values plotted against their correspond- RT and HFN and a significant correlation between RT
ing WF values. When viewed in conjunction with the and log(WF), results inconsistent with their theory.
RT versus WF plot in Figure 11, Figure 15 reveals the Without understanding the shapes of the distributions
relation between log(WF) and WF is a natural reex- involved and the effect of curvilinearity and outliers,
pression that catches the curve in the data that is oth- these authors were left to hypothesize "skittishness"
erwise missed by failing to reexpress the data. and "ballooning" variables, collinearity, and a posi-
Interestingly, Paap and Johansen (1994) noted that tive-thinking bakery theory for choosing among sta-
log transformations have been computed in other re- tistical models. The simple graphics used here, how-
search labs and have led to substantive conclusions ever, explain the situation quite well. WF has a
different from their own. They therefore reanalyzed curvilinear (logarithmic) relationship with RT and
the data described here using the log(WF) transfor- HFN. This curvilinearity is a violation of an assump-
mation on the grounds of historical precedence in RT tion of the linear regression model used. Therefore, no
experiments. Focusing primarily on the size of the significant slopes can be found, as indicated in Figure
correlations and the role of the logged variable in a
multiple regression analysis predicting RT, the im-
provement in fit observed here was interpreted quite
2
differently.2 The analysis discussed in the passage quoted here dis-
cusses a model with a term for the summed log bigram
In summary, when plain WF was entered as a predictor, frequency rather than the summed bigram frequency dis-
the number of HFNs and NS were significant predictors. cussed in this article. This difference did not affect the
However, when log WF was used, only log WF was a relationship among RT, HFN, and WF discussed here and is
significant predictor. The skittishness of the variables in omitted for the sake of simplicity.
lt> RT-roo Plot fCTL..LII-l l i b RT-HFNPtotlor ..Dill I f r KTH8 Plot for I.. f]fl| Ib HTTP Plot forl.. Hpl'l l> RT'SeF Plg< for ..
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Ib LO-n»PWfcrL..ljPI Ib LO-SSfPkXtor ..
Figare K Scatter plot matrix of the Paap and Johansen (1994) data with log reexpression
of WF and two outliers removed. (RT = reaction time; NS = neighborhood size; HFN =
high-frequency neighbors; SBF = summed bigram frequency, WF = word frequency.)
10. The log transformation specifies the degree of When WF is included in the equation in its original
bend in the data so it can be accommodated by the form, the suppressed measure of relationship leads to
regression model RT = a + i(HFN) + fc(NS) + little correction in the HFN-RT relationship. When,
6(SBF) + fo(log(WF)). The correct log(WF) model however, WF is appropriately reexpressed to account
specification reveals a strong curvilinear relationship for the curvilinearity, its relation with HFN is prop-
between RT and WF as well as HFN and WF. erly expressed as high linear relationship and the re-
The disappearance of the HFN effect needs to be lation between HFN and RT is adjusted downward to
understood in the context of the multiple regression take into account the now large correlation between
models used. In such models, relationships between HFN and log(WF).
each predictor variable and the criterion are adjusted These important aspects of the analysis can be in-
for the presence of all the other predictor variables. ferred from Figure 14, correlation matrices, and the
EXPLORATORY DATA ANALYSIS 153
720
680 i
640
600-
560
v,
0.00 0.75 1.50 225 300 600 900
Log(Word Frequency) Word Frequency
(b)
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Figure 75. Plot of predicted values from linear regression of reaction time (RT) on high-
frequency neighbor, neighborhood size, log(word frequency), and summed bigram frequency
on (Panel a) scale of log(WF) and (Panel b) scale of WF. Note how the predicted values
properly model the curve of the data in the original scale.
slope estimates of the multiple regressions. Neverthe- Panel a of Figure 16 is the partial regression plot
less, the exploratory analyst may want additional in- between RT and HFN when each is adjusted for NS,
formation because partial correlations and conditional WF, and SBF. Note that the slope of the line indicates
slopes may also be distorted by aberrant data patterns. that a relationship exists between these two variables
To obtain a more detailed description of the data as after their adjustment for relations with other vari-
represented in the machinery of the multiple regres- ables. Panel b is a partial regression plot between RT
sion and to assess the validity of the computational and HFN when both are adjusted for NS, log(WF),
model, partial regression plots can be used. A partial and SBF. The absence of relationship between the two
regression plot takes advantage of the fact that adjust- sets of residuals in Panel b reflects the small partial
ing one predictor, such as HFN, for its relationship correlation between these variables that has occurred
with another predictor, such as log(WF), is equivalent because the properly specified model leads to appro-
to regressing HFN onto log(WF) and using the residu- priate measures of relationship with WF and RT and
als for subsequent analyses. Because the residuals are HFN. The WF variable is not skittish but strongly
the data after the effect of the model have been sub- curvilinear in a world in which these data analysts
tracted, the residual from HFN = a + fc(log(WF)) are assumed all relationships are linear.
WF-corrected HFN data. Any analysis with these re- Life without EDA. The interactive analysis de-
siduals would be equivalent to a partial regression of scribed here contrasts Paap and Johansen's (1994) use
HFN. According to Velleman (1992), "A partial re- of CDA alone. Working in an exploratory mode, an
gression plot graphs y with the linear effects of the initial model was sought, transformations were at-
other x-variables removed against x with the linear tempted, residuals were used for model evaluation,
effects of the other variables removed"' (pp. 23-24). and the cycle of model searching continued. This pro-
200
100-
0
rlfei-
-100
(a) W
Figure 16. Partial regression plots of reaction time (RT) and high-frequency neighbor
(HFN). Each variable is adjusted for linear relations with all other predictors. Additional
explanatory variables are neighborhood size (NS), word frequency (WF), and summed big-
ram frequency (SBF) for Panel a and NS, log(WF), and SBF for Panel b.
154 BEHRENS
cess quickly revealed numerous unexpected aspects of they have not missed important aspects of the phe-
the data with important consequences for model de- nomenon and have not been fooled by pathological
velopment. Without the detailed description and open data patterns or model misspecification. EDA pro-
attitude available in EDA, Paap and Johansen were motes good theory development and testing by help-
left with seemingly conflicting statistics from the ing researchers ensure their models are aligned with
black box of the hypothesis tests. They began with reality and they are not being misled by more re-
specific hypotheses concerning what variables would moved summaries. As long as researchers are clear
be related under what conditions, but, because of a about what activity is exploratory and what is confir-
lack of detailed familiarity with the data, they failed to matory, and the strength of conclusions from each
specify a model even close to the empirical outcome. mode are appropriate, EDA will facilitate, rather than
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This underscores the idea that theoretical hypotheses retard, theory development and testing. An increase in
This document is copyrighted by the American Psychological Association or one of its allied publishers.
need to be balanced with rich knowledge of the data our knowledge about the data is always beneficial as
being examined. Even when firm hypotheses are held long as its limits are clear. From this analysis a num-
a priori, working in the exploratory mode is always ber of recommendations can be made.
useful to find out what we did not expect. The only First, EDA should be recognized as an important
caveat required is that conclusions obtained as the aspect of data analysis whose conduct and publication
result of exploratory analyses are considered explor-
are valued. By admitting EDA as an acceptable set of
atory and that confirmation of such conclusions will
procedures, researchers can avoid the improper use of
occur only when CDA is undertaken on different data.
CDA techniques for the purposes of data exploration.
The analysis reported here is a small part of an
As long as EDA remains a covert activity, researchers
exploratory analysis of this data. Although the loga-
will continue to improperly use CDA for data explo-
rithmic transformation appears to lead to quite good
ration through model underspecification and overtest-
model specification, the RT variable has some depar-
ing. An increase in EDA will focus more resources at
ture from symmetry and may benefit from a 1/RT
the preliminary stages of investigations and less at the
reexpression that would put it in the scale of speed.
advanced stages. In so doing, the number of irrepro-
Other types of regression diagnostics and plots could
ducible results may be reduced by the substitution of
have been used such as three-dimensional rotating
adequate model building for the cataloging of signifi-
plots and the assessment of other outliers.
cant effects. Further, the detail in modeling afforded
by EDA may improve our understanding of phenom-
Conclusion: Psychological Method and EDA enon otherwise hidden behind simple summary statis-
tics and tests, as seen in the Paap and Johansen (1994)
EDA is a well-established tradition in the statistical data. In this regard, editors and reviewers should fol-
literature. The goal of EDA is to find patterns in the low the lead of Loftus (1993), whose first editorial
data that allow researchers to build rich mental mod- statement for Memory and Cognition included head-
els of the phenomenon being examined. Examining ings of "Figures are Good" and "Data Analysis: A
the Feingold (1994) and Lauver and Jones (1991) Picture is Worth a Thousand Words."
data, we found EDA useful when there is little explicit This is not to say that all exploratory work should
theoretical background to guide prediction and the be published, but rather that all published and initial
first stages of model building is desired. Examining work should be explored. The field would greatly
the Paap and Johansen data, we also saw that, even benefit if all published reports included the statement
when a priori hypotheses exist, EDA can perform a "we examined the data in detail and found the pat-
valuable service by providing rich descriptions of the terns underlying the summary statistics were not ob-
data that can inform the research whether their mental viously pathological.'' More detailed reporting would
models are even close enough to the underlying data also be welcome. When auxiliary exploratory analysis
patterns to consider CDA. In either case, tools for cannot fit into a standard journal format, additional
EDA provide a much wider range of information than graphics and reports may be distributed over the In-
the answers to a specific probabilistic question. ternet or by other electronic means. Behrens and
Theory development and testing are hallmarks of Dugan (1996) provides an example of such supple-
scientific psychology. Good theory development and mental graphic reporting.
testing integrate a wide range of information about the Second, quantitative analysis should be thought of
data being evaluated, so researchers can be certain "more as applied epistemology and less as applied
EXPLORATORY DATA ANALYSIS 155
mathematics" (Behrens & Smith, 1996). When con- artifact in data analysis is also emerging. Rosenthal
sidering statistics as applied mathematics rather than (Cooper & Rosenthal, 1980; Rosenthal & Gaito,
applied epistemology, many messy real-world issues 1963, 1964) demonstrated a consistent overweighing
are often swept under the rug. Instruction addressing of "significance" in light of varying sample sizes,
the assumptions that must be met for a statistic to be and Bar-Hillel (1989; Bar-Hillel & Falk, 1982) illus-
meaningful almost always focuses on assumptions trated the subjectivity inherent in translating math-
about theoretical distributions rather than assumptions ematical concepts into natural language. Flow charts
about the world. Sharing this value with EDA, Box and expert systems suggest data analysis is a purely
(1976) labeled the overemphasis on theoretical issues rational process, yet choice of data analytic behavior
"mathematistry" for which he prescribed practical is ultimately dependent on the same psychological
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
experience and trust of the scientist's intuitions. By factors that affect cognition and behavior in other
This document is copyrighted by the American Psychological Association or one of its allied publishers.
focusing on understanding the data in whatever way is spheres of life. Bias in data analysis will not be mol-
reasonable (not only probabilistically), EDA opens lified by assent to stricter design and control of Type
the data analyst to consider the wide range of ways of I error, but by the detailed analysis of data that ex-
knowing about data. This ecumenical view leaves re- cludes alternate statistical explanations as demon-
searchers considering mathematics as an epistemic strated previously.
tool rather than a complete answer in itself. Math- Fifth, psychologists should consider the possibility
ematics should be used based on how helpful it is in that their craft can improve the conduct of EDA and
understanding data, not simply on its syntactical cor- data analysis in general. For example, Simon (1973;
rectness. Such a position will minimize what has been Simon, Langley, & Bradshaw, 1981) has long held
referred to as Type III error: "precisely solving the that logics of discovery are possible and psychologi-
wrong problem, when you should have been working cally tractable, a position supported by the construc-
on the right problem" (Mitroff, Kilmann, & Barabba, tion of the BACON program (cf. Langley, Simon,
1979, p. 140, cited in Barabba, 1991). Bradshaw, & Zytkow, 1987). Gigerenzer (1991)
Third, graduate programs should integrate instruc- noted that the heuristics encoded in BACON are quite
tion in confirmatory statistics with alternative data similar to those of EDA and mentioned Tukey (1977)
analytic methods. Instruction in EDA offers students a specifically. Investigations are still needed to examine
view of data analysis from outside traditional statis- the processes involved in comprehending common
tics. Such an alternate view may allow new apprecia- statistical graphics (cf. Simkin & Hastie, 1987; Koss-
tion and understanding of CDA. Other complemen- lyn, 1989; Lewandowsky & Spence, 1989, 1990) as
tary methods include meta-analysis (Glass, 1976; well as those specific to EDA (cf. Behrens, Stock, &
Glass, McGaw, & Smith, 1981), Bayesian analysis Sedgwick, 1990; Stock & Behrens, 1991). The statis-
(Howson & Urbach, 1993; Winkler, 1993), interval tical community recognizes the potential of transdis-
estimation approaches, and hybrid combinations ciplinary work and has provided open invitations to
(Box, 1980). Just as history and systems of psychol- the psychological community (Kruskal, 1982; Mo-
ogy are taught in psychology, might students not ben- steller, 1988; Tukey & Wilk, 1986).
efit from a history and systems of data analysis? The Given dramatic improvements in computational
appropriate size of such curricular additions will vary ability and increased sensitivity to the psychological
across programs. At the very least, the idea of multi- and social aspects of data analysis, the time is ripe for
conceptual approaches could be incorporated in al- a broad conceptualization of data analysis that in-
ready existing classes. cludes the principles and procedures of EDA. Lest
Fourth, data analysts should recognize that subjec- these recommendations seem dogmatic, the final
tivity and potential bias are inherent in all data analy- word is left for Neyman and Pearson (1928) from
sis, exploratory or otherwise. One great danger in "On the Use and Interpretation of Certain Test Cri-
overmathematizing data analysis is believing that the teria for Purposes of Statistical Inference." This ar-
reliability and precision of mathematics itself imbue ticle represented the first great break from the Fish-
reliability and precision to the data and the data analy- erian view (introducing alternative distributions and
sis. The artifactual nature of psychological investiga- Type II error) and the beginning of current practice.
tion has been well established by Rosenthal (1966), Their attitude toward mechanized inference can easily
Rosnow (1981), Danziger (1990), and others. Under- be deduced. It is, in fact, good counsel for consider-
standing of the role of cognitive, historical, and social ation of any method:
156 BEHRENS
The process of reasoning, however, is necessarily an und Russischen Dreiecksketten. Berlin: Druckerei der
individual matter, and we do not claim that the method Koniglichen Akademie der Wissenshaften.
which has been most helpful to ourselves will be of
Beveridge, W. I. B. (1950). The art of scientific investiga-
greatest assistance to others. It would seem to be a case
where each individual must reason out for himself his tion. New York: Vintage Books.
own philosophy, (p. 230). Bode, H., Mosteller, F., Tukey, J. W., & Winsor, C. (1986).
The education of scientific generalist. In L. V. Jones (Ed.),
References The collected works of John W. Tukey, Volume HI: Philoso-
phy and principles of data analysis: 1949-1964. Pacific
Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. Grove, CA: Wadsworth. (Original work published 1949)
(1990). Graduate training in statistics, methodology, and Boring, E. G. (1919). Mathematical vs. scientific signifi-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
measurement in psychology: A review of Ph.D. programs cance. Psychological Bulletin, 16, 335-339.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
in North America. American Psychologist, 45, 721-734. Box, G. E. (1976). Science and statistics. Journal of the
Atkinson, A. C. (1985). Plots, transformation, and regression: American Statistical Association, 71, 791-799.
An introduction to graphical methods of diagnostic regres- Box, G. E. (1980). Sampling and Bayes' inference in sci-
sion analysis. Oxford, England: Oxford University Press. entific modeling and robustness. Journal of the Royal
Bar-Hillel, M. (1989). Discussion: How to solve probability Statistical Society (A), 143, 383^130.
teasers. Philosophy of Science, 56, 348-358. Burt, C. (1961). Intelligence and social mobility. British
Bar-Hillel, M., & Falk, R. (1982). Some teasers concerning Journal of Statistical Psychology, 14, 3-23.
conditional probabilities. Cognition, 11, 109—122. Campbell, D. T. (1988). Descriptive epistemology: Psycho-
Barabba, V. P. (1991). Through a glass less darkly. Journal logical, sociological, and evolutionary. In E. S. Overman
of the American Statistical Association, 86, 1-8. (Ed.), Methodology and epistemology for social science:
Barnett, V., & Lewis, T. (1994). Outliers in statistical data Selected papers of Donald T. Campbell (pp. 435-486).
(2nd ed.). New York: Wiley. Chicago: University of Chicago Press.
Becker, R. A., Chambers, J. M., & Wilks, A. R. (1988). The Chamberlain, T. C. (1965). The method of multiple working
New S Language: A programming environment for data hypotheses. Science, 148, 754-759.
analysis and graphics. Monterey, CA: Wadsworth & Chambers, J. M., Cleveland, W. S., Kleiner, B., & Tukey,
Brooks/Cole. P. A. (1983). Graphical methods for data analysis. Bel-
Behrens, J. T. (1996). Course materials for EDP 691: mont, CA: Wadsworth.
Graphical and exploratory data analysis. Available Cleveland, W. S. (1985). The elements of graphing data.
http://research.ed.asu.edu/classes/eda. Monterey, CA: Wadsworth.
Behrens, J. T., & Dugan, J. G. (1996). A graphical tour of the Cleveland, W. S. (Ed.). (1988). The collected works of John
White Racial Identity Attitude Scale data in hyper-text and W. Tukey: Vol. V. Graphics. Belmont, CA: Wadsworth.
VRML. Available http://research.ed.asu.edu/reports/wrias. Cleveland, W. S. (1993). Visualizing data. Summit, NJ: Ho-
Behrens, J. T., & Smith, M. L. (1996). Data and data analy- bart Press.
sis. In D. Berliner & B. Calfee (Eds.), The handbook of Cleveland, W. S., & McGill, M. E. (1988). Dynamic Graph-
educational psychology (pp. 945—989). New York: Mac- ics for Statistics. Monterey, CA: Wadsworth.
millan. Cohen, J. (1994). The earth is round (p < .05). American
Behrens, J. T., Stock, W. A. & Sedgwick, C. E. (1990). Psychologist. 49, 997-1003.
Judgment errors in elementary box-plot displays. Com- Cook, R. D., & Weisberg, S. (1994). An introduction to
munications in Statistics B: Simulation and Computation, regression graphics. New York: Wiley.
19, 245-262. Cooper, H. M., & Hedges, L. V. (Eds.). (1994). The hand-
Berk, K. N. (1994). Data analysis with student Systat. Cam- book of research synthesis. New York: Russell Sage
bridge, MA: Course Technology, Inc. Foundation.
Bernoulli, D. (1961). The most probable choice between Cooper, H. M., & Rosenthal, R. (1980). Statistical versus
several discrepant observations and the formation there- traditional procedures for summarizing research findings.
from of the most likely induction. Biometrika, 48, 3-13. Psychological Bulletin, 87, 422-449.
(Original work published 1777) Danziger, K. (1990). Constructing the subject: Historical
Berlin, J. (1983). Semiology of graphics (W. J. Berg, origins of psychological research. New York: Cambridge
Trans.). Madison: University of Wisconsin Press. University Press.
Bessel, F. W., & Baeyer, J. J. (1838). Gradmessung in Data Description, Inc. (1995). Data desk, 5.0 (computer
Ostpreussen und ihre Verhindung mil Preussischen software). Ithaca, NY: Data Description.
EXPLORATORY DATA ANALYSIS 157
Dunlap, K. (1935). The average animal. Journal of Com- Goodall, C. (1983b). M-Estimators of location: An outline
parative Psychology, 19, 1—3. of the theory. In D. C. Hoaglin, F. Mosteller, & J. W.
Efron, B. E. (1982). The jackknife, the bootstrap, and other Tukey (Eds.), Understanding robust and exploratory
resampling methods. Philadelphia, PA: Society for Indus- data analysis (pp. 339^03). New York: Wiley.
trial and Applied Mathematics. Greenhouse, J. B., & lyengar, S. (1994). Sensitivity analysis
Emerson, J. D. (1991). Introduction to transformation. In and diagnostics. In H. Cooper & L. V. Hedges (Eds.), The
D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Fun- handbook of research synthesis (pp. 383-398). New
damentals of exploratory analysis of variance (pp. 365— York: Russell Sage Foundation.
400). New York: Wiley. Hadi, A. S., & Simonoff, J. S. (1993). Procedures for the
Emerson, J. D., & Hoaglin, D. C. (1983). Resistant lines for identification of multiple outliers in linear models. Jour-
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
y versus x. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey nal of the American Statistical Association, 88, 1264-1272.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
(Eds.), Understanding robust and exploratory data Hampel, F. R. (1971). A general qualitative definition of ro-
analysis (pp. 129-165). New York: Wiley. bustness. Annals of Mathematical Statistics, 42, 1887-1896.
Emerson, J. D., & Stoto, M. A. (1983). Transforming data. Hampel, F. R. (1974). The influence curve and its role in
In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), robust estimation. Journal of the American Statistical As-
Understanding robust and exploratory data analysis (pp. sociation, 69, 383-393.
97-128). New York: Wiley.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Sta-
Emerson, J. D., & Strenio, J. (1983). Boxplots and batch hel, W. A. (1986). Robust statistics: The approach based
comparisons. In D. Hoaglin, F. Mosteller, & J. W. Tukey on influence functions. New York: Wiley.
(Eds.), Understanding robust and exploratory data
Hartwig, F., & Bearing, B. E. (1979). Exploratory data
analysis (pp. 58-96). New York: Wiley.
analysis. Beverly Hills, CA: Sage.
Erickson, B. H., & Nosanchuk, T. A. (1992). Understand-
Hawkins, D. M. (1980). Identification of outliers. New
ing data (2nd ed.). Toronto, Ontario, Canada: University
York: Chapman & Hall.
of Toronto Press.
Henderson, H. V., & Velleman, P. F. (1981). Building mul-
Feingold, A. (1994). Gender differences in personality: A
tiple regression models interactively. Biometrics, 37,
meta-analysis. Psychological Bulletin, 116, 429—4-56.
391^11.
Finch, P. D. (1979). Description and analogy in the practice
Hoaglin, D. C. (1988). Transformations in everyday expe-
of statistics. Biometrika, 66, 195-208.
rience. Chance, I, 40-45.
Fisher, R. A. (1922). On the mathematical foundations of
Hoaglin, D. C., & Iglewicz, B. (1987). Fine-tuning some
theoretical statistics. Philosophical Transactions of the
resistant rules for outlier labeling. Journal of the Ameri-
Royal Society of London. Series A, 222, 309-368.
can Statistical Association, 82, 1147-1149.
Fisher, R. A. (1925). Statistical methods for research work-
ers. London: Oliver and Boyd. Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (1983a). In-
Frigge, M., Hoaglin, D. C., & Iglewicz, B. (1989). Some troduction to more refined estimators. In D. C. Hoaglin,
implementations of the boxplot. The American Statisti- F. Mosteller, & J. W. Tukey (Eds.), Understanding ro-
cian, 43, 50-54. bust and exploratory data analysis (pp. 283—296). New
York: Wiley.
Giere, R. N. (1984). Understanding Scientific Reasoning.
(2nd ed.). New York: Holt, Rinehart & Winston. Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.).
Gigerenzer, G. (1991). From tools-to-theories: A heuristic (1983b). Understanding robust and exploratory data
view, 98, 254-267. Hoaglin, D. C., Mosteller, F., & Tukey. J. W. (Eds.). (1985).
Glass, G. V. (1976). Primary, secondary, and meta-analysis Exploring data tables, trends, and shapes. New York:
of research. Educational Researcher, S, 3-8. Wiley.
Glass, G. V., McGaw, B., & Smith, M.L. (1981). Meta- Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991).
analysis in social research. Beverly Hills, CA: Sage. Fundamentals of exploratory analysis of variance. New
Good, I. J. (1983). The philosophy of exploratory data York: Wiley.
analysis. Philosophy of Science, 50, 238-295. Howson, C., & Urbach, P. (1993). Scientific reasoning: The
Goodall, C. (1983a). Examining residuals. In D. C. Hoaglin, Bayesian approach (2nd ed.). Peru, IL: Open Court.
F. Mosteller, & J. W. Tukey (Eds.), Understanding ro- Huberty, C. J. (1991). Introduction to the practice of statistics
bust and exploratory data analysis (pp. 211-246). New [Review]. Journal of Educational Statistics, 16, 77-81.
York: Wiley. Jones, L. V. (Ed.). (1986a). The collected works of John W.
158 BEHRENS
Tukey. Volume III: Philosophy and principles of data Whiteley (Eds.), Data analysis and the social sciences
analysis: 1949-1964. Belmont, CA: Wadsworth. (pp. 256-284). London: Pinter.
Jones, L. V. (Ed.). (1986b). The collected works of John W. Mallows, C. L. (1979). Robust methods—Some examples
Tukey. Volume IV: Philosophy and principles of data of their use. The American Statistician, 33, 179.
analysis (1965-1986). Belmont, CA: Wadsworth. Maxwell, S. E., & Delaney, H. D. (1990). Designing experi-
Joreskog, K., & Sorbom, D. (1993). L1SREL S: Structural ments and analyzing data. Belmont, CA: Wadsworth.
equation modeling with the S1MPL1S command lan- Mayer, L. S. (1980). The use of exploratory methods in
guage. Chicago: Scientific Software International. economic analysis: Analyzing residential energy demand.
Kosslyn, S. M. (1989). Understanding charts and graphs. In J. Kmenta & J. B. Ramsey (Eds.), Evaluation of econ-
Applied Cognitive Psychology, 3, 185-225. ometric models (pp. 15—45). New York: Academic Press.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Kraemer, H. C., & Thiemann, S. (1987). How many sub- McGill, R., Tukey, J. W., & Larsen, W. A. (1978). Varia-
This document is copyrighted by the American Psychological Association or one of its allied publishers.
jects?: Statistical power analysis in research. Beverly tions of box plots. The American Statistician, 32, 12-16.
Hills, CA: Sage. McGuire, W. J. (1989). A perspectivist approach to the stra-
Kruskal, W. H. (1982). Criteria for judging statistical graph- tegic planning of programmatic scientific research. In B.
Gholson, W. R. Shadish, Jr., R. A. Neimeyer, & A. C.
ics. Utilitas Mathematics, 21B, 283-310.
Houts (Eds.), Psychology of science: Contributions to meta-
Langley, P., Simon, H. A., Bradshaw, G. L., & Zytkow, J. M.
science. Cambridge, England: Cambridge University Press.
(1987). Scientific discovery. Cambridge, MA: MTT Press.
Mitroff, 1.1., Kilmann, R. K., & Barabba, V. P. (1979).
Lauver, P. J., & Jones, R. M. (1991). Factors associated
Management information versus misinformation systems.
with perceived career options in American Indian, White,
In G. Zaltman (Ed.), Management principles for non-
and Hispanic rural high school students. Journal of Coun-
profit agencies and organizations (p. 104). New York:
seling Psychology, 38, 159-166.
AMACOM.
Leinhardt, G., & Leinhardt, S. (1980). Exploratory data
Mosteller, F. (1988). Broadening the scope of statistics and
analysis: New tools for the analysis of empirical data. In
statistics education. The American Statistician, 42, 93-
D. Berliner (Ed.), Review of Research in Education (Vol.
99.
8, pp. 85-157).
Mosteller, F., & Tukey, J. W. (1977). Data analysis and
Leinhardt, S., & Wasserman, S. S. (1979). Exploratory data
regression: A second course in statistics. Reading, MA:
analysis: An introduction to selected methods. In K. F.
Addison-Wesley.
Schuessler (Ed.), Sociological methodology (pp. 311-
Mulaik, S. A. (1984). Empiricism and exploratory statistics.
365). San Francisco: Jossey-Bass.
Philosophy of Science, 52, 410-430.
Lent, R. W., & Hackett, G. (1987). Career self-efficacy:
Neyman, J., & Pearson, W. S. (1928). On the use and in-
Empirical status and future directions. Journal of Voca-
terpretation of certain test criteria for purposes of statis-
tional Behavior, 30, 347-382.
tical inference: Part I. Biometrika, 20a, 175-240.
Lewandowsky, S., & Spence, I. (1989). Discriminating Paap, K. R., & Johansen, L. S. (1994). The case of the van-
strata in scatterplots. Journal of the American Statistical ishing frequency effect: A retest of the verification
Association, 84, 682-688. model. Journal of Experimental Psychology: Human
Lewandowsky, S., & Spence, I. (1990). The perception of Perception and Performance, 20, 1129-1157.
statistical graphs. Sociological Methods and Research, Rosenthal, R. (1966). Experimenter effects in behavioral
18, 200-242. research. New York: Appleton-Century-Crofts.
Light, R. J., Singer, J. D., & Willett, J. B. (1994). The visual Rosenthal, R., & Gaito, J. (1963). The interpretation of lev-
presentation and interpretation of meta-analyses. In H. els of significance by psychological researchers. Journal
Cooper & L. V. Hedges, (Eds.), The handbook of re- of Psychology, 55, 33-38.
search synthesis (pp. 439-454). New York: Russell Sage Rosenthal, R., & Gaito, J. (1964). Further evidence for the
Foundation. cliff effect in the interpretation of levels of significance.
Lind, J. C., & Zumbo, B. D. (1993). The continuity prin- Psychological Reports, 15, 570.
ciple in psychological research: An introduction to robust Rosenthal, R., & Rosnow, R. L. (1991). Essentials of be-
statistics. Canadian Psychology, 34, 407-414. havioral research: Methods and analysis. New York:
Loftus, G. R. (1993). Editorial comment. Memory and Cog- McGraw-Hill.
nition, 21, 1-3. Rosnow, R. L. (1981). Paradigms in transition: The meth-
MacDonald, K. I. (1983). Exploratory data analysis: A pro- odology of social inquiry. New York: Oxford University
cess and a problem. In D. McKay, N. Schofield, & P. Press.
EXPLORATORY DATA ANALYSIS 159
Scott, D. W. (1992). Multivariate density estimation: sis: 1965-1986 (pp. 517-547). Monterey, CA: Wad-
Theory, practice, and visualization. New York: Wiley. sworth & Brooks/Cole.
Simkin, D., & Hastie, R. (1987). An information processing Tukey, J. W. (1986d). Introduction to styles of data analysis
analysis of graph perception. Journal of the American techniques. In L. V. Jones (Ed.), The collected works of
Statistical Association, 82, 454-465. John W. Tukey. Volume IV: Philosophy and principles of
Simon, H. A. (1973). Does scientific discovery have a data analysis: 1965-1986 (pp. 969-983). Belmont, CA:
logic? Philosophy of Science, 40, 471^*80. Wadsworth. (Original work published 1982)
Simon, H. A., Langley, P. W., & Bradshaw, G. L. (1981). Tukey, J. W. (1986e). Methodological comments focused
Scientific discovery as problem solving. Synthese, 47, on opportunities. In L. V. Jones (Ed.), The collected
1-27. works of John W. Tukey. Volume IV: Philosophy and
Smith, A. F., & Prentice, D. A. (1993). Exploratory data principles of data analysis: 1965-1986 (pp. 819-867).
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
analysis. In G. Keren & C. Lewis (Eds.), A handbook for Belmont, CA: Wadsworth. (Original work published
This document is copyrighted by the American Psychological Association or one of its allied publishers.
(Appendix follows)
160 BEHRENS
Appendix
As computer graphic capabilities widen in commonly ming skills for most tasks and is available for UNIX, Mac-
used machines, software for graphic and exploratory analy- intosh, and Windows environments for free. Copies of the
ses gain in popularity. Nevertheless, a few programs have a program can be obtained at http://stat.umn.edu.
decisively strong EDA emphasis. In this article, all the Other programs incorporate EDA procedures as well. Sy-
graphics with the exception of kernel density estimates were stat has a wide variety of graphics and some interactivity as
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
produced in Data Desk on an Apple Power Macintosh com- does SAS-JMP, although SAS-JMP reflects some more
This document is copyrighted by the American Psychological Association or one of its allied publishers.
puter. As illustrated previously, Data Desk is an exceptional CDA philosophies than may be convenient for strong EDA
tool for EDA, having been designed from its inception to be work such as the strong association of levels of measure-
an EDA technology. The documentation that accompanies ment with types of analyses (see Velleman & Wilkinson,
the software may be the single best source of technical and 1993, for a discussion of this concept). Almost all software
practical information concerning EDA. Data Desk offers a packages are now emphasizing the strength and beauty of
completely graphical interface for EDA and requires no graphical analysis with access to box plots, stem-and-leaf
programming. Another heavily EDA oriented program is plots, and so on. Consumers should beware that EDA func-
S-plus, the only software package I know of that supports tions best in highly interactive environments that support
graphics for the two-way fit. S-plus is a completely exten- quick question assessment. This involves complex interface
sible object-oriented programming language available for issues that cannot be solved by the simple inclusion of a
UNIX and MS-Windows environments; however, it has less new plot in the list of options.
graphical interactivity than Data Desk. S-plus is commonly Readers interested in learning more about current issues
used in the statistical graphics research community, and in data visualization and advances in statistical computing
there is a large archive of user-created S-plus functions should consult the Journal of Computational and Graphical
available at the Carnegie Mellon Statlib at http:// Statistics, which was inaugurated in 1992. When consider-
lib.stat.cmu.edu. ing specific products, readers may want to consult The
The kernel density estimate plots shown previously were American Statistician, which periodically contains software
produced in XLISP-STAT (Tiemey, 1990), a LISP-based reviews by statistical computing experts.
system of statistical functions and graphics that is com-
pletely extensible and highly interactive. XLISP-STAT is Received June 7, 1996
also gaining a wide following in the statistical graphics Revision received August 26, 1996
community. XLISP-STAT does not require LISP program- Accepted October 12, 1996 •