Psychon Bull Rev (2018) 25:1968–1972
DOI 10.3758/s13423-017-1348-y
    BRIEF REPORT
Participant Nonnaiveté and the reproducibility
of cognitive psychology
Rolf A. Zwaan 1 & Diane Pecher 1 & Gabriele Paolacci 2 & Samantha Bouwmeester 1 &
Peter Verkoeijen 1,3 & Katinka Dijkstra 1 & René Zeelenberg 1
Published online: 25 July 2017
# The Author(s) 2017. This article is an open access publication
Abstract Many argue that there is a reproducibility crisis in          estimate is that fewer than half of the findings in cognitive
psychology. We investigated nine well-known effects from the           and social psychology are reproducible (Open Science
cognitive psychology literature—three each from the domains            Collaboration, 2015). In addition, there have been several been
of perception/action, memory, and language, respectively—and           high-profile, preregistered, multi-lab failures to replicate well-
found that they are highly reproducible. Not only can they be          known effects psychology (Eerland et al., 2016; Hagger et al.,
reproduced in online environments, but they also can be                2016; Wagenmakers et al., 2016). A similar multi-lab replica-
reproduced with nonnaïve participants with no reduction of             tion psychology that was considered successful yielded an ef-
effect size. Apparently, some cognitive tasks are so constraining      fect size that was much smaller than the original (Alogna et al.
that they encapsulate behavior from external influences, such as       2014). These findings have engendered pessimism about
testing situation and prior recent experience with the experi-         reproducibility.
ment to yield highly robust effects.                                      Coincident with the start of the reproducibility debate was
                                                                       the advent of online experimentation. Crowd-sourcing
Keywords Replication . Reproducibility . Perception .                  websites, such as Amazon Mechanical Turk, offered the pros-
Memory . Language                                                      pect of more efficient, powerful, and generalizable ways of
                                                                       testing psychological theories (Buhrmester, Kwang, &
                                                                       Gosling, 2011). The lower monetary costs and the more
A hallmark of science is reproducibility. A finding is promoted        time-efficient way of conducting experiments online rather
from anecdote to scientific evidence if it can be reproduced           than in a physical lab allowed researchers to recruit larger
(Lykken, 1968; Popper, 1959). There is growing awareness that          numbers of participants across broader geographical, age,
problems exist with reproducibility in psychology. A recent            and educational ranges of participants compared with under-
                                                                       graduates (Paolacci & Chandler, 2014). However, online ex-
                                                                       perimentation presents challenges, typically associated with
Electronic supplementary material The online version of this article
(doi:10.3758/s13423-017-1348-y) contains supplementary material,
                                                                       the loss of control over the testing environment and conditions
which is available to authorized users.                                (Bohannon, 2016). Most relevant to the reproducibility de-
                                                                       bate, online participant pools are large but not infinite, and
* Rolf A. Zwaan                                                        hundreds of studies are conducted on the same participant
  rolfzwaan@gmail.com                                                  pool every day, familiarizing participants with study materials
                                                                       and procedures (Chandler, Mueller, Paolacci, 2014; Stewart
1
     Department of Psychology, Educational, and Child Sciences,
                                                                       et al., 2015). Of particular concern for reproducibility, partic-
     Erasmus University Rotterdam, Burgemeester Oudlaan 50, 3000       ipants may participate in studies in which they have partici-
     DR Rotterdam, Netherlands                                         pated before. A recent preregistered study found sizable re-
2
     Rotterdam School of Management, Erasmus University Rotterdam,     ductions in decision-making effects among participants had
     Rotterdam, Netherlands                                            previously participated in the same studies, suggesting that
3
     Learning and Innovation Center, Avans University of Applied       nonnaïve participants may pose a threat to reproducibility
     Sciences, Breda, The Netherlands                                  (Chandler et al., 2015). Indeed, nonnaïve participants have
Psychon Bull Rev (2018) 25:1968–1972                                                                                                              1969
been implicated in failures to replicate and declining effect                  responses towards what is perceived as normatively correct
sizes (DeVoe & House, 2016; Rand et al., 2014).                                (Chandler et al., 2015). However, studies in cognitive psy-
    Although concerns with reproducibility span the entire                     chology typically have nontransparent research goals, making
field of psychology and beyond, results in cognitive psychol-                  memory of previous experiences irrelevant. Accordingly, a
ogy are typically conceived as comparatively robust (Open                      reduction of effect size due to repeated participation should
Science Collaboration, 2015). We put a sample of these find-                   be close to zero.
ings to a particularly stringent test by running them under                        We tested the hypothesis that cognitive psychology is rel-
circumstances that are increasingly representative of current                  atively immune to nonnaïveté effects in a series of nine
practices of data collection but also are documented as chal-                  preregistered experiments (https://osf.io/shej3/wiki/home/;
lenging for reproducibility. In particular, we conducted the                   see Table 1 for descriptions of each experiment). We
first preregistered replication of a large set of cognitive psy-               selected these experiments for the following reasons. First,
chological effects in the most popular online participant pool                 we wanted a broad coverage of cognitive psychology.
(Crump, McDonnell, & Gureckis, 2013 and Zwaan & Pecher,                        Therefore, we selected three experiments each from the
2012 for non-preregistered replications on MTurk). Most im-                    domains of perception/action, memory, and language,
portantly, we examined whether reproducibility depends on                      arguably the major areas in the field of cognitive
participant nonnaïveté by conducting the same experiments                      psychology. Second, we selected findings that are both well
twice on the same participants a few days apart.                               known and known to be robust. After all, testing immunity to
    Research suggests that access to knowledge obtained from                   nonnaïveté effects presupposes that one finds effects in the
previous participation (e.g., from alternative conditions or                   first place. Third, we selected tasks that lend themselves to
elaboration) can affect people’s responses and may reduce                      online testing. And fourth, we selected tasks that our team
effect sizes when participants accordingly adjust their intuitive              had experience with.
Table 1   Brief descriptions of and references to all replicated experiments
Number    Task                      Description                                                                                Reference
1         Simon task                Choice-reaction time task that measures spatial compatibility. Responses are               Craft and
                                       faster when a visual target (a red square is presented on the left of the screen)         Simon (1970)
                                       is spatially compatible with the response (pressing the left button) than when
                                       the target is spatially incompatible with the response (presented on the right
                                       of the screen).
2         Flanker task              Response inhibition task in which relevant information is selected and inappropriate       Eriksen and
                                       responses in a certain context are suppressed. Responses are faster for congruent         Eriksen (1974)
                                       trials in which compatible distractors flank a central target (AAAAA) than for
                                       incongruent trials in which incompatible distractors flank a central target (AAEAA).
3         Motor priming             A task with a priming procedure in which responses to stimuli (arrow probes <<)            Forster and
           (a = masked,                are required that are primed by presented compatible (<<) or incompatible (>>)            Davis (1984)
           b = unmasked)               items. Responses are slower for compatible items when primes are masked but
                                       faster when primes are visible.
4         Spacing effect            Learning task in which learning (of words) is spaced over time. Recall of words is         Greene (1989)
                                       higher for spaced item repetitions with intervening items than for massed items
                                       immediately repeated after their first presentation.
5         False memories            Memory task that assesses false memory of recognition performance of items that            Roediger and
                                       have not been presented before in a word list but tend to be recognized as                McDermott (1995)
                                       presented before because they are semantically related to the words in the list.
6         Serial position           Memory task that examines recall probability based on a word’s position in a list.         Murdock (1962)
            (a = primacy,              Recall is higher for the first and last words in the list and lowest for items in the
            b = recency)               middle of the list.
7         Associative priming       Implicit memory task which requires a response to a target word that is preceded           Meyer and
                                       by prime word. Responses are faster when the prime is related than when the              Schvaneveldt (1971)
                                       prime is unrelated.
8         Repetition priming        Implicit memory task in which speed of response depends on previous exposure               Forster and
            (a = low frequency,        to an item and the word frequency of that item. Responses are faster for                  Davis (1984)
            b = high frequency)        repeated than for new items. This repetition effect is larger for low
                                       frequency words than high frequency words.
9         Shape simulation          Sentence-verification task that requires a response on whether the object in a picture     Zwaan, Yaxley, and
                                       was present in the previous sentence. Yes responses are faster when the picture           Stanfield (2002)
                                       matches the implied shape mentioned in sentence than when it mismatches.
1970                                                                                               Psychon Bull Rev (2018) 25:1968–1972
   Although these findings have proven to be highly reproduc-          expected null effect for the crucial interactions, power analyses
ible in the laboratory, their robustness in an online environment      could not be used to determine our sample sizes, because these
has not yet been established in preregistered experiments. More        analyses require that one predicts an effect and that one has
importantly, it is unknown whether these findings are robust to        strong arguments for its magnitude. Hence, we decided to obtain
the presence of nonnaïve participants. We tested this hypothesis       more observations than is typically done in previous experi-
by replicating each study in the most conservative case—in             ments examining the same effects. By doing so, our parameter
which all participants encountered the study before.                   estimates are relatively precise.
                                                                       Exclusion criteria
General method
                                                                       Data from participants with an accuracy <80% in RT tasks or
Detailed method descriptions for each experiment can be                an accuracy <10% in memory tasks or a mean (reaction time)
found in the Supplementary Materials. Participants were tested         RT longer than the group M + 3SD were excluded. Data from
in two waves using the Mechanical Turk platform. Approval for          each participant in the RT tasks were trimmed by excluding
data collection was obtained from the Institutional Review             trials where the trial RT deviated more than 3SD from the
Board in the Department of Psychology at Erasmus University            subject M. From the remainder, participants were excluded
Rotterdam. All experiments were programmed in Inquisit. The            (starting with those who participated last) to create equal num-
Inquisit scripts used for collecting the data can be found             bers of participants per counterbalancing version.
at https://osf.io/ghv6m/. At the end of wave 1 of each                     Participants were recruited via Amazon Mechanical Turk.
experimental task, participants were asked to provide the              The subjects participated in two waves, held approximately
following information: age, gender, native language,                   3 days apart. In the second wave, half of the subjects participat-
education. At the end of both waves, we asked the following            ed in an exact copy of the experiment they had participated in
questions, all of which could be responded to by selecting one         before; the other half participated in a version that had an iden-
of the alternatives Bnot at all,^ Bsomewhat,^ or Bvery much^:          tical instruction and procedure but used different stimuli. A
BI’m in a noisy environment^; BThere are a lot of distractions         recent study demonstrated that certain findings replicated with
here^; BI’m in a busy environment^; BAll instructions were             the same but not with a different set of (similar) stimuli (Bahník
clear^; BI found the experiment interesting^; BI followed the          & Vranka, 2017). Our manipulation allowed us to examine
instructions closely^; BThe experiment was difficult^; BI did          whether changing the surface features of an experiment (i.e.,
my best on the task at hand^; BI was distracted during the             the stimuli) affects the reproducibility of its effect in the same
experiment.^                                                           sample of subjects. Each experiment had a sample size of 80 per
    In all experiments, different versions of materials and, in        between-subjects condition (same stimuli vs. different stimuli).
some cases, key assignments were created. Different ver-
sions ensured counterbalancing of stimulus materials and
key assignments. Participants were randomly assigned to                General results
one of the versions when they participated in wave 1.
Then, upon return 3 or 4 days later for wave 2, half of                Detailed results per experiment are described in the
the participants were assigned to the exact same version               Supplementary Materials. Data for all experiments can be found
of the experiment and the other half were assigned to a                here: https://osf.io/b27fd/. The results can be summarized as
different version such that there was zero overlap between             follows. First, the first wave yielded highly significant effects
the stimuli in the first and second wave. Participants who             for all nine experiments, with in each case Bayes factors in
had participated in one of the experiments were not prohibited         excess of 10,000 in support of the prediction. Second, each
from participating in the other experiments.                           effect was replicated during the second wave. Third, effect size
                                                                       did not vary as a function of wave; Bayes factors showed
Sampling plan                                                          moderate to very strong support for the null hypothesis.
                                                                       Fourth, it did not matter whether subjects had previously
For each experiment, we started with recruiting 200 participants:      participated in the exact same experiment or one with different
100 on Monday and 100 on Thursday. Three or four days after            stimuli. The main results are summarized in Fig. 1. The x-axis
the first participation, each participant was invited to participate   displays the wave-1 effect sizes and the y-axis the wave-2 effect
again. Our goal was to have a final sample size of 80 partici-         sizes. The blue dots indicate the same-stimuli condition and the
pants per condition (same items or different items on the second       red dots the different-stimuli condition. The numbers indicate
occasion), taking into account nonresponses and the exclusion          the specific experiment (e.g., 5 = false memory).
criteria below. Whenever we ended up with fewer than 80 par-              In the preregistration, we stated that BBayesian analysis
ticipants per condition, we recruited another batch. Because we        will be used to determine whether the effect size difference
Psychon Bull Rev (2018) 25:1968–1972                                                                                                                                                1971
                                       3.0
                                       2.5
                                                                                                                                            3b
                                                                                                               3b
                                       2.0
                                                                                                                                                   same
                   Effect size Wave 2
                                                                                                                                                   different
                                       1.5
                                                                                               1
                                                                              8a                   6b
                                                                                        8a
                                                                     3a       5                  5
                                                                                                           1
                                                                                               3a
                                       1.0                                         4
                                                                                  7
                                                                                              6b
                                                                          7
                                                                     6a           4      6a
                                                   9                  2
                                       0.5
                                                           8b
                                                                              2
                                                       9        8b
                                       0.0
                                             0.0            0.5                        1.0                     1.5              2.0              2.5               3.0
                                                                                                        Effect size Wave 1
Fig. 1 Wave 1 effect size versus wave 2 effect size (Cohen’s d). Effect                                        plotted for same materials between sessions (blue solid dots) and different
sizes were computed in JASP (JASP Team, 2017). Diagonal line repre-                                            materials between sessions (red striped dots). Labels correspond to the
sents equal effect sizes. For each experiment separate effect sizes are                                        different experiments listed in Table 1.
between waves 1 and 2 better fits a 0% reduction model or a                                                    impossible that some of our participants had previously
25% reduction model.^ However, the absence of a reduction                                                      participated in similar experiments. For these participants,
in effect sizes from wave 1 to wave 2—the wave 2 effect sizes                                                  wave 1 would actually be wave N+1 and wave 2 would be
were, if anything, larger than the wave 1 effect sizes—ren-                                                    wave N+2. Nevertheless, it appears that the tasks used in
dered the planned analysis meaningless. We therefore did                                                       this study are so constraining that they encapsulate behav-
not conduct this analysis.                                                                                     ior from contextual variation and even from recent rele-
                                                                                                               vant experiences to yield highly reproducible effects. We
                                                                                                               should add a note of caution. What we have examined are
General discussion                                                                                             the basic effects with each of these paradigms. In the
                                                                                                               literature, one often finds variations that are designed to
Overall, these results present good news for the field of                                                      examine how the basic effect varies as a function of some
psychology. In contrast to findings in other parts of the                                                      other factor, such as manipulations of instructions, stimu-
field (Chandler et al., 2015), the effects we studied were                                                     lus materials (e.g., emotional vs. neutral stimuli), subject
reproducible in samples of nonnaïve participants, which                                                        population (patients vs. controls) of the addition of a sec-
are increasingly becoming the staple of psychological re-                                                      ondary task. The jury is still out on whether such second-
search. What the tasks used in this research have in com-                                                      ary findings are as robust as the more basic findings we
mon is that they (1) use within-subjects designs and (2)                                                       have presented here.
have opaque goals. Although it is clear that participants
may learn something from their previous experience with                                                        Author contributions R.A. Zwaan developed the study concept. All
                                                                                                               authors contributed to the study design. Testing and data collection were
the experiments (e.g., response times were often faster in
                                                                                                               performed by D. Pecher and S. Bouwmeester. D. Pecher and R.A. Zwaan
wave 2 than in wave 1), this learning did not extent to the                                                    performed the data analysis. R.A. Zwaan, D. Pecher, and G. Paolacci
nature of the manipulation. We should note that it is not                                                      drafted the manuscript, and all other authors provided critical revisions.
1972                                                                                                           Psychon Bull Rev (2018) 25:1968–1972
All authors approved the final version of the manuscript for submission.       Forster, K. I., & Davis, C. (1984). Repetition Priming and Frequency
We thank Frederick Verbruggen and Hal Pashler for helpful feedback on a             Attenuation. Journal of Experimental Psychology: Learning,
previous version of this paper.                                                     Memory, and Cognition, 10, 4.
                                                                               Greene, R. L. (1989). Spacing effects in memory: Evidence for a two-
Open Access This article is distributed under the terms of the Creative             process account. Journal of Experimental Psychology: Learning,
Commons Attribution 4.0 International License (http://                              Memory, and Cognition, 15, 371–377.
creativecommons.org/licenses/by/4.0/), which permits unrestricted use,         Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O.,
distribution, and reproduction in any medium, provided you give                     Batailler, C., Birt, A. R., ... Zwienenberg, M. (2016). A multilab
appropriate credit to the original author(s) and the source, provide a link         preregistered replication of the ego-depletion effect. Perspectives
to the Creative Commons license, and indicate if changes were made.                 on Psychological Science, 11, 546–573.
                                                                               Lykken, D. T. (1968). Statistical significance in psychological research.
                                                                                    Psychological Bulletin, 70, 151–159.
References                                                                     Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recognizing
                                                                                    pairs of words: Evidence of a dependence between retrieval opera-
                                                                                    tions. Journal of Experimental Psychology, 90, 227–234.
Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A.      Murdock, B. B., Jr. (1962). The serial position effect of free recall.
     R., & Zwaan, R. A. (2014). Registered replication report: Schooler             Journal of Experimental Psychology, 64, 482–488.
     & Engstler-Schooler (1990). Perspectives on Psychological Science,
                                                                               Open Science Collaboration. (2015). Estimating the reproducibility of
     9, 556–578.
                                                                                    psychological science. Science, 349(6251), aac4716. doi:10.1126/
Bahník, S., & Vranka, M. A. (2017). If it’s difficult to pronounce, it might
                                                                                    science.aac4716
     not be risky. The effect of fluency on judgment of risk does not
                                                                               Paolacci, G., & Chandler, J. (2014). Inside the Turk: Understanding
     generalize to new stimuli. Psychological Science, 28, 427–436.
                                                                                    Mechanical Turk as a participant pool. Current Directions in
Bohannon, J. (2016). Mechanical Turk upends social sciences. Science,
                                                                                    Psychological Science, 23, 184–188.
     352, 1263–1264.
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon's                   Popper, K. R. (1959). The Logic of Scientific Discovery, translation of
     Mechanical Turk a new source of inexpensive, yet high-quality,                 Logik der Forschung. Oxford: Routledge.
     data? Perspectives on Psychological Science, 6, 3–5.                      Rand, D. G., Peysakhovich, A., Kraft-Todd, G. T., Newman, G. E.,
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté among                  Wurzbacher, O., Nowak, A. W., & Greene, J. D. (2014). Social
     Amazon Mechanical Turk workers: Consequences and solutions for                 heuristics shape intuitive cooperation. Nature Communications,
     behavioral researchers. Behavior Research Methods, 46, 112–130.                5, 4677.
Chandler, J., Paolacci, G., Peer, E., Mueller, P., & Ratliff, K. A. (2015).    Roediger, H. L., & McDermott, K. B. (1995). Creating false memories:
     Using nonnaive participants can reduce effect sizes. Psychological             Remembering words not presented in lists. Journal of Experimental
     Science, 26, 1131–1139.                                                        Psychology: Learning, Memory, and Cognition, 24, 803–814.
Craft, J. L., & Simon, J. R. (1970). Processing symbolic information from      Stewart, N., Ungemach, C., Harris, A. J. L., Bartels, D. M., Newell, B. R.,
     a visual display: Interference from an irrelevant directional cue.             Paolacci, G., & Chandler, J. (2015). The average laboratory samples
     Journal of Experimental Psychology, 83, 415–420.                               a population of 7,30 Amazon Mechanical Turk workers. Judgment
Crump, M. J. C., McDonnell, J. V., & Gureckis, V. M. (2013). Evaluating             and Decision Making, 10, 479–491.
     Amazon’s Mechanical Turk as a tool for experimental behavioral            JASP Team (2017). JASP (Version 0.8.1.2)[Computer software].
     research. PLoS One, 8, e57410.                                            Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A.,
DeVoe, S. E., & House, J. (2016). Replications with MTurkers who are                Adams, R. B., Jr., & Zwaan, R. A. (2016). Registered Replication
     naïve versus experienced with academic studies: A comment on                   Report: Strack, Martin, & Stepper (1988). Perspectives on
     Connors, Khamitov, Moroz, Campbell, and Henderson (2015).                      Psychological Science, 11, 917–928.
     Journal of Experimental Social Psychology, 67, 65–67.                     Zwaan, R. A., & Pecher, D. (2012). Revisiting mental simulation in
Eerland, A., Sherrill, A. M., Magliano, J. P., Zwaan, R. A., Arnal, J. D.,          language comprehension: Six replication attempts. PLoS One, 7,
     Aucoin, P., & Prenoveau, J. M. (2016). Registered replication report:          e51382.
     Hart & Albarracín (2011). Perspectives on Psychological Science,          Zwaan, R. A., Yaxley, R., & Stanfield, R. (2002). Language comprehenders
     11, 158–171.                                                                   mentally represent the shape of objects. Psychological Science, 13,
Eriksen, B. A., & Eriksen, C. W. (1974). Effects of noise letters upon the          168–171. Experiment 1.
     identification of a target letter in a non search task. Perception and
     Psychophysics, 16, 143–149.