Auditory recognition memory is inferior to visual
recognition memory
Michael A. Cohena, Todd S. Horowitza,b, and Jeremy M. Wolfea,b,1
aBrigham   and Women’s Hospital, bHarvard Medical School, Boston, MA 02115
Edited by Anne Treisman, Princeton University, Princeton, NJ, and approved February 24, 2009 (received for review November 24, 2008)
Visual memory for scenes is surprisingly robust. We wished to                 descriptions (chance ⫽ 0.90%), and they succeeded exactly with
examine whether an analogous ability exists in the auditory                   64% of the sounds. Two-thirds of the remaining errors being
domain. Participants listened to a variety of sound clips and were            ‘‘near misses’’ (e.g., ‘‘Big dog’’ for the sound of a small dog
tested on their ability to distinguish old from new clips. Stimuli            barking would be considered a near miss; ‘‘tea-kettle’’ for the
ranged from complex auditory scenes (e.g., talking in a pool hall)            sound of bowling pins falling would not). Thus, with this second
to isolated auditory objects (e.g., a dog barking) to music. In some          set of sound clips, participants were able to identify the sound
conditions, additional information was provided to help partici-              clips relatively well. For each sound clip in this new set, we also
pants with encoding. In every situation, however, auditory mem-               obtained a picture that matched the description.
ory proved to be systematically inferior to visual memory. This                  There were 5 conditions in Experiment 2. In each condition,
suggests that there exists either a fundamental difference be-                12 new participants were tested using the same testing protocol
tween auditory and visual stimuli, or, more plausibly, an asymme-             as Experiment 1. The study phase contained 64 stimuli. In the
try between auditory and visual processing.                                   test phase, participants labeled 64 stimuli as old or new. We
                                                                              measured memory for the sound clips alone, the verbal descrip-
                                                                              tions alone, and the matching pictures alone. We also added 2
F   or several decades, we have known that visual memory for
    scenes is very robust (1, 2). In the most dramatic demon-
stration, Standing (3) showed observers up to 10,000 images for
                                                                              conditions intended to improve encoding of the sound clips. In
                                                                              1 condition, the sound clips were paired with the pictures during
a few seconds each and reported that they could subsequently                  the study phase. In the other, the sound clips were paired with
identify which images they had seen before with 83% accuracy.                 their verbal descriptions during study. In both of these condi-
This memory is far superior to verbal memory (4) and can persist              tions, participants were tested for recognition of the sound clips
for a week (5). Recent research has extended these findings to                alone.
show that we have a massive memory for the details of thousands                  The results, shown in Fig. 1, were unambiguous. According to
of objects (6). Here, we ask whether the same is true for auditory            Tukey’s WSD test, memory for pictures was significantly better
memory and find that it is not.                                               than for all other stimuli, while the remaining conditions did not
                                                                              differ from one another. Recall for sound clips was slightly
Results                                                                       higher than in the first experiment, but still quite low (d⬘ ⫽ 1.83;
For Experiment 1, we recorded or acquired 96 distinctive 5-s                  s.e.m. ⫽ 0.21) and far inferior to recall for pictures (d⬘ ⫽ 3.57;
sound clips from a variety of sources: birds chirping, a coffee               s.e.m. ⫽ 0.24). Supplying the participants with descriptions
shop, motorcycles, a pool hall, etc. Twelve participants listened             together in the study phase did not significantly improve recall
to 64 sound clips during the study phase. Immediately following               for sound clips (d⬘ ⫽ 2.23; s.e.m. ⫽ 0.17). This may not be
the study phase, we tested participants on another series of 64               surprising, because recall for the verbal descriptions by them-
clips, half from the study phase and half new. Participants were              selves was also relatively poor (d⬘ ⫽ 2.39; s.e.m. ⫽ 0.15).
asked to indicate whether each clip was old or new. Memory was                However, even pairing sound clips with pictures of the objects at
fairly poor for these stimuli: the hit rate was 78% and the false             the time of encoding did not improve subsequent testing with
alarm rate 20%, yielding a d⬘ score* of 1.68 (s.e.m. 0.14). To put            sound clips alone (d⬘ ⫽ 1.83; s.e.m. ⫽ 0.16). Note that these were
this performance for a mere 64 sound clips in perspective, in                 the same pictures that, by themselves, produced a d⬘ of 3.57.
Shepard’s original study with 600 pictures, he reported a hit rate               Again, it is still possible that these were the wrong stimuli. In
of 98%, whereas Standing reported a hit rate of 96% for 1,100                 terms of information load, the auditory stimuli we used may
images.                                                                       simply be more impoverished than pictures. Thus, poor memory
   There are several possible explanations for the poor perfor-               performance with sounds may be due solely to the nature of the
mance on this auditory memory task. It could be that the                      particular stimulus we used. Perhaps richer stimuli would lead to
remarkable ability to rapidly encode and remember meaningful                  more efficient encoding and storage in memory. To explore this
stimuli is a feature of visual processing. Alternatively, these               possibility, in Experiment 3 we replicated the testing procedures
might have been the wrong sounds. A particular stimulus set                   from Experiments 1 and 2 using 2 new types of stimuli: spoken
might yield poor performance for a variety of reasons. Perhaps                language and music. Both classes of stimuli might contain more
the perceptual quality was poor; for example, many of our stimuli             information than the natural auditory sounds used in Experi-
were recorded monaurally but played over headphones. It is also               ments 1 and 2. Spoken language conveys information about the
possible that the sound clips were too closely clustered in the               speaker’s age, gender, and nationality, in addition to a wealth of
stimulus space for observers to distinguish between them. Or the
stimuli might simply be the wrong sort of auditory stimuli for
                                                                              Author contributions: M.A.C., T.S.H., and J.M.W. designed research; M.A.C. performed
reasons unknown. To distinguish between the poor memory and                   research; M.A.C., T.S.H., and J.M.W. analyzed data; and M.A.C., T.S.H., and J.M.W. wrote the
poor stimuli hypotheses, we replicated the experiments with a                 paper.
second set of stimuli that were professionally recorded (e.g.,                The authors declare no conflict of interest.
binaurally) and designed to be as unique as possible (e.g., the               This article is a PNAS Direct Submission.
sound of a tea kettle, the sound of bowling pins falling). Each               1To   whom correspondence should be addressed. E-mail: jmwolfe@rics.bwh.harvard.edu.
sound was assigned a brief description (e.g., ‘‘small dog bark-               *d⬘, a standard index of detectability derived from signal detection theory (7), is computed
ing’’). In a separate experiment, 12 participants were asked to                from hit and false alarm rates. Because false alarm rates are not available for all of the
choose the correct name for each sound clip from a list of 111                 early picture memory studies, we also report hit rates.
6008 – 6010 兩 PNAS 兩 April 7, 2009 兩 vol. 106 兩 no. 14                                                           www.pnas.org兾cgi兾doi兾10.1073兾pnas.0811884106
Fig. 1. Memory performance in units of d⬘. Error bars denote the standard error of the mean. The leftmost part shows the results from Experiment 1, the center
part shows the results from Experiment 2, and the rightmost part shows the results from Experiment 3.
semantic information about the topic being discussed. Music,                        the auditory stimuli, participants were able to perform at 64%
when there is a vocalist, can convey much the same information                      on this 111-alternative choice task, using a conservative scoring
as spoken language, in addition to information about rhythm,                        criterion. For comparison, we obtained a set of images that had
harmony, and instrumentation.                                                       been created by taking 256 ⫻ 256 pixel images, reducing them
   Experiment 3 consisted of 2 groups of 12 participants, all                       to 16 ⫻ 16 pixel resolution, then upsampling to create 256 ⫻ 256
native English speakers. In the spoken language condition,                          pixel images for display. This resulted in very degraded, blurred
                                                                                                                                                                              PSYCHOLOGY
participants were tested using 90 unique speech clips (7–15 s) on                   versions of the originals (8). Previous work with these same
a variety of topics (e.g., politics, sports, current affairs, sections              images demonstrated that this procedure leads to a decrease in
from novels). Participants were debriefed afterward to confirm                      performance on a broad categorization task as compared to
that they had no problem understanding what was being said, in                      higher resolution images (8).
terms of both content and speaker’s pronunciation. Performance                         For the first part of Experiment 4, we tested 12 participants in
in this condition (d⬘ ⫽ 2.7; s.e.m. ⫽ 0.16) was better than every                   the same memory protocol as in the previous experiments using
other sound condition, but was still worse than the picture only                    102 upsampled images. As Fig. 2 shows, performance on this
condition of Experiment 2 [t (11) ⫽ 3.31, P ⬍ 0.01]. In the music                   condition (d⬘ ⫽ 1.89; s.e.m. ⫽ 0.17) was not significantly different
condition participants were tested using 90 novel popular music                     from performance with the auditory stimuli from Experiment 2
clips (5–15 s). Each participant was debriefed after the experi-                    [t (11) ⫽ 0.21, P ⬎ 0.8]. In the second condition, we then asked
ment, and none reported having ever heard any of these specific                     12 participants† to choose the correct name for each degraded
clips before. Performance in this experiment (d⬘ ⫽ 1.28; s.e.m.                     image from a list of 102 descriptions (chance ⫽ 0.98%). Partic-
⫽ 0.11) was actually worse than in the sound only condition of                      ipants successfully matched an image with its description just
Experiment 2 [t (11) ⫽ 2.509, P ⬍ 0.05], and far worse than the                     21% of the time, significantly worse than the 64% classification
picture only condition [t (11) ⫽ 14.14, P ⬍ 0.001]. Thus, memory                    performance for the auditory stimuli reported earlier [t (11) ⫽
for a variety of auditory stimulus classes, some of which poten-                    21.22, P ⬍ 0.001]. Using the more liberal scoring criterion that
tially carry more information than natural auditory sounds, is                      corrects for ‘‘near misses’’ (e.g., ‘‘highway’’ for the image of a
inferior to visual memory for scenes and objects.                                   forest road would be considered a near miss; ‘‘bedroom’’ for the
   Experiment 3 suggests that poor auditory memory is not                           image of a ‘‘beach’’ would not), performance was still only 24%
simply the product of impoverished stimuli. However, it would be                    against 83% for the auditory stimuli [t (11) ⫽ 30.277, P ⬍ 0.001].
more satisfying to directly measure the quality of visual and                          Fig. 2 makes our point graphically. To equate the memora-
auditory stimulus sets in the same units. Here, we used the                         bility of visual and auditory stimuli, we needed to render the
classification task previously used to calibrate the auditory                       visual stimuli almost unrecognizable. Participants were much
stimuli in Experiment 2, asking participants to assign each                         better at classifying/identifying the auditory stimuli than the
stimulus a label from a prespecified list of labels. Recall that for                degraded visual stimuli (triangles, right y-axis). This is consistent
                                                                                    with an asymmetry between visual and auditory processing.
                                                                                    Stimuli of equal memorability are not equally identifiable.
                                                                                    Highly identifiable auditory stimuli are not remembered well.
                                                                                    Discussion
                                                                                    It is clear from these results that auditory recognition memory
                                                                                    performance is markedly inferior to visual recognition memory
                                                                                    on this task. Note that we do not claim that long-term auditory
                                                                                    memory, in general, is impoverished. Clearly, some form of
                                                                                    auditory long-term memory allowed our participants to identify
                                                                                    the stimuli as tea kettles, dogs, and so forth. Moreover, with
                                                                                    practice, people can commit large bodies of auditory material
                                                                                    (e.g., music) to memory. The striking aspects of the original
                                                                                    picture memory experiments are the speed and ease with which
Fig. 2. Auditory stimuli vs degraded visual images. Memory performance
(squares, solid line) is plotted against the left y-axis in units of d⬘. Percent    †Notethat 5 participants participated in both conditions of experiment 4, but were only
correct for the naming experiment is plotted against the right y-axis. Error bars   allowed to complete the classification condition after having completed the memory
denote standard error of the mean.                                                  condition.
Cohen et al.                                                                                                   PNAS 兩 April 7, 2009 兩 vol. 106 兩 no. 14 兩 6009
complex visual stimuli seem to slide into long-term memory.                                  extracted. Speech clips used came from various podcasts obtained online and
Hundreds or thousands of images, seen for a few seconds at a                                 were also uploaded into WavePad to obtain 5- to 15-s clips. Degraded visual
time, are available for subsequent recognition. It is this aspect of                         images used in Experiment 4 were obtained from A. Torralba (Massachusetts
                                                                                             Institute of Technology, Cambridge, MA). A list of the stimuli used is provided
memory that seems to be markedly less impressive in audition.
                                                                                             on our website: search.bwh.harvard.edu.
Two explanations suggest themselves. Auditory objects might be
fundamentally different from visual objects. In their physics or                             Experimental Blocks. The memory experiments consisted of a study block and
psychophysics, they may actually be less memorable than their                                a test block. In the study block, participants listened to or viewed a set of sound
visual counterparts. Alternatively, auditory memory might be                                 clips or sound clips and their correlating images/names (60 – 66 clips) for
fundamentally different/smaller than visual memory. We might                                 approximately 10 min. Their instructions were simply to carefully study to the
simply lack the capacity to remember more than a few auditory                                clips and try to commit them to memory as best they could. In the test block,
objects, however memorable, when they are presented one after                                participants were presented with another set of clips (60 – 64 clips), half that
another in rapid succession. In either case, it is unlikely that                             were repeated from the study block (old) and half that had never been
anyone will find 1000 sounds that can be remembered with                                     presented before (new). Participants were asked to make an ‘‘old/new’’
                                                                                             discrimination after every trial. Note that on 1 condition of the memory
anything like the accuracy of their visual counterparts.
                                                                                             experiments the basic paradigm remained the same, but participants were
                                                                                             presented with only visual images (picture only). The naming/classification
Materials and Methods
                                                                                             experiments comprised a single block lasting approximately 20 min. Partici-
Participants. One hundred thirteen total participants (aged 18 –54) partici-                 pants were shown each stimulus for 5 s and would then type in the name of
pated in the experiments. For each condition there were 12 participants, with                what they had heard/seen from a list provided (102–110 names).
a total of 11 conditions/experiments. Each participant passed the Ishihara test
for color blindness and had normal or corrected to normal vision. All partic-
                                                                                             Apparatus. Every experiment was conducted on a Macintosh computer run-
ipants gave informed consent, as approved by the Partners Healthcare Cor-
                                                                                             ning MacOS 9.2, controlled by Matlab 7.5.0 and the Psychophysics Toolbox,
poration IRB, and were compensated $10/h for their time.
                                                                                             version 3.
Stimuli. In Experiment 1, stimuli were gathered using a handheld recording
                                                                                             ACKNOWLEDGMENTS. We thank Christina Chang, Karla Evans, Yair Pinto,
device (Panasonic PV-GS180) or were obtained from a commercially available                   Aude Oliva, and Barbara Shinn-Cunningham for helpful comments and sug-
database (SoundSnap). In Experiment 2, stimuli were gathered from Sound-                     gestions on the project, and Antonio Torralba for providing the degraded
Snap.com. In Experiment 3, music clips came from the collections of members                  images used in Experiment 4. This work was funded in part by NIMH-775561
of the laboratory. Songs were uploaded into WavePad and 7- to 15-s clips were                and AFOSR-887783.
 1. Shepard RN (1967) Recognition memory for words, sentences, and pictures. J Verb           5. Dallet K, Wilcox SG, D’Andrea L (1968) Picture memory experiments. J Exp Psychol
    Learn Verb Behav 6:156 –163.                                                                 76:312–320.
 2. Pezdek K, Whetstone T, Reynolds K, Askari N, Dougherty T (1989) Memory for real-          6. Brady TF, Konkle T, Alvarez GA, Oliva A (2008) Visual long-term memory has a massive
    world scenes: The role of consistency with schema expectation. J Exp Pscyhol Learn           storage capacity for object details. Proc Natl Acad Sci US 105:14325–14329.
    Mem Cog 15:587–595.                                                                       7. Macmillan NA, Creelman CD (2005) in Detection Theory: A User’s Guide 2nd ed.
 3. Standing L (1973) Learning 10,00 pictures. Q J Exp Psychol 25:207–222.                       (Lawrence Erlbaum Assoc, Mahwah, NJ) 2nd Ed.
 4. Standing L, Conezi J, Haber RN (1970) Perception and memory for pictures: Single-trial    8. Torralba A (2009) How many pixels make an image? Visual Neurosci, epub ahead of
    learning of 2500 visual stimuli. Psychon Sci 19:73.                                          print.
6010 兩 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0811884106                                                                                                                    Cohen et al.