0% found this document useful (0 votes)
76 views3 pages

Auditory Recognition Memory Is Inferior To Visual Recognition Memory

This study examined whether auditory memory is as robust as visual memory by testing participants' ability to recognize sounds they had heard previously. In three experiments, auditory memory was consistently inferior to visual memory. When participants listened to sound clips, their later recognition of those clips was much poorer than their recognition of pictures they had viewed. Providing additional encoding support like descriptions or pictures paired with the sounds did not improve later auditory recognition. The results suggest there may be fundamental differences between auditory and visual processing and memory abilities.

Uploaded by

CH EL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views3 pages

Auditory Recognition Memory Is Inferior To Visual Recognition Memory

This study examined whether auditory memory is as robust as visual memory by testing participants' ability to recognize sounds they had heard previously. In three experiments, auditory memory was consistently inferior to visual memory. When participants listened to sound clips, their later recognition of those clips was much poorer than their recognition of pictures they had viewed. Providing additional encoding support like descriptions or pictures paired with the sounds did not improve later auditory recognition. The results suggest there may be fundamental differences between auditory and visual processing and memory abilities.

Uploaded by

CH EL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Auditory recognition memory is inferior to visual

recognition memory
Michael A. Cohena, Todd S. Horowitza,b, and Jeremy M. Wolfea,b,1
aBrigham and Women’s Hospital, bHarvard Medical School, Boston, MA 02115

Edited by Anne Treisman, Princeton University, Princeton, NJ, and approved February 24, 2009 (received for review November 24, 2008)

Visual memory for scenes is surprisingly robust. We wished to descriptions (chance ⫽ 0.90%), and they succeeded exactly with
examine whether an analogous ability exists in the auditory 64% of the sounds. Two-thirds of the remaining errors being
domain. Participants listened to a variety of sound clips and were ‘‘near misses’’ (e.g., ‘‘Big dog’’ for the sound of a small dog
tested on their ability to distinguish old from new clips. Stimuli barking would be considered a near miss; ‘‘tea-kettle’’ for the
ranged from complex auditory scenes (e.g., talking in a pool hall) sound of bowling pins falling would not). Thus, with this second
to isolated auditory objects (e.g., a dog barking) to music. In some set of sound clips, participants were able to identify the sound
conditions, additional information was provided to help partici- clips relatively well. For each sound clip in this new set, we also
pants with encoding. In every situation, however, auditory mem- obtained a picture that matched the description.
ory proved to be systematically inferior to visual memory. This There were 5 conditions in Experiment 2. In each condition,
suggests that there exists either a fundamental difference be- 12 new participants were tested using the same testing protocol
tween auditory and visual stimuli, or, more plausibly, an asymme- as Experiment 1. The study phase contained 64 stimuli. In the
try between auditory and visual processing. test phase, participants labeled 64 stimuli as old or new. We
measured memory for the sound clips alone, the verbal descrip-
tions alone, and the matching pictures alone. We also added 2
F or several decades, we have known that visual memory for
scenes is very robust (1, 2). In the most dramatic demon-
stration, Standing (3) showed observers up to 10,000 images for
conditions intended to improve encoding of the sound clips. In
1 condition, the sound clips were paired with the pictures during
a few seconds each and reported that they could subsequently the study phase. In the other, the sound clips were paired with
identify which images they had seen before with 83% accuracy. their verbal descriptions during study. In both of these condi-
This memory is far superior to verbal memory (4) and can persist tions, participants were tested for recognition of the sound clips
for a week (5). Recent research has extended these findings to alone.
show that we have a massive memory for the details of thousands The results, shown in Fig. 1, were unambiguous. According to
of objects (6). Here, we ask whether the same is true for auditory Tukey’s WSD test, memory for pictures was significantly better
memory and find that it is not. than for all other stimuli, while the remaining conditions did not
differ from one another. Recall for sound clips was slightly
Results higher than in the first experiment, but still quite low (d⬘ ⫽ 1.83;
For Experiment 1, we recorded or acquired 96 distinctive 5-s s.e.m. ⫽ 0.21) and far inferior to recall for pictures (d⬘ ⫽ 3.57;
sound clips from a variety of sources: birds chirping, a coffee s.e.m. ⫽ 0.24). Supplying the participants with descriptions
shop, motorcycles, a pool hall, etc. Twelve participants listened together in the study phase did not significantly improve recall
to 64 sound clips during the study phase. Immediately following for sound clips (d⬘ ⫽ 2.23; s.e.m. ⫽ 0.17). This may not be
the study phase, we tested participants on another series of 64 surprising, because recall for the verbal descriptions by them-
clips, half from the study phase and half new. Participants were selves was also relatively poor (d⬘ ⫽ 2.39; s.e.m. ⫽ 0.15).
asked to indicate whether each clip was old or new. Memory was However, even pairing sound clips with pictures of the objects at
fairly poor for these stimuli: the hit rate was 78% and the false the time of encoding did not improve subsequent testing with
alarm rate 20%, yielding a d⬘ score* of 1.68 (s.e.m. 0.14). To put sound clips alone (d⬘ ⫽ 1.83; s.e.m. ⫽ 0.16). Note that these were
this performance for a mere 64 sound clips in perspective, in the same pictures that, by themselves, produced a d⬘ of 3.57.
Shepard’s original study with 600 pictures, he reported a hit rate Again, it is still possible that these were the wrong stimuli. In
of 98%, whereas Standing reported a hit rate of 96% for 1,100 terms of information load, the auditory stimuli we used may
images. simply be more impoverished than pictures. Thus, poor memory
There are several possible explanations for the poor perfor- performance with sounds may be due solely to the nature of the
mance on this auditory memory task. It could be that the particular stimulus we used. Perhaps richer stimuli would lead to
remarkable ability to rapidly encode and remember meaningful more efficient encoding and storage in memory. To explore this
stimuli is a feature of visual processing. Alternatively, these possibility, in Experiment 3 we replicated the testing procedures
might have been the wrong sounds. A particular stimulus set from Experiments 1 and 2 using 2 new types of stimuli: spoken
might yield poor performance for a variety of reasons. Perhaps language and music. Both classes of stimuli might contain more
the perceptual quality was poor; for example, many of our stimuli information than the natural auditory sounds used in Experi-
were recorded monaurally but played over headphones. It is also ments 1 and 2. Spoken language conveys information about the
possible that the sound clips were too closely clustered in the speaker’s age, gender, and nationality, in addition to a wealth of
stimulus space for observers to distinguish between them. Or the
stimuli might simply be the wrong sort of auditory stimuli for
Author contributions: M.A.C., T.S.H., and J.M.W. designed research; M.A.C. performed
reasons unknown. To distinguish between the poor memory and research; M.A.C., T.S.H., and J.M.W. analyzed data; and M.A.C., T.S.H., and J.M.W. wrote the
poor stimuli hypotheses, we replicated the experiments with a paper.
second set of stimuli that were professionally recorded (e.g., The authors declare no conflict of interest.
binaurally) and designed to be as unique as possible (e.g., the This article is a PNAS Direct Submission.
sound of a tea kettle, the sound of bowling pins falling). Each 1To whom correspondence should be addressed. E-mail: jmwolfe@rics.bwh.harvard.edu.
sound was assigned a brief description (e.g., ‘‘small dog bark- *d⬘, a standard index of detectability derived from signal detection theory (7), is computed
ing’’). In a separate experiment, 12 participants were asked to from hit and false alarm rates. Because false alarm rates are not available for all of the
choose the correct name for each sound clip from a list of 111 early picture memory studies, we also report hit rates.

6008 – 6010 兩 PNAS 兩 April 7, 2009 兩 vol. 106 兩 no. 14 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0811884106


Fig. 1. Memory performance in units of d⬘. Error bars denote the standard error of the mean. The leftmost part shows the results from Experiment 1, the center
part shows the results from Experiment 2, and the rightmost part shows the results from Experiment 3.

semantic information about the topic being discussed. Music, the auditory stimuli, participants were able to perform at 64%
when there is a vocalist, can convey much the same information on this 111-alternative choice task, using a conservative scoring
as spoken language, in addition to information about rhythm, criterion. For comparison, we obtained a set of images that had
harmony, and instrumentation. been created by taking 256 ⫻ 256 pixel images, reducing them
Experiment 3 consisted of 2 groups of 12 participants, all to 16 ⫻ 16 pixel resolution, then upsampling to create 256 ⫻ 256
native English speakers. In the spoken language condition, pixel images for display. This resulted in very degraded, blurred

PSYCHOLOGY
participants were tested using 90 unique speech clips (7–15 s) on versions of the originals (8). Previous work with these same
a variety of topics (e.g., politics, sports, current affairs, sections images demonstrated that this procedure leads to a decrease in
from novels). Participants were debriefed afterward to confirm performance on a broad categorization task as compared to
that they had no problem understanding what was being said, in higher resolution images (8).
terms of both content and speaker’s pronunciation. Performance For the first part of Experiment 4, we tested 12 participants in
in this condition (d⬘ ⫽ 2.7; s.e.m. ⫽ 0.16) was better than every the same memory protocol as in the previous experiments using
other sound condition, but was still worse than the picture only 102 upsampled images. As Fig. 2 shows, performance on this
condition of Experiment 2 [t (11) ⫽ 3.31, P ⬍ 0.01]. In the music condition (d⬘ ⫽ 1.89; s.e.m. ⫽ 0.17) was not significantly different
condition participants were tested using 90 novel popular music from performance with the auditory stimuli from Experiment 2
clips (5–15 s). Each participant was debriefed after the experi- [t (11) ⫽ 0.21, P ⬎ 0.8]. In the second condition, we then asked
ment, and none reported having ever heard any of these specific 12 participants† to choose the correct name for each degraded
clips before. Performance in this experiment (d⬘ ⫽ 1.28; s.e.m. image from a list of 102 descriptions (chance ⫽ 0.98%). Partic-
⫽ 0.11) was actually worse than in the sound only condition of ipants successfully matched an image with its description just
Experiment 2 [t (11) ⫽ 2.509, P ⬍ 0.05], and far worse than the 21% of the time, significantly worse than the 64% classification
picture only condition [t (11) ⫽ 14.14, P ⬍ 0.001]. Thus, memory performance for the auditory stimuli reported earlier [t (11) ⫽
for a variety of auditory stimulus classes, some of which poten- 21.22, P ⬍ 0.001]. Using the more liberal scoring criterion that
tially carry more information than natural auditory sounds, is corrects for ‘‘near misses’’ (e.g., ‘‘highway’’ for the image of a
inferior to visual memory for scenes and objects. forest road would be considered a near miss; ‘‘bedroom’’ for the
Experiment 3 suggests that poor auditory memory is not image of a ‘‘beach’’ would not), performance was still only 24%
simply the product of impoverished stimuli. However, it would be against 83% for the auditory stimuli [t (11) ⫽ 30.277, P ⬍ 0.001].
more satisfying to directly measure the quality of visual and Fig. 2 makes our point graphically. To equate the memora-
auditory stimulus sets in the same units. Here, we used the bility of visual and auditory stimuli, we needed to render the
classification task previously used to calibrate the auditory visual stimuli almost unrecognizable. Participants were much
stimuli in Experiment 2, asking participants to assign each better at classifying/identifying the auditory stimuli than the
stimulus a label from a prespecified list of labels. Recall that for degraded visual stimuli (triangles, right y-axis). This is consistent
with an asymmetry between visual and auditory processing.
Stimuli of equal memorability are not equally identifiable.
Highly identifiable auditory stimuli are not remembered well.

Discussion
It is clear from these results that auditory recognition memory
performance is markedly inferior to visual recognition memory
on this task. Note that we do not claim that long-term auditory
memory, in general, is impoverished. Clearly, some form of
auditory long-term memory allowed our participants to identify
the stimuli as tea kettles, dogs, and so forth. Moreover, with
practice, people can commit large bodies of auditory material
(e.g., music) to memory. The striking aspects of the original
picture memory experiments are the speed and ease with which
Fig. 2. Auditory stimuli vs degraded visual images. Memory performance
(squares, solid line) is plotted against the left y-axis in units of d⬘. Percent †Notethat 5 participants participated in both conditions of experiment 4, but were only
correct for the naming experiment is plotted against the right y-axis. Error bars allowed to complete the classification condition after having completed the memory
denote standard error of the mean. condition.

Cohen et al. PNAS 兩 April 7, 2009 兩 vol. 106 兩 no. 14 兩 6009


complex visual stimuli seem to slide into long-term memory. extracted. Speech clips used came from various podcasts obtained online and
Hundreds or thousands of images, seen for a few seconds at a were also uploaded into WavePad to obtain 5- to 15-s clips. Degraded visual
time, are available for subsequent recognition. It is this aspect of images used in Experiment 4 were obtained from A. Torralba (Massachusetts
Institute of Technology, Cambridge, MA). A list of the stimuli used is provided
memory that seems to be markedly less impressive in audition.
on our website: search.bwh.harvard.edu.
Two explanations suggest themselves. Auditory objects might be
fundamentally different from visual objects. In their physics or Experimental Blocks. The memory experiments consisted of a study block and
psychophysics, they may actually be less memorable than their a test block. In the study block, participants listened to or viewed a set of sound
visual counterparts. Alternatively, auditory memory might be clips or sound clips and their correlating images/names (60 – 66 clips) for
fundamentally different/smaller than visual memory. We might approximately 10 min. Their instructions were simply to carefully study to the
simply lack the capacity to remember more than a few auditory clips and try to commit them to memory as best they could. In the test block,
objects, however memorable, when they are presented one after participants were presented with another set of clips (60 – 64 clips), half that
another in rapid succession. In either case, it is unlikely that were repeated from the study block (old) and half that had never been
anyone will find 1000 sounds that can be remembered with presented before (new). Participants were asked to make an ‘‘old/new’’
discrimination after every trial. Note that on 1 condition of the memory
anything like the accuracy of their visual counterparts.
experiments the basic paradigm remained the same, but participants were
presented with only visual images (picture only). The naming/classification
Materials and Methods
experiments comprised a single block lasting approximately 20 min. Partici-
Participants. One hundred thirteen total participants (aged 18 –54) partici- pants were shown each stimulus for 5 s and would then type in the name of
pated in the experiments. For each condition there were 12 participants, with what they had heard/seen from a list provided (102–110 names).
a total of 11 conditions/experiments. Each participant passed the Ishihara test
for color blindness and had normal or corrected to normal vision. All partic-
Apparatus. Every experiment was conducted on a Macintosh computer run-
ipants gave informed consent, as approved by the Partners Healthcare Cor-
ning MacOS 9.2, controlled by Matlab 7.5.0 and the Psychophysics Toolbox,
poration IRB, and were compensated $10/h for their time.
version 3.

Stimuli. In Experiment 1, stimuli were gathered using a handheld recording


ACKNOWLEDGMENTS. We thank Christina Chang, Karla Evans, Yair Pinto,
device (Panasonic PV-GS180) or were obtained from a commercially available Aude Oliva, and Barbara Shinn-Cunningham for helpful comments and sug-
database (SoundSnap). In Experiment 2, stimuli were gathered from Sound- gestions on the project, and Antonio Torralba for providing the degraded
Snap.com. In Experiment 3, music clips came from the collections of members images used in Experiment 4. This work was funded in part by NIMH-775561
of the laboratory. Songs were uploaded into WavePad and 7- to 15-s clips were and AFOSR-887783.

1. Shepard RN (1967) Recognition memory for words, sentences, and pictures. J Verb 5. Dallet K, Wilcox SG, D’Andrea L (1968) Picture memory experiments. J Exp Psychol
Learn Verb Behav 6:156 –163. 76:312–320.
2. Pezdek K, Whetstone T, Reynolds K, Askari N, Dougherty T (1989) Memory for real- 6. Brady TF, Konkle T, Alvarez GA, Oliva A (2008) Visual long-term memory has a massive
world scenes: The role of consistency with schema expectation. J Exp Pscyhol Learn storage capacity for object details. Proc Natl Acad Sci US 105:14325–14329.
Mem Cog 15:587–595. 7. Macmillan NA, Creelman CD (2005) in Detection Theory: A User’s Guide 2nd ed.
3. Standing L (1973) Learning 10,00 pictures. Q J Exp Psychol 25:207–222. (Lawrence Erlbaum Assoc, Mahwah, NJ) 2nd Ed.
4. Standing L, Conezi J, Haber RN (1970) Perception and memory for pictures: Single-trial 8. Torralba A (2009) How many pixels make an image? Visual Neurosci, epub ahead of
learning of 2500 visual stimuli. Psychon Sci 19:73. print.

6010 兩 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0811884106 Cohen et al.

You might also like