Tracking Perception of The Sounds of English
Tracking Perception of The Sounds of English
Note:
To cite this publication please use the final published version (if applicable).
Tracking perception of the sounds of English
Natasha WarnerJames M. McQueenAnne CutlerBRM
Citation: The Journal of the Acoustical Society of America 135, 2995 (2014); doi: 10.1121/1.4870486
View online: http://dx.doi.org/10.1121/1.4870486
View Table of Contents: http://asa.scitation.org/toc/jas/135/5
Published by the Acoustical Society of America
Tracking perception of the sounds of English
Natasha Warnera)
Department of Linguistics, University of Arizona, Box 210028, Tucson, Arizona 85721-0028
James M. McQueenb)
Radboud University Nijmegen, Postbus 9104, 6500 HE Nijmegen, The Netherlands
Anne Cutlerc)
The MARCS Institute, University of Western Sydney, Locked Bag 1797, Penrith, New South Wales 2751, Australia
(Received 6 August 2013; revised 21 March 2014; accepted 24 March 2014)
Twenty American English listeners identified gated fragments of all 2288 possible English within-word
and cross-word diphones, providing a total of 538 560 phoneme categorizations. The results show
orderly uptake of acoustic information in the signal and provide a view of where information about
segments occurs in time. Information locus depends on each speech sound’s identity and phonological
features. Affricates and diphthongs have highly localized information so that listeners’ perceptual
accuracy rises during a confined time range. Stops and sonorants have more distributed and gradually
appearing information. The identity and phonological features (e.g., vowel vs consonant) of the
neighboring segment also influences when acoustic information about a segment is available. Stressed
vowels are perceived significantly more accurately than unstressed vowels, but this effect is greater for
lax vowels than for tense vowels or diphthongs. The dataset charts the availability of perceptual cues to
segment identity across time for the full phoneme repertoire of English in all attested phonetic contexts.
C 2014 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4870486]
V
J. Acoust. Soc. Am. 135 (5), May 2014 0001-4966/2014/135(5)/2995/12/$30.00 C 2014 Acoustical Society of America
V 2995
diphone set comprises all consonants (C) and all vowels (V) unstressed central vowels [Ø, @] since many speakers and lis-
of a common variety of American English in all combina- teners are unsure what the [Ø] category represents. To avoid
tions (CV, VC, CC, or VV) that the language allows. duplication we also omitted diphones with syllabic conso-
Diphones that occur only across word or compound bounda- nants ([n, m],
etc.), given that non-syllabic sequences of the
ries (e.g., /Tm/, as in batch mode) were included, as well as same segments were already in the corpus ([tn] in catnip vs
more typical diphones that occur syllable internally. All in button up). We did not omit [Ql] as in bottle because
[tn]
vowels appear both as stressed and unstressed (e.g., /bu/ or only syllabic [l] can follow [Q]; non-syllabic [l] cannot.
/æS/ each have two diphones with stressed vs unstressed All combinations of two sounds were considered possi-
vowel, while /ojej/ has four, as in annoy eighteen, alloy ble unless they did not occur within a word in this dictionary,
aging, annoy aging, alloy eighteen). Six gates were created could not be formed by the end of one word in the dictionary
for each diphone, presenting the first third, first two-thirds, and the beginning of another, and a phonological reason for
and entirety of Segment 1, and Segment 1 plus the first third, their impossibility is known. Hence, VV diphones with lax
first two-thirds, and entirety of Segment 2. (Some diphones, vowels as the first vowel (e.g., /ea/) were excluded because
however, only have four gates as explained in Sec. II A.) lax vowels cannot end a syllable or word. Furthermore, some
This yielded a total of 13 464 stimuli. These were presented sequences cannot occur because of vowel mergers before /,
in random order to listeners who then identified the two ˛/ etc. in most varieties of American English (Ladefoged
sounds of each stimulus as best they could. The resulting and Johnson, 2015). Thus, we used /ej/, but not /e, /I˛/, but
dataset will enable investigation of perception of consonant not /i˛/, etc., with the production representing the speaker’s
place, manner, or voicing, vowel quality, stress, time point pronunciation of these strings. These constraints led to a list
within segments, or properties of and interactions with pre- of 2288 diphones out of the 3136 that would occur if every
ceding or following segments. So that all researchers can segment in Table I plus syllabic [l] could precede and follow
make use of this substantial dataset, our results are every other segment. (Notes on further detailed methods
publicly available at http://www.u.arizona.edu/nwarner/ decisions appear online at http://www.u.arizona.edu/
WarnerMcQueenCutler.html. nwarner/ WarnerMcQueenCutler.html.)
The diphones were recorded by a phonetically trained
female speaker who had lived almost her entire life in
II. METHODS
Tucson, Arizona, and who was monolingual in English until
A. Materials her teenage years. The stimuli thus represent the speech of
one speaker, but that speaker is highly appropriate for the
We compiled a list of all diphones that can occur either
choice of dialect. As in Smits et al. (2003), contexts were
within a word or across word boundaries in American English
appended: (i) a following context for each diphone (/k/, /kej/,
using the segment inventory in Table I. This inventory reflects
or /k@/ after vowels and /ej/, /@/, or /a/ after consonants) to
the system of the electronic dictionary of American English at
avoid final lengthening within diphones; (ii) a preceding
http://lexicon.arizona.edu/hammond/newdic.html (accessed
context for some diphones (/a/ before C, /b/ or /ab/
3/5/2014) (related to the dictionary file at http://dingo.sbs.
before V). Most CC diphones (e.g., /fp/) cannot occur
arizona.edu/hammond/lsasummer11/newdic discussed in
word-initially, but a preceding vowel makes them pro-
Pisoni et al., 1985). However, we treated the flap allophone
nounceable in a natural way. To avoid preceding context sig-
[Q] as a separate segment since its occurrence is not fully
naling particular diphone types, some remaining diphones
conditioned by the diphone environment, we omitted /O/
also received preceding contexts (giving overall 71% of
because it does not occur in the Arizona dialect or in many
diphones with preceding context). Preceding and following
other parts of the United States, and we merged the
contexts also helped the speaker to pronounce target stress pat-
terns in VV diphones (e.g., /’abiuk’ej/ for unstressed-unstressed
TABLE I. American English segment inventory for the diphone list. (A) /iu/, /b’i’uk@/ for stressed-stressed). CV and VC diphones were
Consonants. (B) Vowels. followed by /(k)@/ if the diphone’s vowel was stressed, /’(k)ej/
if unstressed; VV stressed-stressed and unstressed-unstressed
(A) Consonants
diphones had following syllables with opposite stress. The
Voiced Voiceless choices of which context to use before and after each diphone
Stops/affricates/flap b, d, g, D, Q p, t, k, T
were the same as specified in Smits et al. (2003).
Fricatives v, ð, z, Z f, h, s, S, h We then identified the boundary between the two seg-
Nasals m, n, ˛ ments of the diphone, as well as between the diphone and
Glides/approximants j, w, , l any preceding or following context. Separate boundary crite-
ria were applied for voiceless consonant to voiced segment
(B) Vowels
(onset/offset of voicing), voiced obstruent to voiced segment
Front Central Back (F2 onset/offset), nasal to vowel or sonorant (sudden change
in frequency of energies), /l/ to vowel (most sudden increase
High i, I u, U
Mid ej, e ˆ, @, 2 ow
in amplitude of formants), glide or // to vowel (midway
Low æ a through duration of F2 or F3 transition, respectively), voice-
Diphthongs aj, oj aw less consonant to voiceless consonant (onset/cessation of
defining features such as closure, burst noise, or frication
2996 J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English
noise) and vowel to vowel (beginning of creak or glottal stop for 295 ms after the ramp. The amplitude of the speech was
if any, midway through F2 transition, otherwise). Boundary ramped down over a 5 ms time window as the amplitude of
position decisions were closely modeled on the methods of the square wave was ramped up. These signals were added
the Dutch work (Smits et al., 2003), in order to make the to produce a smooth transition from speech to square wave
data for the two languages comparable. Additional details (beep). The square wave amplitude was loud enough to con-
about boundary locations appear in Smits et al. (2003). vey a clear beep, but quiet enough not to irritate listeners;
Recordings were final-gated to produce (with one the square wave f0 was high enough to prevent resemblance
exception, next paragraph) six stimuli per diphone, usually to any speech sound. The square wave and ramp were used
with a gate termination at each third of the way through the to prevent the artifactual perception of a labial consonant
first and second segment of the diphone. That is, in the €
that can occur with speech cut suddenly to silence (Ohmann,
diphone /sa/, the shortest stimulus included the preceding 1966; Pols and Schouten, 1978).
context (if any) through to one third of the way through /s/;
Gate 2 included that material and extended to the point two- B. Subjects
thirds through /s/; Gate 3 ended at the boundary of /s/ and
Twenty-eight listeners (five male) began participation in
/a/; Gate 4 went to one-third through /a/, Gate 5 to two-thirds
the experiment; six (two male) did not finish it. All listeners
through /a/, and Gate 6 to the end of the diphone. Thus, any
were monolingual in American English until at least their
preceding context recorded with the diphone was always
teenage years and had no substantial exposure to other lan-
presented as part of the stimulus, and the following context
guage(s) in childhood, nor more than a few years’ classroom
was never presented as the last gate included the whole
study of foreign languages in school. All grew up in the
diphone but no transition to the following context. Gate end-
Southwest of the United States (some came from Texas or
points were defined by proportions of duration of segments,
southern California, but most from Arizona), and all were
rather than by absolute number of milliseconds (e.g., one
students at the University of Arizona at the time of the study.
gate per 20 ms) because this allows one to compare across
Thus, the listeners’ dialect was well matched to that of the
all segment types how well listeners can perceive sounds by
speaker. The listeners were recruited through the University
one-third or two-thirds of the way through the segment.
of Arizona’s honors program to select participants most
Gating at fixed time intervals would make comparison across
likely to return reliably for the many sessions, and most able
manners of articulation (e.g., /m/, which is long, vs /d/ or /j/,
to learn the response symbols easily. No listener had any
which are short), or even across individual stimuli, very
known speech, hearing, or reading problem. Of the six who
difficult.
did not complete the study, three chose to stop, one missed
An exception to the equal gate size occurred with stops
frequent appointments, and two were dropped due to poor
and affricates following another segment. Here, the second
performance in practice.
gate point within the segment was just before the beginning
of the burst, rather than at two-thirds of the segment’s dura-
C. Procedures
tion. This avoided having some diphones with the burst in
the second gate, but others with the burst only in the third The stimuli were randomized and grouped in short ex-
gate. Thus, for all stops and affricates, the burst [and voice perimental blocks, expected to take 10–20 min each to com-
onset time (VOT)] information was only available to listen- plete, so that listeners could complete 3–5 such blocks
ers as of the last gate within that segment (Gate 3 if the sto- during each 1 h experimental session. The order of blocks
p/affricate was Segment 1 of the diphone, Gate 6 if it was was varied for individual subjects (though not fully random-
Segment 2). The first gate endpoint within these segments ized). Four practice blocks were also created using actual
was placed at halfway through the duration from segment stimuli from the experiment with disproportionately many
onset to pre-burst gate point, thus, halfway through the clo- stimuli containing segments for which the response symbol
sure. [Q] was not treated as a stop since it often has no burst, was expected to be relatively difficult to learn (e.g., “dh” for
but rather had its endpoints at one-third and two-thirds /ð/, “g” as /g/ and not /D/, most vowels).
through the flap duration. These gate points were thus close English spelling is too ambiguous to convey responses,
in time, but using three equal gates makes the data compara- but we used response symbols that were based as closely as
ble across diphones. possible on typical English spellings (e.g., “oy” for /oj/, “j”
The exception to the six-gate pattern concerned stops for /D/, “p” for /p/). Listeners were first instructed in these
and affricates as Segment 1 of the diphone, recorded without symbols, and then performed practice blocks for 45 min
preceding context. Here, the silent closure phase could not (223 or 335 stimuli for most listeners, depending on how
be located. For these 132 diphones, only 4 gates were pre- many blocks they completed). Data from practice blocks
sented: 1 reaching the end of Segment 1, plus the 3 normal were used only to evaluate listeners’ ability to do the task,
end-points for Segment 2. That is, the two gates that would and all stimuli presented in the practice sessions appeared
normally end during the stop/affricate closure were simply again during actual experiments. (Because of the very large
omitted as they contained only silence. number of stimuli, many similar, this is not problematic. A
Each stimulus token was created by extracting the listener is unlikely to recall having heard one of the 335
speech from onset of the diphone or preceding context (if practice stimuli when hearing it again among 13 464 experi-
any) to the gate point for that stimulus, then ramping the mental stimuli.) Two listeners scored <50% correct on both
speech to a square wave with f0 ¼ 500 Hz, which continued Segments 1 and 2 even at Gate 6 (when both sounds should
J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English 2997
be relatively perceptible) during the practice, and had ran- statistical analysis, proportions were converted to
dom, perceptually unmotivated error patterns, so their partic- Rationalized Arcsine Transformed Units (RAU), using Eqs.
ipation was discontinued. (2) and (3) in Sherbecoe and Studebaker (2004), which
For experimental sessions, listeners sat in an individual adjust for proportions calculated over <150 stimuli. The
sound-protected booth and heard stimuli presented over analyses of variance reported in Tables IV–VII are within-
headphones. Each stimulus was accompanied by a display subjects pairwise comparisons of related sound types at each
on a computer screen showing all response alternatives as gate, using RAU proportions over all diphones in the rele-
buttons on the left half of the screen, and the same alterna- vant category as the dependent variable (e.g., all diphones
tives on the right half of the screen, with a dividing line with /d/ or /t/ as Segment 1 for the comparison of Segment 1
between the halves. Listeners used a mouse to click first on /d/ to /t/). Initial analysis of the data showed that the accu-
the left half of the screen on the first sound they thought they racy of two listeners (one male) was more than 3 standard
heard, then on the right half on the second sound they heard deviations below the average of all other listeners for percep-
(or that might have come next). The response options were tion of either Segment 1 or 2 at three or more gates. They
the same as the inventory of segments (Table I), except that differed in which information they failed to use (in which
[Q] and syllabic [l] were not given as options since English gates for which segment), but both were clear outliers. These
listeners consider these types of /t, d, l/. Listeners were also two listeners’ data were excluded as not representative, so
not asked to distinguish /@/ from /ˆ/, but were trained to use all figures and tables show the data of the 20 remaining lis-
“uh” for both, and to use “er” for both stressed and teners. No other listeners differed markedly from the rest of
unstressed /2, ˘T/ (separate symbols in the dictionary used; the group at multiple gates. Figure 1 shows overall accuracy
N.B.: identifying stress was not part of the listening task). for all consonants, all vowels, and all segments, and clearly
If the diphone was recorded with a preceding context, reveals listeners’ increasing uptake of information as the
the context was displayed on the left of the screen in the acoustic signal proceeds.
spelling system of the responses to indicate that the sounds
of the preceding context were not part of what the listener
A. Stops, affricates, and flap
should respond to. Thus, for the diphone /iu/ (both
unstressed), recorded in /’abiuk’ej/, “ahb” was shown to the Figure 2 shows percent correct results for stops, affri-
left of the response buttons. cates, and flap [Fig. 2(A) as Segment 1, Fig. 2(B) as
Listeners returned to the lab for multiple one-hour ses- Segment 2). Results for Segment 1 are presented separately
sions, completing as many experimental blocks as they could for diphones with only four gates (stops, affricates without
per visit (with a brief break between each). Listeners took an preceding context) vs with six gates. Gate 3 of diphones with
average of 32.73 sessions to complete the experiment (range: four gates includes only the release burst and any aspiration
28–39). They received a small monetary compensation for noise, so for voiceless unaspirated /b, d, g, D/, this gate is
each visit, and a bonus equal to five sessions’ compensation short, thus lowering accuracy.
on completion. After most listeners had finished the experi- Several overall patterns are evident. Phonemically
ment, we realized that we had erroneously omitted 25 stim- voiceless stops (/p, t, k/) are usually identified better than
uli. These 25 were randomized with 55 fillers (stimuli from their voiced counterparts (/b, d, g/) early in the preceding
other diphones that had already been presented), and subjects segment [Gates 1–2, Fig. 2(B)] and once the release burst
returned to complete these stimuli; responses to fillers in this and any aspiration noise have been heard [Gates 3–6, Fig.
session were not analyzed. 2(A) and Gate 6, Fig. 2(B)]. During the closure of the stop
itself, however, the voiced segments are usually perceived as
well as the voiceless segments or even more accurately, and
III. RESULTS
this pattern may begin even by the end of the preceding seg-
Percent correct responses and type of incorrect ment [Gates 1–2, Fig. 2(A) and Gates 3–5, Fig. 2(B)].
responses were computed for each segment of each diphone. Statistical results (Table IV) confirm this pattern especially
The proportion correct averaged over all diphones contain- for b/p and g/k (/dt/ is discussed below). This suggests that
ing a given segment as Segment 1 (or 2) was then calculated early in the preceding segment, listeners may perceive some
(stressed and unstressed vowels counted separately). Thus, place information, but use voiceless as a default choice for
Subject 1’s proportion correct for stressed /a/ at Gate 1 repre- voicing. By the end of the preceding segment, longer dura-
sents how often Subject 1 correctly chose /a/ as Segment 1 tion before a voiced stop may be conveying information
for all 101 Gate 1 stimuli with /a/ as Segment 1, regardless about voicing. During the stop’s closure, voiceless silence
of Segment 2 identity. Tables II and III present confusion conveys no further information, but a voiced closure does,
matrices, respectively, for consonants and vowels. leading to the advantage for voiced segments. Finally, the
In Secs. III A–III D, we present statistical comparisons noisy, longer VOT of /p, t, k/ leads to an increase in percepti-
analyzing several of the most salient differences within each bility for these stops during the release.
manner class. The choice of which comparisons to present is Figure 2 also shows several individual deviations, for
also informed by the analyses included in Smits et al. instance, that, relative to other voiceless stops, /t/ is per-
(2003). All statistical analyses below were conducted with ceived poorly at many gates. The general pattern of better
subject as random factor on proportions correct out of all perception of voiceless stops during the preceding segment
diphones with the same Segment 1 or Segment 2. Before and once the burst has been heard is shifted by the overall
2998 J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English
TABLE II. Confusion matrices for consonants: first segment at Gate 1 (top line of each Stimulus row) and second segment at Gate 4 (bottom line). Responses
are summed over subjects and over all diphones containing the consonant. The next-to-rightmost column is the total number of vowel responses to the conso-
nant (any vowel).
Response
p 542 22 40 5 6 1 15 9 20 6 4 1 1 68 740
482 30 8 95 26 9 2 3 14 21 1 58 14 6 4 2 25 23 5 11 3 13 2 123 980
t 106 414 45 6 49 20 4 22 17 2 2 1 7 2 83 780
9 333 6 2 188 11 26 8 8 68 33 5 60 5 8 7 1 4 48 9 5 4 3 9 120 980
k 19 31 535 1 3 8 3 10 39 1 1 2 1 4 1 61 720
22 100 266 12 50 48 6 3 7 21 3 1 162 7 5 2 5 5 18 4 4 1 7 16 205 980
b 66 2 522 22 11 1 8 29 2 4 1 2 2 2 2 44 720
98 12 494 32 6 5 4 5 13 2 1 41 30 8 34 19 2 15 7 44 2 86 960
d 8 23 1 12 582 3 6 3 2 4 7 2 20 1 1 45 720
13 46 6 21 499 8 6 12 9 23 4 38 24 20 2 3 8 51 3 14 10 6 3 131 960
Q 4 15 3 207 1 1 3 2 1 3 45 15 300
6 50 1 8 333 1 1 6 15 14 11 2 6 40 23 1 2 60 580
g 1 3 4 2 24 557 1 1 4 1 1 3 3 3 5 6 4 57 680
19 53 39 14 86 309 5 10 3 15 4 1 77 24 9 4 4 3 18 4 3 6 12 29 189 940
T 4 326 44 1 53 1 56 1 2 2 23 5 2 6 2 52 580
39 262 13 8 167 7 64 19 6 38 6 19 77 14 5 2 1 4 26 3 7 7 12 5 129 940
D 3 1 3 500 5 2 5 5 1 5 6 44 580
6 28 4 22 558 7 9 20 2 8 1 1 43 11 13 3 3 5 32 4 8 7 10 11 124 940
f 13 8 6 6 2 725 145 5 41 10 3 6 50 1020
8 5 9 1 3 6 3 555 162 11 2 30 75 10 1 1 7 7 4 7 73 980
h 16 19 3 7 248 346 3 19 10 6 3 1 1 1 37 720
2 3 5 1 2 2 3 1 151 595 5 1 38 26 33 2 1 1 4 1 6 1 56 940
s 11 91 3 1 1 1 9 23 807 7 22 1 1 12 1 1 28 1020
4 17 3 3 2 5 3 2 48 696 21 10 3 5 62 1 1 1 2 2 49 940
S 5 7 1 2 1 64 7 3 1 10 611 9 2 4 4 1 8 740
14 3 8 5 39 33 6 15 6 659 21 2 3 2 42 4 3 5 1 69 940
h 14 5 11 4 1 1 11 7 4 502 2 1 1 1 2 1 3 4 1 24 600
16 39 3 5 10 4 7 7 30 41 7 2 346 9 7 1 6 7 23 5 6 5 8 4 282 880
v 8 3 15 2 1 29 16 1 17 743 15 2 1 16 6 15 28 16 86 1020
4 10 16 5 6 1 5 65 31 4 3 29 598 20 3 7 15 14 5 8 2 23 66 940
ð 7 1 84 28 2 9 37 20 454 117 1 25 17 26 22 28 102 980
6 6 1 8 31 5 5 2 6 176 4 4 19 224 230 1 3 3 15 4 24 11 3 1 88 880
z 1 11 5 5 2 7 4 4 15 6 877 2 1 1 24 3 2 50 1020
3 3 4 2 13 1 3 6 17 54 9 2 4 763 9 3 6 1 37 940
Z 3 9 8 1 164 1 1 7 3 3 9 624 1 11 26 113 1 5 30 1020
1 1 2 14 10 17 133 1 2 2 40 14 2 5 10 569 1 7 1 3 3 4 4 74 920
m 15 2 2 32 2 15 8 693 94 13 20 3 19 2 80 1000
9 11 2 9 6 1 3 2 4 6 2 1 17 19 1 644 112 6 8 3 26 3 85 980
n 13 1 34 5 2 1 1 12 2 78 736 4 8 3 58 2 60 1020
2 17 9 5 38 8 4 5 15 2 23 13 12 2 3 44 629 3 29 5 11 4 97 980
˛ 1 1 1 6 1 8 1 23 108 659 25 6 8 3 129 980
2 3 3 9 32 137 2 1 2 9 200
l 42 4 3 60 4 4 2 1 1 14 4 1 13 9 6 655 11 102 84 1020
7 37 1 12 18 3 1 2 3 9 28 14 7 1 1 21 15 3 543 5 27 3 219 980
33 10 14 66 18 6 7 3 13 7 22 12 23 689 51 1 65 1040
16 18 9 20 5 1 2 1 1 17 15 2 4 13 13 1 21 370 58 2 211 800
w 6 56 9 1 7 4 39 6 361 71 560
8 47 2 9 14 6 4 2 5 7 2 29 12 1 2 21 26 8 45 15 349 1 245 860
j 3 4 22 13 4 1 2 1 12 4 1 1 13 7 8 1 6 192 185 480
3 33 4 9 18 2 1 8 2 21 9 1 3 17 2 8 3 195 441 780
Total 938 1003 716 952 1534 617 147 112 188 1074 633 832 632 842 1298 158 905 633 910 1032 687 889 993 675 214
783 1177 389 767 2141 467 222 777 292 885 1340 847 764 1225 1166 424 873 671 871 1163 212 828 497 629 299
weak perception of /t/, but there is an indication of the same extent, and these four segments also show little or no
pattern. Further, /t, k, T/ fail to show improvement in accu- improvement from Gate 1 to 2 for Segment 1. These time
racy from Gate 4 to 5 for Segment 2, as does /p/ to a lesser periods cover the second half of the silent closure. Listeners
J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English 2999
3000
J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014
TABLE III. Confusion matrices for vowels: first segment at Gate 1 (top line of each Stimulus row) and second segment at Gate 4 (bottom line). Responses are summed over subjects and over all diphones containing the
vowel. The next-to-rightmost column is the total number of consonant responses given. Syllabic [l] is only unstressed and appears only as Segment 2.
Segment 1 Segment 2
B. Fricatives
Results for fricatives appear in Fig. 3. Here, segment-
specific effects outweigh any general effect of voicing. The
segments /ð, h, Z/ are all poorly perceived. Table V shows
that for Segment 1, /ð/ is perceived worse than /h/ (the next
lowest fricative); /h/ itself is perceived somewhat worse than
/Z/ (the next lowest fricative, differences not significant with
FIG. 1. Proportion correct as Segment 1 of the diphone (top set of lines) and
Bonferroni correction), and /Z/ is perceived worse than its
Segment 2 of the diphone (lower set of lines), over time (gate end point voiceless counterpart /S/ at all but one gate. The differences
1–6). Average for all consonants, all vowels, and all segments. between /ð, h/ and /Z, S/ are far larger than the differences
J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English 3001
TABLE V. F ratios (Bonferroni-corrected significant comparisons only) for
fricative comparisons with rationalized arcsine transformed proportions.
Direction of effect in column headers: significance level: p < 0.00167 for
Segment 1, and p < 0.00417 for Segment 2 (Bonferroni correction for 30
comparisons for Segment 1 and 12 for Segment 2), df: (1,19) for each com-
parison. /Z/ vs /h/ and /z/ vs /s/ (both Segment 1) were also tested, but were
not significant.
Segment 1 Segment 2
1 110.28 17.48
2 110.70 16.88
3 82.53 18.53 748.11 94.91
4 69.69 21.75
5 69.26 20.25 32.12
6 68.04 15.30 22.49
C. Sonorants
between voiced/voiceless pairs for stops, suggesting that
segment-specific factors, rather than perception of voicing, Figure 4 shows that among sonorant consonants, nasals
dominate perception of these fricatives. are most easily perceived, while glides, /j/, in particular, are
The interdentals are most often misidentified as labioden- poorly perceived, often identified as vowels (e.g., /j/ as
tals (/ð/ as /v/, /h/ as /f/), far more often than the reverse con- Segment 1 at Gate 2 is identified as /aj/ in 7.1% of responses
fusion: Segment 1 /ð/ received 19% correct responses, 57.1% and as /i/ in 11.5%, correctly perceived in 68.1%).
/v/ and 9.7% /h/ (all other responses <3%), while Segment 1 Identifications of glides as diphthongs such as /aj, aw/ may
/v/ was 86.8% correct, with /ð/ as the next most common stem from the glide following /a/ as a preceding context. /w/
response at 2.3%. This shows that the low accuracy for inter- was sometimes also misperceived as /b/ (3.8% of responses
dentals is not a spelling confusion stemming from English for Segment 1, Gate 2) and /l/ (3.4%), reflecting its labial
orthography (both interdentals written as “th”), but is a per- constriction and the similarity of dark [] to a back glide or
ceptual effect. Our finding here matches earlier perceptual vowel. Syllabic [l] (used only in the diphone [Ql], as dis-
results (Jongman et al., 2000, 2003), and confirms that inter- cussed above) shows very similar accuracy to non-syllabic
dentals are far less perceptible than other English segments. [l] (not tested statistically because there is only one diphone
The other poorly perceived fricative, /Z/, has low fre- with syllabic [l], so proportion correct cannot be used).
quency across the lexicon and has no standard grapheme, Overall, accuracy of sonorant perception, particularly as
both of which most likely contribute to its low identification Segment 2, mirrors how consonantal a given sonorant is, in
accuracy. the sense of how distinct its acoustic boundaries are. Nasals
Besides the above fricative cases, only /f, v; s, z/ are dis- (with the most discrete boundaries) are most perceptible,
tinct in voicing. In both pairs, the voiced member was better then liquids, then glides (the most vowel-like, and the most
3002 J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English
FIG. 4. Proportion correct over gate point for sonorants. (A) As Segment 1. FIG. 5. Proportion correct over gate point for front vowels and front-ending
(B) As Segment 2. diphthongs. (A) As Segment 1. (B) As Segment 2.
J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English 3003
TABLE VI. F ratios (Bonferroni-corrected significant comparisons only) for
vowel tenseness with rationalized arcsine transformed proportions.
Direction of effect in column headers: significance level: p < 0.00208
(Bonferroni correction with 24 comparisons), df: (1,19) for each
comparison.
Segment 1 Segment 2
Gate
Stressed i>I ej > e u>U a>ˆ i>I ej > e u>U a>ˆ
3004 J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English
TABLE VII. F ratios (Bonferroni-corrected significant comparisons only)
for stress (stressed more accurate than unstressed) with rationalized arcsine
transformed proportions. Significance level: p < 0.00278 (Bonferroni correc-
tion with 18 comparisons), df: (1,19) for each comparison.
Segment 1 Segment 2
1
2 48.41
3 101.84 117.50 26.31 23.15 14.31
4 56.15 97.17 13.67 74.93 14.86 32.58
5 68.31 87.12 39.67 78.55 54.99 37.44
6 75.39 253.66 53.03 181.35 145.55 217.44
J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English 3005
consonant is perceived correctly in 90% of CV stimuli, but Fear, B. D., Cutler, A., and Butterfield, S. (1995). “The strong/weak syllable
only 84% of CC stimuli, and an initial stressed vowel is per- distinction in English,” J. Acoust. Soc. Am. 97, 1893–1904.
Fourakis, M. (1991). “Tempo, stress, and vowel reduction in American
ceived correctly in 93% of VV stimuli, but only 86% of VC English,” J. Acoust. Soc. Am. 90, 1816–1827.
stimuli (all vowels stressed). So, in general, coarticulatory Furui, S. (1986). “On the role of spectral transition for speech perception,”
information about consonants in vowels helps consonant per- J. Acoust. Soc. Am. 80, 1016–1025.
Greenberg, S. (1999). “Speaking in shorthand—A syllable-centric perspec-
ception, but coarticulation with consonants hinders vowel
tive for understanding pronunciation variation,” Speech Commun. 29,
perception (a pattern held to underlie listeners’ greater will- 159–176.
ingness to alter initial decisions about vowels than about Hillenbrand, J. M., and Nearey, T. M. (1999). “Identification of resynthe-
consonants; Van Ooijen, 1996). sized /hVd/ utterances: Effects of formant contour,” J. Acoust. Soc. Am.
105, 3509–3523.
Jesse, A., and Massaro, D. W. (2010). “The temporal distribution of infor-
V. CONCLUSIONS mation in audiovisual spokenword identification,” Atten. Percept.
Psychophys. 72, 209–225.
The dataset described here shows how listeners perceive Jongman, A., Wang, Y., and Kim, B. (2003). “Contributions of sentential
speech sounds over time for all sounds of American English and facial information to perception of fricatives,” J. Speech Lang. Hear.
Res. 46, 1367–1377.
in all possible environments. Acoustic information, and Jongman, A., Wayland, R., and Wong, S. (2000). “Acoustic characteristics
hence perceptual cues, are shown to be distributed through of English fricatives,” J. Acoust. Soc. Am. 108, 1252–1263.
the speech signal differentially over time, with the precise Ladefoged, P., and Johnson, K. (2015). A Course in Phonetics (Cengage
Learning, Stamford, CT), Chap. 4, pp. 89–114.
timing of the distribution depending on phonological catego-
McQueen, J. M. (1995). “Processing versus representation: Comments on
ries, specific segment identities, and stress. Ohala and Ohala,” in Phonology and Phonetic Evidence. Papers in
The present work, and the associated publicly available Laboratory Phonology IV, edited by B. Connell and A. Arvaniti
complete dataset (http://www.u.arizona.edu/~nwarner/ (Cambridge University Press, London), pp. 61–67.
Miller, G. A., and Nicely, P. E. (1955). “An analysis of perceptual confu-
WarnerMcQueenCutler.html), allows comparison of the tim- sions among some English consonants,” J. Acoust. Soc. Am. 27, 338–352.
ing of perception for any English segment preceded or fol- Nishi, K., Lewis, D. E., Hoover, B. M., Choi, S., and Stelmachowicz, P. G.
lowed by any other possible English segment. All diphones (2010). “Children’s recognition of American English consonants in noise,”
were tested with the same experimental methods, pro- J. Acoust. Soc. Am. 127, 3177–3188.
Norris, D., and McQueen, J. M. (2008). “Shortlist B: A Bayesian model of
nounced by the same speaker, and heard by the same listen- continuous speech recognition,” Psych. Rev. 115, 357–395.
ers. This degree of comparability across a whole language Ohala, J. J., and Ohala, M. (1995). “Speech perception and lexical represen-
repertoire could never be reached by meta-analyses of stud- tation: The role of vowel nasalization in Hindi and English,” in Phonology
ies of a specific set of segments or diphones. The scale and and Phonetic Evidence. Papers in Laboratory Phonology IV, edited by B.
Connell and A. Arvaniti (Cambridge University Press, London), pp.
comparability of this dataset thus allows current and future 41–60.
researchers to answer a wide variety of questions about €
Ohman, S. E. (1966). “Perception of segments of VCCV utterances,”
speech perception. It also allows modeling of spoken word J. Acoust. Soc. Am. 40, 979–988.
recognition in English with probabilistic data about how Pelaez-Moreno, C., Garcıa-Moral, A. I., and Valverde-Albacete, F. J.
(2010). “Analyzing phonetic confusions using formal concept analysis,”
likely listeners are to think they are hearing a given sound at J. Acoust. Soc. Am. 128, 1377–1390.
a given point in time, not just with a “toy” lexicon, but with Peterson, G. E., and Barney, H. L. (1952). “Control methods used in a study
the entire English lexicon. This use of the dataset will be of the vowels,” J. Acoust. Soc. Am. 24, 175–184.
Phatak, S. A., Lovitt, A., and Allen, J. B. (2008). “Consonant confusions in
implemented in a forthcoming release for English of the
white noise,” J. Acoust. Soc. Am. 124, 1220–1233.
Bayesian probabilistic model of continuous speech recogni- Pisoni, D. B., Nusbaum, H. C., Luce, P. A., and Slowiaczek, L. M. (1985).
tion Shortlist B (Norris and McQueen, 2008, currently “Speech perception, word recognition and the structure of the lexicon,”
implemented for Dutch using the dataset of Smits et al., Speech Commun. 4, 75–95.
Pols, L. C. W., and Schouten, M. E. H. (1978). “Identification of deleted
2003). Of course, the data could equally well be used as consonants,” J. Acoust. Soc. Am. 64, 1333–1337.
input to other models. To conclude, the dataset provides a Sherbecoe, R. L., and Studebaker, G. A. (2004). “Supplementary formulas
way for researchers to answer questions about both spoken and tables for calculating and interconverting speech recognition scores in
word recognition and speech perception in English, without transformed arcsine units,” Int. J. Audiol. 43, 442–448.
Smits, R. (2000). “Temporal distribution of information for human conso-
the need to collect large sets of new data for each question. nant recognition in VCV utterances,” J. Phonetics 28, 111–135.
Smits, R., Warner, N., McQueen, J. M., and Cutler, A. (2003). “Unfolding
ACKNOWLEDGMENTS of phonetic information over time: A database of Dutch diphone
perception,” J. Acoust. Soc. Am. 113, 563–574.
The project was supported by a special grant from the Van Ooijen, B. (1996). “Vowel mutability and lexical selection in English:
Max Planck Society. The authors thank Priscilla Shin and Evidence from a word reconstruction task,” Mem. Cognit. 24, 573–583.
Maureen Hoffmann for their work on this project. Wang, M. D., and Bilger, R. C. (1973). “Consonant confusions in noise: A
study of perceptual features,” J. Acoust. Soc. Am. 54, 1248–1266.
Warner, N., Smits, R., McQueen, J. M., and Cutler, A. (2005).
Benkı, J. (2003). “Analysis of English nonsense syllable recognition in “Phonological and statistical effects on timing of speech perception: A
noise,” Phonetica 60, 129–157. database of Dutch diphone perception,” Speech Commun. 46, 53–72.
3006 J. Acoust. Soc. Am., Vol. 135, No. 5, May 2014 Warner et al.: Tracking perception of the sounds of English