BABBLING
BABBLING
1
reported in Alhaidary (2012) and figure printed with permission of Susan
Rvachew and Abdulsalam Alhaidary.
Dictionary definitions of babbling (Oxford English Dictionary, 2015)
highlight two aspects of the vocalizations that the infant is producing
during this early stage of speech development: First, these vocalizations
are unintelligible or meaningless to the listener; second, these
vocalizations have a speechlike form in that they are composed of
recognizable speech sounds and syllables. It is not uncommon to see the
definition of babble stretched to cover all of the meaningless,
nonreflexive utterances produced by the infant. This overgeneralization
of the term is inappropriate, as illustrated in Figure 1; therefore, it is
necessary to bring some precision to the use of terms (Oller, 2000). Prior
to the onset of babbling at approximately seven months of age the
infant produces a range of nonreflexive sounds that are excluded from
the definition of babbling by virtue of being not recognizably speechlike
(e.g., squeals, growls, raspberries). Also excluded are utterances that
contain recognizable speech sounds, but not organized into speechlike
syllables (i.e., single vowels and marginal babble). Identifying the
transition from these less mature types of vocalizations to babbling is
supported by a scientific definition of babbling that takes into account
the perceptual, acoustic, and articulatory characteristics of these more
mature utterances. Canonical babbling is produced with articulatory
gestures that serve to alternate between a closed (or relatively closed)
and open vocal tract to produce syllables 100 to 500 ms long with
formant frequency transitions that have a duration of 25 to 120 ms. The
vocalic portions of the syllables are produced with normal phonation and
resonance. Canonical babble may be composed of a single syllable
meeting these criteria or a rhythmic sequence of syllables, either
reduplicated as shown in Figure 2A or variegated as shown in Figure 2B.
In this article the focus will be on canonical babbling specifically,
although a brief overview of how this skill emerges from earlier
noncanonical stages of speech development will be provided before
proceeding to the details of the phonetic content of this type of infant
vocalization. Definitions of technical terms that will appear in this
overview and subsequently are shown in Table 1.
Table 1. Definitions.
2
Consonant Speech sounds produced with vocal tract closed or
approximately closed. Includes closures produced with
lips, i.e., labial [m], [b], or tongue tip, i.e., alveolar [n],
[d], or tongue body against velum (soft palate),
i.e., velar [ɡ], [k].
Resonance Resonance is the sound quality that results from the size
and shape of the resonating cavity that speech is
produced in. Fully resonant vowels are produced in an
open oral cavity (mouth) and quasiresonant vowels are
produced with air flow through the nasal cavity (nose).
3
effects.
4
even with recent improvements to analysis techniques that are specially
adapted to this context (Alku, Pohjalainen, Vainio, Laukkanen, &
Story, 2013; Shadle, Nam, & Whalen, 2016; Vallabha & Tuller, 2002).
Therefore, the best option is to combine perceptual transcription with
instrumental descriptions of infant speech.
Oller (2000) explains that in order to provide a linguistically meaningful
description of infant vocalizations it is necessary to employ an
infraphonological framework that defines how acoustic and articulatory
parameters are manipulated to produce well-formed syllables. This
framework permits the adult listener to identify categories of infant
vocalization according to the infraphonological parameters that are
violated in their production, as will be described in more detail in
section 1.3. Only those vocalizations that meet all requirements for well-
formedness, thus qualifying as canonical babble, can be properly
submitted to phonetic transcription. Even at this level, however,
phonetic transcription may provide excessive detail and overestimate
the infant’s phonetic abilities. Ramsdell, Oller, Buder, Ethington, and
Chorna (2012) suggest that the actual number of syllabic templates in
an infant’s repertoire will be smaller than the repertoire of syllables
revealed by detailed phonetic transcription, so that (for example) each
of the utterances [baba] [βaβa] [baβa] [babβa] can be considered as part
of a single labial obstruent-plus-vowel template.
In addition to empirical investigations of infant vocal output, vocal tract
models (Ménard, Davis, Boë, & Roy, 2009) and simulations of speech
development (Nam, Goldstein, Giulivi, Levitt, & Whalen, 2013) have
become increasingly popular methods of testing theories of the
mechanisms that underlie the development of early speech. A related
methodology employs robots to examine the intersection of infant
speech output and adult responses to model developmental processes
(Howard & Messum, 2011; Moulin-Frier, Nguyen, & Oudeyer, 2014;
Rasilo, Räsänen, & Laine, 2013).
5
samples using IPA and then focused on place, manner, and voicing
features. Despite these large procedural variations across studies in
English, Swedish, and Dutch language contexts, the studies yielded
remarkably similar stage hierarchies and therefore only the stages
described by Oller (2000) will be outlined here.
Five stages are proposed, each marked by differences in the relative
frequency of particular utterance types rather than the production of
unique categories of vocalization. The utterance types are defined in
relation to the definition of the canonical syllable (CS) as defined in
section 1.2, revealing the systematic accumulation of the principles of
well-formedness in syllable production during the first year of life. The
resulting set of utterance types makes it possible to classify all human
utterances using the same scheme, while not assuming that the infant
has the same vocal tract structure, motor control capabilities,
articulatory goals, or internalized linguistic categories as the adult talker.
During the phonation stage in the first month or two after birth, the
quasiresonant vowel (QRV) is the primary nonreflexive, nondistress
vocalization, produced with the vocal tract in a more or less resting
position. The primary difference between QRVs and CSs is the absence
in QRVs of clear upper frequency formants and obvious formant
transitions that would accompany deliberate shaping of the vocal tract
to produce a specific vowel or syllable. Although these utterances sound
nasal, the velum may or may not be lowered during production. QRVs
are produced with normal phonation and duration and may occur in
rhythmic sequences, qualities that are shared with CSs.
The primitive articulation stage emerges between one and four months
of age when QRVs are interrupted by an undifferentiated vocal tract
closing gesture to produce an utterance referred to as a goo or coo. IPA
transcription of these utterances leads to the common but misleading
conclusion that prelinguistic phonetic development proceeds from back
to front (velar–coronal–labial) whereas linguistic phonetic development
proceeds in the opposite direction with labial consonants acquired
before velars (Irwin, 1947a). Such a conclusion does not take into
account differences in infant vocal tract morphology or speech motor
control and assumes incorrectly that IPA transcription is a reasonable
representation of the infant’s articulatory gestures during the first few
months of life.
The expansion stage, lasting from approximately four through seven
months, is marked primarily by the appearance of fully resonant vowels
(FRV) alongside a large variety of utterance types (raspberries, squeals,
growls, yells, whispers, and so on) that give the appearance of free
exploration of phonatory and articulatory parameters. Toward the end of
this stage marginal babbling appears in the form of consonant–vowel
syllables that do not meet the requirements of a CS. These highly
variable vocalizations are characterized by unusual timing, phonatory, or
resonance parameters that violate the parameters of well-formedness.
6
The canonical babbling stage begins on average at seven months, and
always by 11 months, in normally developing infants. This stage is
heralded by the emergence of canonical syllables, easily identified by
parents and other untrained observers especially when in multisyllable
form. There is some controversy about whether the production of
reduplicated babble, in which all the syllables in the utterance are the
same, and variegated babble, in which there are varied consonants and
vowels, occurs during overlapping or sequential stages. Some observers
have described variegated babble as a more complex utterance that
appears subsequent to reduplicated babble (Elbers, 1982; Stoel-
Gammon, 1989). In larger sample studies, it has been observed that
these two kinds of babbling emerge in parallel (Mitchell & Kent, 1990;
Smith, Brown-Sweeney, & Stoel-Gammon, 1989). The impossibility of
knowing the infant’s intention makes it difficult to resolve this conflict.
Recall that Ramsdell et al. (2012) hypothesized that utterances such as
[baba] [babβa] [baβa] can be considered to be members of the same
template—phonetically different utterances that would sound roughly
similar to the casual listener. This seems to be a reasonable hypothesis
if the younger infant produces all three vocalizations as a consequence
of imprecise jaw closing gestures. This hypothesis further implies that
[babβa] and [baβa] are not in fact more complex in form than [baba].
Although it seems theoretically possible that an older infant might
intentionally produce [babβa] and [baβa] in a manner that is
qualitatively and functionally differentiated from [baba], it is not clear
how intentional babbles can be isolated from accidental versions of the
same forms. Further to the issue of intent, the canonical babbling stage
coincides with beginning receptive language skills and overlaps with the
emergence of intentional communication. Babbled utterances do not
immediately serve the communicative needs of the child, however, as
often the infant will resort to nonverbal gestures or more primitive forms
of vocalization to demand attention or comment on the environment
(McCune, Vihman, Roug-Hellichus, Bordenave Delery, & Gogate, 1996).
Babbling frequently occurs when the infant is alone or when the adult is
not attending to the infant and therefore these vocalizations are clearly
not communicative in intent (Locke, 1989).
The final stage is the integrative stage, when babbling coexists with the
production of meaningful words, a period covering roughly 12 through
18 months of age. Babbling may be integrated with meaningful forms
within the same utterance when babies produce jargon. Another form
that is common during this stage is gibberish, consisting of sequences of
nonmeaningful syllables produced with prosodic contours that mimic
those heard in meaningful phrases.
7
thereafter until the shape of the vocal tract approximates that of the
adult model by the age of six years (Kent & Vorperian, 1995).
Developmental changes in the structure of the vocal tract may partly
explain the emergence of new utterance types during the first year,
especially the shift from the phonation stage with the predominance of
quasiresonant vocalizations to the expansion stage with varied
vocalizations including fully resonant vowels. The relationship between
vocal tract structure and function is reciprocal, however: It is not simply
a matter of structural change permitting new utterance types; rather,
vocal practice changes muscle strength, contributing in turn to changes
in vocal tract structure. Furthermore, developmental changes in the
ability to control and coordinate the vocal system impact the infant’s
vocal repertoire (for review, see Rvachew & Brosseau-Lapré, 2018).
A prominent theory attributes early babbling patterns to the primary role
of mandibular (lower jaw) movements to syllable production (MacNeilage
& Davis, 2000). By this account, the dominant pattern of rhythmic jaw
movements provides a syllabic frame for speech production. Frame
dominance is hypothesized to restrict early syllables to forms requiring
very little differentiation of tongue and lip movements from the
dominant jaw movement pattern: specifically, labial consonants with
central vowels (e.g., [bʌ]), alveolar consonants with front vowels (e.g.,
[di]), and velar consonants with back vowels (e.g., [ɡu]). The dominant
role of the jaw in early syllable production has been established in
kinematic studies (Green, Moore, Higashikawa, & Steeve, 2000).
However, there is more variability in the content—consonants and
vowels produced within syllables—than is predicted, and at earlier ages
than predicted, suggesting that the phonetic content of babbled
syllables is better explained by alternative accounts such as articulatory
phonology (Giulivi, Whalen, Goldstein, Nam, & Levitt, 2011; Sussman,
Duder, Dalston, & Caciatore, 1999; Sussman, Minifie, Buder, Stoel-
Gammon, & Smith, 1996). Nam et al. (2013) predict that syllables will be
more frequent in infant and adult speech when they are composed of
overlapping consonant and vowel gestures that can be produced
synchronously. By their account, infants have control over a variety of
vocal tract constrictions involving the jaw and tongue body. Kinematic
research with young children has shown that stability in the production
of higher order goals (i.e., lip aperture area) is achieved prior to stability
in the underlying articulatory gestures that contribute to the higher
order goal (Smith & Zelaznik, 2004). Furthermore, practice plays a key
role in the emergence of stable higher order goals (Walsh, Smith, &
Weber-Fox, 2006). It would appear that passive maturational
mechanisms in the structural or functional domains are insufficient to
explain the developmental course of babbling. Therefore, environmental
inputs and learning mechanisms will be considered.
9
unsupervised learning in which stable action–perception links are
acquired through random articulatory practice with auditory and
somatosensory feedback, in particular during the expansion stage.
Ultimately, an internal model of the mapping between inputs to and
outputs from the speech motor system is acquired (Wolpert,
Ghahramani, & Flanagan, 2001).
Subsequently, this internal model lays the foundation for supervised
learning in which the infant learns through trial and error to achieve
speechlike and perhaps language-specific speech targets during the
canonical babbling stage (Imada et al., 2006; Kuhl, Ramírez, Bosseler,
Lin, & Imada, 2014). Performance is improved in supervised learning
because feedback generates error signals in relation to a specified
target during practice (Wolpert et al., 2001). Traditionally, supervised
learning invokes the notion of an external model for imitation, and at
least one study has suggested imitation of point vowels by infants as
young as six months of age (Kuhl & Meltzoff, 1996). Self-supervised
learning is also possible, especially given the early stabilization of
native-language vowel categories in perceptual learning (Kuhl et
al., 2008), and indeed most babbling appears to occur in the absence of
external models. Moulin-Frier et al. (2014) have described a model in
which early learning is intrinsically motivated and focused on self-
generated auditory targets; a developmental transition to imitation
learning occurs later in development after the achievement of the basic
principles of speech production.
An alternative hypothesis invokes reinforcement learning as the primary
mechanism (Howard & Messum, 2011), with speech learning driven by
adult mimicry of infant vocalizations that capture adult attention when
they approximate phonetic categories in the adult language system (see
also Rasilo et al., 2013). Social reinforcement from parents also shapes
infant vocal output (Goldstein, King, & West, 2003; Goldstein &
Schwade, 2008). These differing learning mechanisms—unsupervised,
supervised, and reinforcement learning—are not mutually exclusive, and
it is likely that all three play a role in early speech development.
10
contexts, possibly as a consequence of a generalized fall in energy over
the course of syllable production. With respect to place of consonant
articulation, labial and alveolar consonants together account for 80% to
90% of all consonants produced in babble.
Vowel segments are similarly restricted, with central and mid- or low-
front vowels being preferred across many languages (i.e., English,
French, Japanese, Mandarin, Korean, and Swedish: Boysson-Bardies &
Vihman, 1991; Buhr, 1980; Kent & Bauer, 1985; Kent & Murray, 1982;
Lee, Davis, & MacNeilage, 2010; Chen & Kent, 2010). Vocal tract
modeling shows that it is possible to produce the full range of vowels
with the infant vocal tract, but most vowels in the theoretical infant
vowel space are perceived by the adult as low and front (Ménard,
Schwartz, & Boë, 2004). Developmental changes in vocal tract
morphology and functional abilities result in a gradual expansion of the
vowel space along the F1 dimension in the first year and the F2
dimension in the second year; consequently, the corner vowels appear
alongside greater differentiation among vowel categories by
approximately 18 months of age (Ishizuka, Mugitani, Kato, &
Amano, 2007; Kent & Murray, 1982; Rvachew et al., 2008; Rvachew,
Mattock, Polka, & Ménard, 2006; Rvachew et al., 1996).
The syllabic and prosodic structure of babbled utterances has also
received attention. Excluding noncanonical utterances containing only a
single vowel (which remain the most common speechlike utterance
through one year of age; Kent & Bauer, 1985), babbled utterances are
composed of one or more CV syllables overwhelmingly (Mitchell &
Kent, 1990). Syllables containing a coda consonant or consonant cluster
are extremely rare. About three-quarters of canonical babbles consist of
a single CV syllable during the canonical babbling stage (Fagan, 2009).
The frequency of multisyllable babbles and the number of repetitions per
babbled utterance appears to peak at approximately nine months,
decrease through approximately 12 months when first words appear,
and then increase again later in the second year, at least in normal
hearing English-learning infants (Fagan, 2009, 2015; Smith et al., 1989).
Variegated and reduplicated canonical babbling emerges
contemporaneously; variegation in manner is more common than
variation in place of articulation (Gildersleeve-Neumann, Davis, &
MacNeilage, 2013; Smith et al., 1989).
Within utterance, pitch contours have been examined as precursors to
the linguistic manipulation of fundamental frequency (F0). Depending
upon the language environment, the infant will hear variations in F0 to
signal lexical contrast or phonological tone. There is little consensus
among these studies in reported preferences for pitch contours in
babble. The most frequently observed patterns are falling, rising–falling,
and level pitch contours (Amano, Nakatani, & Kondo, 2006; Chen &
Kent, 2009; Davis, MacNeilage, Matyear, & Powell, 2000; Kent &
Murray, 1982; Whalen, Levitt, & Wang, 1991). Explanations for the
11
observed pitch contours also vary: Some researchers place strong
emphasis on physiological factors (Kent & Murray, 1982) while others
explicitly propose learning from the ambient language (Hallé et
al., 1991). It is difficult to determine if infants are deliberately
manipulating prosodic contours because the background level of F0 in
infant babble is unstable: maturational declines in absolute F0 and
variability in F0 occur continuously during infancy; furthermore, the
ability to coordinate all the parameters of prosody—pitch, loudness, and
duration—develops well into late childhood (Kehoe, Stoel-Gammon, &
Buder, 1995; Lee, Potamianos, & Narayanan, 1999; Vorperian &
Kent, 2007).
12
implants at an early age (before 30 months) phonetic development may
proceed relatively normally, with onset of canonical babble five to 10
months after implantation being a good prognostic indicator.
An important kind of auditory input may be access to feedback of the
infant’s own speech during speech practice. Infants who have undergone
tracheostomy to bypass an obstructed airway have restricted access to
this kind of feedback until the breathing tube is removed
(decannulation). Some infants who have experienced long-term
tracheostomy have neurological or craniofacial conditions that would
otherwise impair speech production but many do not have complications
beyond the tracheostomy. Speech development in this latter group
appears to progress through the normal prelinguistic stages of vocal
development after decannulation, albeit at a faster pace (Kraemer,
Plante, & Green, 2005); specifically, expansion stage vocalizations
appear first followed rapidly by canonical babble consisting of the
expected CV syllables favoring stop consonants; fricatives emerge last
and speech therapy may be required to ensure a complete phonetic
inventory and accurate speech. Outcomes are associated with the age of
initial tracheostomy procedure, duration of cannulation, and age at
decannulation (Jiang & Morrison, 2003). When the tracheostomy is
performed after one year or decannulation occurs in the first three
months of life, speech development will follow a normal trajectory. If the
tracheostomy is performed at approximately four months, there is likely
to be speech and language delay if the duration of cannulation persists
throughout the second year of life; outcomes may be good if
decannulation occurs before or shortly after the first birthday.
Many infants who require tracheostomy were born prematurely, and low
birth weight impacts speech development even without tracheostomy in
this population that is at-risk for subtle but long-term issues with motor
coordination. Rvachew, Creighton, Feldman, and Sauve (2005) observed
that very low birth weight infants with a history of bronchopulmonary
dysplasia demonstrated delayed onset of canonical babbling and a
tendency toward unusual rhythmic organization of their babbling (see
also Goldfield, 1999). Infants with more frank motor impairments,
specifically cerebral palsy, have been observed to produce very short
utterances reflecting differences with controlled expiration (Levin, 1999).
In contrast, infants with cleft palate do not demonstrate difficulties with
the rhythmic quality of their vocalizations but do have restricted and
unusual phonetic repertoires reflecting impairments in the structural
domain (Chapman, Hardin-Jones, Schulte, & Halter, 2001).
13
samples of infants and there are disputes about the appropriate
description of salient ambient language inputs, some cross-linguistic
differences in infant speech have been reported. With respect to
consonant manner for example, French-, Japanese-, and Chinese-
learning infants may produce more nasal consonants than English- and
Swedish-learning infants in their prelinguistic babble (Boysson-Bardies &
Vihman, 1991; Chen & Kent, 2010). However, nasals are a common
consonant type universally, and it is difficult to establish that this cross-
linguistic variation exceeds the degree of within-language variation for
proportion of stops versus nasals in infant vocalizations. It is not clear
that there are reliable differences in place of articulation, but labial place
may be predominant in English and French whereas alveolar place
seems somewhat more common than labial in Japanese, Mandarin,
Korean, and Swedish (Boysson-Bardies & Vihman, 1991; Chen &
Kent, 2005; Kent & Bauer, 1985; Lee et al., 2010). Language-specific
differences in voice onset time have been reported, specifically involving
a more frequent production of voicing lead by French-learning infants in
comparison to English-learning infants (Whalen, Levitt, &
Goldstein, 2007).
14
maximum [F2−F1] (Diffuse corner), minimum [(F1+F2)/2] (Grave corner)
and minimum [F2−F1] (Compact corner) and using these values to
calculate the triangular vowel space area. Arabic infants were found to
produce larger and more symmetrical vowel spaces in comparison to the
English vowel space throughout the age range 10 through 18 months of
age. Data taken from Alhaidary (2012), and figure printed with permission
of Susan Rvachew and Abdulsalam Alhaidary.
The approach to the study of the vowel system has focused on more
global characteristics of the infant’s phonetic output. Acoustic analysis of
the infant’s vowels permits a description of the infant’s vowel space that
does not assume infant knowledge of adult phonetic categories, as
shown in Figure 3. For example, developmental changes and cross-
linguistic differences in the location of the center of the vowel space in
F1 and F2 coordinates have been identified in French- versus English-
learning infants (de Boysson-Bardies, Halle, Sagart, & Durand, 1989;
Rvachew et al., 2006). Furthermore, expansion of the vowel space
toward the corners proceeds differently as a function of the complexity
of the input vowel system: specifically, this expansion appears to be
faster when the vowel system is simple, as in Arabic, when compared to
languages with more complex vowel inventories such as English or
French (Alhaidary, 2012; Rvachew et al., 2008). These studies suggest
that development of the infant vowel space does not reflect a
straightforward process of attempting to match adult phonetic targets;
rather, changes in the shape of the vowel space as a whole seem to take
into account competition for perceptual attention in different corners of
the space.
Some languages are differentiated more by rhythmic characteristics
than phonetic content: For example, English is a stress-timed language
whereas French is a syllable-timed language. Levitt and Wang (1991)
found that French- and English-learning infants produced babble that
reflected striking differences in the prosodic organization of the ambient
language environment: In particular, babbled utterances by the French-
learning infants contained more syllables that were regularly timed
excepting the final syllables, which showed more prominent utterance
final lengthening when compared to those produced by their English-
learning age peers. Further to the topic of cross-linguistic differences in
prosody, numerous studies have reported cross-linguistic differences in
the frequency of rising and falling tones in babbled disyllables with
continuity in the proportion of usage of these patterns into the first word
stage (e.g., for Mandarin, see Chen & Kent, 2009; for French vs.
Japanese, see Hallé et al., 1991; for English vs. French, see Whalen et
al., 1995).
15
3. Critical Analysis of Scholarship
The latter decades of the twentieth century yielded some highly
significant advancements in the study of vocal development. An
essential breakthrough was the establishment of an objective definition
of canonical babbling that integrates acoustic, articulatory, and phonetic
factors. An understanding of the developmental course of vocal
development during infancy emerged, and the age of onset for the
canonical babbling stage has been established with replications across
multiple laboratories and language groups. It is now clear that vocal
development absolutely requires adequate access to language input:
Hearing-impaired infants do not learn to babble in the normal fashion
unless hearing is habilitated early in life via hearing aids or cochlear
implants. Despite this considerable progress, the learning mechanisms
that underpin the acquisition of babbling have not yet been determined,
and it remains unclear whether infants are acquiring language-specific
articulatory representations for speech sounds during this prelinguistic
stage of vocal development.
Many studies have attempted to test the hypothesis of “babbling drift”
by comparing the phonetic characteristics of babble produced by infants
learning different languages. These studies are marked by
methodological difficulties including small samples of infants, unreliable
descriptive metrics, and a complete lack of replication studies for a
given language comparison and outcome measure. These problems are
exacerbated by the universality of many phonetic factors so that, for
example, stop consonants predominate in adult input and in infant
output across the many languages that have been studied. Therefore,
hypotheses about potential differences (e.g., relatively high numbers of
affricates in Mandarin language input to the infant) concern very low
frequency events in infant speech. The combined effect of these
methodological problems is that it is typically impossible to determine if
the observed differences between language groups are greater than the
variation that might be observed within a language group (Vihman et
al., 1994).
One reason that the sample sizes in these studies are small is that the
speech sample analysis techniques are technically difficult, time-
consuming, and expensive. Automatic speech analysis tools, as a
substitute for phonetic transcription and other methods of hand coding
on a segment-by-segment basis, have the potential to improve the
efficiency of this work (Oller et al., 2010; VanDam & Silbert, 2016).
Acoustic analysis of infant speech has yielded some particularly
promising findings and can be conducted reliably (Rvachew, Creighton,
Feldman, & Sauve, 2002) but not always accurately, especially when F0
is high (Kelso, Tuller, Vatikiotis-Bateson, & Fowler, 1984). Unfortunately,
the most accurate forms of automatic speech analysis are not
sufficiently accurate with natural infant speech to avoid a considerable
16
amount of hand coding (Shadle, Nam, & Whalen, 2016). Therefore, the
accumulation of data to test the babbling drift hypothesis may continue
at a slow pace, especially if phonetic segments continue to be the
primary focus of investigation.
Recent research has attempted to sidestep the difficulty of working with
real data by testing hypotheses with simulations and robotic applications
(Howard & Messum, 2011; Moulin-Frier et al., 2014; Nam et al., 2013).
While theoretically interesting, the outcome of these studies can only be
validated in relation to reliable and replicable data recorded from live
infants, and therefore the shortage of these data remains a problem
even in the face of these considerable technological innovations.
Furthermore, observational and simulation studies suffer from a lack of
consensus about appropriate units of analysis for comparing infant to
adult speech. Typically, adult languages are differentiated by
phonological units such as the distribution of phonemes, and then it may
be assumed that infant speech will gradually approximate the ambient
language distribution of those units. However, there is little evidence
that prelinguistic infants attend to phonemes per se. Neither is it clear
that speech targets for the supervised learning process in early vocal
learning are phonemes. The issue of what the infant is attending to has
been addressed since the early days of cross-linguistic research when
Vihman et al. (1994) suggested that the most appropriate
characterization of ambient language input would be derived from the
distribution of target phonemes underlying the infant’s first words. More
recently, Masapollo, Polka, and Ménard (2016) demonstrated that infants
prefer to listen to vowels produced with infant like pitch or frequency
formants, in comparison to the same vowels with adultlike spectral
characteristics. Ongoing research is required to fully understand the
intersection of ambient language inputs, infant attentional preferences,
and their vocal output.
17