Speech Communication: S. Shahnawazuddin, Hemant K. Kathania, Abhishek Dey, Rohit Sinha
Speech Communication: S. Shahnawazuddin, Hemant K. Kathania, Abhishek Dey, Rohit Sinha
Speech Communication
journal homepage: www.elsevier.com/locate/specom
a r t i c l e i n f o a b s t r a c t
Keywords: The work presented in this paper explores the issues in automatic speech recognition (ASR) of children’s speech
Children’s speech recognition on acoustic models trained on adults’ speech. In such contexts, due to a large acoustic mismatch between training
Pitch variation and test data, highly degraded recognition rates are noted. Even with the use of vocal tract length normalization
Low-rank feature projection
(VTLN), the mismatched case recognition performance is still much below that for the matched case. Our earlier
PCA
studies have shown that, for commonly used mel-filterbank-based cepstral features, the acoustic mismatch is
HLDA
SGMM exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured
DNN low-rank projection of the features vectors prior to learning the acoustic models as well as before decoding is
proposed in this paper. To accomplish this, first a low-rank transform is learned on the training data (adults’
speech). Any dimensionality reduction technique which depends on the variance of the training data may be
used for this purpose. In this work, principal component analysis and heteroscedastic linear discriminant analysis
have been explored for the same. When the derived low-rank projection is applied in the mismatched testing case,
it alleviates the pitch-dependent mismatch. The proposed approach provides a relative recognition performance
improvement of 35% over the VTLN included baseline for the children’s mismatched ASR employing acoustic
modeling based on hidden Markov models (HMM) with observation densities modeled using Gaussian mixture
models (GMM). In addition to that, other acoustic modeling approaches based on subspace GMM (SGMM) and
deep neural networks (DNN) have also been explored. Projecting the data to a lower-dimensional subspace is
found to be effective in those frameworks as well. In the case of SGMM and DNN-based systems, the proposed
approach is noted to result in relative recognition performance improvements of 33% and 21%, respectively, over
their corresponding baselines.
1. Introduction In practice, most of the ASR systems are developed using data from
adult (male and female) speakers. There do exist some applications
The state-of-the-art automatic speech recognition (ASR) systems are such as information retrieval, reading tutors, language learning tools,
developed by pooling speech data from a large number of speakers for voice-based search and entertainment (Hagen et al., 2003; Nisimura
learning the acoustic models. As a result, there always remains a scope et al., 2004; Bell and Gustafson, 2007; Hagen et al., 2007; Schalkwyk
of improving the recognition performance for a particular speaker. The et al., 2010; Gray et al., 2014) where the ASR system is accessed by
task of improving the performance of an ASR system with respect to both adult and child speakers. For such speech-based applications, one
any particular speaker or group of speakers is referred to as adaptation. can employ two separately trained ASR systems (one for adults and
A number of speaker adaptation techniques have been developed dur- the other for children) and a reliable switching between the two. Such
ing the last few decades (Gauvain and Lee, 1994; Leggetter and Wood- an approach is expected to result in good recognition performances
land, 1995; Digalakis et al., 1995; Hazen and Glass, 1997; Kuhn et al., for both the group of speakers. As publicly available children’s speech
2000; Gales, 1999). In general, the addressal of any possible mismatch data is scarce, it is difficult to create a well trained ASR system for
between train and test conditions forms an integral part of all state-of- children. To mitigate the scarcity of children’s speech data, training
the-art ASR systems. with a mixture of adults’ and children’s speech was proposed in the
past (Potamianos et al., 1997; Gerosa et al., 2009a). Alternatively, one
can explore ways of improving the recognition of children’s speech on
∗
Corresponding author.
E-mail addresses: s.syed@iitg.ernet.in, s.syed@nitp.ac.in (S. Shahnawazuddin), hemant.ece@nitsikkim.ac.in (H.K. Kathania), abhishekdey@iitg.ernet.in (A. Dey),
rsinha@iitg.ernet.in (R. Sinha).
https://doi.org/10.1016/j.specom.2018.11.001
Received 17 October 2016; Received in revised form 20 October 2018; Accepted 2 November 2018
Available online 8 November 2018
0167-6393/© 2018 Elsevier B.V. All rights reserved.
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
acoustic models trained on adults’ speech. The latter approach is fol- weighting (SW)1 of the features based on structured low-rank projection
lowed up in this work and is referred to as children’s mismatched ASR. is explored in this work. The primary objective of the SW approach is
In the following, we present a brief review of the works on children’s to map the children’s test MFCC features to the space of adults’ speech.
speech recognition highlighting the challenges involved in such ASR For this purpose, a projection matrix is learned on the base MFCC fea-
tasks. tures corresponding to adults’ speech data. This can be done by em-
Earlier works on children’s ASR have reported that the acoustic ploying either the principal component analysis (PCA) (Jolliffe, 1986)
attributes of speech vary significantly across adult and child speak- or the heteroscedastic linear discriminant analysis (HLDA) (Kumar and
ers (Potamianos and Narayanan, 2003; Ghai and Sinha, 2010). It is Andreou, 1998). Considering the nature of mismatch being similar in all
well known that, compared to the adults, the vocal organs of chil- three streams of the feature vector, the same projection matrix is also
dren are much smaller. This physiological difference accounts for higher applied to the delta and delta-delta cepstral coefficients. Thus, the em-
values of pitch as well as formants observed in the case of children’s ployed feature projection matrix has a constrained block-diagonal struc-
speech (Potamianos and Narayanan, 2003; Eguchi and Hirsh, 1969; ture. The use of PCA/HLDA derived feature projections in ASR modeling
Kent, 1976). During the growing phase, speech from children speak- have been previously employed for speaker adaptation (Stemmer and
ers undergoes considerable variation along with improvement in their Brugnara, 2006). But in that work, the employed projection was neither
ability to correctly pronounce the complex words (Russell and D’Arcy, motivated for addressing pitch-mismatch nor did it had any constrained
2007). Further, the overall speaking rate is slower in the case of block-diagonal structure. We have experimentally verified that such a
children and they exhibit higher variability in the speaking rate as constrained projection is more effective than the unconstrained one for
well (Potamianos and Narayanan, 2003). Children’s speech is reported the children’s mismatched ASR.
to have greater values for the mean and variance for the acoustic cor- The work presented in this paper consists of two distinct parts.
relates of speech than those for the adults’ (Eguchi and Hirsh, 1969; The first part deals with mismatched ASR systems employing Gaussian
Ghai and Sinha, 2010). As a result, children’s speech suffers from a mixture-based hidden Markov modeling (GMM-HMM)2 The motivation
higher degree of inter- and intra-speaker acoustic variability than the for using low-rank projection is developed using the GMM paradigm and
adultsâ speech (Potamianos and Narayanan, 2003; Gerosa et al., 2007). serves as the baseline. In the second part, we have explored the acoustic
From the linguistic perspective, children are more likely to use imag- modeling frameworks based on subspace GMM (SGMM) (Povey et al.,
inative words, ungrammatical phrases and incorrect pronunciations as 2011a) and deep neural networks (DNN) (Dahl et al., 2012). Further, the
discussed in (Gray et al., 2014). effectiveness of low-rank feature projection for children’s mismatched
The above mentioned factors aggravate the acoustic mismatch be- ASR is also demonstrated in the context of these current acoustic model-
tween adults’ and children’s speech making the children’s mismatched ing techniques. The remainder of the paper is organized as follows: The
ASR a challenging problem (Lee et al., 1999; Narayanan and Potami- low-rank feature projection technique to address the pitch-dependent
anos, 2002; Potamianos and Narayanan, 2003; Gerosa et al., 2009b; mismatch is presented in Section 2. Next, the experimental setup used
Shivakumar et al., 2014). To compensate for these sources of mismatch, for the study is described in Section 3. The experimental studies val-
a number of techniques have been suggested in literature for children’s idating the effectiveness of low-rank feature projection in the context
speech recognition under mismatched conditions. Burnett and Fanty of GMM system are presented in Section 4. The studies on SGMM- and
proposed a fast approach for compensating the formant scaling using DNN-based ASR systems are presented in Section 5. Finally, the paper
a speaker-dependent warping of the frequency scale through offsets is concluded in Section 6.
in the Bark-domain (Burnett and Fanty, 1996). Gustafson and Sjolan-
der reported an improved recognition of children’s speech on publicly
available fixed adults’ speech trained ASR with an explicit reduction 2. Structured low-rank projection of features for pitch mismatch
of the pitch of the signals (Gustafson and Sjolander, 2002). In addi- reduction
tion, improved phone classification with pitch-dependent normaliza-
tion (Singer and Sagayama, 1992) and speech reconstruction by pitch 2.1. Motivation
prediction using the mel-frequency cepstral coefficient (MFCC) Davis
and Mermelstein (1980) features have also been reported (Shao and The MFCC feature computation involves low-time liftering of the
Milner, 2004). The effectiveness of vocal tract length normalization mel-cepstra. As a result, it is often believed that the MFCC features
(VTLN) Lee and Rose (1998) for children’s mismatched ASR is demon- would remain unaffected by the pitch of the speech signals. On the con-
strated in Potamianos et al. (1997); Ghai (2011); Serizel and Giuliani trary, our earlier studies (Ghai and Sinha, 2009; Ghai, 2011) have shown
(2016). Lately, it has been shown that the standard mel-filterbank in- that the MFCC features for higher pitch (child) speakers get significantly
volved in the MFCC feature computation is not able to provide suffi- affected by the pitch-dependent distortions in contrast to those for lower
cient smoothing especially for high-pitched child speakers (Ghai and pitch (adult) speakers. It has been argued in those works that, due to
Sinha, 2009). Consequently, a significant mismatch in the variances of narrow (approx. 100 Hz) bandwidth of the lower-order filters in the tri-
the higher-order MFCCs is noted for adults’ and children’s speech (Sinha angular mel-filterbank, the pitch harmonics are not well smoothed out
and Ghai, 2009). To reduce the resulting mismatch, a simple binary- while analyzing the speech signal with higher pitch (more than 200 Hz).
weighting (BW) of the features that essentially truncates some higher- Consequently, the pitch-dependent distortions appear especially in the
order coefficients in the base MFCC features was explored in Ghai and lower frequency region of the spectral envelope for children’s speech.
Sinha (2009). Some recently reported works have explored DNN-based On account of these non-harmonic pitch-dependent distortions in the
acoustic modeling for improving children’s ASR (Serizel and Giuliani, mel-spectra for the children’s speech, the resulting MFCC features ex-
2014; Metallinou and Cheng, 2014; Liao et al., 2015; Serizel and Giu- hibit enhanced variances for the higher-order coefficients in contrast to
liani, 2016). The differences in the acoustic and the linguistic correlates those for the adults’ speech.
of speech from adult and child speakers have been observed to affect
the performance of speaker recognition tasks as well (Safavi et al., 2012;
1
2014). Soft-weighting (SW) and low-rank feature projection have been used inter-
changeably in the rest of this paper.
Although the BW scheme is noted to be quite effective, it leads to the 2
In the remainder of the work, the terms GMM, SGMM, and DNN refer to the
loss of information when a large number of higher-order coefficients in
HMM framework with observation densities modeled using GMM, SGMM and
the MFCC features is truncated. To address the loss in information, a soft- DNN, respectively.
104
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
Variance
clude an intentional level shifts of −2 dB and −4
−2 0.03 dB, respectively, for separating them. For deriving
−4 these spectra, the base MFCC features (𝐶0 − 𝐶12 ) are
F = 100 Hz 0.02 transformed back to frequency domain using 100-
0
−6
F = 200 Hz
0
point IDCT. (b) Variance plots for the base MFCC fea-
0.01
−8 F = 300 Hz tures (𝐶1 − 𝐶12 ) for vowel /IY/ corresponding to two
0
−10 broad pitch (F0 ) ranges. For this analysis, the feature
0
0 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 vectors for nearly 2000 speech frames corresponding
Frequency (kHz)
Base feature index to the central portion of the vowel are used. For the
(a) (b) chosen F0 ranges, the mismatch in the variances of
higher-order coefficients is evident. Both these anal-
yses are performed on the data extracted from TIMIT corpus (Fisher et al., 1986). It is to note that the feature vectors employed in these analyses have been normalized
using cepstral mean and variance normalization (CMVN) but not VTLN.
-2.5 F 0 = 210 Hz ⎡1 0 0 0 0 0 … 0⎤
⎢ ⎥
F 0 = 320 Hz 0 1 0 0 0 0 … 0⎥
=⎢ . (1)
-3 ⎢0 0 1 0 0 0 … 0⎥
0 1 2 3 4 ⎢0 0 0 1 0 0 … 0⎥⎦
⎣ 4×𝐷
Frequency (kHz)
Since the coefficients in the velocity and acceleration streams are
Fig. 2. Smoothed spectral plots for two different values of F0 corresponding also found to exhibit similar nature, the same transform is applied to
to the central frame of vowel /IY/ extracted from PF-STAR children’s speech those coefficients as well. The structure of the constrained projection
database. matrix ̃ applied to the 3D-dimensional test speech feature vector is
given as
105
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
BW
0.5 0.5 0.5
mation in the higher-order coefficients in all three streams of the acous-
tic feature vector when those are dropped. Alternatively, one can learn 0 0 0
the variation of the lower pitch (training) data and apply the same to 0 5 10 13 0 5 10 13 0 5 10 13
the higher pitch (children’s test) data. This scheme may be able to avoid 1 1 1
SW−PCA
the complete loss of information while suppressing the pitch-dependent
mismatch. A low-rank projection capturing the principal dimensions of 0.5 0.5 0.5
acoustic variations represented by the adults’ speech training data is
0 0 0
expected to suppress the higher-order coefficients. Thus, the dimension- 0 5 10 13 0 5 10 13 0 5 10 13
ality reduction techniques can be employed to learn such projections. 1.5 1.5 1.5
SW−HLDA
In this work, we have explored both PCA and HLDA for this purpose.
1 1 1
Unlike the BW case, such projection matrices do not have a purely di-
agonal structure and thus amounts to soft-weighting of the features and 0.5 0.5 0.5
the acoustic models. 0 0 0
0 5 10 13 0 5 10 13 0 5 10 13
In PCA, the original data is projected to a low dimensional subspace Index Index Index
capturing the maximum variability in the data, thus preserving most of
the information. This is done by the eigen-decomposition of the covari- Fig. 3. Feature dimension-wise energy distributions obtained by multiplying
ance or correlation matrix constructed from the given data. The eigen- with a 13 × 13 identity matrix. The resulting energies are shown for the three
decomposition of the covariance/correlation matrix results in a set of D- different kinds of projection matrices with varying rank. In these plots, the x-axis
denotes the feature coefficient index while the y-axis represents the magnitude
dimensional eigenvectors {𝐞𝑑 }𝐷 𝑑=1
and their corresponding eigenvalues
of the resulting energy. These plots highlight the nature and degree of suppres-
{𝛽 d }. The K eigenvectors corresponding to the highest eigenvalues are
sion of the variance of the feature vector possible in each of the cases.
selected and form the bases of the desired lower dimensional subspace,
𝐾×𝐷 . The 𝐾×𝐷 is then arranged to form a block diagonal projection Table 1
matrix as shown in (2) to derive ̃.
Specifications of the speech corpora used for the experimental evaluation
Unlike the PCA, the low-rank feature projection matrix in the HLDA presented in this work.
is learned in a supervised manner. Given the labeled training data and
Corpus WSJCAM0 PF-STAR
the SI model, an ML approach minimizing the ratio of within-class and
between-class scatter is used to derive the HLDA-based projection ma- Language British English British English
trix. In order to derive ̃ in the structure given in (2), a separate SI Data set Adult-Tr Adult-Ts Child-Tr Child-Ts
system is developed using the D-dimensional base MFCC features only. Purpose Training Testing Training Testing
A projection matrix 𝐾×𝐷 is then learned using the system and the la- Speaker group Adult Adult Child Child
No. of speakers 92 20 122 60
beled adults’ training data. Since the projection matrix in the PCA case
(male & female)
is not learned in ML sense, it is not expected to be as effective as that in Age group > 18 years > 18 years 4–13 years 4–13 years
the HLDA case. No. of words 132,778 5,608 46,974 5,067
In order to depict the degree of suppression pictorially, a 13 × 13 Duration (hrs.) 15.5 0.6 8.3 1.1
identity matrix is used to study the effect of projections on a par- Recording Quite room, close talking Closed room, head
conditions and desk microphones mounted microphones
ticular dimension of the feature vector. Each column of represents
one of the 13 dimensions in the feature vector being active while the
remaining 12 dimensions are deactivated. The projection matrix 𝐾×𝐷
is then multiplied with . For each of the columns in the transformed speech corpora and datasets used in the experiments are summarized in
matrix, the 𝓁 2 -norm is computed to determine the resulting energy. The Table 1.
dimension-wise resulting energy for the three kinds of projections (BW,
SW-PCA and SW-HLDA) are shown in Fig. 3. These distributions are plot- 3.2. Specifications of GMM-based ASR system
ted for the cases when K is chosen as 12, 8 and 4. As both the variants of
PCA are observed to have a similar effect, the energy distribution for the The ASR systems employed in this work for experimental evalua-
correlation-matrix-based PCA case is plotted here. With reducing rank tions are developed using Kaldi speech recognition toolkit (Povey et al.,
of the projection matrices, an increased suppression of higher-order co- 2011b). For speech data analysis, a Hamming window of length 25 ms
efficients is noticeable for all three cases. But unlike the BW case, some with a frame rate of 100 Hz is used. The speech data is pre-emphasized
information in the higher-order coefficients would be retained in the SW using a factor of 0.97. Employing a 21-channel mel-filterbank, first
cases which will lead to an enhanced recognition performance. 12-dimensional (𝐶1 − 𝐶12 ) base MFCCs are computed for each of the
frames. Log-energy of the frame is added as the zeroth coefficient mak-
3. Experimental setup and baseline performance ing the base feature dimension equal to 13 (𝐶0 − 𝐶12 ). Next, time splic-
ing of the 13-dimensional base MFCC features considering a context size
3.1. Speech corpora of 9 frames (i.e., ± 4) is performed. This is followed by feature dimen-
sionality reduction to 39 through linear discriminant analysis (LDA).
The speech data used for the model training and testing is ob- Further de-correlation of the resulting features is performed using max-
tained using two different British English corpora, viz. WSJCAM0 adults’ imum likelihood linear transform (MLLT). Cepstral mean subtraction
speech corpus (Robinson et al., 1995) and PF-STAR children’s speech followed by cepstral variance normalization is also applied on the MFCC
corpus (Batliner et al., 2005). For simulating the telephone-based speech feature vectors.
interface to the ASR system, all speech used in this study is re-sampled The GMM-based SI system is developed using cross-word tri-phone
at 8 kHz rate. It is known that the bandwidth reduction due to down- acoustic modeling along with a decision tree-based state tying. Each
sampling of the data affects the children’s ASR more than the adults’ triphone model consists of a 3-states HMM with 8 diagonal covariance
ASR (Russell et al., 2007). Therefore, the main inferences are revali- Gaussian components per state.The silence and short-pause are modeled
dated on wideband speech data in Section 5.5. More details about the using 3-state HMM with 16 diagonal covariance Gaussian components.
106
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
Table 2
Performances for adults’, children’s and mix of 34
adults’ and children’s speech trained SI systems un-
33
der acoustically matched and mismatched test con- BW
ditions. 32
SW−PCA
31
Speech data WER (in %) SW−HLDA
30
WER (in %)
used for the SI Adult-Ts Child-Ts
Baseline
29
system training Baseline VTLN 28
Adult-Tr 12.15 62.55 35.06 27
Child-Tr 70.00 22.17 18.92 26
Adult-Tr + Child-Tr 20.44 32.38 21.89
25
24
The maximum number of tied states (senones) is fixed at 2000. In this 23
work, the SI systems are developed using the Adult-Tr adult training 22
set. Testing is done on the Adult-Ts and Child-Ts sets for the matched 21
and mismatched cases, respectively. Word error rate (WER) is used as a 9 12 15 18 21 24 27 30 33 36 39
Feature dimension after projection
measure of recognition performance.
Decoding of Adult-Ts set is done using the standard MIT-Lincoln 5k Fig. 4. The WER-profiles with varying ranks of the BW and the SW-based fea-
Wall Street Journal bi-gram language model (LM). The used LM has ture projections for children’s mismatched ASR. All results were obtained after
a perplexity of 95.3 for Adult-Ts set with no out-of-vocabulary (OOV) applying VTLN on test data.
words. In matched case, a lexicon of 5,850 words including the pronun-
ciation variations is used. There are significant differences in the word-
list and word frequencies across the adults’ and children’s datasets. As a 4. Evaluating the effectiveness of SW in GMM framework
result, the use of MIT-Lincoln LM in decoding Child-Ts set did not result
in a meaningful recognition performance (WER=110.97%). To prevent In our initial work on SW for children’s mismatched ASR, reported
the linguistic mismatch from affecting the results in this study, a domain- in (Shahnawazuddin et al., 2015), structured low-rank feature projec-
specific bigram LM is employed in decoding Child-Ts. This bigram LM is tion was compared with the BW approach. In that work, for evaluating
trained on the transcripts of speech data in PF-STAR excluding Child-Ts, the effectiveness of low-rank projections, the rank of ̃ was varied from
i.e., on the training transcripts only. The employed domain-specific LM 36 to 12 in steps of 3. Also for incorporating VTLN, the projections were
𝛼
has an OOV rate of 1.20% and perplexity of 95.8 for Child-Ts set, re- applied to the warped feature vector 𝐨𝑟 𝑖 with 𝛼 i being the optimal warp
𝛼
spectively. A lexicon of 1,969 words including pronunciation variations factor. The transformed feature vectors 𝐨̃ 𝑟 𝑖 were then decoded on the
is used. correspondingly transformed acoustic models. For ease in comparison,
the WER-profiles for three kinds of projections (BW, SW-PCA and SW-
3.3. Baseline GMM system performance HLDA) reported in (Shahnawazuddin et al., 2015) are reproduced here
as shown in Fig. 4. As evident from the shown WER-profiles, projecting
The WERs for the adults’ speech trained SI system under the matched the feature vectors to lower dimensional subspace results in huge reduc-
(Adult-Ts) and mismatched (Child-Ts) test cases are given in Table 2. tions in error rate. At the same time, the proposed SW-based approaches
The severity of the problem targeted in this study can be accessed by outperform the BW one with SW-HLDA being the best among the two.
noting the large difference in WERs for the matched and mismatched The experimental evaluations presented in (Shahnawazuddin et al.,
cases, despite the use of domain-specific LMs. For better contrast, sepa- 2015) were performed on ASR systems employing GMM-based mod-
rate ASR systems are trained on children train set (Child-Tr) as well as eling. During the last few years, newer acoustic modeling techniques
on a mix of Adult-Tr and Child-Tr. The WERs for Adult-Ts and Child-Ts have been developed and have overtaken GMM-based approach. There-
test sets on the trained ASR systems are also given in Table 2. Develop- fore, in this paper, we have extended the preliminary studies reported
ing an ASR system using children’s speech helps improve the recognition in (Shahnawazuddin et al., 2015) to state-of-the-art setups employing
performance for child speakers. At the same time, on decoding adults’ SGMM- and DNN-based acoustic modeling. It is to note that, the model
speech using acoustic models trained on children’s speech, a severely architecture in SGMM- and DNN-based modeling approaches is not the
degraded recognition performance is obtained as evident from Table 2. same as the GMM-based system. In other words, the observation prob-
On the other hand, pooling data from both the group of speakers results abilities for the HMM states are not generated using Gaussian means,
in performances that are comparatively poorer for their corresponding mixture-weights and diagonal covariance matrices. In the case of SGMM,
matched case baselines. the covariance matrices are full instead of being diagonal. Moreover, a
For improving the children’s mismatched ASR, the first approach we set of low dimensional state projection vectors and linear subspace pro-
have explored is the VTLN which is reported to be highly effective un- jection matrices are used to map the parameters of the universal back-
der acoustically mismatched test condition. To implement VTLN, a set ground model to derive the parameters of the HMM states. In the case
of frequency warp factors {𝛼𝑖 }13 𝑖=1
are employed which vary from 0.88 of DNN, on the other hand, the model parameters consist of network
to 1.12 in steps of 0.02. A grid search is performed by aligning differ- weights and the bias vectors. Therefore, the structured low-rank feature
ently warped feature sets against the SI model under the constraints projection described in Section 2.3 cannot be used as it is.
of first-pass hypothesis. This hypothesis is generated by decoding the An alternate means to implement the SW is to project the speech
non-warped feature set on the SI system. For each utterance, the value features to a lower-dimensional subspace before learning the acoustic
of 𝛼 i resulting in the highest likelihood is noted and employed during models. For this purpose, two different schemes are proposed in this
second-pass decoding. The large reductions in WERs possible with the paper as shown in Fig. 5. In Scheme-I, the feature extraction approach
VTLN in the case of different ASR systems are also reported in Table 2. is identical to that available in the Kaldi toolkit except for the use of
In this work, VTLN-warping was always applied on test data only. It is varying rank LDA projections to implement SW. In Scheme-II, the base
worth highlighting that despite applying the variance normalization to features are projected to a lower-dimensional subspace prior to time-
the features as well as VTLN, WERs for Child-Ts set remain degraded splicing. In both cases, the acoustic models are trained using the feature
with respect to those of Adult-Ts. vectors that have been projected to a lower-dimensional subspace. In the
107
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
Scheme−II
Lower
13−dimensional Feature projection dimensional Fixed rank 39−dimensional
base Time−splice
base features using HLDA/PCA LDA + MLLT output features
features
65
No normalization VTLN the case of Scheme-I, the frequency warped feature vectors for the test
65
data are projected to a lower-dimensional subspace before decoding to
60 60
include VTLN. The WER-profile for this study is also shown in Fig. 7. It is
55 55
evident from the WER-profiles shown in Fig. 6 and Fig. 7 that, Scheme-II
50 50
of SW is better than Scheme-I.
WER (in %)
45 45
40 40 5. Evaluating SW in SGMM and DNN frameworks
35 35
30 30 Even though SGMM/DNN are reported to be better than GMM, these
25 25 two approaches are also data dependent. Therefore, a mismatch in the
12 15 18 21 24 27 30 33 36 39 12 15 18 21 24 27 30 33 36 39 variance of training and test data feature vectors, as highlighted earlier,
Feature dimension after projection Feature dimension after projection is bound to affect ASR systems employing SGMM/DNN-based acoustic
modeling as well. In this section, the effectiveness of proposed low-rank
Fig. 6. The WER-profiles illustrating the effect of low-rank feature projection
feature projection is validated in the context of modeling approaches
(Scheme-I) on children’s mismatched ASR employing GMM-based acoustic mod-
eling. These profiles also demonstrate the effect of including VTLN along with based on SGMM and DNN frameworks.
low-rank feature projection.
5.1. Specifications of SGMM and DNN systems
following subsections, we first study the efficacy of both the schemes For developing the SGMM-based system, 400 Gaussians are used in
in GMM framework. This is followed by similar studies performed in the universal background model. Number of leaves and Gaussians in the
SGMM and DNN frameworks. SGMM are selected as 9000 and 7000, respectively. In the case of DNN-
based system, the hidden layers include tanh nonlinearity. The number
4.1. Low-rank projection using Scheme-I of hidden layers is varied from 2 to 8 and finally fixed at 8. There are
1024 nodes in the hidden layer. The output layer uses soft-max function.
In order to implement Scheme-I SW, the rank of the LDA projection An initial learning rate of 0.015 is selected which is reduced to 0.002
matrix applied to the time-spliced features is varied as stated earlier. in 20 epochs. Extra 10 epochs are employed after reducing the learning
Cepstral means and variance normalization (CMVN) is applied on the rate to 0.002. The minibatch size for neural net training is selected as
resulting feature vectors. Both training and test data feature vectors are 512.
processed through these steps. The normalized lower-dimensional fea- Generally, in order to improve the recognition performance by sup-
ture vectors are then used for training the GMM parameters. The ef- pressing speaker-dependent variability, normalization through feature-
fect of projecting the feature vectors to a lower-dimensional subspace is space maximum-likelihood linear regression (fMLLR) is performed in
demonstrated using the WER-profile shown in Fig. 6. To include VTLN the case of DNN as suggested in (Rath et al., 2013). In order to include
with Scheme-I SW, the frequency warped feature vectors for the test fMLLR in this study, a revised GMM system is developed using the fM-
data are projected to a lower-dimensional subspace before decoding. LLR transformed features employing speaker adaptive training (SAT).
The WER-profile for this study is also shown in Fig. 6. The baseline per- The decision tree and the state alignments required as the supervision
formances for these two cases are 62.55% and 35.06%, respectively, in the SGMM/DNN training are obtained from the revised system (SAT-
as given in Table 2. Hence, it is evident from the shown WER-profiles GMM system). The fMLLR transformed features are also employed in the
that projecting the data to a lower-dimensional subspace significantly case of SGMM-based ASR systems. Furthermore, the fMLLR-normalized
improves the recognition rates for children’s speech. features are further spliced in time considering a context size of 9 frames
and the resulting 351-dimensional features are used for DNN training.
4.2. Low-rank projection using Scheme-II The SAT-GMM-, SGMM- and DNN-based ASR systems use the same LMs
and lexicons in decoding, as discussed earlier.
In the second approach of low-rank projection shown in
Fig. 5 (Scheme-II), the base feature vectors are first projected to a 5.2. Baseline performances for SGMM and DNN systems
lower-dimensional subspace prior to time-splicing. Either HLDA or PCA
can be used for projecting the base feature vectors as discussed earlier The baseline WERs for the two earlier mentioned test sets (see
in Section 2.3. In this paper, we present the results obtained by using Table 1) with respect to SGMM and DNN systems are given in Table 3.
HLDA only. The final feature dimension derived using time-splicing It is to note that the reported performances are obtained after apply-
followed by LDA+MLLT is kept fixed at 39. Subsequently, the features ing CMVN, LDA, MLLT and fMLLR transformations to the training as
are normalized using CMVN and VTLN. well as test feature vectors. For proper contrast, baseline WERs for SAT-
The effectiveness of Scheme-II SW is illustrated using the WER- GMM system employing fMLLR-based feature normalization are also
profile shown in Fig. 7. Since the base feature vectors are projected prior tabulated. For both the test sets, SGMM- and DNN-based modelings are
to time-splicing, the x-axis is indexed from 5 to 13 in steps of 1. As in found to be superior to the SAT-GMM case as expected. Further, all the
108
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
45 45
40 40
35 35
30 30
25 25
5 6 7 8 9 10 11 12 13 5 6 7 8 9 10 11 12 13
Base feature dimension after projection Base feature dimension after projection
WER (in %)
cally matched and mismatched test conditions with differ- 50
ent acoustic modeling approaches. All the reported per- 40
formances include LDA, MLLT and fMLLR transformations
30
being applied to the features.
20
12 15 18 21 24 27 30 33 36 39
Speech data Acoustic WER (in %)
60 fMLLR
WER (in %)
Table 4 50
The WERs for the children’s test with respect to adult data trained ASR
40
systems employing different kinds of acoustic models with and without
30
the fMLLR/VTLN.
20
Speech data used for Acoustic model WER (in %) 12 15 18 21 24 27 30 33 36 39
Feature dimension after projection
SI system training kind No norm. fMLLR VTLN
Fig. 8. The WER-profiles with variations in the rank of feature projection matrix
Adult-Tr GMM 62.55 43.53 35.06
SGMM 54.43 31.96 26.12 (Scheme-I) for the three acoustic modeling approaches namely GMM, SGMM
DNN 43.32 24.25 23.57 and DNN. Also shown are the effects of normalizing the time-spliced base MFCCs
using either the fMLLR or the VTLN.
systems exhibit large degradation in the mismatched testing case similar shown in Fig. 8. These profiles are shown with and without fMLLR be-
to that noted for the GMM system. One of the reasons for the observed ing included in training as well as testing. As noted earlier, fMLLR-based
degradation is pitch-induced spectral distortion as argued earlier. feature normalization is found to result in additional reductions in the
As demonstrated earlier, inclusion of VTLN results in huge reduction WERs. The WER-profiles for the case when VTLN is employed instead of
is WER in the case of children’s mismatched ASR. Therefore, we have fMLLR are also shown for all three modeling cases. In the case of VTLN,
also studied the effect of VTLN-based frequency warping in the context the warp factors in the same range as described earlier are used. This
of SGMM- and DNN-based systems. To this end, a set of models were is for assessing the relative advantage of VTLN/fMLLR for children’s
trained without performing fMLLR-based feature normalisation while mismatched ASR. Like the case of fMLLR, significant and quite similar
VTLN was applied during testing. The WERs for this study are enlisted improvements in WER are obtained with the use of VTLN as well. For
in Table 4. To gauge relative impact of fMLLR and VTLN, WERs obtained DNN systems, both fMLLR and VTLN result in almost similar WERs af-
by fMLLR-based feature normalization are also given in Table 4. From ter low-rank projection. For highlighting the relative reduction in WERs
the tabulated WERs, it is clear that both fMLLR and VTLN are very ef- for the different modeling approaches, the best case performances ob-
fective in the context of children’s mismatched ASR. In the case of GMM tained with low-rank projection are given in Table 5. It can be noted
and SGMM systems, the WERs obtained by using VTLN are significantly that significant gains in recognition performance are achieved with the
better than those obtained by employing fMLLR-based feature normal- inclusion of low-rank projections in all the explored combinations.
ization. At the same time, both fMLLR and VTLN result in very similar
WERs in the case of DNN-based ASR system. 5.3.1. Effect of varying the number of hidden layers in DNN
In the studies presented earlier in Section 5.3, VTLN was observed
5.3. Low-rank projection using scheme-I to be quite effective. On the contrary, VTLN was found to be largely in-
effective in the case of DNN systems with large number of hidden layers
The effect of the low-rank feature projection using Scheme-I in the (7+) as reported in (Seide et al., 2011). According to that study, VTLN
case of children’s mismatched ASR can be accessed by the WER-profiles is found to be effective in the case of shallow networks only. The experi-
109
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
Table 5
Percentage relative improvement (PRI) in the recognition performances obtained through the
use of low-rank feature projection (Scheme-I). The WERs are reported for the three acoustic
modeling approaches explored. Also shown are the WERs for the cases when fMLLR/VTLN is
applied for feature normalization.
GMM 62.55 45.59 27.11 35.06 26.04 25.73 43.53 31.96 26.56
SGMM 54.43 41.01 24.66 26.12 23.06 11.72 31.89 25.33 20.57
DNN 43.32 39.07 9.81 23.57 21.40 9.21 24.25 21.44 11.58
40
30 24 MFCC
20 PLPCC
WER (in %)
12 15 18 21 24 27 30 33 36 39
50 fMLLR 23
WER (in %)
40
30
22
20
12 15 18 21 24 27 30 33 36 39
50 VTLN
WER (in %)
40
21
12 15 18 21 24 27 30 33 36 39
30
Feature dimension after projection
20
12 15 18 21 24 27 30 33 36 39
Fig. 10. The WER-profile with variations in the rank of feature projection ma-
Feature dimension after projection
trix (Scheme-I) for the PLPCC and MFCC features in the context of DNN system.
The WERs are obtained with VTLN-based warping of the test data.
Fig. 9. The WER-profiles with variations in the rank of feature projection matrix
(Scheme-I) with respect to the DNN systems involving 2, 5 and 8 hidden layers,
respectively. Also shown are the effects of normalizing the time-spliced base
cessed through LDA, MLLT, and CMVN as explained earlier. The WER-
MFCCs using the fMLLR/VTLN.
profile for this study is shown in Fig. 10. It is evident from the presented
WERs that the use of low-rank projection is similarly effective for the
ments reported in that work are for the matched case testing only. Thus, PLPCC features as noted for the MFCC in the context of DNN-based sys-
it would be worth confirming whether similar trend holds in the case of tem. The plotted WERs profiles are for the case when VTLN is included.
mismatched testing explored in this work or not. Consequently, we var- Similar trends are noted even with fMLLR-based feature normalization.
ied the number of hidden layers for the mismatched testing going from
shallow to deep networks. At the same time, the low-rank projection of 5.4. Low-rank projection using scheme-II
the data was also explored in each of the cases.
The effect of projecting the data to a lower-dimensional subspace The WER profiles depicting the effect of employing Scheme-II for
on the three different complexity DNN systems is shown in Fig. 9. The low-rank feature projection with variations in the dimensionality of the
WER-profiles are shown for the cases when the number of hidden layers base feature vector are shown in Fig. 11. The WERs are given with re-
is 2, 5 and 8 only for legibility. Also shown are the WER-profiles when spect to GMM, SGMM and DNN systems with the inclusion of fMLLR-
the acoustic features are normalized using fMLLR as well as VTLN. It is based normalization. Similar WERs are obtained when VTLN is used in
interesting to note that the effect of varying the number of hidden layers the place of fMLLR. The percentage relative improvement in recognition
is much more pronounced when no feature normalization is included. performance with the inclusion of the explored schemes for low-rank
Moreover, the effectiveness of VTLN remains intact in all the studied projection are given in Table 6. The reductions in WERs with the use
cases. The WERs obtained through the use of fMLLR and VTLN turn out of Scheme-II are significantly better than that obtained using Scheme-
be quite similar in all the cases. In addition to that, the low-rank feature I. It is to note that, in the structure developed in Section 2.3, the dy-
projection is observed to result in further reductions in WERs for the namic features are computed after low-rank projection of base feature
studied cases. vectors. Similarly, in the case of Scheme-II of SW, time-splicing, LDA,
MLLT and fMLLR are performed only after projecting the base features
5.3.2. Exploring PLPCC features in DNN framework to a lower dimensional subspace. Hence the observed gains are larger
In this subsection, we explore the use of perceptual linear prediction when Scheme-II is employed.
cepstral coefficient (PLPCC) (Hermansky, 1990) based acoustic features
along with low-rank feature projection in the context of DNN system. 5.5. Experiments on wideband speech data
The 13-dimensional base PLPCC features are derived using 12th -order
LP analysis and a 21-channel triangular mel-filterbank. The LP order For proper contrast, the effectiveness of low-rank feature projection
chosen for PLPCC feature extraction is consistent with that reported in was also explored on ASR systems trained using wideband speech (sam-
literature. The base PLPCC features are then spliced in time and pro- pled at 16 kHz rate). The ASR systems developed for this study had
110
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
45 33
GMM GMM
42 SGMM SGMM
30
DNN DNN
39
36 27
WER (in %)
WER (in %)
33
24
30
27 21
24 18
21
18 15
5 6 7 8 9 10 11 12 13 5 6 7 8 9 10 11 12 13
Base feature dimension after projection Base feature dimension after projection
Fig. 11. The WER-profile with variations in the rank of feature projection ma- Fig. 12. The WER-profile with variations in the rank of feature projection ma-
trix using Scheme-II in the context of GMM, SGMM and DNN systems. The WERs trix using Scheme-II in the context of GMM, SGMM and DNN systems trained
include fMLLR-based normalization of the training and test data. on wideband speech data from adult speakers. The WERs are obtained with
fMLLR-based normalization of the training and test data.
Table 6
The PRIs in the recognition performances obtained through the
Table 8
inclusion of low-rank feature projection (Scheme-I & Scheme-II).
The PRIs obtained through the inclusion of low-rank feature projection
The WERs are reported for the three acoustic modeling approaches
(Scheme-II) with respect to ASR systems trained on narrowband and wide-
explored with the inclusion of fMLLR-based feature normalization.
band speech data. The WERs are reported for the three acoustic modeling
Acoustic WER (in %) approaches explored with the inclusion of fMLLR-based feature normaliza-
tion.
modeling Scheme-I Scheme-II
Acoustic WER (in %)
approach Base. SW PRI Base. SW PRI
modeling Narrowband Wideband
GMM 43.53 31.96 26.56 43.53 25.78 45.11
SGMM 31.89 25.33 20.57 31.89 20.80 34.78 approach Base. SW PRI Base. SW PRI
DNN 24.25 21.44 11.58 24.25 19.11 21.19
GMM 43.53 25.78 45.11 35.08 20.89 40.45
SGMM 31.89 21.18 33.58 24.73 16.22 34.41
Table 7 DNN 24.25 19.11 21.19 20.38 15.34 24.73
The WERs for the children’s test on adult data trained ASR systems employ-
ing different kinds of acoustic models. The tabulated WERs demonstrate the
effect of using wideband speech data instead of narrowband.
of 25% is noted when the feature dimension is reduced from 39 to 24
Bandwidth Acoustic model WER (in %)
using structured low-rank projection.
kind fMLLR VTLN
111
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
Table 9 Fisher, W.M., Doddington, G.R., Goudie-Marshall, K.M., 1986. The DARPA speech recog-
The PRIs in the recognition performances obtained through the inclusion of nition research database: specifications and status. In: Proc. DARPA Workshop on
low-rank feature projection (Scheme-II) with respect to ASR systems trained Speech Recognition, pp. 93–99.
Gales, M.J.F., 1999. Cluster adaptive training of hidden Markov models. IEEE Trans.
by pooling speech speech data from both adult and child speakers. The WERs
Speech Audio Process. 8 (4), 417–428.
are obtained with fMLLR-based normalization of the training and test data. Gauvain, J.L., Lee, C.H., 1994. Maximum a-posteriori estimation for multivariate Gaus-
sian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2,
Bandwidth WER (in %)
291–298.
Base. SW PRI Gerosa, M., Giuliani, D., Brugnara, F., 2007. Acoustic variability and automatic recogni-
tion of children’s speech. Speech Communun. 49 (10–11), 847–860.
Narrowband 14.63 12.24 16.4 Gerosa, M., Giuliani, D., Brugnara, F., 2009. Towards age-independent acoustic modeling.
Wideband 11.79 10.16 13.8 Speech Commun. 51 (6), 499–509.
Gerosa, M., Giuliani, D., Narayanan, S., Potamianos, A., 2009. A review of ASR technolo-
gies for children’s speech. In: Proc. Workshop on Child, Computer and Interaction,
pp. 7:1–7:8.
6. Conclusion and future work Ghai, S., 2011. Addressing Pitch Mismatch for Children’s Automatic Speech Recognition.
Department of EEE, Indian Institute of Technology Guwahati, India.
Ghai, S., Sinha, R., 2009. Exploring the role of spectral smoothing in context of children’s
The work presented in this paper aims at addressing pitch-dependent speech recognition. In: Proc. INTERSPEECH, pp. 1607–1610.
acoustic mismatch in the context of children’s speech recognition with Ghai, S., Sinha, R., 2010. Exploring the effect of differences in the acoustic correlates of
adults’ speech trained models. To overcome this issue, structured low- adults’ and children’s speech in the context of automatic speech recognition. EURASIP
J. Audio Speech Music Process. 2010, 7:1–7:15.
rank feature projections are proposed and are found to be effective in the Gray, S.S., Willett, D., Pinto, J., Lu, J., Maergner, P., Bodenstab, N., 2014. Child automatic
context of GMM-based ASR. Moreover, the use of feature normalization speech recognition for US English: Child interaction with living-room-electronic-de-
techniques like fMLLR and VTLN along with low-rank projection are also vices. In: Proc. INTERSPEECH, Workshop on Child, Computer and Interaction.
Gustafson, J., Sjolander, K., 2002. Voice transformations for improving children’s speech
studied. Both the feature normalization techniques are observed to result recognition in a publicly available dialogue system. In: Proc. ICSLP, pp. 297–300.
in added improvements. In order to extend the developed approach to Hagen, A., Pellom, B., Cole, R., 2003. Children’s speech recognition with application to
SGMM- and DNN-based modeling, two different schemes for projecting interactive books and tutors. In: Proc. ASRU Workshop, pp. 186–191.
Hagen, A., Pellom, B., Cole, R., 2007. Highly accurate childrenâs speech recognition for
the speech data to lower-dimensional subspace are presented. The latter
interactive reading tutors using subword units. Speech Commun. 49 (12), 861–873.
of the two schemes is noted to result in greater improvements. From the Hazen, T.J., Glass, J.R., 1997. A comparison of novel techniques for instantaneous speaker
studies on SGMM- and DNN-based children’s mismatched ASR presented adaptation. In: Proc. of European Conference on Speech Communication and Tech-
nology, pp. 2047–2050.
in this paper, the following conclusions can be drawn:
Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc.
Amer. 57 (4), 1738–1752.
• Though SGMM/DNN-based systems are superior to the GMM ones,
Jolliffe, I.T., 1986. Principal Component Analysis. Springer-Verlag, Berlin, Germany.
the ill-effects of pitch-dependent distortions are still evident in the Kent, R.D., 1976. Anatomical and neuromuscular maturation of the speech mechanism:
case of mismatched ASR; evidence from acoustic studies. JHSR 9, 421–447.
• The proposed SW approach is found to be effective even in the case Kuhn, R., Junqua, J.C., Nguyen, P., Niedzielski, N., 2000. Rapid speaker adaptation in
eigenvoice space. IEEE Trans. Speech Audio Process. 8 (6), 695–707.
of SGMM- and DNN-based children’s mismatched ASR systems; Kumar, N., Andreou, A.G., 1998. Heteroscedastic discriminant analysis and reduced rank
• Unlike the adults’ matched task, VTLN is observed to be as effective HMMs for improved speech recognition. Speech Commun. 26 (4), 283–297.
as fMLLR; Lee, L., Rose, R., 1998. A frequency warping approach to speaker normalization. IEEE
Trans. Speech Audio Process. 6 (1), 49–60.
• The VTLN is found to be effective for both shallow and deep net- Lee, S., Potamianos, A., Narayanan, S., 1999. Acoustics of children’s speech: Develop-
works unlike that noted for the adults’ matched tasks. mental changes of temporal and spectral parameters. In: J. Acoust. Soc. Amer., 105,
pp. 1455–1468.
Throughout the paper, we have presumed that the system somehow Leggetter, C.J., Woodland, P.C., 1995. Maximum likelihood linear regression for speaker
knows the difference between the children’s and adults’ test sets and adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9,
171–185.
applies SW to children’s test set. Unfortunately, applying SW to adults’ Liao, H., Pundak, G., Siohan, O., Carroll, M.K., Coccaro, N., Jiang, Q., Sainath, T.N., Se-
test set is noted to degrade the recognition performance. Consequently, nior, A.W., Beaufays, F., Bacchiani, M., 2015. Large vocabulary automatic speech
a robust switch to detect the speech from the two groups of speakers is recognition for children. In: Proc. INTERSPEECH, pp. 1611–1615.
Metallinou, A., Cheng, J., 2014. Using deep neural networks to improve proficiency assess-
required. Even when two separate ASR systems are employed, such a
ment for children english language learners. In: Proc. INTERSPEECH, pp. 1468–1472.
switch will be required. In future, we wish to address this shortcoming Narayanan, S., Potamianos, A., 2002. Creating conversational interfaces for children. IEEE
of the proposed approach. Trans. Speech Audio Process. 10 (2), 65–78.
Nisimura, R., Lee, A., Saruwatari, H., Shikano, K., 2004. Public speech-oriented guid-
ance system with adult and child discrimination capability. In: Proc. ICASSP, 1,
Acknowledgements pp. 433–436.
Potamianos, A., Narayanan, S., 2003. Robust recognition of children speech. IEEE Trans.
The authors express sincere gratitude to the anonymous reviewers Speech Audio Process. 11 (6), 603–616.
Potamianos, A., Narayanan, S.S., Lee, S., 1997. Automatic speech recognition for children.
for their thoughtful comments and suggestions that has helped us im- In: Proc. INTERSPEECH, 5, pp. 2371–2374.
prove the quality of this paper. Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., Glembek, O., Goel, N.,
Karafiát, M., Rastrow, A., Rose, R.C., Schwarz, P., Thomas, S., 2011. The subspace
References gaussian mixture model-a structured model for speech recognition. Comput. Speech
Lang. 25 (2), 404–439.
Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D., Gerosa, M., Hacker, C., Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M.,
Russell, M., Wong, M., 2005. The PF_STAR children’s speech corpus. In: Proc. INTER- Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K., 2011. The
SPEECH, pp. 2761–2764. kaldi speech recognition toolkit. In: Proc. ASRU.
Bell, L., Gustafson, J., 2007. Children’s convergence in referring expressions to graphical Rath, S.P., Povey, D., Veselý, K., Černocký, J., 2013. Improved feature processing for deep
objects in a speech-enabled computer game. In: Proc. INTERSPEECH, pp. 2209–2212. neural networks. In: Proc. INTERSPEECH.
Burnett, D., Fanty, M., 1996. Rapid unsupervised adaptation to children’s speech on a Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S., 1995. WSJCAM0: A British English
connected-digit task. In: Proc. ICSLP, 2, pp. 1145–1148. speech corpus for large vocabulary continuous speech recognition. In: Proc. ICASSP,
Dahl, G., Yu, D., Deng, L., Acero, A., 2012. Context-dependent pre-trained deep neural net- pp. 81–85.
works for large vocabulary speech recognition. IEEE Trans. Speech and Audio Process. Russell, M., D’Arcy, S., 2007. Challenges for computer recognition of children’s speech.
20 (6), 30–42. In: Proc. Speech and Language Technologies in Education (SLaTE).
Davis, S., Mermelstein, P., 1980. Comparison of parametric representations for mono- Russell, M., D’Arcy, S., Qun, L., 2007. The effects of bandwidth reduction on human
syllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustic, and computer recognition of children’s speech. IEEE Signal Process Lett. 14 (12),
Speech , Signal Process. 28 (4), 357–366. 1044–1046.
Digalakis, V., Rtischev, D., Neumeyer, L., 1995. Speaker adaptation using constrained Safavi, S., Najafian, M., Hanani, A., Russell, M.J., Jancovic, P., 2014. Comparison of
estimation of gaussian mixtures. IEEE Trans. Speech and Audio Process. 3, 357–366. speaker verification performance for adult and child speech. In: Proc. The Workshop
Eguchi, S., Hirsh, I.J., 1969. Development of speech sounds in children.. Acta Otolaryngol on Child, Computer and Interaction, pp. 27–31.
Suppl 257, 1–51.
112
S. Shahnawazuddin, H.K. Kathania and A. Dey et al. Speech Communication 105 (2018) 103–113
Safavi, S., Najafian, M., Hanani, A., Russell, M.J., Jancovic, P., J.Carey, M., 2012. Speaker Shahnawazuddin, S., Kathania, H., Sinha, R., 2015. Enhancing the recognition of chil-
recognition for children’s speech. In: Proc. INTERSPEECH, pp. 836–1839. dren’s speech on acoustically mismatched ASR system. In: Proc. IEEE TENCON,
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., pp. 1–5.
Strope, B., 2010. Your Word Is My Command: Google Search by Voice: a Case Study. Shao, X., Milner, B., 2004. Pitch prediction from MFCC vectors for speech reconstruction.
In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, In: Proc. ICASSP, pp. 97–100.
pp. 61–90. Shivakumar, P.G., Potamianos, A., Lee, S., Narayanan, S., 2014. Improving speech recog-
Seide, F., Li, G., Chen, X., Yu, D., 2011. Feature engineering in context-dependent deep nition for children using acoustic adaptation and pronunciation modeling. In: Proc.
neural networks for conversational speech transcription.. In: Proc. ASRU, pp. 24–29. Workshop on Child Computer Interaction.
Serizel, R., Giuliani, D., 2014. Vocal tract length normalisation approaches to dnn-based Singer, H., Sagayama, S., 1992. Pitch dependent phone modelling for HMM based speech
children’s and adults’ speech recognition. In: Proc. Workshop on Spoken Language recognition. In: Proc. ICASSP, pp. 273–276.
Technology (SLT), pp. 135–140. Sinha, R., Ghai, S., 2009. On the use of pitch normalization for improving children’s speech
Serizel, R., Giuliani, D., 2016. Deep-neural network approaches for speech recognition recognition. In: Proc. INTERSPEECH, pp. 568–571.
with heterogeneous groups of speakers including children. Nat. Lang. Eng. 1, 325–350. Stemmer, G., Brugnara, F., 2006. Integration of heteroscedastic linear discriminant anal-
ysis (HLDA) into adaptive training. In: Proc. ICASSP, 1, pp. 14–19.
113