See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/224640888
Integrating Acoustic, Prosodic and Phonotactic Features for Spoken Language
Identification
Conference Paper  in  Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on · June 2006
DOI: 10.1109/ICASSP.2006.1659993 · Source: IEEE Xplore
CITATIONS                                                                                                 READS
52                                                                                                        741
5 authors, including:
             Rong Tong                                                                                               Bin Ma
             Agency for Science, Technology and Research (A*STAR)                                                    Institute for Infocomm Research
             34 PUBLICATIONS   324 CITATIONS                                                                         221 PUBLICATIONS   2,315 CITATIONS   
                SEE PROFILE                                                                                               SEE PROFILE
             Haizhou Li                                                                                              Eng Siong Chng
             National University of Singapore                                                                        Nanyang Technological University
             693 PUBLICATIONS   8,700 CITATIONS                                                                      259 PUBLICATIONS   3,155 CITATIONS   
                SEE PROFILE                                                                                               SEE PROFILE
Some of the authors of this publication are also working on these related projects:
              Voice Analysis and Transformation View project
              Hierarchical Spoken Language Identification View project
 All content following this page was uploaded by Bin Ma on 14 October 2014.
 The user has requested enhancement of the downloaded file.
             INTEGRATING ACOUSTIC, PROSODIC AND PHONOTACTIC FEATURES
                       FOR SPOKEN LANGUAGE IDENTIFICATION
                      Rong Tong1,2, Bin Ma1, Donglai Zhu1 , Haizhou Li1 and Eng Siong Chng2
                                        1
                                   Institute for Infocomm Research, Singapore
               2
                   School of Computer Engineering, Nanyang Technological University, Singapore
                           1
                               {tongrong,mabin,dzhu,hli}@i2r.a-star.edu.sg, 2aseschng@ntu.edu.sg
                          ABSTRACT                                      channel than spectral features. For practicality, research has been
                                                                        focused on acoustic-prosodic-phonotactic features. In this paper,
The fundamental issue of the automatic language identification is       we study how the three levels of language cues, n-gram LM, bag-
to explore the effective discriminative cues for languages. This        of-sounds, spectral feature, duration and pitch complement in LID
paper studies the fusion of five features at different level of         tasks.
abstraction for language identification, including spectrum,
duration, pitch, n-gram phonotactic, and bag-of-sounds features.                        Syntactic: word n-gram
We build a system and report test results on NIST 1996 and 2003
                                                                            high
LRE datasets. The system is also built to participate in NIST 2005
LRE. The experiment results show that different levels of                               Lexical: word
information provide complementary language cues. The prosodic
features are more effective for shorter utterances while the                            Phonotactic: n-gram LM, BOS
phonotactic features work better for longer utterances. For the task
of 12 languages, the system with fusion of five features achieved                       Prosodic: duration, pitch
2.38% EER for 30-sec speech segments on NIST 1996 dataset.                  low
                                                                                        Acoustic: MFCC, SDC
                       1. INTRODUCTION
                                                                                       Figure 1 Five levels of LID features
Automatic language identification (LID) is a process of
determining the language identity corresponding to a given spoken            We typically represent a speech utterance as a collection of
query. It is an important technology in many applications, such as      independent spectral feature vectors. The collection of vectors can
spoken language translation, multilingual speech recognition and        be modeled by a Gaussian mixture model, known as GMM [7],
spoken document retrieval.                                              that captures the spectral characteristics of a language. The
     Recent studies have explored different levels of speech            prosody of speech can be characterized mainly by energy, pitch
features which include articulatory parameters [1], spectral            and duration among others. They can be modeled in a similar way
information [2], prosody [3], phonotactic [2] and lexical               as that for spectral feature. Phonotactic features capture the lexical
knowledge [4]. It is generally believed that spectral feature and       constraint of admissible phonetic combination in a language. One
phonotactic feature provide complementary language cues to each         typical implementation is the P-PRLM (Parallel Phone
other [5]. Human perception experiments also suggest that               Recognition followed by Language Model) approach that employs
prosodic features are informative language cues [1]. However,           multiple phoneme recognizers that tokenize a speech waveform
prosodic feature has not been fully exploited in LID [6]. In            into phoneme sequences and then characterizes a language by a
general, LID features fall into five groups according to their level    group of n-gram language models (LM) over the phoneme
of knowledge abstraction as shown in Figure 1. Lower level              sequences [2]. A new phonotactic model, known as bag-of-sounds
features, such as spectral feature, are easier to obtain but volatile   was proposed recently to model utterance level phonotactics
because speech variations such as speaker or channel variations are     collectively. Its language discriminative ability is comparable to
present. Higher level features, such as lexical/syntactic features,     that of the n-gram LM [8][9].
rely on large vocabulary speech recognizer, which is language and            In this paper, we study five LID features: n-gram LM in P-
                                                                        PRLM, bag-of-sounds, spectral feature, pitch and duration. In
domain dependant. They are therefore difficult to generalize across
                                                                        Section 2, the development and evaluation databases are
languages and domains. Phonotactic features become a trade-off
                                                                        introduced. In Section 3, the feature fusion LID system is
between computational complexity and performance. It is
                                                                        described. In Section 4, we report the experiment results. Finally
generally agreed that phonotactics, i.e. the rules governing the        we conclude in Section 5.
sequences of admissible phone/phonemes, carry more language
discriminative information than the phonemes themselves. They                  2. DEVELOPMENT AND EVALUATION DATA
are extracted from output of a phoneme recognizer, which is
supposed to be more robust against effects such as speaker and          The NIST 1996 and 2003 language recognition evaluation (LRE)
sets are used to evaluate the performance of the LID systems.
There are 12 target languages in both sets: Arabic, Farsi, French,           3.2 Bag-of-Sounds (BOS)
German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil,
Vietnamese and English. Dialects of English, Mandarin and                    The bag-of-sounds method uses a universal sound recognizer to
Spanish are also included in 1996 test set. In 2003 test, there are          tokenize an utterance into a sound sequence, and then converts the
test segments in Russian. Each language consists of test segments            sound sequence into a count vector, known as bag-of-sounds
in 3 length groups: 30, 10 and 3 seconds.                                    vector [8]. The bag-of-sounds method differs from the P-PRLM
      The development data come from CallFriend corpus [10]. We              method in that it use single universal sound recognizer, with this
use the same 12 languages and 3 dialects as the target languages             universal sound recognizer, one does not need to carry out acoustic
specified in the NIST LRE. In CallFriend corpus, data for each               modeling when adding new language capability to the classifier.
language are grouped into 3 parts: ‘train’, ‘devtest’ and ‘evaltest’.        Although the sound inventory for the universal sound recognizer
We are using ‘train’ and ‘devtest’ as our development data.                  can be derived from unsupervised learning [8], in this paper, the
      All the development and test data are pre-processed by a               universal sound inventory is a combined phoneme set from 6
speech activity detection program to remove silence. In the                  languages: English, Mandarin, Hindi, Japanese, Spanish and
development process, we treat the dialects of English, Mandarin              German. There are 258 phonemes in total. The phoneme labeled
and Spanish as different languages. Therefore, there are 15                  training corpus of these 6 languages are come from same sources
languages in the training process. For our results to be comparable          as described in P-PRLM system.
with other reports in the literature, in the test process, we only                 For each sound sequence generated from the universal sound
measure the LID performance of the 12 primary languages by                   tokenizer, we count the occurrence of bi-phones. A phoneme
grouping the dialect labels into their respective primary language.          sequence is then represented as a vector of bi-phone occurrence
                                                                             with 66,564 = 258 × 258 elements. A Support Vector Machine
                    3. SYSTEM DESCRIPTION                                    (SVM) is used to partition the high dimensional vector space [14].
                                                                             As SVM is a 2-way classifier, we train pair-wise SVM classifiers
One of the solutions to fuse multiple features is the ensemble               for the 15 target languages, resulting in 105 SVM classifiers. The
method. An ensemble of classifiers is a set of classifiers whose             linear kernel is adopted when using SVM-light tool.
individual decisions are combined in the classification process.                   A training utterance is classified by the 105 SVM classifiers
Our five-feature fusion LID system is formulated in this way. In             to derive a 105-dimensional score vectors. The collection of
this section, we discuss five member classifiers in the ensemble.            training score vectors are used to train a backend classifier in the
                                                                             same way as it is used in P-PRLM. The likelihood ratio for a test
3.1 n-gram LM in P-PRLM                                                      utterance can be given by the backend classifier as λBOS .
Following the P-PRLM formulation as in [2], seven phoneme
tokenizers are used in our system: English, Korean, Mandarin,                3.3. SDC Feature in GMM
Japanese, Hindi, Spanish and German. English phonemes are
                                                                             Gaussian mixture models are used to model acoustic
trained from IIR-LID [11] database. Korean phonemes are trained
                                                                             characteristics of a language, known as GMM acoustic in [5]. We
from LDC Korean corpus (LDC2003S03). Mandarin phonemes are
                                                                             use the shifted delta cepstral (SDC) features [7] to capture long
trained from MAT corpus [12]. Other phonemes are trained from
                                                                             time spectral information across successive frames. The parameter
OGI-TS corpus [13]. 39-dimensional MFCC features are extracted
                                                                             7-3-1-7 is used as in [5]. We build a set of GMMs to form a
from each frame. Utterance based cepstral mean subtraction is
                                                                             classifier. First, a 2,048-mixture Gaussian Mixture model is trained
applied to the MFCC features to remove channel distortion. Each
                                                                             from all the SDC feature vectors of 15 languages, this is the
phoneme in the languages are modeled with a HMM of 3-state.
                                                                             universal background model (UBM). Then, we adapt the UBM
The English, Korean and Mandarin states are of 32 mixtures each,
                                                                             towards each target language amounting to 15 language dependent
while others are of 6 mixtures considering the availability of
                                                                             GMMs. We further adapt the language dependent GMM by gender
training data. Based on the phoneme sequence from each
                                                                             resulting in 30 gender-language dependent GMMs. In summary,
tokenizer, we train up to 3-gram phoneme LM for each tokenizer-
                                                                             we obtain 30 gender-language dependent GMMs, 15 language
target language pair, resulting in 105 = 15 × 7 LMs. For each input
                                                                             dependent GMMs and 3 UBMs.
utterance, 105 interpolated language scores are derived to form a
                                                                                   An utterance is evaluated on the 45 GMMs and 3 UBMs to
vector. In this way, a set of training utterances are represented by a
                                                                             generate 45 language dependent scores in a 45-dimensional vector.
collection of 105-dimensional score vectors. The score vectors are
                                                                             The score vectors are normalized by their respective UBM scores.
normalized by subtracting the mean of their competing languages.
                                                                             The collection of training score vectors are used to train a backend
      The P-PRLM classifier consists of 15 pairs of Gaussian
                                                                             classifier in the same way as it is used in P-PRLM. The confidence
mixture models (GMMs), known as the backend classifier. For
                                                                             of a test utterance can be given by the backend classifier as λSDC .
each target language, we build two GMMs {m+ , m− } . m + is
trained on the score vectors of target language, called positive             3.4. Duration
model, while m − is trained on those of its competing languages,
called negative model. The confidence of a test utterance O is               As one of the prosodic features, we believe that the phoneme
                                                                             duration statistics provide language discriminative information.
given by the likelihood ratio λPPRLM = log( p (O | m + ) / p (O | m − )) .
Early research has found that duration is useful in the speaker        12 languages on NIST 1996 30-sec data are shown in Table 3 and
recognition study [15].                                                Table 4. Figure 2 shows the DET plots on 3-sec NIST 2003 LRE
     We use the same universal sound recognizer as in bag-of-          data. The proposed ensemble system significantly outperforms
sounds classifier. After tokenization, we obtain duration statistics   previous reported results on the 3-sec short test utterances and
for each phoneme. The duration feature vector has 3 elements           compare favorably on longer test utterances except 30-sec in 2003
representing the duration of 3 states in a phoneme. For each           LRE.
phoneme in a target language, we train a 16-mixture language-
dependent GMM model using the collection of duration features.          Method                               30-sec   10-sec     3-sec
For each phoneme, we also train a 16-mixture language-                  P-PRLM                               2.92     8.23       18.61
independent GMM model as the negative model using the
                                                                        P-PRLM+BOS                           2.61     7.11       16.98
collection of duration features from all its competing phonemes.
                                                                        P-PRLM+BOS+SDC                       2.38     6.80       15.70
As a result, we arrive at 3,874 = 258 × 15 positive models and 258
                                                                        P-PRLM+BOS+SDC+Duration              2.38     6.35       14.55
negative models.                                                        P-PRLM+BOS+SDC                       2.38     6.26       14.31
      For each utterance, the likelihood ratios from the 258            +Duration+Pitch
positive-negative model pairs are multiplied to generate a score for    MIT fused system [5]                 2.70     6.90       17.40
each language, resulting in a score vector of 15 dimensions
representing 15 languages. The collection of training score vectors         Table 1 EER% of system fusion on NIST 1996 LRE data
are used to train a backend classifier in the same way as it is used
in P-PRLM. The confidence of a test utterance can be given by the
                                                                        Method                               30-sec   10-sec     3-sec
backend classifier as λDUR .
                                                                        P-PRLM                               4.54     11.31      20.37
                                                                        P-PRLM+BOS                           4.17     10.03      18.64
3.5. Pitch                                                              P-PRLM+BOS+SDC                       3.27     8.55       16.66
                                                                        P-PRLM+BOS+SDC+Duration              3.27     8.37       15.94
Pitch feature is another important prosodic feature. It has been        P-PRLM+BOS+SDC+                      3.27     7.97       15.54
used in some speaker recognition tasks [16], but has not                Duration+Pitch
successfully used in LID task yet. We initially design pitch            MIT fused system [5]                 2.80     7.80       20.30
features for Chinese dialect identification as Chinese dialects are
largely differentiated by different intonation schemes. We have             Table 2 EER% of system fusion on NIST 2003 LRE data
seen promising results [17]. Here we adopt pitch features to build
one member classifier in the ensemble.                                      Language               EER%               #test utterances
      For given utterance, 11 dimensional pitch features are           French (FR)                  1.30                      80
extracted from each frame [17]. A Gaussian mixture model, i.e.         Arabic (AR)                  1.76                      80
universal background model (UBM), is trained using feature             Farsi (FA)                   3.15                      80
vectors from all languages. Then a GMM model is adapted from           Geman (GE)                   3.80                     80
the UBM model for each target language. As a result, we build 15       Hindi (HI)                   7.92                      76
GMM models and one UBM model. All models have 16 Gaussian              Japanese (JA)                1.20                      79
mixtures each.
                                                                       Korean (KO)                  3.51                      78
      An utterance is evaluated on the 15 GMMs and 1 UBM to
                                                                       Tamil (TA)                   4.70                     73
generate 15 language dependent scores in a 15-dimensional vector.
                                                                       Vietnamese (VI)              4.38                      79
The score vectors are normalized by the UBM score. The
                                                                       Mandarin (MA)                1.86                     156
collection of training score vectors are used to train a backend
                                                                       Spanish (SP)                 2.03                     153
classifier in the same way as it is used in P-PRLM. The confidence
                                                                       English (EN)                 1.56                     478
of a test utterance can be given by the backend classifier as λPIT .
                                                                       Table 3 EER% for individual language on NIST 1996 LRE data
                       4. EXPERIMENTS                                  (30-sec)
                                                                             FR AR FA GE HI JA KO TA VI MA                     SP    EN
We conduct experiments on NIST 1996 and 2003 LRE datasets.             FR    77  0  0  0  0 0   0  0  1  0                      0     2
We use NIST 1996 LRE development data for fine-tuning of the           AR     1 72  3  0  0 0   0  0  0  0                      2     2
ensemble. With the same resulting setting, we run the test on both     FA     0  0 74  2  0 0   0  0  2  1                      0     1
1996 and 2003 datasets.                                                GE     0  0  5 74 0 0    1  0  0  0                      0     0
     To investigate how different levels of discriminative features    HI     2  1  3  0 57 0   6  2  1  0                      3     1
complement each other, we use our P-PRLM classifier as the             JA     0  0  0  0  0 76 2   0  0  1                      0     0
baseline, and then fuse other classifiers one by one into the          KO     0  0  2  0  2 1 70   1  0  2                      0     0
ensemble. The fusion is carried by multiplying the likelihood ratio    TA     0  1  1  0  3 1   2 65 0   0                      0     0
score from individual member classifiers. In the case of 5-feature     VI     1  0  0  0  1 0   1  0 72  1                      1     2
                                                                       MA     0  1  0  0  0 0   1  0  0 151                     0     3
fusion, we have λ = λPPRLM + λBOS + λSDC + λDUR + λPIT . Table
                                                                       SP     1  0  0  0  3 2   0  0  1  0                     145    1
1 & 2 show the results for incremental fusion of ensemble with the     EN     1  0  5  1  0 0   1  0  2  1                      0    467
last row being extracted from Singer et al [5] for comparison. The
performance of individual languages and confusion matrix among            Table 4 Confusion matrix of NIST 1996 LRE data (30-sec)
                                                                                                        backbone of the ensemble. The spectral feature also consistently
       To look into the contribution of each member classifier in the                                   contributes to the LID tasks. It is found that fusing the lower level
  ensemble, we break down the EER reductions by individual                                              acoustic information and high level phonotactic information
  classifier when it is added into the ensemble, as in Table 5.                                         greatly improves the overall system. We have also successfully
       As both P-PRLM and BOS systems capture phonotactic                                               integrated the prosodic features into the LID task. The experiment
  features in different way, by fusing the two systems, we gain                                         results show that even the simple prosodic feature as pitch and
  average 10.2% EER reduction evenly across the board. The P-                                           phoneme duration are useful, especially for short speech segments.
  PRLM classifier extracts phoneme 3-gram statistics and uses                                                The performance of proposed ensemble LID system on NIST
  perplexity measure to evaluate similarity between languages. The                                      1996 and 2003 LRE datasets are comparable with the best system
  BOS classifiers extract bi-phone statistics, which is similar to                                      reported in the literature. The experiments in this paper also re-
  phoneme bigram, but projects the statistics into a high dimensional                                   affirm, from a different angle, the findings in other reports [5] that
  space for SVM to carry out discrimination [8][9].                                                     spectral and phonotactic features are the most effective features for
                                                                                                        LID.
                                              30-sec             10-sec                      3-sec                                REFERENCES
                                          1996    2003       1996 2003                1996       2003
            P-PRLM                        -       -          -        -               -          -      [1] Y.K. Muthusamy, N. Jain and R.A. Cole, “Perceptual
            BOS                           10.6    8.1        13.6     11.3            9.2        8.5    benchmarks for automatic language identification.” Proc. ICASSP
            SDC                           8.8     21.6       4.4      14.7            7.5        10.6   1994, pp. 333-336.
            Duration                      0.0     0.0        6.6      2.1             7.3        4.3    [2] M.A.Zissman, “Comparison of four approaches to automatic
            Pitch                         0.0     0.0        1.4      4.8             1.6        2.5    language identification of telephone speech,” IEEE Transactions
                                                                                                        on Speech and Audio Processing, 4(1), pp. 31-44, 1996.
  Table 5 EER reduction (%) by member classifiers in the ensemble                                       [3] A. E. Thymé-Gobbel and S. E. Hutchins, "On using prosodic
                                                                                                        cues in automatic language identification", ICSLP 96,
                                                  Language Detection Performance
                                                                                                        Philadelphia, USA, October 1996
                                                                                                        [4] D. Matrouf, M. Adda-Decker, L. Lamel and J. Gauvain,
                                                                                                        "Language Identification Incorporating Lexical Information",
                             40
                                                                                                        ICSLP 98, Sydney, Australia, December 1998.
                                                                                                        [5] E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, W.M.
                                                                                                        Campbell, and D.A. Reynolds, “Acoustic, Phonetic, and
                                                                                                        Discriminative approaches to Automatic Language Identification,”
                             20                                                                         in Proc. Eurospeech 2003, pp. 1345–1348, Sept. 2003
   Miss probability (in %)
                                                                                                        [6] T. J. Hazen and V.W. Zue. “Recent improvements in an
                                                                                                        approach to segment-based automatic language identification”.
                             10                                                                         ICSLP 1994
                                                                                                        [7] Pedro A. Torres-Carrasquillo, et al. “Approaches to Language
                                      P−PRLM
                                                                                                        Identification using Gaussian Mixute Models and Shifted Delta
                             5        P−PRLM+BOS                                                        Cepstral Features”, ICASSP 2002
                                      P−PRLM+BOS+SDC
                                      P−PRLM+BOS+SDC+Duration                                           [8] B. Ma, H. Li, “A Phonotactic-Semantic Paradigm for
                                      P−PRLM+BOS+SDC+Duration+Pitch
                                                                                                        Automatic Spoken Document Classification”. SIGIR2005,
                             2
                                                                                                        Salvador, Brazil. August 15-19, 2005,
                                                                                                        [9] H. Li and B. Ma, "A Phonotactic Language Model for Spoken
                             1
                                  1   2          5            10              20               40       Language Identification", ACL05, Ann Arbor, USA. 2005
                                                     False Alarm probability (in %)
                                                                                                        [10] Linguistic Data Consortium (LDC), “The CallFriend corpra”,
                                                                                                        http://www.ldc.upenn.edu/Catalog/byType.jsp#speech.telephone
          Figure 2 DET curve of fused system on NIST 2003 LRE (3-sec)                                   [11] Language Identification Corpus of the Institute for Infocomm
                                                                                                        Research
      The SDC classifier captures low level acoustic information.                                       [12] H.-C. Wang, MAT-a project to collect Mandarin speech data
  The results also show that it also significantly contributes to EER                                   through networks in Taiwan, in: Int. J. Comput. Linguistics
  reduction across the board. However, the effect is more obvious in                                    Chinese Language Process. 1 (2) (February 1997) 73-89.
  2003 LRE than in 1996 LRE. As for the prosodic based classifiers,                                     [13] http://cslu.cse.ogi.edu/corpora/corpCurrent.html
  we only see effect in 3-sec and 10-sec test cases.                                                    [14] SVM-light,http://svmlight.joachims.org/
                                                                                                        [15] L. Ferrer et al., “Modeling Duration Patterns for Speaker
                                                 5. CONCLUSIONS
                                                                                                        Recognition”, Proc. Eurospeech, Geneva, pp.2017-2020, 2003.
                                                                                                        [16] D.A.Reynolds, et al., “The 2004 MIT Lincoln Laboratory
  We have proposed an effective ensemble method for LID. The
                                                                                                        Speaker Recognition System”, ICASSP 2005
  ensemble fuses different levels of discriminative features. We have
                                                                                                        [17] B. Ma, D. Zhu, R. Tong, “Chinese Dialect Identification using
  shown that different levels of information provide complementary
                                                                                                        Tone Features Based on Pitch Flux”, Submitted to ICASSP 2006
  language identification cues. It is found that P-PRLM and bag-of-
  sounds features complement each other to fully explore both n-
  local phonotactics and utterance level collective phonotactic
  statistics. The P-PRLM and bag-of-sounds classifiers form the
View publication stats