0% found this document useful (0 votes)
39 views5 pages

Acoustic and Phonotactic LID Features

This document discusses integrating acoustic, prosodic, and phonotactic features for spoken language identification. It examines fusing five features at different levels of abstraction: spectral features, duration, pitch, n-gram phonotactic features, and bag-of-sounds features. The experiment evaluates these features on NIST 1996 and 2003 language recognition evaluation datasets containing 12 languages. The results show that different feature levels provide complementary language cues, with prosodic features being more effective for shorter utterances and phonotactic features working better for longer utterances.

Uploaded by

Maged Hamouda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views5 pages

Acoustic and Phonotactic LID Features

This document discusses integrating acoustic, prosodic, and phonotactic features for spoken language identification. It examines fusing five features at different levels of abstraction: spectral features, duration, pitch, n-gram phonotactic features, and bag-of-sounds features. The experiment evaluates these features on NIST 1996 and 2003 language recognition evaluation datasets containing 12 languages. The results show that different feature levels provide complementary language cues, with prosodic features being more effective for shorter utterances and phonotactic features working better for longer utterances.

Uploaded by

Maged Hamouda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224640888

Integrating Acoustic, Prosodic and Phonotactic Features for Spoken Language


Identification

Conference Paper  in  Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on · June 2006
DOI: 10.1109/ICASSP.2006.1659993 · Source: IEEE Xplore

CITATIONS READS
52 741

5 authors, including:

Rong Tong Bin Ma


Agency for Science, Technology and Research (A*STAR) Institute for Infocomm Research
34 PUBLICATIONS   324 CITATIONS    221 PUBLICATIONS   2,315 CITATIONS   

SEE PROFILE SEE PROFILE

Haizhou Li Eng Siong Chng


National University of Singapore Nanyang Technological University
693 PUBLICATIONS   8,700 CITATIONS    259 PUBLICATIONS   3,155 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Voice Analysis and Transformation View project

Hierarchical Spoken Language Identification View project

All content following this page was uploaded by Bin Ma on 14 October 2014.

The user has requested enhancement of the downloaded file.


INTEGRATING ACOUSTIC, PROSODIC AND PHONOTACTIC FEATURES
FOR SPOKEN LANGUAGE IDENTIFICATION

Rong Tong1,2, Bin Ma1, Donglai Zhu1 , Haizhou Li1 and Eng Siong Chng2
1
Institute for Infocomm Research, Singapore
2
School of Computer Engineering, Nanyang Technological University, Singapore
1
{tongrong,mabin,dzhu,hli}@i2r.a-star.edu.sg, 2aseschng@ntu.edu.sg

ABSTRACT channel than spectral features. For practicality, research has been
focused on acoustic-prosodic-phonotactic features. In this paper,
The fundamental issue of the automatic language identification is we study how the three levels of language cues, n-gram LM, bag-
to explore the effective discriminative cues for languages. This of-sounds, spectral feature, duration and pitch complement in LID
paper studies the fusion of five features at different level of tasks.
abstraction for language identification, including spectrum,
duration, pitch, n-gram phonotactic, and bag-of-sounds features. Syntactic: word n-gram
We build a system and report test results on NIST 1996 and 2003
high
LRE datasets. The system is also built to participate in NIST 2005
LRE. The experiment results show that different levels of Lexical: word
information provide complementary language cues. The prosodic
features are more effective for shorter utterances while the Phonotactic: n-gram LM, BOS
phonotactic features work better for longer utterances. For the task
of 12 languages, the system with fusion of five features achieved Prosodic: duration, pitch
2.38% EER for 30-sec speech segments on NIST 1996 dataset. low
Acoustic: MFCC, SDC
1. INTRODUCTION
Figure 1 Five levels of LID features
Automatic language identification (LID) is a process of
determining the language identity corresponding to a given spoken We typically represent a speech utterance as a collection of
query. It is an important technology in many applications, such as independent spectral feature vectors. The collection of vectors can
spoken language translation, multilingual speech recognition and be modeled by a Gaussian mixture model, known as GMM [7],
spoken document retrieval. that captures the spectral characteristics of a language. The
Recent studies have explored different levels of speech prosody of speech can be characterized mainly by energy, pitch
features which include articulatory parameters [1], spectral and duration among others. They can be modeled in a similar way
information [2], prosody [3], phonotactic [2] and lexical as that for spectral feature. Phonotactic features capture the lexical
knowledge [4]. It is generally believed that spectral feature and constraint of admissible phonetic combination in a language. One
phonotactic feature provide complementary language cues to each typical implementation is the P-PRLM (Parallel Phone
other [5]. Human perception experiments also suggest that Recognition followed by Language Model) approach that employs
prosodic features are informative language cues [1]. However, multiple phoneme recognizers that tokenize a speech waveform
prosodic feature has not been fully exploited in LID [6]. In into phoneme sequences and then characterizes a language by a
general, LID features fall into five groups according to their level group of n-gram language models (LM) over the phoneme
of knowledge abstraction as shown in Figure 1. Lower level sequences [2]. A new phonotactic model, known as bag-of-sounds
features, such as spectral feature, are easier to obtain but volatile was proposed recently to model utterance level phonotactics
because speech variations such as speaker or channel variations are collectively. Its language discriminative ability is comparable to
present. Higher level features, such as lexical/syntactic features, that of the n-gram LM [8][9].
rely on large vocabulary speech recognizer, which is language and In this paper, we study five LID features: n-gram LM in P-
PRLM, bag-of-sounds, spectral feature, pitch and duration. In
domain dependant. They are therefore difficult to generalize across
Section 2, the development and evaluation databases are
languages and domains. Phonotactic features become a trade-off
introduced. In Section 3, the feature fusion LID system is
between computational complexity and performance. It is
described. In Section 4, we report the experiment results. Finally
generally agreed that phonotactics, i.e. the rules governing the we conclude in Section 5.
sequences of admissible phone/phonemes, carry more language
discriminative information than the phonemes themselves. They 2. DEVELOPMENT AND EVALUATION DATA
are extracted from output of a phoneme recognizer, which is
supposed to be more robust against effects such as speaker and The NIST 1996 and 2003 language recognition evaluation (LRE)
sets are used to evaluate the performance of the LID systems.
There are 12 target languages in both sets: Arabic, Farsi, French, 3.2 Bag-of-Sounds (BOS)
German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil,
Vietnamese and English. Dialects of English, Mandarin and The bag-of-sounds method uses a universal sound recognizer to
Spanish are also included in 1996 test set. In 2003 test, there are tokenize an utterance into a sound sequence, and then converts the
test segments in Russian. Each language consists of test segments sound sequence into a count vector, known as bag-of-sounds
in 3 length groups: 30, 10 and 3 seconds. vector [8]. The bag-of-sounds method differs from the P-PRLM
The development data come from CallFriend corpus [10]. We method in that it use single universal sound recognizer, with this
use the same 12 languages and 3 dialects as the target languages universal sound recognizer, one does not need to carry out acoustic
specified in the NIST LRE. In CallFriend corpus, data for each modeling when adding new language capability to the classifier.
language are grouped into 3 parts: ‘train’, ‘devtest’ and ‘evaltest’. Although the sound inventory for the universal sound recognizer
We are using ‘train’ and ‘devtest’ as our development data. can be derived from unsupervised learning [8], in this paper, the
All the development and test data are pre-processed by a universal sound inventory is a combined phoneme set from 6
speech activity detection program to remove silence. In the languages: English, Mandarin, Hindi, Japanese, Spanish and
development process, we treat the dialects of English, Mandarin German. There are 258 phonemes in total. The phoneme labeled
and Spanish as different languages. Therefore, there are 15 training corpus of these 6 languages are come from same sources
languages in the training process. For our results to be comparable as described in P-PRLM system.
with other reports in the literature, in the test process, we only For each sound sequence generated from the universal sound
measure the LID performance of the 12 primary languages by tokenizer, we count the occurrence of bi-phones. A phoneme
grouping the dialect labels into their respective primary language. sequence is then represented as a vector of bi-phone occurrence
with 66,564 = 258 × 258 elements. A Support Vector Machine
3. SYSTEM DESCRIPTION (SVM) is used to partition the high dimensional vector space [14].
As SVM is a 2-way classifier, we train pair-wise SVM classifiers
One of the solutions to fuse multiple features is the ensemble for the 15 target languages, resulting in 105 SVM classifiers. The
method. An ensemble of classifiers is a set of classifiers whose linear kernel is adopted when using SVM-light tool.
individual decisions are combined in the classification process. A training utterance is classified by the 105 SVM classifiers
Our five-feature fusion LID system is formulated in this way. In to derive a 105-dimensional score vectors. The collection of
this section, we discuss five member classifiers in the ensemble. training score vectors are used to train a backend classifier in the
same way as it is used in P-PRLM. The likelihood ratio for a test
3.1 n-gram LM in P-PRLM utterance can be given by the backend classifier as λBOS .
Following the P-PRLM formulation as in [2], seven phoneme
tokenizers are used in our system: English, Korean, Mandarin, 3.3. SDC Feature in GMM
Japanese, Hindi, Spanish and German. English phonemes are
Gaussian mixture models are used to model acoustic
trained from IIR-LID [11] database. Korean phonemes are trained
characteristics of a language, known as GMM acoustic in [5]. We
from LDC Korean corpus (LDC2003S03). Mandarin phonemes are
use the shifted delta cepstral (SDC) features [7] to capture long
trained from MAT corpus [12]. Other phonemes are trained from
time spectral information across successive frames. The parameter
OGI-TS corpus [13]. 39-dimensional MFCC features are extracted
7-3-1-7 is used as in [5]. We build a set of GMMs to form a
from each frame. Utterance based cepstral mean subtraction is
classifier. First, a 2,048-mixture Gaussian Mixture model is trained
applied to the MFCC features to remove channel distortion. Each
from all the SDC feature vectors of 15 languages, this is the
phoneme in the languages are modeled with a HMM of 3-state.
universal background model (UBM). Then, we adapt the UBM
The English, Korean and Mandarin states are of 32 mixtures each,
towards each target language amounting to 15 language dependent
while others are of 6 mixtures considering the availability of
GMMs. We further adapt the language dependent GMM by gender
training data. Based on the phoneme sequence from each
resulting in 30 gender-language dependent GMMs. In summary,
tokenizer, we train up to 3-gram phoneme LM for each tokenizer-
we obtain 30 gender-language dependent GMMs, 15 language
target language pair, resulting in 105 = 15 × 7 LMs. For each input
dependent GMMs and 3 UBMs.
utterance, 105 interpolated language scores are derived to form a
An utterance is evaluated on the 45 GMMs and 3 UBMs to
vector. In this way, a set of training utterances are represented by a
generate 45 language dependent scores in a 45-dimensional vector.
collection of 105-dimensional score vectors. The score vectors are
The score vectors are normalized by their respective UBM scores.
normalized by subtracting the mean of their competing languages.
The collection of training score vectors are used to train a backend
The P-PRLM classifier consists of 15 pairs of Gaussian
classifier in the same way as it is used in P-PRLM. The confidence
mixture models (GMMs), known as the backend classifier. For
of a test utterance can be given by the backend classifier as λSDC .
each target language, we build two GMMs {m+ , m− } . m + is
trained on the score vectors of target language, called positive 3.4. Duration
model, while m − is trained on those of its competing languages,
called negative model. The confidence of a test utterance O is As one of the prosodic features, we believe that the phoneme
duration statistics provide language discriminative information.
given by the likelihood ratio λPPRLM = log( p (O | m + ) / p (O | m − )) .
Early research has found that duration is useful in the speaker 12 languages on NIST 1996 30-sec data are shown in Table 3 and
recognition study [15]. Table 4. Figure 2 shows the DET plots on 3-sec NIST 2003 LRE
We use the same universal sound recognizer as in bag-of- data. The proposed ensemble system significantly outperforms
sounds classifier. After tokenization, we obtain duration statistics previous reported results on the 3-sec short test utterances and
for each phoneme. The duration feature vector has 3 elements compare favorably on longer test utterances except 30-sec in 2003
representing the duration of 3 states in a phoneme. For each LRE.
phoneme in a target language, we train a 16-mixture language-
dependent GMM model using the collection of duration features. Method 30-sec 10-sec 3-sec
For each phoneme, we also train a 16-mixture language- P-PRLM 2.92 8.23 18.61
independent GMM model as the negative model using the
P-PRLM+BOS 2.61 7.11 16.98
collection of duration features from all its competing phonemes.
P-PRLM+BOS+SDC 2.38 6.80 15.70
As a result, we arrive at 3,874 = 258 × 15 positive models and 258
P-PRLM+BOS+SDC+Duration 2.38 6.35 14.55
negative models. P-PRLM+BOS+SDC 2.38 6.26 14.31
For each utterance, the likelihood ratios from the 258 +Duration+Pitch
positive-negative model pairs are multiplied to generate a score for MIT fused system [5] 2.70 6.90 17.40
each language, resulting in a score vector of 15 dimensions
representing 15 languages. The collection of training score vectors Table 1 EER% of system fusion on NIST 1996 LRE data
are used to train a backend classifier in the same way as it is used
in P-PRLM. The confidence of a test utterance can be given by the
Method 30-sec 10-sec 3-sec
backend classifier as λDUR .
P-PRLM 4.54 11.31 20.37
P-PRLM+BOS 4.17 10.03 18.64
3.5. Pitch P-PRLM+BOS+SDC 3.27 8.55 16.66
P-PRLM+BOS+SDC+Duration 3.27 8.37 15.94
Pitch feature is another important prosodic feature. It has been P-PRLM+BOS+SDC+ 3.27 7.97 15.54
used in some speaker recognition tasks [16], but has not Duration+Pitch
successfully used in LID task yet. We initially design pitch MIT fused system [5] 2.80 7.80 20.30
features for Chinese dialect identification as Chinese dialects are
largely differentiated by different intonation schemes. We have Table 2 EER% of system fusion on NIST 2003 LRE data
seen promising results [17]. Here we adopt pitch features to build
one member classifier in the ensemble. Language EER% #test utterances
For given utterance, 11 dimensional pitch features are French (FR) 1.30 80
extracted from each frame [17]. A Gaussian mixture model, i.e. Arabic (AR) 1.76 80
universal background model (UBM), is trained using feature Farsi (FA) 3.15 80
vectors from all languages. Then a GMM model is adapted from Geman (GE) 3.80 80
the UBM model for each target language. As a result, we build 15 Hindi (HI) 7.92 76
GMM models and one UBM model. All models have 16 Gaussian Japanese (JA) 1.20 79
mixtures each.
Korean (KO) 3.51 78
An utterance is evaluated on the 15 GMMs and 1 UBM to
Tamil (TA) 4.70 73
generate 15 language dependent scores in a 15-dimensional vector.
Vietnamese (VI) 4.38 79
The score vectors are normalized by the UBM score. The
Mandarin (MA) 1.86 156
collection of training score vectors are used to train a backend
Spanish (SP) 2.03 153
classifier in the same way as it is used in P-PRLM. The confidence
English (EN) 1.56 478
of a test utterance can be given by the backend classifier as λPIT .
Table 3 EER% for individual language on NIST 1996 LRE data
4. EXPERIMENTS (30-sec)
FR AR FA GE HI JA KO TA VI MA SP EN
We conduct experiments on NIST 1996 and 2003 LRE datasets. FR 77 0 0 0 0 0 0 0 1 0 0 2
We use NIST 1996 LRE development data for fine-tuning of the AR 1 72 3 0 0 0 0 0 0 0 2 2
ensemble. With the same resulting setting, we run the test on both FA 0 0 74 2 0 0 0 0 2 1 0 1
1996 and 2003 datasets. GE 0 0 5 74 0 0 1 0 0 0 0 0
To investigate how different levels of discriminative features HI 2 1 3 0 57 0 6 2 1 0 3 1
complement each other, we use our P-PRLM classifier as the JA 0 0 0 0 0 76 2 0 0 1 0 0
baseline, and then fuse other classifiers one by one into the KO 0 0 2 0 2 1 70 1 0 2 0 0
ensemble. The fusion is carried by multiplying the likelihood ratio TA 0 1 1 0 3 1 2 65 0 0 0 0
score from individual member classifiers. In the case of 5-feature VI 1 0 0 0 1 0 1 0 72 1 1 2
MA 0 1 0 0 0 0 1 0 0 151 0 3
fusion, we have λ = λPPRLM + λBOS + λSDC + λDUR + λPIT . Table
SP 1 0 0 0 3 2 0 0 1 0 145 1
1 & 2 show the results for incremental fusion of ensemble with the EN 1 0 5 1 0 0 1 0 2 1 0 467
last row being extracted from Singer et al [5] for comparison. The
performance of individual languages and confusion matrix among Table 4 Confusion matrix of NIST 1996 LRE data (30-sec)
backbone of the ensemble. The spectral feature also consistently
To look into the contribution of each member classifier in the contributes to the LID tasks. It is found that fusing the lower level
ensemble, we break down the EER reductions by individual acoustic information and high level phonotactic information
classifier when it is added into the ensemble, as in Table 5. greatly improves the overall system. We have also successfully
As both P-PRLM and BOS systems capture phonotactic integrated the prosodic features into the LID task. The experiment
features in different way, by fusing the two systems, we gain results show that even the simple prosodic feature as pitch and
average 10.2% EER reduction evenly across the board. The P- phoneme duration are useful, especially for short speech segments.
PRLM classifier extracts phoneme 3-gram statistics and uses The performance of proposed ensemble LID system on NIST
perplexity measure to evaluate similarity between languages. The 1996 and 2003 LRE datasets are comparable with the best system
BOS classifiers extract bi-phone statistics, which is similar to reported in the literature. The experiments in this paper also re-
phoneme bigram, but projects the statistics into a high dimensional affirm, from a different angle, the findings in other reports [5] that
space for SVM to carry out discrimination [8][9]. spectral and phonotactic features are the most effective features for
LID.
30-sec 10-sec 3-sec REFERENCES
1996 2003 1996 2003 1996 2003
P-PRLM - - - - - - [1] Y.K. Muthusamy, N. Jain and R.A. Cole, “Perceptual
BOS 10.6 8.1 13.6 11.3 9.2 8.5 benchmarks for automatic language identification.” Proc. ICASSP
SDC 8.8 21.6 4.4 14.7 7.5 10.6 1994, pp. 333-336.
Duration 0.0 0.0 6.6 2.1 7.3 4.3 [2] M.A.Zissman, “Comparison of four approaches to automatic
Pitch 0.0 0.0 1.4 4.8 1.6 2.5 language identification of telephone speech,” IEEE Transactions
on Speech and Audio Processing, 4(1), pp. 31-44, 1996.
Table 5 EER reduction (%) by member classifiers in the ensemble [3] A. E. Thymé-Gobbel and S. E. Hutchins, "On using prosodic
cues in automatic language identification", ICSLP 96,
Language Detection Performance
Philadelphia, USA, October 1996
[4] D. Matrouf, M. Adda-Decker, L. Lamel and J. Gauvain,
"Language Identification Incorporating Lexical Information",
40
ICSLP 98, Sydney, Australia, December 1998.
[5] E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, W.M.
Campbell, and D.A. Reynolds, “Acoustic, Phonetic, and
Discriminative approaches to Automatic Language Identification,”
20 in Proc. Eurospeech 2003, pp. 1345–1348, Sept. 2003
Miss probability (in %)

[6] T. J. Hazen and V.W. Zue. “Recent improvements in an


approach to segment-based automatic language identification”.
10 ICSLP 1994
[7] Pedro A. Torres-Carrasquillo, et al. “Approaches to Language
P−PRLM
Identification using Gaussian Mixute Models and Shifted Delta
5 P−PRLM+BOS Cepstral Features”, ICASSP 2002
P−PRLM+BOS+SDC
P−PRLM+BOS+SDC+Duration [8] B. Ma, H. Li, “A Phonotactic-Semantic Paradigm for
P−PRLM+BOS+SDC+Duration+Pitch
Automatic Spoken Document Classification”. SIGIR2005,
2
Salvador, Brazil. August 15-19, 2005,
[9] H. Li and B. Ma, "A Phonotactic Language Model for Spoken
1
1 2 5 10 20 40 Language Identification", ACL05, Ann Arbor, USA. 2005
False Alarm probability (in %)
[10] Linguistic Data Consortium (LDC), “The CallFriend corpra”,
http://www.ldc.upenn.edu/Catalog/byType.jsp#speech.telephone
Figure 2 DET curve of fused system on NIST 2003 LRE (3-sec) [11] Language Identification Corpus of the Institute for Infocomm
Research
The SDC classifier captures low level acoustic information. [12] H.-C. Wang, MAT-a project to collect Mandarin speech data
The results also show that it also significantly contributes to EER through networks in Taiwan, in: Int. J. Comput. Linguistics
reduction across the board. However, the effect is more obvious in Chinese Language Process. 1 (2) (February 1997) 73-89.
2003 LRE than in 1996 LRE. As for the prosodic based classifiers, [13] http://cslu.cse.ogi.edu/corpora/corpCurrent.html
we only see effect in 3-sec and 10-sec test cases. [14] SVM-light,http://svmlight.joachims.org/
[15] L. Ferrer et al., “Modeling Duration Patterns for Speaker
5. CONCLUSIONS
Recognition”, Proc. Eurospeech, Geneva, pp.2017-2020, 2003.
[16] D.A.Reynolds, et al., “The 2004 MIT Lincoln Laboratory
We have proposed an effective ensemble method for LID. The
Speaker Recognition System”, ICASSP 2005
ensemble fuses different levels of discriminative features. We have
[17] B. Ma, D. Zhu, R. Tong, “Chinese Dialect Identification using
shown that different levels of information provide complementary
Tone Features Based on Pitch Flux”, Submitted to ICASSP 2006
language identification cues. It is found that P-PRLM and bag-of-
sounds features complement each other to fully explore both n-
local phonotactics and utterance level collective phonotactic
statistics. The P-PRLM and bag-of-sounds classifiers form the

View publication stats

You might also like