See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/313323989
Significance of neural phonotactic models for large-scale spoken language
identification
Conference Paper · May 2017
DOI: 10.1109/IJCNN.2017.7966114
CITATIONS READS
8 253
4 authors:
Brij Mohan Lal Srivastava Hari Krishna
National Institute for Research in Computer Science and Control International Institute of Information Technology, Hyderabad
19 PUBLICATIONS 29 CITATIONS 22 PUBLICATIONS 76 CITATIONS
SEE PROFILE SEE PROFILE
Anil Kumar Vuppala Manish Shrivastava
International Institute of Information Technology, Hyderabad International Institute of Information Technology, Hyderabad
73 PUBLICATIONS 468 CITATIONS 94 PUBLICATIONS 396 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Learning Representation for Text Classification View project
Medical Domain, Multilingual Virtual Assistant View project
All content following this page was uploaded by Brij Mohan Lal Srivastava on 09 March 2017.
The user has requested enhancement of the downloaded file.
Significance of neural phonotactic models for
large-scale spoken language identification
Brij Mohan Lal Srivastava, Hari Vydana, Anil Kumar Vuppala, and Manish Shrivastava
Language Technology Research Center
International Institute of Information Technology, Hyderabad, India
{brijmohanlal.s, hari.vydana}@research.iiit.ac.in
{anil.vuppala, m.shrivastava}@iiit.ac.in
Abstract—Language identification (LID) is vital frontend for bottleneck features for LID [4]. [5] shows that DNNs can
spoken dialogue systems operating in diverse linguistic settings perform better than i-vector based approaches. Multilingual
to reduce recognition and understanding errors. Existing LID bottleneck, multilingual tandem bottleneck obtained by stack-
systems which use low-level signal information for classification
do not scale well due to exponential growth of parameters as ing the SDC with the corresponding bottleneck features are
the classes increase. They also suffer performance degradation exploited in [6], [7]. All the approaches mentioned above rely
due to the inherent variabilities of speech signal. In the proposed on the language discriminative capability present in the lower-
approach, we model the language-specific phonotactic informa- level acoustic information with some contextual information
tion in speech using recurrent neural network for developing in time neighborhood. The better performance of i-vectors
an LID system. The input speech signal is tokenized to phone
sequences by using a common language-independent phone and bottleneck features compared to the conventional spectral
recognizer with varying phonetic coverage. We establish a causal features can be attributed to its better context modeling capa-
relationship between phonetic coverage and LID performance. bility [3], [4], [6], [7]. The systems which try to model the
The phonotactics in the observed phone sequences are modeled language discriminative information at higher-level (phones,
using statistical and recurrent neural network language models phone frequency and phonotactics) have exhibited reliably
to predict language-specific symbol from a universal phonetic
inventory. Proposed approach is robust, computationally light better performance [8], [9]. Very first attempt to use language
weight and highly scalable. Experiments show that the convex discriminative phonotactic information for LID was made
combination of statistical and recurrent neural network lan- by [10] who proposed Parallel-Phone recognizer followed by
guage model (RNNLM) based phonotactic models significantly language model (PPRLM) wherein language dependent phone
outperform a strong baseline system of Deep Neural Network recognizers are operated in parallel for every language to
(DNN) which is shown to surpass the performance of i-vector
based approach for LID. The proposed approach outperforms the decode the test utterance followed by a language model with
baseline models in terms of mean F1 score over 176 languages. large number of uni-gram and bi-gram counts that is used
Further we provide significant information-theoretic evidence to as a model to capture the spoken language identity. Though
analyze the mechanism of the proposed approach. PPRLM is quite efficient, it is computationally inefficient
to decode the test utterance using each language’s phone
I. I NTRODUCTION recognizers which makes scalability a major issue for PPRLM
Language identification (LID) refers to the task of au- systems. Additionally, a language independent phone recog-
tomatically identifying the language of a spoken utterance nizer followed by a language model (PRLM) for capturing
using speech signal. An LID system is a vital module for the phonotactics of 4 languages is used in [10].
a wide range of multilingual applications like, call centers, Recently LID systems developed using DNN classifiers
multilingual spoken dialog systems, emergency services and trained over low-level acoustic features has shown acceptable
speech-to-speech translation systems. Human-computer inter- performance when the number of languages are less, but when
action through speech can be taken more deeply into society the number of languages are more (i.e., in the present study
if the interaction is through multiple native languages and for number of classes = 176) the performance of the DNN based
that purpose LID system is a preliminary requirement [1], LID system degrades. This degradation in the performance
[2]. In relevance to the task of developing an LID sys- of DNN classifier when number of language are increased to
tem, recent works have focused on developing algorithms to 176 can be attributed to the lack of sequential modeling. The
extract suitable features for language identification. I-vector DNN classifier has not captured the language discriminative
based features are explored and they have exhibited better phonotactic information efficiently. Sequential models like
performance compared to the conventional spectral features Recurrent Neural Networks (RNN), Long Short Term Memory
like Mel-frequency cepstral coefficients (MFCC), Linear pre- (LSTM) which have exhibited a superior performance in phone
dictive cepstral coefficients (LPCC) and Shifted delta cep- classification when used on the acoustic sequences due to their
stral coefficients (SDC) in NIST evaluations for speaker and sequence modeling capability. For the task of LID, language
language recognition tasks [3]. Neural networks have also discriminative relations span across the sequence of words
been employed as feature extractors to compute the stacked (i.e., the sequential information for discriminating languages
is longer than the sequential information for recognizing The proposed approach as such is described in section 3.
the phones). While learning such long durational cues the Results, observations and inferences are presented in section
performance of sequence learners like RNN and LSTMs are 4. Conclusion and future scope are discussed in section 5.
degraded by vanishing and exploding gradient problems. [11]
II. DATA D ESCRIPTION
shows that RNNLM when coupled with PPRLM can outper-
form i-vector based approaches. Their results are reported on The language data which is made openly available by Top-
6 languages from KALAKA-3 [12] database. coder 1 as a part of Spoken Language Recognition challenge
To make the input sequence more amenable for sequence is used in this work. Data sets comprises of recorded speech in
learners, we tokenize the input speech signal using a previ- 176 languages. Refer to figure 6 to find out the geographical
ously trained language independent acoustic model and the extent of languages present in our dataset. The dataset contains
tokens are the phonemes recognized by that acoustic model. 375 utterances per language and the language labels of these
The motive behind this transformation is to reduce the length utterances are also available. Each utterance is of 10 seconds
of the input sequence such that the sequence classifiers can duration. Each speech recording is given in a separate file
better discriminate languages by modeling the long term and only one language is spoken in each file. The available
dependencies. The proposed approach is independent of the data is reorganized into training, testing, and validation sets.
language of the acoustic model used, as the acoustic model is For training SRILM n-grams, 330 utterances were used for
considered as a mere tokenizer rather than an accurate phone developing language models and the remaining 45 utterances
recognizer i.e. in this approach we rely on the consistency are used for testing the models. In case of RNNLMs, 300
of the acoustic model in tokenizing the input utterance rather utterances were used for training, 30 utterances were used for
than the accuracy. We exploit the well-known fact that the validation and 45 utterances are used to test the developed
language discriminative phonotactic information is present in models. The data provided has speech recordings in mp3
the sequence of tokens (phones) and modeling this sequential format, which are converted to WAV format with 16 kHz
information plays a vital role in building an LID system. An sampling rate. The following repository will be updated with
initial version of this study, describing a partial approach and the code and feature sets used as part of this study2 .
experiments have been published at arxiv.org as [13]. This III. P ROPOSED A PPROACH
paper significantly improves over the approach mentioned in
In this approach, we rely on the higher-level phonotactic
[13] in terms of clarity of idea, exhaustive experimentation and
information extracted from speech for developing an LID
a fair comparison of the proposed approach with the state-of-
system. To extract this information input speech signal is tok-
the-art LID techniques.
In this paper, we propose an approach to capture higher- enized by decoding it using a language independent acoustic
level language discriminative information, which can be used model and a phonetic language model with probability mass
as a complementary information to the existing low-level distributed uniformly over all the phones, i.e. all the phones
acoustic information modeling LID systems. We explore the have equal emission probability. Though the phone recognizer
significance and limits of PRLM based approaches for de- is independent of the language that is being decoded, we
veloping a large-scale LID system. As a part of our experi- assume that the similar sounding acoustic patterns will be
ments, we develop LID systems comprising of 176 different decoded as approximately the same phone labels. By using the
languages. The pipeline of our system includes a language language independent phone recognizer, we are relying on the
independent phone recognizer to generate the phone sequences consistency of the phone recognizer rather than the accuracy
from raw signal and two different language modeling ap- ,i.e., similar sounding acoustic sounds will be tagged with
proaches (SRILM [14] and RNNLM [15]) to capture the same phone label regardless of the language. According to [8]
phonotactic information. Further we use convex combination the statistical patterns present in the obtained phone sequence
to capture the best of both modeling techniques and achieve must contain discriminative information for corresponding
stability and robustness in recognition. As the proposed ap- language. N-grams and recurrent neural networks (RNN) are
proach mostly relies on the large durational phonotactic in- employed to model these statistical patterns that convey the
formation of a language which is provided by the phonetic language-discriminative information from the tokenized phone
recognizer, it is more robust compared to low-level acoustic sequence. The block diagram of the proposed approach is
modeling approaches. presented in Fig 1. The neural representation of phonotactics
The main contributions of this paper is to study the follow- is computed by training the RNN to generate/predict the next
ing hypotheses: acoustic tag given a truncated sequence of past tags. Therefore
1) Convex combination of phonotactic estimations can lead independent RNNs learns generative model for each of the 176
to highly robust and scalable PRLM-based LID systems. languages. The best performing models just use 30 neurons in
2) A single language-independent phonetic recognizer can the hidden layer and the vocabulary size is equal to number of
be tuned to generate language-discriminative token se- phones used. This amounts a tiny model for classification in
quences by controlling its acoustic coverage. terms of number of parameters which are less than one-fourth
The rest of the paper is organized as follows: Section 1 https://community.topcoder.com/tc?module=MatchDetails&rd=16555
2 describes the details of the dataset used for experiments. 2 https://github.com/brijmohan/lid-convex-comb
a million for 176 classes as compared to conventional models TABLE I
which have millions of parameters to learn for classification P ERFORMANCE OF BASELINE LID SYSTEMS DEVELOPED TO CLASSIFY
176 LANGUAGES . C OLUMN 1 AND 2 SHOW THE SCORE OF THE LANGUAGE
of far lesser classes. WHICH PERFORMED WORST AND BEST GIVEN EACH MODEL . C OLUMN 3
The goal of the language-independent phone recognizer is PRESENTS THE AVERAGE F1 SCORE OVER 176 LANGUAGES
to provide maximum coverage of phone units present in all
the languages for which the system is being developed. We System Min Max Mean F1%
employ CMU Sphinx [16] as the front-end phone recognizer GMM 8-mixtures 1 35 06
which uses HMM-based phone decoder from speech signal. GMM 16-mixtures 1 38 08
A phonetically tied-mixture (PTM) model is used for efficient GMM 32-mixtures 2 41 10
decoding. It contains 256 mixture components per state and GMM 64-mixtures 2 43 12
assigns different mixture weights to the shared states of
Kernel SVM 5 37 13
triphones. This model provides a good balance between speed
and accuracy. Since it can be trained over huge data, it gives a DNN (8layers with dropout=0.2) 31 72 53
decent decoding result in under real time. We use US English
phone set with 40 phones and an unbiased phonetic language
model for decoding. = a ∗ PRN N (l) + (1 − a) ∗ Pngrams (l)
We assume that there exists a universal set of phonetic units
Up encompassing all the known languages wherein a particular Here a is convex combination coefficient.
language l comprises of a subset of phonetic units (Pl ∈ Up ). IV. E XPERIMENTS , R ESULTS & D ISCUSSION
The discriminative information between two languages l and
m is encoded by the difference of their subsets, i.e. Pl −Pm as A. Baseline methods
well as the interaction/transition between phonetic units within In order to objectively study the performances of various
the subsets. The interaction of phonetic units within a subset is LID systems, we implemented several baseline systems and
measured as the probability of transition between these units their results are tabulated in Table I. During the study, 39-
using either ngrams or recurrent neural networks. dimensional MFCC with a frame size of 25 ms and a frame
The phone recognizer can be improved by training over shift of 10 ms are extracted, each frame is stacked with features
multiple languages which will certainly increase the coverage from ±10 frames around it and a feature vector of 819 is used
of common phonetic and acoustic patterns. We would also for further analysis. During the study, GMM, SVM and DNN
like to experiment with phonetic units which do not rely classifiers have been employed for developing the LID system
on languages, such as, articulatory gestures obtained in an and their performances are reported in Table. I. DNN-based
unsupervised manner from data as proposed in [17]. The LID system is implemented as described by [5], the hyper
acoustic model used in this study are trained on US English parameters are tuned accordingly and the results from the best
and French speech data. English and any other language performing system are reported. This DNN baseline performs
belonging to its language family (West-Germanic) are not better than i-vector based LID systems.
present in the dataset which makes it highly suitable for this The performance of various baseline LID systems are
study as an unbiased tokenizer. SRILM n-grams (uni-gram to tabulated in Table I. Column 1 of Table. I is the type of
6-grams) and RNNLM have been used for experimentations LID system and column 2 of Table. I is the performance
to model the statistical patterns in the phone sequences. The of LID system. From Table I, it can be observed that the
sequence of steps in the proposed approach is described below. performance of LID system developed for 176 languages is
Training Phase: very poor. The major reasons for such poor performance of
• Decode the input speech using the acoustic model and baseline LID systems result in two different inferences i.e.,
obtain the corresponding phone sequences. a) these approaches do not scale well with large number of
• Build a language model per language using the decoded classes and b) they do not capture the language-discriminative
phone sequences from the training data. phonotactic information vital for classification. Though the
input feature has the contextual information of about 21 frames
Testing Phase:
in time neighborhood obtained by stacking the features, these
• Decode the test utterance using the acoustic model and
approaches failed to capture the required information.
obtain the phone sequences. The performance of the proposed approach is presented
• Find the probability of observed phone sequence using
in Table. II. Column 1 is the type of LID system and the
different language models. column 2-4 is the performance of the LID system. Column 2
• Combine the probabilities using the learnt convex com-
and 3 show the score of the language which performed worst
bination (⊕) coefficients. and best given each model. Column 4 presents the average F1
• Infer language label as the label of the model exhibiting
score over 176 languages. Rows 2-7 SRILM based language
highest score according to equation 1. model is used to model the statistical patterns in the decoded
phone sequences and in Row 8 RNNLMs have been employed
L̂ = argmax PRN N (l) ⊕ Pngrams (l) (1) to learn the statistical patterns in the decoded phone sequences.
l
Fig. 1. Block diagram of the proposed method. The ⊕ symbol indicates the convex combination of probability vectors.
TABLE II the best performing RNNLM and DNN model. It uses
P ERFORMANCE OF THE PROPOSED LID SYSTEMS DEVELOPED TO grayscale colormap to display the intensity of classification.
CLASSIFY 176 LANGUAGES USING US English ACOUSTIC MODEL . W E
MODIFY THE FOLLOWING PARAMETERS IN ORDER TO TUNE THE The colormap of the plot is remapped onto a power-law
PERFORMANCE OF RNNLM: C - # CLASSES , H - # UNITS IN HIDDEN relationship (i.e. y = xγ ), hence the skewed colorbar. Here γ
LAYER , B - # STEPS FOR BPTT is selected to be 1/3 in order to highlight mis-classification
which occurs with very low frequency in the proposed model.
System Min Max Mean F1%
Darker cells represent to highly correlated classes. Ideally,
1-gram 43.79 60 46.53 the principle diagonal should be darkest. It can be observed
2-gram 77.77 83.61 81.77 that the LID performance remains consistent across all the
3-gram 84.07 88.88 85.45 languages in case of RNNLM while the performance of DNN
4-gram 81.11 86.66 83.94 based LID system is inferior to RNNLM based LID system.
5-gram 77.77 86.66 83.15
6-gram 77.77 86.66 82.98 To study the influence of front-end acoustic model on the
LID system we have used a acoustic model trained with french
RNNLM 1-c 83.13 93.33 84.42
data over 34 phonemes and the performance is reported in
RNNLM 6-c 84.44 90.58 87.69 Table. III. Though the performance of the proposed approach
RNNLM 100-c 82.22 89.44 85.38 with french acoustic model as tokenizer has performed better
RNNLM 1-c ⊕ 3-gram 88.44 91.24 89.36 than the baseline approaches, it is clearly lower than the LID
(convex combination) system with US English acoustic model. This observation can
be certainly attributed to the fact that US English model has 40
phones which results in better acoustic/phonetic coverage than
A combination of both the approaches have been explored and french model which has just 34 phones. The process of tok-
the performance of the LID system is presented in row 9 of enizing the input utterance can be treated as the quantization
Table. II. of the acoustic sequence to sequence of tokens (phones), if the
From Table II it can be observed that due to the use of number of phones are not sufficiently high, the complexity of
phonotactic information, the performace of the LID system statistical pattern in the phone sequence becomes intractable
is significantly higher than the baseline methods in Table. I. for the classifier to disentangle and the the performance of the
From Table. II, it can be noted that even though the number LID degrades.
of languages are large in number, the high-level phonotactics
exhibit better discriminability in building LID systems. Similar TABLE IV
C OMPARISON OF MAXIMUM PERFORMANCE OF EACH LID SYSTEM
observations can be made from the confusion matrices of DNN
LID system and the proposed LID system using RNNLM. RNNLM ⊕
Along with the superiority in the performance, the proposed System GMM SVM DNN 3-GRAMS RNNLM 3-GRAMS
LID exhibits consistency in the performance across the lan- Mean F1% 12 13 53 85.45 87.69 89.36
guages rater than a skewed performance to certain languages
as opposed to all the baseline approaches. The performance of LID system developed during the study
Figure 2 shows the confusion matrix obtained for are compared and the comparison is presented in Table. IV.
Fig. 2. Classification confusion matrix showing the best performing DNN (8 layers with 0.2 dropout after each layer) (left) and the best performing RNNLM+
3-gram (phoneset: US English) (right) over test dataset for 176 languages. The scale of the colormap is skewed appropriately for better visualization.
TABLE III
L ANGUAGE IDENTIFICATION PERFORMANCE (M EAN F1%) FOR EACH
LANGUAGE MODEL USING French PHONESET ( EXPLICIT APPROACH ). T HIS
TABLE SHOWS THAT PHONETIC COVERAGE IS VITAL FOR LID AND
F RENCH PHONESET PERFORMS WORSE THAN E NGLISH DUE TO ITS
INFERIOR PHONETIC COVERAGE . W E MODIFY THE FOLLOWING
PARAMETERS IN ORDER TO TUNE THE PERFORMANCE OF RNNLM: C -
# CLASSES , H - # UNITS IN HIDDEN LAYER , B - # STEPS FOR BPTT
System Min Max Mean F1%
RNNLM 1-c 69.21 78.88 71.35
RNNLM 6-c 71.3 82.22 73.53
RNNLM 50-c 100-h 2-b 74.5 84.44 76.56
RNNLM 50-c 100-h 3-b 74.42 82.22 76.46
RNNLM 50-c 100-h 4-b 74.6 86.66 76.76
Fig. 3. Plot depicts the range of mean F1 for each estimation method.
RNNLM 50-c 200-h 1-b 75.06 82.71 77.11 Vertical bar at each point indicate asymmetric distribution of values of F1
corresponding to each language around mean. N-grams and RNNLM achieve
consistency across languages whereas other baseline techniques display highly
skewed F1. High mean and low variance with respect to ConPT exhibit that
this model scales consistently and gracefully for large number of language
classes. This is due to the shift of expected value towards the ensemble average
of two stochastic processes with complementary information, in this case,
From Table IV, it can be observed that the performance of the phonotactics estimated using n-grams and RNNLM.
proposed approaches that rely on higher-level acoustic features
performs significantly better than the approaches that rely on
the low-level acoustic features for developing an LID system B. Information-theoretic analysis of phonotactic models
which aims to scale up to large number of languages. The
significance of the proposed approach is clearly noted when Due to the fixed vocabulary of phonetic units recognized
the number of language classes are higher. by the language-independent phonetic recognizer mentioned
in section III, the estimated phonotactics of each language are
Figure 3 shows the performance range of best tuned model fundamentally a discrete probability distribution over identical
for each phonotactics estimation method. The points depict domain which is composed of a sample space of common
the mean performance of each model and the vertical bar dis- phonetic uni-, bi- and tri-grams. For an optimal classification
plays asymmetric variance of performance across languages. of languages, these probability manifolds must be distinct.
We observe that low-level acoustic methods show hugely According to philological classification, languages can be
varying results whereas higher-level features are consistent grouped as language families based on similar phonetic and
across languages. Note that the combination model gracefully phonological structure. We claim that since the probability
scales with the number of languages due to complementary manifolds are distributed over acoustic/phonetic tags generated
information augmentation. by a consistent recognizer, there must exist some similarity
between languages belonging to similar ethnicity or language
family. At least the probability distribution of languages be-
longing to same language family must be closer as compared 0 50 100 150
to other languages. In order to ascertain this claim, we 0 600
calculated pairwise Bhattacharya coefficient [18] and plotted 550
500
a similarity matrix G as shown in Figure 4). Bhattacharya
coefficient between two phonotactic models p and q over a set 450
of common n-grams X is calculated according to the following 50 400
formula: Xp 350
BC(p, q) = p(x)q(x) (2) 300
x∈X
We use Laplacian eigenmaps [19] to generate 2D embed- 100 250
dings of phonotactic models using G as input. We further
cluster these embeddings using Affinity propagation cluster- 200
ing [20] algorithm. Figure 5 shows 2D plot of language models
with cluster centroids connected to the member languages 150
through edges. We noticed that the centroid languages be-
long to disparate ethnography and several member languages
share ethnographical similarities. For instance, the cluster with
Bhojpuri as centroid language (refer Figure 5), contains 7
Indian languages of the 9 present in the dataset including Fig. 4. Pairwise similarity matrix depicting Bhattacharya coefficient between
each pair of language model. Darker shades amount to higher correlation. The
Hindi. This information provokes a need for detailed linguistic colormap is skewed to highlight lower values.
analysis which we reserve as part of the future work. Fig-
ure 6 presents a world map depicting the 10 most prominent
discovered clusters using Bhattacharya distance between lan-
guage models. To analyze the ethnographic correlation of our
statistical models, individual languages are now represented
by their language family. Each cluster then, is described
by the percentage distribution of its language families as TABLE V
C LUSTERS OBTAINED AFTER PROJECTING EACH LANGUAGE MODEL IN
mentioned in Table V. It is evident from this map that our B HATTACHARYA SPACE . N UMBERS IN PARENTHESES INDICATE THE % AGE
models cluster together in Bhattacharya space using affinity DISTRIBUTION OF THE CLUSTER OVER LANGUAGE FAMILIES .
propagation and closely resemble the ethnographic clustering
of languages based on language families closer in geographical Cluster No. Language families with %age composition
location. This evidence proves that LID models derived using Cluster 1 Austro-Asiatic (55.5), Niger-Congo (11.1), Sino-
Tibetan (11.1), Tai-Kadai (22.2)
phonotactic information in spoken utterances are influenced
by the phenomena of language contact and language-specific Cluster 2 Niger-Congo (72.2), Afro-Asiatic (11.1), Creole
(5.6), Nilo-Saharan (11.1)
characteristics such as phonological variations are vital in
Cluster 3 Mayan (63.6), Totonacan (9.1), Uto-Aztecan (27.3)
order to distinguish between languages.
Cluster 4 Turkic (50), Mongolic (25), Chipaya-Uru (12.5),
Indo-European (12.5)
V. C ONCLUSION & F UTURE W ORKS
Cluster 5 Sino-Tibetan (66.7), Koreanic (16.7), Hmong-Mien
In this work, we develop an LID system for 176 languages. (16.7)
The performance of LID systems that rely on low level Cluster 6 Austronesian (62.5), West Papuan (37.5)
acoustic sequences degrades drastically when the number Cluster 7 Nilo-Saharan (33.3), Afro-Asiatic (13.3),
of language classes increases (176 as in present case). Otomanguean (13.3), Niger-Congo (13.3), Arauan
(6.7), Cariban (6.7), Chibchan (6.7), Language
In this approach we rely on the phonotactic information isolate (6.7)
obtained using language independent acoustic model,
Cluster 8 Jivaroan (20), Totonacan (20), Austronesian (10),
unbiased phonetic language model and statistical sequence Otomanguean (10), Huavean (10), Quechuan (10),
learners like recurrent neural network. As the proposed Paezan (10), Jicaquean (10)
approach relies on the phonotactic information this can be Cluster 9 Indo-European (36.4), Dravidian (27.3), Austro-
used as a complementary information to the approaches Asiatic (9.1), Austronesian (9.1), Zamucoan (9.1),
Barbacoan (9.1)
that rely on language information from low-level acoustic
features. Experiments show that the proposed approach is Cluster 10 Afro-Asiatic (50), Creole (25), Trans-New Guinea
(25)
computationally efficient and scales gracefully for large
number of language classes. A convex combination of
n-grams and RNNLM based language models have shown
Fig. 5. 2D plot of phonotactic models of languages and their corresponding clusters. Central dots represent cluster centroids and the neighboring dots represent
member languages of the cluster. Member language tags are omitted to maintain readability.
Fig. 6. Geographical distribution of 10 language clusters based on the language families they contain. Notice that clusters often contain languages belonging
to language families closer in geographical regions. Details of each cluster is mentioned in Table V.
noteworthy potential for developing a large scale LID systems. [13] B. M. L. Srivastava, H. K. Vydana, A. K. Vuppala, and M. Shrivastava,
“A language model based approach towards large scale and lightweight
language identification systems,” CoRR, vol. abs/1510.03602, 2015.
Currently, the language models are developed independent [Online]. Available: http://arxiv.org/abs/1510.03602
of each other. Some of the future tasks can be to develop [14] A. Stolcke et al., “Srilm-an extensible language modeling toolkit.” in
linearly and non-linearly interpolated language models Interspeech, vol. 2002, 2002, p. 2002.
[15] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cernocky,
for LID, developing the acoustic models specifically for “Rnnlm-recurrent neural network language modeling toolkit,” in Proc.
LID-systems and Language model pruning. Less phonetic of the 2011 ASRU Workshop, 2011, pp. 196–201.
coverage appears to be a reason for low recognition accuracy [16] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar,
A. Rudnicky et al., “Pocketsphinx: A free, real-time continuous speech
of some languages hence more generalized phone recognizer recognition system for hand-held devices,” in Acoustics, Speech and
with large phonetic coverage could be developed specifically Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE Inter-
for LID systems. Phonetic inventory learnt in an unsupervised national Conference on, vol. 1. IEEE, 2006, pp. I–I.
[17] B. M. L. Srivastava and M. Shrivastava, “Articulatory gesture rich
manner directly from data is also a potential future extension. representation learning of phonological units in low resource settings,”
We also wish to exploit polyglot neural language models by in Statistical Language and Speech Processing, Fourth International
Tsvetkov et al [21]. Conference, SLSP 2016, vol. 4. Springer, 2016.
[18] A. Bhattachayya, “On a measure of divergence between two statistical
population defined by their population distributions,” Bulletin Calcutta
Some initial experiments with probability distributions of Mathematical Society, vol. 35, pp. 99–109, 1943.
[19] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality
phonotactic models revealed interesting correspondence be- reduction and data representation,” Neural computation, vol. 15, no. 6,
tween ethnologic language families and clusters obtained in pp. 1373–1396, 2003.
embedding space of laplacian eigenmaps. We will further [20] B. J. Frey and D. Dueck, “Clustering by passing messages between data
points,” science, vol. 315, no. 5814, pp. 972–976, 2007.
conduct a detailed correlation study between information- [21] Y. Tsvetkov, S. Sitaram, M. Faruqui, G. Lample, P. Littell, D. Mortensen,
theoretic models and ethnological classes of languages to draw A. W. Black, L. Levin, and C. Dyer, “Polyglot neural language models:
interesting insights related to similarity of language groups. A case study in cross-lingual phonetic representation learning,” arXiv
preprint arXiv:1605.03832, 2016.
R EFERENCES
[1] C.-H. Lee, “Principles of spoken language recognition,” in Springer
Handbook of Speech Processing. Springer, 2008, pp. 785–796.
[2] H. Li, B. Ma, and K. A. Lee, “Spoken language recognition: from
fundamentals to practice,” Proceedings of the IEEE, vol. 101, no. 5,
pp. 1136–1159, 2013.
[3] A. McCree and D. Garcia-Romero, “Dnn senone map multinomial
i-vectors for phonotactic language recognition,” in Sixteenth Annual
Conference of the International Speech Communication Association,
2015.
[4] P. Matejka, L. Zhang, T. Ng, H. Mallidi, O. Glembek, J. Ma, and
B. Zhang, “Neural network bottleneck features for language identifi-
cation,” Proc. IEEE Odyssey, pp. 299–304, 2014.
[5] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez,
J. Gonzalez-Rodriguez, and P. Moreno, “Automatic language iden-
tification using deep neural networks,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2014, pp. 5337–5341.
[6] R. Fér, P. Matějka, F. Grézl, O. Plchot, and J. Černockỳ, “Multilingual
bottleneck features for language recognition,” in Sixteenth Annual Con-
ference of the International Speech Communication Association, 2015.
[7] W. Geng, J. Li, S. Zhang, X. Cai, and B. Xu, “Multilingual tandem
bottleneck feature for language identification,” in Sixteenth Annual
Conference of the International Speech Communication Association,
2015.
[8] M. A. Zissman and K. M. Berkling, “Automatic language identification,”
Speech Communication, vol. 35, no. 1, pp. 115–124, 2001.
[9] K. Hingkeung and K. Hirose, “N-gram modeling based on recognized
phonemes in automatic language identification,” IEICE TRANSACTIONS
on Information and Systems, vol. 81, no. 11, pp. 1224–1231, 1998.
[10] M. Zissman, E. Singer et al., “Automatic language identification of
telephone speech messages using phoneme recognition and n-gram
modeling,” in Acoustics, Speech, and Signal Processing, 1994. ICASSP-
94., 1994 IEEE International Conference on, vol. 1. IEEE, 1994, pp.
I–305.
[11] C. Salamea, L. F. D’Haro, R. de Córdoba, and R. San-Segundo, “On
the use of phone-gram units in recurrent neural networks for language
identification,” Odyssey 2016, pp. 117–123, 2016.
[12] L. J. Rodrı́guez-Fuentes, M. Penagarikano, A. Varona, M. Diez, and
G. Bordel, “Kalaka-3: a database for the assessment of spoken language
recognition technology on youtube audios,” Language Resources and
Evaluation, pp. 1–23.
View publication stats