0% found this document useful (0 votes)

51 views9 pages

Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification

This document discusses a study on using neural phonotactic models for large-scale spoken language identification. The authors propose modeling language-specific phonotactic information in speech using recurrent neural networks. They tokenize input speech to phone sequences using a common language-independent phone recognizer, and establish a relationship between phonetic coverage and language identification performance. Statistical and recurrent neural network language models are used to model phonotactics in phone sequences to predict language. Experiments on 176 languages show the combination of these models outperforms a DNN baseline.

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views9 pages

Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/313323989

Signiﬁcance of neural phonotactic models for large-scale spoken language

identiﬁcation

Conference Paper · May 2017

DOI: 10.1109/IJCNN.2017.7966114

CITATIONS READS
8 253

4 authors:

Brij Mohan Lal Srivastava Hari Krishna

National Institute for Research in Computer Science and Control International Institute of Information Technology, Hyderabad
19 PUBLICATIONS 29 CITATIONS 22 PUBLICATIONS 76 CITATIONS

SEE PROFILE SEE PROFILE

Anil Kumar Vuppala Manish Shrivastava

International Institute of Information Technology, Hyderabad International Institute of Information Technology, Hyderabad
73 PUBLICATIONS 468 CITATIONS 94 PUBLICATIONS 396 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Learning Representation for Text Classification View project

Medical Domain, Multilingual Virtual Assistant View project

All content following this page was uploaded by Brij Mohan Lal Srivastava on 09 March 2017.

The user has requested enhancement of the downloaded file.

Significance of neural phonotactic models for
large-scale spoken language identification
Brij Mohan Lal Srivastava, Hari Vydana, Anil Kumar Vuppala, and Manish Shrivastava
Language Technology Research Center
International Institute of Information Technology, Hyderabad, India
{brijmohanlal.s, hari.vydana}@research.iiit.ac.in
{anil.vuppala, m.shrivastava}@iiit.ac.in

Abstract—Language identification (LID) is vital frontend for bottleneck features for LID [4]. [5] shows that DNNs can
spoken dialogue systems operating in diverse linguistic settings perform better than i-vector based approaches. Multilingual
to reduce recognition and understanding errors. Existing LID bottleneck, multilingual tandem bottleneck obtained by stack-
systems which use low-level signal information for classification
do not scale well due to exponential growth of parameters as ing the SDC with the corresponding bottleneck features are
the classes increase. They also suffer performance degradation exploited in [6], [7]. All the approaches mentioned above rely
due to the inherent variabilities of speech signal. In the proposed on the language discriminative capability present in the lower-
approach, we model the language-specific phonotactic informa- level acoustic information with some contextual information
tion in speech using recurrent neural network for developing in time neighborhood. The better performance of i-vectors
an LID system. The input speech signal is tokenized to phone
sequences by using a common language-independent phone and bottleneck features compared to the conventional spectral
recognizer with varying phonetic coverage. We establish a causal features can be attributed to its better context modeling capa-
relationship between phonetic coverage and LID performance. bility [3], [4], [6], [7]. The systems which try to model the
The phonotactics in the observed phone sequences are modeled language discriminative information at higher-level (phones,
using statistical and recurrent neural network language models phone frequency and phonotactics) have exhibited reliably
to predict language-specific symbol from a universal phonetic
inventory. Proposed approach is robust, computationally light better performance [8], [9]. Very first attempt to use language
weight and highly scalable. Experiments show that the convex discriminative phonotactic information for LID was made
combination of statistical and recurrent neural network lan- by [10] who proposed Parallel-Phone recognizer followed by
guage model (RNNLM) based phonotactic models significantly language model (PPRLM) wherein language dependent phone
outperform a strong baseline system of Deep Neural Network recognizers are operated in parallel for every language to
(DNN) which is shown to surpass the performance of i-vector
based approach for LID. The proposed approach outperforms the decode the test utterance followed by a language model with
baseline models in terms of mean F1 score over 176 languages. large number of uni-gram and bi-gram counts that is used
Further we provide significant information-theoretic evidence to as a model to capture the spoken language identity. Though
analyze the mechanism of the proposed approach. PPRLM is quite efficient, it is computationally inefficient
to decode the test utterance using each language’s phone
I. I NTRODUCTION recognizers which makes scalability a major issue for PPRLM
Language identification (LID) refers to the task of au- systems. Additionally, a language independent phone recog-
tomatically identifying the language of a spoken utterance nizer followed by a language model (PRLM) for capturing
using speech signal. An LID system is a vital module for the phonotactics of 4 languages is used in [10].
a wide range of multilingual applications like, call centers, Recently LID systems developed using DNN classifiers
multilingual spoken dialog systems, emergency services and trained over low-level acoustic features has shown acceptable
speech-to-speech translation systems. Human-computer inter- performance when the number of languages are less, but when
action through speech can be taken more deeply into society the number of languages are more (i.e., in the present study
if the interaction is through multiple native languages and for number of classes = 176) the performance of the DNN based
that purpose LID system is a preliminary requirement [1], LID system degrades. This degradation in the performance
[2]. In relevance to the task of developing an LID sys- of DNN classifier when number of language are increased to
tem, recent works have focused on developing algorithms to 176 can be attributed to the lack of sequential modeling. The
extract suitable features for language identification. I-vector DNN classifier has not captured the language discriminative
based features are explored and they have exhibited better phonotactic information efficiently. Sequential models like
performance compared to the conventional spectral features Recurrent Neural Networks (RNN), Long Short Term Memory
like Mel-frequency cepstral coefficients (MFCC), Linear pre- (LSTM) which have exhibited a superior performance in phone
dictive cepstral coefficients (LPCC) and Shifted delta cep- classification when used on the acoustic sequences due to their
stral coefficients (SDC) in NIST evaluations for speaker and sequence modeling capability. For the task of LID, language
language recognition tasks [3]. Neural networks have also discriminative relations span across the sequence of words
been employed as feature extractors to compute the stacked (i.e., the sequential information for discriminating languages
is longer than the sequential information for recognizing The proposed approach as such is described in section 3.
the phones). While learning such long durational cues the Results, observations and inferences are presented in section
performance of sequence learners like RNN and LSTMs are 4. Conclusion and future scope are discussed in section 5.
degraded by vanishing and exploding gradient problems. [11]
II. DATA D ESCRIPTION
shows that RNNLM when coupled with PPRLM can outper-
form i-vector based approaches. Their results are reported on The language data which is made openly available by Top-
6 languages from KALAKA-3 [12] database. coder 1 as a part of Spoken Language Recognition challenge
To make the input sequence more amenable for sequence is used in this work. Data sets comprises of recorded speech in
learners, we tokenize the input speech signal using a previ- 176 languages. Refer to figure 6 to find out the geographical
ously trained language independent acoustic model and the extent of languages present in our dataset. The dataset contains
tokens are the phonemes recognized by that acoustic model. 375 utterances per language and the language labels of these
The motive behind this transformation is to reduce the length utterances are also available. Each utterance is of 10 seconds
of the input sequence such that the sequence classifiers can duration. Each speech recording is given in a separate file
better discriminate languages by modeling the long term and only one language is spoken in each file. The available
dependencies. The proposed approach is independent of the data is reorganized into training, testing, and validation sets.
language of the acoustic model used, as the acoustic model is For training SRILM n-grams, 330 utterances were used for
considered as a mere tokenizer rather than an accurate phone developing language models and the remaining 45 utterances
recognizer i.e. in this approach we rely on the consistency are used for testing the models. In case of RNNLMs, 300
of the acoustic model in tokenizing the input utterance rather utterances were used for training, 30 utterances were used for
than the accuracy. We exploit the well-known fact that the validation and 45 utterances are used to test the developed
language discriminative phonotactic information is present in models. The data provided has speech recordings in mp3
the sequence of tokens (phones) and modeling this sequential format, which are converted to WAV format with 16 kHz
information plays a vital role in building an LID system. An sampling rate. The following repository will be updated with
initial version of this study, describing a partial approach and the code and feature sets used as part of this study2 .
experiments have been published at arxiv.org as [13]. This III. P ROPOSED A PPROACH
paper significantly improves over the approach mentioned in
In this approach, we rely on the higher-level phonotactic
[13] in terms of clarity of idea, exhaustive experimentation and
information extracted from speech for developing an LID
a fair comparison of the proposed approach with the state-of-
system. To extract this information input speech signal is tok-
the-art LID techniques.
In this paper, we propose an approach to capture higher- enized by decoding it using a language independent acoustic
level language discriminative information, which can be used model and a phonetic language model with probability mass
as a complementary information to the existing low-level distributed uniformly over all the phones, i.e. all the phones
acoustic information modeling LID systems. We explore the have equal emission probability. Though the phone recognizer
significance and limits of PRLM based approaches for de- is independent of the language that is being decoded, we
veloping a large-scale LID system. As a part of our experi- assume that the similar sounding acoustic patterns will be
ments, we develop LID systems comprising of 176 different decoded as approximately the same phone labels. By using the
languages. The pipeline of our system includes a language language independent phone recognizer, we are relying on the
independent phone recognizer to generate the phone sequences consistency of the phone recognizer rather than the accuracy
from raw signal and two different language modeling ap- ,i.e., similar sounding acoustic sounds will be tagged with
proaches (SRILM [14] and RNNLM [15]) to capture the same phone label regardless of the language. According to [8]
phonotactic information. Further we use convex combination the statistical patterns present in the obtained phone sequence
to capture the best of both modeling techniques and achieve must contain discriminative information for corresponding
stability and robustness in recognition. As the proposed ap- language. N-grams and recurrent neural networks (RNN) are
proach mostly relies on the large durational phonotactic in- employed to model these statistical patterns that convey the
formation of a language which is provided by the phonetic language-discriminative information from the tokenized phone
recognizer, it is more robust compared to low-level acoustic sequence. The block diagram of the proposed approach is
modeling approaches. presented in Fig 1. The neural representation of phonotactics
The main contributions of this paper is to study the follow- is computed by training the RNN to generate/predict the next
ing hypotheses: acoustic tag given a truncated sequence of past tags. Therefore
1) Convex combination of phonotactic estimations can lead independent RNNs learns generative model for each of the 176
to highly robust and scalable PRLM-based LID systems. languages. The best performing models just use 30 neurons in
2) A single language-independent phonetic recognizer can the hidden layer and the vocabulary size is equal to number of
be tuned to generate language-discriminative token se- phones used. This amounts a tiny model for classification in
quences by controlling its acoustic coverage. terms of number of parameters which are less than one-fourth
The rest of the paper is organized as follows: Section 1 https://community.topcoder.com/tc?module=MatchDetails&rd=16555

2 describes the details of the dataset used for experiments. 2 https://github.com/brijmohan/lid-convex-comb

a million for 176 classes as compared to conventional models TABLE I
which have millions of parameters to learn for classification P ERFORMANCE OF BASELINE LID SYSTEMS DEVELOPED TO CLASSIFY
176 LANGUAGES . C OLUMN 1 AND 2 SHOW THE SCORE OF THE LANGUAGE
of far lesser classes. WHICH PERFORMED WORST AND BEST GIVEN EACH MODEL . C OLUMN 3
The goal of the language-independent phone recognizer is PRESENTS THE AVERAGE F1 SCORE OVER 176 LANGUAGES
to provide maximum coverage of phone units present in all
the languages for which the system is being developed. We System Min Max Mean F1%
employ CMU Sphinx [16] as the front-end phone recognizer GMM 8-mixtures 1 35 06
which uses HMM-based phone decoder from speech signal. GMM 16-mixtures 1 38 08
A phonetically tied-mixture (PTM) model is used for efficient GMM 32-mixtures 2 41 10
decoding. It contains 256 mixture components per state and GMM 64-mixtures 2 43 12
assigns different mixture weights to the shared states of
Kernel SVM 5 37 13
triphones. This model provides a good balance between speed
and accuracy. Since it can be trained over huge data, it gives a DNN (8layers with dropout=0.2) 31 72 53
decent decoding result in under real time. We use US English
phone set with 40 phones and an unbiased phonetic language
model for decoding. = a ∗ PRN N (l) + (1 − a) ∗ Pngrams (l)
We assume that there exists a universal set of phonetic units
Up encompassing all the known languages wherein a particular Here a is convex combination coefficient.
language l comprises of a subset of phonetic units (Pl ∈ Up ). IV. E XPERIMENTS , R ESULTS & D ISCUSSION
The discriminative information between two languages l and
m is encoded by the difference of their subsets, i.e. Pl −Pm as A. Baseline methods
well as the interaction/transition between phonetic units within In order to objectively study the performances of various
the subsets. The interaction of phonetic units within a subset is LID systems, we implemented several baseline systems and
measured as the probability of transition between these units their results are tabulated in Table I. During the study, 39-
using either ngrams or recurrent neural networks. dimensional MFCC with a frame size of 25 ms and a frame
The phone recognizer can be improved by training over shift of 10 ms are extracted, each frame is stacked with features
multiple languages which will certainly increase the coverage from ±10 frames around it and a feature vector of 819 is used
of common phonetic and acoustic patterns. We would also for further analysis. During the study, GMM, SVM and DNN
like to experiment with phonetic units which do not rely classifiers have been employed for developing the LID system
on languages, such as, articulatory gestures obtained in an and their performances are reported in Table. I. DNN-based
unsupervised manner from data as proposed in [17]. The LID system is implemented as described by [5], the hyper
acoustic model used in this study are trained on US English parameters are tuned accordingly and the results from the best
and French speech data. English and any other language performing system are reported. This DNN baseline performs
belonging to its language family (West-Germanic) are not better than i-vector based LID systems.
present in the dataset which makes it highly suitable for this The performance of various baseline LID systems are
study as an unbiased tokenizer. SRILM n-grams (uni-gram to tabulated in Table I. Column 1 of Table. I is the type of
6-grams) and RNNLM have been used for experimentations LID system and column 2 of Table. I is the performance
to model the statistical patterns in the phone sequences. The of LID system. From Table I, it can be observed that the
sequence of steps in the proposed approach is described below. performance of LID system developed for 176 languages is
Training Phase: very poor. The major reasons for such poor performance of
• Decode the input speech using the acoustic model and baseline LID systems result in two different inferences i.e.,
obtain the corresponding phone sequences. a) these approaches do not scale well with large number of
• Build a language model per language using the decoded classes and b) they do not capture the language-discriminative
phone sequences from the training data. phonotactic information vital for classification. Though the
input feature has the contextual information of about 21 frames
Testing Phase:
in time neighborhood obtained by stacking the features, these
• Decode the test utterance using the acoustic model and
approaches failed to capture the required information.
obtain the phone sequences. The performance of the proposed approach is presented
• Find the probability of observed phone sequence using
in Table. II. Column 1 is the type of LID system and the
different language models. column 2-4 is the performance of the LID system. Column 2
• Combine the probabilities using the learnt convex com-
and 3 show the score of the language which performed worst
bination (⊕) coefficients. and best given each model. Column 4 presents the average F1
• Infer language label as the label of the model exhibiting
score over 176 languages. Rows 2-7 SRILM based language
highest score according to equation 1. model is used to model the statistical patterns in the decoded
phone sequences and in Row 8 RNNLMs have been employed
L̂ = argmax PRN N (l) ⊕ Pngrams (l) (1) to learn the statistical patterns in the decoded phone sequences.
l
Fig. 1. Block diagram of the proposed method. The ⊕ symbol indicates the convex combination of probability vectors.

TABLE II the best performing RNNLM and DNN model. It uses

P ERFORMANCE OF THE PROPOSED LID SYSTEMS DEVELOPED TO grayscale colormap to display the intensity of classification.
CLASSIFY 176 LANGUAGES USING US English ACOUSTIC MODEL . W E
MODIFY THE FOLLOWING PARAMETERS IN ORDER TO TUNE THE The colormap of the plot is remapped onto a power-law
PERFORMANCE OF RNNLM: C - # CLASSES , H - # UNITS IN HIDDEN relationship (i.e. y = xγ ), hence the skewed colorbar. Here γ
LAYER , B - # STEPS FOR BPTT is selected to be 1/3 in order to highlight mis-classification
which occurs with very low frequency in the proposed model.
System Min Max Mean F1%
Darker cells represent to highly correlated classes. Ideally,
1-gram 43.79 60 46.53 the principle diagonal should be darkest. It can be observed
2-gram 77.77 83.61 81.77 that the LID performance remains consistent across all the
3-gram 84.07 88.88 85.45 languages in case of RNNLM while the performance of DNN
4-gram 81.11 86.66 83.94 based LID system is inferior to RNNLM based LID system.
5-gram 77.77 86.66 83.15
6-gram 77.77 86.66 82.98 To study the influence of front-end acoustic model on the
LID system we have used a acoustic model trained with french
RNNLM 1-c 83.13 93.33 84.42
data over 34 phonemes and the performance is reported in
RNNLM 6-c 84.44 90.58 87.69 Table. III. Though the performance of the proposed approach
RNNLM 100-c 82.22 89.44 85.38 with french acoustic model as tokenizer has performed better
RNNLM 1-c ⊕ 3-gram 88.44 91.24 89.36 than the baseline approaches, it is clearly lower than the LID
(convex combination) system with US English acoustic model. This observation can
be certainly attributed to the fact that US English model has 40
phones which results in better acoustic/phonetic coverage than
A combination of both the approaches have been explored and french model which has just 34 phones. The process of tok-
the performance of the LID system is presented in row 9 of enizing the input utterance can be treated as the quantization
Table. II. of the acoustic sequence to sequence of tokens (phones), if the
From Table II it can be observed that due to the use of number of phones are not sufficiently high, the complexity of
phonotactic information, the performace of the LID system statistical pattern in the phone sequence becomes intractable
is significantly higher than the baseline methods in Table. I. for the classifier to disentangle and the the performance of the
From Table. II, it can be noted that even though the number LID degrades.
of languages are large in number, the high-level phonotactics
exhibit better discriminability in building LID systems. Similar TABLE IV
C OMPARISON OF MAXIMUM PERFORMANCE OF EACH LID SYSTEM
observations can be made from the confusion matrices of DNN
LID system and the proposed LID system using RNNLM. RNNLM ⊕
Along with the superiority in the performance, the proposed System GMM SVM DNN 3-GRAMS RNNLM 3-GRAMS
LID exhibits consistency in the performance across the lan- Mean F1% 12 13 53 85.45 87.69 89.36
guages rater than a skewed performance to certain languages
as opposed to all the baseline approaches. The performance of LID system developed during the study
Figure 2 shows the confusion matrix obtained for are compared and the comparison is presented in Table. IV.
Fig. 2. Classification confusion matrix showing the best performing DNN (8 layers with 0.2 dropout after each layer) (left) and the best performing RNNLM+
3-gram (phoneset: US English) (right) over test dataset for 176 languages. The scale of the colormap is skewed appropriately for better visualization.

TABLE III
L ANGUAGE IDENTIFICATION PERFORMANCE (M EAN F1%) FOR EACH
LANGUAGE MODEL USING French PHONESET ( EXPLICIT APPROACH ). T HIS
TABLE SHOWS THAT PHONETIC COVERAGE IS VITAL FOR LID AND
F RENCH PHONESET PERFORMS WORSE THAN E NGLISH DUE TO ITS
INFERIOR PHONETIC COVERAGE . W E MODIFY THE FOLLOWING
PARAMETERS IN ORDER TO TUNE THE PERFORMANCE OF RNNLM: C -
# CLASSES , H - # UNITS IN HIDDEN LAYER , B - # STEPS FOR BPTT

System Min Max Mean F1%

RNNLM 1-c 69.21 78.88 71.35
RNNLM 6-c 71.3 82.22 73.53
RNNLM 50-c 100-h 2-b 74.5 84.44 76.56
RNNLM 50-c 100-h 3-b 74.42 82.22 76.46
RNNLM 50-c 100-h 4-b 74.6 86.66 76.76
Fig. 3. Plot depicts the range of mean F1 for each estimation method.
RNNLM 50-c 200-h 1-b 75.06 82.71 77.11 Vertical bar at each point indicate asymmetric distribution of values of F1
corresponding to each language around mean. N-grams and RNNLM achieve
consistency across languages whereas other baseline techniques display highly
skewed F1. High mean and low variance with respect to ConPT exhibit that
this model scales consistently and gracefully for large number of language
classes. This is due to the shift of expected value towards the ensemble average
of two stochastic processes with complementary information, in this case,
From Table IV, it can be observed that the performance of the phonotactics estimated using n-grams and RNNLM.
proposed approaches that rely on higher-level acoustic features
performs significantly better than the approaches that rely on
the low-level acoustic features for developing an LID system B. Information-theoretic analysis of phonotactic models
which aims to scale up to large number of languages. The
significance of the proposed approach is clearly noted when Due to the fixed vocabulary of phonetic units recognized
the number of language classes are higher. by the language-independent phonetic recognizer mentioned
in section III, the estimated phonotactics of each language are
Figure 3 shows the performance range of best tuned model fundamentally a discrete probability distribution over identical
for each phonotactics estimation method. The points depict domain which is composed of a sample space of common
the mean performance of each model and the vertical bar dis- phonetic uni-, bi- and tri-grams. For an optimal classification
plays asymmetric variance of performance across languages. of languages, these probability manifolds must be distinct.
We observe that low-level acoustic methods show hugely According to philological classification, languages can be
varying results whereas higher-level features are consistent grouped as language families based on similar phonetic and
across languages. Note that the combination model gracefully phonological structure. We claim that since the probability
scales with the number of languages due to complementary manifolds are distributed over acoustic/phonetic tags generated
information augmentation. by a consistent recognizer, there must exist some similarity
between languages belonging to similar ethnicity or language
family. At least the probability distribution of languages be-
longing to same language family must be closer as compared 0 50 100 150
to other languages. In order to ascertain this claim, we 0 600
calculated pairwise Bhattacharya coefficient [18] and plotted 550
500
a similarity matrix G as shown in Figure 4). Bhattacharya
coefficient between two phonotactic models p and q over a set 450
of common n-grams X is calculated according to the following 50 400
formula: Xp 350
BC(p, q) = p(x)q(x) (2) 300
x∈X

We use Laplacian eigenmaps [19] to generate 2D embed- 100 250

dings of phonotactic models using G as input. We further
cluster these embeddings using Affinity propagation cluster- 200
ing [20] algorithm. Figure 5 shows 2D plot of language models
with cluster centroids connected to the member languages 150
through edges. We noticed that the centroid languages be-
long to disparate ethnography and several member languages
share ethnographical similarities. For instance, the cluster with
Bhojpuri as centroid language (refer Figure 5), contains 7
Indian languages of the 9 present in the dataset including Fig. 4. Pairwise similarity matrix depicting Bhattacharya coefficient between
each pair of language model. Darker shades amount to higher correlation. The
Hindi. This information provokes a need for detailed linguistic colormap is skewed to highlight lower values.
analysis which we reserve as part of the future work. Fig-
ure 6 presents a world map depicting the 10 most prominent
discovered clusters using Bhattacharya distance between lan-
guage models. To analyze the ethnographic correlation of our
statistical models, individual languages are now represented
by their language family. Each cluster then, is described
by the percentage distribution of its language families as TABLE V
C LUSTERS OBTAINED AFTER PROJECTING EACH LANGUAGE MODEL IN
mentioned in Table V. It is evident from this map that our B HATTACHARYA SPACE . N UMBERS IN PARENTHESES INDICATE THE % AGE
models cluster together in Bhattacharya space using affinity DISTRIBUTION OF THE CLUSTER OVER LANGUAGE FAMILIES .
propagation and closely resemble the ethnographic clustering
of languages based on language families closer in geographical Cluster No. Language families with %age composition
location. This evidence proves that LID models derived using Cluster 1 Austro-Asiatic (55.5), Niger-Congo (11.1), Sino-
Tibetan (11.1), Tai-Kadai (22.2)
phonotactic information in spoken utterances are influenced
by the phenomena of language contact and language-specific Cluster 2 Niger-Congo (72.2), Afro-Asiatic (11.1), Creole
(5.6), Nilo-Saharan (11.1)
characteristics such as phonological variations are vital in
Cluster 3 Mayan (63.6), Totonacan (9.1), Uto-Aztecan (27.3)
order to distinguish between languages.
Cluster 4 Turkic (50), Mongolic (25), Chipaya-Uru (12.5),
Indo-European (12.5)
V. C ONCLUSION & F UTURE W ORKS
Cluster 5 Sino-Tibetan (66.7), Koreanic (16.7), Hmong-Mien
In this work, we develop an LID system for 176 languages. (16.7)
The performance of LID systems that rely on low level Cluster 6 Austronesian (62.5), West Papuan (37.5)
acoustic sequences degrades drastically when the number Cluster 7 Nilo-Saharan (33.3), Afro-Asiatic (13.3),
of language classes increases (176 as in present case). Otomanguean (13.3), Niger-Congo (13.3), Arauan
(6.7), Cariban (6.7), Chibchan (6.7), Language
In this approach we rely on the phonotactic information isolate (6.7)
obtained using language independent acoustic model,
Cluster 8 Jivaroan (20), Totonacan (20), Austronesian (10),
unbiased phonetic language model and statistical sequence Otomanguean (10), Huavean (10), Quechuan (10),
learners like recurrent neural network. As the proposed Paezan (10), Jicaquean (10)
approach relies on the phonotactic information this can be Cluster 9 Indo-European (36.4), Dravidian (27.3), Austro-
used as a complementary information to the approaches Asiatic (9.1), Austronesian (9.1), Zamucoan (9.1),
Barbacoan (9.1)
that rely on language information from low-level acoustic
features. Experiments show that the proposed approach is Cluster 10 Afro-Asiatic (50), Creole (25), Trans-New Guinea
(25)
computationally efficient and scales gracefully for large
number of language classes. A convex combination of
n-grams and RNNLM based language models have shown
Fig. 5. 2D plot of phonotactic models of languages and their corresponding clusters. Central dots represent cluster centroids and the neighboring dots represent
member languages of the cluster. Member language tags are omitted to maintain readability.

Fig. 6. Geographical distribution of 10 language clusters based on the language families they contain. Notice that clusters often contain languages belonging
to language families closer in geographical regions. Details of each cluster is mentioned in Table V.
noteworthy potential for developing a large scale LID systems. [13] B. M. L. Srivastava, H. K. Vydana, A. K. Vuppala, and M. Shrivastava,
“A language model based approach towards large scale and lightweight
language identification systems,” CoRR, vol. abs/1510.03602, 2015.
Currently, the language models are developed independent [Online]. Available: http://arxiv.org/abs/1510.03602
of each other. Some of the future tasks can be to develop [14] A. Stolcke et al., “Srilm-an extensible language modeling toolkit.” in
linearly and non-linearly interpolated language models Interspeech, vol. 2002, 2002, p. 2002.
[15] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cernocky,
for LID, developing the acoustic models specifically for “Rnnlm-recurrent neural network language modeling toolkit,” in Proc.
LID-systems and Language model pruning. Less phonetic of the 2011 ASRU Workshop, 2011, pp. 196–201.
coverage appears to be a reason for low recognition accuracy [16] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar,
A. Rudnicky et al., “Pocketsphinx: A free, real-time continuous speech
of some languages hence more generalized phone recognizer recognition system for hand-held devices,” in Acoustics, Speech and
with large phonetic coverage could be developed specifically Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE Inter-
for LID systems. Phonetic inventory learnt in an unsupervised national Conference on, vol. 1. IEEE, 2006, pp. I–I.
[17] B. M. L. Srivastava and M. Shrivastava, “Articulatory gesture rich
manner directly from data is also a potential future extension. representation learning of phonological units in low resource settings,”
We also wish to exploit polyglot neural language models by in Statistical Language and Speech Processing, Fourth International
Tsvetkov et al [21]. Conference, SLSP 2016, vol. 4. Springer, 2016.
[18] A. Bhattachayya, “On a measure of divergence between two statistical
population defined by their population distributions,” Bulletin Calcutta
Some initial experiments with probability distributions of Mathematical Society, vol. 35, pp. 99–109, 1943.
[19] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality
phonotactic models revealed interesting correspondence be- reduction and data representation,” Neural computation, vol. 15, no. 6,
tween ethnologic language families and clusters obtained in pp. 1373–1396, 2003.
embedding space of laplacian eigenmaps. We will further [20] B. J. Frey and D. Dueck, “Clustering by passing messages between data
points,” science, vol. 315, no. 5814, pp. 972–976, 2007.
conduct a detailed correlation study between information- [21] Y. Tsvetkov, S. Sitaram, M. Faruqui, G. Lample, P. Littell, D. Mortensen,
theoretic models and ethnological classes of languages to draw A. W. Black, L. Levin, and C. Dyer, “Polyglot neural language models:
interesting insights related to similarity of language groups. A case study in cross-lingual phonetic representation learning,” arXiv
preprint arXiv:1605.03832, 2016.

R EFERENCES
[1] C.-H. Lee, “Principles of spoken language recognition,” in Springer
Handbook of Speech Processing. Springer, 2008, pp. 785–796.
[2] H. Li, B. Ma, and K. A. Lee, “Spoken language recognition: from
fundamentals to practice,” Proceedings of the IEEE, vol. 101, no. 5,
pp. 1136–1159, 2013.
[3] A. McCree and D. Garcia-Romero, “Dnn senone map multinomial
i-vectors for phonotactic language recognition,” in Sixteenth Annual
Conference of the International Speech Communication Association,
2015.
[4] P. Matejka, L. Zhang, T. Ng, H. Mallidi, O. Glembek, J. Ma, and
B. Zhang, “Neural network bottleneck features for language identifi-
cation,” Proc. IEEE Odyssey, pp. 299–304, 2014.
[5] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez,
J. Gonzalez-Rodriguez, and P. Moreno, “Automatic language iden-
tification using deep neural networks,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2014, pp. 5337–5341.
[6] R. Fér, P. Matějka, F. Grézl, O. Plchot, and J. Černockỳ, “Multilingual
bottleneck features for language recognition,” in Sixteenth Annual Con-
ference of the International Speech Communication Association, 2015.
[7] W. Geng, J. Li, S. Zhang, X. Cai, and B. Xu, “Multilingual tandem
bottleneck feature for language identification,” in Sixteenth Annual
Conference of the International Speech Communication Association,
2015.
[8] M. A. Zissman and K. M. Berkling, “Automatic language identification,”
Speech Communication, vol. 35, no. 1, pp. 115–124, 2001.
[9] K. Hingkeung and K. Hirose, “N-gram modeling based on recognized
phonemes in automatic language identification,” IEICE TRANSACTIONS
on Information and Systems, vol. 81, no. 11, pp. 1224–1231, 1998.
[10] M. Zissman, E. Singer et al., “Automatic language identification of
telephone speech messages using phoneme recognition and n-gram
modeling,” in Acoustics, Speech, and Signal Processing, 1994. ICASSP-
94., 1994 IEEE International Conference on, vol. 1. IEEE, 1994, pp.
I–305.
[11] C. Salamea, L. F. D’Haro, R. de Córdoba, and R. San-Segundo, “On
the use of phone-gram units in recurrent neural networks for language
identification,” Odyssey 2016, pp. 117–123, 2016.
[12] L. J. Rodrı́guez-Fuentes, M. Penagarikano, A. Varona, M. Diez, and
G. Bordel, “Kalaka-3: a database for the assessment of spoken language
recognition technology on youtube audios,” Language Resources and
Evaluation, pp. 1–23.

View publication stats

Activated Flux TIG Welding of Dissimilar SS202 and SS304 Alloys
No ratings yet
Activated Flux TIG Welding of Dissimilar SS202 and SS304 Alloys
9 pages
2020 Book Chapter11 Soil Analysis
No ratings yet
2020 Book Chapter11 Soil Analysis
13 pages
New Fully-Uncoupled Current-Controlled Sinusoidal Oscillator Employing Grounded Capacitors
No ratings yet
New Fully-Uncoupled Current-Controlled Sinusoidal Oscillator Employing Grounded Capacitors
5 pages
Lean Manufacturing Review: Techniques & Implementation
No ratings yet
Lean Manufacturing Review: Techniques & Implementation
6 pages
The Consumer Protection Act Are View of Legal Perspective
No ratings yet
The Consumer Protection Act Are View of Legal Perspective
10 pages
Design and Analysis of Microstrip Patch Antenna For Wireless Communication
No ratings yet
Design and Analysis of Microstrip Patch Antenna For Wireless Communication
5 pages
Abdominal Tuberculosis: Diagnosis and Management in 2018: Journal, Indian Academy of Clinical Medicine March 2018
No ratings yet
Abdominal Tuberculosis: Diagnosis and Management in 2018: Journal, Indian Academy of Clinical Medicine March 2018
5 pages
10 Sci Ieeejoij10
No ratings yet
10 Sci Ieeejoij10
10 pages
Facial Asymmetry Revisited Part I - Diagnosis and T
No ratings yet
Facial Asymmetry Revisited Part I - Diagnosis and T
9 pages
IJIRSETSilica Fume
No ratings yet
IJIRSETSilica Fume
7 pages
ASustainableVehicleRoutingProblemforIndian POMS Conf
No ratings yet
ASustainableVehicleRoutingProblemforIndian POMS Conf
6 pages
119 Js Police
No ratings yet
119 Js Police
8 pages
Shti 264 Shti190448
No ratings yet
Shti 264 Shti190448
6 pages
Rajkumaretal Effectof Plant Spacingand Fertilizer Levelson Growth Yield Andqualityofcastor
No ratings yet
Rajkumaretal Effectof Plant Spacingand Fertilizer Levelson Growth Yield Andqualityofcastor
5 pages
Orientation Dependent DFT Analysis of Aniline and Pyrrole Based Copolymer
No ratings yet
Orientation Dependent DFT Analysis of Aniline and Pyrrole Based Copolymer
5 pages
Ijet 23830
No ratings yet
Ijet 23830
4 pages
Analysis of Challenges Faced by Indian Logistics Service Providers
No ratings yet
Analysis of Challenges Faced by Indian Logistics Service Providers
13 pages
Smart Agriculture with IoAT
No ratings yet
Smart Agriculture with IoAT
15 pages
Imperfect Repair Modeling Using Kijima Type Generalized Renewal Process
No ratings yet
Imperfect Repair Modeling Using Kijima Type Generalized Renewal Process
9 pages
Atps Recuperacion de Biomoleculas
No ratings yet
Atps Recuperacion de Biomoleculas
11 pages
Analyzing AWS Edge Computing Solutions To Enhance IoT Deployments
No ratings yet
Analyzing AWS Edge Computing Solutions To Enhance IoT Deployments
6 pages
Reviewofcausesoffoundationfailuresandtheirpossiblepreventiveandremedialmeasures KKU PDF
No ratings yet
Reviewofcausesoffoundationfailuresandtheirpossiblepreventiveandremedialmeasures KKU PDF
7 pages
Natural Fiber
No ratings yet
Natural Fiber
10 pages
Biochemical Composition of Pulp and Seed of Wild Jack (Artocarpus Hirsutus Lam.) Fruit
No ratings yet
Biochemical Composition of Pulp and Seed of Wild Jack (Artocarpus Hirsutus Lam.) Fruit
3 pages
Design, Development and Flight Testing of A Novel Quadrotor Convertiplane Unmanned Air Vehicle
No ratings yet
Design, Development and Flight Testing of A Novel Quadrotor Convertiplane Unmanned Air Vehicle
15 pages
Engine Performance Simulation of Ricardo WAVE For GTDI Optimization
No ratings yet
Engine Performance Simulation of Ricardo WAVE For GTDI Optimization
6 pages
A Review On The Production of Metal Matrix Composites Through Stir Casting - Furnace Design, Properties, Challenges, and Research Opportunities
No ratings yet
A Review On The Production of Metal Matrix Composites Through Stir Casting - Furnace Design, Properties, Challenges, and Research Opportunities
34 pages
The Three Faces of Herpes Simplex Epithelial Kerat-1
No ratings yet
The Three Faces of Herpes Simplex Epithelial Kerat-1
3 pages
Tehri Seismicity CS 2012
No ratings yet
Tehri Seismicity CS 2012
6 pages
Relative Performance of Metal and Polymeric Foam Sandwich Plates Under Low Velocity Impact
No ratings yet
Relative Performance of Metal and Polymeric Foam Sandwich Plates Under Low Velocity Impact
13 pages
An Empirical Study On Effectiveness of E Learning Over Conventional Class Room Learning A Case Study With Respect To Online Degree Programmes in Higher Education
No ratings yet
An Empirical Study On Effectiveness of E Learning Over Conventional Class Room Learning A Case Study With Respect To Online Degree Programmes in Higher Education
11 pages
Macrocyclic Schiff Base Metal Complexes Derived From Isatin: Structural Activity Relationship and DFT Calculations
No ratings yet
Macrocyclic Schiff Base Metal Complexes Derived From Isatin: Structural Activity Relationship and DFT Calculations
13 pages
Propolis Mouthrinse
No ratings yet
Propolis Mouthrinse
9 pages
Developing CL III Malocclusions, Challenges and Solutions, Clinical, Cosmetic and Investigational Dentistry
No ratings yet
Developing CL III Malocclusions, Challenges and Solutions, Clinical, Cosmetic and Investigational Dentistry
19 pages
Published Article
No ratings yet
Published Article
5 pages
Advanced MOSFET Technologies For Next Generation C
No ratings yet
Advanced MOSFET Technologies For Next Generation C
17 pages
Analysis of Welding Joints and Processes: International Journal of Computer Applications January 2016
No ratings yet
Analysis of Welding Joints and Processes: International Journal of Computer Applications January 2016
8 pages
IndianJPatholMicrobiol622357-6343484 173714
No ratings yet
IndianJPatholMicrobiol622357-6343484 173714
4 pages
Jrncfossilms
No ratings yet
Jrncfossilms
10 pages
ENT182306
No ratings yet
ENT182306
6 pages
Friction Stir Welding of Aluminum 6082 With Mild Steel and Its Joint Analyses
No ratings yet
Friction Stir Welding of Aluminum 6082 With Mild Steel and Its Joint Analyses
7 pages
Nombre de Feuilles Par Plant Tomate
No ratings yet
Nombre de Feuilles Par Plant Tomate
15 pages
Changes in Peripheral Brachial Blood Pressure From Supine To Lateral Decubitus Position in Hypertensive and Normotensive Subjects
No ratings yet
Changes in Peripheral Brachial Blood Pressure From Supine To Lateral Decubitus Position in Hypertensive and Normotensive Subjects
8 pages
Quantitative Estimation of Carbon Stock and Carbon
No ratings yet
Quantitative Estimation of Carbon Stock and Carbon
6 pages
Automatic Generation Control of Thermal Power System Under Varying Steam Turbine Dynamic Model Parameters Based On Generation Schedules of The Plants
No ratings yet
Automatic Generation Control of Thermal Power System Under Varying Steam Turbine Dynamic Model Parameters Based On Generation Schedules of The Plants
14 pages
Resilient Modulus of Clayey Subgrade Soils Treated With Calcium Carbide Residue
No ratings yet
Resilient Modulus of Clayey Subgrade Soils Treated With Calcium Carbide Residue
12 pages
Big Data Methods 1
No ratings yet
Big Data Methods 1
7 pages
Author Personal Copy Uncorrected
No ratings yet
Author Personal Copy Uncorrected
15 pages
Biplab Cu SC 2016
No ratings yet
Biplab Cu SC 2016
6 pages
Horse Dung and Soil Based Composites For Construction of Aesthetic Shelves in Rural Homes of Western Rajasthan
No ratings yet
Horse Dung and Soil Based Composites For Construction of Aesthetic Shelves in Rural Homes of Western Rajasthan
6 pages
Review On ML For Alzheimers2
No ratings yet
Review On ML For Alzheimers2
29 pages
Speed Control and Electrical Braking of Axial Ux BLDC Motor
No ratings yet
Speed Control and Electrical Braking of Axial Ux BLDC Motor
7 pages
Sustainable Farming with Biofertilizers
No ratings yet
Sustainable Farming with Biofertilizers
15 pages
Access and Mobility For The Urban Poor in India: Bridging The Gap Between Policy and Needs
No ratings yet
Access and Mobility For The Urban Poor in India: Bridging The Gap Between Policy and Needs
12 pages
Advanced Traveller Information System For Chandigarh City Using GIS
No ratings yet
Advanced Traveller Information System For Chandigarh City Using GIS
10 pages
Wideband Reflectarray Antennas Using Concentric Ring Based Elementsfor Ku Band Applications
No ratings yet
Wideband Reflectarray Antennas Using Concentric Ring Based Elementsfor Ku Band Applications
13 pages
ADOSS
No ratings yet
ADOSS
8 pages
Analysis and Design of Transmission Tower Using STAAD - PRO: June 2016
No ratings yet
Analysis and Design of Transmission Tower Using STAAD - PRO: June 2016
5 pages
3 - LR No.31 FactorInfluencingtoUsersAcceptanceofDigitalPaymentSystem
No ratings yet
3 - LR No.31 FactorInfluencingtoUsersAcceptanceofDigitalPaymentSystem
6 pages
Apps - Testing OAF Extensions From Jdeveloper - Things To Keep in Mind
No ratings yet
Apps - Testing OAF Extensions From Jdeveloper - Things To Keep in Mind
2 pages
Deep Learning in Language ID Systems
No ratings yet
Deep Learning in Language ID Systems
4 pages
A Phonotactic Language Model For Spoken Language Identification
No ratings yet
A Phonotactic Language Model For Spoken Language Identification
8 pages
Comparative Study On Spoken Language Identification Based On Deep Learning
No ratings yet
Comparative Study On Spoken Language Identification Based On Deep Learning
5 pages
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
No ratings yet
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
14 pages
Acoustic and Phonotactic LID Features
No ratings yet
Acoustic and Phonotactic LID Features
5 pages
Induction of Decision Trees: Machine Learning
No ratings yet
Induction of Decision Trees: Machine Learning
52 pages
IOUG - Oracle-Application-Express-Administration-Francis-Mignault
100% (2)
IOUG - Oracle-Application-Express-Administration-Francis-Mignault
312 pages
Acoustic and Phonotactic LID Features
No ratings yet
Acoustic and Phonotactic LID Features
5 pages
SpokenLanguages2 - Report
No ratings yet
SpokenLanguages2 - Report
4 pages
Grey Color: (Questions Written in Won't Appear in The Mid-Term/exam)
No ratings yet
Grey Color: (Questions Written in Won't Appear in The Mid-Term/exam)
2 pages
Spoken Language Identification Using Language Bottleneck Features
No ratings yet
Spoken Language Identification Using Language Bottleneck Features
9 pages
Rapid Language Identification
No ratings yet
Rapid Language Identification
12 pages
Parallel Phonetically Aware Dnns and Lstm-Rnns For Frame-By-Frame Discriminative Modeling of Spoken Language Identification
No ratings yet
Parallel Phonetically Aware Dnns and Lstm-Rnns For Frame-By-Frame Discriminative Modeling of Spoken Language Identification
5 pages
OperatingGuide MS60
No ratings yet
OperatingGuide MS60
1 page
X3-Brochure 1 (9-17) - 1 - Compressed
No ratings yet
X3-Brochure 1 (9-17) - 1 - Compressed
4 pages
White Paper Industry 4.0
100% (1)
White Paper Industry 4.0
28 pages
Currency Education Program
No ratings yet
Currency Education Program
26 pages
Manuals Guides Controlwave Electronic Flow Meter Efm en 132638
No ratings yet
Manuals Guides Controlwave Electronic Flow Meter Efm en 132638
146 pages
RM Notes - Gunjan
No ratings yet
RM Notes - Gunjan
21 pages
Unit 5
No ratings yet
Unit 5
38 pages
MiFIR Data Validation Rules Guide
No ratings yet
MiFIR Data Validation Rules Guide
32 pages
Artificial Intelligence Machine Automation Controller: NX701-Z 00 / NY5 2-Z 00
No ratings yet
Artificial Intelligence Machine Automation Controller: NX701-Z 00 / NY5 2-Z 00
12 pages
Arch Linux Installation: 1.prepare An Installation Medium
No ratings yet
Arch Linux Installation: 1.prepare An Installation Medium
8 pages
Rev 20 - Forma Environmental Chamber, Model 3940 and 3911 Series - Operating and Maintenance Manual
No ratings yet
Rev 20 - Forma Environmental Chamber, Model 3940 and 3911 Series - Operating and Maintenance Manual
66 pages
Perception Server Installation & User Guide: Grid Solutions
No ratings yet
Perception Server Installation & User Guide: Grid Solutions
49 pages
NRC's COVID-19 Response Overview
No ratings yet
NRC's COVID-19 Response Overview
9 pages
Part 1
No ratings yet
Part 1
2 pages
PR App Guide
No ratings yet
PR App Guide
17 pages
Ma 1530 Manual Usuario
No ratings yet
Ma 1530 Manual Usuario
122 pages
Full CV Form
No ratings yet
Full CV Form
15 pages
NPI Process IP Camera Quality Assurance Structure
No ratings yet
NPI Process IP Camera Quality Assurance Structure
6 pages
New Centres of Power PART IV BRICS (Brazil, Russia, India, China, South Africa)
No ratings yet
New Centres of Power PART IV BRICS (Brazil, Russia, India, China, South Africa)
11 pages
RHM Sensor Installation Guide
No ratings yet
RHM Sensor Installation Guide
30 pages
Fittings PDF
No ratings yet
Fittings PDF
3 pages
CENG 103 Q7-Q10 With Answers
No ratings yet
CENG 103 Q7-Q10 With Answers
5 pages
DB Question Ans
No ratings yet
DB Question Ans
12 pages
ARM37
No ratings yet
ARM37
5 pages
Operational Amplifier (OP-AMP) : If The Signal Applied To The Input Terminal, Results in Opposite
No ratings yet
Operational Amplifier (OP-AMP) : If The Signal Applied To The Input Terminal, Results in Opposite
9 pages
Engineering Management Insights
No ratings yet
Engineering Management Insights
9 pages
Office 365 Course Outline
No ratings yet
Office 365 Course Outline
7 pages
Arithmetic
No ratings yet
Arithmetic
3 pages
Novel Study Choice Board
No ratings yet
Novel Study Choice Board
5 pages
Optimizing Gati's Delivery System
No ratings yet
Optimizing Gati's Delivery System
4 pages

Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification

Uploaded by

Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Signiﬁcance of neural phonotactic models for large-scale spoken language

Conference Paper · May 2017

Brij Mohan Lal Srivastava Hari Krishna

SEE PROFILE SEE PROFILE

Anil Kumar Vuppala Manish Shrivastava

SEE PROFILE SEE PROFILE

Learning Representation for Text Classification View project

Medical Domain, Multilingual Virtual Assistant View project

The user has requested enhancement of the downloaded file.

2 describes the details of the dataset used for experiments. 2 https://github.com/brijmohan/lid-convex-comb

TABLE II the best performing RNNLM and DNN model. It uses

System Min Max Mean F1%

We use Laplacian eigenmaps [19] to generate 2D embed- 100 250

View publication stats

You might also like