Lid PRL
Lid PRL
a r t i c l e i n f o a b s t r a c t
Article history: State-of-the-art spoken language identification (LID) systems use sophisticated training strategies to im-
Received 29 August 2021 prove the robustness to unseen channel conditions in the real-world test samples. However, all these
Revised 29 January 2022
approaches require training samples from multiple channels with corresponding channel-labels, which
Accepted 13 April 2022
is not available in many cases. Recent research in this regard has shown the possibility of learning a
Available online 16 April 2022
channel-invariant representation of the speech using an auxiliary loss function called within-sample sim-
Edited by Prof. S. Sarkar ilarity loss (WSSL), which does not require samples from multiple channels. Specifically, the WSSL encour-
ages the LID network to ignore channel-specific contents in the speech by minimizing the similarities be-
Keywords:
Spoken language identification tween two utterance-level embeddings of same sample. However, as WSSL approach operates at sample-
Unseen channel condition level, it ignores the channel variations that may be present across different training samples within same
Channel-mismatch dataset. In this work, we propose a modification to the WSSL approach to address this limitation. Specif-
Domain-mismatch ically, along with the WSSL, the proposed modified WSSL (mWSSL) approach additionally considers the
Deep learning similarities with two global-level embeddings which represent the average channel-specific contents in
Within-sample similarity loss a given mini-batch of training samples. The proposed modification allows the network to have a better
view of the channel-specific contents in the training dataset, leading to improved performance in unseen
channel conditions.
© 2022 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.patrec.2022.04.018
0167-8655/© 2022 Elsevier B.V. All rights reserved.
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
fusion weights dynamically based on the channel conditions in the nore these intra-domain variations, we argue that the inclusion of
input, leading to a better performance in unseen channel condi- this factor can lead to better channel-invariance of the LID system.
tions. Another example is [9], where, an AMTL-based approach has In this article, we propose a novel modified within-sample sim-
been used to improve the noise-invariance of a speech recognition ilarity loss (mWSSL) based approach for improving the channel-
system to test samples with unseen types of background noise. invariance of the LID network. Compared to WSSL [17], the mWSSL
In [9], the AMTL uses training samples with four known types of contains an additional factor to explicitly consider the intra-
noise to improve the noise-invariance. Similarly, the AMTL-based domain variations in the training dataset. Specifically, apart from
training in [10] uses samples from “Quiet”, “TV”, and “Music” en- suppressing the similarities between two embeddings of same
vironments to improve the generalization of the system to unseen speech sample, mWSSL additionally suppresses the similarities of
environmental conditions. Note that, in all these works, the train- these two embeddings with two global-level embeddings which
ing dataset used to train the system contains samples from two or represent the average channel-specific contents in the given mini-
more domains with corresponding domain-labels. batch of training samples. These global-level embeddings are ob-
However, there are many cases in which the training dataset tained by averaging the respective utterance-level embeddings of
contains samples from only one channel, lacking channel-diversity. all training samples in the given mini-batch.
For example, the “OGI Telephone Speech Corpus” [11] contains The major contributions of this article are given below.
samples collected using only telephone lines. Similarly, the speech
• A novel modified within-sample similarity loss (mWSSL) based
corpora used in “AP16/17-OLR Challenge” [12,13] contains samples
approach for improving the channel-invariance of the LID net-
from only mobile phones, and the Indian languages dataset used
work. The proposed mWSSL encourages the network to sup-
in [14] contains speech samples recorded using only high qual-
press the channel-specific contents in the speech at both
ity recording devices in a controlled environment. In addition to
sample-level as well as at a global (mini-batch) level, leading
these, the “MIAMI” Spanish dialect corpus [15] and the Arabic di-
to a better channel-invariance.
alect corpus used in [16] also contain speech samples recorded
• Re-implementation of the original WSSL [17] approach with
using only one type of microphone in a controlled environment.
some architectural changes such as self-attention based fu-
Due to very limited channel-diversity in the training samples, the
sion of utterance-level embeddings, modification in the method
LID system built using such datasets becomes highly vulnerable
of computing the similarities between those embeddings, etc.,
to channel-mismatch when test samples contain unseen channel
leading to improved performance.
conditions.
• Extensive experimentation on the proposed approach and sup-
In this work, our aim is to improve the robustness of a LID
plementary evidence to show its effectiveness.
system to channel-mismatch when the test samples are expected
to contain unseen channel conditions. We assume that we do The rest of this article is organized as follows. We first discuss
not have access to any labeled/unlabeled samples from the tar- the baseline and WSSL-based LID systems and their imitations in
get domain. Furthermore, we also assume that the training dataset Section 2, followed by the proposed mWSSL approach to overcome
used does not contain samples from multiple channels with corre- those limitations. In Section 3, we give details of the speech cor-
sponding channel-labels. This situation is commonly encountered pora used in this work. Performance evaluation and conclusion are
when we are building a LID system for real-world application us- given in Section 4 and 5 respectively.
ing a training dataset containing very limited channel-diversity as
in [11–16]. Note that, traditional supervised/unsupervised domain 2. Neural LID models
adaptation cannot be used in this case as we do not have a pri-
ori knowledge about the target domain. Furthermore, state-of-the- In this section, we first discuss the architecture of our baseline
art AMTL-based [9,10] and domain attentive fusion based [8] ap- LID network followed by the WSSL network. We then discuss the
proaches also cannot be used in this case as they require train- limitation in the WSSL approach, followed by a solution to address
ing samples from multiple channels with corresponding channel- that limitation – the mWSSL approach.
labels.
To address this problem, we proposed a novel within-sample 2.1. Baseline LID network
similarity loss (WSSL) based approach in our recent work [17].
The proposed WSSL is an auxiliary loss, which encourages the LID The Fig. 1 shows the block diagram of our baseline system.1
network to learn a channel-invariant representation of the speech It contains a feature extractor block at the front-end to produce
even when the training dataset lacks channel-diversity. Specifi- an utterance-level embedding of the speech (represented as u-
cally, the WSSL gives a measure of similarity between a pair of vector in Fig. 1) followed by a language classifier block. The fea-
utterance-level embeddings of same speech sample obtained using ture extractor block contains a pre-trained bottleneck feature (BNF)
two embedding extractors present at the front-end of the LID net- extractor [18] at the front-end to convert the input speech into
work. These embedding extractors are designed to capture similar a sequence of BNF vectors. The input sequence of BNFs is then
information about the channel, but dissimilar LID-specific informa- analyzed by two identical embedding extractors to provide two
tion in the speech. By including this WSSL score as an additional utterance-level embeddings (represented as e1 and e2 in Fig. 1) of
loss during the training process, the network is penalized for en- the input speech. These embedding extractors are designed to pro-
coding similar information in these two embeddings. This encour- cess the input BNF sequence at two different temporal resolutions.
ages the network to suppress the channel-specific contents in the This analysis at different resolutions allows the two embedding ex-
speech that is common to both the embeddings, leading to a better tractors to have two different views of the same input, thereby
channel-invariance. allowing them to capture complementary LID-specific contents in
However, as WSSL considers only the similarities present be- the input speech [17].
tween two embeddings of the same speech sample, it does As mentioned earlier, both embedding extractors have identical
not capture the intra-domain variations (inter-session variations architecture. This architecture is motivated by the networks in [19–
present within the same dataset). Note that, in spite of using sim- 22]. As shown in Fig. 2a, each embedding extractor contains a set
ilar type of recording devices, speech samples collected using dif-
ferent devices can still contain certain differences in their chan- 1
The WSSL block in Fig. 1 is to be ignored for baseline LID network. It is present
nel/background conditions. While state-of-the-art approaches ig- only for the WSSL network
17
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
Fig. 1. Block diagram of the baseline/WSSL network used in this work. Red coloured frames in sequence of BNFs indicate the frames selected as input within an analysis
window. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
18
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
E = [e1 , e2 ] be the set of embeddings obtained from two embed- contents present in them. During training, the primary language
ding extractors (where e1 , e2 ∈ RD ). Using E , an intermediate rep- classification loss continuously forces the network to capture more
resentation is computed first as: and more LID-specific contents and the WSSL simultaneously pre-
γn = tanh(tanh(Wa en + ba )Wγ + bγ ), n ∈ {1, 2} (1) vents the network from encoding the channel-specific contents in
the speech. Furthermore, even if the embeddings e1 and e2 con-
Here, Wa ∈ RD×N ,
ba ∈ R1×N
are respectively the wights and bi- tain some similarities due to LID-specific contents, the combination
ases of the dense layer with N nodes and Wγ ∈ RN×1 , bγ ∈ R1×1 of WSSL with the language classification loss encourages them to
are weights and bias of the dense layer with single node. All these capture complementary LID-specific contents by suppressing those
learnable parameters of the attention network are optimized dur- similarities.
ing the training such that more weight is assigned to the embed- However, the WSSL is designed to suppress only the similari-
ding with more LID-specific contents. Using γ = [γ1 , γ2 ] , the fu- ties present between two embeddings of the same speech sam-
sion weights α = [α1 , α2 ] are computed as: α = softmax(γ ) ple. Hence, it does not capture the intra-domain variations in the
The u-vector is then obtained as: channel-specific contents that may be present within the training
u = E α = α1 e1 + α2 e2 . (2) dataset. To overcome this limitation, we propose a modification to
the WSSL approach.
This u-vector is then passed on to the language classifier block.
This forms an end-to-end LID network, which can be trained us-
ing the language classification (categorical cross-entropy) loss. We 2.3. Proposed modified WSSL (mWSSL) approach
denote this network as our baseline LID network (Lnet_baseline).
However, training the Lnet_baseline using only primary language The Fig. 3 shows the block diagram of the proposed mWSSL ap-
classification loss does not prevent the network from capturing the proach. Compared to WSSL, the mWSSL includes an additional fac-
channel-specific information in the input. This makes the network tor which allows the network to have a better view of the channel-
vulnerable to channel-mismatch. We use WSSL to address this is- specific contents in the training dataset. Specifically, this additional
sue. factor gives a measure of similarity for each of the two embed-
dings of a given training sample with two global embeddings as
2.2. Improving the channel-invariance using WSSL
given below.
As shown in Fig. 1, the WSSL is a measure of similarity between Lg (θF ) = Lcos (e1 , eˆ2 ) + Lcos (e2 , eˆ1 ) (5)
the embeddings e1 and e2 of the given input sample. During the
training, this WSSL score is added as an auxiliary loss to the pri- Where, eˆ1 and eˆ2 are the global embeddings which are computed
mary classification loss so that the network is encouraged to sup- as follows.
press the similarity between the embeddings e1 and e2 . The mo-
1
Nb
tivation for suppressing the similarity between these two embed-
eˆn = en i , n ∈ {1 , 2 } (6)
dings is the following. It is seen that, the channel-specific contents Nb
i=1
in a clean speech utterance remain (almost) constant. Hence, a
given clean speech sample can be visualized as a combination of a Here, Nb represents the total number of training samples in the
fast-changing foreground (speech) and a constant or slowly varying given mini-batch. Note that, as these global embeddings are ob-
background (related to channel). As the two embedding extractors tained by averaging all sample-level embeddings (e1 and e2 ) in
process the input at different resolutions, they encode dissimilar the given mini-batch, they represent the average channel-specific
information about the foreground speech (the fast-changing part contents in that mini-batch. Also, as they are averaged irrespec-
in the speech) and similar information about the channel (the con- tive of their language classes, they are class-independent in nature.
stant part). Due to this, the two embeddings e1 and e2 carry dis- Hence, suppressing the similarities with these global embeddings
similar LID-specific information, but similar channel-specific infor- leads to further improvement in the channel-invariance of the net-
mation. Therefore, encouraging the network to suppress the simi- work without loosing any LID-specific information.
larity between e1 and e2 enables it to learn a channel-independent With the auxiliary WSSL and the proposed modification term,
representation of the speech. we compute the total loss as below.
We compute the WSSL as follows:
LT (θF , θC ) = Ll (θF , θC ) + β Lw (θF ) + δ Lg (θF ) (7)
Lw (θF ) = Lcos (e1 , e2 ) (3)
e ·e
Where, Lcos (e1 , e2 ) = e 1e2 is the cosine similarity between the Here, Lg (θF ) is the proposed modification factor weighted by the
1 2
embeddings, and θF represents the parameters of the feature ex- trade-off parameter δ .
tractor block. Unlike [17], where a combination of cosine similarity Note that, the network is trained with only primary language
and Euclidean distances are used, we use only cosine similarity in classification loss and WSSL for the very first mini-batch of train-
this work as inclusion of Euclidean distance did not provide notice- ing samples, excluding the modification factor. For all subsequent
able improvement in the performance with the new architecture. mini-batches, the training procedure involves two steps. First, the
With the WSSL as an auxiliary loss, we train our LID network global embeddings eˆ1 and eˆ2 (used in Eq. (5)) are computed us-
using the following total loss. ing all training samples in the given mini-batch by keeping the
model parameters (obtained after previous mini-batch) fixed. In
LT (θF , θC ) = Ll (θF , θC ) + β Lw (θF ) (4) the second step, the model parameters are updated using these
Where, LT (θF , θC ) is the total loss, Ll (θF , θC ) is the primary lan- pre-computed global embeddings (according to Eq. (7)), for the
guage classification (cross-entropy) loss, Lw (θF ) is the WSSL. The given mini-batch. This two step procedure ensures that the model
scalar value β is the trade-off parameter, θF and θC are respec- is aware of channel-specific contents in the given set of train-
tively the parameters of the feature extractor and language classi- ing samples during the parameter optimization. Due to the com-
fier blocks of the LID network. bination of Lw (θF ) and Lg (θF ) with the primary classification
Note that, as the embeddings e1 and e2 are designed to cap- loss, the network learns to suppress the channel-specific contents
ture dissimilar LID-specific contents, the amount of similarity be- present at both sample-level as well as at a global (mini-batch)
tween them given by WSSL is mainly due to the channel-specific level.
19
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
Fig. 3. Block diagram of the proposed mWSSL approach. Compared to WSSL, the mWSSL additionally considers the similarities with two global embeddings.
Table 1
Details about the IIT-Mandi Indian languages corpus used in this work.
3. Datasets used in the study and remaining 20% for development. Since this development part
has same channel conditions as that of training dataset, it is called
We use the IIT-Mandi Indian languages corpus [17] in this work. seen test set.
This corpus is available online.2 There are two parts in this corpus: The IIT-Mandi YouTube dataset contains audio files extracted
IIT-Mandi read speech and IIT-Mandi YouTube dataset. The Table 1 from various YouTube videos on online teaching and personal in-
shows details like number of hours of speech data, number of ut- terviews. Each language contains samples from at least 10 speak-
terances and speakers in each language in these two datasets. ers. Note that, there is significant domain-mismatch between IIT-
The IIT-Mandi read speech dataset contains audio files obtained Mandi read speech and IIT-Mandi YouTube datasets in terms of
from news broadcasts in All India Radio.3 Each language contains channel, type of speech, background conditions, etc. We use the
around 4.5 h of speech data from at least 15 speakers. We use 80% YouTube dataset only for testing. Since no samples from this
samples from this dataset for training (represented as Readsp-train) dataset are used during training, the channel conditions in this are
unseen by the LID system. Hence, it is called unseen test set.
Note that, the amount of speech data and number of speak-
2
https://speechiitmandi.github.io/air/ ers in IIT-Mandi read speech dataset are limited, leading to lim-
3
https://newsonair.gov.in/
20
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
Table 2
Performances of baseline systems when tested in seen and unseen channel conditions. Performances are given
in accuracy (Acc) and Cavg .
ited diversity in the training dataset. In order to have higher intra- The fourth row in the Table 2 shows the performance of base-
domain variations, we additionally use a part of IIIT-Hyderabad In- line LID network with the proposed new architecture trained using
dian languages corpus [14,23] for training in some of our experi- the combined-readsp-train dataset (containing about 10 h of speech
ments. Like IIT-Mandi read speech, this dataset also contains read per language with more speakers). The number of nodes in each
speech samples recorded in controlled environment. This dataset is hidden layer of this network (represented as Lnet_baseline_comb)
available upon request. In each language, we use approximately 6 h has been increased to twice compared to Lnet_baseline due to in-
of speech (about 25 speakers in each language) from this dataset. creased size of the dataset.
By combining samples from IIIT-Hyderabad corpus, we get a com- From the results in Table 2 it is seen that, the Lnet_baseline
bined training dataset called combined-readsp-train, having about with the proposed new architecture (using attentive fusion) has
10 h of speech in each language with at least 40 speakers. performed better than baselines in both seen and unseen test
In our experiments, we have divided all larger speech files into sets. Unlike in [17], the Lnet_baseline in this work uses a self-
smaller ones such that the duration of speech samples used in this attention based fusion which estimates the fusion weights dynam-
work (including train and test datasets) varies between 2 s and ically based on the LID-specific contents in the two embeddings,
15 s with mean of 7.8 s. leading to a better performance.
Note that, both Lnet_baseline and Lnet_baseline_comb have per-
formed very well on seen test set (Read speech) but, poorly on un-
4. Results and discussion
seen test set (YouTube dataset). This difference in the performance
clearly shows the effect of channel-mismatch. As these networks
In this section, we study the effectiveness of the proposed
have seen only read speech samples with limited channel diver-
method. Performances are given in accuracy (%) and the Cavg (%)
sity during the training, both of them have become vulnerable to
metric [24]. Lower values of Cavg indicates better performance. We
unseen channel conditions in the YouTube dataset. Compared to
have re-implemented all systems in the original work.
Lnet_baseline, the baseline network trained on combined-readsp-
train has provided slightly better performance in both seen and un-
4.1. Baseline systems seen channel conditions. This betterment is attributed to the bet-
ter generalization of the network due to the presence of additional
To compare the effectiveness of WSSL and proposed mWSSL, samples from IIIT-Hyderabad corpus.
we use two baselines that are trained using only primary language However, in spite of using a larger training dataset with
classification loss. First one is the x-vector based LID system [20]. It more diversity (combined-readsp-train), the performance of
processes the sequence of BNFs to obtain 512-dimensional x-vector Lnet_baseline_comb in unseen test set is far below than that
which is then classified using a Gaussian back-end [20]. The clean in seen test set. This clearly indicates the necessity for a sophisti-
speech samples from Readsp-Train dataset (having approximately cated training strategy to further improve the channel-invariance
4 h of speech per language) is used for training this system. Re- of the network.
sults obtained for the x-vector based system on seen and unseen
test sets are shown in 1st row of Table 2.
Second one is a baseline LID-network (Lnet_baseline) which 4.2. Effectiveness of the WSSL
contains only feature extractor and language classifier blocks in
Fig. 1 (excluding WSSL block). The feature extractor block of this Here, we experiment by including WSSL in the training pro-
network contains two identical embedding extractors. The first cess as in Eq. (4). The value of trade-off parameter β in Eq. (4) is
embedding extractor analyzes the input by dividing it into chunks empirically set as 0.30. The first row in Table 3 shows the per-
of 0.5 s (T ), whereas, the second embeddding extractor uses 1 s formance of original WSSL network [17] trained on Readsp-train
chunks. These chunk sizes are chosen empirically based on the (Lnet_WSSL). The performance of WSSL network with the proposed
best performance. The embedding extractors contain two BLSTM attentive fusion is given in second row of Table 3. The performance
layers with 256 and 32 nodes respectively in first and second lay- of WSSL network trained on combined-readsp-train (represented as
ers to produce embeddings e1 and e2 (where e1 , e2 ∈ R64 ). These Lnet_WSSL_comb) is given in third row.
embeddings are then processed by a self-attention network which It can be seen that, the WSSL network with the proposed new
contains a dense layer with 100 (N) nodes followed by a dense architecture (shown in second row of Table 3) provides better
layer with single node to produce the attention (fusion) weights. performance than the one used in the original work (first row).
The two embeddings are then added using these fusion weights This improvement is due to the self-attention based fusion of the
to produce the u-vector. This u-vector is then fed to the language embeddings. Furthermore, both Lnet_WSSL and Lnet_WSSL_comb
classifier (output layer) with softmax activation, containing 8 nodes have performed significantly better than baselines in unseen test
to represent the languages. Results obtained for this system when set. This indicates that, the inclusion of WSSL reduces the vulner-
trained on Readsp-train are given in the third row of Table 2. The ability of the network to channel-mismatch by encouraging it to
second row in the Table 2 corresponds to the baseline system used generalize better. Since WSSL encourages the two embeddings to
in the original work [17], which uses fixed weights for the fusion capture complementary LID-specific contents, the networks with
of two embeddings instead of the self-attention based fusion. WSSL have performed better than baselines in seen test set also.
21
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
Table 3
Performance in accuracy (Acc) and Cavg for WSSL and mWSSL systems in seen and unseen channel conditions.
Next, we experiment by including the proposed modification The t-SNE plots in Fig. 4 illustrates the effectiveness of the pro-
factor in the training process. posed mWSSL. Here, the plot (a) in top row corresponds to the
embeddings e1 and e2 obtained for the test samples from YouTube
4.3. Effectiveness of the proposed mWSSL dataset, obtained using the baseline network (Lnet_baseline). Em-
beddings e1 and e2 are shown in different shapes. Points in same
In this case, the network is trained according to Eq. (7), with color indicate speech samples from same language. We used sam-
values of β and δ empirically set as 0.2 and 0.1 respectively. We ples from 5 languages (Assamese, Bengali, Gujarati, Hindi and Kan-
use a mini-batch of 100 samples for training. The fourth row in nada) in the plot, which are shown using different colors. It can be
Table 3 shows the results of the system trained on Readsp-train seen that, the language clusters of embedding e1 do not overlap
(represented as Lnet_mWSSL). The performance of network trained with those from e2 , indicating that they carry significantly differ-
on combined-readsp-train (Lnet_mWSSL_comb) is shown in the last ent information. As e1 and e2 are obtained by analyzing the same
row of Table 3. input at different temporal resolutions, they encode the informa-
Compared to WSSL, the proposed mWSSL has provided bet- tion differently. However, within the set of clusters from a given
ter performance in unseen channel condition. Due to the inclu- embedding extractor, there is high overlap between clusters of dif-
sion of the modification term (similarity with the global embed- ferent languages. This overlapping is due to the confusion created
dings), the network learns to ignore the channel-specific contents due to channel-mismatch, when test samples are taken from un-
at both sample-level as well as at a global-level, improving the seen target domain.
channel-invariance in a better way. Note that, the improvement The t-SNE plot in the bottom row of Fig. 4 (plot b) corre-
provided by Lnet_mWSSL_comb is more than that provided by sponds to the embeddings obtained from the mWSSL network
Lnet_mWSSL. This clearly shows that, the proposed modification (Lnet_mWSSL). Compared to baseline, the language clusters in plot
factor is more effective when the training dataset contains signif- b have more compactness and better separation. Since the pro-
icant intra-domain variations. The mWSSL has provided slight im- posed mWSSL encourages the network to learn a channel-invariant
provement in seen channel conditions too, by encouraging the two representation of the speech, the network has become less vulner-
embedding extractors to capture complementary LID-specific con- able to channel-mismatch.
tents in the speech. The bivariate kernel density estimate (KDE) plots in Fig 5 pro-
vides some additional insight on how the proposed mWSSL en-
courages the embeddings e1 and e2 to capture complementary
contents in the input. The plots shown here are the KDE plots of
e1 and e2 obtained using two randomly selected features. Here, the
top row corresponds to language “Hindi” and bottom row corre-
sponds to language “Kannada”. In both top and bottom rows, the
plot on the left side are the KDE of embeddings e1 and e2 obtained
22
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23
using the baseline network (Lnet_baseline). The deviation between [6] H. Wang, H. Dinkel, S. Wang, Y. Qian, K. Yu, Cross-domain replay spoofing
KDE of e1 and e2 in both these plots clearly indicates that, ana- attack detection using domain adversarial training, in: INTERSPEECH, 2019,
pp. 2938–2942.
lyzing the input at two different resolutions indeed helps the two [7] R. Duroselle, D. Jouvet, I. Illina, Unsupervised regularization of the embedding
embeddings to encode the information quite differently. However, extractor for robust language identification, Odyssey 2020 The Speaker and
they do contain some similarities, as indicated by the slight over- Language Recognition Workshop, 2020.
[8] S. Shon, A. Ali, J. Glass, Domain attentive fusion for end-to-end dialect iden-
lap between those KDE plots. tification with unknown target domain, in: IEEE International Conference on
In both rows of Fig. 5, plots on the right side corresponds to Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5951–5955.
the KDE of embedding e1 and e2 from the network trained with [9] Y. Shinohara, Adversarial multi-task learning of deep neural networks for ro-
bust speech recognition, in: INTERSPEECH, 2016, pp. 2369–2372.
mWSSL (Lnet_mWSSL). Compared to the KDE plots from the base-
[10] Z. Meng, Y. Zhao, J. Li, Y. Gong, Adversarial speaker verification, in: IEEE Inter-
line network, the KDE plots from the mWSSL network are well sep- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019,
arated. This indicates that, the proposed mWSSL indeed encourages pp. 6216–6220.
[11] Y.K. Muthusamy, R.A. Cole, B.T. Oshika, The OGI multi-language telephone
two embeddings to encode complementary contents by suppress-
speech corpus, in: Second International Conference on Spoken Language Pro-
ing the similarities between them. This in turn generalizes the net- cessing, 1992.
work in a better way, leading to a better performance. [12] D. Wang, L. Li, D. Tang, Q. Chen, AP16-OL7: a multilingual database for orien-
tal languages and a language recognition baseline, in: Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA),
5. Conclusion 2016, pp. 1–5.
[13] Z. Tang, D. Wang, Y. Chen, Q. Chen, AP17-OLR challenge: data, plan, and base-
In this article, we proposed a novel modified within-sample line, in: 2017 Asia-Pacific Signal and Information Processing Association An-
nual Summit and Conference (APSIPA ASC), 2017, pp. 749–753.
similarity loss (mWSSL) to improve the channel-invariance of a [14] K. Mounika, S. Achanta, H. Lakshmi, S.V. Gangashetty, A.K. Vuppala, An investi-
LID network. The proposed mWSSL overcomes the limitations in gation of deep neural network architectures for language recognition in Indian
the recently proposed WSSL approach. The mWSSL encourages the languages, in: INTERSPEECH, 2016, pp. 2930–2933.
[15] M.A. Zissman, T.P. Gleason, D. Rekart, B.L. Losiewicz, Automatic dialect iden-
network to suppress the channel-specific contents in the speech at tification of extemporaneous conversational, Latin American Spanish speech,
both sample-level as well as at a global-level, leading to improved in: IEEE International Conference on Acoustics, Speech, and Signal Processing
performance in both seen and unseen channel conditions. Conference Proceedings (ICASSP), 1996, pp. 777–780.
[16] Y. Lei, J.H. Hansen, Dialect classification via text-independent training and test-
ing for Arabic, Spanish, and Chinese, IEEE Trans. Audio Speech Lang. Process.
Declaration of Competing Interest 19 (1) (2010) 85–96.
[17] H. Muralikrishna, S. Kapoor, A.D. Dileep, P. Rajan, Spoken language identifi-
cation in unseen target domain using within-sample similarity loss, in: IEEE
The authors declare that they have no known competing finan-
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
cial interests or personal relationships that could have appeared to 2021, pp. 7223–7227.
influence the work reported in this paper. [18] A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotny, F. Grezl, P. Schwarz,
L. Burget, J. Cernocky, BUT/Phonexia bottleneck feature extractor, in: Odyssey
References The Speaker and Language Recognition Workshop, 2018, pp. 283–287.
[19] H. Muralikrishna, S. Pulkit, J. Anuksha, A.D. Dileep, Spoken language identifica-
tion using bidirectional LSTM based LID sequential senones, in: IEEE Automatic
[1] M. Mclaren, M.K. Nandwana, D. Castán, L. Ferrer, Approaches to multi-do- Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 320–326.
main language recognition, in: Odyssey The Speaker and Language Recognition [20] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, S. Khudanpur, Spo-
Workshop, 2018, pp. 90–97. ken language recognition using x-vectors, in: Odyssey The Speaker and Lan-
[2] J.A.V. Lopez, N. Brummer, N. Dehak, End-to-end versus embedding neural net- guage Recognition Workshop, 2018, pp. 105–111.
works for language recognition in mismatched conditions, in: Odyssey The [21] A. Lozano-Diez, O. Plchot, P. Matejka, J. Gonzalez-Rodriguez, DNN based em-
Speaker and Language Recognition Workshop, 2018, pp. 112–119. beddings for language recognition, in: IEEE International Conference on Acous-
[3] B.M. Abdullah, T. Avgustinova, B. Mobius, D. Klakow, Cross-domain adaptation tics, Speech and Signal Processing (ICASSP), 2018, pp. 5184–5188.
of spoken language identification for related languages: The curious case of [22] H. Muralikrishna, S. Gupta, A.D. Dileep, P. Rajan, Noise-robust spoken language
slavic languages, in: INTERSPEECH, 2020, pp. 477–481. identification using language relevance factor based embedding, in: IEEE Spo-
[4] Q. Wang, W. Rao, S. Sun, L. Xie, E.S. Chng, H. Li, Unsupervised domain adapta- ken Language Technology Workshop (SLT), 2021, pp. 644–651.
tion via domain adversarial training for speaker recognition, in: IEEE Interna- [23] R.K. Vuddagiri, H.K. Vydana, A.K. Vuppala, Curriculum learning based approach
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, for noise robust language identification using DNN with attention, Expert. Syst.
pp. 4889–4893. Appl. 110 (2018) 290–297.
[5] G. Bhattacharya, J. Alam, P. Kenny, Adapting end-to-end neural speaker veri- [24] The 2015 NIST Language Recognition Evaluation plan (LRE15), 2015https://
fication to new languages and recording conditions with adversarial training, www.nist.gov/itl/iad/mig/2015- language- recognition- evaluation.
in: IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019, pp. 6041–6045.
23