0% found this document useful (0 votes)

11 views8 pages

Lid PRL

The document discusses a novel approach to spoken language identification (LID) that addresses the challenge of unseen channel conditions by modifying the within-sample similarity loss (WSSL) method. The proposed modified WSSL (mWSSL) enhances the LID network's robustness by considering both sample-level and global-level similarities, allowing it to better ignore channel-specific content. This approach aims to improve performance when training datasets lack diversity in channel conditions, which is common in real-world applications.

Uploaded by

dumbabubu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views8 pages

Lid PRL

Uploaded by

dumbabubu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Pattern Recognition Letters 158 (2022) 16–23

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

Spoken language identiﬁcation in unseen channel conditions using

modiﬁed within-sample similarity loss
Muralikrishna H.∗, Dileep Aroor Dinesh
MANAS Lab, School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, India

a r t i c l e i n f o a b s t r a c t

Article history: State-of-the-art spoken language identification (LID) systems use sophisticated training strategies to im-
Received 29 August 2021 prove the robustness to unseen channel conditions in the real-world test samples. However, all these
Revised 29 January 2022
approaches require training samples from multiple channels with corresponding channel-labels, which
Accepted 13 April 2022
is not available in many cases. Recent research in this regard has shown the possibility of learning a
Available online 16 April 2022
channel-invariant representation of the speech using an auxiliary loss function called within-sample sim-
Edited by Prof. S. Sarkar ilarity loss (WSSL), which does not require samples from multiple channels. Specifically, the WSSL encour-
ages the LID network to ignore channel-specific contents in the speech by minimizing the similarities be-
Keywords:
Spoken language identification tween two utterance-level embeddings of same sample. However, as WSSL approach operates at sample-
Unseen channel condition level, it ignores the channel variations that may be present across different training samples within same
Channel-mismatch dataset. In this work, we propose a modification to the WSSL approach to address this limitation. Specif-
Domain-mismatch ically, along with the WSSL, the proposed modified WSSL (mWSSL) approach additionally considers the
Deep learning similarities with two global-level embeddings which represent the average channel-specific contents in
Within-sample similarity loss a given mini-batch of training samples. The proposed modification allows the network to have a better
view of the channel-specific contents in the training dataset, leading to improved performance in unseen
channel conditions.
© 2022 Elsevier B.V. All rights reserved.

1. Introduction Domain adaptation is one commonly used strategy to handle

domain-mismatch which fine-tunes the parameters of a system to
Spoken language identification (LID) systems are sensitive to the intended target domain. For example, [1,2] use supervised do-
domain-mismatch, that occurs when testing sample comes from a main adaptation to adapt the back-end classifier of a pre-trained
different domain (containing different background noise condition, LID system to the test domain. The works in [3–6] use adversarial
type of speech, speaker, etc.) than those in the training samples. multitask learning (AMTL) based unsupervised domain adaptation
Channel-mismatch is a special case of domain-mismatch that oc- to make the system invariant across source and target domains.
curs due to differences in the channels (devices) used for obtain- Similarly, [7] uses a maximum mean discrepancy (MMD) based di-
ing the training and testing samples. Since real-world test samples vergence measure between source and target domains to perform
can contain domain/channel conditions that are unseen during the the unsupervised domain adaptation (UDA). However, all these su-
training, special care should be taken to improve the robustness pervised and unsupervised domain adaptations require samples
of the system before deploying it in real-world conditions. Though from the target domain. Hence, they cannot be used to improve
robustness of a LID system can be greatly improved by training it robustness to unseen channel conditions.
on a dataset with wide variety of background conditions, collect- This limitation in the domain-adaptation techniques motivated
ing such kind of a dataset is not trivial, as it requires lot of effort researchers to develop methods to improve the generalization of
and time. This limitation in the availability of training resources the system, thereby improving its robustness to unseen domain
motivated researchers to develop sophisticated training strategies conditions. For example, the work in [8] suggested a domain atten-
which improve the robustness of the system even when the train- tive fusion based technique to improve the robustness of a spoken
ing dataset contains limited diversity. dialect identification system to an unseen target domain. In this,
the outputs from two subsystems trained on two separate domains
(channels) are combined together using a self-attention based fu-
∗
sion. Unlike the traditional logistic regression which uses fixed
Corresponding author.
weights for fusion, the domain attentive fusion in [8] estimates the
E-mail address: d17021@students.iitmandi.ac.in (M. H.).

https://doi.org/10.1016/j.patrec.2022.04.018
0167-8655/© 2022 Elsevier B.V. All rights reserved.
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

fusion weights dynamically based on the channel conditions in the nore these intra-domain variations, we argue that the inclusion of
input, leading to a better performance in unseen channel condi- this factor can lead to better channel-invariance of the LID system.
tions. Another example is [9], where, an AMTL-based approach has In this article, we propose a novel modified within-sample sim-
been used to improve the noise-invariance of a speech recognition ilarity loss (mWSSL) based approach for improving the channel-
system to test samples with unseen types of background noise. invariance of the LID network. Compared to WSSL [17], the mWSSL
In [9], the AMTL uses training samples with four known types of contains an additional factor to explicitly consider the intra-
noise to improve the noise-invariance. Similarly, the AMTL-based domain variations in the training dataset. Specifically, apart from
training in [10] uses samples from “Quiet”, “TV”, and “Music” en- suppressing the similarities between two embeddings of same
vironments to improve the generalization of the system to unseen speech sample, mWSSL additionally suppresses the similarities of
environmental conditions. Note that, in all these works, the train- these two embeddings with two global-level embeddings which
ing dataset used to train the system contains samples from two or represent the average channel-specific contents in the given mini-
more domains with corresponding domain-labels. batch of training samples. These global-level embeddings are ob-
However, there are many cases in which the training dataset tained by averaging the respective utterance-level embeddings of
contains samples from only one channel, lacking channel-diversity. all training samples in the given mini-batch.
For example, the “OGI Telephone Speech Corpus” [11] contains The major contributions of this article are given below.
samples collected using only telephone lines. Similarly, the speech
• A novel modified within-sample similarity loss (mWSSL) based
corpora used in “AP16/17-OLR Challenge” [12,13] contains samples
approach for improving the channel-invariance of the LID net-
from only mobile phones, and the Indian languages dataset used
work. The proposed mWSSL encourages the network to sup-
in [14] contains speech samples recorded using only high qual-
press the channel-specific contents in the speech at both
ity recording devices in a controlled environment. In addition to
sample-level as well as at a global (mini-batch) level, leading
these, the “MIAMI” Spanish dialect corpus [15] and the Arabic di-
to a better channel-invariance.
alect corpus used in [16] also contain speech samples recorded
• Re-implementation of the original WSSL [17] approach with
using only one type of microphone in a controlled environment.
some architectural changes such as self-attention based fu-
Due to very limited channel-diversity in the training samples, the
sion of utterance-level embeddings, modification in the method
LID system built using such datasets becomes highly vulnerable
of computing the similarities between those embeddings, etc.,
to channel-mismatch when test samples contain unseen channel
leading to improved performance.
conditions.
• Extensive experimentation on the proposed approach and sup-
In this work, our aim is to improve the robustness of a LID
plementary evidence to show its effectiveness.
system to channel-mismatch when the test samples are expected
to contain unseen channel conditions. We assume that we do The rest of this article is organized as follows. We first discuss
not have access to any labeled/unlabeled samples from the tar- the baseline and WSSL-based LID systems and their imitations in
get domain. Furthermore, we also assume that the training dataset Section 2, followed by the proposed mWSSL approach to overcome
used does not contain samples from multiple channels with corre- those limitations. In Section 3, we give details of the speech cor-
sponding channel-labels. This situation is commonly encountered pora used in this work. Performance evaluation and conclusion are
when we are building a LID system for real-world application us- given in Section 4 and 5 respectively.
ing a training dataset containing very limited channel-diversity as
in [11–16]. Note that, traditional supervised/unsupervised domain 2. Neural LID models
adaptation cannot be used in this case as we do not have a pri-
ori knowledge about the target domain. Furthermore, state-of-the- In this section, we first discuss the architecture of our baseline
art AMTL-based [9,10] and domain attentive fusion based [8] ap- LID network followed by the WSSL network. We then discuss the
proaches also cannot be used in this case as they require train- limitation in the WSSL approach, followed by a solution to address
ing samples from multiple channels with corresponding channel- that limitation – the mWSSL approach.
labels.
To address this problem, we proposed a novel within-sample 2.1. Baseline LID network
similarity loss (WSSL) based approach in our recent work [17].
The proposed WSSL is an auxiliary loss, which encourages the LID The Fig. 1 shows the block diagram of our baseline system.1
network to learn a channel-invariant representation of the speech It contains a feature extractor block at the front-end to produce
even when the training dataset lacks channel-diversity. Specifi- an utterance-level embedding of the speech (represented as u-
cally, the WSSL gives a measure of similarity between a pair of vector in Fig. 1) followed by a language classifier block. The fea-
utterance-level embeddings of same speech sample obtained using ture extractor block contains a pre-trained bottleneck feature (BNF)
two embedding extractors present at the front-end of the LID net- extractor [18] at the front-end to convert the input speech into
work. These embedding extractors are designed to capture similar a sequence of BNF vectors. The input sequence of BNFs is then
information about the channel, but dissimilar LID-specific informa- analyzed by two identical embedding extractors to provide two
tion in the speech. By including this WSSL score as an additional utterance-level embeddings (represented as e1 and e2 in Fig. 1) of
loss during the training process, the network is penalized for en- the input speech. These embedding extractors are designed to pro-
coding similar information in these two embeddings. This encour- cess the input BNF sequence at two different temporal resolutions.
ages the network to suppress the channel-specific contents in the This analysis at different resolutions allows the two embedding ex-
speech that is common to both the embeddings, leading to a better tractors to have two different views of the same input, thereby
channel-invariance. allowing them to capture complementary LID-specific contents in
However, as WSSL considers only the similarities present be- the input speech [17].
tween two embeddings of the same speech sample, it does As mentioned earlier, both embedding extractors have identical
not capture the intra-domain variations (inter-session variations architecture. This architecture is motivated by the networks in [19–
present within the same dataset). Note that, in spite of using sim- 22]. As shown in Fig. 2a, each embedding extractor contains a set
ilar type of recording devices, speech samples collected using dif-
ferent devices can still contain certain differences in their chan- 1
The WSSL block in Fig. 1 is to be ignored for baseline LID network. It is present
nel/background conditions. While state-of-the-art approaches ig- only for the WSSL network

17
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

Fig. 1. Block diagram of the baseline/WSSL network used in this work. Red coloured frames in sequence of BNFs indicate the frames selected as input within an analysis
window. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

In order to allow the two embedding extractors to process the

input at different temporal resolutions, we make the following ar-
rangement. For the embedding extractor-1, we feed the input se-
quence of BNFs by dividing it into fixed-length chunks of T sec-
onds, but for embedding extractor-2, we use chunks of 2T seconds.
Also, embedding extractor-1 considers all BNF vectors within the
chunk as input (high resolution within small context of T seconds)
whereas embedding extractor-2 uses only alternative BNF vectors
within the 2T seconds chunk (low resolution within context of 2T
seconds).
The two embeddings e1 and e2 are then combined together
to get a final utterance-level representation of the input speech
(u-vector). Unlike in [17], where embeddings e1 and e2 are com-
bined together using fixed weights, we propose a self-attention
based fusion of the embeddings to get the u-vector. The motiva-
Fig. 2. Block diagram of (a) embedding extractor and (b) self-attention based fusion tion for using self-attention based fusion is as follows. Since the
of e1 and e2 .
two embedding extractors analyze the input at different context-
length and resolutions, the amount of LID-specific contents present
in two embeddings (e1 and e2 ) can vary significantly depending on
of bidirectional long short-term memory (BLSTM) layers to analyze
the background conditions in the speech. Unlike the traditional fu-
the input sequence of BNFs by dividing it into fixed-length chunks
sion with fixed weights, the self-attention based fusion assigns the
(with 85% overlap between successive chunks) to produce LID-seq-
weights dynamically depending on the LID-specific contents in the
senones [19]. Each LID-seq-senone is a compact representation of
embeddings, leading to a better performance.
the LID-specific contents present in that chunk. The mean and
The self-attention network (Fig. 2b) contains a dense layer with
standard deviation of these LID-seq-senones are then computed
N nodes followed by a dense layer with single node, both lay-
using a statistics pooling layer. The concatenated mean and stan-
ers having tanh activation. This self-attention network can also
dard deviation gives the utterance-level embedding of the speech
be viewed as a multilayer perceptron (MLP) with 2 layers. Let
(shown as embedding e1 or e2 in Fig. 1).

18
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

E = [e1 , e2 ] be the set of embeddings obtained from two embed- contents present in them. During training, the primary language
ding extractors (where e1 , e2 ∈ RD ). Using E , an intermediate rep- classification loss continuously forces the network to capture more
resentation is computed first as: and more LID-specific contents and the WSSL simultaneously pre-
γn = tanh(tanh(Wa en + ba )Wγ + bγ ), n ∈ {1, 2} (1) vents the network from encoding the channel-specific contents in
the speech. Furthermore, even if the embeddings e1 and e2 con-
Here, Wa ∈ RD×N ,
ba ∈ R1×N
are respectively the wights and bi- tain some similarities due to LID-specific contents, the combination
ases of the dense layer with N nodes and Wγ ∈ RN×1 , bγ ∈ R1×1 of WSSL with the language classification loss encourages them to
are weights and bias of the dense layer with single node. All these capture complementary LID-specific contents by suppressing those
learnable parameters of the attention network are optimized dur- similarities.
ing the training such that more weight is assigned to the embed- However, the WSSL is designed to suppress only the similari-
ding with more LID-specific contents. Using γ = [γ1 , γ2 ] , the fu- ties present between two embeddings of the same speech sam-
sion weights α = [α1 , α2 ] are computed as: α = softmax(γ ) ple. Hence, it does not capture the intra-domain variations in the
The u-vector is then obtained as: channel-specific contents that may be present within the training
u = E α = α1 e1 + α2 e2 . (2) dataset. To overcome this limitation, we propose a modification to
the WSSL approach.
This u-vector is then passed on to the language classifier block.
This forms an end-to-end LID network, which can be trained us-
ing the language classification (categorical cross-entropy) loss. We 2.3. Proposed modified WSSL (mWSSL) approach
denote this network as our baseline LID network (Lnet_baseline).
However, training the Lnet_baseline using only primary language The Fig. 3 shows the block diagram of the proposed mWSSL ap-
classification loss does not prevent the network from capturing the proach. Compared to WSSL, the mWSSL includes an additional fac-
channel-specific information in the input. This makes the network tor which allows the network to have a better view of the channel-
vulnerable to channel-mismatch. We use WSSL to address this is- specific contents in the training dataset. Specifically, this additional
sue. factor gives a measure of similarity for each of the two embed-
dings of a given training sample with two global embeddings as
2.2. Improving the channel-invariance using WSSL
given below.
As shown in Fig. 1, the WSSL is a measure of similarity between Lg (θF ) = Lcos (e1 , eˆ2 ) + Lcos (e2 , eˆ1 ) (5)
the embeddings e1 and e2 of the given input sample. During the
training, this WSSL score is added as an auxiliary loss to the pri- Where, eˆ1 and eˆ2 are the global embeddings which are computed
mary classification loss so that the network is encouraged to sup- as follows.
press the similarity between the embeddings e1 and e2 . The mo-
1
Nb
tivation for suppressing the similarity between these two embed-
eˆn = en i , n ∈ {1 , 2 } (6)
dings is the following. It is seen that, the channel-specific contents Nb
i=1
in a clean speech utterance remain (almost) constant. Hence, a
given clean speech sample can be visualized as a combination of a Here, Nb represents the total number of training samples in the
fast-changing foreground (speech) and a constant or slowly varying given mini-batch. Note that, as these global embeddings are ob-
background (related to channel). As the two embedding extractors tained by averaging all sample-level embeddings (e1 and e2 ) in
process the input at different resolutions, they encode dissimilar the given mini-batch, they represent the average channel-specific
information about the foreground speech (the fast-changing part contents in that mini-batch. Also, as they are averaged irrespec-
in the speech) and similar information about the channel (the con- tive of their language classes, they are class-independent in nature.
stant part). Due to this, the two embeddings e1 and e2 carry dis- Hence, suppressing the similarities with these global embeddings
similar LID-specific information, but similar channel-specific infor- leads to further improvement in the channel-invariance of the net-
mation. Therefore, encouraging the network to suppress the simi- work without loosing any LID-specific information.
larity between e1 and e2 enables it to learn a channel-independent With the auxiliary WSSL and the proposed modification term,
representation of the speech. we compute the total loss as below.
We compute the WSSL as follows:
LT (θF , θC ) = Ll (θF , θC ) + β Lw (θF ) + δ Lg (θF ) (7)
Lw (θF ) = Lcos (e1 , e2 ) (3)
e ·e
Where, Lcos (e1 , e2 ) = e 1e2 is the cosine similarity between the Here, Lg (θF ) is the proposed modification factor weighted by the
1 2
embeddings, and θF represents the parameters of the feature ex- trade-off parameter δ .
tractor block. Unlike [17], where a combination of cosine similarity Note that, the network is trained with only primary language
and Euclidean distances are used, we use only cosine similarity in classification loss and WSSL for the very first mini-batch of train-
this work as inclusion of Euclidean distance did not provide notice- ing samples, excluding the modification factor. For all subsequent
able improvement in the performance with the new architecture. mini-batches, the training procedure involves two steps. First, the
With the WSSL as an auxiliary loss, we train our LID network global embeddings eˆ1 and eˆ2 (used in Eq. (5)) are computed us-
using the following total loss. ing all training samples in the given mini-batch by keeping the
model parameters (obtained after previous mini-batch) fixed. In
LT (θF , θC ) = Ll (θF , θC ) + β Lw (θF ) (4) the second step, the model parameters are updated using these
Where, LT (θF , θC ) is the total loss, Ll (θF , θC ) is the primary lan- pre-computed global embeddings (according to Eq. (7)), for the
guage classification (cross-entropy) loss, Lw (θF ) is the WSSL. The given mini-batch. This two step procedure ensures that the model
scalar value β is the trade-off parameter, θF and θC are respec- is aware of channel-specific contents in the given set of train-
tively the parameters of the feature extractor and language classi- ing samples during the parameter optimization. Due to the com-
fier blocks of the LID network. bination of Lw (θF ) and Lg (θF ) with the primary classification
Note that, as the embeddings e1 and e2 are designed to cap- loss, the network learns to suppress the channel-specific contents
ture dissimilar LID-specific contents, the amount of similarity be- present at both sample-level as well as at a global (mini-batch)
tween them given by WSSL is mainly due to the channel-specific level.

19
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

Fig. 3. Block diagram of the proposed mWSSL approach. Compared to WSSL, the mWSSL additionally considers the similarities with two global embeddings.

Table 1
Details about the IIT-Mandi Indian languages corpus used in this work.

IIT-Mandi Read speech IIT-Mandi YouTube

Language
Duration (Hours) Utterances Speakers Duration (Hours) Utterances Speakers

Assamese 4.3 1850 16 1.0 366 12

Bengali 5.4 1950 15 1.0 361 11
Gujarati 5.1 1835 15 1.0 362 10
Hindi 5.4 1962 18 1.0 367 15
Kannada 4.6 1844 16 1.0 363 12
Malayalam 4.5 1780 15 1.0 362 12
Odia 4.5 1765 15 1.0 363 10
Telugu 5.0 1960 15 1.0 366 11

3. Datasets used in the study and remaining 20% for development. Since this development part
has same channel conditions as that of training dataset, it is called
We use the IIT-Mandi Indian languages corpus [17] in this work. seen test set.
This corpus is available online.2 There are two parts in this corpus: The IIT-Mandi YouTube dataset contains audio files extracted
IIT-Mandi read speech and IIT-Mandi YouTube dataset. The Table 1 from various YouTube videos on online teaching and personal in-
shows details like number of hours of speech data, number of ut- terviews. Each language contains samples from at least 10 speak-
terances and speakers in each language in these two datasets. ers. Note that, there is significant domain-mismatch between IIT-
The IIT-Mandi read speech dataset contains audio files obtained Mandi read speech and IIT-Mandi YouTube datasets in terms of
from news broadcasts in All India Radio.3 Each language contains channel, type of speech, background conditions, etc. We use the
around 4.5 h of speech data from at least 15 speakers. We use 80% YouTube dataset only for testing. Since no samples from this
samples from this dataset for training (represented as Readsp-train) dataset are used during training, the channel conditions in this are
unseen by the LID system. Hence, it is called unseen test set.
Note that, the amount of speech data and number of speak-
2
https://speechiitmandi.github.io/air/ ers in IIT-Mandi read speech dataset are limited, leading to lim-
3
https://newsonair.gov.in/

20
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

Table 2
Performances of baseline systems when tested in seen and unseen channel conditions. Performances are given
in accuracy (Acc) and Cavg .

seen test (Read speech) unseen test (YouTube)

LID system
Acc Cavg Acc Cavg

1 x-vector 86.35 8.25 62.90 22.30

2 Lnet_baseline[17] 88.20 7.40 63.80 21.50
3 Lnet_baseline + attentive fusion 92.25 5.15 65.60 18.15
4 Lnet_baseline_comb + attentive fusion 95.20 3.60 68.90 17.05

ited diversity in the training dataset. In order to have higher intra- The fourth row in the Table 2 shows the performance of base-
domain variations, we additionally use a part of IIIT-Hyderabad In- line LID network with the proposed new architecture trained using
dian languages corpus [14,23] for training in some of our experi- the combined-readsp-train dataset (containing about 10 h of speech
ments. Like IIT-Mandi read speech, this dataset also contains read per language with more speakers). The number of nodes in each
speech samples recorded in controlled environment. This dataset is hidden layer of this network (represented as Lnet_baseline_comb)
available upon request. In each language, we use approximately 6 h has been increased to twice compared to Lnet_baseline due to in-
of speech (about 25 speakers in each language) from this dataset. creased size of the dataset.
By combining samples from IIIT-Hyderabad corpus, we get a com- From the results in Table 2 it is seen that, the Lnet_baseline
bined training dataset called combined-readsp-train, having about with the proposed new architecture (using attentive fusion) has
10 h of speech in each language with at least 40 speakers. performed better than baselines in both seen and unseen test
In our experiments, we have divided all larger speech files into sets. Unlike in [17], the Lnet_baseline in this work uses a self-
smaller ones such that the duration of speech samples used in this attention based fusion which estimates the fusion weights dynam-
work (including train and test datasets) varies between 2 s and ically based on the LID-specific contents in the two embeddings,
15 s with mean of 7.8 s. leading to a better performance.
Note that, both Lnet_baseline and Lnet_baseline_comb have per-
formed very well on seen test set (Read speech) but, poorly on un-
4. Results and discussion
seen test set (YouTube dataset). This difference in the performance
clearly shows the effect of channel-mismatch. As these networks
In this section, we study the effectiveness of the proposed
have seen only read speech samples with limited channel diver-
method. Performances are given in accuracy (%) and the Cavg (%)
sity during the training, both of them have become vulnerable to
metric [24]. Lower values of Cavg indicates better performance. We
unseen channel conditions in the YouTube dataset. Compared to
have re-implemented all systems in the original work.
Lnet_baseline, the baseline network trained on combined-readsp-
train has provided slightly better performance in both seen and un-
4.1. Baseline systems seen channel conditions. This betterment is attributed to the bet-
ter generalization of the network due to the presence of additional
To compare the effectiveness of WSSL and proposed mWSSL, samples from IIIT-Hyderabad corpus.
we use two baselines that are trained using only primary language However, in spite of using a larger training dataset with
classification loss. First one is the x-vector based LID system [20]. It more diversity (combined-readsp-train), the performance of
processes the sequence of BNFs to obtain 512-dimensional x-vector Lnet_baseline_comb in unseen test set is far below than that
which is then classified using a Gaussian back-end [20]. The clean in seen test set. This clearly indicates the necessity for a sophisti-
speech samples from Readsp-Train dataset (having approximately cated training strategy to further improve the channel-invariance
4 h of speech per language) is used for training this system. Re- of the network.
sults obtained for the x-vector based system on seen and unseen
test sets are shown in 1st row of Table 2.
Second one is a baseline LID-network (Lnet_baseline) which 4.2. Effectiveness of the WSSL
contains only feature extractor and language classifier blocks in
Fig. 1 (excluding WSSL block). The feature extractor block of this Here, we experiment by including WSSL in the training pro-
network contains two identical embedding extractors. The first cess as in Eq. (4). The value of trade-off parameter β in Eq. (4) is
embedding extractor analyzes the input by dividing it into chunks empirically set as 0.30. The first row in Table 3 shows the per-
of 0.5 s (T ), whereas, the second embeddding extractor uses 1 s formance of original WSSL network [17] trained on Readsp-train
chunks. These chunk sizes are chosen empirically based on the (Lnet_WSSL). The performance of WSSL network with the proposed
best performance. The embedding extractors contain two BLSTM attentive fusion is given in second row of Table 3. The performance
layers with 256 and 32 nodes respectively in first and second lay- of WSSL network trained on combined-readsp-train (represented as
ers to produce embeddings e1 and e2 (where e1 , e2 ∈ R64 ). These Lnet_WSSL_comb) is given in third row.
embeddings are then processed by a self-attention network which It can be seen that, the WSSL network with the proposed new
contains a dense layer with 100 (N) nodes followed by a dense architecture (shown in second row of Table 3) provides better
layer with single node to produce the attention (fusion) weights. performance than the one used in the original work (first row).
The two embeddings are then added using these fusion weights This improvement is due to the self-attention based fusion of the
to produce the u-vector. This u-vector is then fed to the language embeddings. Furthermore, both Lnet_WSSL and Lnet_WSSL_comb
classifier (output layer) with softmax activation, containing 8 nodes have performed significantly better than baselines in unseen test
to represent the languages. Results obtained for this system when set. This indicates that, the inclusion of WSSL reduces the vulner-
trained on Readsp-train are given in the third row of Table 2. The ability of the network to channel-mismatch by encouraging it to
second row in the Table 2 corresponds to the baseline system used generalize better. Since WSSL encourages the two embeddings to
in the original work [17], which uses fixed weights for the fusion capture complementary LID-specific contents, the networks with
of two embeddings instead of the self-attention based fusion. WSSL have performed better than baselines in seen test set also.

21
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

Table 3
Performance in accuracy (Acc) and Cavg for WSSL and mWSSL systems in seen and unseen channel conditions.

seen test (Read speech) unseen test (YouTube)

LID system
Acc Cavg Acc Cavg

1 Lnet_WSSL[17] 90.50 5.95 68.60 17.40

2 Lnet_WSSL + attentive fusion 93.75 4.30 69.70 16.45
3 Lnet_WSSL_comb +attentive fusion 96.04 3.10 72.05 14.35
4 Lnet_mWSSL + attentive fusion 94.25 3.85 70.25 15.50
5 Lnet_mWSSL_comb + attentive fusion 96.50 2.90 72.80 13.80

Next, we experiment by including the proposed modification The t-SNE plots in Fig. 4 illustrates the effectiveness of the pro-
factor in the training process. posed mWSSL. Here, the plot (a) in top row corresponds to the
embeddings e1 and e2 obtained for the test samples from YouTube
4.3. Effectiveness of the proposed mWSSL dataset, obtained using the baseline network (Lnet_baseline). Em-
beddings e1 and e2 are shown in different shapes. Points in same
In this case, the network is trained according to Eq. (7), with color indicate speech samples from same language. We used sam-
values of β and δ empirically set as 0.2 and 0.1 respectively. We ples from 5 languages (Assamese, Bengali, Gujarati, Hindi and Kan-
use a mini-batch of 100 samples for training. The fourth row in nada) in the plot, which are shown using different colors. It can be
Table 3 shows the results of the system trained on Readsp-train seen that, the language clusters of embedding e1 do not overlap
(represented as Lnet_mWSSL). The performance of network trained with those from e2 , indicating that they carry significantly differ-
on combined-readsp-train (Lnet_mWSSL_comb) is shown in the last ent information. As e1 and e2 are obtained by analyzing the same
row of Table 3. input at different temporal resolutions, they encode the informa-
Compared to WSSL, the proposed mWSSL has provided bet- tion differently. However, within the set of clusters from a given
ter performance in unseen channel condition. Due to the inclu- embedding extractor, there is high overlap between clusters of dif-
sion of the modification term (similarity with the global embed- ferent languages. This overlapping is due to the confusion created
dings), the network learns to ignore the channel-specific contents due to channel-mismatch, when test samples are taken from un-
at both sample-level as well as at a global-level, improving the seen target domain.
channel-invariance in a better way. Note that, the improvement The t-SNE plot in the bottom row of Fig. 4 (plot b) corre-
provided by Lnet_mWSSL_comb is more than that provided by sponds to the embeddings obtained from the mWSSL network
Lnet_mWSSL. This clearly shows that, the proposed modification (Lnet_mWSSL). Compared to baseline, the language clusters in plot
factor is more effective when the training dataset contains signif- b have more compactness and better separation. Since the pro-
icant intra-domain variations. The mWSSL has provided slight im- posed mWSSL encourages the network to learn a channel-invariant
provement in seen channel conditions too, by encouraging the two representation of the speech, the network has become less vulner-
embedding extractors to capture complementary LID-specific con- able to channel-mismatch.
tents in the speech. The bivariate kernel density estimate (KDE) plots in Fig 5 pro-
vides some additional insight on how the proposed mWSSL en-
courages the embeddings e1 and e2 to capture complementary
contents in the input. The plots shown here are the KDE plots of
e1 and e2 obtained using two randomly selected features. Here, the
top row corresponds to language “Hindi” and bottom row corre-
sponds to language “Kannada”. In both top and bottom rows, the
plot on the left side are the KDE of embeddings e1 and e2 obtained

Fig. 5. Illustration of bivariate kernel density estimate with 2 randomly selected

Fig. 4. t-SNE plot of embeddings e1 and e2 obtained for the test samples from un- features from embeddings e1 and e2 for two different languages. Top row: embed-
seen target domain using baseline network (plot a) and mWSSL network (plot b). dings from the baseline network (left) and mWSSL network (right) for the language
Embeddings e1 and e2 are shown in different shapes. Points in same color are from “Hindi”. Bottom row: embeddings from the baseline and mWSSL network for the
same language. language “Kannada”.

22
M. H. and D. Aroor Dinesh Pattern Recognition Letters 158 (2022) 16–23

using the baseline network (Lnet_baseline). The deviation between [6] H. Wang, H. Dinkel, S. Wang, Y. Qian, K. Yu, Cross-domain replay spoofing
KDE of e1 and e2 in both these plots clearly indicates that, ana- attack detection using domain adversarial training, in: INTERSPEECH, 2019,
pp. 2938–2942.
lyzing the input at two different resolutions indeed helps the two [7] R. Duroselle, D. Jouvet, I. Illina, Unsupervised regularization of the embedding
embeddings to encode the information quite differently. However, extractor for robust language identification, Odyssey 2020 The Speaker and
they do contain some similarities, as indicated by the slight over- Language Recognition Workshop, 2020.
[8] S. Shon, A. Ali, J. Glass, Domain attentive fusion for end-to-end dialect iden-
lap between those KDE plots. tification with unknown target domain, in: IEEE International Conference on
In both rows of Fig. 5, plots on the right side corresponds to Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5951–5955.
the KDE of embedding e1 and e2 from the network trained with [9] Y. Shinohara, Adversarial multi-task learning of deep neural networks for ro-
bust speech recognition, in: INTERSPEECH, 2016, pp. 2369–2372.
mWSSL (Lnet_mWSSL). Compared to the KDE plots from the base-
[10] Z. Meng, Y. Zhao, J. Li, Y. Gong, Adversarial speaker verification, in: IEEE Inter-
line network, the KDE plots from the mWSSL network are well sep- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019,
arated. This indicates that, the proposed mWSSL indeed encourages pp. 6216–6220.
[11] Y.K. Muthusamy, R.A. Cole, B.T. Oshika, The OGI multi-language telephone
two embeddings to encode complementary contents by suppress-
speech corpus, in: Second International Conference on Spoken Language Pro-
ing the similarities between them. This in turn generalizes the net- cessing, 1992.
work in a better way, leading to a better performance. [12] D. Wang, L. Li, D. Tang, Q. Chen, AP16-OL7: a multilingual database for orien-
tal languages and a language recognition baseline, in: Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA),
5. Conclusion 2016, pp. 1–5.
[13] Z. Tang, D. Wang, Y. Chen, Q. Chen, AP17-OLR challenge: data, plan, and base-
In this article, we proposed a novel modified within-sample line, in: 2017 Asia-Pacific Signal and Information Processing Association An-
nual Summit and Conference (APSIPA ASC), 2017, pp. 749–753.
similarity loss (mWSSL) to improve the channel-invariance of a [14] K. Mounika, S. Achanta, H. Lakshmi, S.V. Gangashetty, A.K. Vuppala, An investi-
LID network. The proposed mWSSL overcomes the limitations in gation of deep neural network architectures for language recognition in Indian
the recently proposed WSSL approach. The mWSSL encourages the languages, in: INTERSPEECH, 2016, pp. 2930–2933.
[15] M.A. Zissman, T.P. Gleason, D. Rekart, B.L. Losiewicz, Automatic dialect iden-
network to suppress the channel-specific contents in the speech at tification of extemporaneous conversational, Latin American Spanish speech,
both sample-level as well as at a global-level, leading to improved in: IEEE International Conference on Acoustics, Speech, and Signal Processing
performance in both seen and unseen channel conditions. Conference Proceedings (ICASSP), 1996, pp. 777–780.
[16] Y. Lei, J.H. Hansen, Dialect classification via text-independent training and test-
ing for Arabic, Spanish, and Chinese, IEEE Trans. Audio Speech Lang. Process.
Declaration of Competing Interest 19 (1) (2010) 85–96.
[17] H. Muralikrishna, S. Kapoor, A.D. Dileep, P. Rajan, Spoken language identifi-
cation in unseen target domain using within-sample similarity loss, in: IEEE
The authors declare that they have no known competing finan-
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
cial interests or personal relationships that could have appeared to 2021, pp. 7223–7227.
influence the work reported in this paper. [18] A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotny, F. Grezl, P. Schwarz,
L. Burget, J. Cernocky, BUT/Phonexia bottleneck feature extractor, in: Odyssey
References The Speaker and Language Recognition Workshop, 2018, pp. 283–287.
[19] H. Muralikrishna, S. Pulkit, J. Anuksha, A.D. Dileep, Spoken language identifica-
tion using bidirectional LSTM based LID sequential senones, in: IEEE Automatic
[1] M. Mclaren, M.K. Nandwana, D. Castán, L. Ferrer, Approaches to multi-do- Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 320–326.
main language recognition, in: Odyssey The Speaker and Language Recognition [20] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, S. Khudanpur, Spo-
Workshop, 2018, pp. 90–97. ken language recognition using x-vectors, in: Odyssey The Speaker and Lan-
[2] J.A.V. Lopez, N. Brummer, N. Dehak, End-to-end versus embedding neural net- guage Recognition Workshop, 2018, pp. 105–111.
works for language recognition in mismatched conditions, in: Odyssey The [21] A. Lozano-Diez, O. Plchot, P. Matejka, J. Gonzalez-Rodriguez, DNN based em-
Speaker and Language Recognition Workshop, 2018, pp. 112–119. beddings for language recognition, in: IEEE International Conference on Acous-
[3] B.M. Abdullah, T. Avgustinova, B. Mobius, D. Klakow, Cross-domain adaptation tics, Speech and Signal Processing (ICASSP), 2018, pp. 5184–5188.
of spoken language identification for related languages: The curious case of [22] H. Muralikrishna, S. Gupta, A.D. Dileep, P. Rajan, Noise-robust spoken language
slavic languages, in: INTERSPEECH, 2020, pp. 477–481. identification using language relevance factor based embedding, in: IEEE Spo-
[4] Q. Wang, W. Rao, S. Sun, L. Xie, E.S. Chng, H. Li, Unsupervised domain adapta- ken Language Technology Workshop (SLT), 2021, pp. 644–651.
tion via domain adversarial training for speaker recognition, in: IEEE Interna- [23] R.K. Vuddagiri, H.K. Vydana, A.K. Vuppala, Curriculum learning based approach
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, for noise robust language identification using DNN with attention, Expert. Syst.
pp. 4889–4893. Appl. 110 (2018) 290–297.
[5] G. Bhattacharya, J. Alam, P. Kenny, Adapting end-to-end neural speaker veri- [24] The 2015 NIST Language Recognition Evaluation plan (LRE15), 2015https://
fication to new languages and recording conditions with adversarial training, www.nist.gov/itl/iad/mig/2015- language- recognition- evaluation.
in: IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019, pp. 6041–6045.

Towards Cross-Corpora Generalization For Low-Resource Spoken Language Identification
No ratings yet
Towards Cross-Corpora Generalization For Low-Resource Spoken Language Identification
11 pages
Rapid Language Identification
No ratings yet
Rapid Language Identification
12 pages
Towards Emotion Independent Languageidentification System: by Priyam Jain, Krishna Gurugubelli, Anil Kumar Vuppala
No ratings yet
Towards Emotion Independent Languageidentification System: by Priyam Jain, Krishna Gurugubelli, Anil Kumar Vuppala
6 pages
Baskar22 Interspeech
No ratings yet
Baskar22 Interspeech
5 pages
A Hierarchical Language Identification System For Indian Languages
No ratings yet
A Hierarchical Language Identification System For Indian Languages
10 pages
Audio Deepfake Detection with Domain-Invariant Features
No ratings yet
Audio Deepfake Detection with Domain-Invariant Features
5 pages
Self Learning Speaker Identification A System For PDF
No ratings yet
Self Learning Speaker Identification A System For PDF
185 pages
2210 02592
No ratings yet
2210 02592
8 pages
Increasing Robustness To Spurious Correlations Using Forgettable Examples
No ratings yet
Increasing Robustness To Spurious Correlations Using Forgettable Examples
14 pages
Contrastive and Consistency Learning For Neural Noisy-Channel Model in Spoken Language Understanding
No ratings yet
Contrastive and Consistency Learning For Neural Noisy-Channel Model in Spoken Language Understanding
14 pages
10 - Recurrent Neural Network Based Speech Emotion
No ratings yet
10 - Recurrent Neural Network Based Speech Emotion
13 pages
Unsupervised Data Selection For Speech Recognition With Contrastive Loss Ratios
No ratings yet
Unsupervised Data Selection For Speech Recognition With Contrastive Loss Ratios
5 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
Julian David Echeverry Correa
No ratings yet
Julian David Echeverry Correa
161 pages
9-IEEE - Language Identification For Multilingual Machine Translation Final
No ratings yet
9-IEEE - Language Identification For Multilingual Machine Translation Final
4 pages
A Literature Survey On Domain Adaptation of Statistical Classifiers
No ratings yet
A Literature Survey On Domain Adaptation of Statistical Classifiers
12 pages
Speaker Recognition From Whisper
No ratings yet
Speaker Recognition From Whisper
47 pages
Let'S Fuse Step by Step: A Generative Fusion Decoding Algorithm With Llms For Robust and Instruction-Aware Asr and Ocr
No ratings yet
Let'S Fuse Step by Step: A Generative Fusion Decoding Algorithm With Llms For Robust and Instruction-Aware Asr and Ocr
15 pages
Identification/Segmentation of Indian Regional Languages With Singular Value Decomposition Based Feature Embedding
No ratings yet
Identification/Segmentation of Indian Regional Languages With Singular Value Decomposition Based Feature Embedding
10 pages
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
No ratings yet
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
14 pages
Approaches For Neural-Network Language Model Adaptation: August 2017
No ratings yet
Approaches For Neural-Network Language Model Adaptation: August 2017
6 pages
2008-A Literature Survey On Domain Adaptation of Statistical Classifiers
No ratings yet
2008-A Literature Survey On Domain Adaptation of Statistical Classifiers
12 pages
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
No ratings yet
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
5 pages
Incorporating Knowledge Sources Into Statistical Speech Recognition
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
20 pages
Roettger & Rimland - 2020
No ratings yet
Roettger & Rimland - 2020
10 pages
NLP Assignment 4
No ratings yet
NLP Assignment 4
3 pages
Speech Processing - Anu
No ratings yet
Speech Processing - Anu
78 pages
Speech Communication: S. Shahnawazuddin, Hemant K. Kathania, Abhishek Dey, Rohit Sinha
No ratings yet
Speech Communication: S. Shahnawazuddin, Hemant K. Kathania, Abhishek Dey, Rohit Sinha
11 pages
Learning Audio Concepts From Natural Language Supervision
No ratings yet
Learning Audio Concepts From Natural Language Supervision
9 pages
Sagae Lehr ET AL 2012 Hallucinated N Best Lists For Discriminative Language Modeling
No ratings yet
Sagae Lehr ET AL 2012 Hallucinated N Best Lists For Discriminative Language Modeling
5 pages
Spearker Recog
No ratings yet
Spearker Recog
19 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
A Phonotactic Language Model For Spoken Language Identification
No ratings yet
A Phonotactic Language Model For Spoken Language Identification
8 pages
5-Speech Recognition
No ratings yet
5-Speech Recognition
5 pages
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
No ratings yet
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
5 pages
Coping With Threats Towards Speaker Recognition Systems, Spoofing Risk Minimization
100% (1)
Coping With Threats Towards Speaker Recognition Systems, Spoofing Risk Minimization
22 pages
Eurasia2002 (1) Yook
No ratings yet
Eurasia2002 (1) Yook
8 pages
Word Sense Disambiguation Approaches
No ratings yet
Word Sense Disambiguation Approaches
3 pages
Andrew Rosenberg - Lecture 20: Model Adaptation
No ratings yet
Andrew Rosenberg - Lecture 20: Model Adaptation
35 pages
Audio Deepfake Detection Advances
No ratings yet
Audio Deepfake Detection Advances
15 pages
Missing-Class-Robust Domain Adaptation by Unilateral Alignment
No ratings yet
Missing-Class-Robust Domain Adaptation by Unilateral Alignment
9 pages
Kollmeier Et Al 2016 Sentence Recognition Prediction For Hearing Impaired Listeners in Stationary and Fluctuation Noise
No ratings yet
Kollmeier Et Al 2016 Sentence Recognition Prediction For Hearing Impaired Listeners in Stationary and Fluctuation Noise
17 pages
Speaker Recognition From Speech Lecture
No ratings yet
Speaker Recognition From Speech Lecture
53 pages
A Consolidate View of Loss Functions For Supervised Deep Learning-Based Speech Enhancement
No ratings yet
A Consolidate View of Loss Functions For Supervised Deep Learning-Based Speech Enhancement
5 pages
2021深度聚类论文
No ratings yet
2021深度聚类论文
3 pages
Improving Longer-Range Dialogue State Tracking
No ratings yet
Improving Longer-Range Dialogue State Tracking
10 pages
CS 224S/LING 281 Speech Recognition, Synthesis, and Dialogue
No ratings yet
CS 224S/LING 281 Speech Recognition, Synthesis, and Dialogue
78 pages
BarkCE05 SFD Spcomm
No ratings yet
BarkCE05 SFD Spcomm
21 pages
10.1016/j.ijpsycho.2010.06.296: Abstracts / International Journal of Psychophysiology 77 (2010) 206 238
No ratings yet
10.1016/j.ijpsycho.2010.06.296: Abstracts / International Journal of Psychophysiology 77 (2010) 206 238
2 pages
Paper2 Asr
No ratings yet
Paper2 Asr
8 pages
Label Aware Speech Representation Learning For Lan
No ratings yet
Label Aware Speech Representation Learning For Lan
5 pages
Comparative Study of Speaker Recognition System Using VQ and GMM
No ratings yet
Comparative Study of Speaker Recognition System Using VQ and GMM
7 pages
Response Two
No ratings yet
Response Two
3 pages
An Attempt To Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
No ratings yet
An Attempt To Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
11 pages
A Weighted Partial Domain Adaptation For Acoustic Scene Classification and Its Application in Fiber Optic Security System
No ratings yet
A Weighted Partial Domain Adaptation For Acoustic Scene Classification and Its Application in Fiber Optic Security System
7 pages
PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network
No ratings yet
PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network
9 pages
Wav LM
No ratings yet
Wav LM
14 pages
An Introduction To Hidden Markov Models
No ratings yet
An Introduction To Hidden Markov Models
12 pages
Exploring Potential of State-of-the-Art Speaker Diarization Frameworks For Multilingual Multi-Speaker Conversational Audio
No ratings yet
Exploring Potential of State-of-the-Art Speaker Diarization Frameworks For Multilingual Multi-Speaker Conversational Audio
6 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
A Survey of Techniques To Add Audio Module To Embedded Systems
No ratings yet
A Survey of Techniques To Add Audio Module To Embedded Systems
5 pages
Deep Neural Networks For Cochannel Speaker Identification: Xiaojia Zhao, Yuxuan Wang and Deliang Wang
No ratings yet
Deep Neural Networks For Cochannel Speaker Identification: Xiaojia Zhao, Yuxuan Wang and Deliang Wang
5 pages
p43 Kapur PDF
No ratings yet
p43 Kapur PDF
11 pages
Implementing FIR Filters
No ratings yet
Implementing FIR Filters
12 pages
PS403 - Digital Signal Processing: 5. DSP - Non-Recursive (FIR) Digital Filters
No ratings yet
PS403 - Digital Signal Processing: 5. DSP - Non-Recursive (FIR) Digital Filters
51 pages
Topaire Wall Mounted Aircond Vstelog
No ratings yet
Topaire Wall Mounted Aircond Vstelog
4 pages
12V Battery Pack - Lithium Iron-Phosphate (Lifepo) - 20ah
No ratings yet
12V Battery Pack - Lithium Iron-Phosphate (Lifepo) - 20ah
2 pages
Design Thinking for Professionals
No ratings yet
Design Thinking for Professionals
28 pages
Deep Foundation-Design Methods
No ratings yet
Deep Foundation-Design Methods
30 pages
Suraj Jagdale
No ratings yet
Suraj Jagdale
15 pages
Control System Performance Measure-Pcee503-Unit-5
No ratings yet
Control System Performance Measure-Pcee503-Unit-5
32 pages
Salt Spray Cabinet BGD 800S
No ratings yet
Salt Spray Cabinet BGD 800S
5 pages
Com - Dual.space - Parallel.apps - Multiaccounts.appscloner Logcat
No ratings yet
Com - Dual.space - Parallel.apps - Multiaccounts.appscloner Logcat
157 pages
Son Nguyen's Blog - How To Install HASP MultiKey USB Dongle Emulator On Windows 7 64 Bit
0% (1)
Son Nguyen's Blog - How To Install HASP MultiKey USB Dongle Emulator On Windows 7 64 Bit
27 pages
HTML Tags
67% (3)
HTML Tags
4 pages
STEP BY STEP PROCEDURE TO USE PLANWIN Part I
No ratings yet
STEP BY STEP PROCEDURE TO USE PLANWIN Part I
146 pages
PDCCH Optimization for LTE Performance
100% (1)
PDCCH Optimization for LTE Performance
1 page
Symbology-Proc Eq
No ratings yet
Symbology-Proc Eq
8 pages
Food Waste Management System
No ratings yet
Food Waste Management System
5 pages
014 CAT-6060 AC 60Hz E-Drive CAMP + SIL + BCS4 FS Legend H-Schematic Canada No New Available
No ratings yet
014 CAT-6060 AC 60Hz E-Drive CAMP + SIL + BCS4 FS Legend H-Schematic Canada No New Available
13 pages
Aron Method
No ratings yet
Aron Method
3 pages
The Demystification of Lookup Tables in Revit Families I
100% (1)
The Demystification of Lookup Tables in Revit Families I
35 pages
Submitted by Md. Ahsan SG - 196415 Amiete - Et
No ratings yet
Submitted by Md. Ahsan SG - 196415 Amiete - Et
17 pages
Advanced GPRS Data Logger Guide
No ratings yet
Advanced GPRS Data Logger Guide
2 pages
Products and Services of Ethiotelecom PDF
50% (2)
Products and Services of Ethiotelecom PDF
9 pages
Testo 350 0981 2924
No ratings yet
Testo 350 0981 2924
8 pages
Features of The New GL
No ratings yet
Features of The New GL
3 pages
Tle10 Electronics
No ratings yet
Tle10 Electronics
2 pages
A Review On Nature Cybercrime and Best Practices of Digital Footprints
No ratings yet
A Review On Nature Cybercrime and Best Practices of Digital Footprints
7 pages
RDBMS 12 Question Answer
No ratings yet
RDBMS 12 Question Answer
31 pages
15 Logical Reasoning Questions For Olympiad For 6 7 8
No ratings yet
15 Logical Reasoning Questions For Olympiad For 6 7 8
38 pages
Winter 24 Key Points
No ratings yet
Winter 24 Key Points
56 pages
Energy & Environment Projects
No ratings yet
Energy & Environment Projects
5 pages
ML4806 Exam Question Paper October-November 2024
No ratings yet
ML4806 Exam Question Paper October-November 2024
10 pages
BSECE 2019 Revised Curriculum OFFICIAL
100% (1)
BSECE 2019 Revised Curriculum OFFICIAL
3 pages

Lid PRL

Uploaded by

Lid PRL

Uploaded by

Pattern Recognition Letters 158 (2022) 16–23

Contents lists available at ScienceDirect

Pattern Recognition Letters

Spoken language identiﬁcation in unseen channel conditions using

1. Introduction Domain adaptation is one commonly used strategy to handle

In order to allow the two embedding extractors to process the

IIT-Mandi Read speech IIT-Mandi YouTube

Assamese 4.3 1850 16 1.0 366 12

seen test (Read speech) unseen test (YouTube)

1 x-vector 86.35 8.25 62.90 22.30

seen test (Read speech) unseen test (YouTube)

1 Lnet_WSSL[17] 90.50 5.95 68.60 17.40

Fig. 5. Illustration of bivariate kernel density estimate with 2 randomly selected

You might also like