0% found this document useful (0 votes)
14 views3 pages

Wangzhi

The Tianjin University Audiovisual Cognitive Computing team developed a speaker verification system for the Source Speaker Tracing Challenge (SSTC) 2024, achieving an Equal Error Rate (EER) of 19.32% and securing third place. The system, based on ResNet152, was trained on diverse datasets to enhance its robustness against voice conversion attacks. The report outlines the training strategies, experimental results, and future directions for improving the system's performance in speaker verification.

Uploaded by

yjcmote0301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Wangzhi

The Tianjin University Audiovisual Cognitive Computing team developed a speaker verification system for the Source Speaker Tracing Challenge (SSTC) 2024, achieving an Equal Error Rate (EER) of 19.32% and securing third place. The system, based on ResNet152, was trained on diverse datasets to enhance its robustness against voice conversion attacks. The report outlines the training strategies, experimental results, and future directions for improving the system's performance in speaker verification.

Uploaded by

yjcmote0301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

THE TIANJIN UNIVERSITY AUDIOVISUAL COGNITIVE COMPUTING TEAM SPEAKER

VERIFICATION FOR THE SSTC2024

Hangming Zhang, Zheng Li, Qianyi Bai, Zhichao Deng

College of Intelligence and Computing


Tianjin University
Tianjin, 300354, China

ABSTRACT vulnerabilities, various challenges like ASVspoof and Audio


In recent years, speaker verification technology has be- Deepfake Detection have been organized, aiming to foster the
come a common identity verification method, which is widely development of effective countermeasures.
used in fields like AI assistant, financial security, smart city, Amidst these developments, the Source Speaker Tracing
criminal investigation. While with the development of ar- Challenge (SSTC) 2024 plays a pivotal role. This challenge is
tificial intelligence technology like speech conversion, the specifically designed to enhance the source speaker verifica-
speaker verification systems are facing more threats than ever tion process, particularly against sophisticated voice conver-
before. The Source Speaker Tracing Challenge (SSTC) plays sion attacks. By focusing on the identification of the source
an important role in addressing these problems, providing a speaker in manipulated speech signals, SSTC 2024 seeks to
platform for communication and innovation of source speaker push the boundaries of current technologies, promote innova-
verification, also providing impetus for the application of tion, and provide a robust, open dataset for community en-
new technology in this area. In this challenge, we team Tian- gagement.
jin University Audiovisual Cognitive Computing proposed a This report introduces a novel speaker verification system
system which is based on ResNet152 derived from the We- specifically designed for voice-converted speech. The sys-
Speaker framework. In the training process of the model, tem has been trained using the official training set provided
AAM classifier was used and, 8 data sets were used to sim- by the challenge organizers. Notably, without the simulation
ulate different deception scenarios to fine-tune the model, so of additional training sets, our best model achieved an Equal
that the model has ability to recognize and counter complex Error Rate (EER) of 19.32%. The structure of this report is
attacks. The model achieved an ERR of merely 19.3% on the organized as follows: Section 2 details our training and fine-
test set of voice conversion speaker verification, ranking the tuning strategies, Section 3 presents the experimental results,
third place in SSTC 2024. In the future, we will continue to and Section 4 concludes with a summary of our findings and
explore various ways to further improve the robustness and implications for future research. This system’s development
effective recognition to against more sophisticated deception and evaluation are part of a broader effort to enhance the reli-
scenes. ability and security of speaker verification technologies in the
face of evolving spoofing tactics.
Index Terms— Source speaker verification, speaker ver-
ification, the Source Speaker Tracing Challenge (SSTC)
2. TRAINING STRATEGY
1. INTRODUCTION
Our speaker verification system leverages the pretrained
In today’s digital era, speaker verification[1] has emerged ResNet152[8] model as its foundation. This model is sourced
as a crucial biometric authentication technology, extensively from the WeSpeaker framework[9]. The ResNet152 model
utilized in various applications such as mobile devices[2], was trained on the VoxCeleb2 dataset[10]. VoxCeleb2 con-
smart homes[3], smart cities[4], criminal investigation[5] tains over 1 million utterances from 5994 celebrities, sourced
and financial services[6]. These technologies rely heavily from YouTube videos. The dataset encompasses a wide range
on the accuracy and security of voice recognition systems of accents, professions, and different backgrounds, making it
to ensure the integrity of user interactions. However, with one of the most diverse and comprehensive datasets available
the advancement of deep neural networks, the susceptibility for speaker recognition tasks.
of speaker verification systems to spoofing attacks—such To fine-tune the speaker verification system, we used a
as speech synthesis][7], voice conversion, and speech edit- specialized dataset created from the LibriSpeech corpus pro-
ing—has become an increasing concern. To address these vided by the contest organizers. The converted speech sam-
Table 1. EER of two systems on 12 Dev sets.
System DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7 DEV8 DEV9 DEV10 DEV11 DEV12 ALL
S1 12.82 12.03 9.41 11.15 11.41 14.65 31.92 30.78 36.78 42.88 19.16 23.57 21.38
S2 11.78 10.09 8.53 11.17 10.08 14.05 28.22 27.78 33.04 40.82 14.96 19.27 19.15

ples were generated using advanced voice conversion (VC) bedding. This step aims to eliminate certain biases or noise
techniques, which manipulate the vocal characteristics of the in the data so that embedding can more accurately reflect the
source speaker’s audio to resemble those of a target speaker inherent characteristics of the data.
while preserving the linguistic content. This process involved Through testing on the actual development set, we found
selecting three source speech samples for each target speech, that system S2 has significantly improved the key indicator of
simulating scenarios where attacks are carried out by differ- EER, as shown in Table 1. This result fully demonstrates the
ent voice converters. The diversity in conversion techniques effectiveness of our adjustments during the data preprocess-
ensures that our system is exposed to a wide range of possible ing stage. After completing the evaluation on the Dev set, we
spoofing scenarios, enhancing its ability to generalize across decided to submit S1 and S2 to the Test set for more rigor-
different types of voice conversion attacks. ous validation. The ERR of S1 and S2 on the Test set were
In our research, we employed an advanced fine-tuning 23.20% and 19.32%, respectively. The significant improve-
strategy to optimize the ResNet152 model for speaker verifi- ment in this achievement once again proves the effectiveness
cation tasks, specifically focusing on voice-converted speech. of our work in data preprocessing and model optimization.
The model was fine-tuned on eight distinct datasets, each rep- We believe that with the continuous advancement of technol-
resenting different voice conversion scenarios to ensure broad ogy and in-depth research, our system will be able to achieve
exposure and robustness against various spoofing types. The even better performance in the future.
training was conducted over 20 epochs with a warmup phase
of one epoch. Spectral features were extracted using an 80-
dimensional Fbank. The ResNet152 model was trained with 4. CONCLUSION
the Additive Angular Margin(AAM) classifier[11], config-
ured with an angular margin m of 0.2 and a scaling factor s of Our team ultimately secured third place in the SSTC, achiev-
32, which sharpens the decision boundaries between classes, ing an EER of 19.32% on the test set for voice conversion
crucial for improving the discriminative power of the model speaker verification. It is observable that despite fine-tuning
in distinguishing between different speakers. No dropout the dataset, the experimental results were still not entirely
was applied, allowing the model to train on all features for satisfactory. This task remains exceedingly challenging with
a comprehensive learning experience. Our model operated current technologies, indicating significant room for improve-
on four GPUs. This fine-tuning approach not only adapted ment in handling voice conversion spoofing. Moving forward,
the model to the intricacies of voice-converted speech but we are committed to exploring various training strategies to
also enhanced its generalization capabilities, essential for better understand their impact on voice conversion speaker
deploying in real-world scenarios where speaker verification verification systems. By continually refining our approach,
systems must reliably identify and counteract sophisticated we aim to enhance the robustness and accuracy of our system
spoofing attacks. against sophisticated spoofing techniques.

3. RESULTS
5. REFERENCES
We have carefully constructed and submitted two systems S1
and S2 to evaluate their performance. The main difference [1] Craig S. Greenberg, Lisa P. Mason, Seyed Omid Sad-
between these two systems lies in a crucial step in the data jadi, and Douglas A. Reynolds, “Two decades of
preprocessing stage, which has a significant impact on the fi- speaker recognition evaluation at the national institute
nal performance. Firstly, S1 followed the standard processing of standards and technology,” Comput. Speech Lang.,
flow without introducing additional data transformation steps. vol. 60, no. C, mar 2020.
However, despite S1’s outstanding performance in multiple
aspects, there is still room for optimization in certain details. [2] Joao Antônio Chagas Nunes, David Macêdo, and Cle-
To further improve performance, we developed S2. Com- ber Zanchettin, “Am-mobilenet1d: A portable model for
pared to S1, S2 has added a key step in the data preprocessing speaker recognition,” in 2020 International Joint Con-
stage: calculating the mean vector of all data in the Dev set, ference on Neural Networks (IJCNN). IEEE, 2020, pp.
and subtracting this mean vector when extracting each em- 1–8.
[3] Yudi Dong and Yu-Dong Yao, “Secure mmwave-radar-
based speaker verification for iot smart home,” IEEE
Internet of Things Journal, vol. 8, no. 5, pp. 3500–3511,
2020.
[4] Adil E Rajput, Tayeb Brahimi, and Akila Sarirete, “Au-
tomatic speaker verification, zigbee and lorawan: Po-
tential threats and vulnerabilities in smart cities,” in
Research & Innovation Forum 2019: Technology, Inno-
vation, Education, and their Social Impact 1. Springer,
2019, pp. 277–285.

[5] Joseph P. Campbell, Wade Shen, William M. Campbell,


Reva Schwartz, Jean-Francois Bonastre, and Driss Ma-
trouf, “Forensic speaker recognition,” IEEE Signal Pro-
cessing Magazine, vol. 26, no. 2, pp. 95–103, 2009.
[6] Muteb Aljasem, Aun Irtaza, Hafiz Malik, Noushin Saba,
Ali Javed, Khalid Mahmood Malik, and Mohammad
Meharmohammadi, “Secure automatic speaker verifica-
tion (sasv) system through sm-altp features and asym-
metric bagging,” IEEE Transactions on Information
Forensics and Security, vol. 16, pp. 3524–3537, 2021.

[7] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu, “A


survey on neural speech synthesis,” arXiv preprint
arXiv:2106.15561, 2021.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[9] Hongji Wang, Chengdong Liang, Shuai Wang,
Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei
Deng, and Yanmin Qian, “Wespeaker: A research
and production oriented speaker embedding learning
toolkit,” in ICASSP 2023-2023 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2023, pp. 1–5.
[10] Joon Son Chung, Arsha Nagrani, and Andrew Zisser-
man, “Voxceleb2: Deep speaker recognition,” arXiv
preprint arXiv:1806.05622, 2018.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos
Zafeiriou, “Arcface: Additive angular margin loss for
deep face recognition,” in Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
2019, pp. 4690–4699.

You might also like