THE TIANJIN UNIVERSITY AUDIOVISUAL COGNITIVE COMPUTING TEAM SPEAKER
VERIFICATION FOR THE SSTC2024
                               Hangming Zhang, Zheng Li, Qianyi Bai, Zhichao Deng
                                          College of Intelligence and Computing
                                                    Tianjin University
                                                 Tianjin, 300354, China
                         ABSTRACT                                   vulnerabilities, various challenges like ASVspoof and Audio
     In recent years, speaker verification technology has be-       Deepfake Detection have been organized, aiming to foster the
come a common identity verification method, which is widely         development of effective countermeasures.
used in fields like AI assistant, financial security, smart city,       Amidst these developments, the Source Speaker Tracing
criminal investigation. While with the development of ar-           Challenge (SSTC) 2024 plays a pivotal role. This challenge is
tificial intelligence technology like speech conversion, the        specifically designed to enhance the source speaker verifica-
speaker verification systems are facing more threats than ever      tion process, particularly against sophisticated voice conver-
before. The Source Speaker Tracing Challenge (SSTC) plays           sion attacks. By focusing on the identification of the source
an important role in addressing these problems, providing a         speaker in manipulated speech signals, SSTC 2024 seeks to
platform for communication and innovation of source speaker         push the boundaries of current technologies, promote innova-
verification, also providing impetus for the application of         tion, and provide a robust, open dataset for community en-
new technology in this area. In this challenge, we team Tian-       gagement.
jin University Audiovisual Cognitive Computing proposed a               This report introduces a novel speaker verification system
system which is based on ResNet152 derived from the We-             specifically designed for voice-converted speech. The sys-
Speaker framework. In the training process of the model,            tem has been trained using the official training set provided
AAM classifier was used and, 8 data sets were used to sim-          by the challenge organizers. Notably, without the simulation
ulate different deception scenarios to fine-tune the model, so      of additional training sets, our best model achieved an Equal
that the model has ability to recognize and counter complex         Error Rate (EER) of 19.32%. The structure of this report is
attacks. The model achieved an ERR of merely 19.3% on the           organized as follows: Section 2 details our training and fine-
test set of voice conversion speaker verification, ranking the      tuning strategies, Section 3 presents the experimental results,
third place in SSTC 2024. In the future, we will continue to        and Section 4 concludes with a summary of our findings and
explore various ways to further improve the robustness and          implications for future research. This system’s development
effective recognition to against more sophisticated deception       and evaluation are part of a broader effort to enhance the reli-
scenes.                                                             ability and security of speaker verification technologies in the
                                                                    face of evolving spoofing tactics.
     Index Terms— Source speaker verification, speaker ver-
ification, the Source Speaker Tracing Challenge (SSTC)
                                                                                    2. TRAINING STRATEGY
                    1. INTRODUCTION
                                                                    Our speaker verification system leverages the pretrained
In today’s digital era, speaker verification[1] has emerged         ResNet152[8] model as its foundation. This model is sourced
as a crucial biometric authentication technology, extensively       from the WeSpeaker framework[9]. The ResNet152 model
utilized in various applications such as mobile devices[2],         was trained on the VoxCeleb2 dataset[10]. VoxCeleb2 con-
smart homes[3], smart cities[4], criminal investigation[5]          tains over 1 million utterances from 5994 celebrities, sourced
and financial services[6]. These technologies rely heavily          from YouTube videos. The dataset encompasses a wide range
on the accuracy and security of voice recognition systems           of accents, professions, and different backgrounds, making it
to ensure the integrity of user interactions. However, with         one of the most diverse and comprehensive datasets available
the advancement of deep neural networks, the susceptibility         for speaker recognition tasks.
of speaker verification systems to spoofing attacks—such                To fine-tune the speaker verification system, we used a
as speech synthesis][7], voice conversion, and speech edit-         specialized dataset created from the LibriSpeech corpus pro-
ing—has become an increasing concern. To address these              vided by the contest organizers. The converted speech sam-
                                          Table 1. EER of two systems on 12 Dev sets.
 System     DEV1      DEV2      DEV3      DEV4 DEV5 DEV6 DEV7 DEV8 DEV9                              DEV10      DEV11      DEV12        ALL
    S1       12.82    12.03      9.41     11.15     11.41     14.65      31.92    30.78     36.78     42.88      19.16      23.57       21.38
    S2       11.78    10.09      8.53     11.17     10.08     14.05      28.22    27.78     33.04     40.82      14.96      19.27       19.15
ples were generated using advanced voice conversion (VC)              bedding. This step aims to eliminate certain biases or noise
techniques, which manipulate the vocal characteristics of the         in the data so that embedding can more accurately reflect the
source speaker’s audio to resemble those of a target speaker          inherent characteristics of the data.
while preserving the linguistic content. This process involved            Through testing on the actual development set, we found
selecting three source speech samples for each target speech,         that system S2 has significantly improved the key indicator of
simulating scenarios where attacks are carried out by differ-         EER, as shown in Table 1. This result fully demonstrates the
ent voice converters. The diversity in conversion techniques          effectiveness of our adjustments during the data preprocess-
ensures that our system is exposed to a wide range of possible        ing stage. After completing the evaluation on the Dev set, we
spoofing scenarios, enhancing its ability to generalize across        decided to submit S1 and S2 to the Test set for more rigor-
different types of voice conversion attacks.                          ous validation. The ERR of S1 and S2 on the Test set were
    In our research, we employed an advanced fine-tuning              23.20% and 19.32%, respectively. The significant improve-
strategy to optimize the ResNet152 model for speaker verifi-          ment in this achievement once again proves the effectiveness
cation tasks, specifically focusing on voice-converted speech.        of our work in data preprocessing and model optimization.
The model was fine-tuned on eight distinct datasets, each rep-        We believe that with the continuous advancement of technol-
resenting different voice conversion scenarios to ensure broad        ogy and in-depth research, our system will be able to achieve
exposure and robustness against various spoofing types. The           even better performance in the future.
training was conducted over 20 epochs with a warmup phase
of one epoch. Spectral features were extracted using an 80-
dimensional Fbank. The ResNet152 model was trained with                                    4. CONCLUSION
the Additive Angular Margin(AAM) classifier[11], config-
ured with an angular margin m of 0.2 and a scaling factor s of        Our team ultimately secured third place in the SSTC, achiev-
32, which sharpens the decision boundaries between classes,           ing an EER of 19.32% on the test set for voice conversion
crucial for improving the discriminative power of the model           speaker verification. It is observable that despite fine-tuning
in distinguishing between different speakers. No dropout              the dataset, the experimental results were still not entirely
was applied, allowing the model to train on all features for          satisfactory. This task remains exceedingly challenging with
a comprehensive learning experience. Our model operated               current technologies, indicating significant room for improve-
on four GPUs. This fine-tuning approach not only adapted              ment in handling voice conversion spoofing. Moving forward,
the model to the intricacies of voice-converted speech but            we are committed to exploring various training strategies to
also enhanced its generalization capabilities, essential for          better understand their impact on voice conversion speaker
deploying in real-world scenarios where speaker verification          verification systems. By continually refining our approach,
systems must reliably identify and counteract sophisticated           we aim to enhance the robustness and accuracy of our system
spoofing attacks.                                                     against sophisticated spoofing techniques.
                        3. RESULTS
                                                                                           5. REFERENCES
We have carefully constructed and submitted two systems S1
and S2 to evaluate their performance. The main difference              [1] Craig S. Greenberg, Lisa P. Mason, Seyed Omid Sad-
between these two systems lies in a crucial step in the data               jadi, and Douglas A. Reynolds, “Two decades of
preprocessing stage, which has a significant impact on the fi-             speaker recognition evaluation at the national institute
nal performance. Firstly, S1 followed the standard processing              of standards and technology,” Comput. Speech Lang.,
flow without introducing additional data transformation steps.             vol. 60, no. C, mar 2020.
However, despite S1’s outstanding performance in multiple
aspects, there is still room for optimization in certain details.      [2] Joao Antônio Chagas Nunes, David Macêdo, and Cle-
To further improve performance, we developed S2. Com-                      ber Zanchettin, “Am-mobilenet1d: A portable model for
pared to S1, S2 has added a key step in the data preprocessing             speaker recognition,” in 2020 International Joint Con-
stage: calculating the mean vector of all data in the Dev set,             ference on Neural Networks (IJCNN). IEEE, 2020, pp.
and subtracting this mean vector when extracting each em-                  1–8.
 [3] Yudi Dong and Yu-Dong Yao, “Secure mmwave-radar-
     based speaker verification for iot smart home,” IEEE
     Internet of Things Journal, vol. 8, no. 5, pp. 3500–3511,
     2020.
 [4] Adil E Rajput, Tayeb Brahimi, and Akila Sarirete, “Au-
     tomatic speaker verification, zigbee and lorawan: Po-
     tential threats and vulnerabilities in smart cities,” in
     Research & Innovation Forum 2019: Technology, Inno-
     vation, Education, and their Social Impact 1. Springer,
     2019, pp. 277–285.
 [5] Joseph P. Campbell, Wade Shen, William M. Campbell,
     Reva Schwartz, Jean-Francois Bonastre, and Driss Ma-
     trouf, “Forensic speaker recognition,” IEEE Signal Pro-
     cessing Magazine, vol. 26, no. 2, pp. 95–103, 2009.
 [6] Muteb Aljasem, Aun Irtaza, Hafiz Malik, Noushin Saba,
     Ali Javed, Khalid Mahmood Malik, and Mohammad
     Meharmohammadi, “Secure automatic speaker verifica-
     tion (sasv) system through sm-altp features and asym-
     metric bagging,” IEEE Transactions on Information
     Forensics and Security, vol. 16, pp. 3524–3537, 2021.
 [7] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu, “A
     survey on neural speech synthesis,” arXiv preprint
     arXiv:2106.15561, 2021.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Sun, “Deep residual learning for image recognition,” in
     Proceedings of the IEEE conference on computer vision
     and pattern recognition, 2016, pp. 770–778.
 [9] Hongji Wang, Chengdong Liang, Shuai Wang,
     Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei
     Deng, and Yanmin Qian, “Wespeaker: A research
     and production oriented speaker embedding learning
     toolkit,” in ICASSP 2023-2023 IEEE International
     Conference on Acoustics, Speech and Signal Processing
     (ICASSP). IEEE, 2023, pp. 1–5.
[10] Joon Son Chung, Arsha Nagrani, and Andrew Zisser-
     man, “Voxceleb2: Deep speaker recognition,” arXiv
     preprint arXiv:1806.05622, 2018.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos
     Zafeiriou, “Arcface: Additive angular margin loss for
     deep face recognition,” in Proceedings of the IEEE/CVF
     conference on computer vision and pattern recognition,
     2019, pp. 4690–4699.