Wangzhi

The Tianjin University Audiovisual Cognitive Computing team developed a speaker verification system for the Source Speaker Tracing Challenge (SSTC) 2024, achieving an Equal Error Rate (EER) of 19.32% and securing third place. The system, based on ResNet152, was trained on diverse datasets to enhance its robustness against voice conversion attacks. The report outlines the training strategies, experimental results, and future directions for improving the system's performance in speaker verification.

Uploaded by

yjcmote0301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views3 pages

Wangzhi

Uploaded by

yjcmote0301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

THE TIANJIN UNIVERSITY AUDIOVISUAL COGNITIVE COMPUTING TEAM SPEAKER

VERIFICATION FOR THE SSTC2024

Hangming Zhang, Zheng Li, Qianyi Bai, Zhichao Deng

College of Intelligence and Computing

Tianjin University
Tianjin, 300354, China

ABSTRACT vulnerabilities, various challenges like ASVspoof and Audio

In recent years, speaker verification technology has be- Deepfake Detection have been organized, aiming to foster the
come a common identity verification method, which is widely development of effective countermeasures.
used in fields like AI assistant, financial security, smart city, Amidst these developments, the Source Speaker Tracing
criminal investigation. While with the development of ar- Challenge (SSTC) 2024 plays a pivotal role. This challenge is
tificial intelligence technology like speech conversion, the specifically designed to enhance the source speaker verifica-
speaker verification systems are facing more threats than ever tion process, particularly against sophisticated voice conver-
before. The Source Speaker Tracing Challenge (SSTC) plays sion attacks. By focusing on the identification of the source
an important role in addressing these problems, providing a speaker in manipulated speech signals, SSTC 2024 seeks to
platform for communication and innovation of source speaker push the boundaries of current technologies, promote innova-
verification, also providing impetus for the application of tion, and provide a robust, open dataset for community en-
new technology in this area. In this challenge, we team Tian- gagement.
jin University Audiovisual Cognitive Computing proposed a This report introduces a novel speaker verification system
system which is based on ResNet152 derived from the We- specifically designed for voice-converted speech. The sys-
Speaker framework. In the training process of the model, tem has been trained using the official training set provided
AAM classifier was used and, 8 data sets were used to sim- by the challenge organizers. Notably, without the simulation
ulate different deception scenarios to fine-tune the model, so of additional training sets, our best model achieved an Equal
that the model has ability to recognize and counter complex Error Rate (EER) of 19.32%. The structure of this report is
attacks. The model achieved an ERR of merely 19.3% on the organized as follows: Section 2 details our training and fine-
test set of voice conversion speaker verification, ranking the tuning strategies, Section 3 presents the experimental results,
third place in SSTC 2024. In the future, we will continue to and Section 4 concludes with a summary of our findings and
explore various ways to further improve the robustness and implications for future research. This system’s development
effective recognition to against more sophisticated deception and evaluation are part of a broader effort to enhance the reli-
scenes. ability and security of speaker verification technologies in the
face of evolving spoofing tactics.
Index Terms— Source speaker verification, speaker ver-
ification, the Source Speaker Tracing Challenge (SSTC)
2. TRAINING STRATEGY
1. INTRODUCTION
Our speaker verification system leverages the pretrained
In today’s digital era, speaker verification[1] has emerged ResNet152[8] model as its foundation. This model is sourced
as a crucial biometric authentication technology, extensively from the WeSpeaker framework[9]. The ResNet152 model
utilized in various applications such as mobile devices[2], was trained on the VoxCeleb2 dataset[10]. VoxCeleb2 con-
smart homes[3], smart cities[4], criminal investigation[5] tains over 1 million utterances from 5994 celebrities, sourced
and financial services[6]. These technologies rely heavily from YouTube videos. The dataset encompasses a wide range
on the accuracy and security of voice recognition systems of accents, professions, and different backgrounds, making it
to ensure the integrity of user interactions. However, with one of the most diverse and comprehensive datasets available
the advancement of deep neural networks, the susceptibility for speaker recognition tasks.
of speaker verification systems to spoofing attacks—such To fine-tune the speaker verification system, we used a
as speech synthesis][7], voice conversion, and speech edit- specialized dataset created from the LibriSpeech corpus pro-
ing—has become an increasing concern. To address these vided by the contest organizers. The converted speech sam-
Table 1. EER of two systems on 12 Dev sets.
System DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7 DEV8 DEV9 DEV10 DEV11 DEV12 ALL
S1 12.82 12.03 9.41 11.15 11.41 14.65 31.92 30.78 36.78 42.88 19.16 23.57 21.38
S2 11.78 10.09 8.53 11.17 10.08 14.05 28.22 27.78 33.04 40.82 14.96 19.27 19.15

ples were generated using advanced voice conversion (VC) bedding. This step aims to eliminate certain biases or noise
techniques, which manipulate the vocal characteristics of the in the data so that embedding can more accurately reflect the
source speaker’s audio to resemble those of a target speaker inherent characteristics of the data.
while preserving the linguistic content. This process involved Through testing on the actual development set, we found
selecting three source speech samples for each target speech, that system S2 has significantly improved the key indicator of
simulating scenarios where attacks are carried out by differ- EER, as shown in Table 1. This result fully demonstrates the
ent voice converters. The diversity in conversion techniques effectiveness of our adjustments during the data preprocess-
ensures that our system is exposed to a wide range of possible ing stage. After completing the evaluation on the Dev set, we
spoofing scenarios, enhancing its ability to generalize across decided to submit S1 and S2 to the Test set for more rigor-
different types of voice conversion attacks. ous validation. The ERR of S1 and S2 on the Test set were
In our research, we employed an advanced fine-tuning 23.20% and 19.32%, respectively. The significant improve-
strategy to optimize the ResNet152 model for speaker verifi- ment in this achievement once again proves the effectiveness
cation tasks, specifically focusing on voice-converted speech. of our work in data preprocessing and model optimization.
The model was fine-tuned on eight distinct datasets, each rep- We believe that with the continuous advancement of technol-
resenting different voice conversion scenarios to ensure broad ogy and in-depth research, our system will be able to achieve
exposure and robustness against various spoofing types. The even better performance in the future.
training was conducted over 20 epochs with a warmup phase
of one epoch. Spectral features were extracted using an 80-
dimensional Fbank. The ResNet152 model was trained with 4. CONCLUSION
the Additive Angular Margin(AAM) classifier[11], config-
ured with an angular margin m of 0.2 and a scaling factor s of Our team ultimately secured third place in the SSTC, achiev-
32, which sharpens the decision boundaries between classes, ing an EER of 19.32% on the test set for voice conversion
crucial for improving the discriminative power of the model speaker verification. It is observable that despite fine-tuning
in distinguishing between different speakers. No dropout the dataset, the experimental results were still not entirely
was applied, allowing the model to train on all features for satisfactory. This task remains exceedingly challenging with
a comprehensive learning experience. Our model operated current technologies, indicating significant room for improve-
on four GPUs. This fine-tuning approach not only adapted ment in handling voice conversion spoofing. Moving forward,
the model to the intricacies of voice-converted speech but we are committed to exploring various training strategies to
also enhanced its generalization capabilities, essential for better understand their impact on voice conversion speaker
deploying in real-world scenarios where speaker verification verification systems. By continually refining our approach,
systems must reliably identify and counteract sophisticated we aim to enhance the robustness and accuracy of our system
spoofing attacks. against sophisticated spoofing techniques.

3. RESULTS
5. REFERENCES
We have carefully constructed and submitted two systems S1
and S2 to evaluate their performance. The main difference [1] Craig S. Greenberg, Lisa P. Mason, Seyed Omid Sad-
between these two systems lies in a crucial step in the data jadi, and Douglas A. Reynolds, “Two decades of
preprocessing stage, which has a significant impact on the fi- speaker recognition evaluation at the national institute
nal performance. Firstly, S1 followed the standard processing of standards and technology,” Comput. Speech Lang.,
flow without introducing additional data transformation steps. vol. 60, no. C, mar 2020.
However, despite S1’s outstanding performance in multiple
aspects, there is still room for optimization in certain details. [2] Joao Antônio Chagas Nunes, David Macêdo, and Cle-
To further improve performance, we developed S2. Com- ber Zanchettin, “Am-mobilenet1d: A portable model for
pared to S1, S2 has added a key step in the data preprocessing speaker recognition,” in 2020 International Joint Con-
stage: calculating the mean vector of all data in the Dev set, ference on Neural Networks (IJCNN). IEEE, 2020, pp.
and subtracting this mean vector when extracting each em- 1–8.
[3] Yudi Dong and Yu-Dong Yao, “Secure mmwave-radar-
based speaker verification for iot smart home,” IEEE
Internet of Things Journal, vol. 8, no. 5, pp. 3500–3511,
2020.
[4] Adil E Rajput, Tayeb Brahimi, and Akila Sarirete, “Au-
tomatic speaker verification, zigbee and lorawan: Po-
tential threats and vulnerabilities in smart cities,” in
Research & Innovation Forum 2019: Technology, Inno-
vation, Education, and their Social Impact 1. Springer,
2019, pp. 277–285.

[5] Joseph P. Campbell, Wade Shen, William M. Campbell,

Reva Schwartz, Jean-Francois Bonastre, and Driss Ma-
trouf, “Forensic speaker recognition,” IEEE Signal Pro-
cessing Magazine, vol. 26, no. 2, pp. 95–103, 2009.
[6] Muteb Aljasem, Aun Irtaza, Hafiz Malik, Noushin Saba,
Ali Javed, Khalid Mahmood Malik, and Mohammad
Meharmohammadi, “Secure automatic speaker verifica-
tion (sasv) system through sm-altp features and asym-
metric bagging,” IEEE Transactions on Information
Forensics and Security, vol. 16, pp. 3524–3537, 2021.

[7] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu, “A

survey on neural speech synthesis,” arXiv preprint
arXiv:2106.15561, 2021.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[9] Hongji Wang, Chengdong Liang, Shuai Wang,
Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei
Deng, and Yanmin Qian, “Wespeaker: A research
and production oriented speaker embedding learning
toolkit,” in ICASSP 2023-2023 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2023, pp. 1–5.
[10] Joon Son Chung, Arsha Nagrani, and Andrew Zisser-
man, “Voxceleb2: Deep speaker recognition,” arXiv
preprint arXiv:1806.05622, 2018.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos
Zafeiriou, “Arcface: Additive angular margin loss for
deep face recognition,” in Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
2019, pp. 4690–4699.

Zhangdejun
No ratings yet
Zhangdejun
3 pages
Evaluation Plan
No ratings yet
Evaluation Plan
4 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
One-Class Learning Towards Synthetic Voice Spoofing Detection
No ratings yet
One-Class Learning Towards Synthetic Voice Spoofing Detection
5 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
Masinless
No ratings yet
Masinless
6 pages
Dkorzh 10
No ratings yet
Dkorzh 10
6 pages
Rachel W
No ratings yet
Rachel W
4 pages
Deep4SNet: Deep Learning For Fake Speech Classification
No ratings yet
Deep4SNet: Deep Learning For Fake Speech Classification
12 pages
Audio Deepfake Detection Insights
No ratings yet
Audio Deepfake Detection Insights
9 pages
Text-Independent Speaker Verification Using Long Short-Term Memory Networks
No ratings yet
Text-Independent Speaker Verification Using Long Short-Term Memory Networks
6 pages
Speech Proceesing End Sem Reveiw FINAL
No ratings yet
Speech Proceesing End Sem Reveiw FINAL
16 pages
Voice Spoofing Countermeasure For Synthetic Speech Detection
No ratings yet
Voice Spoofing Countermeasure For Synthetic Speech Detection
4 pages
Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell Mclaren, Douglas A Reynolds and Andrew Zisserman
No ratings yet
Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell Mclaren, Douglas A Reynolds and Andrew Zisserman
5 pages
Final Deepfake Voice Detection Report
No ratings yet
Final Deepfake Voice Detection Report
36 pages
Voice Spoofing Countermeasure For Voice Replay Attacks Using Deep Learning
No ratings yet
Voice Spoofing Countermeasure For Voice Replay Attacks Using Deep Learning
14 pages
CNNs for Speaker Verification
No ratings yet
CNNs for Speaker Verification
6 pages
Mini Project Report Template
No ratings yet
Mini Project Report Template
31 pages
He 2020
No ratings yet
He 2020
4 pages
Voice Spoofing Countermeasure For Synthetic Speech Detection
No ratings yet
Voice Spoofing Countermeasure For Synthetic Speech Detection
4 pages
6044-Article Text-9269-1-10-20200513
No ratings yet
6044-Article Text-9269-1-10-20200513
8 pages
Updated Poster
No ratings yet
Updated Poster
1 page
Applsci 13 08488 v2
No ratings yet
Applsci 13 08488 v2
15 pages
Base Paper Audio Deep Fake Detection
No ratings yet
Base Paper Audio Deep Fake Detection
16 pages
SV - VLSP2021 The Smartcall - ITS S Systems
No ratings yet
SV - VLSP2021 The Smartcall - ITS S Systems
5 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Akshetpatial
No ratings yet
Akshetpatial
3 pages
Final Intro AIReport
No ratings yet
Final Intro AIReport
9 pages
Anti-Spoofing For Text-Independent Speaker Verification
No ratings yet
Anti-Spoofing For Text-Independent Speaker Verification
21 pages
Speaker Recognition Using Vector Quantization and Gaussian Mixture Models
No ratings yet
Speaker Recognition Using Vector Quantization and Gaussian Mixture Models
6 pages
Implementation Paper
No ratings yet
Implementation Paper
13 pages
Voice Conversion by Separating Speaker
No ratings yet
Voice Conversion by Separating Speaker
6 pages
4 - Frame Level Speaker Embeddings For Text Independent Speaker Recognition and Analysis of End To End Model
No ratings yet
4 - Frame Level Speaker Embeddings For Text Independent Speaker Recognition and Analysis of End To End Model
7 pages
Thesis On Speaker Recognition System
100% (2)
Thesis On Speaker Recognition System
4 pages
Fake Audio Detection Based On Unsupervised Pretraining Models
No ratings yet
Fake Audio Detection Based On Unsupervised Pretraining Models
5 pages
2017 Interspeech Embeddings
No ratings yet
2017 Interspeech Embeddings
5 pages
Thesis Bich Ngoc Do
No ratings yet
Thesis Bich Ngoc Do
72 pages
Voice Wukong
No ratings yet
Voice Wukong
22 pages
BTP Report
No ratings yet
BTP Report
39 pages
A Hybrid of Deep Neural Network and EXtreme Gradie
No ratings yet
A Hybrid of Deep Neural Network and EXtreme Gradie
12 pages
Xiao15 Interspeech
No ratings yet
Xiao15 Interspeech
5 pages
Short Termm Feature
No ratings yet
Short Termm Feature
5 pages
++++tutorial Text Independent Speaker Verification
No ratings yet
++++tutorial Text Independent Speaker Verification
22 pages
Deepfake Speech Detection Research
No ratings yet
Deepfake Speech Detection Research
3 pages
Person Voice Recognition Methods
No ratings yet
Person Voice Recognition Methods
6 pages
Deepfake Report Finalll-1
No ratings yet
Deepfake Report Finalll-1
37 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Battling Voice Spoofing: A Review, Comparative Analysis, and Generalizability Evaluation of State of The Art Voice Spoofing Counter Measures
No ratings yet
Battling Voice Spoofing: A Review, Comparative Analysis, and Generalizability Evaluation of State of The Art Voice Spoofing Counter Measures
54 pages
Acoustics 05 00042
No ratings yet
Acoustics 05 00042
21 pages
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
No ratings yet
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
9 pages
Discriminative Deep Learning Based Hybrid Spectro-Temporal Features For Synthetic Voice Spoofing Detection
No ratings yet
Discriminative Deep Learning Based Hybrid Spectro-Temporal Features For Synthetic Voice Spoofing Detection
12 pages
Electronics 14 02040
No ratings yet
Electronics 14 02040
13 pages
BERTIVITS The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections For End-To-End Speech Synthesis - 2024
No ratings yet
BERTIVITS The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections For End-To-End Speech Synthesis - 2024
14 pages
Project Report: "In Pursuit of Global Competitiveness"
75% (4)
Project Report: "In Pursuit of Global Competitiveness"
9 pages
Beyond The Illusion Ensemble Learning For Effective Voice Deepfake Detection
No ratings yet
Beyond The Illusion Ensemble Learning For Effective Voice Deepfake Detection
20 pages
Deepfake Basepaper
No ratings yet
Deepfake Basepaper
3 pages
Encodec Trans
No ratings yet
Encodec Trans
5 pages
Ieee
No ratings yet
Ieee
12 pages
Digital Logic Gate Security
No ratings yet
Digital Logic Gate Security
9 pages
02 4 PDF
No ratings yet
02 4 PDF
2 pages
UNIT-4 Notes
No ratings yet
UNIT-4 Notes
30 pages
Mobile Application Security and Penetration Testing: Syllabus
No ratings yet
Mobile Application Security and Penetration Testing: Syllabus
14 pages
Navi JD - Associate Manager - Information Security
No ratings yet
Navi JD - Associate Manager - Information Security
3 pages
VMware Horizon With F5 BIG-IP vs. Citrix XenDesktop With Citrix NetScaler
No ratings yet
VMware Horizon With F5 BIG-IP vs. Citrix XenDesktop With Citrix NetScaler
34 pages
PAM Review Process in SAP IAG
No ratings yet
PAM Review Process in SAP IAG
10 pages
Netflix User & Admin Guide
No ratings yet
Netflix User & Admin Guide
6 pages
Primary Key Reg. No. Nationality Name Gender DOB Status Industry ID Type ID No
No ratings yet
Primary Key Reg. No. Nationality Name Gender DOB Status Industry ID Type ID No
38 pages
CV - Sajid - Deputy Manager IT
No ratings yet
CV - Sajid - Deputy Manager IT
4 pages
Nse4 Study Plan - Weekly: Fortigate Security 6.2 Self-Paced
No ratings yet
Nse4 Study Plan - Weekly: Fortigate Security 6.2 Self-Paced
2 pages
USER MANUAL TV Un40nu7100
No ratings yet
USER MANUAL TV Un40nu7100
44 pages
Cyber Crime Arguments To Submit
No ratings yet
Cyber Crime Arguments To Submit
13 pages
Assignmnt Summery
No ratings yet
Assignmnt Summery
18 pages
Admission Form UDC OSCIT
No ratings yet
Admission Form UDC OSCIT
1 page
Nwea Map Proctor Tips and Troubleshooting Quickref
100% (1)
Nwea Map Proctor Tips and Troubleshooting Quickref
7 pages
Femtocell Technology Guide
No ratings yet
Femtocell Technology Guide
21 pages
Hotel Guest Cycle Explained
No ratings yet
Hotel Guest Cycle Explained
35 pages
Vendor Portal Guide
No ratings yet
Vendor Portal Guide
17 pages
Bash Cookbook
No ratings yet
Bash Cookbook
21 pages
Smart Card - Introduction To Smart Card Technology
100% (2)
Smart Card - Introduction To Smart Card Technology
126 pages
Cg0021en 6
No ratings yet
Cg0021en 6
1 page
Lea 3 Prelim Key Answer
100% (1)
Lea 3 Prelim Key Answer
9 pages
Patrick King - Persuasion Tactics - Covert Psychology Strategies To Influence, Persuade, & Get Your Way (Without Manipulation) (2016)
No ratings yet
Patrick King - Persuasion Tactics - Covert Psychology Strategies To Influence, Persuade, & Get Your Way (Without Manipulation) (2016)
3 pages
The AI Revolution: How Artificial Intelligence Is Reshaping Our Society For Better or Worse
No ratings yet
The AI Revolution: How Artificial Intelligence Is Reshaping Our Society For Better or Worse
4 pages
Screen Exit On Sapmv60a (Vf01-Vf02-Vf03)
No ratings yet
Screen Exit On Sapmv60a (Vf01-Vf02-Vf03)
5 pages
Datesheet For Vocational Courses 20230120t103748853z
No ratings yet
Datesheet For Vocational Courses 20230120t103748853z
3 pages
History of Lotus Notes
No ratings yet
History of Lotus Notes
19 pages
CH21 CompSec4e
No ratings yet
CH21 CompSec4e
28 pages
Initial Chat Friends List
No ratings yet
Initial Chat Friends List
19 pages