0% found this document useful (0 votes)
16 views37 pages

Seminar Report Final

The seminar report by Parthiv Jasoliya presents a comparative analysis of synthetic speech detection methodologies, highlighting the evolution from traditional approaches to advanced end-to-end deep learning systems. It emphasizes the security threats posed by synthetic audio technologies and evaluates various detection techniques based on performance metrics and generalization capabilities. The report concludes by discussing the challenges and future research directions in this critical area of study.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views37 pages

Seminar Report Final

The seminar report by Parthiv Jasoliya presents a comparative analysis of synthetic speech detection methodologies, highlighting the evolution from traditional approaches to advanced end-to-end deep learning systems. It emphasizes the security threats posed by synthetic audio technologies and evaluates various detection techniques based on performance metrics and generalization capabilities. The report concludes by discussing the challenges and future research directions in this critical area of study.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Seminar Report

On

Towards End-to-End Synthetic Speech


Detection: A Comparative Analysis

Submitted to the Department of Electronics and Communication Engineering in Partial


Fulfilment for the Requirements for the Degree of

Bachelor of Technology
(Electronics and Communication)

by
Parthiv Jasoliya
(U22EC019)
(B. TECH. III(EC), 5th Semester)

Guided by

Dr. P. K. Shah
Assistant Professor, DECE

SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY


SURAT

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

MAY - 2025
Sardar Vallabhbhai National Institute Of Technology
Surat 395 007, Gujarat, India

DEPARTMENT OF ELECTRONICS ENGINEERING

CERTIFICATE

This is to certify that the SEMINAR REPORT entitled “Towards End-to-


End Synthetic Speech Detection: A Comparative Analysis” is presented
& submitted by Candidate Parthiv Jasoliya, bearing Roll No. U22EC019,
of B.Tech. III, 5th Semester in the partial fulfillment of the requirement for
the award of B.Tech. Degree in Electronics & Communication Engineering for
academic year 2023 - 24.
He has successfully and satisfactorily completed his/her Seminar Exam in
all respects. We certify that the work is comprehensive, complete, and fit for
evaluation.

(Dr. P. K. Shah) (Dr. Jignesh Sarvaiya)


Assistant Professor & Seminar Guide Head & Professor
DECE, SVNIT

Name of Examiners Signature with Date

1. Dr. Abhishek Acharya

2. Dr. Suresh Dahiya

Seal of The Department


(May 2025)
ACKNOWLEDGEMENTS

I take this opportunity to express my deepest sense of gratitude and sincere thanks to everyone
who helped me to complete this work successfully. I express my sincere thanks to Dr. Jignesh
Sarvaiya, Head of Department, Department of Electronics Engineering, Sardar Vallabhbhai
National Institute of Technology, Surat for providing me with all the necessary facilities and
support. I would like to place on record my sincere gratitude to my seminar guide Dr. P.
K. Shah, Assistant Professor, Sardar Vallabhbhai National Institute of Technology for the
guidance and mentorship throughout the course. I would also like to express my gratitude to
all my colleagues for their advice, help, and support.

Parthiv Jasoliya
Sardar Vallabhbhai National Institute of Technology,
Surat
May 13, 2025

iii
Abstract
The rapid advancement of speech synthesis and voice conversion technologies, particularly
driven by deep learning, enables the creation of synthetic audio increasingly indistinguish-
able from genuine human speech. While beneficial in many domains, this progress poses
significant security threats, undermining the reliability of Automatic Speaker Verification
(ASV) systems and facilitating malicious activities such as disinformation, fraud, and im-
personation. Consequently, the development of robust and generalizable synthetic speech
detection countermeasures has become a critical area of research. This report presents a
comprehensive review and comparative analysis charting the evolution of these detection
methodologies. It begins with traditional approaches, characterized by handcrafted acoustic
features (e.g., MFCC, CQCC, LFCC, prediction-based features) coupled with classical ma-
chine learning classifiers (e.g., GMM, SVM). The report then examines the paradigm shift
towards end-to-end deep learning architectures, such as Time-domain Synthetic Speech De-
tection Networks (TSSDNet), which learn discriminative features directly from raw audio
waveforms. Finally, it delves into the current state-of-the-art: fully automated end-to-
end systems that leverage powerful self-supervised pretrained models (e.g., wav2vec 2.0)
for feature extraction and employ Neural Architecture Search (NAS) techniques (e.g., light-
DARTS) to optimize the detector’s structure, minimizing manual intervention. The analysis
evaluates these distinct paradigms based on key performance metrics (primarily Equal Error
Rate - EER and minimum Tandem Decision Cost Function - min t-DCF), generalization
capabilities across standard datasets (notably ASVspoof 2015 and 2019, with reference to
recent challenges), robustness, and computational efficiency. The comparison highlights a
clear trend towards the superior performance and adaptability of automated end-to-end
systems. Furthermore, the report discusses the inherent advantages and limitations of each
approach and concludes by delineating persistent challenges (including data dependency,
generalization to novel attacks, and interpretability) and outlining promising directions for
future research within this vital domain.

Examiners’ Names

Dr. Abhishek Acharya Dr. Suresh Dahiya

Signature of Student: Signature of Guide:

Parthiv Jasoliya (Dr. P. K. Shah)


(U22EC019) Seminar Guide

May 13, 2025

iv
Contents

1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Overview of Speech Synthesis Techniques 3


2.1 Text-To-Speech (TTS) Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Voice Conversion (VC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Literature Review 5
3.1 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 Handcrafted Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.2 Classical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 End-to-End Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Direct Waveform Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Fully Automated End-to-End Approaches . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Self-Supervised Pretrained Feature Extraction . . . . . . . . . . . . . . . 8
3.3.2 Automated Network Architecture Search . . . . . . . . . . . . . . . . . . 9
3.3.3 Advantages of the Fully Automated Approach . . . . . . . . . . . . . . . 10

4 Methodology 11
4.1 Comparison Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Experimental Setup Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.3 Feature Extraction and Model Inputs . . . . . . . . . . . . . . . . . . . . 12
4.2.4 Training and Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . 13

5 Comparative Analysis 14
5.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Generalization Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4 Impact of Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.5 Computational Cost and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Discussion 19
6.1 Implications for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Ethical Considerations and Societal Impact 21


7.1 Malicious Uses of Synthetic Speech . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Trustworthy AI and Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.3 Data Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

8 Conclusion 23

A Appendices 24
A.1 Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v
List of Figures

1 Conceptual Diagram of a ResNet-Style Block for TSSDNet [[2]]. . . . . . . . . . . 7


2 Conceptual Diagram of an Inception-Style Block for TSSDNet [[2]]. . . . . . . . . 8
3 Conceptual Diagram of the Wav2Vec 2.0 Feature Extractor [[1], [24]]. . . . . . . . 9
4 Visual Comparison of EER Performance on ASVspoof Evaluation Sets. . . . . . 15
5 Example Detection Error Trade-off (DET) Curves. . . . . . . . . . . . . . . . . . 16

vi
List of Tables

1 Overview of Key Synthetic Speech Detection Datasets (Example) . . . . . . . . . 12


2 Performance Comparison of Representative Synthetic Speech Detection Approaches
(EER % on Evaluation Sets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Ablation Study: EER (%) on ASVspoof 2019 LA Eval for Different Feature/Net-
work Combinations (Adapted from Wang et al. [[1]]) . . . . . . . . . . . . . . . . 15
4 Performance (EER %) of Wav2Vec 2.0 + Light-DARTS Across Different Datasets
(Data from Wang et al. [[1]]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vii
LIST OF ABBREVIATIONS

Abbreviation Full Form


AI Artificial Intelligence
ASR Automatic Speech Recognition
ASV Automatic Speaker Verification
BER Bit Error Rate
CNN Convolutional Neural Network
CQCC Constant Q Cepstral Coefficients
CQT Constant Q Transform
DARTS Differentiable Architecture Search
DNN Deep Neural Network
EER Equal Error Rate
FFT Fast Fourier Transform
FL Federated Learning
FLOPs Floating Point Operations
GDPR General Data Protection Regulation
GLDS Generalized Linear Discriminant Sequence
GMM Gaussian Mixture Model
GPU Graphics Processing Unit
HMM Hidden Markov Model
HTS HMM-based Speech Synthesis system
IID Independent and Identically Distributed
ISI Inter-Symbol Interference
LA Logical Access (ASVspoof context)
LFCC Linear Frequency Cepstral Coefficients
LPCC Linear Prediction Cepstral Coefficients
MFCC Mel-Frequency Cepstral Coefficients
MFM Max Feature Map
MGD Modified Group Delay
ML Maximum Likelihood
MOS Mean Opinion Score
NAS Network Architecture Search
NN Neural Network
RPS Relative Phase Shift
SSL Self-Supervised Learning
STFT Short-Time Fourier Transform
SVM Support Vector Machine
t-DCF tandem Decision Cost Function
TTS Text-To-Speech
UBM Universal Background Model
VC Voice Conversion
VCTK Voice Cloning Toolkit (Corpus)
WCCN Within-Class Covariance Normalization
WCE Weighted Cross-Entropy
XAI Explainable Artificial Intelligence

viii
1 INTRODUCTION

1.1 Background and Motivation


The remarkable progress in Artificial Intelligence (AI) driven speech synthesis and voice conver-
sion (VC) technologies facilitates the generation of synthetic speech increasingly indistinguish-
able from natural human speech, even by human listeners [[4]]. While offering benefits across
various applications, such as assistive technologies for speech-impaired individuals [[4]], this
advancement concurrently introduces significant security vulnerabilities, particularly for auto-
matic speaker verification (ASV) systems [[3], [4]]. High-quality synthetic voices, often termed
”deepfakes,” can be exploited to deceive ASV systems, posing serious threats in domains such as
access control, telephone banking, and forensic investigations [[3]]. Beyond ASV, the potential
for misuse extends to broader societal challenges, including gaining illicit access to private data,
discrediting public figures, spreading misinformation, and perpetrating fraudulent activities [[5],
[4]]. The facility with which realistic fake audio can now be created, coupled with the poten-
tial for severe negative impact, necessitates the development of robust, reliable, and adaptive
detection methodologies [[4], [5]]. This creates an ongoing ”arms race” between increasingly
sophisticated synthesis techniques and the countermeasures designed to detect them [[4]].
Initial approaches to synthetic speech detection relied substantially on manually engineered
acoustic features extracted from the speech signal, subsequently classified using classical machine
learning models [[5], [3]]. These features were often designed based on known properties of
older synthesis methods or general speech characteristics. Although these methods achieved
partial success, they often demanded extensive expert knowledge, manual parameter tuning, and
exhibited limited generalization capabilities when confronted with novel or previously unseen
spoofing attacks, especially those generated by modern neural network-based techniques [[4],
[3], [5]]. The inherent limitations of these traditional techniques have catalyzed a significant
shift towards more automated and data-driven paradigms [[4]].
Contemporary research emphasizes end-to-end deep learning models that directly process
raw audio waveforms or basic spectral representations, enabling the network to autonomously
learn discriminative features without extensive manual feature engineering [[2]]. Expanding
upon this trajectory, the field has advanced further with the introduction of fully automated
systems. These integrate powerful self-supervised learning (SSL) models for feature extraction,
pretrained on vast amounts of unlabeled data, coupled with automated network architecture
search (NAS) methodologies to optimize the detector’s structure [[1]]. These modern strategies
aim to minimize reliance on human expertise, enhance detection performance against state-of-
the-art attacks, and improve generalization efficacy across diverse datasets, acoustic conditions,
and spoofing scenarios [[1], [2]].

1.2 Objectives
The primary objectives of this report are delineated as follows:

• To furnish a comprehensive review and comparison of diverse approaches for synthetic


speech detection, tracing the evolution from traditional methods to modern deep learning
paradigms [[4]]. This encompasses an examination of:

– Traditional methods predicated on handcrafted features (e.g., MFCC, LFCC, CQCC,


prediction-based) and classical classifiers (e.g., GMM, SVM, i-vector) [[3], [5]].
– End-to-end deep learning systems operating directly on raw waveforms or spectro-
grams [[2]].
– Fully automated approaches combining self-supervised feature extraction (e.g., wav2vec
2.0) and automated architecture search (e.g., light-DARTS) [[1]].

1
• To analyze the performance of these distinct methodologies based on key evaluation met-
rics, including detection accuracy (primarily Equal Error Rate (EER) [[3]]) and impact
on ASV system reliability (e.g., minimum Tandem Decision Cost Function (min t-DCF)
[[6]]). Computational efficiency (e.g., model parameters [[2]], inference time [[32]]) and
generalization capabilities across different datasets (particularly the ASVspoof 2015 [[7]]
and 2019 [[6]] challenges) and unseen attacks will also be considered [[2], [3], [1]].

• To deliberate upon the inherent challenges within synthetic speech detection, such as the
difficulty in generalizing to novel and rapidly evolving spoofing attacks [[4]], the necessity
for robustness against varying acoustic conditions (noise, reverberation) and transmission
channel effects (codecs, compression) [[1], [8]], the computational demands of advanced
deep learning models, and data limitations (availability, diversity, legal compliance) [[4]].

• To highlight potential avenues for future research directed towards augmenting detection
performance, robustness, and generalization [[4]]. This includes exploring innovative net-
work architectures, leveraging more advanced SSL techniques, improving NAS efficiency,
developing adaptive learning strategies (continual, federated learning), enhancing model
interpretability (XAI), and creating more representative datasets [[4], [1]].

The remainder of this report is organized as follows: Section 2 provides an overview of


common speech synthesis techniques. Section 3 presents a detailed literature review of the
detection approaches examined. Section 4 outlines the methodology and experimental setup
considerations for comparative analysis. Section 5 presents the comparative results and analysis.
Section 6 discusses implications, challenges, and limitations. Section 7 briefly touches upon
ethical considerations. Section 8 concludes with key insights and directions for future work.
Appendices follow the main text.

2
2 OVERVIEW OF SPEECH SYNTHESIS TECHNIQUES

Understanding the methods used to generate synthetic speech is crucial for developing effective
detection countermeasures. Spoofing attacks typically employ either Text-To-Speech (TTS)
synthesis, which generates speech directly from text input, or Voice Conversion (VC), which
modifies a source speaker’s voice to sound like a target speaker [[4], [5]]. Both areas have seen
significant evolution, driven largely by advances in machine learning.

2.1 Text-To-Speech (TTS) Synthesis


TTS systems aim to produce natural-sounding speech from written text. Major approaches
include:

• Concatenative Synthesis: Historically dominant, these methods select appropriate pre-


recorded speech units (like diphones or triphones) from a large database and concatenate
them [[5]]. Systems like MaryTTS fall into this category [[45]]. While capable of high
intelligibility, they often suffer from audible concatenation artifacts and lack flexibility in
modifying voice characteristics like emotion or style [[5]]. The ASVspoof 2019 dataset
includes examples from waveform concatenation methods (A04, A16, A17) [[10]].

• Parametric Synthesis: These methods first predict acoustic parameters (e.g., funda-
mental frequency F0 , spectral envelope, excitation signals) from the input text and then
synthesize the speech waveform using a vocoder based on these parameters [[5]]. HMM-
based Speech Synthesis (HTS) systems were prominent examples [[47], [5]]. While more
flexible than concatenative methods, traditional parametric approaches often produced
less natural-sounding (”buzzy”) speech [[5]]. Vocoders like WORLD [[46]] and STRAIGHT
[[5]] are frequently used in these pipelines. ASVspoof 2019 includes systems using WORLD
(A02, A05, A07) and STRAIGHT (A14, A15) [[10]].

• Neural Network (NN) Based Synthesis: Modern TTS systems predominantly rely
on deep neural networks. These often follow a two-stage approach:

1. Acoustic Model: Predicts intermediate acoustic features (e.g., Mel spectrograms)


from text. Examples include Tacotron 2 [[43]].
2. Vocoder: Synthesizes the final waveform from the predicted acoustic features. Neural
vocoders like WaveNet [[42]], WaveRNN [[44]], WaveGlow [[4]], HiFi-GAN [[4]], or
LPCNet [[1]] generate significantly more natural-sounding speech than older para-
metric vocoders. Some systems aim for end-to-end generation directly from text to
waveform.

Many ASVspoof 2019 attacks utilize neural components, including WaveNet (A01, A12,
A15), WaveRNN (A10), and other NN-based acoustic models or vocoders (A03, A08,
A09, A11, A13) [[10]]. These methods produce highly realistic speech, posing a significant
challenge for detection [[5], [4]].

2.2 Voice Conversion (VC)


VC aims to modify the speech of a source speaker to sound like a target speaker while preserving
the linguistic content [[4]]. Various techniques exist, including:

• Statistical Methods: Often using GMMs to model the mapping between source and
target acoustic features [[3]].

3
• Neural Network Based Methods: Employing architectures like Variational Autoen-
coders (VAEs) [[48]], Generative Adversarial Networks (GANs), or sequence-to-sequence
models to learn the transformation [[4], [5]]. These often rely on vocoders (like WORLD
or WaveNet) for waveform generation.

• Direct Waveform Modification: Some techniques attempt to modify the source wave-
form directly to match the target speaker’s characteristics [[5]].

Several VC systems are included in the ASVspoof challenges, such as transfer-function based
methods (A02, A03, A06, A19), VAE-based systems (A05, A17), and others (A14, A15, A18)
[[10], [5]].
The wide variety of these synthesis techniques, ranging from older methods leaving distinct
artifacts to highly sophisticated neural approaches generating near-indistinguishable speech,
underscores the complexity of the synthetic speech detection task [[5], [4]]. Detection systems
must be robust and generalizable to handle this diverse and evolving landscape of threats.

4
3 LITERATURE REVIEW

The detection of synthetic speech has evolved significantly, moving from methods relying on ex-
pert knowledge and handcrafted features towards more data-driven and automated deep learning
paradigms. This section reviews these major approaches.

3.1 Traditional Approaches


Traditional methodologies predominantly centered on the extraction of manually engineered
features from the audio signal, followed by classification employing classical machine learning
models [[5]]. These methods were formulated based on expert knowledge of speech production,
signal processing, and characteristics potentially differentiating natural from synthetic speech
[[3]].

3.1.1 Handcrafted Feature Extraction


Handcrafted features were meticulously designed to capture specific acoustic properties hy-
pothesized to differ between natural and synthetic or converted speech [[5]]. Commonly utilized
features encompass:

• Spectral Envelope Features:

– Mel-Frequency Cepstral Coefficients (MFCCs): Derived from the short-term power


spectrum using a Mel-scaled filter bank approximating human auditory perception
[[12]]. Widely used in speech/speaker recognition, but their effectiveness for spoofing
detection can be limited [[3]].
– Linear Frequency Cepstral Coefficients (LFCCs): Utilize a linear filter bank, em-
phasizing higher frequency components compared to MFCCs, potentially capturing
different artifacts [[12], [1]].
– Constant Q Cepstral Coefficients (CQCCs): Based on the constant Q transform
(CQT), offering logarithmically spaced frequency resolution akin to human hearing,
potentially enhancing detection across varied frequency bands [[13], [12]]. CQCCs
were effective baseline features for ASVspoof 2019 [[2]].

• Phase and Excitation Features: Features derived from phase information (e.g., rela-
tive phase shift - RPS), group delay (e.g., modified group delay - MGD), or analysis of
the excitation signal were explored, as synthesis methods might introduce phase inconsis-
tencies [[12], [5]].

• Prediction-Based Features: Methods analyzing linear prediction (LP) coefficients or


the prediction residual, aiming to model the source-filter characteristics of speech pro-
duction. Borrelli et al. proposed features based on short-term and long-term prediction
residuals, considering multiple prediction orders to capture potential inconsistencies [[5]].
Janicki focused on analyzing the LP error directly [[14]].

• High-Order Statistics: Features like bicoherence analyze non-linear interactions and


phase coupling between frequency components, hypothesized to differ between natural
and synthetic speech [[5]].

The efficacy of these features often depended on identifying specific artifacts introduced by
particular synthesis techniques, requiring careful selection and tuning (e.g., window sizes, filter
configurations, prediction orders) [[5], [1]]. This reliance on specific feature engineering poten-
tially hindered generalization across diverse datasets and evolving synthesis methods [[1], [12],
[5]].

5
3.1.2 Classical Classifiers
Subsequent to feature extraction, traditional systems employed classical statistical models and
machine learning classifiers for discrimination [[3]]. Hanilçi et al. (2015) provided a comparative
study using standard MFCC features on ASVspoof 2015 [[3]]:

• Gaussian Mixture Models (GMMs): Probabilistic models representing feature dis-


tributions as a weighted sum of Gaussian components [[3], [15]]. Maximum likelihood
(ML) estimation and adaptation from a Universal Background Model (UBM) were com-
mon training paradigms. GMM-ML showed reasonable robustness to unknown attacks
but potentially higher overall EER [[3]].

• Support Vector Machines (SVMs): Supervised models seeking an optimal separating


hyperplane [[3], [16]]. They were often used with kernel functions or in conjunction with
GMMs.

– GMM Supervectors + SVM: Feature vectors were represented by concatenating


MAP-adapted GMM mean vectors (supervectors) before SVM classification [[3], [17]].
– GLDS-SVM: Used a generalized linear discriminant sequence kernel, involving poly-
nomial expansion of features before linear SVM classification, avoiding the need for
a UBM [[3], [18]]. GLDS-SVM achieved the best performance on the ASVspoof 2015
development set but degraded significantly on unknown evaluation attacks [[3]].

• I-vector Based Systems: Extracted low-dimensional identity vectors (i-vectors) [[19]]


capturing utterance variability. Classification used SVMs or cosine similarity, sometimes
with Within-Class Covariance Normalization (WCCN) to normalize variations from speak-
ers or synthesis methods [[3]]. These systems showed the highest EERs in the comparison
by Hanilçi et al. [[3]].

These comparative studies highlighted the sensitivity of traditional methods to feature and clas-
sifier choices and the significant challenge of generalizing across different (especially unknown)
synthesis techniques [[3]]. Their adaptability remained fundamentally limited by the reliance
on fixed, manually engineered features [[4]].

3.2 End-to-End Deep Learning Approaches


The emergence of deep learning instigated a paradigm shift, offering end-to-end approaches
as a potent alternative [[2]]. These methods aim to circumvent manual feature engineering
limitations by directly processing the raw audio waveform or basic spectral representations,
allowing the network to learn relevant discriminative features autonomously [[2]].

3.2.1 Direct Waveform Processing


End-to-end systems typically accept the raw audio waveform as input, obviating explicit pre-
processing and feature extraction steps like MFCC or CQCC calculation [[2]]. Learning repre-
sentations directly from the time-domain waveform potentially preserves more original signal
information, including subtle phase details often discarded by spectral transforms, and may
capture transient synthesis artifacts lost during averaging or frame-based feature extraction
[[2], [20]]. The Time-domain Synthetic Speech Detection Net (TSSDNet) proposed by Hua et
al. (2021) exemplifies this approach, feeding raw audio segments directly into a DNN [[2]].
This leverages the DNN’s inherent capability to perform hierarchical feature learning, leading
to enhanced detection performance on challenging datasets like ASVspoof 2019 [[2]].

6
3.2.2 Network Architectures
Network architecture design is critical for the success of end-to-end approaches [[2]]. Hua et al.
(2021) introduced two relatively lightweight TSSDNet variants, hypothesizing that shallower
networks might be more suitable than very deep ones for capturing subtle, low-level forgery
artifacts rather than high-level semantics [[2]]:

• Res-TSSDNet: Incorporates ResNet-style blocks with skip connections, often imple-


mented with 1×1 convolution kernels to match dimensions [[2]]. Inspired by ResNet [[21]],
these skip connections facilitate the training of deeper networks by mitigating the van-
ishing gradient problem, allowing the model to learn complex representations without
significant performance degradation [[2]].

Figure 1: Conceptual Diagram of a ResNet-Style Block for TSSDNet [[2]].

7
• Inc-TSSDNet: Employs Inception-style blocks featuring parallel convolutional pathways
with different kernel sizes and, notably, dilated convolutions [[2]]. Inspired by Inception
networks [[22]], this allows simultaneous multi-scale feature processing. Dilated convolu-
tions increase the receptive field without increasing parameters or losing resolution [[2]].
The parallel structure enables concurrent analysis at multiple temporal scales or abstrac-
tion levels [[2]].

Figure 2: Conceptual Diagram of an Inception-Style Block for TSSDNet [[2]].

Common components in these architectures include an initial 1D convolutional layer for wave-
form feature extraction (e.g., using a 1x7 kernel), pooling layers (max pooling often found effec-
tive), batch normalization, activation functions (like ReLU), and final fully connected layers for
classification [[2]]. End-to-end training allows joint optimization of all network parameters, often
yielding improved generalization, as evidenced by TSSDNet’s competitive EERs on benchmarks
like ASVspoof2019 [[2]]. Their ability to learn task-specific representations directly from data
potentially makes them more adaptable to evolving spoofing techniques than methods reliant
on fixed, predefined features [[2], [4]].

3.3 Fully Automated End-to-End Approaches


Fully automated approaches signify the current frontier, aiming to minimize manual interven-
tion even further by integrating powerful self-supervised pretraining models with automated
architecture search techniques [[1]]. This paradigm seeks to leverage large unlabeled datasets
for robust feature learning and automatically discover optimal network structures, potentially
yielding highly performant and generalizable systems [[1]]. The method proposed by Wang et
al. (2022) exemplifies this integration [[1]].

3.3.1 Self-Supervised Pretrained Feature Extraction


A cornerstone of this approach is the use of self-supervised learning (SSL) models, pretrained
on vast amounts of unlabeled audio data, to extract high-level, general-purpose speech repre-
sentations [[1]]. Models like wav2vec [[23]] and particularly wav2vec 2.0 [[24]] are frequently
employed as feature extractors [[1]].

8
• wav2vec 2.0 Architecture: Typically consists of a multi-layer convolutional feature en-
coder that processes the raw waveform, followed by a Transformer-based context network
that builds contextualized representations [[24], [1]]. A quantization module discretizes
the encoder output for use in the self-supervised objective [[24]].

• Pretraining Objective: wav2vec 2.0 is trained using a contrastive task, learning to


predict the correct quantized latent speech representation for masked time steps within
the Transformer’s output, using other sampled quantized representations as negatives
[[24]]. This forces the model to learn meaningful representations from the structure of the
audio itself.

• Benefits: By pretraining on large unlabeled datasets (e.g., 960 hours of Librispeech


[[25]]), these models learn rich representations capturing phonetic, acoustic, and contextual
characteristics potentially useful for downstream tasks like spoofing detection [[1], [24]].
Using these pretrained models as feature extractors bypasses traditional acoustic feature
engineering (MFCC, CQCC), reduces dependency on manual tuning, and can potentially
capture subtle cues missed by handcrafted features [[1]]. The efficacy of SSL models for
spoofing detection is supported by multiple studies [[1], [26], [27], [28]].

Figure 3: Conceptual Diagram of the Wav2Vec 2.0 Feature Extractor [[1], [24]].

3.3.2 Automated Network Architecture Search


Complementing robust SSL features, fully automated systems often employ Neural Architecture
Search (NAS) to optimize the subsequent classification network structure [[1]]. NAS automates
the typically labor-intensive process of architecture design.

• DARTS and Light-DARTS: Differentiable Architecture Search (DARTS) [[29]] is a


popular gradient-based NAS technique. It relaxes the discrete search space of network
operations into a continuous one, allowing simultaneous optimization of architecture pa-
rameters and network weights via gradient descent [[1], [29]]. Light-DARTS is an efficient
variant used by Wang et al. (2022) [[1]].

9
• Search Process: NAS typically searches over a predefined space of candidate operations
(e.g., separable convolutions of different kernel sizes, dilated convolutions, pooling opera-
tions, skip connections, zero operation) within basic building blocks called cells (normal
and reduction cells) [[1], [29]]. Light-DARTS uses techniques like softmax weighting over
operations and bilevel optimization (optimizing weights on training data, architecture
parameters on validation data) to find a promising final discrete architecture [[1]].

• Enhancements: Wang et al. incorporated a Max Feature Map (MFM) module [[30]] into
the Light-DARTS search space. MFM acts as a feature selection mechanism by taking the
element-wise maximum across channels, potentially enhancing discrimination [[1]]. Ge et
al. also explored Partially-Connected DARTS for spoofing detection [[31]].

This automated search minimizes manual design effort, reduces reliance on expert intuition,
and can discover novel, task-specific architectures optimized for the extracted SSL features,
potentially improving performance and efficiency over hand-designed networks [[1], [31]].

3.3.3 Advantages of the Fully Automated Approach


This integrated approach, combining SSL feature extraction with NAS-driven architecture de-
sign, offers significant potential advantages [[1]]:

• Reduced Manual Intervention: Fully automates both feature extraction and architec-
ture design, minimizing time-consuming manual tuning and reliance on domain expertise
[[1]].

• Robust Generalization: SSL pretraining on large, diverse unlabeled datasets fosters


generalized speech representations less biased towards specific training data or known
attack types, enhancing performance on unseen data and novel attacks, thereby potentially
improving domain robustness [[1]].

• State-of-the-Art Performance: The synergy between high-level, robust pretrained


features and automatically optimized, task-specific architectures discovered by NAS can
lead to superior results on challenging benchmarks [[1]]. Wang et al. (2022) reported a
state-of-the-art EER of 1.08 % on ASVspoof2019 LA using wav2vec 2.0 large features and
light-DARTS, outperforming other contemporary single systems significantly [[1]].

Fully automated end-to-end approaches represent a substantial advancement, offering scalable,


adaptable, and highly effective solutions to the challenges posed by the rapid evolution of speech
synthesis and voice conversion technologies [[1]].

10
4 METHODOLOGY

This section outlines the common methodologies employed in the literature for comparing differ-
ent synthetic speech detection systems, focusing on the criteria used for evaluation and typical
experimental setups.

4.1 Comparison Criteria


A holistic evaluation of synthetic speech detection approaches typically involves assessing per-
formance across several key dimensions [[4]]:
• Detection Accuracy: Primarily measured by the Equal Error Rate (EER), which
represents the point where the False Acceptance Rate (FAR - spoofed classified as genuine)
equals the False Rejection Rate (FRR - genuine classified as spoofed) [[3], [1], [2]]. A
lower EER indicates better discrimination between bona fide and spoofed speech. It is
the standard metric in ASVspoof challenges [[6]].
• ASV System Impact: Evaluated using the minimum Tandem Detection Cost
Function (min t-DCF), especially relevant in ASVspoof challenges. This metric assesses
the detection system’s performance based on weighted costs of miss and false alarm errors,
specifically considering its impact on the reliability of a downstream ASV system [[6], [4]].
A lower min t-DCF indicates better performance in the context of protecting an ASV
system [[6]].
• Generalization Capability: Assesses the model’s robustness to unseen conditions. This
is evaluated by testing on evaluation sets containing unknown attack types or by cross-
dataset testing (e.g., training on ASVspoof 2019 and testing on ASVspoof 2015/2021/ADD)
[[2], [1], [4], [33], [34]]. Poor generalization is a major limitation of many systems [[4]].
• Robustness to Variations: Examines the system’s sensitivity to variations commonly
encountered in real-world scenarios, such as different audio durations, background noise,
reverberation, and compression/codec artifacts [[1], [4], [8]]. The impact of data augmen-
tation techniques on stability and performance is also considered [[35], [4]].
• Computational Efficiency: Considers the practical aspects of deployment, including
model complexity (e.g., number of parameters [[2]]), training resource requirements (e.g.,
GPUs, training time, FLOPs), and inference speed [[32]]. This is crucial for real-time
applications or deployment on resource-constrained devices [[4]].
These criteria collectively provide a comprehensive assessment considering accuracy, practicality,
generalization, and reliability under diverse conditions [[4]].

4.2 Experimental Setup Considerations


Comparative studies typically adhere to established protocols using standard benchmark datasets
to ensure reproducibility and fair comparison [[4], [1], [2]].

4.2.1 Benchmark Datasets


Several publicly available datasets serve as standard benchmarks:
• ASVspoof 2015: The first iteration of the challenge, focusing on logical access (TTS/VC)
attacks. It includes training, development, and evaluation sets with 5 known (S1-S5, used
in training/dev/eval) and 5 unknown (S6-S10, only in eval) spoofing algorithms [[3], [7]].
It remains a valuable benchmark for assessing generalization to older or different attack
types [[2]].

11
• ASVspoof 2019: A major benchmark focusing on both Logical Access (LA) and Physical
Access (PA - replay attacks). The LA dataset, used in this report’s context, features sig-
nificantly more challenging and diverse attacks based on modern TTS and VC techniques
[[6], [10]]. It uses the VCTK corpus [[11]] as the basis for bona fide speech and includes
6 known attacks (A01-A06) in the training/development sets and 13 attacks (A07-A19,
where A16=A04, A19=A06) in the evaluation set, ensuring most evaluation attacks are
unseen during training [[10], [5]].

• ASVspoof 2021: Introduced a specific Speech DeepFake (DF) track focusing purely on
synthetic speech detection, separate from ASV integration [[8]]. Its evaluation set notably
includes audio subjected to various codecs and compression, testing robustness to channel
effects [[1], [8]]. Training and development data largely overlap with ASVspoof 2019 LA
[[1]].

• ADD 2022: The first Audio Deep Synthesis Detection challenge, featuring tracks like
Low-Quality Fake audio detection (LF) [[9]]. The LF track uses Mandarin speech (AISHELL-
3 corpus) and includes evaluation data with environmental noise and background music,
testing robustness in non-clean conditions [[1], [9]].

These datasets, with their defined splits and varied attack types/conditions, provide a structured
framework for evaluating and comparing detection systems [[4]].

Table 1: Overview of Key Synthetic Speech Detection Datasets (Example)

Dataset Year Focus Key Characteristics


ASVspoof LA 2015 TTS/VC 5 Known + 5 Unknown Attacks [[7]]
ASVspoof LA 2019 TTS/VC (Modern) 6 Known + 11 Unknown Attacks (Eval) [[10]]
ASVspoof DF 2021 TTS/VC (DeepFakes) Codec/Compression Effects (Eval) [[8]]
ADD LF 2022 TTS/VC (Low Quality) Mandarin, Noise/Music (Eval) [[9]]

4.2.2 Evaluation Metrics


As mentioned in Section 4.1, the primary metrics are:

• EER: Calculated from the detection scores on a test set, it finds the threshold where
FAR equals FRR. It provides a single-number summary of the system’s discrimination
capability, independent of application-specific costs or thresholds [[3], [1]].

• min t-DCF: Defined in the ASVspoof challenges, it evaluates the detector’s performance
in the context of an ASV system. It calculates a weighted cost based on misses (spoofed
accepted by ASV due to detector failure) and false alarms (genuine rejected by ASV due
to detector error), minimized over the decision threshold [[6], [4]]. This metric better
reflects the practical utility of the countermeasure in protecting speaker verification [[6]].

While Mean Opinion Score (MOS) is used to evaluate the *quality* or naturalness of
synthetic speech itself [[4]], it is not a direct metric for evaluating the performance of *detection*
systems.

4.2.3 Feature Extraction and Model Inputs


The input representation varies significantly based on the approach:

12
• Traditional Methods: Rely on extracting pre-defined handcrafted features. Common
parameters include frame length (e.g., 20-30 ms), hop length (e.g., 10 ms), number of coef-
ficients (e.g., 19 MFCCs + Energy, plus deltas and double-deltas yielding 60 dimensions)
or CQT parameters [[3], [6]]. Voice Activity Detection (VAD) is often applied to remove
silence [[3]].

• End-to-End Approaches: Typically use the raw audio waveform directly, often seg-
mented into fixed lengths (e.g., 6 s at 16 kHz sampling rate used in [[2]]). Alternatively,
basic spectral representations like STFT spectrograms might be used as input to 2D CNNs
[[5]].

• Fully Automated Systems: Also start with the raw waveform but feed it into a pre-
trained SSL model (like wav2vec 2.0) to extract high-dimensional features (e.g., 768-dim
for wav2vec 2.0 base, 1024-dim for large) [[1]]. These features, often processed into fixed-
length sequences (e.g., 400 time frames), then serve as input to the NAS-optimized clas-
sifier network [[1]].

4.2.4 Training and Evaluation Protocols


Standard practices for training and evaluation include:

• Data Preparation: Standardizing audio duration (by truncating or concatenating/re-


peating) and sampling rate (commonly 16 kHz) is typical for DNN inputs [[2], [1]]. Data
is usually processed in batches during training [[2], [1]].

• Training Strategy: DNNs are commonly trained using stochastic gradient descent vari-
ants like Adam [[2], [1]] with learning rate schedules (e.g., exponential decay). Techniques
to improve robustness and prevent overfitting are crucial:

– Data Augmentation: Applying transformations like adding noise, reverberation, time


shifting, pitch shifting, or using spectral augmentation methods like SpecAugment
[[36], [35], [4]].
– Regularization: Techniques like dropout or mixup [[37]], which linearly interpolates
between samples and their labels, have shown significant benefits, especially for gen-
eralization [[2]].
– Loss Functions: Weighted Cross-Entropy (WCE) is often used to handle class im-
balance between genuine and spoofed samples [[2]].

NAS procedures involve specific optimization schedules, often alternating between training
network weights and architecture parameters [[1], [29]].

• Evaluation: Models are typically trained on the designated training set, with hyper-
parameters tuned based on performance (usually EER) on the development set. Final
performance is reported on the unseen evaluation set using metrics like EER and min
t-DCF [[1], [2], [3]]. Cross-dataset testing protocols are used to assess generalization ex-
plicitly [[2], [1]].

Adherence to these common practices allows for more meaningful comparisons between different
proposed detection systems in the literature.

13
5 COMPARATIVE ANALYSIS

This section synthesizes findings from the literature to compare the traditional, end-to-end, and
fully automated approaches based on the criteria outlined in Section 4.

5.1 Performance Comparison


Performance evaluation, primarily via EER on standard datasets, reveals a clear progression in
accuracy with advancing methodologies.
• Traditional Methods: Showed variable performance. On ASVspoof 2015, sophisticated
classifiers like GLDS-SVM achieved very low EERs on the development set (0.12 %) but
struggled with unknown attacks in the evaluation set (9.40 % EER for unknown, 4.75 %
average eval EER) [[3]]. Simpler GMM-ML was more robust to unknown attacks (5.52 %
EER) but had a higher average eval EER (3.01 %) [[3]]. Baselines using LFCC+GMM
and CQCC+GMM on the more challenging ASVspoof 2019 evaluation set yielded EERs of
9.57 % and 8.09 %, respectively [[2], [6]]. This highlights their limitations against modern
and unseen attacks.
• End-to-End DNNs: Demonstrated significant improvements. Hua et al.’s Res-TSSDNet,
operating directly on raw waveforms, achieved an average EER of 1.66 % on the ASVspoof
2019 LA evaluation set, outperforming contemporary fused systems and traditional base-
lines substantially [[2]]. The lightweight Inc-TSSDNet also achieved a respectable 4.04 %
EER [[2]]. This indicates the effectiveness of learning features directly from data for this
task.
• Fully Automated Methods: Represent the current state-of-the-art in terms of raw
performance. The system by Wang et al., combining wav2vec 2.0 (large) features with
a light-DARTS optimized network, reported an EER of only 1.08 % on the ASVspoof
2019 LA evaluation set [[1]]. This result, achieved by a single system without manual
feature engineering or architecture design, underscores the power of combining large-scale
pretraining with automated optimization [[1]].
Table 2 (updated from the skeleton’s Table 1) summarizes these key performance points on
evaluation sets. Performance based on min t-DCF generally follows similar trends, though
specific values depend heavily on the ASVspoof challenge version and scoring parameters [[6]].

Table 2: Performance Comparison of Representative Synthetic Speech Detection Approaches


(EER % on Evaluation Sets)

ASVspoof 2015 ASVspoof 2019 LA


Approach Method
(Eval Avg) (Eval)

Traditional GMM-ML [[3]] 3.01 -


Traditional GLDS-SVM [[3]] 4.75 -
Traditional Baseline LFCC+GMM [[2]] - 9.57
Traditional Baseline CQCC+GMM [[2]] - 8.09
End-to-End Res-TSSDNet [[2]] (1.95*) 1.66
End-to-End Inc-TSSDNet [[2]] (1.96*) 4.04
Fully Automated wav2vec 2.0-L + light-DARTS [[1]] - 1.08
*Result obtained via cross-dataset testing (Train: ASVspoof2019, Test: ASVspoof2015 Eval), requires
mixup regularization [[2]]. Results are indicative; direct comparison across different papers/setups
requires caution.

14
Table 3: Ablation Study: EER (%) on ASVspoof 2019 LA Eval for Different Feature/Network
Combinations (Adapted from Wang et al. [[1]])

Feature Network Architecture Dev EER (%) Eval EER (%)

LFCC LCNN 0.13 4.75


LFCC DARTS 0.01 4.82
LFCC Light-DARTS (Proposed) 0.05 4.35
Wav2vec LCNN 0.03 3.51
Wav2vec DARTS 0.02 2.18
Wav2vec Light-DARTS (Proposed) 0.06 1.51
Wav2vec 2.0-Base LCNN 0.02 3.32
Wav2vec 2.0-Base DARTS 0.01 2.23
Wav2vec 2.0-Base Light-DARTS (Proposed) 0.01 1.19
Wav2vec 2.0-Large LCNN 0.14 3.56
Wav2vec 2.0-Large DARTS 0.05 1.97
Wav2vec 2.0-Large Light-DARTS (Proposed) 0.02 1.08
Note: Illustrates the impact of feature choice (LFCC vs. SSL models) and network architecture
(manually designed LCNN vs. automatically searched DARTS/Light-DARTS). Results from Wang et
al. [[1]].

Figure 4: Visual Comparison of EER Performance on ASVspoof Evaluation Sets.

15
Figure 5: Example Detection Error Trade-off (DET) Curves.

5.2 Generalization Capabilities


Generalization across datasets, unseen attacks, and varying conditions is a critical differentiator
[[4]].

• Traditional Methods: Often exhibit poor generalization. As seen with GLDS-SVM on


ASVspoof 2015, performance can drop drastically when moving from known development
attacks to unknown evaluation attacks [[3]]. Their reliance on features tailored to specific
artifacts makes them brittle against novel synthesis techniques [[4], [3]]. Cross-dataset
testing typically reveals significant performance degradation [[2]].

• End-to-End DNNs: Show improved potential for generalization due to learned repre-
sentations. However, naive end-to-end models can still overfit the training data distri-
bution. Hua et al. demonstrated that while their baseline TSSDNet performed poorly
in cross-dataset testing (ASVspoof 2019 train → 2015 test, EER ¿ 39 %), incorporating
mixup regularization dramatically improved generalization, achieving EERs below 2 % on
ASVspoof 2015 eval [[2]]. This suggests the learned features have inherent generalization
capacity that can be unlocked with proper regularization.

• Fully Automated Methods: Exhibit the strongest generalization reported so far. This
is largely attributed to the robust, general representations learned by SSL models (like
wav2vec 2.0) during pretraining on massive, diverse unlabeled audio data [[1]]. NAS
further helps by finding architectures suitable for these general features. Wang et al.
demonstrated their system’s robustness not only on ASVspoof 2019 LA (1.08 % EER) but
also on the ASVspoof 2021 DF evaluation set (7.86 % EER, including codec effects) and
the ADD 2022 Track 1 evaluation set (20.11 % EER, Mandarin speech with noise/music),
outperforming the respective challenge winners in the latter two cases [[1]]. This indicates
remarkable adaptability across different datasets, languages, and acoustic conditions.

5.3 Advantages and Limitations


Each paradigm presents a distinct profile of strengths and weaknesses:

16
Table 4: Performance (EER %) of Wav2Vec 2.0 + Light-DARTS Across Different Datasets
(Data from Wang et al. [[1]])

Dataset (Evaluation Set) SSL Feature Extractor EER (%)

ASVspoof 2019 LA Wav2Vec 2.0-Large 1.08


ASVspoof 2019 LA Wav2Vec 2.0-Base 1.19
ASVspoof 2021 DF Wav2Vec 2.0-Large 7.86
ASVspoof 2021 DF Wav2Vec 2.0-Base 8.16
ADD 2022 Track 1 Wav2Vec 2.0-Large 20.11
ADD 2022 Track 1 Wav2Vec 2.0-Base 21.23
Note: Results demonstrate performance on Logical Access (LA), DeepFake (DF - includes codec
effects), and Audio Deep Synthesis Detection (ADD - includes noise/music) challenges.

• Traditional Methods:
– Advantages: Relatively interpretable features based on acoustic principles; generally
lower computational requirements; can perform well on known attacks if features are
well-chosen [[3], [5]].
– Limitations: Heavy reliance on expert knowledge and manual feature engineering/-
tuning; poor adaptability and generalization to novel/unseen attacks; often subopti-
mal performance compared to modern methods [[4], [3]].
• End-to-End Deep Learning Approaches:
– Advantages: Eliminate manual feature engineering; learn robust, task-specific repre-
sentations directly from data (potentially preserving more information); significantly
improved performance on modern datasets; better generalization potential than tra-
ditional methods [[2], [4]].
– Limitations: Can require large labeled datasets for optimal training; sensitive to ar-
chitecture design, hyperparameters, and regularization (like mixup); generally higher
computational cost than traditional methods; interpretability (”black box” nature)
can be challenging [[4], [2]].
• Fully Automated End-to-End Approaches:
– Advantages: Minimize manual intervention in both feature and architecture design;
leverage large unlabeled datasets via SSL for very robust and general representations;
achieve state-of-the-art performance and excellent generalization across diverse con-
ditions [[1]].
– Limitations: Highest computational complexity, especially for SSL pretraining and
the NAS search phase; heavily dependent on the availability and quality of large
pretraining datasets; resulting architectures can be complex and difficult to interpret;
potential vulnerability if pretraining data contains biases or artifacts [[1], [4]].

5.4 Impact of Data Augmentation


Data augmentation plays a crucial role in improving the robustness and generalization of DNN-
based detectors [[35], [4]].
• Techniques: Common methods include adding noise (various types), simulating rever-
beration, applying frequency/time masking (SpecAugment [[36]]), pitch shifting, time
stretching, and using generative methods to create more training data [[35], [4]].

17
• Effectiveness: Augmentation helps models learn features invariant to common distor-
tions and reduces overfitting to the training set characteristics [[4]]. It demonstrably
improves performance, especially generalization to unseen conditions or datasets [[35]].

• Mixup Regularization: Techniques like mixup [[37]], which creates virtual training
samples by linearly interpolating pairs of samples and their labels, have proven particu-
larly effective for regularization and enhancing cross-dataset generalization in end-to-end
spoofing detection, as shown by Hua et al. [[2]].

• Considerations: The choice and parameters of augmentation techniques must be care-


fully considered, as excessive or inappropriate augmentation might not always improve
performance and could potentially obscure relevant spoofing artifacts [[33], [4]].

5.5 Computational Cost and Efficiency


The computational demands vary significantly:

• Traditional Methods: Generally the least computationally intensive, making them suit-
able for resource-constrained environments, although feature extraction can still require
processing [[3]].

• End-to-End DNNs: Require more resources, particularly for training which often ne-
cessitates GPUs [[2]]. However, models can be designed to be relatively lightweight. For
example, Hua et al.’s Inc-TSSDNet had only 0.09M parameters and Res-TSSDNet had
0.35M parameters, significantly fewer than many large contemporary DNNs used in other
speech tasks [[2]]. Inference time is generally manageable for optimized models.

• Fully Automated Methods: Can be very resource-intensive. Pretraining large SSL


models like wav2vec 2.0 requires substantial computational power and data [[1]]. The
NAS phase, especially gradient-based methods like DARTS, can also be computationally
expensive, although efficient variants like light-DARTS aim to mitigate this [[1]]. While
inference with the final discovered model might be efficient, the overall development cost
is high.

There is often a trade-off between peak accuracy and computational efficiency [[4], [2]]. Tech-
niques like model quantization or pruning may be necessary to deploy the most accurate but
complex models on edge devices or in real-time applications [[4], [32]].

18
6 DISCUSSION

The comparative analysis highlights the significant progress in synthetic speech detection, par-
ticularly with the advent of automated end-to-end systems. However, it also underscores per-
sistent challenges and illuminates key directions for future research.

6.1 Implications for Future Research


The trajectory from traditional to fully automated methods suggests several promising research
avenues to further enhance detection capabilities [[4]]:

• Advanced Feature Learning: Exploring more sophisticated SSL models beyond wav2vec
2.0, such as HuBERT [[38]], or investigating unsupervised methods tailored specifically for
distinguishing subtle synthesis artifacts could yield even richer representations [[4], [1]].
Multimodal analysis, integrating visual cues (lip movements) when available, might also
offer benefits for deepfake detection in general.

• Efficient Network Architecture Search: While NAS techniques like light-DARTS are
effective, further research into improving their computational efficiency and scalability is
needed. Exploring alternative NAS paradigms or more constrained search spaces could
yield high-performing architectures with lower discovery costs [[1], [4]].

• Adaptability and Robustness: Developing systems robust to the continuous evolution


of synthesis techniques is paramount. Research into domain adaptation, domain general-
ization, continual learning (allowing models to adapt to new attacks without catastrophic
forgetting [[39]]), and adversarial training (explicitly training against attack generation) is
critical [[4]]. Robustness against real-world channel effects (codecs, noise, reverberation)
also needs continued focus [[1]].

• Interpretability and Explainability (XAI): Improving the interpretability of complex


DNN detectors is essential for building trust, enabling diagnostics, and facilitating their
use in sensitive applications (e.g., forensics). Adapting visual XAI methods like Grad-
CAM [[40]] to audio spectrograms or developing novel audio-specific XAI techniques to
highlight *which* parts or characteristics of a signal lead to a spoof decision is an impor-
tant research direction [[4]].

• Resource Efficiency and Deployment: Optimizing models for deployment on resource-


constrained devices (e.g., mobile phones, edge devices) through techniques like quantiza-
tion, pruning, and knowledge distillation is crucial for real-world applicability [[4], [32]].

• Collaborative Learning Paradigms: Exploring federated learning (FL) could enable


collaborative training of robust models across different institutions without sharing sen-
sitive raw audio data. However, significant challenges related to statistical heterogeneity
(non-IID data) across clients need to be addressed for FL to be effective in this domain
[[4]].

• Improved Datasets: The creation and curation of larger, more diverse, and realistic
benchmark datasets are crucial. Future datasets should ideally include: speech generated
by the very latest synthesis techniques, partially spoofed or manipulated audio (not just
fully synthetic [[41]]), audio subjected to various real-world channel effects and augmenta-
tions, and data collected under clear legal and ethical guidelines (e.g., GDPR compliant)
to facilitate broader research collaboration [[4]].

19
6.2 Challenges and Limitations
Despite significant progress, synthetic speech detection faces persistent challenges [[4]]:

• Data Dependency: Modern deep learning methods, especially SSL pretraining, heavily
depend on the availability of large datasets (both labeled for fine-tuning and unlabeled
for pretraining). Performance can be limited in scenarios where such data is scarce,
imbalanced, or lacks representation of specific attack types or acoustic conditions [[4], [1]].
Data collection itself faces hurdles related to cost, diversity, and increasingly stringent
privacy regulations (like GDPR) [[4]].

• Computational Complexity: Training large SSL models and performing NAS can be
extremely computationally expensive, requiring significant hardware resources (GPUs/T-
PUs) and time, potentially limiting accessibility for researchers with fewer resources [[1],
[4]].

• Generalization to Unseen Attacks: The rapid evolution of speech synthesis techniques


creates a continuous ”arms race.” Models trained on existing datasets may overfit to the
artifacts of known attacks and fail to generalize effectively to entirely novel synthesis
methods that produce different, perhaps more subtle, artifacts [[4], [5], [34]]. Evaluating
true generalization remains a difficult task [[33]].

• Interpretability: The ”black box” nature of complex DNNs makes it difficult to under-
stand *why* a particular decision (bona fide vs. spoof) is made. This lack of transparency
hinders trust, debugging, validation, and admissibility in contexts requiring explainable
decisions, such as legal proceedings [[4]].

• Robustness to Real-World Conditions: While performance on benchmark datasets


is improving, ensuring robustness against the wide variety of real-world acoustic environ-
ments, transmission channels, codecs, and potential adversarial perturbations remains a
significant challenge [[4], [8]].

• Practical Deployment: Integrating detection systems seamlessly into real-world ap-


plications (like ASV systems, forensic analysis tools, or social media platforms) requires
addressing latency constraints, computational limits on target devices, and ensuring com-
patibility with existing pipelines [[4], [6]].

Addressing these challenges is crucial for the continued development and reliable deployment
of synthetic speech detection technologies.

20
7 ETHICAL CONSIDERATIONS AND SOCIETAL IMPACT

The development of synthetic speech detection technologies is intrinsically linked to significant


ethical considerations and potential societal impacts. The very need for such detectors arises
from the potential misuse of increasingly realistic speech synthesis and voice conversion [[5], [4]].

7.1 Malicious Uses of Synthetic Speech


Sophisticated synthetic speech enables a range of harmful activities, motivating the development
of countermeasures:

• Impersonation and Fraud: Synthetic voices can be used to impersonate individuals


for financial fraud (e.g., targeting voice banking systems), social engineering attacks (e.g.,
deceiving relatives or colleagues), or bypassing voice-based authentication systems [[4],
[5]].

• Disinformation and Propaganda: Fabricated audio clips attributed to public figures


(politicians, CEOs) can be used to spread fake news, manipulate public opinion, incite
violence, or damage reputations [[5], [4]].

• Harassment and Abuse: Synthetic voices could potentially be used in targeted ha-
rassment campaigns or to create non-consensual fabricated content involving individuals’
voices.

The potential for these harms underscores the societal necessity for reliable methods to distin-
guish authentic human speech from synthetic fabrications [[4]].

7.2 Trustworthy AI and Explainability


As detection systems become more integrated into security protocols or forensic investigations,
their trustworthiness becomes paramount.

• Bias and Fairness: Detection algorithms trained on biased datasets (e.g., lacking di-
versity in age, gender, accent, or language) may exhibit differential performance across
demographic groups, potentially leading to unfair outcomes. Ensuring fairness in detector
performance is an important ethical consideration.

• Interpretability Needs: The lack of interpretability in complex ”black box” models


poses ethical challenges, especially if these systems are used in high-stakes decisions (e.g.,
legal evidence). The demand for Explainable AI (XAI) stems partly from the need to
understand, validate, and trust the outputs of these systems, ensuring they are not mak-
ing decisions based on spurious correlations or biases [[4]]. A ”right to explanation” is
emerging in some regulatory frameworks [[4]].

7.3 Data Privacy


The development of robust detectors, particularly those using supervised or self-supervised
learning, relies on access to speech data.

• Consent and Collection: Collecting large datasets of real human speech requires careful
attention to ethical guidelines and data privacy regulations, such as the GDPR in Europe
[[4]]. Obtaining informed consent and ensuring data anonymization where appropriate
are crucial steps. Using publicly available data scraped from the internet without consent
raises significant ethical and legal issues [[4]].

21
• Federated Learning Trade-offs: While FL offers a privacy-preserving alternative by
avoiding direct data sharing, it introduces its own challenges (like non-IID data) and
potential privacy risks (e.g., model inversion attacks) that need careful consideration [[4]].

Navigating these ethical considerations responsibly is essential for the sustainable and beneficial
development and deployment of synthetic speech detection technology.

22
8 CONCLUSION

This report has provided a comprehensive review and comparative analysis of synthetic speech
detection methodologies, tracing their evolution from traditional feature-based classifiers [[3],
[5]] to modern end-to-end deep learning paradigms, culminating in fully automated systems [[2],
[1]]. The analysis indicates a definitive trend towards these automated end-to-end approaches,
which demonstrably advance detection performance and generalization capabilities over earlier
techniques [[1], [2]].
Systems like TSSDNet highlight the effectiveness of learning discriminative features directly
from raw waveforms, bypassing traditional feature engineering [[2]]. Furthermore, fully au-
tomated methods that leverage large-scale self-supervised pretraining (e.g., wav2vec 2.0) for
robust feature extraction and employ Neural Architecture Search (e.g., light-DARTS) for op-
timized classifier design have achieved state-of-the-art results on challenging benchmarks like
ASVspoof 2019, while also exhibiting strong generalization to different datasets, languages, and
acoustic conditions [[1]]. This integrated approach minimizes reliance on manual tuning and
expert knowledge, offering a powerful strategy against sophisticated spoofing attacks [[1]].
Despite these significant advancements, substantial challenges remain critical areas for on-
going research. These include the high computational cost associated with training large SSL
models and performing NAS, the dependency on extensive and diverse datasets (both labeled
and unlabeled), the continuous ”arms race” against rapidly evolving synthesis techniques requir-
ing constant adaptation, and the need for improved model interpretability (XAI) and robustness
to real-world conditions [[4], [1], [2]]. Addressing these limitations concerning data accessibility,
computational feasibility, adaptability, transparency, and practical deployability is essential for
developing robust, scalable, and trustworthy detection systems [[4]].
The evolution towards automated end-to-end systems represents a highly promising direction
for the future of synthetic speech detection. These methods offer potent tools to enhance the
security and reliability of voice-based applications, particularly vital ASV systems, against the
increasing threat of audio deepfakes [[1], [2]]. Future research should prioritize optimizing these
advanced methods for efficiency, enhancing their adaptability to unknown and future attacks,
ensuring fairness and ethical deployment, and improving their transparency to foster wider
adoption and trust in real-world scenarios [[4]].

23
A APPENDICES

A.1 Code Listings


Example: MFCC Feature Extraction
This snippet demonstrates the extraction of Mel-Frequency Cepstral Coefficients (MFCCs), a
common handcrafted feature used in traditional speech analysis and early spoofing detection
systems, using the Librosa library.
import librosa
import numpy as np

# Assume ’ audio_signal ’ is a numpy array holding the waveform


# Assume ’ sampling_rate ’ is the sampling rate ( e . g . , 16000 Hz )
# Example : Load audio ( replace with your actual audio loading )
# audio_signal , sampling_rate = librosa . load ( ’ path / to / audio . wav ’, sr
=16000)

# Parameters for MFCC extraction


n_mfcc = 13 # Number of MFCCs to return
n_fft = 512 # FFT window size ( e . g . , 32 ms for 16 kHz )
hop_length = 256 # Hop length ( e . g . , 16 ms for 16 kHz )
win_length = n_fft # Window length , often same as n_fft

# Extract MFCCs
# Returns a numpy array of shape ( n_mfcc , number_of_frames )
mfccs = librosa . feature . mfcc (
y = audio_signal ,
sr = sampling_rate ,
n_mfcc = n_mfcc ,
n_fft = n_fft ,
hop_length = hop_length ,
win_length = win_length
)

# Optionally add delta and delta - delta features


# mfccs_delta = librosa . feature . delta ( mfccs )
# mfccs_delta2 = librosa . feature . delta ( mfccs , order =2)
# combi ned_fe atures = np . vstack ([ mfccs , mfccs_delta , mfccs_delta2 ])

print ( f " Extracted MFCCs with shape : { mfccs . shape } " )


# Further processing ( e . g . , averaging , feeding to classifier ) would
follow

Listing 1: Simplified MFCC Extraction using Librosa

Example: TSSDNet ResNet-Style Block


This snippet shows a conceptual PyTorch implementation of the ResNet-style convolutional block
used in the Res-TSSDNet architecture [[2]], illustrating the use of 1D convolutions and skip
connections for processing time-domain signals.
import torch
import torch . nn as nn

class ResBlockTSSD ( nn . Module ) :


""" Conceptual ResNet - style block similar to Hua et al . 2021 """
def __init__ ( self , channels ) :
super () . __init__ ()
# 1 D Convolution -> BatchNorm -> ReLU

24
self . conv1 = nn . Conv1d ( channels , channels , kernel_size =3 ,
padding =1 , bias = False )
self . bn1 = nn . BatchNorm1d ( channels )
self . relu = nn . ReLU ( inplace = True )
# Second 1 D Convolution -> BatchNorm
self . conv2 = nn . Conv1d ( channels , channels , kernel_size =3 ,
padding =1 , bias = False )
self . bn2 = nn . BatchNorm1d ( channels )
# Skip connection is identity if channels match

def forward ( self , x ) :


# Input tensor x shape : ( batch_size , channels , sequence_length )
identity = x # Store input for skip connection

# First layer
out = self . conv1 ( x )
out = self . bn1 ( out )
out = self . relu ( out )

# Second layer
out = self . conv2 ( out )
out = self . bn2 ( out )

# Add skip connection ( identity )


out += identity
# Final ReLU activation
out = self . relu ( out )
return out

# Example Usage :
# input_tensor = torch . randn (32 , 64 , 1000) # Batch of 32 , 64 channels ,
1000 length
# res_block = ResBlockTSSD ( channels =64)
# output_tensor = res_block ( input_tensor )
# print ( f " Output shape : { output_tensor . shape }")
Listing 2: Conceptual TSSDNet ResNet-Style Block in PyTorch
Example: Wav2Vec 2.0 Feature Extraction
This snippet demonstrates loading a pretrained Wav2Vec 2.0 model using the Hugging Face
Transformers library and extracting high-level feature representations from a raw audio wave-
form. This illustrates the self-supervised feature extraction step common in fully automated
detection systems [[1]].
import torch
import librosa
from transformers import Wav2Vec2Processor , Wav2Vec2Model

# Assume ’ audio_path ’ points to an audio file


audio_path = ’ path / to / your / audio . wav ’ # Replace with actual path

# 1. Load audio waveform ( resample to 16 kHz as required by most models )


audio_input , sampling_rate = librosa . load ( audio_path , sr =16000)

# 2. Load pretrained processor and model


# Example : Using the base Wav2Vec 2.0 model
model_name = " facebook / wav2vec2 - base -960 h "
processor = Wav2V ec2Pro cessor . from_pretrained ( model_name )
model = Wav2Vec2Model . from_pretrained ( model_name )

25
# Ensure model is in evaluation mode if not fine - tuning
model . eval ()

# 3. Process audio waveform for the model


# The processor handles normalization and formatting
input_values = processor ( audio_input ,
sampling_rate = sampling_rate ,
return_tensors = " pt " ) . input_values # pt for
PyTorch tensor

# 4. Extract features ( get hidden states from the model )


with torch . no_grad () : # Disable gradient calculations for inference
outputs = model ( input_values )

# Features are typically the ’ last_ hidden _state ’


# Shape : ( batch_size , sequence_length , hidden_size ) -> (1 , num_frames ,
768 for base )
features = outputs . last _hidde n_stat e

print ( f " Extracted Wav2Vec 2.0 features with shape : { features . shape } " )

# These features would then be fed into a downstream classifier


# ( e . g . , one found by NAS like Light - DARTS )

Listing 3: Wav2Vec 2.0 Feature Extraction using Transformers

26
References

[1] C. Wang, J. Yi, J. Tao, H. Sun, X. Chen, Z. Tian, H. Ma, C. Fan, and R. Fu, ”Fully Auto-
mated End-to-End Fake Audio Detection,” in Proc. 1st Int. Workshop Deepfake Detection
Audio Multimedia (DDAM ’22), Lisboa, Portugal, Oct. 2022, pp. 1-7.

[2] G. Hua, A. B. J. Teoh, and H. Zhang, ”Towards End-to-End Synthetic Speech Detection,”
IEEE Signal Process. Lett., vol. 28, pp. 1265-1269, 2021.

[3] C. Hanilçi, T. Kinnunen, M. Sahidullah, and A. Sizov, ”Classifiers for Synthetic Speech
Detection: A Comparison,” in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015, pp.
2057-2061.

[4] L. Cuccovillo, C. Papastergiopoulos, A. Vafeiadis, A. Yaroshchuk, P. Aichroth, K. Vo-


tis, and D. Tzovaras, ”Open Challenges in Synthetic Speech Detection,” arXiv preprint
arXiv:2209.07180v3, Jan. 2023.

[5] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, ”Synthetic speech detection
through short-term and long-term prediction traces,” EURASIP J. Inf. Secur., vol. 2021,
no. 1, pp. 1-14, Apr. 2021.

[6] M. Todisco et al., ”ASVspoof 2019: Future horizons in spoofed and fake audio detection,”
in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 1008-1012.

[7] Z. Wu et al., ”ASVspoof 2015: the first automatic speaker verification spoofing and coun-
termeasures challenge,” in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015, pp. 2037-
2041.

[8] J. Yamagishi et al., ”ASVspoof 2021: accelerating progress in spoofed and deepfake speech
detection,” arXiv preprint arXiv:2109.00537, Sep. 2021.

[9] J. Yi et al., ”ADD 2022: The first audio deep synthesis detection challenge,” arXiv preprint
arXiv:2202.08433, Feb. 2022.

[10] X. Wang et al., ”ASVspoof 2019: a large-scale public database of synthesized, converted
and replayed speech,” Comput. Speech Lang., vol. 64, p. 101114, Nov. 2020.

[11] J. Yamagishi et al., ”CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice
Cloning Toolkit (Version 0.92),” University of Edinburgh, The Centre for Speech Technol-
ogy Research (CSTR), 2019.

[12] M. Sahidullah, T. Kinnunen, and C. Hanilçi, ”A comparison of features for synthetic speech
detection,” in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015, pp. 2087-2091.

[13] M. Todisco, H. Delgado, and N. Evans, ”Constant Q cepstral coefficients: A spoofing


countermeasure for automatic speaker verification,” Comput. Speech Lang., vol. 45, pp.
516-535, Sep. 2017.

[14] A. Janicki, ”Spoofing countermeasure based on analysis of linear prediction error,” in Proc.
INTERSPEECH, Dresden, Germany, Sep. 2015.

[15] D. A. Reynolds and R. C. Rose, ”Robust text-independent speaker identification using


Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp.
72-83, Jan. 1995.

[16] V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer-
Verlag, 1995.

27
[17] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, ”Support vector machines using
GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp.
308-311, May 2006.
[18] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo,
”Support vector machines for speaker and language recognition,” Comput. Speech Lang.,
vol. 20, no. 2-3, pp. 210-229, Apr. 2006.
[19] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, ”Front-end factor analysis
for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp.
788-798, May 2011.
[20] H. Muckenhirn, M. Magimai-Doss, and S. Marcel, ”End-to-end convolutional neural
network-based voice presentation attack detection,” in Proc. IEEE Int. Joint Conf. Biomet.
(IJCB), Denver, CO, USA, Oct. 2017, pp. 335-341.
[21] K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun.
2016, pp. 770-778.
[22] C. Szegedy et al., ”Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 1-9.
[23] S. Schneider, A. Baevski, R. Collobert, and M. Auli, ”wav2vec: Unsupervised Pre-Training
for Speech Recognition,” in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 3465-
3469.
[24] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, ”wav2vec 2.0: A framework for
self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst.
(NeurIPS), vol. 33, 2020, pp. 12449-12460.
[25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, ”Librispeech: an ASR corpus based
on public domain audio books,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
(ICASSP), South Brisbane, QLD, Australia, Apr. 2015, pp. 5206-5210.
[26] Y. Xie, Z. Zhang, and Y. Yang, ”Siamese network with wav2vec feature for spoofing speech
detection,” in Proc. INTERSPEECH, Brno, Czechia, Aug./Sep. 2021, pp. 4269-4273.
[27] X. Wang and J. Yamagishi, ”Investigating self-supervised front ends for speech spoofing
countermeasures,” arXiv preprint arXiv:2111.07725, Nov. 2021.
[28] Z. Lv, S. Zhang, K. Tang, and P. Hu, ”Fake audio detection based on unsupervised pre-
training models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
Singapore, May 2022, pp. 9231-9235.
[29] H. Liu, K. Simonyan, and Y. Yang, ”DARTS: Differentiable Architecture Search,” in Proc.
Int. Conf. Learn. Represent. (ICLR), New Orleans, LA, USA, May 2019.
[30] X. Wu, R. He, Z. Sun, and T. Tan, ”A light cnn for deep face representation with noisy
labels,” IEEE Trans. Inf. Forensics Secur., vol. 13, no. 11, pp. 2884-2896, Nov. 2018.
[31] W. Ge, M. Panariello, J. Patino, M. Todisco, and N. Evans, ”Partially-connected
differentiable architecture search for deepfake and spoofing detection,” arXiv preprint
arXiv:2104.03123, Apr. 2021.
[32] N. Subramani and D. Rao, ”Learning efficient representations for fake speech detection,”
in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 4, New York, NY, USA, Feb. 2020, pp.
5859-5866.

28
[33] N. M. Müller et al., ”Does audio deepfake detection generalize?” in Proc. INTERSPEECH,
Incheon, Korea, Sep. 2022, pp. 2783-2787.

[34] T. Chen et al., ”Generalization of audio deepfake detection,” in Proc. Odyssey 2020 The
Speaker and Language Recognition Workshop, Tokyo, Japan, Nov. 2020, pp. 132-137.

[35] Z. Zhang, X. Yi, and X. Zhao, ”Fake speech detection using residual network with trans-
former encoder,” in Proc. ACM Workshop Inf. Hiding Multimedia Secur. (IH&MMSEC),
Virtual Event, Jun. 2021, pp. 13-22.

[36] D. S. Park et al., ”SpecAugment: A Simple Data Augmentation Method for Automatic
Speech Recognition,” in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 2613-2617.

[37] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, ”mixup: Beyond Empirical Risk
Minimization,” in Proc. Int. Conf. Learn. Represent. (ICLR), Vancouver, BC, Canada,
Apr./May 2018.

[38] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed,
”HuBERT: Self-supervised speech representation learning by masked prediction of hidden
units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451-3460, 2021.

[39] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, ”Continual Learning for Fake Audio
Detection,” in Proc. INTERSPEECH, Brno, Czechia, Aug./Sep. 2021, pp. 886-890.

[40] R. R. Selvaraju et al., ”Grad-CAM: Visual explanations from deep networks via gradient-
based localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Venice, Italy, Oct.
2017, pp. 618-626.

[41] J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, ”Half-Truth:
A Partially Fake Audio Detection Dataset,” in Proc. INTERSPEECH, Brno, Czechia,
Aug./Sep. 2021, pp. 1654-1658.

[42] A. van den Oord et al., ”WaveNet: A generative model for raw audio,” arXiv preprint
arXiv:1609.03499, Sep. 2016.

[43] J. Shen et al., ”Natural TTS synthesis by conditioning wavenet on mel spectrogram pre-
dictions,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Calgary,
AB, Canada, Apr. 2018, pp. 4779-4783.

[44] N. Kalchbrenner et al., ”Efficient neural audio synthesis,” in Proc. Int. Conf. Mach. Learn.
(ICML), Stockholm, Sweden, Jul. 2018.

[45] M. Schröder, M. Charfuelan, S. Pammi, and I. Steiner, ”Open source voice creation toolkit
for the MARY TTS platform,” in Proc. INTERSPEECH, Florence, Italy, Aug. 2011.

[46] M. Morise, F. Yokomori, and K. Ozawa, ”WORLD: a vocoder-based high-quality speech


synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. E99.D, no. 7,
pp. 1877-1884, Jul. 2016.

[47] K. Tokuda, H. Zen, and A. W. Black, ”An HMM-based speech synthesis system applied
to English,” in Proc. IEEE Workshop Speech Synth., Santa Monica, CA, USA, Sep. 2002.

[48] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, ”Voice conversion from non-
parallel corpora using variational auto-encoder,” in Proc. Asia-Pacific Signal Inf. Process.
Assoc. Annu. Summit Conf. (APSIPA ASC), Jeju, South Korea, Dec. 2016.

29

You might also like