Seminar Report Final
Seminar Report Final
On
Bachelor of Technology
(Electronics and Communication)
by
Parthiv Jasoliya
(U22EC019)
(B. TECH. III(EC), 5th Semester)
Guided by
Dr. P. K. Shah
Assistant Professor, DECE
MAY - 2025
Sardar Vallabhbhai National Institute Of Technology
Surat 395 007, Gujarat, India
CERTIFICATE
I take this opportunity to express my deepest sense of gratitude and sincere thanks to everyone
who helped me to complete this work successfully. I express my sincere thanks to Dr. Jignesh
Sarvaiya, Head of Department, Department of Electronics Engineering, Sardar Vallabhbhai
National Institute of Technology, Surat for providing me with all the necessary facilities and
support. I would like to place on record my sincere gratitude to my seminar guide Dr. P.
K. Shah, Assistant Professor, Sardar Vallabhbhai National Institute of Technology for the
guidance and mentorship throughout the course. I would also like to express my gratitude to
all my colleagues for their advice, help, and support.
Parthiv Jasoliya
Sardar Vallabhbhai National Institute of Technology,
Surat
May 13, 2025
iii
Abstract
The rapid advancement of speech synthesis and voice conversion technologies, particularly
driven by deep learning, enables the creation of synthetic audio increasingly indistinguish-
able from genuine human speech. While beneficial in many domains, this progress poses
significant security threats, undermining the reliability of Automatic Speaker Verification
(ASV) systems and facilitating malicious activities such as disinformation, fraud, and im-
personation. Consequently, the development of robust and generalizable synthetic speech
detection countermeasures has become a critical area of research. This report presents a
comprehensive review and comparative analysis charting the evolution of these detection
methodologies. It begins with traditional approaches, characterized by handcrafted acoustic
features (e.g., MFCC, CQCC, LFCC, prediction-based features) coupled with classical ma-
chine learning classifiers (e.g., GMM, SVM). The report then examines the paradigm shift
towards end-to-end deep learning architectures, such as Time-domain Synthetic Speech De-
tection Networks (TSSDNet), which learn discriminative features directly from raw audio
waveforms. Finally, it delves into the current state-of-the-art: fully automated end-to-
end systems that leverage powerful self-supervised pretrained models (e.g., wav2vec 2.0)
for feature extraction and employ Neural Architecture Search (NAS) techniques (e.g., light-
DARTS) to optimize the detector’s structure, minimizing manual intervention. The analysis
evaluates these distinct paradigms based on key performance metrics (primarily Equal Error
Rate - EER and minimum Tandem Decision Cost Function - min t-DCF), generalization
capabilities across standard datasets (notably ASVspoof 2015 and 2019, with reference to
recent challenges), robustness, and computational efficiency. The comparison highlights a
clear trend towards the superior performance and adaptability of automated end-to-end
systems. Furthermore, the report discusses the inherent advantages and limitations of each
approach and concludes by delineating persistent challenges (including data dependency,
generalization to novel attacks, and interpretability) and outlining promising directions for
future research within this vital domain.
Examiners’ Names
iv
Contents
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Literature Review 5
3.1 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 Handcrafted Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.2 Classical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 End-to-End Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Direct Waveform Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Fully Automated End-to-End Approaches . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Self-Supervised Pretrained Feature Extraction . . . . . . . . . . . . . . . 8
3.3.2 Automated Network Architecture Search . . . . . . . . . . . . . . . . . . 9
3.3.3 Advantages of the Fully Automated Approach . . . . . . . . . . . . . . . 10
4 Methodology 11
4.1 Comparison Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Experimental Setup Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.3 Feature Extraction and Model Inputs . . . . . . . . . . . . . . . . . . . . 12
4.2.4 Training and Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . 13
5 Comparative Analysis 14
5.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Generalization Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4 Impact of Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.5 Computational Cost and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Discussion 19
6.1 Implications for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Conclusion 23
A Appendices 24
A.1 Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
List of Figures
vi
List of Tables
vii
LIST OF ABBREVIATIONS
viii
1 INTRODUCTION
1.2 Objectives
The primary objectives of this report are delineated as follows:
1
• To analyze the performance of these distinct methodologies based on key evaluation met-
rics, including detection accuracy (primarily Equal Error Rate (EER) [[3]]) and impact
on ASV system reliability (e.g., minimum Tandem Decision Cost Function (min t-DCF)
[[6]]). Computational efficiency (e.g., model parameters [[2]], inference time [[32]]) and
generalization capabilities across different datasets (particularly the ASVspoof 2015 [[7]]
and 2019 [[6]] challenges) and unseen attacks will also be considered [[2], [3], [1]].
• To deliberate upon the inherent challenges within synthetic speech detection, such as the
difficulty in generalizing to novel and rapidly evolving spoofing attacks [[4]], the necessity
for robustness against varying acoustic conditions (noise, reverberation) and transmission
channel effects (codecs, compression) [[1], [8]], the computational demands of advanced
deep learning models, and data limitations (availability, diversity, legal compliance) [[4]].
• To highlight potential avenues for future research directed towards augmenting detection
performance, robustness, and generalization [[4]]. This includes exploring innovative net-
work architectures, leveraging more advanced SSL techniques, improving NAS efficiency,
developing adaptive learning strategies (continual, federated learning), enhancing model
interpretability (XAI), and creating more representative datasets [[4], [1]].
2
2 OVERVIEW OF SPEECH SYNTHESIS TECHNIQUES
Understanding the methods used to generate synthetic speech is crucial for developing effective
detection countermeasures. Spoofing attacks typically employ either Text-To-Speech (TTS)
synthesis, which generates speech directly from text input, or Voice Conversion (VC), which
modifies a source speaker’s voice to sound like a target speaker [[4], [5]]. Both areas have seen
significant evolution, driven largely by advances in machine learning.
• Parametric Synthesis: These methods first predict acoustic parameters (e.g., funda-
mental frequency F0 , spectral envelope, excitation signals) from the input text and then
synthesize the speech waveform using a vocoder based on these parameters [[5]]. HMM-
based Speech Synthesis (HTS) systems were prominent examples [[47], [5]]. While more
flexible than concatenative methods, traditional parametric approaches often produced
less natural-sounding (”buzzy”) speech [[5]]. Vocoders like WORLD [[46]] and STRAIGHT
[[5]] are frequently used in these pipelines. ASVspoof 2019 includes systems using WORLD
(A02, A05, A07) and STRAIGHT (A14, A15) [[10]].
• Neural Network (NN) Based Synthesis: Modern TTS systems predominantly rely
on deep neural networks. These often follow a two-stage approach:
Many ASVspoof 2019 attacks utilize neural components, including WaveNet (A01, A12,
A15), WaveRNN (A10), and other NN-based acoustic models or vocoders (A03, A08,
A09, A11, A13) [[10]]. These methods produce highly realistic speech, posing a significant
challenge for detection [[5], [4]].
• Statistical Methods: Often using GMMs to model the mapping between source and
target acoustic features [[3]].
3
• Neural Network Based Methods: Employing architectures like Variational Autoen-
coders (VAEs) [[48]], Generative Adversarial Networks (GANs), or sequence-to-sequence
models to learn the transformation [[4], [5]]. These often rely on vocoders (like WORLD
or WaveNet) for waveform generation.
• Direct Waveform Modification: Some techniques attempt to modify the source wave-
form directly to match the target speaker’s characteristics [[5]].
Several VC systems are included in the ASVspoof challenges, such as transfer-function based
methods (A02, A03, A06, A19), VAE-based systems (A05, A17), and others (A14, A15, A18)
[[10], [5]].
The wide variety of these synthesis techniques, ranging from older methods leaving distinct
artifacts to highly sophisticated neural approaches generating near-indistinguishable speech,
underscores the complexity of the synthetic speech detection task [[5], [4]]. Detection systems
must be robust and generalizable to handle this diverse and evolving landscape of threats.
4
3 LITERATURE REVIEW
The detection of synthetic speech has evolved significantly, moving from methods relying on ex-
pert knowledge and handcrafted features towards more data-driven and automated deep learning
paradigms. This section reviews these major approaches.
• Phase and Excitation Features: Features derived from phase information (e.g., rela-
tive phase shift - RPS), group delay (e.g., modified group delay - MGD), or analysis of
the excitation signal were explored, as synthesis methods might introduce phase inconsis-
tencies [[12], [5]].
The efficacy of these features often depended on identifying specific artifacts introduced by
particular synthesis techniques, requiring careful selection and tuning (e.g., window sizes, filter
configurations, prediction orders) [[5], [1]]. This reliance on specific feature engineering poten-
tially hindered generalization across diverse datasets and evolving synthesis methods [[1], [12],
[5]].
5
3.1.2 Classical Classifiers
Subsequent to feature extraction, traditional systems employed classical statistical models and
machine learning classifiers for discrimination [[3]]. Hanilçi et al. (2015) provided a comparative
study using standard MFCC features on ASVspoof 2015 [[3]]:
These comparative studies highlighted the sensitivity of traditional methods to feature and clas-
sifier choices and the significant challenge of generalizing across different (especially unknown)
synthesis techniques [[3]]. Their adaptability remained fundamentally limited by the reliance
on fixed, manually engineered features [[4]].
6
3.2.2 Network Architectures
Network architecture design is critical for the success of end-to-end approaches [[2]]. Hua et al.
(2021) introduced two relatively lightweight TSSDNet variants, hypothesizing that shallower
networks might be more suitable than very deep ones for capturing subtle, low-level forgery
artifacts rather than high-level semantics [[2]]:
7
• Inc-TSSDNet: Employs Inception-style blocks featuring parallel convolutional pathways
with different kernel sizes and, notably, dilated convolutions [[2]]. Inspired by Inception
networks [[22]], this allows simultaneous multi-scale feature processing. Dilated convolu-
tions increase the receptive field without increasing parameters or losing resolution [[2]].
The parallel structure enables concurrent analysis at multiple temporal scales or abstrac-
tion levels [[2]].
Common components in these architectures include an initial 1D convolutional layer for wave-
form feature extraction (e.g., using a 1x7 kernel), pooling layers (max pooling often found effec-
tive), batch normalization, activation functions (like ReLU), and final fully connected layers for
classification [[2]]. End-to-end training allows joint optimization of all network parameters, often
yielding improved generalization, as evidenced by TSSDNet’s competitive EERs on benchmarks
like ASVspoof2019 [[2]]. Their ability to learn task-specific representations directly from data
potentially makes them more adaptable to evolving spoofing techniques than methods reliant
on fixed, predefined features [[2], [4]].
8
• wav2vec 2.0 Architecture: Typically consists of a multi-layer convolutional feature en-
coder that processes the raw waveform, followed by a Transformer-based context network
that builds contextualized representations [[24], [1]]. A quantization module discretizes
the encoder output for use in the self-supervised objective [[24]].
Figure 3: Conceptual Diagram of the Wav2Vec 2.0 Feature Extractor [[1], [24]].
9
• Search Process: NAS typically searches over a predefined space of candidate operations
(e.g., separable convolutions of different kernel sizes, dilated convolutions, pooling opera-
tions, skip connections, zero operation) within basic building blocks called cells (normal
and reduction cells) [[1], [29]]. Light-DARTS uses techniques like softmax weighting over
operations and bilevel optimization (optimizing weights on training data, architecture
parameters on validation data) to find a promising final discrete architecture [[1]].
• Enhancements: Wang et al. incorporated a Max Feature Map (MFM) module [[30]] into
the Light-DARTS search space. MFM acts as a feature selection mechanism by taking the
element-wise maximum across channels, potentially enhancing discrimination [[1]]. Ge et
al. also explored Partially-Connected DARTS for spoofing detection [[31]].
This automated search minimizes manual design effort, reduces reliance on expert intuition,
and can discover novel, task-specific architectures optimized for the extracted SSL features,
potentially improving performance and efficiency over hand-designed networks [[1], [31]].
• Reduced Manual Intervention: Fully automates both feature extraction and architec-
ture design, minimizing time-consuming manual tuning and reliance on domain expertise
[[1]].
10
4 METHODOLOGY
This section outlines the common methodologies employed in the literature for comparing differ-
ent synthetic speech detection systems, focusing on the criteria used for evaluation and typical
experimental setups.
11
• ASVspoof 2019: A major benchmark focusing on both Logical Access (LA) and Physical
Access (PA - replay attacks). The LA dataset, used in this report’s context, features sig-
nificantly more challenging and diverse attacks based on modern TTS and VC techniques
[[6], [10]]. It uses the VCTK corpus [[11]] as the basis for bona fide speech and includes
6 known attacks (A01-A06) in the training/development sets and 13 attacks (A07-A19,
where A16=A04, A19=A06) in the evaluation set, ensuring most evaluation attacks are
unseen during training [[10], [5]].
• ASVspoof 2021: Introduced a specific Speech DeepFake (DF) track focusing purely on
synthetic speech detection, separate from ASV integration [[8]]. Its evaluation set notably
includes audio subjected to various codecs and compression, testing robustness to channel
effects [[1], [8]]. Training and development data largely overlap with ASVspoof 2019 LA
[[1]].
• ADD 2022: The first Audio Deep Synthesis Detection challenge, featuring tracks like
Low-Quality Fake audio detection (LF) [[9]]. The LF track uses Mandarin speech (AISHELL-
3 corpus) and includes evaluation data with environmental noise and background music,
testing robustness in non-clean conditions [[1], [9]].
These datasets, with their defined splits and varied attack types/conditions, provide a structured
framework for evaluating and comparing detection systems [[4]].
• EER: Calculated from the detection scores on a test set, it finds the threshold where
FAR equals FRR. It provides a single-number summary of the system’s discrimination
capability, independent of application-specific costs or thresholds [[3], [1]].
• min t-DCF: Defined in the ASVspoof challenges, it evaluates the detector’s performance
in the context of an ASV system. It calculates a weighted cost based on misses (spoofed
accepted by ASV due to detector failure) and false alarms (genuine rejected by ASV due
to detector error), minimized over the decision threshold [[6], [4]]. This metric better
reflects the practical utility of the countermeasure in protecting speaker verification [[6]].
While Mean Opinion Score (MOS) is used to evaluate the *quality* or naturalness of
synthetic speech itself [[4]], it is not a direct metric for evaluating the performance of *detection*
systems.
12
• Traditional Methods: Rely on extracting pre-defined handcrafted features. Common
parameters include frame length (e.g., 20-30 ms), hop length (e.g., 10 ms), number of coef-
ficients (e.g., 19 MFCCs + Energy, plus deltas and double-deltas yielding 60 dimensions)
or CQT parameters [[3], [6]]. Voice Activity Detection (VAD) is often applied to remove
silence [[3]].
• End-to-End Approaches: Typically use the raw audio waveform directly, often seg-
mented into fixed lengths (e.g., 6 s at 16 kHz sampling rate used in [[2]]). Alternatively,
basic spectral representations like STFT spectrograms might be used as input to 2D CNNs
[[5]].
• Fully Automated Systems: Also start with the raw waveform but feed it into a pre-
trained SSL model (like wav2vec 2.0) to extract high-dimensional features (e.g., 768-dim
for wav2vec 2.0 base, 1024-dim for large) [[1]]. These features, often processed into fixed-
length sequences (e.g., 400 time frames), then serve as input to the NAS-optimized clas-
sifier network [[1]].
• Training Strategy: DNNs are commonly trained using stochastic gradient descent vari-
ants like Adam [[2], [1]] with learning rate schedules (e.g., exponential decay). Techniques
to improve robustness and prevent overfitting are crucial:
NAS procedures involve specific optimization schedules, often alternating between training
network weights and architecture parameters [[1], [29]].
• Evaluation: Models are typically trained on the designated training set, with hyper-
parameters tuned based on performance (usually EER) on the development set. Final
performance is reported on the unseen evaluation set using metrics like EER and min
t-DCF [[1], [2], [3]]. Cross-dataset testing protocols are used to assess generalization ex-
plicitly [[2], [1]].
Adherence to these common practices allows for more meaningful comparisons between different
proposed detection systems in the literature.
13
5 COMPARATIVE ANALYSIS
This section synthesizes findings from the literature to compare the traditional, end-to-end, and
fully automated approaches based on the criteria outlined in Section 4.
14
Table 3: Ablation Study: EER (%) on ASVspoof 2019 LA Eval for Different Feature/Network
Combinations (Adapted from Wang et al. [[1]])
15
Figure 5: Example Detection Error Trade-off (DET) Curves.
• End-to-End DNNs: Show improved potential for generalization due to learned repre-
sentations. However, naive end-to-end models can still overfit the training data distri-
bution. Hua et al. demonstrated that while their baseline TSSDNet performed poorly
in cross-dataset testing (ASVspoof 2019 train → 2015 test, EER ¿ 39 %), incorporating
mixup regularization dramatically improved generalization, achieving EERs below 2 % on
ASVspoof 2015 eval [[2]]. This suggests the learned features have inherent generalization
capacity that can be unlocked with proper regularization.
• Fully Automated Methods: Exhibit the strongest generalization reported so far. This
is largely attributed to the robust, general representations learned by SSL models (like
wav2vec 2.0) during pretraining on massive, diverse unlabeled audio data [[1]]. NAS
further helps by finding architectures suitable for these general features. Wang et al.
demonstrated their system’s robustness not only on ASVspoof 2019 LA (1.08 % EER) but
also on the ASVspoof 2021 DF evaluation set (7.86 % EER, including codec effects) and
the ADD 2022 Track 1 evaluation set (20.11 % EER, Mandarin speech with noise/music),
outperforming the respective challenge winners in the latter two cases [[1]]. This indicates
remarkable adaptability across different datasets, languages, and acoustic conditions.
16
Table 4: Performance (EER %) of Wav2Vec 2.0 + Light-DARTS Across Different Datasets
(Data from Wang et al. [[1]])
• Traditional Methods:
– Advantages: Relatively interpretable features based on acoustic principles; generally
lower computational requirements; can perform well on known attacks if features are
well-chosen [[3], [5]].
– Limitations: Heavy reliance on expert knowledge and manual feature engineering/-
tuning; poor adaptability and generalization to novel/unseen attacks; often subopti-
mal performance compared to modern methods [[4], [3]].
• End-to-End Deep Learning Approaches:
– Advantages: Eliminate manual feature engineering; learn robust, task-specific repre-
sentations directly from data (potentially preserving more information); significantly
improved performance on modern datasets; better generalization potential than tra-
ditional methods [[2], [4]].
– Limitations: Can require large labeled datasets for optimal training; sensitive to ar-
chitecture design, hyperparameters, and regularization (like mixup); generally higher
computational cost than traditional methods; interpretability (”black box” nature)
can be challenging [[4], [2]].
• Fully Automated End-to-End Approaches:
– Advantages: Minimize manual intervention in both feature and architecture design;
leverage large unlabeled datasets via SSL for very robust and general representations;
achieve state-of-the-art performance and excellent generalization across diverse con-
ditions [[1]].
– Limitations: Highest computational complexity, especially for SSL pretraining and
the NAS search phase; heavily dependent on the availability and quality of large
pretraining datasets; resulting architectures can be complex and difficult to interpret;
potential vulnerability if pretraining data contains biases or artifacts [[1], [4]].
17
• Effectiveness: Augmentation helps models learn features invariant to common distor-
tions and reduces overfitting to the training set characteristics [[4]]. It demonstrably
improves performance, especially generalization to unseen conditions or datasets [[35]].
• Mixup Regularization: Techniques like mixup [[37]], which creates virtual training
samples by linearly interpolating pairs of samples and their labels, have proven particu-
larly effective for regularization and enhancing cross-dataset generalization in end-to-end
spoofing detection, as shown by Hua et al. [[2]].
• Traditional Methods: Generally the least computationally intensive, making them suit-
able for resource-constrained environments, although feature extraction can still require
processing [[3]].
• End-to-End DNNs: Require more resources, particularly for training which often ne-
cessitates GPUs [[2]]. However, models can be designed to be relatively lightweight. For
example, Hua et al.’s Inc-TSSDNet had only 0.09M parameters and Res-TSSDNet had
0.35M parameters, significantly fewer than many large contemporary DNNs used in other
speech tasks [[2]]. Inference time is generally manageable for optimized models.
There is often a trade-off between peak accuracy and computational efficiency [[4], [2]]. Tech-
niques like model quantization or pruning may be necessary to deploy the most accurate but
complex models on edge devices or in real-time applications [[4], [32]].
18
6 DISCUSSION
The comparative analysis highlights the significant progress in synthetic speech detection, par-
ticularly with the advent of automated end-to-end systems. However, it also underscores per-
sistent challenges and illuminates key directions for future research.
• Advanced Feature Learning: Exploring more sophisticated SSL models beyond wav2vec
2.0, such as HuBERT [[38]], or investigating unsupervised methods tailored specifically for
distinguishing subtle synthesis artifacts could yield even richer representations [[4], [1]].
Multimodal analysis, integrating visual cues (lip movements) when available, might also
offer benefits for deepfake detection in general.
• Efficient Network Architecture Search: While NAS techniques like light-DARTS are
effective, further research into improving their computational efficiency and scalability is
needed. Exploring alternative NAS paradigms or more constrained search spaces could
yield high-performing architectures with lower discovery costs [[1], [4]].
• Improved Datasets: The creation and curation of larger, more diverse, and realistic
benchmark datasets are crucial. Future datasets should ideally include: speech generated
by the very latest synthesis techniques, partially spoofed or manipulated audio (not just
fully synthetic [[41]]), audio subjected to various real-world channel effects and augmenta-
tions, and data collected under clear legal and ethical guidelines (e.g., GDPR compliant)
to facilitate broader research collaboration [[4]].
19
6.2 Challenges and Limitations
Despite significant progress, synthetic speech detection faces persistent challenges [[4]]:
• Data Dependency: Modern deep learning methods, especially SSL pretraining, heavily
depend on the availability of large datasets (both labeled for fine-tuning and unlabeled
for pretraining). Performance can be limited in scenarios where such data is scarce,
imbalanced, or lacks representation of specific attack types or acoustic conditions [[4], [1]].
Data collection itself faces hurdles related to cost, diversity, and increasingly stringent
privacy regulations (like GDPR) [[4]].
• Computational Complexity: Training large SSL models and performing NAS can be
extremely computationally expensive, requiring significant hardware resources (GPUs/T-
PUs) and time, potentially limiting accessibility for researchers with fewer resources [[1],
[4]].
• Interpretability: The ”black box” nature of complex DNNs makes it difficult to under-
stand *why* a particular decision (bona fide vs. spoof) is made. This lack of transparency
hinders trust, debugging, validation, and admissibility in contexts requiring explainable
decisions, such as legal proceedings [[4]].
Addressing these challenges is crucial for the continued development and reliable deployment
of synthetic speech detection technologies.
20
7 ETHICAL CONSIDERATIONS AND SOCIETAL IMPACT
• Harassment and Abuse: Synthetic voices could potentially be used in targeted ha-
rassment campaigns or to create non-consensual fabricated content involving individuals’
voices.
The potential for these harms underscores the societal necessity for reliable methods to distin-
guish authentic human speech from synthetic fabrications [[4]].
• Bias and Fairness: Detection algorithms trained on biased datasets (e.g., lacking di-
versity in age, gender, accent, or language) may exhibit differential performance across
demographic groups, potentially leading to unfair outcomes. Ensuring fairness in detector
performance is an important ethical consideration.
• Consent and Collection: Collecting large datasets of real human speech requires careful
attention to ethical guidelines and data privacy regulations, such as the GDPR in Europe
[[4]]. Obtaining informed consent and ensuring data anonymization where appropriate
are crucial steps. Using publicly available data scraped from the internet without consent
raises significant ethical and legal issues [[4]].
21
• Federated Learning Trade-offs: While FL offers a privacy-preserving alternative by
avoiding direct data sharing, it introduces its own challenges (like non-IID data) and
potential privacy risks (e.g., model inversion attacks) that need careful consideration [[4]].
Navigating these ethical considerations responsibly is essential for the sustainable and beneficial
development and deployment of synthetic speech detection technology.
22
8 CONCLUSION
This report has provided a comprehensive review and comparative analysis of synthetic speech
detection methodologies, tracing their evolution from traditional feature-based classifiers [[3],
[5]] to modern end-to-end deep learning paradigms, culminating in fully automated systems [[2],
[1]]. The analysis indicates a definitive trend towards these automated end-to-end approaches,
which demonstrably advance detection performance and generalization capabilities over earlier
techniques [[1], [2]].
Systems like TSSDNet highlight the effectiveness of learning discriminative features directly
from raw waveforms, bypassing traditional feature engineering [[2]]. Furthermore, fully au-
tomated methods that leverage large-scale self-supervised pretraining (e.g., wav2vec 2.0) for
robust feature extraction and employ Neural Architecture Search (e.g., light-DARTS) for op-
timized classifier design have achieved state-of-the-art results on challenging benchmarks like
ASVspoof 2019, while also exhibiting strong generalization to different datasets, languages, and
acoustic conditions [[1]]. This integrated approach minimizes reliance on manual tuning and
expert knowledge, offering a powerful strategy against sophisticated spoofing attacks [[1]].
Despite these significant advancements, substantial challenges remain critical areas for on-
going research. These include the high computational cost associated with training large SSL
models and performing NAS, the dependency on extensive and diverse datasets (both labeled
and unlabeled), the continuous ”arms race” against rapidly evolving synthesis techniques requir-
ing constant adaptation, and the need for improved model interpretability (XAI) and robustness
to real-world conditions [[4], [1], [2]]. Addressing these limitations concerning data accessibility,
computational feasibility, adaptability, transparency, and practical deployability is essential for
developing robust, scalable, and trustworthy detection systems [[4]].
The evolution towards automated end-to-end systems represents a highly promising direction
for the future of synthetic speech detection. These methods offer potent tools to enhance the
security and reliability of voice-based applications, particularly vital ASV systems, against the
increasing threat of audio deepfakes [[1], [2]]. Future research should prioritize optimizing these
advanced methods for efficiency, enhancing their adaptability to unknown and future attacks,
ensuring fairness and ethical deployment, and improving their transparency to foster wider
adoption and trust in real-world scenarios [[4]].
23
A APPENDICES
# Extract MFCCs
# Returns a numpy array of shape ( n_mfcc , number_of_frames )
mfccs = librosa . feature . mfcc (
y = audio_signal ,
sr = sampling_rate ,
n_mfcc = n_mfcc ,
n_fft = n_fft ,
hop_length = hop_length ,
win_length = win_length
)
24
self . conv1 = nn . Conv1d ( channels , channels , kernel_size =3 ,
padding =1 , bias = False )
self . bn1 = nn . BatchNorm1d ( channels )
self . relu = nn . ReLU ( inplace = True )
# Second 1 D Convolution -> BatchNorm
self . conv2 = nn . Conv1d ( channels , channels , kernel_size =3 ,
padding =1 , bias = False )
self . bn2 = nn . BatchNorm1d ( channels )
# Skip connection is identity if channels match
# First layer
out = self . conv1 ( x )
out = self . bn1 ( out )
out = self . relu ( out )
# Second layer
out = self . conv2 ( out )
out = self . bn2 ( out )
# Example Usage :
# input_tensor = torch . randn (32 , 64 , 1000) # Batch of 32 , 64 channels ,
1000 length
# res_block = ResBlockTSSD ( channels =64)
# output_tensor = res_block ( input_tensor )
# print ( f " Output shape : { output_tensor . shape }")
Listing 2: Conceptual TSSDNet ResNet-Style Block in PyTorch
Example: Wav2Vec 2.0 Feature Extraction
This snippet demonstrates loading a pretrained Wav2Vec 2.0 model using the Hugging Face
Transformers library and extracting high-level feature representations from a raw audio wave-
form. This illustrates the self-supervised feature extraction step common in fully automated
detection systems [[1]].
import torch
import librosa
from transformers import Wav2Vec2Processor , Wav2Vec2Model
25
# Ensure model is in evaluation mode if not fine - tuning
model . eval ()
print ( f " Extracted Wav2Vec 2.0 features with shape : { features . shape } " )
26
References
[1] C. Wang, J. Yi, J. Tao, H. Sun, X. Chen, Z. Tian, H. Ma, C. Fan, and R. Fu, ”Fully Auto-
mated End-to-End Fake Audio Detection,” in Proc. 1st Int. Workshop Deepfake Detection
Audio Multimedia (DDAM ’22), Lisboa, Portugal, Oct. 2022, pp. 1-7.
[2] G. Hua, A. B. J. Teoh, and H. Zhang, ”Towards End-to-End Synthetic Speech Detection,”
IEEE Signal Process. Lett., vol. 28, pp. 1265-1269, 2021.
[3] C. Hanilçi, T. Kinnunen, M. Sahidullah, and A. Sizov, ”Classifiers for Synthetic Speech
Detection: A Comparison,” in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015, pp.
2057-2061.
[5] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, ”Synthetic speech detection
through short-term and long-term prediction traces,” EURASIP J. Inf. Secur., vol. 2021,
no. 1, pp. 1-14, Apr. 2021.
[6] M. Todisco et al., ”ASVspoof 2019: Future horizons in spoofed and fake audio detection,”
in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 1008-1012.
[7] Z. Wu et al., ”ASVspoof 2015: the first automatic speaker verification spoofing and coun-
termeasures challenge,” in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015, pp. 2037-
2041.
[8] J. Yamagishi et al., ”ASVspoof 2021: accelerating progress in spoofed and deepfake speech
detection,” arXiv preprint arXiv:2109.00537, Sep. 2021.
[9] J. Yi et al., ”ADD 2022: The first audio deep synthesis detection challenge,” arXiv preprint
arXiv:2202.08433, Feb. 2022.
[10] X. Wang et al., ”ASVspoof 2019: a large-scale public database of synthesized, converted
and replayed speech,” Comput. Speech Lang., vol. 64, p. 101114, Nov. 2020.
[11] J. Yamagishi et al., ”CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice
Cloning Toolkit (Version 0.92),” University of Edinburgh, The Centre for Speech Technol-
ogy Research (CSTR), 2019.
[12] M. Sahidullah, T. Kinnunen, and C. Hanilçi, ”A comparison of features for synthetic speech
detection,” in Proc. INTERSPEECH, Dresden, Germany, Sep. 2015, pp. 2087-2091.
[14] A. Janicki, ”Spoofing countermeasure based on analysis of linear prediction error,” in Proc.
INTERSPEECH, Dresden, Germany, Sep. 2015.
[16] V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer-
Verlag, 1995.
27
[17] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, ”Support vector machines using
GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp.
308-311, May 2006.
[18] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo,
”Support vector machines for speaker and language recognition,” Comput. Speech Lang.,
vol. 20, no. 2-3, pp. 210-229, Apr. 2006.
[19] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, ”Front-end factor analysis
for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp.
788-798, May 2011.
[20] H. Muckenhirn, M. Magimai-Doss, and S. Marcel, ”End-to-end convolutional neural
network-based voice presentation attack detection,” in Proc. IEEE Int. Joint Conf. Biomet.
(IJCB), Denver, CO, USA, Oct. 2017, pp. 335-341.
[21] K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun.
2016, pp. 770-778.
[22] C. Szegedy et al., ”Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 1-9.
[23] S. Schneider, A. Baevski, R. Collobert, and M. Auli, ”wav2vec: Unsupervised Pre-Training
for Speech Recognition,” in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 3465-
3469.
[24] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, ”wav2vec 2.0: A framework for
self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst.
(NeurIPS), vol. 33, 2020, pp. 12449-12460.
[25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, ”Librispeech: an ASR corpus based
on public domain audio books,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
(ICASSP), South Brisbane, QLD, Australia, Apr. 2015, pp. 5206-5210.
[26] Y. Xie, Z. Zhang, and Y. Yang, ”Siamese network with wav2vec feature for spoofing speech
detection,” in Proc. INTERSPEECH, Brno, Czechia, Aug./Sep. 2021, pp. 4269-4273.
[27] X. Wang and J. Yamagishi, ”Investigating self-supervised front ends for speech spoofing
countermeasures,” arXiv preprint arXiv:2111.07725, Nov. 2021.
[28] Z. Lv, S. Zhang, K. Tang, and P. Hu, ”Fake audio detection based on unsupervised pre-
training models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
Singapore, May 2022, pp. 9231-9235.
[29] H. Liu, K. Simonyan, and Y. Yang, ”DARTS: Differentiable Architecture Search,” in Proc.
Int. Conf. Learn. Represent. (ICLR), New Orleans, LA, USA, May 2019.
[30] X. Wu, R. He, Z. Sun, and T. Tan, ”A light cnn for deep face representation with noisy
labels,” IEEE Trans. Inf. Forensics Secur., vol. 13, no. 11, pp. 2884-2896, Nov. 2018.
[31] W. Ge, M. Panariello, J. Patino, M. Todisco, and N. Evans, ”Partially-connected
differentiable architecture search for deepfake and spoofing detection,” arXiv preprint
arXiv:2104.03123, Apr. 2021.
[32] N. Subramani and D. Rao, ”Learning efficient representations for fake speech detection,”
in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 4, New York, NY, USA, Feb. 2020, pp.
5859-5866.
28
[33] N. M. Müller et al., ”Does audio deepfake detection generalize?” in Proc. INTERSPEECH,
Incheon, Korea, Sep. 2022, pp. 2783-2787.
[34] T. Chen et al., ”Generalization of audio deepfake detection,” in Proc. Odyssey 2020 The
Speaker and Language Recognition Workshop, Tokyo, Japan, Nov. 2020, pp. 132-137.
[35] Z. Zhang, X. Yi, and X. Zhao, ”Fake speech detection using residual network with trans-
former encoder,” in Proc. ACM Workshop Inf. Hiding Multimedia Secur. (IH&MMSEC),
Virtual Event, Jun. 2021, pp. 13-22.
[36] D. S. Park et al., ”SpecAugment: A Simple Data Augmentation Method for Automatic
Speech Recognition,” in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 2613-2617.
[37] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, ”mixup: Beyond Empirical Risk
Minimization,” in Proc. Int. Conf. Learn. Represent. (ICLR), Vancouver, BC, Canada,
Apr./May 2018.
[38] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed,
”HuBERT: Self-supervised speech representation learning by masked prediction of hidden
units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451-3460, 2021.
[39] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, ”Continual Learning for Fake Audio
Detection,” in Proc. INTERSPEECH, Brno, Czechia, Aug./Sep. 2021, pp. 886-890.
[40] R. R. Selvaraju et al., ”Grad-CAM: Visual explanations from deep networks via gradient-
based localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Venice, Italy, Oct.
2017, pp. 618-626.
[41] J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, ”Half-Truth:
A Partially Fake Audio Detection Dataset,” in Proc. INTERSPEECH, Brno, Czechia,
Aug./Sep. 2021, pp. 1654-1658.
[42] A. van den Oord et al., ”WaveNet: A generative model for raw audio,” arXiv preprint
arXiv:1609.03499, Sep. 2016.
[43] J. Shen et al., ”Natural TTS synthesis by conditioning wavenet on mel spectrogram pre-
dictions,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Calgary,
AB, Canada, Apr. 2018, pp. 4779-4783.
[44] N. Kalchbrenner et al., ”Efficient neural audio synthesis,” in Proc. Int. Conf. Mach. Learn.
(ICML), Stockholm, Sweden, Jul. 2018.
[45] M. Schröder, M. Charfuelan, S. Pammi, and I. Steiner, ”Open source voice creation toolkit
for the MARY TTS platform,” in Proc. INTERSPEECH, Florence, Italy, Aug. 2011.
[47] K. Tokuda, H. Zen, and A. W. Black, ”An HMM-based speech synthesis system applied
to English,” in Proc. IEEE Workshop Speech Synth., Santa Monica, CA, USA, Sep. 2002.
[48] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, ”Voice conversion from non-
parallel corpora using variational auto-encoder,” in Proc. Asia-Pacific Signal Inf. Process.
Assoc. Annu. Summit Conf. (APSIPA ASC), Jeju, South Korea, Dec. 2016.
29