Futureinternet 16 00481
Futureinternet 16 00481
Department of Information Engineering and Technology, German University in Cairo, Cairo 11835, Egypt
* Correspondence: hesham.khalil@student.guc.edu.eg (H.K.); maggie.ezzat@guc.edu.eg (M.M.)
Abstract: Network and cloud environments must be fortified against a dynamic array of threats, and
intrusion detection systems (IDSs) are critical tools for identifying and thwarting hostile activities.
IDSs, classified as anomaly-based or signature-based, have increasingly incorporated deep learning
models into their framework. Recently, significant advancements have been made in anomaly-based
IDSs, particularly those using machine learning, where attack detection accuracy has been notably
high. Our proposed method demonstrates that deep learning models can achieve unprecedented
success in identifying both known and unknown threats within cloud environments. However,
existing benchmark datasets for intrusion detection typically contain more normal traffic samples
than attack samples to reflect real-world network traffic. This imbalance in the training data makes it
more challenging for IDSs to accurately detect specific types of attacks. Thus, our challenges arise
from two key factors, unbalanced training data and the emergence of new, unidentified threats. To
address these issues, we present a hybrid transformer-convolutional neural network (Transformer-
CNN) deep learning model, which leverages data resampling techniques such as adaptive synthetic
(ADASYN), synthetic minority oversampling technique (SMOTE), edited nearest neighbors (ENN),
and class weights to overcome class imbalance. The transformer component of our model is employed
for contextual feature extraction, enabling the system to analyze relationships and patterns in the data
effectively. In contrast, the CNN is responsible for final classification, processing the extracted features
Citation: Kamal, H.; Mashaly, M. to accurately identify specific attack types. The Transformer-CNN model focuses on three primary
Advanced Hybrid Transformer-CNN objectives to enhance detection accuracy and performance: (1) reducing false positives and false
Deep Learning Model for Effective negatives, (2) enabling real-time intrusion detection in high-speed networks, and (3) detecting zero-
Intrusion Detection Systems with day attacks. We evaluate our proposed model, Transformer-CNN, using the NF-UNSW-NB15-v2 and
Class Imbalance Mitigation Using CICIDS2017 benchmark datasets, and assess its performance with metrics such as accuracy, precision,
Resampling Techniques. Future
recall, and F1-score. The results demonstrate that our method achieves an impressive 99.71% accuracy
Internet 2024, 16, 481. https://
in binary classification and 99.02% in multi-class classification on the NF-UNSW-NB15-v2 dataset,
doi.org/10.3390/fi16120481
while for the CICIDS2017 dataset, it reaches 99.93% in binary classification and 99.13% in multi-class
Academic Editor: Ugo Fiore classification, significantly outperforming existing models. This proves the enhanced capability of
our IDS in defending cloud environments against intrusions, including zero-day attacks.
Received: 31 October 2024
Revised: 13 December 2024
Accepted: 16 December 2024
Keywords: ADASYN; data resampling; deep learning; ENN; IDS; multi-class classification; Transformer-CNN
Published: 23 December 2024
1. Introduction
Copyright: © 2024 by the authors.
As the internet has evolved and expanded over time, it now offers a wide array of
Licensee MDPI, Basel, Switzerland.
valuable services that significantly improve people’s lives. Nevertheless, these services are
This article is an open access article
accompanied by various security threats. The increasing prevalence of network infections,
distributed under the terms and
eavesdropping, and malicious attacks complicates detection efforts and contributes to a
conditions of the Creative Commons
Attribution (CC BY) license (https://
rise in false alarms. Consequently, network security has become a paramount concern
creativecommons.org/licenses/by/
for a growing number of internet users, including in critical sectors such as banking,
4.0/). corporations, and government agencies.
2. Related Work
IDSs have become vital safeguards for national, economic, and personal security
due to the rapid expansion of data collection and the increasing interconnectedness of
global internet infrastructures. The concept of intrusion detection was pioneered by James
P. Anderson in 1980 [14], aimed at mitigating vulnerabilities in computer systems and
enhancing monitoring capabilities. Over the years, as security professionals have continued
to refine the effectiveness and accuracy of IDSs, their widespread adoption has followed.
This section delves into the various machine learning and deep learning techniques that
have been explored in the literature for intrusion detection. Given the extensive applications
and remarkable performance of deep learning in fields such as image recognition and
natural language processing, it has emerged as a compelling choice for detecting traffic
anomalies within IDSs. Academic publications have primarily focused on utilizing deep
learning methodologies for the classification of attack types in intrusion detection systems.
advanced machine learning techniques. The study employs a real-world IoT dataset, as-
sessing the performance of various classification algorithms to determine their efficacy
in detecting harmful traffic. In ref. [20], the authors establish key constraints for realis-
tic adversarial cyber-attack scenarios and introduce a robust framework for adversarial
analysis, centered on an evasion attack vector. This framework is utilized to evaluate the
performance of three supervised learning algorithms: random forest (RF), extreme gradient
boosting (XGB), and light gradient boosting machine (LGBM), alongside one unsupervised
algorithm, isolation forest (IFOR). In ref. [21], the study focuses on three primary machine
learning techniques, applied for both binary and multi-class classification within an IDS de-
signed to detect IoT-based attacks. The IoT-23 dataset [16], a comprehensive and up-to-date
collection, serves as the foundation for developing an intelligent IDS capable of identifying
and categorizing attack types in IoT environments. In ref. [18], the authors address the
challenge of creating an IoT/IIoT dataset that includes labeled ground truth differentiating
between normal and attack classes. The dataset also features attack sub-classes for more
detailed multi-classification tasks. Known as ToN-IoT, the dataset encompasses telemetry
data from IoT/IIoT services, OS logs, and network traffic, gathered from a realistic simu-
lation of a medium-scale IoT network, conducted at the University of New South Wales
(UNSW) Canberra’s Cyber Range and IoT Labs. In ref. [22], the study utilizes PySpark
with Apache Spark within the Google Colaboratory (Colab) environment, incorporating
Keras and Scikit-learn libraries. The authors employ the ‘CICIoT2023’ and ‘TON_IoT’
datasets for model training and testing, refining the features through correlation analysis
to reduce dimensionality. A hybrid deep learning model, integrating one-dimensional
CNN and LSTM, is then proposed to optimize performance. Additionally, ref. [20] explores
adversarial robustness by defining constraints for realistic cyber-attacks and introducing a
comprehensive evaluation method. This approach is used to test the effectiveness of RF,
XGB, LGBM, and IFOR algorithms under adversarial conditions.
This study conducts an extensive evaluation of deep learning models, demonstrat-
ing their potential when integrated with big data analytics to optimize IDS. In ref. [2], a
deep neural network (DNN) model achieves an outstanding 99.99% accuracy in binary
classification, underscoring the synergy between deep learning techniques and big data in
enhancing IDS effectiveness. The research utilizes three classifiers, random forest, gradient
boosting tree (GBT), and a deep feed-forward neural network, to classify network traffic,
while employing a homogeneity measure to extract the most relevant features from the
datasets. In ref. [23], a DNN model is proposed, achieving 93.1% accuracy for binary
classification, with a focus on developing a robust and adaptable IDS capable of detecting
both known and emerging cyber threats. Recognizing the ever-evolving nature of network
environments and the rapid emergence of new attack vectors, the study evaluates various
datasets using both static and dynamic analysis techniques to identify optimal methods for
detecting novel threats. It compares the performance of DNN models with traditional ma-
chine learning classifiers on a variety of publicly available malware datasets. The authors
in ref. [24] present a DNN-based IDS model that achieves a 99% accuracy rate, applied to a
newly constructed dataset containing both packet-based and flow-based data, as well as as-
sociated metadata. Despite the dataset’s imbalance and the inclusion of 79 attributes, some
representing classes with minimal training samples, the research highlights the capacity of
deep learning to mitigate the issues inherent in imbalanced datasets. Meanwhile, ref. [25]
introduces a stacked auto encoder (SAE) model, achieving a remarkable 99.92% accuracy.
The study outlines an innovative IDS framework comprising five core components: data
preprocessing, auto encoder compression, database storage, classification, and feedback. By
compressing the preprocessed data to extract lower-dimensional features, the auto encoder
enables more efficient classification while storing the compressed data in a database for
future forensic analysis, post-attack evaluations, and model retraining. In ref. [26], a long
short-term memory (LSTM) model is proposed, achieving 92.2% accuracy in binary classifi-
cation by incorporating attention mechanisms to enhance the capture of both temporal and
spatial features in network traffic data. This model is tested on the UNSW-NB15 dataset,
Future Internet 2024, 16, 481 5 of 74
which offers diverse patterns and significant disparities between training and testing sets,
making it an ideal challenge for evaluating model performance. In ref. [27], the authors
propose a hybrid CNN-BiLSTM model, which achieves 97.90% accuracy in binary classifi-
cation. This model combines bidirectional LSTMs with a lightweight CNN architecture and
utilizes feature selection methods to reduce complexity while maintaining robust detection
performance. Similarly, in ref. [28], a random forest model is presented, achieving 98.6%
accuracy in detecting network attacks on the UNSW-NB15 dataset. This comprehensive
study employs advanced machine learning and deep learning techniques to create a highly
effective attack detection strategy. Finally, in ref. [2], another DNN model reaches 99.16%
accuracy in classifying network traffic, utilizing five-fold cross-validation and incorporating
ensemble learning methods. The model leverages the Apache Spark MLlib alongside the
Keras deep learning framework, illustrating the powerful capabilities of deep learning and
big data technologies in addressing complex network security challenges.
In ref. [29], a LSTM-based recurrent neural network (RNN) was trained as a category
classifier, utilizing a dataset with 122 features. This model achieved a test accuracy of
82.68%, demonstrating its ability to manage complex classification tasks. The authors in
ref. [30] addressed the issue of class imbalance by combining a CNN with a Bidirectional
LSTM (BiLSTM) and integrating ADASYN, resulting in a notable accuracy of 90.73% on
the test set. In ref. [31], performance enhancement was achieved by optimizing an auto
encoder network for anomaly detection, yielding a test accuracy of 90.61%. Meanwhile,
ref. [32] introduced a multi-CNN model with discrete data preprocessing steps, which
effectively classified attacks on the test set, achieving an accuracy of 83%. In ref. [33],
advancements in IDS for cloud environments were explored by developing and evaluating
two cutting-edge deep neural network models. The first model was a multi-layer perceptron
(MLP) trained using backpropagation (BP), while the second incorporated particle swarm
optimization (PSO) into the MLP training process. Both models demonstrated a significant
improvement in IDS performance and efficiency, achieving an impressive accuracy of
98.97%. This underscores their effectiveness in both intrusion detection and prevention.
In ref. [34], the efficacy of deep learning algorithms for network intrusion detection was
evaluated by comparing frameworks such as Keras, TensorFlow, Theano, fast.ai, and
PyTorch. The researchers employed an MLP model and achieved a test accuracy of 98.68%
in identifying network intrusion traffic and classifying various attack types, utilizing
the CSE-CIC-IDS2018 dataset for validation. Similarly, [35] presented an innovative IDS
model that employed a custom-designed recurrent convolutional neural network (RC-
NN) optimized through the ant lion optimization algorithm, achieving a test accuracy of
94%. This approach significantly improved IDS performance, particularly in detecting
and mitigating network intrusions. In ref. [36], a deep learning framework was proposed
to enhance IDS by utilizing a denoising auto encoder (DAE) as the central component of
the methodology. The DAE was trained using a layer-wise greedy approach to prevent
overfitting and avoid local optima, achieving a robust test accuracy of 96.53%. This strategy
ensured higher reliability in detecting network intrusions. In ref. [37], a hidden naïve Bayes
(HNB) classifier was introduced, specifically tailored to counter denial of service (DoS)
attacks. By relaxing the traditional naïve Bayes assumption of conditional independence,
the model incorporated discretization and feature selection techniques, achieving a test
accuracy of 97%. This approach not only enhanced performance but also minimized
processing time by prioritizing the most relevant features. Lastly, in ref. [38], the authors
developed a novel classifier by employing a cascade of boosting-based artificial neural
networks (ANNs) on two prominent intrusion detection datasets. Their method, which
achieved a test accuracy of 98.25%, significantly improved upon the traditional one-vs-
remaining strategy by introducing an additional example filtering step, ultimately boosting
the model’s overall effectiveness.
In ref. [39], the authors present the design of an IDS tailored for IIoT networks, em-
ploying the RF model for classification. The methodology incorporates PCC to identify and
select critical features, alongside IF to detect outliers. Both PCC and IF are applied both
Future Internet 2024, 16, 481 6 of 74
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This paper introduces ROAST-IoT, an AI-based
model designed for efficient intrusion detection
in IoT environments. It employs a multi-modal • The study acknowledges its
architecture to capture complex relationships limitations, particularly the necessity
Anandaraj to integrate a broader range of deep
in diverse network traffic data. System
Mahalingam IoT-23 2023 ROAST-IoT 99.15% learning models to enhance the
behavior is continuously monitored by sensors
et al. [15] security of IIoT networks against
and stored on a cloud server for analysis. The
model’s performance is thoroughly evaluated cyber threats.
using benchmark datasets, including IoT-23,
Edge-IIoT, ToN-IoT, and UNSW-NB15.
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This research presents three leading machine
learning methodologies employed for binary
and multi-class classification, serving as the
foundation of an intrusion detection system • A significant limitation highlighted in
designed to safeguard Internet of Things this research is the failure of the
environments. These approaches are utilized to SMOTE to enhance the accuracy of the
Trifa S. Othman identify a range of cyber threats targeting IoT proposed intelligent intrusion
and Saman M. IoT-23 2023 ANN 99% devices while effectively categorizing their detection system model on the IoT-23
Abdullah [21] respective types. By harnessing the dataset, despite its usual efficacy in
cutting-edge IoT-23 dataset, the study addressing issues related to
constructs a sophisticated intelligent intrusion imbalanced datasets.
detection system capable of detecting
malicious behaviors and classifying attack
vectors in real-time, thereby bolstering the
security posture of IoT networks.
This paper presents a new dataset, TON_IoT, • This study acknowledges several
designed for the IoT and IIoT, which includes limitations, particularly the presence
labeled ground truth to differentiate between of class imbalance and missing values
normal operations and various attack classes. within the ToN-IoT dataset. Although
The dataset features attributes for identifying techniques like Chi-squared for
feature selection and SMOTE for class
attack subclasses, supporting multi-class
balancing were employed, these
classification. It contains telemetry data,
Abdallah R. Gad issues could hinder the model’s ability
ToN-IoT 2020 XGBoost 98.2% operating system logs, and network traffic,
et al. [18] to generalize effectively and scale in
collected from a realistic medium-scale
network simulation at UNSW Canberra, practical, real-world scenarios.
Australia. Overall, the study significantly Consequently, the findings may not
enhances the effectiveness of intrusion fully reflect the model’s performance
detection systems in IoT environments by under diverse operational conditions,
providing a comprehensive dataset for highlighting a need for further
improved classification accuracy. refinement and validation in future
research.
Future Internet 2024, 16, 481 9 of 74
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This study employed PySpark with Apache • A significant limitation of this study is
Spark in Google Colaboratory, utilizing Keras that, although it attained high
and Scikit-Learn to analyze the ‘CI-CIoT2023’ accuracy levels, the extensive data
and ‘TON_IoT’ datasets. It focused on feature volumes resulted in prolonged
reduction via correlation to enhance model training and testing durations. This
Sami Yaras and
ToN-IoT 2024 CNN-LSTM 98.75% relevance and developed a hybrid deep emphasizes the necessity for future
Murat Dener [22]
learning algorithm combining one-dimensional optimization efforts to achieve a
CNN and LSTM for better performance. balance between accuracy,
Overall, the research showcases advanced deep computational efficiency, and
learning applications for improving IoT cost-effectiveness.
intrusion detection.
This research outlines the critical requirements • The primary limitation of this study is
for developing a credible adversarial that, despite the improvements in
model resilience achieved through
cyber-attack and presents a framework for
adversarial training, specific models
conducting a reliable robustness analysis with
João Vitorino RF, XGB, LGBM, a practical adversarial evasion attack vector. such as LightGBM remain highly
ToN-IoT 2023 85% susceptible to adversarial examples in
et al [20]. and IFOR The framework was employed to assess the
robustness of three supervised machine the context of imbalanced multi-class
learning algorithms: random forest, XGBoost, classification. This underscores the
and LightGBM, alongside one unsupervised necessity for additional research into
algorithm, isolation forest. defense strategies and the evaluation
of new datasets and attack methods.
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This study explores deep neural networks for a
versatile intrusion detection system capable of • Limited scalability and performance
R. Vinaya-Kumar identifying and categorizing new cyber-attacks, evaluation of distributed systems and
CICIDS2017 2019 DNN 93.1% advanced deep neural network
et al. [23] evaluating their performance against
conventional machine learning classifiers using architectures.
standard benchmark datasets.
• Intricate architecture.
This study introduces a novel network • Although the AT-LSTM model
intrusion detection system that employs LSTM demonstrates impressive accuracy on
Mohammad A. networks and attention mechanisms to analyze the UNSW-NB15 dataset, it fails to
Alsharaiah UNSW-NB15 2024 AT-LSTM 92.2% the temporal and spatial characteristics of tackle class imbalance and lacks
et al. [26] network traffic. Utilizing the UNSW-NB15 evaluation on alternative datasets,
dataset, the approach evaluates different such as NSL-KDD. This limitation
training and testing set sizes. may hinder its applicability and
effectiveness across various contexts.
Future Internet 2024, 16, 481 11 of 74
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
• Intricate architecture.
This research presents a robust intrusion • This research focused on enhancing
detection system model that combines BiLSTM the intrusion detection system model
Mohammed with a lightweight CNN. The approach to address computational limitations,
UNSW-NB15 2024 CNN-BiLSTM 97.90%
Jouhari et al. [27] incorporates feature selection techniques to potentially overlooking important
streamline the model, enhancing its efficiency factors like broader generalization and
and effectiveness in detecting threats. robustness across various datasets.
This study achieved high attack detection rates • Misclassification of attack classes
on the UNSW-NB15 dataset, recording 98.6% suggests the necessity for improved
accuracy in binary classification and 98.3% in dataset balancing and the
Fuat Türk [28] UNSW-NB15 2023 RF 98.6%
multi-class classification by employing implementation of real-time model
sophisticated machine learning and deep updates to boost overall performance.
learning methods.
This study assesses machine learning models • The paper fails to address scalability
Osama Faker and through 5-fold cross-validation, employs concerning distributed processing and
Erdogan UNSW-NB15 2019 DNN 99.16% ensemble techniques alongside Apache Spark, does not explore advanced feature
Dogdu [2] and integrates deep learning by merging selection methods.
Apache Spark with Keras.
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
• Inadequate accuracy.
• Intricate architecture.
This research enhances intrusion detection • The study reports high accuracy for
systems for cloud environments by developing both the multilayer perceptron with
and assessing two deep neural network backpropagation and the multilayer
Saud Alzughaibi
CSE-CIC- MLP-BP, models: one based on a multilayer perceptron perceptron with particle swarm
and Salim El 2023 98.97%
IDS2018 MLP-PSO with backpropagation and the other utilizing optimization models. However, these
Khediri [33]
particle swarm optimization. These models models have yet to be evaluated in
aim to improve the efficiency and effectiveness real-time or cloud environments.
of detecting and responding to intrusions. Exploring additional optimization
algorithms may further improve
the outcomes.
• Intricate architecture.
This paper presents a sophisticated IDS that • The proposed RC-NN model
T. Thilagam and CSE-CIC- utilizes a customized RC-NN enhanced by the outperforms existing classifiers in
2021 RC-NN-IDS 94% intrusion detection; however, it is
R. Aruna [35] IDS2018 ALO algorithm, with the goal of markedly
improving the system’s effectiveness. missing a management module to
initiate preventive measures following
detection.
Future Internet 2024, 16, 481 14 of 74
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
Table 1. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
In this study, we design an IDS for IIoT
networks utilizing the RF model for • A limitation of this study is that the
classification. The approach integrates PCC for IDS model’s effectiveness has only
Mouaad selecting relevant features and IF as an outlier been validated on a limited set of
NF-UNSW- datasets, suggesting a need for
Mohy-Eddine 2022 RF 99.30% detection mechanism. PCC and IF are applied
NB15-v2 broader evaluation across diverse IIoT
et al. [39] independently as well as interchangeably, with
PCC feeding its output to IF and, conversely, IF and IoT datasets to ensure
supplying its output to PCC in generalized applicability.
different iterations.
This study addresses limitations in NIDS by
introducing and evaluating standardized • Although this study significantly
feature sets based on the NetFlow metadata contributes to the establishment of a
collection protocol. It systematically compares standardized NetFlow-based feature
two variants of these feature sets, one with 12 set for NIDS, it is constrained by its
features and another with 43 features. The dependence on existing benchmark
Mohanad Sarhan NF-UNSW- Extra Tree
2022 99.7 study reformulates four well-known NIDS datasets, which may not
et al. [10] NB15-v2 classifier
datasets to incorporate these NetFlow-based comprehensively capture the full
feature sets. Utilizing an Extra Tree classifier, it spectrum of real-world network
assesses the classification performance of the environments and diverse attack
NetFlow-derived feature sets against the scenarios encountered in practice.
original proprietary feature sets included in
the datasets.
Future Internet 2024, 16, 481 16 of 74
In ref. [18], the authors tackle the challenge of identifying malicious activities in
IoT/IIoT environments by introducing a new dataset, named TON_IoT, which features
labeled ground truth to differentiate between normal operations and attack classes. This
dataset further enriches the classification process by including a feature that categorizes
various attack subclasses, thereby facilitating multi-class classification. TON_IoT comprises
telemetry data from IoT/IIoT services, operating system logs, and network traffic, all
gathered from a realistic medium-scale network simulation conducted at the Cyber Range
and IoT Labs at UNSW Canberra, Australia. In ref. [43], the authors utilized both machine
learning and deep learning algorithms to investigate DoS and DDoS attacks. The analysis
was conducted using the Bot-IoT dataset, developed by the UNSW Canberra Cyber Centre,
with relevant features extracted from the pcap files of the UNSW dataset using ARGUS
software, allowing for an in-depth examination of attack patterns. Additionally, in ref. [43],
the authors propose a novel framework known as the privacy-preserving intrusion detec-
tion framework (P2IDF), specifically designed for traffic in Software-Defined IoT and Fog
networks. This framework employs a SAE technique to transform raw data into an encoded
format, effectively safeguarding against inference attacks. Subsequently, an IDS based
on ANN is integrated and evaluated using the ToN-IoT dataset to discern normal from
malicious traffic before and after the data transformation. This dual approach enhances the
security of IoT-Fog networks while maintaining data confidentiality. In ref. [44], the authors
conducted a thorough assessment of feature importance across six network NIDS datasets.
They employed three feature selection techniques: Chi-square, information gain (IG), and
correlation to rank features according to their predictive significance. These features were
then assessed using deep feed forward networks (DFF) and RF classifiers, leading to a total
of 414 experiments. A major finding from this study is that a carefully selected subset of
features can deliver equal or even superior detection performance compared to using the
full feature set, highlighting the efficiency of feature reduction in NIDS performance. In
ref. [45], the authors present a novel, comprehensive cyber security dataset tailored for IoT
and IIoT applications, named Edge-IIoTset. This dataset is designed for use with machine
learning-based intrusion detection systems, supporting both centralized and federated
learning modes. It was created using a purpose-built IoT/IIoT testbed, which incorporates
a wide array of representative devices, sensors, protocols, and cloud/edge configurations,
ensuring its relevance and applicability in real-world scenarios.
In ref. [23], a versatile DNN model reaches 95.6% accuracy, emphasizing the evaluation
of diverse datasets via static and dynamic methodologies to effectively detect emerging
cyber threats. The research in ref. [24] presents a DNN model achieving 99% accuracy
while addressing imbalances in labeled datasets through a comprehensive analysis of
packet-based and flow-based data. In ref. [46], the authors suggest a CNN-gated recurrent
unit (GRU) approach, achieving an accuracy of 98.73%. This method optimizes network
parameters by combining CNN and GRU, demonstrating various CNN-GRU combina-
tions. The CICIDS-2017 benchmark dataset is used, and evaluation metrics such as recall,
precision, false positive rate (FPR), and true positive rate (TPR) are employed. Another
study [2] reports a DNN model achieving 97.01% accuracy with five-fold cross-validation
and Apache Spark for distributed computing. The work of ref. [47] showcases an ANN
model with a notable 99.59% accuracy, employing a holistic dataset approach to enhance
deep learning performance. Similarly, an RF model in ref. [48] achieves 97.37% accuracy by
addressing dataset imbalance and dimensionality through feature clustering techniques. In
ref. [28], an RF model extends its attack detection methodology to the UNSW-NB15 dataset,
reaching 98.3% accuracy, while an RNN model in ref. [49] achieves 94% accuracy, utilizing
recursive feature elimination to improve classification across various attack categories.
Furthermore, [50] introduces a multilayer CNN combined with LSTM networks, achieving
99.5% accuracy, and [51] presents a method utilizing sparse stacked auto encoders, achiev-
ing 98.5% accuracy through a three-stage process. Finally, ref. [52] introduces an LSTM
model with a commendable accuracy of 96.9%.
Future Internet 2024, 16, 481 17 of 74
In ref. [34], the research investigates the effectiveness of several deep learning frame-
works for network intrusion detection, comparing notable options such as Keras, TensorFlow,
Theano, fast.ai, and PyTorch, along with MLP integration. The study reports an impressive
accuracy of 98.31% in identifying and classifying network intrusion traffic and various attack
types using the CSE-CIC-IDS2018 dataset. Additionally, the study in ref. [53] underscores
the critical importance of cyber security in safeguarding network infrastructures against
vulnerabilities and intrusions, highlighting advancements in machine learning and deep
learning techniques that facilitate early detection and prevention of attacks through advanced
self-learning and feature extraction methods. Leveraging the CSE-CIC-IDS2018 dataset, which
encompasses normal network behaviors and diverse attacks, an LSTM model achieved an
outstanding detection accuracy of 99%. In ref. [54], the authors evaluate a DNN model that
demonstrates approximately 90% accuracy, emphasizing its capability to effectively identify
network intrusions. The work presented in ref. [55] introduces a dynamic network anomaly
detection system aimed at bolstering network security through deep learning techniques,
specifically a deep neural network based on LSTM integrated with an attention mechanism
(AM) to enhance performance; this approach addresses class imbalance in the CSE-CIC-
IDS2018 dataset using SMOTE and an enhanced loss function, resulting in 96.2% accuracy.
The study detailed in ref. [38] proposes an advanced classifier development technique that
employs a cascade of boosting-based ANNs to construct a highly effective multi-class classifier,
utilizing a one-vs-remaining strategy refined with example filtering, ultimately achieving an
impressive accuracy of 99.36%. Furthermore, in ref. [23], the authors focus on developing a
DNN for a flexible and effective IDS capable of detecting and classifying novel cyber-attacks.
Recognizing the dynamic nature of network behaviors and attack strategies, they emphasize
the necessity of evaluating datasets generated through both static and dynamic methods over
time; the proposed DNN model achieves robust performance with an accuracy of 93.5%,
demonstrating its adaptability for real-time threat detection. Lastly, the authors in ref. [56]
execute a multi-class classification experiment for network intrusion detection utilizing the
KDD-CUP 99 and NSL-KDD datasets, employing a CNN to achieve a remarkable accuracy of
98.2%, thereby showcasing its efficacy in accurately identifying various network attack types.
In ref. [10], the study seeks to address the identified limitation by proposing and
systematically evaluating standardized feature sets for NIDS that are derived from the Net-
Flow network metadata collection protocol and system. We conduct a detailed assessment
of two distinct variants of NetFlow-based feature sets, one comprising 12 features and the
other encompassing 43 features. For our evaluation, we transformed four widely recog-
nized NIDS datasets into revised versions that integrate these proposed NetFlow-based
feature sets. Utilizing an Extra Tree classifier as the analytical framework, we meticulously
compare the classification performance of the NetFlow-derived feature sets with the propri-
etary feature sets that accompany the original datasets, thereby providing a comprehensive
analysis of their relative effectiveness in detecting network intrusions. In ref. [57], the paper
introduces a conditional generative adversarial network (CGAN) enhanced by bidirectional
encoder representations from transformers (BERT), a sophisticated pre-trained language
model, aimed at improving multi-class intrusion detection. The proposed method leverages
CGAN to augment minority attack data, effectively addressing the issue of class imbalance.
Additionally, BERT is incorporated into the CGAN discriminator, facilitating robust feature
extraction that strengthens input-output dependencies and enhances detection capabilities
through adversarial training.
Through an extensive review of prior research, we firmly establish that our proposed
Transformer-CNN model markedly surpasses current methodologies in performance. This
cutting-edge model attains an impressive accuracy of 99.02% in multi-class classification on
the NF-UNSW-NB15-v2 dataset and 99.13% on the CICIDS2017 dataset, underscoring its
effectiveness across diverse datasets. These outcomes not only demonstrate the efficacy of
our model but also highlight the significant advancements it brings to the field of intrusion
detection. A comprehensive comparison of our findings with relevant studies is presented
in Table 2.
Future Internet 2024, 16, 481 18 of 74
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This paper proposes a new machine • The study’s reliance on the IoT-23
Mohamed learning-based classifier for detecting dataset limits its coverage of all IoT
ElKashlan IoT-23 2023 Filtered classifier 99.2% malicious traffic in IoT networks, using a EVCS attack scenarios. Future work
et al. [16] real-world IoT dataset to assess the should explore diverse datasets and
advanced deep learning for more
performance of different algorithms.
comprehensive results.
This research employs ML and DL techniques • The study is limited by its focus on RF,
Bambang Susilo CNN, and MLP, with further research
with standard datasets to enhance IoT security,
and Riri Fitri IoT-23 2020 CNN 91.24% needed to optimize batch sizes and
developing a DL-based algorithm for DoS
Sari [41] integrate multiple ML or DL models
attack detection.
for real-time intrusion detection.
This paper tackles the challenge of handling • The study is limited by its emphasis
missing values in computational intelligence on comparing imputation methods
Mateusz applications. It presents two experiments without examining the impact of deep
Szczepański IoT-23 2022 RF 96.30% assessing different imputation methods for learning imputation on various ML
et al. [42] missing values in random forest classifiers classifiers. Future research should fill
trained on modern cybersecurity benchmark this gap by exploring explainability
datasets like CICIDS2017 and IoT-23. techniques and the latent
representations of autoencoders.
Future Internet 2024, 16, 481 19 of 74
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This paper addresses the challenge by
introducing a novel data-driven IoT/IIoT
dataset called TON_IoT, which includes • This study’s limitations include class
ground truth labels to distinguish between imbalance and missing values in the
normal and attack classes. It features an ToN-IoT dataset. Although Chi2 was
additional attribute for various attack used for feature selection and SMOTE
Abdallah R. Gad
ToN-IoT 2020 XGBoost 97.8% subclasses, allowing for multi-class for class balancing, these issues may
et al. [18]
classification. The dataset comprises telemetry affect the model’s scalability and its
data from IoT/IIoT services, operating system ability to generalize to real-world
logs, and network traffic, all collected from a scenarios.
realistic medium-scale network environment at
the Cyber Range and IoT Labs at UNSW
Canberra, Australia.
This paper employs machine learning and • The study concludes that both
deep learning techniques to conduct a machine learning and deep learning
thorough analysis of DoS and DDoS attacks. It models are effective in detecting DoS
utilizes the Bot-IoT dataset from the UNSW and DDoS attacks; however, deep
Canberra Cyber Centre as the main training learning models require more
Decision trees
Prahlad Kumar resource. To achieve precise feature extraction, resources and are best suited for
Bot-IOT 2021 (DT), RF, KNN, 99.6%
et al. [43] ARGUS software was used to process and systems with ample resources. In
NB, and ANN
derive features from the pcap files of the contrast, machine learning models are
UNSW dataset. This methodology enables a more appropriate for environments
detailed investigation of attack behaviors, with constrained resources and lower
aiding in the detection and classification of data traffic.
malicious activities within IoT environments.
This paper presents a P2IDF for • The study emphasizes that the P2IDF
Software-Defined IoT-Fog networks, utilizing a framework surpasses recent methods
SAE for data encoding to mitigate inference in terms of detection accuracy and
Prabhat Kumar attacks. It assesses an ANN-based intrusion precision. Future research will
ToN-IoT 2021 ANN 99.44%
et al. [43] detection system on the ToN-IoT dataset, concentrate on creating a real-time
comparing performance before and after data prototype to tackle privacy and
transformation. The framework successfully security issues in Software-Defined
identifies attacks while ensuring data privacy. IoT-Fog networks.
Future Internet 2024, 16, 481 20 of 74
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This paper assesses feature importance across • A major limitation noted in this paper
six NIDS datasets by employing three feature is the absence of a universal guideline
selection techniques: Chi-square, information for selecting optimal feature sets, as
gain, and correlation analysis. The chosen the importance of features varies
features were evaluated using deep considerably among different datasets
feed-forward networks and random forest and classifiers. This necessitates
Mohanad Sarhan
ToN-IoT 2022 DFF, RF 96.10%, 97.35% classifiers, resulting in a total of thorough analysis for each specific
et al. [44]
414 experiments. A significant finding is that a scenario. Additionally, some
streamlined subset of features can achieve unrealistic features in synthetic
detection performance comparable to or better datasets, such as TTL-based attributes
than that of the complete feature set, in the UNSW-NB15 dataset, should be
underscoring the value of feature selection in omitted to guarantee reliable
enhancing the efficiency and accuracy of NIDS. evaluation outcomes.
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This research aims to develop a versatile
intrusion detection system (IDS) by utilizing
deep neural networks to detect and classify • There is a lack of scalability and
R. Vinaya-Kumar emerging cyber threats. It evaluates various performance analysis for distributed
CICIDS2017 2019 DNN 95.6% systems, as well as for advanced deep
et al. [23] datasets and algorithms, comparing DNNs
with traditional classifiers using benchmark neural networks.
malware datasets to determine the most
effective approach for identifying new threats.
This study introduces an IDS based on deep • The model’s failure to classify
neural networks, evaluated on a contemporary ‘Heartbleed,’ ‘Infiltration,’ and ‘Web
Kaniz Farhana imbalanced dataset featuring 79 attributes. Attack SQL Injection’ underscores the
CICIDS2017 2020 DNN 99%
et al. [24] Built using Keras and TensorFlow, the model challenges posed by class imbalance,
analyzes packet-based, flow-based data, and stemming from an inadequate number
associated metadata. of records for these specific attacks.
• Complex architecture.
• The proposed IDS model, despite its
The study presents a method combining CNN high accuracy, could benefit from
Azriel Henry and GRU for optimizing network parameters, improvements in handling
CICIDS2017 2023 CNN-GRU 98.73%
et al. [46] evaluated using the CICIDS-2017 dataset and imbalanced data, optimizing training
metrics such as recall, precision, FPR, and TPR. for all attack types, and addressing
accuracy, false alarms, and execution
time in large-scale systems
This work evaluates machine learning models • The paper fails to include an analysis
Osama Faker and using five-fold cross-validation, employing of scalability concerning distributed
Erdogan UNSW-NB15 2019 DNN 97.01% Keras with Apache Spark for deep learning processing and does not cover
Dogdu [2] and leveraging Apache Spark MLlib for advanced feature selection techniques.
ensemble methods.
Future Internet 2024, 16, 481 22 of 74
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
They introduce feature clusters for Flow, TCP, • The model’s ability to generalize to
and MQTT derived from the UNSW-NB15 other IoT protocols and datasets may
Muhammad be constrained by the study’s
UNSW-NB15 2021 RF 97.37% dataset to address issues of imbalance,
Ahmad et al. [48] emphasis on particular protocols and
dimensionality, and overfitting, using ANN,
SVM, and RF for classification. imputation techniques.
This article utilizes advanced machine learning • Attack classes are occasionally
and deep learning techniques for attack misclassified, highlighting the
detection on the UNSW-NB15 and NSL-KDD necessity for improved dataset
Fuat Türk [28] UNSW-NB15 2023 RF 98.3%
datasets, achieving an accuracy of 98.6% in balancing and real-time model
binary classification and 98.3% in multi-class updates to enhance performance.
classification for the UNSW-NB15 dataset.
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
The authors develop a DNN-based IDS that • KDD99 employed this dataset, which
displays a greater level of redundancy.
R. Vinaya-Kumar attains 93% accuracy in detecting and
KDD-CUP’99 2019 DNN 93% • Inadequate analysis of scalability and
et al. [23] classifying new cyber-attacks by analyzing
various static and dynamic datasets. performance in distributed systems
and advanced DNNs.
Table 2. Cont.
Utilized
Author Dataset Year Accuracy Contribution Limitations
Technique
This paper addresses limitations in NIDS by
proposing and evaluating standardized feature
sets based on the NetFlow metadata collection • While this study advances the
protocol. It compares two variants of these development of a standardized
feature sets, one with 12 features and another NetFlow-based feature set for NIDS, it
with 43 features, by reformulating four is limited by the reliance on existing
Mohanad Sarhan NF-UNSW- Extra Tree benchmark datasets, which may not
2022 98.9% well-known NIDS datasets to include the
et al. [10] NB15-v2 classifier fully represent the diverse range of
proposed sets. Using an Extra Tree classifier,
the study rigorously assesses and contrasts the real-world network environments and
classification performance of the attack scenarios encountered
NetFlow-based feature sets with the original in practice.
proprietary feature sets, highlighting their
effectiveness in intrusion detection.
This study presents an innovative method that
combines a CGAN with BERT to tackle
multi-class intrusion detection challenges. The
approach focuses on augmenting data for • A key limitation of this study is the
minority attack classes, addressing class difficulty in accurately distinguishing
NF-UNSW- imbalance issues. By integrating BERT into the between attacks with similar
Fang Li [57] 2024 CGAN-BERT 87.40% characteristics or high concealment,
NB15-v2 CGAN’s discriminator, the framework
strengthens input-output relationships and such as Analysis, Backdoor, and DoS
enhances detection capabilities through in the NF-UNSW-NB15-v2 dataset.
adversarial training, resulting in improved
feature extraction and a more robust
cybersecurity detection mechanism.
Future Internet 2024, 16, 481 27 of 74
used to tackle class imbalance across multiple datasets, including CICIDS2017, KDD99, and
UNSW-NB15, improving the classification performance by effectively handling the under-
representation of minority classes. In ref. [72], the performance of DT and RF was enhanced
by applying CatBoost alongside random oversampling and undersampling techniques
on the CIC-IDS-2018 dataset. In ref. [73], a novel feature selection algorithm, improved
non-dominated sorting genetic algorithm III (I-NSGA-III), was proposed to address the
imbalance issue, resulting in a better detection rate, though it did not lead to higher accu-
racy. In ref. [74], random oversampling of the minority classes and random undersampling
of the majority class were applied to improve intrusion detection performance. However,
random oversampling is known to induce overfitting [75], and only accuracy was reported
in this study. A study in ref. [76] evaluated two tree-based classifiers and one deep learning-
based classifier under various sampling rates, showing that sampling techniques improve
the detection of both majority and minority classes. In ref. [77], Zhang et al. proposed a
combination of SMOTE with ENN (SMOTE-ENN) and a DNN for the NSL-KDD dataset.
In ref. [78], a cost-sensitive deep learning model combined with ensemble techniques was
used to tackle class imbalance in intrusion detection. The model enhanced detection of
both majority and minority attacks, but required high computational resources and time.
In ref. [79], to address class imbalance, the study utilized SMOTE for oversampling the
minority class and Tomek Links for undersampling the majority class. This combination of
techniques helped balance the dataset before applying an LSTM model for classification,
improving the model’s ability to detect both majority and minority classes effectively.
2.4. Challenges
State-of-the-art IDSs that utilize deep learning models encounter several significant ob-
stacles. A primary concern is the challenge of achieving high accuracy, which is frequently
impeded by class imbalance within benchmark datasets. In these datasets, normal traffic
typically outnumbers attack traffic, complicating the detection of rare yet critical attack
types. This imbalance results in higher false alarm rates and diminishes overall detection
effectiveness. Moreover, while deep learning has the potential to enhance detection perfor-
mance, it also introduces considerable computational complexity and resource demands.
This raises important concerns regarding scalability and efficiency, especially in large-scale,
real-time operational environments. Another notable challenge is the generalizability of
these models; they often struggle to adapt to varying network conditions or to detect new
attack types that were not present in the training data, thereby limiting their robustness in
practical, real-world applications. Additionally, many existing studies tend to emphasize
theoretical and experimental aspects of deep learning, often overlooking essential practical
deployment issues such as data privacy, system latency, and the integration of these sys-
tems with existing security infrastructures. Lastly, a narrow focus on accuracy can obscure
other critical performance metrics, including precision, recall, F1-score, and the impacts of
false positives and negatives. To address these multifaceted challenges, a comprehensive
approach is required, one that carefully balances data handling, scalability, adaptability,
and practical implementation.
Our proposed Transformer-CNN model addresses several key limitations of contem-
porary intrusion detection systems, offering superior performance in terms of accuracy
and other critical metrics compared to traditional approaches. By incorporating advanced
techniques such as ADASYN, SMOTE, ENN, and class weights, the model effectively
mitigates class imbalance, significantly improving its ability to detect rare attack types.
The transformer’s contextual feature extraction capability enables the system to analyze
complex relationships and patterns within the data with exceptional efficacy. Simultane-
ously, the CNN processes these extracted features to accurately classify specific attack types.
Designed for scalability and efficiency, the Transformer-CNN model excels in handling
large-scale datasets, optimizing computational resources, and ensuring real-time process-
ing capabilities. Extensive testing on the NF-UNSW-NB15-v2 and CICIDS2017 datasets
demonstrates the model’s robustness, validating its effectiveness across diverse network
Future Internet 2024, 16, 481 29 of 74
environments and attack scenarios. The model also addresses practical deployment chal-
lenges by minimizing false positives and false negatives, ensuring dependable performance
in real-world applications. Moreover, the evaluation framework for the Transformer-CNN
model goes beyond accuracy, incorporating a wide range of metrics to provide a thor-
ough performance assessment and address potential limitations in detection reliability and
practical implementation.
3. Proposed Approach
The Transformer-CNN model embodies a cutting-edge deep learning architecture
that fuses the strengths of Transformer and CNN to achieve exceptional performance in
both binary and multi-class classification tasks. This innovative framework proficiently
addresses critical challenges faced by IDS, particularly in enhancing classification accuracy
and mitigating class imbalances, with a primary emphasis on the NF-UNSW-NB15-v2 and
CICIDS2017 datasets. In this section, we outline the detailed steps involved in the model,
including comprehensive preprocessing procedures applied to the NF-UNSW-NB15-v2
dataset, followed by an evaluation of its performance on both the NF-UNSW-NB15-v2
and CICIDS2017 datasets. To tackle the issue of class imbalance, the model incorporates
a suite of advanced data preprocessing techniques. It employs ADASYN and SMOTE to
effectively oversample minority classes, thereby bolstering the model’s capacity to learn
from underrepresented yet crucial instances. Additionally, the model utilizes ENN for
strategic undersampling, while also applying class weights to recalibrate the importance of
each class during the training phase. This dual strategy not only ensures that challenging
cases receive adequate focus but also preserves a balanced class distribution throughout the
training process. The transformer component of the model is dedicated to contextual feature
extraction, empowering the system to adeptly analyze relationships and patterns within the
data. Meanwhile, the CNN efficiently processes the extracted features to accurately classify
specific attack types. This synergistic architecture significantly reduces the incidence of false
positives and false negatives, thereby enhancing the model’s ability to detect both known
threats and previously unseen (zero-day) attacks. The model’s outstanding performance is
underscored by its remarkable results on the NF-UNSW-NB15-v2 dataset, where it achieved
an impressive 99.71% accuracy in binary classification and 99.02% accuracy in multi-class
classification, as well as on the CICIDS2017 dataset, achieving 99.93% accuracy in binary
classification and 99.13% accuracy in multi-class classification. Figure 1 illustrates the model
architecture and its application to various classification tasks using the NF-UNSW-NB15-v2
dataset, providing a clear visual representation of its capabilities.
Start
Dataset
Removing Outliers
Using Z-Score and
LOF
Feature Selection
Using Correlation
Technique
Numerical Columns
Normalization
MinMaxNormalizer
Output Results
Split IoT-23
Dataset Files
Model Validation
Model Training and Update
Class Weights
Training Testing
File File
Class Resampling
Using ENN
Technique
Combine
NF-UNSW-NB15-v2
Dataset Files
Training
File
Class Resampling
Using ADASYN
Technique
Testing
File
Figure 1. Architectural design for binary classification and multi-class classification using NF-
Figure 1. Architectural design for binary classification and multi-class classification using NF-UNSW-
UNSW-NB15-v2 dataset.
NB15-v2 dataset.
Future Internet 2024, 16, 481 31 of 74
in Figure 1, which illustrates the full workflow for both binary and multi-class classification
tasks on the NF-UNSW-NB15-v2 dataset.
3.2.1. Removing Outliers Using Z-Score and Local Outlier Factor (LOF)
Z-score was applied to detect and filter out extreme outliers in the dataset. Specifically,
the zscore function from the scipy.stats module calculated the z-scores for all features in the
DataFrame. Z-scores represent how far a data point is from the mean in terms of standard
deviations. A threshold of 6 was set, meaning any data point with a z-score greater than 6
in any feature was considered an outlier and removed. This process was applied for both
binary and multi-class classification to ensure that the dataset remained clean and free of
extreme outliers.
Following the z-score, the LOF method was implemented to further detect and elimi-
nate outliers. LOF identifies data points with significantly lower density compared to their
neighbors, making it particularly effective for datasets with varying density distributions.
The LOF was configured with n_neighbors set to 20 and contamination set to 0.1, indicating
that 10% of the data were expected to be outliers. After fitting the LOF model, samples
were classified as either outliers (labeled −1) or inliers (labeled 1). Only the inlier samples
were retained, resulting in a cleaner dataset for subsequent analysis. This dual approach of
outlier removal enhanced the performance and reliability of the classification models for
both binary and multi-class tasks.
(i) Binary Classification
The NF-UNSW-NB15-v2 dataset underwent z-score to remove extreme outliers, en-
suring higher data quality for binary classification. Outliers were identified and removed
based on how far data points deviated from the mean, improving the dataset’s reliability.
As shown in Table 4, the Benign class was reduced from 96,432 to 93,653 samples after
outlier removal. Similarly, other attack categories, such as Exploits and Fuzzers, decreased
from 18,804 to 17,576 and 12,999 to 11,695 samples, respectively. Smaller classes, including
Shellcode, Backdoor, and Worms, also experienced slight reductions. This filtering process
ensured that both majority and minority classes remained balanced while minimizing the
impact of noise and outliers on the classification model’s performance.
a reduction from 322 to 233 samples. The Worms class remained relatively stable, with
a slight decline from 89 to 87 samples. This filtering process ensured a cleaner dataset,
facilitating more accurate classification while reducing the influence of outliers on model
performance. The adjustments made through LOF provide a balanced representation of
both majority and minority classes, enhancing the reliability of the classification tasks.
Class Type Number of Samples Before LOF Number of Samples After LOF
Benign 93,653 85,680
Exploits 17,576 14,969
Fuzzers 11,695 10,116
Reconnaissance 6883 6759
Generic 3211 2668
DoS 2180 1716
Shellcode 886 605
Backdoor 322 233
Analysis 324 304
Worms 89 87
Class Type Number of Samples Before LOF Number of Samples After LOF
Benign 93,530 85,510
Exploits 17,492 14,933
Fuzzers 11,730 10,131
Reconnaissance 6881 6774
Generic 3234 2688
DoS 2195 1730
Shellcode 886 614
Backdoor 327 243
Analysis 330 316
Worms 92 88
Table 8. Selected features of NF-UNSW-NB15-v2 dataset in binary classification using correlation technique.
3.2.3. Normalization
Data scaling, a crucial preprocessing step in machine and deep learning, involves
adjusting numerical values to a specific range, thereby enhancing the efficiency and effec-
tiveness of model. This standardization process is applied across all columns, ensuring
consistent data representation. Among various normalization techniques, the MinMaxS-
caler, a widely used tool in the scikit-learn library, stands out as the most effective for
our study. The normalization formula, as depicted in Equation (1) [80], calculates each
value by subtracting the minimum value in the column and dividing by the range (the
difference between the maximum and minimum values). In this context, X represents the
original values, min(X) is the minimum value in the column, and max(X) is the maximum
value in the column. After evaluating multiple normalization methods, MinMaxScaler
was chosen for its superior performance. This normalization technique was applied to the
Future Internet 2024, 16, 481 36 of 74
selected features in the dataset, ensuring consistent scaling for both binary and multi-class
classification tasks.
X − min( x )
X (scaled) = (1)
max ( x ) − min( x )
The testing file was employed for evaluating the NF-UNSW-NB15-v2 dataset, while
the complete training file was used for training in the initial approach.
NMaj − NMin
Ni
ni = . 1− (2)
NMin k
In this context, NMaj and NMin represent the sample counts of the majority and
minority classes, respectively, highlighting the imbalance between them. The term Ni
denotes the number of minority class samples that fall within the radius defined by the k-
Future Internet 2024, 16, 481 38 of 74
nearest neighbors, which helps in identifying minority instances near decision boundaries,
where synthetic samples are often generated to improve model performance.
For each minority instance Xi , synthetic samples are generated using the following
equation, as presented in Equation (3) [84].
X syn = Xi +γ. X j − Xi (3)
where X syn denotes the synthetic sample created to address class imbalance, X j represents
a randomly selected neighbor from the k-nearest neighbors of Xi , the minority sample, and
the term γ is a random number between 0 and 1, ensuring that the synthetic sample is
generated along the line segment between Xi and X j .
(i) Binary Classification
The sample distribution in each class before and after applying the ADASYN resam-
pling technique for binary classification on the NF-UNSW-NB15-v2 dataset is presented
in Table 12. Initially, the dataset comprised 85,680 samples for the ‘Normal’ class and
37,457 samples for the ‘Attack’ class. After applying ADASYN, the number of samples
for the ‘Attack’ class increased to 85,777, while the count for the ‘Normal’ class remained
unchanged at 85,680. This adjustment underscores the effectiveness of ADASYN in ad-
dressing class imbalance by generating synthetic samples for the minority class, ultimately
enhancing the model’s ability to learn from a more balanced dataset.
Table 12. Sample distribution in each class before/after resampling using ADASYN for binary
classification on NF-UNSW-NB15-v2 dataset.
Table 13. Sample distribution in each class before/after resampling using ADASYN for multiclass
classification on NF-UNSW-NB15-v2 dataset.
2. ENN
ENN is a data preprocessing technique aimed at refining training datasets by remov-
ing noisy instances and improving class boundaries. This method examines the nearest
neighbors of each instance and eliminates those that are misclassified, thereby enhancing
the overall quality of the training data. ENN effectively reduces class overlap and helps
maintain a balanced representation of the classes, making it particularly useful in both
binary and multi-class classification scenarios. In this study, ENN was applied once for
binary classification and three times for multi-class classification. By applying ENN to
the training data, models can achieve better generalization and improved performance on
unseen data.
For each instance Xi in the dataset, determine the k-nearest neighbors. The set of
neighbors is defined as using Equation (4) [85].
n o
N ( Xi ) = X j1 , X j2 , . . . . . . ., X jk (4)
where X jk are the nearest neighbors of Xi in terms of a distance metric (e.g., Euclidean
distance).
To calculate the majority class among the nearest neighbors, one can use the formula
represented in Equation (5) [85]. This involves determining the class labels of the nearest
neighbors and identifying which class occurs most frequently. By applying this method,
one can ensure that the predicted class for an instance is based on the most common class
among its neighbors, thereby enhancing the classification accuracy.
k
C ( Xi ) = argmaxc ∑ j=1 ∏ y j = c (5)
Here, C (Xi ) denotes the predicted class for instance Xi , with y j representing the class
label of its j-th neighbor. The indicator function ∏ outputs 1 if the condition holds true,
otherwise returning 0.
An instance Xi is removed if its predicted class C (Xi ) does not match its actual class
y j , using the formula in Equation (6) [85].
Table 14. Sample distribution in each Train class before/after resampling using ENN for binary
classification on NF-UNSW-NB15-v2 dataset.
Table 15. Sample distribution in each Train class before/after resampling using ENN for multi-class
classification on NF-UNSW-NB15-v2 dataset.
3. Class Weights
Class weights are a valuable technique used to address class imbalance in datasets by
assigning different weights to each class during model training. This approach ensures that
the model pays more attention to minority classes, thereby improving its ability to correctly
classify instances from these groups. By applying class weights to the training data, both
binary and multi-class classification tasks benefit from enhanced model performance and
generalization. This method helps mitigate the risks associated with biased predictions,
ensuring a more balanced representation of all classes throughout the learning process.
The class weights can be calculated using the following formula to address class im-
balance within the dataset. This approach assigns different weights to each class, ensuring
that the model pays more attention to minority classes during training. The formula for
calculating class weights is provided in Equation (7) [86].
N
Weightc = (7)
k.nc
Future Internet 2024, 16, 481 41 of 74
In this context, Weightc denotes the weight assigned to class c. The total number of
instances in the dataset is represented by N, while k indicates the total number of classes.
Additionally, nc signifies the number of instances belonging to class c.
(i) Binary Classification
The weights assigned to each class in the training data for binary classification using
class weights on the NF-UNSW-NB15-v2 dataset are presented in Table 16. The ‘Normal’
class is assigned a weight of 1.0150, while the ‘Attack’ class receives a weight of 0.9855.
These weights reflect the importance of each class during model training, with the goal of
addressing any imbalances in the dataset. By incorporating these class weights, the model
can enhance its performance and improve the accuracy of its predictions, particularly for
the minority class.
Table 16. Weight in each train class using class weights for binary classification on NF-UNSW-NB15-
v2 dataset.
Table 17. Weight in each Train class using class weights for multi-class classification on NF-UNSW-
NB15-v2 dataset.
In this context, Z represents the output feature map, while X denotes the input feature
map. The convolution kernel is indicated by K, which is utilized in the convolution
operation to transform the input features into the output features.
The ReLU activation function, shown in Equation (9) [90], is a simple yet powerful
non-linear transformation. It outputs zero for negative inputs, retains positive values,
mitigates the vanishing gradient issue, and promotes sparse activations, enhancing both
training efficiency and model performance.
The max pooling operation reduces the spatial dimensions of the input feature map
while retaining the most important features. This operation selects the maximum value
from a specified pooling window, effectively downsampling the input. The equation for
the max pooling operation can be expressed as shown in Equation (10) [89].
Pi,j = max X i:i+ p, j:j+q (10)
In this context, P refers to the pooled output generated from the pooling operation,
while p and q represent the dimensions of the pooling window used to aggregate the input
features into the pooled output.
The dropout layer randomly sets a fraction p of input units to zero during training to
prevent overfitting. This technique helps to improve the model’s generalization by ensuring
that it does not rely too heavily on any single input feature, as detailed in Equation (11) [91].
x with probability 1 − p
Dropout(x) = (11)
0 with probabilityp
Future Internet 2024, 16, 481 43 of 74
For binary classification tasks, the output layer utilizes the sigmoid function, which
outputs a probability score indicating the likelihood of an instance belonging to the positive
class. This is mathematically expressed in Equation (12) [92].
1
σ (Z) = (12)
1 + e−z
where Z is the output from the last dense layer.
The output layer employs the softmax function for multi-class classification, allow-
ing the model to produce a probability distribution across multiple classes. This can be
mathematically represented as shown in Equation (13) [92].
e zi
Softmax ( Zi ) = z (13)
∑j e j
where Zi is the output for class i, and Zj represents the raw score for class j.
(i) Binary Classification
The architecture of the CNN model designed for binary classification is detailed
in Table 18. The model architecture is shared across both the NF-UNSW-NB15-v2 and
CICIDS2017 datasets, with the input block differing to accommodate the specific features
of each dataset. For the NF-UNSW-NB15-v2 dataset, the input layer processes 25 distinct
features, while the CICIDS2017 dataset input layer accommodates 69 features. This input
layer serves as the foundation for subsequent computations. The CNN model for binary
classification begins with the first hidden block, which includes a one-dimensional (1D)
CNN layer with 256 filters, using the ReLU activation function to introduce non-linearity
and enhance feature extraction. Following this, a 1D max pooling layer with a pool size of
2 is employed to downsample the data, preserving critical features. A dropout layer with a
very low rate of 0.0000001 is incorporated to mitigate the risk of overfitting. The second
hidden block replicates this structure, incorporating another 1D CNN layer with 256 filters
and ReLU activation, followed by a 1D max pooling layer with a pool size of 4, and another
dropout layer with the same low rate to maintain generalization. In the third hidden
block, a dense layer with 1024 neurons is used, employing ReLU activation to facilitate
complex feature interactions. This is followed by another dropout layer to further enhance
robustness against overfitting. The final output block consists of a single neuron configured
with a sigmoid activation function, which is critical for producing binary classification
outputs for both datasets. This carefully structured architecture, as summarized in Table 18,
is optimized to effectively process the unique characteristics of the NF-UNSW-NB15-v2
and CICIDS2017 datasets, ensuring reliable and accurate binary classification performance.
(ii) Multi-Class Classification
The CNN model designed for multi-class classification features a comprehensive
architecture, as detailed in Table 19. This architecture is tailored to process the distinct
features of both the NF-UNSW-NB15-v2 and CICIDS2017 datasets, with the input block
configured specifically for each dataset. For the NF-UNSW-NB15-v2 dataset, the input layer
processes 27 features, while the CICIDS2017 dataset input layer handles 35 features. These
input layers provide the foundation for the model to effectively capture dataset-specific
information relevant to the classification task. The model begins with the first hidden
block, which includes a one-dimensional (1D) CNN layer with 256 filters, employing the
ReLU activation function to enable effective feature extraction. This is followed by a 1D
max pooling layer with a pool size of 2, which reduces dimensionality while retaining
critical information. To minimize overfitting, a dropout layer with an extremely low rate of
0.0000001 is applied. The second hidden block mirrors this structure, featuring another 1D
CNN layer with 256 filters and ReLU activation, followed by a 1D max pooling layer with a
pool size of 4 and a dropout layer with the same low rate to maintain generalization. In the
third hidden block, a dense layer with 1024 neurons is used, employing ReLU activation
Future Internet 2024, 16, 481 44 of 74
to enhance the model’s ability to learn complex feature relationships. This is followed
by another dropout layer to further strengthen the model’s capacity to generalize well to
unseen data. The output block varies depending on the dataset. For the NF-UNSW-NB15-v2
dataset, the output layer consists of 10 neurons with a softmax activation function, allowing
the model to output probabilities across 10 classes. For the CICIDS2017 dataset, the output
layer comprises 15 neurons, also using a softmax activation function to accommodate its
multi-class structure. This carefully designed architecture, as summarized in Table 19, is
optimized to handle the unique characteristics of both datasets, ensuring effective learning
and high-performance multi-class classification.
approach allows the model to make finer adjustments during training, which can help
accelerate convergence. To avoid excessively small updates, the learning rate is capped
at a minimum value of 1 × 10−5 . This method ensures more efficient and stable training,
enabling the model to converge steadily without overshooting the optimal solution. Across
both classifier types, the Adam optimizer is employed, known for its adaptive learning
rate capabilities, which enhances training performance. The choice of loss function is
tailored to the nature of the classification task. Binary cross-entropy is adopted for the
binary classification scenario, while categorical cross-entropy is utilized in multi-class
classification, ensuring appropriate measurement of model performance based on the
output format. Lastly, accuracy is designated as the evaluation metric for both classifiers,
providing a straightforward assessment of their performance in correctly classifying the
input data. This careful selection and configuration of hyperparameters are essential for
optimizing the effectiveness of the CNN models in their respective classification tasks.
In this formulation, h(l ) denotes the output of the encoder layer l, while a(l −1) rep-
resents the output from the previous layer, serving as the input for the first layer. The
weight matrix for layer l is indicated by W (l ) , and b(l ) denotes the bias vector for that layer.
The activation function applied is denoted as f, which is specifically the ReLU function in
this context.
Future Internet 2024, 16, 481 46 of 74
In a standard auto encoder, the decoder layer reconstructs the input from the com-
pressed representation learned by the encoder. This reconstruction process can be mathe-
matically represented as detailed in Equation (15) [89].
′
α = g W (d) h(l ) + b(d) (15)
In this context, ά represents the reconstructed output, while h(l ) is the output from the
last encoder layer. The weight matrix for the decoder layer is denoted as W (d) , and b(d)
indicates the bias vector for the decoder layer. The activation function used for the decoder
is represented by g, which is typically linear for reconstruction purposes.
The classification layer utilizes a specific activation function for binary classification
output. This can be expressed as presented in Equation (16) [89].
y = σ W (out) h(l ) +b(out) (16)
In this framework, y denotes the predicted probability for the positive class. The
weight matrix for the output layer is represented by W (out) , while b(out) signifies the bias
for the output layer. The sigmoid function, denoted as σ, is employed to map the output to
a probability score between 0 and 1.
For multi-class classification output, the classification layer employs the softmax
activation function, which enables the model to generate a probability distribution across
multiple classes. This can be expressed as presented in Equation (17) [92].
y = softmax W (out) h(l ) +b(out) (17)
In this context, y represents the vector of predicted probabilities across multiple classes.
The weight matrix W (out) and bias b(out) are associated with the output layer. The softmax
function is utilized to convert the logits into probabilities, ensuring that the predicted
values sum to one across all classes.
(i) Binary Classification
The architecture outlined in Table 21 presents the layers of the Auto Encoder model
designed for binary classification, tailored for both the NF-UNSW-NB15-v2 and CICIDS2017
datasets. The input block is configured to accommodate the unique features of each dataset.
The NF-UNSW-NB15-v2 dataset processes data with 25 features, while the CICIDS2017
dataset handles 69 features. This input layer serves as the entry point for data, providing the
foundation for the model’s operations. The encoder structure comprises three dense layers,
with 128, 64, and 32 neurons, respectively. Each dense layer utilizes the ReLU activation
function, which introduces non-linearity and facilitates the extraction of complex patterns
within the data. These layers effectively compress the input data into a lower-dimensional
latent space, capturing the most critical features necessary for effective classification. The
final output block consists of a single neuron activated by a sigmoid function. This layer
generates a probability score indicating the likelihood of the input data belonging to the
positive class, enabling binary classification. The architecture is designed to distinguish
effectively between the two classes, ensuring robust performance across both datasets.
This carefully structured model leverages its shared architecture to handle the unique
characteristics of the NF-UNSW-NB15-v2 and CICIDS2017 datasets, enhancing its overall
classification effectiveness.
Future Internet 2024, 16, 481 47 of 74
enhances the model’s ability to converge effectively. Furthermore, the learning rate is
capped with a minimum value of 1 × 10−5 to prevent it from becoming too small to produce
meaningful updates. This strategy strikes a balance between accelerating convergence in
the early stages and allowing for finer adjustments as the model nears optimal performance,
ultimately leading to more reliable and efficient training. The Adam optimizer is employed
for efficient weight updates, while the choice of loss function depends on the classification
task. Binary cross-entropy is used for binary classification, and categorical cross-entropy is
applied for multi-class classification. For performance evaluation, accuracy is chosen as the
primary metric, offering a comprehensive assessment of the model’s ability to classify data
correctly in both contexts.
In this context, a(l ) denotes the activation of the current layer l. The weight matrix
for this layer is represented by W (l ) , while b(l ) signifies the bias vector for layer l. The
activation function f is applied element-wise, which may include functions such as ReLU
or sigmoid, to introduce non-linearity into the model.
The ReLU activation function, shown in Equation (19) [94], is a simple and efficient
non-linear function that outputs zero for negative values, promotes sparse activations, and
supports effective gradient flow, making it ideal for deep learning.
For binary classification tasks, the sigmoid function is utilized, which outputs a probability
score indicating the likelihood of an instance belonging to the positive class. The sigmoid
function transforms the raw score into a value between 0 and 1, effectively serving as a threshold
for classification. This can be represented as presented in Equation (20) [92].
1
σ (Z) = (20)
1 + e−z
where Z is the output from the last dense layer.
The output layer employs the softmax function for multi-class classification tasks, en-
abling the model to produce probability distributions across multiple classes. This function
takes a vector of raw scores (logits) and normalizes them into a range between 0 and 1,
where the sum of the probabilities equals 1. The softmax function can be mathematically
expressed as shown in Equation (21) [92].
e zi
Softmax ( Zi ) = z (21)
∑j e j
where Zi is the output from the last dense layer for class i, and Zj represents the raw score
for class j.
(i) Binary Classification
The architecture detailed in Table 24 presents the structure of the DNN model specif-
ically designed for binary classification, with tailored configurations for the NF-UNSW-
NB15-v2 and CICIDS2017 datasets. The input layer block begins by processing 25 features
for the NF-UNSW-NB15-v2 dataset and 69 features for the CICIDS2017 dataset, and a dense
layer with 1024 neurons, where the ReLU activation function is applied to introduce non-
linearity and enhance the model’s ability to learn complex representations from the data.
The first hidden block includes a dropout layer with a very low dropout rate to mitigate
overfitting, followed by a dense layer with 768 neurons. The dense layer is equipped with
a ReLU activation function to introduce non-linearity, enhancing the model’s ability to
learn complex patterns. Batch normalization is applied after the dense layer to stabilize
the learning process by normalizing the outputs, ensuring more effective and consistent
training. The second hidden block contains another dropout layer and batch normalization,
further refining the learning dynamics. Ultimately, the architecture concludes with an out-
put layer featuring a single neuron activated by the sigmoid function. This configuration
is meticulously crafted to enhance the model’s effectiveness in binary classification tasks
across both datasets.
scenario. In both cases, accuracy serves as the primary evaluation metric, providing a clear
measure of the models’ effectiveness in classifying data accurately.
QK T
Attention ( Q, K, V ) = softmax √ (22)
dk
In this context, Q represents the query matrix, K denotes the key matrix, and V signifies
the value matrix. The variable dk refers to the dimension of the keys, which plays a crucial
role in the computation of attention scores within the model.
Future Internet 2024, 16, 481 52 of 74
Each head performs this attention calculation independently and then concatenates
the results, as detailed in Equation (23) [95].
In this context, W 0 refers to the output weight matrix, which is utilized to transform
the output of the preceding layer into the final output of the model.
Layer normalization stabilizes the output of each layer, using Equation (24) [96].
X−
LayerNorm( x ) = ∗γ+β (24)
σ
In this context, µ represents the mean of the inputs, while σ denotes the standard deviation.
Additionally, γ and β are learnable parameters that are utilized in the normalization process.
The FFN processes the output from the attention mechanism as presented in Equation (25) [95].
In this context, W1 and W2 refer to the weight matrices, while b1 and b2 represent
the corresponding biases associated with the layers in the model.
The convolution operation in the CNN layers can be defined as in Equation (26) [89].
In this scenario, Z denotes the output feature map resulting from the convolution
process, while X represents the input feature map. The convolution kernel, denoted as K, is
used to perform the convolution operation between the input and the output.
The ReLU activation function, presented in Equation (27) [90], is efficient and straight-
forward, outputting zero for negative inputs while passing positive values through un-
changed. Its ability to promote sparse activations and facilitate gradient flow makes it
particularly effective in deep learning applications.
The max pooling operation can be expressed as presented in Equation (28) [89].
Pi,j = max X i:i+ p, j:j+q (28)
In this context, P represents the pooled output generated from the pooling opera-
tion, while P and q denote the dimensions of the pooling window applied to the input
feature map.
The dropout layer randomly sets a fraction p of input units to zero during training to
prevent overfitting, as presented in Equation (29) [91].
x with probability 1 − p
Dropout( x ) = (29)
0 with probabilityp
For binary classification tasks, the sigmoid function is utilized, which outputs a
probability score indicating the likelihood of an instance belonging to the positive class, as
presented in Equation (30) [92].
1
σ (Z) = (30)
1 + e−z
where Z is the output from the last dense layer.
Future Internet 2024, 16, 481 53 of 74
The output layer employs the softmax function for multi-class classification tasks, en-
abling the model to produce probability distributions across multiple classes, as presented
in Equation (31) [92].
e zi
Softmax ( Zi ) = z (31)
∑j e j
where Zi is the output for class i, and Zj represents the raw score for class j.
(i) Binary Classification
The architecture of the Transformer model designed for binary classification is detailed
in Table 27. The model begins with an input layer that processes data structured as (25, 1) for
the NF-UNSW-NB15-v2 dataset and (69, 1) for the CICIDS2017 dataset, effectively accom-
modating input with 25 features for NF-UNSW-NB15-v2 and 69 features for CICIDS2017.
Following this, the Transformer block employs a multi-head attention mechanism with
eight heads and a key dimension of 128. This mechanism captures complex relationships
within the input data, enhancing the model’s ability to identify intricate patterns. The
output from the attention layer is subsequently normalized using layer normalization with
an epsilon value of 1 × 10−6 , which helps stabilize the output. A residual connection is
implemented to add the original input data back to the attention output, promoting stability
during training. The feed-forward block consists of a dense layer with 512 units and a ReLU
activation function, applying a transformation to the data. This is followed by a dropout
layer with a rate of 0.0000001, aimed at mitigating overfitting by regularizing the network.
Another dense layer with 512 units is included without an activation function, allowing
for additional transformations. A subsequent dropout layer with the same rate further
reinforces regularization, enhancing model robustness. The output from the feed-forward
network is then added back to the previous block’s output via another residual connection,
followed by another layer normalization step with epsilon = 1 × 10−6 to normalize the
combined output, ensuring stability in the model’s learning process.
The architecture of the CNN model designed for binary classification utilizes the
output of the Transformer model as its input and is tailored for datasets like NF-UNSW-
NB15-v2 and CICIDS2017. The input block processes the Transformer output, providing
structured input for the model. The first hidden block includes a 1D CNN layer with
512 filters and a ReLU activation function, which extracts essential features from the input
data. This is followed by a 1D max pooling layer with a pool size of two, reducing the
dimensionality of the feature maps, and a dropout layer with a rate of 0.0000001 to mitigate
overfitting. The second hidden block repeats this structure with another 1D CNN layer
with 512 filters and ReLU activation, followed by a 1D max pooling layer with a pool size of
four and a dropout layer with the same dropout rate. In the third hidden block, the model
incorporates a dense layer with 1024 units and a ReLU activation function, enhancing
the model’s representational capabilities. A dropout layer with a rate of 0.0000001 is
again applied for additional regularization. The architecture concludes with a single-
output layer employing a sigmoid activation function for binary classification, producing a
probability score to determine class membership. The detailed structure, including layer
sizes, activation functions, and dropout rates, is outlined in Table 28.
Activation
Dataset Block Layer Type Output Size Parameters Description
Function
Accepts input data with
NF-UNSW-NB15-v2 Input block Input layer (25, 1) - -
25 features.
Accepts input data with
CICIDS2017 Input block Input layer (69, 1) - -
69 features.
Future Internet 2024, 16, 481 54 of 74
Activation
Dataset Block Layer Type Output Size Parameters Description
Function
Captures complex
Transformer Multi-head num_heads = 8,
- - relationships within
block attention key_dim = 128
input data.
Layer Normalizes the output
- - epsilon = 1 × 10−6
Normalization from the attention layer.
Adds input data to the
Add (Residual
- - - attention output
Connection)
for stability.
Applies a dense
Feed Forward units = 512,
Dense layer 512 ReLU transformation with
block activation = ‘relu’
ReLU activation.
Regularizes the network
Shared Structure Dropout layer - - rate = 0.0000001 to prevent overfitting
(p = 0.0000001).
Another dense
Dense layer 512 - units = 512 transformation without
activation.
Further regularization
Dropout layer - - rate = 0.0000001
(p = 0.0000001).
Adds feed-forward
Add (Residual
- - - output to the previous
Connection)
block output.
Normalizes the
Layer
- - epsilon = 1 × 10−6 combined output
Normalization
for stability.
from the attention layer is then normalized using layer normalization with an epsilon
value of 1 × 10−6 , which contributes to stabilizing the output. A residual connection is
established to add the original input data back to the attention output, promoting stability
during training. The feed-forward block consists of a dense layer with 512 units and a
ReLU activation function, which applies a transformation to the data. This is followed by
a dropout layer with a rate of 0.0000001, designed to mitigate overfitting by regularizing
the network. An additional dense layer with 512 units is included without an activation
function, allowing for further transformations. A subsequent dropout layer with the same
rate reinforces regularization, enhancing the model’s robustness. The output from the feed-
forward network is then added back to the previous block’s output via another residual
connection, followed by an additional layer normalization step with epsilon = 1 × 10−6 to
normalize the combined output, ensuring stability in the model’s learning process.
Output Activation
Dataset Block Layer Type Parameters Description
Size Function
Accepts input data with
NF-UNSW-NB15-v2 Input block Input layer (27, 1) - -
27 features.
Accepts input data with
CICIDS2017 Input block Input layer (35, 1) - -
35 features.
Captures complex
Transformer Multi-head num_heads = 8,
- - relationships within
block attention key_dim = 128
input data.
Layer Normalizes the output
- - epsilon = 1 × 10−6
Normalization from the attention layer.
Adds input data to the
Add (Residual
- - - attention output for
Connection)
stability.
Applies a dense
Feed Forward units = 512,
Dense layer 512 ReLU transformation with
block activation = ‘relu’
ReLU activation.
Shared Structure Regularizes the network
Dropout layer - - rate = 0.0000001 to prevent overfitting
(p = 0.0000001).
Another dense
Dense layer 512 - units = 512 transformation without
activation.
Further regularization
Dropout layer - - rate = 0.0000001
(p = 0.0000001).
Adds feed-forward
Add (Residual
- - - output to the previous
Connection)
block output.
Normalizes the
Layer
- - epsilon = 1 × 10−6 combined output for
Normalization
stability.
The architecture of the CNN model designed for multi-class classification leverages the
output of the Transformer model as its input, tailored for both the NF-UNSW-NB15-v2 and
CICIDS2017 datasets. The model starts with an input block that processes the Transformer
output. The first hidden block incorporates a 1D CNN layer with 512 filters and a ReLU
activation function, enabling the extraction of critical features from the data. This is
followed by a 1D max pooling layer with a pool size of two, which reduces dimensionality,
and a dropout layer with a rate of 0.0000001 to mitigate overfitting. In the second hidden
block, another 1D CNN layer with 512 filters and a ReLU activation function is utilized,
accompanied by a 1D max pooling layer with a pool size of 4 and another dropout layer
with the same rate, reinforcing regularization. The third hidden block comprises a dense
layer with 1024 units and a ReLU activation function, further enhancing the model’s ability
Future Internet 2024, 16, 481 56 of 74
to represent complex patterns. This block also includes a dropout layer with a rate of
0.0000001 for additional regularization. The output block varies based on the dataset.
For the NF-UNSW-NB15-v2 dataset, the output layer consists of 10 units, while for the
CICIDS2017 dataset, it includes 15 units. Both employ a softmax activation function to
perform multi-class classification. The complete architecture, including layer specifications,
is outlined in Table 30.
Table 30. CNN model layers for multi-class classification.
and 14 distinct attack types [12]. The dataset is organized into eight files representing five
days of benign and attack traffic, with each file containing real-world network data [98,99].
In addition to the core traffic data, the records include supplementary metadata and are
provided in packet-based and bifacial flow-based formats [97]. The dataset is fully labeled,
making it suitable for both binary and multi-class classification tasks. For binary classifica-
tion, all attack types are labeled as ‘1’, while benign traffic is labeled as ‘0’. For multi-class
classification, all attack types are considered individually, providing a comprehensive view
of the different forms of network attacks. The CICIDS2017 dataset, while extensive, requires
meticulous preprocessing to address missing data and enhance its quality for analysis.
Preprocessing began by consolidating the dataset’s eight constituent files into a single
comprehensive dataset. Missing values, or NaNs, were systematically addressed to prevent
data quality issues. Duplicates were eliminated, and columns with only a single unique
value were removed to optimize feature relevance. Remaining NaN values were carefully
imputed, and feature names were standardized by stripping leading spaces for uniformity.
Sampling was then performed, and for multi-class classification, instances belonging to
the ‘Normal’ class were excluded post-sampling. To eliminate extreme values that could
bias model outcomes, outliers were identified and removed using the LOF. In multi-class
classification, feature selection based on correlation is applied following outlier removal to
refine the feature set further. Then, numerical features are normalized using MinMaxScaler
to ensure consistent scaling across variables. After these steps, the dataset was partitioned
into training and testing subsets. To address class imbalances during training, advanced
resampling techniques were implemented. For binary classification, the enhanced hybrid
ADASYN-SMOTE method was applied to generate synthetic samples within the training
data, while for multi-class classification, an advanced cascaded SMOTE approach was
utilized to balance the training dataset effectively. Additionally, the ENN technique was
employed to undersample the training data, further refining class distribution and im-
proving model robustness. Class weights were dynamically adjusted during the training
process to ensure balanced learning across all classes. Collectively, these preprocessing
strategies transformed the raw CICIDS2017 dataset into a well-balanced and optimized
resource, tailored for binary and multi-class classification tasks.
• False Positive (FP): These are the instances that the model incorrectly predicted to be
positive. In the spam filter context, if it mistakenly classified a legitimate email as
spam, this is a false positive.
Equation (32) [101] illustrates the most basic and fundamental metric, accuracy, which
can be derived from the confusion matrix.
TP + TN
Accuracy = (32)
TP + TN + FP + FN
It is common to evaluate the model using a variety of additional metrics, including
recall, precision, and the F-score. Precision is determined by dividing the number of
true positive results by the total number of predicted positive results, encompassing both
correct and incorrect identifications. This metric, also known as positive predictive value,
is calculated using Equation (33) [101]. Recall, defined in Equation (34) [101], assesses
the proportion of actual positive instances that the model correctly identifies among all
instances that should have been recognized as positive. The F-score, computed using
Equation (35) [102], serves as the harmonic mean of precision and recall, providing a
balanced measure of the model’s performance.
TP
Precision = (33)
TP + FP
TP
Recall = (34)
TP + FN
2 ∗ precision ∗ recall
Fscore = (35)
precision + recall
In this scenario, the goal is to enhance metrics including the F-score, accuracy, recall,
and precision, as outlined by the evaluation criteria.
4.4. Results
The evaluation of the proposed models was conducted across two primary phases,
training and testing, utilizing the train and test subsets of the NF-UNSW-NB15-v2 dataset,
with additional evaluation on other datasets like CICIDS2017 to demonstrate the models’
generalizability. These experiments targeted both binary and multi-class classification tasks,
ensuring accurate detection of malicious activities and precise identification of various
attack types. A comprehensive analysis was performed to assess the impact of data re-
sampling techniques on the models’ performance, offering a thorough comparison of their
effectiveness. The models were also benchmarked against established intrusion detection
systems from the literature, providing valuable insights into their relative strengths and
weaknesses in a broader context. The results from both the NF-UNSW-NB15-v2 and CI-
CIDS2017 datasets underscore the effectiveness and versatility of the proposed models
in addressing complex classification challenges. Among the evaluated approaches, the
Transformer-CNN model consistently emerged as the top performer, demonstrating excep-
tional accuracy in detecting malicious activities and classifying diverse attack types. While
other models, such as auto encoder, DNN and CNN, delivered commendable results, the
Transformer-CNN model proved to be the most resilient and reliable across all evaluation
metrics, highlighting the critical role of applied preprocessing techniques and emphasizing
the robustness and generalizability of the models.
(i) Binary Classification
The performance metrics presented in Table 32 illustrate the results of binary classi-
fication on the NF-UNSW-NB15-v2 and CICIDS2017 datasets, utilizing data resampling
techniques and class weights. Each model demonstrated impressive performance across
all metrics, highlighting their reliability and robustness in binary classification tasks. On
Future Internet 2024, 16, 481 60 of 74
the NF-UNSW-NB15-v2 dataset, the CNN model achieved an accuracy of 99.69%, with
precision, recall, and F-score all matching at 99.69%. The auto encoder reported an ac-
curacy of 99.66%, and similarly, the DNN model achieved an accuracy of 99.68%, with
corresponding precision, recall, and F-score values of 99.68%. The Transformer-CNN model
outperformed the others, achieving the highest accuracy at 99.71%, along with matching
precision, recall, and F-score metrics of 99.71%. On the CICIDS2017 dataset, the CNN model
demonstrated outstanding performance, achieving an accuracy of 99.86%, with precision,
recall, and F-score all equally high at 99.86%. The auto encoder model, while slightly
lower, still achieved a strong accuracy of 99.73%, with corresponding precision, recall,
and F-score values matching at 99.73%, suggesting it is effective in identifying anomalies
and classifying the data accurately. The DNN model reported an impressive accuracy of
99.88%, with precision, recall, and F-score values consistently high at 99.88%, indicating
that it is highly reliable in distinguishing between the different classes within the dataset.
However, The Transformer-CNN model stood out as the best performer, achieving the
highest accuracy of 99.93%, with precision, recall, and F-score all at 99.93%. These results
highlight the impressive performance of each model in binary classification tasks across
both datasets, showcasing their reliability and robustness for real-world applications. The
Transformer-CNN model, in particular, emerged as the most effective, achieving the highest
performance in binary classification on both datasets.
Table 32. Performance metrics in binary classification using data resampling and class weights.
Table 33. Performance metrics in multi-class classification using data resampling and class weights.
5. Discussion
This section provides a comprehensive evaluation of the Transformer-CNN model’s
performance in comparison to other classification methods, such as CNN, auto encoder, and
DNN, across both binary and multi-class classification tasks. We conduct a detailed analysis
of the confusion matrices and key performance metrics, including accuracy, precision, recall,
and F1-score, to offer a comparative assessment of each model’s strengths and weaknesses.
Results obtained from the NF-UNSW-NB15-v2 dataset, along with additional evaluation on
other datasets like CICIDS2017 to demonstrate the models’ generalizability, reveal how the
Transformer-CNN model’s innovative integration of Transformer and CNN architectures
enhances its ability to detect malicious activities and classify various attack types. This
analysis not only highlights the model’s superior performance across multiple metrics but
also underscores its robustness in real-world intrusion detection scenarios, emphasizing
the practical implications of improving the accuracy and reliability of IDS systems.
(i) Binary Classification
In binary classification on both the NF-UNSW-NB15-v2 and CICIDS2017 datasets,
the Transformer-CNN model demonstrated exceptional performance across critical met-
rics such as accuracy, precision, recall, and F1-score, outperforming previously proposed
models. Its ability to extract and leverage essential features from the input data is ev-
ident in the classification outcomes. Figure 2 presents the confusion matrices for the
Transformer-CNN model applied to the NF-UNSW-NB15-v2 and CICIDS2017 datasets. On
the NF-UNSW-NB15-v2 dataset, the model achieved an accuracy of 99.71%, with precision,
recall, and F1-score all at 99.71%. The confusion matrix shows that the model correctly
identified 4342 normal instances and 1797 attack instances. However, 18 normal instances
were misclassified as attacks, with no attack instances misclassified as normal. This per-
formance underscores the model’s robustness in handling imbalanced datasets and its
precision in detecting attacks while minimizing false alarms. On the CICIDS2017 dataset,
the Transformer-CNN model achieved an even higher accuracy of 99.93%, with precision,
recall, and F1-score also at 99.93%. The confusion matrix reveals that the model correctly
classified 13,939 normal instances and 11,033 attack instances. However, 15 normal in-
stances were misclassified as attacks, and 3 attack instances were misclassified as normal.
This result highlights the model’s exceptional ability to distinguish between normal and
malicious traffic effectively, ensuring reliability and precision in real-world intrusion detec-
tion scenarios. These results confirm the Transformer-CNN model’s capability to address
critical challenges in intrusion detection, including managing imbalanced datasets and
reducing false positives and false negatives, making it a highly reliable tool for deployment
in real-world network security applications.
The comparative performance of the proposed Transformer-CNN model against other
binary classifiers, including a standalone CNN, auto encoder, and DNN, is depicted in
Figures 3 and 4. The evaluation metrics displayed include accuracy, precision, recall, and
F1-score. The results indicate that the Transformer-CNN model excelled, with an accuracy
Future Internet 2024, 16, 481 62 of 74
of 99.71%, a precision of 99.71%, a recall of 99.71%, and an F1-score of 99.71% on the NF-
UNSW-NB15-v2 dataset. This underscores its exceptional capability in detecting intrusions.
The high precision score of 99.71% indicates that the Transformer-CNN model effectively
identified true positives with very few false positives, while the 99.71% recall score shows
that it captured nearly all true positive instances, minimizing false negatives. The F1-score
of 99.71% reflects a nearly perfect balance between precision and recall, showcasing the
model’s overall effectiveness and reliability. On the CICIDS2017 dataset, the Transformer-
CNN model demonstrated even greater performance, achieving an accuracy of 99.93%,
along with matching precision, recall, and F1-score metrics of 99.93%. In contrast, the
standalone auto encoder exhibited lower performance metrics on both datasets, with
accuracy, precision, recall, and F1-score around 99.66% on NF-UNSW-NB15-v2 and 99.73%
on CICIDS2017. The standalone CNN achieved slightly better metrics of 99.69% on NF-
UNSW-NB15-v2 and 99.86% on CICIDS2017. The DNN model had metrics of 99.68% on
NF-UNSW-NB15-v2 and 99.88% on CICIDS2017. Ultimately, the Transformer-CNN model
Future stands out
Internet 2024, 16, due
x FOR to itsREVIEW
PEER robust overall performance on both datasets, reinforcing its suitability
55 of 70
for binary classification tasks.
True Labels
True Labels
Normal
Normal
Attack
Attack
True Attack
True Normal
Predicted Labels False Label Predicted Labels
(a) (b)
Figure 2. Confusion matrix for binary classification using Transformer-CNN on (a) NF-UNSW-
Figure 2. Confusion matrix for binary classification using Transformer-CNN on (a) NF-UNSW-NB15-
Future Internet 2024, 16, x FOR PEER REVIEW 56 of 70
NB15-v2 dataset and (b) CICIDS2017 dataset.
v2 dataset and (b) CICIDS2017 dataset.
The comparative performance of the proposed Transformer-CNN model against
other binary classifiers, including a standalone CNN, auto encoder, and DNN, is depicted
in Figures 3 and 4. The evaluation metrics displayed include accuracy, Accuracy
precision, recall,
and F1-score. The results indicate that the Transformer-CNN model excelled,
Precision with an
accuracy of 99.71%, a precision of 99.71%, a recall of 99.71%, and an F1-score of 99.71% on
Recall
the NF-UNSW-NB15-v2 dataset. This underscores its exceptional capability in detecting
intrusions. The high precision score of 99.71% indicates that the Transformer-CNN
F-score model
99.72% effectively identified true positives with very few false positives, while the 99.71% recall
score shows that it captured nearly all true positive instances, minimizing false negatives.
99.71%
The F1-score of 99.71% reflects a nearly perfect balance between precision and recall,
99.70% showcasing the model’s overall effectiveness and reliability. On the CICIDS2017 dataset,
99.69% the Transformer-CNN model demonstrated even greater performance, achieving an
99.68% accuracy of 99.93%, along with matching precision, recall, and F1-score metrics of 99.93%.
In contrast, the standalone auto encoder exhibited lower performance metrics on both
99.67%
datasets, with accuracy, precision, recall, and F1-score around 99.66% on NF-UNSW-
99.66% NB15-v2 and 99.73% on CICIDS2017. The standalone CNN achieved slightly better
99.65% metrics of 99.69% on NF-UNSW-NB15-v2 and 99.86% on CICIDS2017. The DNN model
99.64%
had metrics of 99.68% on NF-UNSW-NB15-v2 and 99.88% on CICIDS2017. Ultimately, the
Transformer-CNN model stands out due to its robust overall performance on both
99.63%
datasets, reinforcing its suitability for binary classification tasks.
CNN Auto Encoder DNN Transformer-CNN
Accuracy
Precision
Recall
F-score
99.95%
99.66%
99.65%
99.64%
99.63%
Future Internet 2024, 16, 481 CNN Auto Encoder DNN Transformer-CNN
63 of 74
Accuracy
Precision
Recall
F-score
99.95%
99.90%
99.85%
99.80%
99.75%
99.70%
99.65%
99.60%
CNN Auto Encoder DNN Transformer-CNN
Table 34. Performance metrics for Transformer-CNN across several classes in binary classification on
NF-UNSW-NB15-v2 dataset.
Table 35. Performance metrics for Transformer-CNN across several classes in binary classification on
CICIDS2017 dataset.
Benign 4294 4 7 0 0 0 0 1 4 0
Exploits 0 720 3 5 0 4 0 5 5 1
Fuzzers 0 0 474 0 0 1 0 2 1 0
Reconnaissance 0 1 0 344 0 0 0 0 1 0
True Labels
True label
True label
False label
DoS 0 0 0 0 0 76 0 1 1 0
Shellcode 0 0 0 0 0 0 25 0 0 0
Backdoor 0 0 0 0 0 0 0 14 2 0
Analysis 0 0 0 0 0 0 0 2 8 0
Worms 0 0 0 0 0 0 0 0 0 5
Generic
Shellcode
Exploits
Fuzzers
Reconnaissance
Backdoor
Benign
Worms
Analysis
DoS
Predicted Lables
Figure 5. Confusion matrix for multi-class classification using Transformer-CNN on NF-UNSW-
Figure 5. Confusion
NB15-v2 dataset. matrix for multi-class classification using Transformer-CNN on NF-UNSW-
NB15-v2 dataset.
In multi-class classification on the CICIDS2017 dataset, the Transformer-CNN model
demonstrates
In multi-classremarkable effectiveness,
classification as illustrated
on the CICIDS2017by thedataset,
confusionthe
matrix shown in
Transformer-CNN model
Figure 6. The model achieves outstanding accuracy, precision, recall, and F1-scores across
demonstrates remarkable effectiveness, as illustrated by the confusion matrix shown in
various attack classes, effectively distinguishing between diverse at-tack types with
Figure 6. The
minimal model achieves
misclassifications. Foroutstanding
instance, the accuracy, precision,
Benign class achieved recall,
13,773 and F1-scores across
correct
classifications, with only a few instances misclassified into other categories, such as 60
instances as PortScan and 34 as DoS Hulk. The PortScan attack class was classified with
high precision, correctly identifying 1,806 out of 1,808 instances, with just 2 instance
misclassified. Similarly, the model correctly classified 2,080 instances of DDoS, with 2
instances misclassified into other categories. For DoS Hulk, the model correctly classified
5,609 instances, with only two minor misclassifications. In the DoS GoldenEye class, all
Future Internet 2024, 16, 481 65 of 74
various attack classes, effectively distinguishing between diverse at-tack types with minimal
misclassifications. For instance, the Benign class achieved 13,773 correct classifications, with
only a few instances misclassified into other categories, such as 60 instances as PortScan
and 34 as DoS Hulk. The PortScan attack class was classified with high precision, correctly
identifying 1,806 out of 1,808 instances, with just 2 instance misclassified. Similarly, the
model correctly classified 2,080 instances of DDoS, with 2 instances misclassified into other
categories. For DoS Hulk, the model correctly classified 5,609 instances, with only two
minor misclassifications. In the DoS GoldenEye class, all 480 instances were correctly
identified, show-casing perfect performance. For FTP-Patator, 255 instances were correctly
classified, with just 2 misclassified as DoS Slowloris. The model maintained strong accuracy
for the SSH-Patator class, correctly identifying 112 instances with minimal errors. For
more challenging attack types, such as DoS Slowloris and DoS Slowhttptest, the model
achieved excellent results, correctly classifying 261 and 160 instances, respectively, without
any misclassifications. The model also handled the Bot attack class effectively, correctly
classifying 104 instances, with only 3 misclassified into the Benign category. The Web Attack
- Brute Force class was classified with perfect precision and recall, correctly identifying all
69 instances without any errors, while the Web Attack - XSS class achieved near-perfect
performance, correctly identifying 55 instances with minimal errors. The Transformer-
CNN model demonstrated strong performance across the Infiltration, Web Attack - SQL
Injection, and Heartbleed classes. For Infiltration, it correctly identified 3 instances, but
misclassified 2 instances as Heartbleed. In the Web Attack - SQL Injection class, the model
classified all 2 instances correctly, achieving perfect accuracy. Similarly, for Heartbleed, the
model exhibited flawless performance, correctly identifying all 4 instances with no errors.
These results further emphasize the model’s ability to handle less frequent and challenging
attack classes with high precision. With an overall accuracy of 99.13%, precision of 99.22%,
recall of 99.13%, and an F1-score of 99.16%, the Transformer-CNN model demonstrates
robust capability in handling multi-class classification challenges. Its ability to classify a
wide range of attack types accurately and reliably underscores its potential for real-world
deployment in intrusion detection systems, where precision and reliability are paramount.
The comparative performance of the proposed Transformer-CNN model against other
multi-class classifiers, including a standalone CNN, auto encoder, and DNN, highlights
the Transformer-CNN’s remarkable capability in managing complex classification tasks
on both the NF-UNSW-NB15-v2 and CICIDS2017 datasets, as shown in Figures 7 and 8.
The evaluation metrics, including accuracy, precision, recall, and F1-score, show that the
Transformer-CNN consistently outperforms the other classifiers across both datasets. On
the NF-UNSW-NB15-v2 dataset, the Transformer-CNN achieved an accuracy of 99.02%,
a precision of 99.30%, a recall of 99.02%, and an F1-score of 99.13%, underscoring its
effectiveness in handling multi-class classification with high performance. In addition to
its high accuracy, the model excelled in precision, recall, and F1-score, which are essential
for assessing performance in imbalanced datasets. Specifically, it achieved a precision of
99.30% and a recall of 99.02%, underscoring its effectiveness in identifying true positives
while minimizing false positives. In contrast, the CNN achieved an accuracy of 98.36%,
with precision, recall, and F1-score values of 98.66%, 98.36%, and 98.46%, respectively.
The DNN recorded an accuracy of 97.65%, with precision, recall, and F1-score values
of 98.09%, 97.65%, and 97.77%, respectively. The auto encoder exhibited comparatively
lower metrics, achieving 95.57% accuracy, 96.54% precision, 95.57% recall, and 95.77%
F1-score. On the CICIDS2017 dataset, the Transformer-CNN also led with an accuracy of
99.13%, a precision of 99.22%, a recall of 99.13%, and an F1-score of 99.16%. The CNN
achieved an accuracy of 99.05%, with precision, recall, and F1-score values of 99.12%,
99.05%, and 99.07%, respectively. The DNN recorded an accuracy of 99.11%, with precision,
recall, and F1-score values of 99.20%, 99.11%, and 99.14%, respectively. The auto encoder
achieved an accuracy of 99.09%, with precision, recall, and F1-score values of 99.12%,
99.09%, and 99.09%. These results emphasize the significant improvement offered by the
Transformer-CNN model for multi-class classification tasks across both datasets.
Future Internet 2024, 16, 481 66 of 74
Future Internet 2024, 16, x FOR PEER REVIEW 60 of 70
True label
PortScan 1 1806 0 0 0 0 0 1 0 0 0 0 0 0 0
True label
FTP–Patator 0 0 0 0 0 255 0 2 0 0 0 0 0 0 0
True Labels
SSH–Patator 1 0 0 1 0 0 112 0 0 0 0 0 0 0 0
Bot 3 0 0 0 0 0 0 0 0 104 0 0 0 0 0
Future Internet 2024, 16, x FOR PEER Web Attack – Brute Force
REVIEW 0 0 0 0 0 0 0 0 0 0 69 0 0 70 0
0 61 of
F1-score, Infiltration
which are 0 essential
0 0 for
0 assessing
0 0 performance
0 0 0 in0 imbalanced
0 0 datasets.
3 0 2
Specifically, it achieved a precision of 99.30% and a recall of 99.02%, underscoring its
Web Attack – Sql Injection 0 0 0 0 0 0 0 0 0 0 0
effectiveness in identifying true positives while minimizing false positives. In0 contrast,
0 2
the 0
CNN achieved an accuracy of 98.36%, with precision, recall, and F1-score values of
Heartbleed 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
98.66%, 98.36%, and 98.46%, respectively. The DNN recorded an accuracy of 97.65%, with
PortScan
DDoS
Benign
DoS Hulk
Infiltration
Web Attack – XSS
Heartbleed
Web Attack – Sql Injection
DoS GoldenEye
Bot
SSH–Patator
98.00%
97.00%
96.00%
95.00%
94.00%
93.00%
CNN Auto Encoder DNN Transformer-CNN
Accuracy
Precision
Recall
F-score
99.25%
99.20%
99.15%
99.10%
99.05%
99.00%
98.95%
CNN Auto Encoder DNN Transformer-CNN
Table 37. Performance metrics for Transformer-CNN across several classes in multi-class classification
on CICIDS2017 dataset.
True Attack
True Normal
False Label
Attack 6 293
Normal
Attack
Predicted Labels
of Confusion
Figure 9.
Figure 9. Confusion matrix matrix of the Transformer-CNN
the Transformer-CNN model on themodel on the NF-UNSW-NB15-v2
NF-UNSW-NB15-v2 dataset,dataset,
demonstrating its effectiveness in detecting
demonstrating its effectiveness in detecting zero-day attacks. zero-day attacks.
6. Limitations
6. Limitations The Transformer-CNN architecture exemplifies a sophisticated deep learning frame-
work
The Transformer-CNN that combines the capabilities
architecture of Transformers
exemplifies and CNNs todeep
a sophisticated bolsterlearning
performance in
classification tasks. Although this innovative approach effectively tackles key issues in
framework that combines the capabilities of Transformers and CNNs to bolster
intrusion detection systems, such as enhancing accuracy and addressing class imbalances,
performance in classification tasks.
it is essential Althoughvarious
to acknowledge this innovative approach
limitations and effectively
challenges tackles
that may arise:
key issues in intrusion
• detection As
Scalability: systems, such
the volume as enhancing
of datasets accuracy
or the complexity and addressing
of network traffic grows, the
class imbalances, it is essential to acknowledge various limitations and
computational demands on the model can intensify, which may challenges
hinder itsthat
efficiency
may arise: and its capacity to manage larger datasets or adapt to changing network environments.
• Generalization: Although the Transformer-CNN exhibits impressive performance on
• Scalability: As the volume of datasets or and
the NF-UNSW-NB15-v2 the CICIDS2017
complexitydatasets,
of network trafficacross
its efficacy grows, the types
diverse
computational demands on the
of network model
traffic can emerging
or newly intensify,attack
which mayishinder
vectors not yet its efficiency
fully established. To
and its capacity to manage larger datasets or adapt to changing network environments.
assess its robustness and generalization capabilities, it is crucial to evaluate the model
• against athe
Generalization: Although wider array of datasets, including
Transformer-CNN KDDCup99
exhibits impressive [36],performance
NSL KDD [29], on and more
recent collections like CSE-CIC-IDS2018 [34], and IoT23 [16].
the NF-UNSW-NB15-v2 and CICIDS2017 datasets, its efficacy across diverse types of
network traffic or newly emerging attack vectors is not yet fully established. To assess
its robustness and generalization capabilities, it is crucial to evaluate the model
against a wider array of datasets, including KDDCup99 [36], NSL KDD [29], and
more recent collections like CSE-CIC-IDS2018 [34], and IoT23 [16].
Future Internet 2024, 16, 481 70 of 74
7. Conclusions
In this paper, we proposed an advanced hybrid Transformer-CNN deep learning
model designed to address the challenges of zero-day attack detection and class imbalance
in IDS. The transformer component of our model is employed for contextual feature extrac-
tion, enabling the system to analyze relationships and patterns in the data effectively. In
contrast, the CNN is responsible for final classification, processing the extracted features to
accurately identify specific attack types. By integrating data resampling techniques such
as ADASYN, SMOTE and ENN, we effectively address class imbalance in the training
data. Additionally, utilizing class weights further enhances our model’s performance by
balancing the influence of different classes during training. As a result, our model sig-
nificantly improves detection accuracy while reducing false positives and negatives. The
results of our evaluation demonstrate the model’s remarkable performance across both the
NF-UNSW-NB15-v2 and CICIDS2017 datasets. On the NF-UNSW-NB15-v2 dataset, the
model achieved an impressive 99.71% accuracy in binary classification and 99.02% accuracy
in multi-class classification. Similarly, on the CICIDS2017 dataset, the model attained
99.93% accuracy in binary classification and 99.13% accuracy in multi-class classification,
showcasing its effectiveness across diverse datasets and classification tasks. This perfor-
mance surpasses that of existing models in both known and unknown threat detection.
This research highlights the potential of hybrid deep learning models in fortifying network
and cloud environments against increasingly sophisticated cyber threats. Our approach
not only enhances real-time detection capabilities but also proves effective in handling
imbalanced datasets, a common challenge in IDS development.
8. Future Work
To address the limitations and challenges outlined in Section 6, future research should
prioritize exploration in the following domains:
• Broader Dataset Evaluation: Future investigations should involve testing the Transformer-CNN
across a more diverse range of datasets, including KDDCup99 [36], NSL KDD [29],
and newer datasets such as CSE-CIC-IDS2018 [34], and IoT23 [16]. This approach
will provide insights into its robustness, generalization potential, and effectiveness in
addressing emerging attack vectors.
• Data Preprocessing Refinement: The data preprocessing procedures should be meticu-
lously refined and customized for each dataset to achieve optimal model performance.
This entails experimenting with various preprocessing techniques and analyzing their
effects on model results. Comprehensive discussions of these preprocessing strategies
are extensively covered in Section 3.2 and 4.1 of the manuscript.
• Model Adaptation and Hyperparameter Optimization: Ongoing investigation into
model adaptation techniques is essential, emphasizing the refinement of the hyperpa-
rameter optimization process tailored to various datasets. This process should undergo
systematic analysis to uncover best practices for effectively adapting the model to
different data environments. Detailed discussions of these aspects are presented in
Section 3, specifically in Section 3.3.4.
Future Internet 2024, 16, 481 71 of 74
Author Contributions: Conceptualization, H.K. and M.M.; Methodology, H.K. and M.M.; Software,
H.K. and M.M.; Validation, H.K. and M.M.; Writing—original draft, H.K. and M.M.; Supervision,
M.M. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The datasets used in our study, NF-UNSW-NB15-v2 and CICIDS2017,
are publicly available. Below are the URLs for the datasets: NF-UNSW-NB15-v2: https://staff.itee.
uq.edu.au/marius/NIDS_datasets/ (accessed on 15 December 2024); CICIDS2017: https://www.
unb.ca/cic/datasets/ids-2017.html (accessed on 15 December 2024).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Conti, M.; Dargahi, T.; Dehghantanha, A. Cyber Threat Intelligence: Challenges and Opportunities; Springer: Berlin/Heidelberg,
Germany, 2018; pp. 1–6. [CrossRef]
2. Faker, O.; Dogdu, E. Intrusion detection using big data and deep learning techniques. In Proceedings of the 2019 ACM Southeast
Conference. ACM SE’19, Kennesaw, GA, USA, 18–20 April 2019; Association for Computing Machinery: New York, NY, USA,
2019; pp. 86–93. [CrossRef]
3. Kaur, G.; Habibi Lashkari, A.; Rahali, A. Intrusion trafc detection and characterization using deep image learning. In Proceedings
of the 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on
Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, Intl Conf on Cyber Science
and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 55–62.
[CrossRef]
4. Internet Security Threat Report. Available online: https://docs.broadcom.com/doc/istr-23-2018-en (accessed on 18 July 2022).
5. Cyberattacks Now Cost Companies \$200,000 on Average, Putting Many out of Business. Available online: https://www.cnbc.
com/2019/10/13/cyberattacks-cost-small-companies-200k-putting-many-out-of-business.html (accessed on 13 October 2019).
6. Kumar, M.; Singh, A.K. Distributed intrusion detection system using blockchain and cloud computing infrastructure. In
Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), Tirunelveli, India,
15–17 June 2020; pp. 248–252.
7. Zhang, X.; Xie, J.; Huang, L. Real-Time Intrusion Detection Using Deep Learning Techniques. J. Netw. Comput. Appl. 2020, 140,
45–53.
8. Kumar, S.; Kumar, R. A Review of Real-Time Intrusion Detection Systems Using Machine Learning Approaches. Comput. Secur.
2020, 95, 101944.
9. Smith, A.; Jones, B.; Taylor, C. Enhancing Network Security with Real-Time Intrusion Detection Systems. Int. J. Inf. Secur. 2021, 21,
123–135.
10. Sarhan, M.; Layeghy, S.; Portmann, M. Towards a standard feature set for network intrusion detection system datasets. Mob.
Netw. Appl. 2022, 27, 357–370. [CrossRef]
11. Sarhan, M.; Layeghy, S.; Moustafa, N.; Portmann, M. Cyber threat intelligence sharing scheme based on federated learning for
network intrusion detection. J. Netw. Syst. Manag. 2023, 31, 3. [CrossRef]
12. UNB. Intrusion Detection Evaluation Dataset (CICIDS2017), University of New Brunswick. Available online: https://www.unb.
ca/cic/datasets/ids-2017.html (accessed on 30 October 2024).
13. Panigrahi, R.; Borah, S. A detailed analysis of CICIDS2017 dataset for designing Intrusion Detection Systems. Int. J. Eng. Technol.
2018, 7, 479–482.
14. Anderson, J.P. Computer security threat monitoring and surveillance. In Technical Report; James P. Anderson Company: Washing-
ton, DC, USA, 1980.
15. Mahalingam, A.; Perumal, G.; Subburayalu, G.; Albathan, M.; Altameem, A.; Almakki, R.S.; Hussain, A.; Abbas, Q. ROAST-IoT: A
novel range-optimized attention convolutional scattered technique for intrusion detection in IoT networks. Sensors 2023, 23, 8044.
[CrossRef]
16. ElKashlan, M.; Elsayed, M.S.; Jurcut, A.D.; Azer, M. A machine learning-based intrusion detection system for iot electric vehicle
charging stations (evcss). Electronics 2023, 12, 1044. [CrossRef]
17. Al Nuaimi, T.; Al Zaabi, S.; Alyilieli, M.; AlMaskari, M.; Alblooshi, S.; Alhabsi, F.; Yusof, M.F.B.; Al Badawi, A. A comparative
evaluation of intrusion detection systems on the edge-IIoT-2022 dataset. Intell. Syst. Appl. 2023, 20, 200298. [CrossRef]
18. Gad, A.R.; Nashat, A.A.; Barkat, T.M. Intrusion detection system using machine learning for vehicular ad hoc networks based on
ToN-IoT dataset. IEEE Access 2021, 9, 142206–142217. [CrossRef]
Future Internet 2024, 16, 481 72 of 74
19. Al-Daweri, M.S.; Ariffin, K.A.Z.; Abdullah, S.; Senan, M.F.E.M. An analysis of the KDD99 and UNSW-NB15 datasets for the
intrusion detection system. Symmetry 2020, 12, 1666. [CrossRef]
20. Vitorino, J.; Praça, I.; Maia, E. Towards adversarial realism and robust learning for IoT intrusion detection and classification. Ann.
Telecommun. 2023, 78, 401–412. [CrossRef]
21. Othman, T.S.; Abdullah, S.M. An intelligent intrusion detection system for internet of things attack detection and identification
using machine learning. Aro-Sci. J. Koya Univ. 2023, 11, 126–137. [CrossRef]
22. Yaras, S.; Dener, M. IoT-Based Intrusion Detection System Using New Hybrid Deep Learning Algorithm. Electronics 2024, 13, 1053.
[CrossRef]
23. Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep learning approach for
intelligent intrusion detection system. IEEE Access 2019, 7, 41525–41550. [CrossRef]
24. Farhana, K.; Rahman, M.; Ahmed, M.T. An intrusion detection system for packet and flow based networks using deep neural
network approach. Int. J. Electr. Comput. Eng. 2020, 10, 5514–5525. [CrossRef]
25. Zhang, C.; Chen, Y.; Meng, Y.; Ruan, F.; Chen, R.; Li, Y.; Yang, Y. A novel framework design of network intrusion detection based
on machine learning techniques. Secur. Commun. Netw. 2021, 2021, 6610675. [CrossRef]
26. Alsharaiah, M.; Abualhaj, M.; Baniata, L.; Al-saaidah, A.; Kharma, Q.; Al-Zyoud, M. An innovative network intrusion detection
system (NIDS): Hierarchical deep learning model based on Unsw-Nb15 dataset. Int. J. Data Netw. Sci. 2024, 8, 709–722. [CrossRef]
27. Jouhari, M.; Benaddi, H.; Ibrahimi, K. Efficient Intrusion Detection: Combining χ2 Feature Selection with CNN-BiLSTM on the
UNSW-NB15 Dataset. arXiv 2024, arXiv:2407.14945.
28. Türk, F. Analysis of intrusion detection systems in UNSW-NB15 and NSL-KDD datasets with machine learning algorithms. Bitlis
Eren Üniversitesi Fen Bilim. Derg. 2023, 12, 465–477. [CrossRef]
29. Muhuri, P.; Chatterjee, P.; Yuan, X.; Roy, K.; Esterline, A. Using a long short-term memory recurrent neural network (lstm-rnn) to
classify network attacks. Information 2020, 11, 243. [CrossRef]
30. Fu, Y.; Du, Y.; Cao, Z.; Li, Q.; Xiang, W. A deep learning model for network intrusion detection with imbalanced data. Elec-tronics
2022, 11, 898. [CrossRef]
31. Yin, Y.; Jang-Jaccard, J.; Xu, W.; Singh, A.; Zhu, J.; Sabrina, F.; Kwak, J. IGRF-RFE: A hybrid feature selection method for
MLP-based network intrusion detection on UNSW-NB15 dataset. J. Big Data 2023, 10, 15. [CrossRef]
32. Yoo, J.; Min, B.; Kim, S.; Shin, D.; Shin, D. Study on network intrusion detection method using discrete pre-processing method
and convolution neural network. IEEE Access 2021, 9, 142348–142361. [CrossRef]
33. Alzughaibi, S.; El Khediri, S. A cloud intrusion detection systems based on dnn using backpropagation and pso on the cse-cic-
ids2018 dataset. Appl. Sci. 2023, 13, 2276. [CrossRef]
34. Basnet, R.B.; Shash, R.; Johnson, C.; Walgren, L.; Doleck, T. Towards Detecting and Classifying Network Intrusion Traffic Using
Deep Learning Frameworks. J. Internet Serv. Inf. Secur. 2019, 9, 1–17.
35. Thilagam, T.; Aruna, R. Intrusion detection for network based cloud computing by custom RC-NN and optimization. ICT Express
2021, 7, 512–520. [CrossRef]
36. Farahnakian, F.; Heikkonen, J. A deep auto-encoder based approach for intrusion detection system. In Proceedings of the 2018
20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 11–14 February
2018; IEEE: Piscataway, NJ, USA, 2018; pp. 178–183.
37. Mahmood, H.A.; Hashem, S.H. Network intrusion detection system (NIDS) in cloud environment based on hid-den Naïve Bayes
multiclass classifier. Al-Mustansiriyah J. Sci. 2018, 28, 134–142. [CrossRef]
38. Baig, M.M.; Awais, M.M.; El-Alfy, E.S.M. A multiclass cascade of artificial neural network for network intrusion detection. J. Intell.
Fuzzy Syst. 2017, 32, 2875–2883. [CrossRef]
39. Mohy-Eddine, M.; Guezzaz, A.; Benkirane, S.; Azrour, M.; Farhaoui, Y. An ensemble learning based intrusion detection model for
industrial IoT security. Big Data Min. Anal. 2023, 6, 273–287. [CrossRef]
40. Nicolas-Alin, S. Machine Learning for Anomaly Detection in Iot Networks: Malware Analysis on the Iot-23 Data Set. Bachelor’s
Thesis, University of Twente, Enschede, The Netherland, 2020.
41. Susilo, B.; Sari, R.F. Intrusion detection in IoT networks using deep learning algorithm. Information 2020, 11, 279. [CrossRef]
42. Szczepański, M.; Pawlicki, M.; Kozik, R.; Choraś, M. The application of deep learning imputation and other advanced methods
for handling missing values in network intrusion detection. Vietnam. J. Comput. Sci. 2023, 10, 1–23. [CrossRef]
43. Kumar, P.; Bagga, H.; Netam, B.S.; Uduthalapally, V. Sad-iot: Security analysis of ddos attacks in iot networks. Wirel. Pers.
Commun. 2022, 122, 87–108. [CrossRef]
44. Sarhan, M.; Layeghy, S.; Portmann, M. Feature analysis for machine learning-based IoT intrusion detection. arXiv 2021,
arXiv:2108.12732.
45. Ferrag, M.A.; Friha, O.; Hamouda, D.; Maglaras, L.; Janicke, H. Edge-IIoTset: A new comprehensive realistic cyber security
dataset of IoT and IIoT applications for centralized and federated learning. IEEE Access 2022, 10, 40281–40306. [CrossRef]
46. Henry, A.; Gautam, S.; Khanna, S.; Rabie, K.; Shongwe, T.; Bhattacharya, P.; Sharma, B.; Chowdhury, S. Composition of hybrid
deep learning model and feature optimization for intrusion detection system. Sensors 2023, 23, 890. [CrossRef] [PubMed]
47. Aleesa, A.; Mohammed, A.A.; Mohammed, A.A.; Sahar, N. Deep-intrusion detection system with enhanced UNSW-NB15 dataset
based on deep learning techniques. J. Eng. Sci. Technol. 2021, 16, 711–727.
Future Internet 2024, 16, 481 73 of 74
48. Ahmad, M.; Riaz, Q.; Zeeshan, M.; Tahir, H.; Haider, S.A.; Khan, M.S. Intrusion detection in internet of things using supervised
machine learning based on application and transport layer features using UNSW-NB15 data-set. EURASIP J. Wirel. Commun.
Netw. 2021, 2021, 10. [CrossRef]
49. Mohammed, B.; Gbashi, E.K. Intrusion detection system for NSL-KDD dataset based on deep learning and recursive feature
elimination. Eng. Technol. J. 2021, 39, 1069–1079. [CrossRef]
50. Umair, M.B.; Iqbal, Z.; Faraz, M.A.; Khan, M.A.; Zhang, Y.D.; Razmjooy, N.; Kadry, S. A network intrusion detection system using
hybrid multilayer deep learning model. Big Data 2022, 12, 367–376. [CrossRef]
51. Choobdar, P.; Naderan, M.; Naderan, M. Detection and multi-class classification of intrusion in software defined networks using
stacked auto-encoders and CICIDS2017 dataset. Wirel. Pers. Commun. 2022, 123, 437–471. [CrossRef]
52. Shende, S.; Thorat, S. Long short-term memory (LSTM) deep learning method for intrusion detection in network security. Int. J.
Eng. Res. 2020, 9, 1615–1620.
53. Farhan, B.I.; Jasim, A.D. Performance analysis of intrusion detection for deep learning model based on CSE-CIC-IDS2018 dataset.
Indones. J. Electr. Eng. Comput. Sci. 2022, 26, 1165–1172. [CrossRef]
54. Farhan, R.I.; Maolood, A.T.; Hassan, N. Performance analysis of flow-based attacks detection on CSE-CIC-IDS2018 dataset using
deep learning. Indones. J. Electr. Eng. Comput. Sci. 2020, 20, 1413–1418. [CrossRef]
55. Lin, P.; Ye, K.; Xu, C.Z. Dynamic network anomaly detection system by using deep learning techniques. In Proceedings of the
Cloud Computing–CLOUD 2019: 12th International Conference, Held as Part of the Services Conference Federation, SCF 2019,
San Diego, CA, USA, 25–30 June 2019; Proceedings 12. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp.
161–176.
56. Liu, G.; Zhang, J. CNID: Research of network intrusion detection based on convolutional neural network. Discret. Dyn. Nat. Soc.
2020, 2020, 4705982. [CrossRef]
57. Li, F.; Shen, H.; Mai, J.; Wang, T.; Dai, Y.; Miao, X. Pre-trained language model-enhanced conditional generative adversarial
networks for intrusion detection. Peer-to-Peer Netw. Appl. 2024, 17, 227–245. [CrossRef]
58. Wang, S.; Yao, X. Multiclass imbalance problems: Analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B 2012, 42,
1119–1130. [CrossRef] [PubMed]
59. Abdelkhalek, A.; Mashaly, M. Addressing the class imbalance problem in network intrusion detection systems using data
resampling and deep learning. J. Supercomput. 2023, 79, 10611–10644. [CrossRef]
60. Yang, H.; Xu, J.; Xiao, Y.; Hu, L. SPE-ACGAN: A resampling approach for class imbalance problem in network intrusion detection
systems. Electronics 2023, 12, 3323. [CrossRef]
61. Zakariah, M.; AlQahtani, S.A.; Al-Rakhami, M.S. Machine learning-based adaptive synthetic sampling technique for intrusion
detection. Appl. Sci. 2023, 13, 6504. [CrossRef]
62. Thiyam, B.; Dey, S. Efficient feature evaluation approach for a class-imbalanced dataset using machine learning. Procedia Comput.
Sci. 2023, 218, 2520–2532. [CrossRef]
63. AlbAlbasheer, F.O.; Haibatti, R.R.; Agarwal, M.; Nam, S.Y. A Novel IDS Based on Jaya Optimizer and Smote-ENN for Cyberattacks
Detection. IEEE Access 2024, 12, 101506–101527. [CrossRef]
64. Arık, A.O.; Çavdaroğlu, G.Ç. An Intrusion Detection Approach based on the Combination of Oversampling and Undersampling
Algorithms. Acta Infologica 2023, 7, 125–138. [CrossRef]
65. Rao, Y.N.; Suresh Babu, K. An imbalanced generative adversarial network-based approach for network intrusion detection in an
imbalanced dataset. Sensors 2023, 23, 550. [CrossRef]
66. Jamoos, M.; Mora, A.M.; AlKhanafseh, M.; Surakhi, O. A new data-balancing approach based on generative adversarial network
for network intrusion detection system. Electronics 2023, 12, 2851. [CrossRef]
67. Xu, B.; Sun, L.; Mao, X.; Ding, R.; Liu, C. IoT Intrusion Detection System Based on Machine Learning. Electronics 2023, 12, 4289.
[CrossRef]
68. Assy, A.T.; Mostafa, Y.; Abd El-khaleq, A.; Mashaly, M. Anomaly-based intrusion detection system using one-dimensional
convolutional neural network. Procedia Comput. Sci. 2023, 220, 78–85. [CrossRef]
69. Elghalhoud, O.; Naik, K.; Zaman, M.; Manzano, R. Data Balancing and cnn Based Network Intrusion Detection System; IEEE:
Piscataway, NJ, USA, 2023.
70. Almarshdi, R.; Nassef, L.; Fadel, E.; Alowidi, N. Hybrid Deep Learning Based Attack Detection for Imbalanced Data Classification.
Intell. Autom. Soft Comput. 2023, 35, 297–320. [CrossRef]
71. Thockchom, N.; Singh, M.M.; Nandi, U. A novel ensemble learning-based model for network intrusion detection. Complex Intell.
Syst. 2023, 9, 5693–5714. [CrossRef]
72. Jumabek, A.; Yang, S.S.; Noh, Y.T. CatBoost-based network intrusion detection on imbalanced CIC-IDS-2018 dataset. Korean Soc.
Commun. Commun. J. 2021, 46, 2191–2197. [CrossRef]
73. Zhu, Y.; Liang, J.; Chen, J.; Ming, Z. An improved nsga-iii algorithm for feature selection used in intrusion detection. Knowl.-Based
Syst. 2017, 116, 74–85. [CrossRef]
74. Jiang, J.; Wang, Q.; Shi, Z.; Lv, B.; Qi, B. Rst-rf: A hybrid model based on rough set theory and random forest for network intrusion
detection. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Guiyang, China, 16–18
March 2018.
Future Internet 2024, 16, 481 74 of 74
75. Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, W. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16,
321–357. [CrossRef]
76. Alikhanov, J.; Jang, R.; Abuhamad, M.; Mohaisen, D.; Nyang, D.; Noh, Y. Investigating the effect of trafc sampling on machine
learning-based network intrusion detection approaches. IEEE Access 2022, 10, 5801–5823. [CrossRef]
77. Zhang, X.; Ran, J.; Mi, J. An intrusion detection system based on convolutional neural network for imbalanced network trafc. In
Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), Dalian,
China, 19–20 October 2019; pp. 456–460.
78. Gupta, N.; Jindal, V.; Bedi, P. CSE-IDS: Using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in
Network-based intrusion detection systems. Comput. Secur. 2021, 112, 102499. [CrossRef]
79. Mbow, M.; Koide, H.; Sakurai, K. Handling class imbalance problem in intrusion detection system based on deep learning. Int. J.
Netw. Comput. 2022, 12, 467–492. [CrossRef] [PubMed]
80. Patro, S.G.; Sahu, D.-K.K. Normalization: A preprocessing stage. arXiv 2015, arXiv:1503.06462. [CrossRef]
81. Bagui, S.; Li, K. Resampling imbalanced data for network intrusion detection datasets. J. Big Data 2021, 8, 6. [CrossRef]
82. Elmasry, W.; Akbulut, A.; Zaim, A.H. Empirical study on multiclass classifcation-based network intrusion detection. Comput.
Intell. 2019, 35, 919–954. [CrossRef]
83. El-Habil, B.Y.; Abu-Naser, S.S. Global climate prediction using deep learning. J. Theor. Appl. Inf. Technol. 2022, 100, 4824–4838.
84. He, H.; Wu, D. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 Fourth
International Conference on Natural Computation, Jinan, China, 18–20 October 2008.
85. Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 3, 408–421.
[CrossRef]
86. He, H.; Garcia, E. Learning from imbalanced data. In IEEE Transactions on Knowledge and Data Engineering; IEEE: Piscataway, NJ,
USA, 2009.
87. Zhendong, S.; Jinping, M. Deep learning-driven MIMO: Data encoding and processing mechanism. Phys. Commun. 2022, 57,
101976. [CrossRef]
88. Xin, Z.; Chunjiang, Z.; Jun, S.; Kunshan, Y.; Min, X. Detection of lead content in oilseed rape leaves and roots based on deep
transfer learning and hyperspectral imaging technology. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2022, 290, 122288.
[CrossRef]
89. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
90. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International
Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814.
91. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
92. Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition And Machine Learning; Springer: New York, MY, USA, 2006; Volume 4.
93. Nielsen, M.A. Neural Networks and Deep Learning. In Chapter 1 Explains the Basics of Feedforward Operations in Neural Networks;
Determination Press: San Francisco, CA, USA, 2015.
94. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011.
95. Vaswani, A.; Noam, S.; Niki, P.; Jakob, U.; Llion, J.; Aidan, N.G.; Lukasz, K.; Illia, P. Attention Is All You Need.(Nips), 2017. arXiv
2017, arXiv:1706.03762.
96. Lei Ba, J.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450.
97. Ring, M.; Wunderlich, S.; Scheuring, D.; Landes, D.; Hotho, A. A survey of network-based intrusion detection data sets. Comput.
Secur. 2019, 86, 147–167. [CrossRef]
98. Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic
characterization. ICISSp 2018, 1, 108–116.
99. Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A detailed analysis of the cicids2017 data set. In Proceedings of the
Information Systems Security and Privacy: 4th International Conference, ICISSP 2018, Funchal-Madeira, Portugal, 22–24 January
2018; Revised Selected Papers 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 172–188.
100. Jyothsna, V.; Prasad, K.M. Anomaly-based intrusion detection system. In Computer and Network Security; Intech: Houston, TX,
USA, 2019; Volume 10.
101. Chen, C.; Song, Y.; Yue, S.; Xu, X.; Zhou, L.; Lv, Q.; Yang, L. FCNN-SE: An Intrusion Detection Model Based on a Fusion CNN and
Stacked Ensemble. Appl. Sci. 2022, 12, 8601. [CrossRef]
102. Powers, D.M.W. Evaluation: From Precision, Recall, and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach.
Learn. Technol. 2011, 2, 37–63.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.