0% found this document useful (0 votes)
23 views27 pages

A Novel Approach For Cyber Threat Analysis Systems Using BERT Model From Cyber Threat Intelligence Data

Uploaded by

Radhee R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views27 pages

A Novel Approach For Cyber Threat Analysis Systems Using BERT Model From Cyber Threat Intelligence Data

Uploaded by

Radhee R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Article

A Novel Approach for Cyber Threat Analysis Systems Using


BERT Model from Cyber Threat Intelligence Data
Doygun Demirol 1 , Resul Das 2, * and Davut Hanbay 3

1 Department of Computer Technologies, Bingöl University, 12000 Bingöl, Türkiye; ddemirol@bingol.edu.tr


2 Department of Software Engineering, Technology Faculty, Fırat University, 23119 Elazığ, Türkiye
3 Department of Computer Engineering, Engineering Faculty, İnönü University, 44000 Malatya, Türkiye;
davut.hanbay@inonu.edu.tr
* Correspondence: rdas@firat.edu.tr

Abstract: As today’s cybersecurity environment is becoming increasingly complex, it is


crucial to analyse threats quickly and effectively. A delayed response or lack of foresight can
lead to data loss, reputational damage, and operational disruptions. Therefore, developing
methods that can rapidly extract valuable threat intelligence is a critical need to strengthen
defence strategies and minimise potential damage. This paper presents an innovative
approach that integrates knowledge graphs and a fine-tuned BERT-based model to analyse
cyber threat intelligence (CTI) data. The proposed system extracts cyber entities such
as threat actors, malware, campaigns, and targets from unstructured threat reports and
establishes their relationships using an ontology-driven framework. A named entity
recognition dataset was created and a BERT-based model was trained. To address the
class imbalance, oversampling and a focal loss function were applied, achieving an F1
score of 96%. The extracted entities and relationships were visualised and analysed using
knowledge graphs, enabling the advanced threat analysis and prediction of potential
attack targets. This approach enhances cyber-attack prediction and prevention through
knowledge graphs.

Keywords: cyber threat intelligence; knowledge graphs; named entity recognition;


pre-trained language model

Academic Editor: Jie Yang

Received: 28 February 2025


Revised: 1 April 2025 1. Introduction
Accepted: 8 April 2025
The process of storing and analysing the data digitally first emerged in the late 1970s
Published: 11 April 2025
with the concept of “Database Machine”, a special technology. The storage and processing
Citation: Demirol, D.; Das, R.;
capacity of a single host system has become insufficient in parallel with the increase in
Hanbay, D. A Novel Approach for
Cyber Threat Analysis Systems Using
data volume. The researchers in the 1980s developed a parallel database system named
BERT Model from Cyber Threat “Shared-Nothing” to overcome the capacity issue and to store data in a healthy and trouble-
Intelligence Data. Symmetry 2025, 17, free manner [1]. Following this, the first commercial parallel database system emerged [2].
587. https://doi.org/10.3390/ At the end of the 1990s, researchers accepted the advantages of the parallel database,
sym17040587
and the parallel databases started to be widely used. In 2011, EMC/IDC (International
Copyright: © 2025 by the authors. Data Corporation) published an important research report on the concept of big data, titled
Licensee MDPI, Basel, Switzerland. “Extracting value from chaos” [3].
This article is an open access article
The rapid increase in the use of IT systems from the past to the present has enabled the
distributed under the terms and
transition from traditional to modern IT systems in many areas. As a result of this transition,
conditions of the Creative Commons
Attribution (CC BY) license
IT systems have become more attractive to malicious users. In parallel with these develop-
(https://creativecommons.org/ ments, cyberattacks have become more sophisticated and capable by becoming widespread.
licenses/by/4.0/). For this reason, cyberattacks have become a serious problem for IT infrastructures and

Symmetry 2025, 17, 587 https://doi.org/10.3390/sym17040587


Symmetry 2025, 17, 587 2 of 27

society. To prevent attacks that seriously affect IT society and systems, security researchers
have developed and implemented many methods and taken numerous measures. The
efficiency of artificial intelligence (AI)-based methods is based on their training knowledge.
Keeping this training knowledge base up-to-date is possible by analysing current security
documents and extracting cyber assets efficiently and accurately. The simultaneous sharing
of extracted cyber assets with security communities increases the effectiveness of cyber
threat intelligence and helps to train cyber defence systems with more accurate information.
Cyber threat intelligence (CTI) is critical information that can help organisations to
preserve their infrastructure from threats. It is vital for cybersecurity experts to keep
their knowledge base up to date on new malware and attack scenarios by analysing up-
to-date CTI reports in order to be prepared for new attacks. Although there is a lot of
structured CTI data that are shared by vendors such as Symantec, McAfee, Trend Micro,
FireEye, etc., a large amount of CTI data are available in an unstructured format via
public available sources such as cyber security blogs and security reports. Manually
collecting and analysing large amounts of cyber intelligence data in an unstructured format
can make defence against cyber security threats inadequate and inefficient. However,
the rapid proliferation of open source documents containing CTI information makes it
more difficult for human analysts to monitor and process these data efficiently and in a
timely manner. To overcome these challenges, we proposed a hybrid cyber security entities
extraction system from unstructured security texts. The proposed system was based on the
combination of natural language processing (NLP), AI methods, and rule-based pattern
recognition. The implementation of the proposed system brings about some challenges.
First, before extracting cyber entities, some preprocessing steps need to be applied to
the collected data. Depending on the type of data source, text data need to be cleaned
and extracted from these sources and converted into a common format. To do this, we
developed separate methods according to data sources and extracted clean sentences from
text data. Another challenge is the value of the data collected. The “value” parameter,
which appears as a component in big data, is a time-dependent component. As time
passes, the value of the available data decreases. Therefore, it is necessary to keep the
cyber intelligence data constantly up to date. For this purpose, the proposed system was
updated by collecting new CTI reports from data sources at certain intervals. The main
contribution of this paper is the construction of hand-annotated NER and relationship
extraction datasets for cybersecurity due to the lack of publicly available datasets. This
shortage makes it challenging to create datasets for cyber threat intelligence. To construct
this dataset, we preprocess and annotate 100 PDF-formatted cyber security-related reports
and 540 HTML-formatted web pages.

1.1. Research Issues and Motivation


In the digital age, cybersecurity emerges not just as a technical challenge but as a
critical frontier in the safeguarding of global security and privacy. The exponential growth
of digital data and the complexity of the data has unveiled a myriad of vulnerabilities,
making it increasingly difficult to protect against evolving cyber threats. This research is
driven by two pivotal issues: the inefficiency of traditional security measures in keeping
pace with sophisticated cyber threats, and the vast, untapped potential of raw security data
waiting to be decoded into actionable intelligence.
The increasing complexity of cyber threats makes the identification and understanding
of these threats complex and difficult. Traditional defence systems are often based on
predefined threat signatures, making them vulnerable to advanced persistent threats (APTs)
using new attack methods [4]. This gap in cyber defence highlights the need for innovative
approaches that not only detect known threats but also predict emerging new threats and
Symmetry 2025, 17, 587 3 of 27

potential targets. Graph-based approaches offer significant advantages over traditional


methods in the analysis of cyber threat intelligence. For example, Zhou et al. [5] demon-
strated that graph-based systems enable the better visualisation and understanding of
relationships between advanced persistent threat (APT) actors, achieving higher accuracy
compared to traditional methods. Similarly, Piplai et al. [6] showed that knowledge graphs
constructed from malware reports provide richer contextual information, revealing relation-
ships often missed by text-based analysis. Furthermore, Sarhan and Spruit [7] highlighted
that graph-based approaches systematically reveal hidden connections between entities
such as threat actors, malware, and targets, thereby improving the overall quality of threat
intelligence. These studies highlight the potential of graph-based methods to address the
limitations of traditional approaches, particularly when dealing with complex and dynamic
cybersecurity data.
Massive amounts of security data are generated every day, including threat reports,
blog articles, logs, and structured cyber threat data. These data contain valuable threat
intelligence information such as cyber threat behaviour, vulnerabilities exploited, attack
targets, and technical attack information. However, the volume and irregularity of these
data pose significant challenges for analysis. Traditional text analysis methods are not
sufficient to extract hidden patterns and relationships in such big data, which means that
potential threat intelligence remains idle.
The motivation for this study is to provide cybersecurity professionals with a deeper
understanding of threat behaviour and the timely extraction of cyber threat information
without losing its value in order to develop effective cyber defence strategies. Based on
this motivation, we have developed our proposed approach by utilising the capabilities
of knowledge graphs. Knowledge graphs represent cyber entities and the relationships
between these entities in a graphical structure, making it easier to discover hidden patterns,
predict threat behaviours, and identify more complex relationships in the cyber security
environment. With the proposed graph-based approach, we aim to keep pace with the pace
of change of threats in the cyber security environment and develop more proactive defence
strategies against these threats by going beyond the limitations of traditional approaches.

1.2. Main Contributions


This research represents a significant step in the field of cybersecurity, bridging the
gap between raw data complexity and actionable threat intelligence. This study provides
a comprehensive solution to cyber security threats by performing entity extraction from
unstructured threat data and relationship analysis with knowledge graphs. The main
contributions of our work are listed below:
• By utilising the existing capabilities of knowledge graphs in the cyber security domain,
we demonstrate the structured conversion of raw cyber security data from various
sources into an interconnected cyber intelligence network. This contribution provides
an infrastructure for better understanding, querying, and analysing cybersecurity data
with knowledge graphs.
• Using natural language processing techniques, we improve the process of extracting
entities from raw cybersecurity content. By leveraging an ontology to build the
relationships between entities, we both make the task of relationship building more
meaningful and more clearly reveal insights that traditional methods may miss.
• Our proposed approach provides a comprehensive solution for analysing cyber entities
and their relationships using knowledge graphs. This approach provides detailed
knowledge of the current and future behaviour of cyber threats, enabling cybersecurity
practitioners to effectively predict and identify potential threats.
Symmetry 2025, 17, 587 4 of 27

• By analysing patterns within the knowledge graph, we present a novel approach to


predicting future potential targets based on threat actor behaviour. This contribution
plays an important role in improving cybersecurity by enabling the prediction of
potential targets.
• This contribution aims to enable cyber security professionals to proactively defend
against threats they may encounter. In this way, it is envisaged that more accurate
decisions will be made in the cyber defence process using actionable cyber intelli-
gence data.
• Finally, by integrating concepts from graph theory, data science, and cybersecurity,
our work fosters interdisciplinary collaboration. We highlight the importance of
combining expertise from different fields to tackle the complex and multifaceted
nature of cyber threats.
These contributions underscore our commitment to advancing the state of cyberse-
curity through innovative research and practical applications. By redefining how security
data are analysed and utilised, we aim to create a safer digital world for individuals,
organisations, and governments alike.

2. Related Works
Cyber threat intelligence is vital for organisations and security communities to protect
their cybersecurity assets against rapidly evolving cyber threats. In particular, cyber threat
intelligence related to APT (advanced persistent threat) attacks contains detailed techni-
cal information about attackers, targets, and attack techniques and tactics. Furthermore,
extracting and analysing cyber-related information using traditional methods can be very
time-consuming and require much manual work for security analysts. Therefore, extracting
and using cyber threat intelligence from non-structural data is vital and challenging. So,
most security researchers have focused on automating the extraction of threat intelligence
from public data sources. Publicly available threat sources such as ThreatExchange [8],
Symantec [9], Kaspersky, hacker forums, and social media platforms are useful sources
for threat intelligence and sharing. However, the specified sources of threats are only in
the form of unstructured text. Unstructured cyber threat texts have been standardised in
machine-readable formats to formats such as STIX [10] and MAEC [11] to facilitate the shar-
ing of threat intelligence information and provide effective prevention and identification of
attacks promptly [12].
In this section, we review previous works performed on the extraction of CTI data
from open data sources in the field of cybersecurity. Zhou et al. [5] proposed a CTI analysis
framework called CTI View to extract CTI from unstructured APT texts and analyse
them automatically. The authors extract the threat entities and train the model through
the BERT-GRU-BiLSTM-CRF model based on bidirectional encoder representations from
transformers (BERT). The proposed model shows that the accuracy is more than 72%.
Jones et al. employed a combination of semi-supervised machine learning techniques
and active learning approaches to extract entities and relationships related to network
security [13]. Husari et al. proposed a tool called Ttpdrill that is designed to automatically
extract threat actions from unstructured text sources such as CTI reports. The tool uses
NLP techniques to identify and extract relevant information from the text and has been
evaluated on a dataset of CTI reports. The results show that Ttpdrill can extract threat
actions with high accuracy [12]. Alves et al. presented a Twitter streaming threat monitor
that regularly updates a summary of threats related to the target. They collected tweets that
pertain to cyber security incidents and extracted features using the term frequency–inverse
document frequency method and then employed both a multi-layer perceptron and support
vector machine to be used as classifiers for the collected tweets [14]. Kim et al. proposed a
Symmetry 2025, 17, 587 5 of 27

framework called CyTIME that uses structured CTI data from repositories. The framework
collects intelligence data, continuously generates rules without human intervention, and
then converts them into a JSON format known as Structured Threat Information Expression
(STIX) to mitigate real-time network cyber threats [15]. Zhang et al. proposed a framework
to extract cyber threat actions from CTI reports using NLP. In addition to the extraction
of actions, the framework finds relationships among entities [16]. Piplai et al. prepared a
framework to extract cyber information from the after-action reports and represent that in
a knowledge graph to offer insightful analyses to cybersecurity analysts. The system uses
NER and regular expressions to identify cyber-related entities [6]. Sarhan et al. presented a
neural network-based open information extraction (OIE) system to extract valuable cyber
intelligence from unstructured texts. The proposed approach constructs knowledge graph
representation from the threat reports and performs named entity recognition (NER) using
OIE [7]. Alam et al. designed a transformer-based library to perform an NER system that
extracts cyber entities. This library uses a neural language model named XLM RoBERTa
which is pre-trained on threat reports [17]. Zhu et al. proposed a system that uses NLP
methods to extract IOCs from security-related articles and classify the articles into campaign
stages. To enhance the IOC extraction stage, rule-based methods were used [18].
These studies demonstrate various approaches to cyber threat intelligence extraction.
However, recent developments show a clear trend towards more sophisticated methods
that combine transformer models with specialised knowledge representations. Current
research focuses predominantly on hybrid architectures that combine state-of-the-art lan-
guage models with graph-based approaches, achieving higher performance metrics than
traditional methods.
This study aims to contribute to the fields of cyber threat intelligence, natural language
processing, named entity recognition, deep learning models, and graph-based analysis.
A review of existing studies in the literature shows that cyber threat intelligence processes
are mostly handled using traditional methods and that graph-based analysis approaches are
applied in this area to a limited extent. Furthermore, research integrating ontology-based
approaches is rather scarce.
This study addresses these gaps by leveraging a fine-tuned BERT model to enhance
contextual entity recognition in cyber threat intelligence. This aligns with the literature’s
preference of hybrid models, as shown in Table 1, which highlights significant advances in
combining transformer-based architectures with graph-based knowledge representations.
Specifically, this study presents a novel approach to map threats and identify hidden links
by integrating graph-based Neo4j models to visualise and analyse the relationships be-
tween cyber threat entities. In addition, the proposed ontology-based model systematically
structures the entity relationships, further improving the interpretability and utilisation of
the extracted knowledge. These innovations, combining BERT-based NER with domain-
specific ontology and graph-based analysis, deliver a comprehensive framework for cyber
threat intelligence. The fine-tuned BERT model enhances contextual entity recognition,
particularly for cyber threat terminology, while the graph-based approach, implemented
with Neo4j, visualises and analyses relationships between threat entities, uncovering poten-
tial connections and mapping threat chains [19]. Additionally, the ontology-based model
systematically structures entities and their relationships, facilitating a more meaningful and
organised representation of cyber threat data. Together, these advancements significantly
improve multi-entity recognition, relationship extraction, and threat visualisation, provid-
ing both methodological and practical contributions to the field of cyber threat intelligence.
Symmetry 2025, 17, 587 6 of 27

Table 1. Key comparison of recent cybersecurity intelligence research.

Ref. Data Method Key Takeaways Key Findings


The DLNN model achieved the
Facilitates more accurate highest performance with a 94%
CTI reports collected from NLP, LSA, SVD, Naïve Bayes, KNN,
[20] classification and analysis of accuracy rate, producing more
open sources Decision Tree, Random Forest, DLNN
high-level IOCs and threat actors. reliable results compared to
other methods.
CRF, LSTM, LSTM-CRF, The proposed FT-CNN-BiLSTM-CRF
Feature templates (FT) enhance the
Freebuf website and WooYun BiLSTM-CRF, CNN-BiLSTM-CRF, model achieved the best performance
[21] detection of security entities by
Vulnerability Database FT-LSTM-CRF, FT-BiLSTM-CRF, with 93.31% accuracy and an F-score
combining local and global features.
FT-CNN-BiLSTM-CRF of 86.
The system demonstrated an
The proposed method identifies new
TF-IDF, DBSCAN, accuracy rate of 93.75%, showcasing
[22] Twitter data and emerging cyber threat events
TextRank, TextRazor the applicability of threat detection
from X streams.
with X data.
The proposed XBiLSTM-CRF method
CVE list, Adobe and Microsoft The proposed model achieved the
Stanford NER, LSTM + Dense, LSTM outperformed others in recognizing
[23] Security Bulletins, and security blog best results with 90.54% accuracy and
+ CRF, XBiLSTM-CRF entity categories by learning
texts an F1 score of 89.38%.
distinctions more effectively.
CTIMiner accelerated threat analysis
The CTIMiner system generated a
processes by integrating indicators
APT security reports and malware Regex-based IoC Parser, malware high-quality CTI dataset by
[24] from open source reports and
repository data repository service (for analysis) leveraging open source reports and
malware analyses to form a
malware analysis data.
high-quality CTI dataset.
BiLSTM-Attention-CRF-crowd, The proposed method effectively
The BiLSTM-Attention-CRF-crowd
CRF-MA, Dawid and Skene-LSTM, extracts accurate information from
[25] CVE records and security blog texts model outperformed others with
BiLSTM-Attention-CRF, low-quality data sources and
89.1% accuracy and an F1 score of 89.
BiLSTM-Attention-CRF-VT classifies security entities.
The proposed model successfully
An NER-based system was identified vulnerabilities in IoT
Named Entity Recognition (NER),
[26] CVE records developed for identifying and networks, achieving positive results
Ontology-Based Data Modeling
evaluating IoT security statuses. through semantic analysis and
ontology contributions.
The proposed method extracted
Malware After Action Reports Automatic extraction of information
Named Entity Recognition (NER), entities and relations from AARs,
(AARs), Microsoft and Adobe from cybersecurity incident reports
[6] Relation Extraction, Ontology-Based creating knowledge graphs and
Security Bulletins, and CVE led to the creation of
Modeling, Knowledge Graphs improving the analysis of
descriptions knowledge graphs.
security threats.
The proposed method provides an The system classified articles with an
Open source threat Multi-layer perceptron (MLP), and effective framework for extracting accuracy rate of 94.85% and
[27]
intelligence platforms TF-IDF-Based Feature Extraction accurate and meaningful information produced enriched knowledge
from threat articles. graphs using multi-source data.
Ontology-based correction and The BiLSTM-CRF + Correction model
BiLSTM-CRF integration improved achieved the highest F1 scores across
Cyber threat reports, Chinese CTI CRF, BiLSTM, BiLSTM-CRF,
[28] contextual accuracy and produced all entity types, providing reliable
reports, and vulnerability data BiLSTM-CRF + Correction
effective results for complex information extraction for cyber
entity types. threat intelligence.
The Bi-GRU + CNN + CRF model
The deep learning-based Bi-GRU + achieved the highest performance
CRF, LSTM-CRF, CNN-CRF,
Cyber threat reports and CNN + CRF model excelled in with an F1 score of 93.4%, producing
[29] RNN-CRF, GRU-CRF, BiGRU-CRF,
vulnerability information contextual information modeling and more accurate results in entity
BiGRU + CNN-CRF
understanding entity relationships. recognition compared to
existing approaches.
Char-RNN-Bi-LSTM,
BOC-Bi-LSTM-CRF produced the The proposed model achieved an F1
Char-RNN-Bi-LSTM-CRF,
Cyber threat reports and best results in accuracy and efficiency score of 75.05%, outperforming other
[30] Char-CNN-Bi-LSTM,
CVE information by leveraging contextual and methods. CRF improved
Char-CNN-Bi-LSTM-CRF,
character information. performance across all models.
BOC-Bi-LSTM, BOC-Bi-LSTM-CRF
HAG intuitively modeled tactical and
technical information with a
The Hyper Attack Graph (HAG)
hyper-graph structure. The CTIBERT
CTIBERT model (BERT, BiGRU, CRF, framework is the first approach to
[31] Cyber threat reports model excelled in extracting complex
Multi-head Mechanism) analyse cyber threat intelligence
contextual information, with the
using a hyper-graph structure.
multi-head mechanism
enhancing accuracy.
GloVe + BiLSTM + CRF, FastText +
BiLSTM + CRF, BERT-base-cased +
BERT-based approaches, especially
BiLSTM + CRF, BERT-large-cased + The BERT-large-cased-wwm + FFN
with Whole Word Masking,
BiLSTM + CRF, achieved the best performance with
[32] Automatically Labeled Corpus improved accuracy, while CRF
BERT-large-cased-wwm + BiLSTM + an F1 score of 97.4%, excelling in
successfully modeled
CRF, BERT-base-cased + FFN, contextual information extraction.
label dependencies.
BERT-large-cased + FFN,
BERT-large-cased-wwm + FFN
Symmetry 2025, 17, 587 7 of 27

Table 1. Cont.

Ref. Data Method Key Takeaways Key Findings


CNN-BiLSTM-CRF,
RoBERTa-based models
CNN-BiGRU-CRF, The RoBERTa-BiGRU-CRF model
CTI reports from Microsoft, Cisco, demonstrated superior performance
BERT-BiLSTM-CRF, achieved the best performance with
[33] McAfee, Kaspersky, Fortinet, in complex relationships during
BERT-BiGRU-CRF, an F1 score of 83.2%, excelling in
CrowdStrike, and others overlap tests, with high accuracy in
RoBERTa-BiLSTM-CRF, contextual information extraction.
“all” tests compared to other models.
RoBERTa-BiGRU-CRF
Automatically extracts attack High accuracy rates were achieved in
Cyber threat reports from Microsoft, behaviours from reports, producing threat detection, extracting
[34] Bi-LSTM, BERT
Cisco, McAfee, Kaspersky, and others results similar to manually meaningful information from
prepared graphs. complex CTI reports.
The model provides a novel method The Open-CyKG-Bi-GRU + CRF
for cyber threat analysis by extracting model achieved outstanding success
Cybersecurity reports and open contextual and accurate information in accurate relationship extraction
Bi-GRU, Bi-GRU + Att, Bi-LSTM +
[7] source datasets from Microsoft, Cisco, from unstructured text. with an F1 score of 98.9% and
CRF, Bi-GRU + CRF
McAfee, and Kaspersky Canonicalisation processes sensitivity of 80.8%. Canonicalisation
standardised knowledge achieved an F1 score of 82.6% in
graph creation. relationship matching.
The proposed method improved the The knowledge graph creation
accuracy of cyber threat intelligence process provided reliable information
CVE descriptions, APT reports,
by simultaneously extracting entities and relationships with an F1 score of
[35] Security Bulletins, ATT&CK, MISP, BERT, BiGRU, Attention Mechanism
and relationships. Ontology 81.37%. BERT-based models
Unit 42, and WatcherLab
alignment ensured demonstrated superior success in
contextual consistency. contextual analysis.
The proposed system provided
superior accuracy in extracting threat
The model achieved an F1 score of
entities and relationships from
CTI reports, MITRE ATT&CK, NVD, BERT + BiLSTM + CRF, BERT 97.2% in entity extraction and 98.5%
[36] unstructured text compared to
and open source data (Relation Extraction) in relation extraction, outperforming
existing methods. Contextual
existing methods.
information alignment optimised
threat analysis processes.
Achieved the best performance in
The GAT model supported accurate
threat intelligence extraction with an
Structured data from OpenCVE, BERT, graph attention modeling of contextual relationships
[37] F1 score of 90.16% for entity
ATT&CK, CAPEC, and websites networks (GAT) during knowledge graph creation,
extraction and 81.83% for
optimising threat analysis processes.
relationship extraction.

3. Proposed System Architecture and Implementation


This study introduces a comprehensive system for cyber threat intelligence (CTI),
designed to handle the complete process—from extracting information from unstructured
text to performing detailed analyses of the extracted data. The primary objective of the
proposed system is to automate the processing of critical information embedded in security
threat reports and related content, convert it into a structured format, extract actionable
insights, and facilitate efficient analysis. The system’s overall architecture is centred around
the components illustrated in Figure 1.

Figure 1. The main steps in the proposed system.

In the first step, the system collects text-based data from various online sources, such
as websites, APIs, and threat reports. These data can be in both structured and unstructured
formats and form the basis for training and analysing the system. The collected data are pre-
processed to remove unnecessary elements and convert them into a standardised format.
In the second step, after data collection and preprocessing, the data are manually annotated
to create a labelled dataset. This dataset includes entities such as threat actors, malware,
campaigns, and targets, which are critical for the system to identify domain-specific entities.
In the third step, the annotated dataset is then used to fine-tune a pre-trained BERT model
for the named entity recognition (NER) task. The model is trained and evaluated to
Symmetry 2025, 17, 587 8 of 27

ensure its ability to extract entities with high precision and recall, making it suitable for
cybersecurity applications. In the fourth step, once the model extracts the entities and their
relationships, these are organised into entity–relation–entity triples and stored in a Neo4j
graph database. This step enables the creation of knowledge graphs that represent the
relationships between cyber entities. The relationships are organised using the Unified
Cybersecurity Ontology (UCO) framework [38], which defines the relationships between
threat actors, malware, campaigns, and targets. In the final step, the knowledge graphs
are analysed to uncover hidden patterns, predict threat behaviours, and identify more
complex relationships. Advanced graph-based techniques, such as PageRank, are applied to
prioritise and analyse the most critical entities and their connections. This process supports
a deeper understanding of cyber threats and helps develop proactive defence strategies.
The proposed approach, illustrated in Figure 2, is described as a sequential pipeline
representing our cyber threat intelligence work. These modules are the data module,
the training module, the knowledge construction module, and the graph analysis module.
The following subsections provide detailed explanations of each module.

Figure 2. The flowchart of the proposed approach.


Symmetry 2025, 17, 587 9 of 27

3.1. Data Module


The data module is the basis of the proposed system. Data are collected, processed,
annotated, and fed into the next training module at this stage. The sub-components of the
data module are explained in detail below.

3.1.1. Data Collection


Raw cybersecurity data are used as the basic input for the proposed system. In the
data collection phase, these data are obtained from three different sources and used in the
preprocessing phase. The data sources used in our study are explained below.
• Web Sources: The data gathered from web sources are text-based content, such
as cybersecurity articles and blog posts containing technical information. As these
contents are extracted from web pages, the data are in an unstructured format and
are extracted using web scraping techniques specific to the page in question [39,40].
A comprehensive preprocessing phase is performed on the raw data collected from
web pages to remove unnecessary items such as HTML tags. The preprocessed data
are stored in txt format for the next steps.
• Threat Reports: Threat reports are retrieved from a public GitHub repository [41].
The threat reports stored in the relevant GitHub repository are in PDF format, and the
repository is regularly updated with new reports. The text content of these PDF
files was extracted using the Python library PyPDF2 [42]. These contents were then
preprocessed, and the texts were converted into a suitable form. The texts were saved
in txt format for later use.
• API Sources: MITRE ATT&CK [43] is a trusted framework that provides structured
cyber threat data. The data requested from this source can be accessed with API or
Python packages. In this study, malware, threat actor, and campaign lists were re-
trieved from MITRE ATT&CK to be used in the proposed system using the mitreattack-
python [44] package.
The data collected from the specified sources must be given as input to the preprocess-
ing stage in order to remove unnecessary content and convert them into structure data.

3.1.2. Preprocessing
In this stage of the proposed system, a detailed preprocessing process was carried
out to clean the raw cyber-texts collected from various sources and to convert them to a
certain standard. This process converted the raw text data into a structured form and made
them ready for the next stage. Processes such as cleaning, tokenisation, the removal of
HTML tags, and the removal of unnecessary characters were applied. Algorithm 1 shows
the processes performed on the texts.
• Text Cleaning: The text cleaning process includes the removal of irrelevant items to
ensure the clarity and consistency of the raw data. In this stage, in order to achieve a
certain quality of the dataset used in the following modules, the texts were converted
to lower case, unicode characters were normalised, extra spaces and newlines were
removed, non-alphanumeric characters such as “!”, “@” or “#” that were not of
analytical importance were removed, and HTML and script tags were removed from
the texts obtained from web pages. This minimised the noise in the data, improved
the quality of the data, and made the data more suitable for the annotation process.
• Tokenisation: Tokenisation is the process of splitting text into small and manageable
pieces. At this stage, text is split into sentences and then into words, and analysis is
performed at the word level. In natural language processing, the tokenisation stage
needs to be effectively implemented in order to perform an effective named entity
Symmetry 2025, 17, 587 10 of 27

recognition process. This is because NER systems perform this analysis at the token
level when extracting entities from text.

Algorithm 1: Preprocess text files.


Input: Directory containing raw text files (source_dir)
Output: Cleaned and preprocessed text files (cleaned_dir)
Initialisation : Initialise NLTK libraries: download ’punkt’
1 foreach file in source_dir do
2 Read the content of the file;
3 Text cleaning;
4 Convert all text to lowercase;
5 Normalise unicode characters;
6 Remove extra whitespaces and newline characters;
7 Remove non-alphanumeric characters;
8 Remove HTML/XML tags and entities;
9 Strip JavaScript and CSS content;
10 Tokenisation;
11 Tokenise text into sentences;
12 Tokenise each sentence into words;
13 Reconstruct cleaned text;
14 Save the cleaned text to cleaned_dir;
15 end

3.1.3. Data Annotation


After the preprocessing stage, the CTI data, converted into a structured form, were
annotated using manual and rule-based methods to create a structured dataset for model
training. The accuracy of the annotation process performed in this step is critical in training
the model to make correct predictions. At this stage, the malware, campaign, and threat
actor lists retrieved from MITRE ATT&CK were used to create the dataset using rule-based
methods. In this way, it was hoped that the trained model would extract more entities.
In this study, the dataset used for annotation was obtained from multiple sources.
A total of 540 news articles were collected from WeLiveSecurity [39] and FortiGuard Labs
Threat Research [40]. Additionally, 100 selected threat reports were downloaded from
the public APT Cybercrime Campaign Collections GitHub repository [41], and the lists of
malware, threat actors, and campaigns were obtained from the MITRE ATT&CK framework.
We collected 640 documents from these sources and manually annotated 257 of them to
train the named entity recognition model.
The Inside–Outside–Beginning (IOB) tagging scheme was used to annotate the enti-
ties. The IOB tagging scheme is widely used in NER tasks. It has a useful structure for
determining the boundaries of sentences and representing the categories of entities within
the text. In the study, the process of annotating these entities, including ThreatActors,
Targets, Malware, and Campaigns, was performed in this module, and the distribution of
the entities is shown in Table 2. The tagged dataset was then divided into subsets such
as training, validation, and test. There are 15,248 entities in the training set, 1906 in the
validation set, and 1906 in the test set.
Symmetry 2025, 17, 587 11 of 27

Table 2. Entity distribution of annotated dataset

Entity Type Count


O 7926
B-ThreatActor 2848
I-ThreatActor 1196
B-Target 1746
I-Target 1220
B-Malware 2204
I-Malware 516
B-Campaign 760
I-Campaign 644

The structure of the IOB tagging scheme is described below:


• B-Tag (Beginning): Specifies the beginning of an entity in a sequence of tags (e.g.,
B-ThreatActor for the beginning tag of a ThreatActor entity).
• I-Tag (Inside): Identifies the subsequent tokens within the same asset (e.g., I-
ThreatActor for the second token in a threat actor entity).
• O-Tag (Outside): Identifies the tokens that do not belong to any entity.
The following sentence demonstrates how the IOB tagging scheme is applied in a
cybersecurity context: “The APT29 group used the malware SUNBURST in a campaign targeting
U.S. government agencies”. In the annotation process, entities such as threat actors, malware,
campaigns, and targets were annotated as shown in Table 3.

Table 3. Example of IOB annotation for a cybersecurity sentence.

Token → Tag
The → O
APT29 → B-ThreatActor
group → I-ThreatActor
used → O
the → O
malware → O
SUNBURST → B-Malware
in → O
a → O
campaign → B-Campaign
targeting → O
U.S. → B-Target
government → I-Target
agencies → I-Target
. → O

This systematic labelling enables the creation of a structured dataset that clearly iden-
tifies and separates key entities in cybersecurity text. The remaining portion of the dataset,
comprising unannotated documents, was utilised during the Knowledge Construction
Module to extract entities using the trained NER model. Such annotated datasets are
essential for training machine learning models to automate the recognition of entities and
relationships in unstructured threat reports.

3.2. Training Module


The training module is designed to fine-tune a pre-trained BERT model for the named
entity recognition task in the cybersecurity domain [45]. BERT’s most superior feature over
traditional models is its bidirectional transformer approach. BERT can process both the
Symmetry 2025, 17, 587 12 of 27

left and right context of each word in the text simultaneously, and, with this approach, it
understands the texts more contextually and produces numerical representations.
The pre-trained BERT model can extract general entities such as names, places, dates,
etc., from text. In order to extract domain-specific entities from text, it is necessary to
fine-tune the BERT model using a specially designed annotated dataset.

3.2.1. Fine-Tuning BERT Model


Fine-tuning BERT involves training the model on labelled cybersecurity datasets while
retaining the language acquired during pre-training. The fine-tuned BERT model for
named entity recognition (NER) utilises several key layers and parameters from the BERT
architecture. The embeddings layer includes word embeddings with a vocabulary size
of 28,996 and a hidden size of 768, position embeddings for sequences up to 512 tokens,
and token-type embeddings for distinguishing between sentence segments. The encoder
consists of 12 transformer layers (BertLayer), each comprising multi-head self-attention
mechanisms with a hidden size of 768 and an intermediate size of 3072 for feed-forward
layers. Dropout regularisation is applied with a rate of 0.4 in the self-attention layer and
0.1 in other parts of the network, such as the embeddings and output layers. Finally,
a classification head is added on top, consisting of a linear layer with 768 input features and
10 output classes, corresponding to the entity labels in the NER task. The fine-tuning process
adjusts the model’s weights to optimise its performance in domain-specific tasks. One
critical challenge in fine-tuning BERT for named entity recognition (NER) is handling the
class imbalance in NER tasks. In most datasets, the “O” (Outside) class, which represents
tokens that do not belong to any entity, dominates the labelled data. During the fine-tuning
process, several advanced techniques were employed to optimise the model’s performance
and address challenges such as class imbalance and overfitting. To address the class
imbalance issue, oversampling was applied to duplicate samples from minority classes such
as “malware” and “campaign”. This ensured that the model received sufficient exposure
to underrepresented classes during training, reducing the bias towards the majority class.
This technique significantly improved the model’s performance, as evidenced by an F1
score of 0.9600 when oversampling was applied, compared to an F1 score of 0.6043 without
oversampling. Similarly, the accuracy increased from 0.9001 to 0.9600 with the application
of oversampling. Additionally, we utilised focal loss [46] as the loss function during training.
Focal loss modifies the standard cross-entropy loss to focus on harder-to-classify examples
by introducing a modulating factor. The focal loss parameters (alpha and gamma) and
other hyperparameters were determined experimentally through a series of controlled tests.
Specifically, a range of values for alpha (e.g., 0.25, 0.5) and gamma (e.g., 1.0, 2.0, 3.0) were
evaluated to identify the configuration that produced the best F1 score on the validation set.
Similarly, other hyperparameters such as learning rate and batch size were adjusted based
on their impact on model performance. The AdamW optimiser was chosen for its ability to
handle sparse gradients and improve convergence stability. To prevent exploding gradients,
gradient clipping was applied with a maximum norm of 1.0. Additionally, early stopping
was implemented to halt training if no improvement was observed in the validation loss
for five consecutive epochs, reducing the risk of overfitting. When focal loss was applied
in conjunction with oversampling, the model achieved its best performance, with an F1
score of 0.9600. These results demonstrate the critical role of oversampling and focal loss in
addressing class imbalance and improving the model’s ability to classify minority classes
such as malware and campaign. This experimental approach allowed us to systematically
explore the parameter space and select values that optimised model performance.

FL( pt ) = −αt (1 − pt )γ log( pt ) (1)


Symmetry 2025, 17, 587 13 of 27

In Equation (1), pt represents the predicted probability of the correct class, αt is a


weighting factor that balances the importance of classes, and γ is a focusing parameter that
reduces the loss for well-classified examples. By emphasising misclassified tokens, focal
loss helps the model better represent minority classes like malware and campaign entities.
This adjustment mitigates the impact of the “O” class dominance and leads to improved
overall performance.
Hyperparameters are manually adjustable values that control various aspects of the
model during the training process and affect the model’s performance. Below are brief
descriptions of the hyperparameters used for the BERT model in the training module.
• Maximum length: Specifies the maximum length that an input string (e.g., a phrase)
can be. This parameter specifies the number of tokens that the model will process
for the input. A larger length setting allows the model to capture more context,
but increases the computational cost. On the other hand, a smaller length setting may
lead to loss of information.
• Batch Size: The batch size parameter specifies the number of examples the model
learns at a time during each training step. A model with a high batch size value
performs a more stable learning process but also uses more memory. On the other
hand, setting this parameter to a low value can reduce memory usage but makes the
model’s learning process more unstable.
• Test Size: Specifies the proportion of the dataset to be used for testing. For example,
a value of 0.2 indicates that 20% of the data are used for testing.
• Hidden Dropout: This prevents the model from overfitting to the training data by
temporarily deactivating the neurons in the hidden layers with a specified probability
at each training step.
• Attention Dropout: Specifies the dropout rate applied to the attention mechanisms
in the transformer layers. This regularisation helps prevent overfitting in complex
models like BERT.
• Learning Rate: This is the parameter setting that controls how much the model’s pa-
rameters are updated in each training step. If this rate is set lower, the model performs
slower but more stably. On the other hand, setting this rate higher increases the risk of
the model exceeding the optimal weights while speeding up the training process.
• Adam Epsilon: A small constant added to the denominator in the Adam optimiser
to prevent division by zero. This stabilises updates, particularly when gradients are
sparse or small.
• Epoch: Refers to the number of complete passes through the entire training dataset.
More epochs provide the model with additional opportunities to learn but may in-
crease the risk of overfitting.
• Maximum Gradient Norm: Limits the magnitude of gradients during backpropa-
gation. Gradient clipping prevents exploding gradients and ensures stable training,
particularly in deep models like BERT.
• Warmup Ratio: Specifies the fraction of training steps used to gradually increase
the learning rate from zero to its maximum value. This prevents abrupt changes in
parameter updates early in training, leading to smoother convergence.
• Weight Decay: The Weight Decay parameter is the parameter that prevents the model
from overfitting by preventing the weights from taking on too large values and by
decreasing the weights gradually at each step of training so that the model can better
adapt to new situations.
• Early Stopping Patience: This is a parameter that stops the training process if there
is no improvement during the specified number of epochs while the validation per-
Symmetry 2025, 17, 587 14 of 27

formance of the model is being evaluated. This ensures efficient use of resources and
prevents overfitting the model.
• Gradient Accumulation Steps: Accumulates gradients over multiple batches be-
fore performing an update. This allows effective training with smaller batch sizes,
especially on memory-limited hardware.
• Focal Loss Gamma: Controls the focusing factor in the focal loss formula. A higher
gamma value reduces the loss of well-classified examples, allowing the model to focus
more on hard-to-classify tokens.
• Focal Loss Alpha: Balances the importance of different classes. A higher alpha value
emphasises minority classes, helping to address class imbalance during training.
The pseudo-code for the training process is provided in Algorithm 2.

Algorithm 2: Fine-tuning BERT for cybersecurity NER.


Input: Pre-trained BERT model, annotated dataset
Output: Fine-tuned BERT model for cybersecurity NER
1 Initialisation: Load pre-trained BERT model;
2 Split annotated dataset into train, validation, and test sets;
3 Initialise optimiser (AdamW) and loss function (focal loss);
4 Set hyperparameters (learning rate, batch size, epochs, etc.);
5 foreach epoch in epochs do
6 foreach batch in train set do
7 Tokenise batch text into subwords;
8 Generate input IDs, attention masks, and token-type IDs;
9 Feed input data into BERT model;
10 Compute predictions and calculate focal loss;
11 Backpropagate loss and update model weights;
12 end
13 Evaluate model on validation set using precision, recall, and F1 score;
14 Save best-performing model checkpoint;
15 end
16 return Fine-tuned BERT model

3.2.2. Domain-Specific Named Entity Recognition


The fine-tuned BERT model specialises in extracting cybersecurity-specific entities
from unstructured text with high precision. By leveraging the domain knowledge encoded
during fine-tuning, the model effectively transforms raw threat reports into structured
data. For example, as shown in Table 3, the model outputs a sequence of labelled tokens
representing entities such as threat actors, malware, campaigns, and targets, based on
input sentences. This transformation enables downstream modules to analyse the ex-
tracted entities for actionable insights, making the training module a critical component of
the pipeline.

3.3. Knowledge Construction Module


The knowledge construction module stores entities in the knowledge base in a struc-
tured format using a domain-specific ontology and a fine-tuned named entity recognition
model. The module defines the relationships between entities according to ontological
rules, generating triples that form the basis for storing and querying threat intelligence data.
This module consists of three components. First, the process of extracting cybersecurity-
related entities from processed threat reports is performed using the trained BERT model.
Symmetry 2025, 17, 587 15 of 27

Then, the relationships between entities are established based on the defined ontology,
and finally, the entities and relationships are represented as triples. These triples are
stored in a structured knowledge base that enables querying and analysis of threat
intelligence information.
The knowledge graph construction process begins with the extraction of entities
using the fine-tuned BERT model. Relationships between these entities are defined based
on a domain-specific ontology, which specifies semantic links such as “isUsedBy” (e.g.,
a malware used by a threat actor) and “isLaunchedBy” (e.g., a campaign launched by a
threat actor). These relationships are validated against predefined ontological rules to
ensure consistency. The entities and relationships are then represented as triples in the
format of “Entity–Relation–Entity” (e.g., “Emotet-isUsedBy-APT28”). These triples are
stored in a Neo4j graph database, enabling efficient querying and visualisation.

3.3.1. Entity Extraction Using the Fine-Tuned Model


The first step in the knowledge construction process involves extracting entities from
preprocessed threat reports using the fine-tuned NER model developed in the training
module. These threat reports, prepared through preprocessing steps in the data module,
are analysed by the model to identify meaningful entities. The model extracts the following
key entity types, forming the basis of the knowledge base:
• ThreatActor : Represents individuals, groups, or organisations involved in malicious
activities (e.g., APT28).
• Malware: Refers to malicious software used in attacks, such as ransomware, trojans,
or spyware (e.g., Emotet).
• Campaign: Denotes organised efforts by threat actors to achieve specific objectives
(e.g., Operation Aurora).
• Target: Identifies sectors, organisations, or systems targeted by campaigns or malware
(e.g., financial institutions).
While the model focuses on entity extraction, relationships between these entities are
established using the ontology described below, which defines and validates the semantic
links based on a predefined schema.

3.3.2. Ontology-Based Relationship Construction


In this study, we drew inspiration from the unified cybersecurity ontology (UCO)
to design an ontology that addresses the specific requirements of cybersecurity analysis.
This ontology provides a standardised schema for modelling interactions between entities,
such as how a specific malware is deployed by a threat actor or how a campaign targets a
particular industry. Table 4 provides an overview of the entity types defined in the ontology
and the relationships between them.
An example graph representation of the relationships between the mentioned entities
is illustrated in Figure 3. Red nodes represent threat actors, blue nodes indicate malware,
yellow nodes show targets, and green nodes depict campaigns.
These relationships provide an organised structure of the knowledge base by asso-
ciating entities such as ThreatActors, malware, campaigns, and targets. For example,
the hasAssociatedCampaign relationship associates a ThreatActor with specific campaigns
it participates in, while the hasAssociatedMalware relationship associates a ThreatActor
with the malware it uses. Similarly, the hasTargetedField relationship associates a ThreatAc-
tor with the domains it targets and the hasAlias relationship defines a ThreatActor’s aliases.
Symmetry 2025, 17, 587 16 of 27

Table 4. Mapping relationships between entities.

Entity Type Relationship Target Entity


ThreatActor hasAssociatedCampaign Campaign
ThreatActor hasAssociatedMalware Malware
ThreatActor hasTargetedField Target
ThreatActor hasAlias ThreatActor
Malware isUsedBy ThreatActor
Campaign isLaunchedBy ThreatActor
ThreatActor isAssociatedWithCampaign Campaign
Campaign usesMalware Malware
Campaign hasTargetedField Target

Figure 3. Graph representation of cyber threat relationships.

To further analyse the roles of campaigns and malware, relationships are defined
with other entities. For example, the isUsedBy relationship links a malware to the threat
actor that uses it, while the isLaunchedBy relationship links a campaign to the threat actor
that launched it. Campaigns are defined by two relationships. The usesMalware rela-
tionship specifies which malware a campaign uses, while the isAssociatedWithCampaign
relationship specifies which campaigns the threat actors are associated with.

3.3.3. Triple Representation


After extracting the cyber entities and establishing the relationships between them,
the data were transferred to the Neo4j graph database in the form of triples. These triples
allow for the detailed and visual examination of the relationships between threat actors,
malware, campaigns, and targets in the graph analysis module. By leveraging Neo4j’s graph
visualisation capabilities, the proposed approach enabled analyses such as discovering
malware deployment patterns and predicting threat actors’ potential targets.
Symmetry 2025, 17, 587 17 of 27

3.4. Graph Analysis Module


The graph analysis module is critical in integrating and analysing the RDF triples
generated in the previous module. These triples, which identify extracted entities and their
relationships, are imported into a graph database using Neo4j. By leveraging Neo4j’s graph
storage and querying capabilities, this module transforms the structured data into a flexible
format that supports advanced analytical tasks.
Once the triples are integrated, Neo4j enables the efficient querying and visualisation of
relationships between entities such as ThreatActors, campaigns, malware, targets, and their
relationships. Analysts can explore these connections to uncover valuable insights, such as
identifying frequently targeted industries, analysing the operational scope of specific threat
actors, or discovering patterns in malware deployment strategies.
The graph analysis module leverages the triples stored in the Neo4j database to
perform advanced analytical tasks. For instance, the relationships between entities such
as “ThreatActors”, “campaigns”, and “malware” are visualised to uncover patterns in
malware deployment and identify frequently targeted industries [47]. Predictive analyses,
such as identifying potential future targets of threat actors, are conducted using graph-
based algorithms like common neighbours and the Adamic/Adar index. These analyses
provide actionable insights for cybersecurity experts, enabling them to proactively address
emerging threats. For example, the PageRank analysis revealed that the threat actor with
Node ID 5141 has the highest centrality, indicating its significant influence in the network.
This structured and visual representation not only facilitates the exploration of com-
plex relationships but also supports decision-making processes in threat intelligence by
enabling the faster and more accurate identification of key patterns and trends.

4. Experimental Results
This section presents a detailed evaluation of the proposed approach from different
perspectives. The evaluation process includes the performance analysis of the named entity
recognition model and the examination of the analysis obtained from the generated graph
structure. The results show the system’s ability to systematically process threat information
and the effectiveness of its analysis capabilities.
In the following subsections, the results are presented, starting from the evaluation
of the training module in Section 4.1 to the detailed analysis of the graph-based results in
Section 4.2.

4.1. Training Module Results


All training experiments were conducted on Google Colab Pro, leveraging its NVIDIA
A100 GPU resources for efficient model training and evaluation. The implementation
was carried out using Python 3.11 and the PyTorch 2.6.0 deep learning framework, which
provided the necessary tools for fine-tuning the BERT-based model and handling large-
scale datasets. The hyperparameters used to fine-tune the BERT were carefully selected by
combining experimental tests and information from the literature. Values ranging from
1 × 10−5 to 5 × 10−5 were evaluated regarding the learning rate; ultimately, 3 × 10−5 was
chosen as it provided the best trade-off between convergence speed and model performance
on the validation set. Batch sizes of 16 and 32 were tested, with 32 providing more stable
gradient updates and higher F1 scores. Focal loss parameters (alpha and gamma) were
experimentally determined by testing alpha values of 0.25 and 0.5 and gamma values
of 1.0, 2.0 and 3.0. The combination of alpha = 0.25 and gamma = 2.0 was found to be
optimal for removing class imbalance and improving the representation of minority classes.
The dropout rates were set to 0.1 for hidden layers and 0.4 for attention layers to balance
overfitting and underfitting. The number of epochs was determined by monitoring the
Symmetry 2025, 17, 587 18 of 27

validation performance, and early stopping was applied at 13 epochs to prevent overfitting.
The BERT-based model was specifically fine-tuned for named entity recognition (NER) tasks
in the cybersecurity domain, leveraging its pre-trained language understanding capabilities.
These hyperparameter choices were guided by systematic experimentation to ensure that
the model achieved robust and reliable performance. Table 5 lists the hyperparameters
used during fine-tuning.

Table 5. Hyperparameters used for fine-tuning BERT.

Hyperparameters Value
Maximum Sequence Length 128
Batch Size 32
Test Size 0.2
Hidden Dropout 0.1
Attention Dropout 0.4
Learning Rate 3 × 10−5
Adam Epsilon 1 × 10−8
Number of Epochs 20
Max Gradient Norm 1.0
Warmup Ratio 0.1
Weight Decay 0.01
Early Stopping Patience 5
Gradient Accumulation Steps 3
Gamma 2.0
Alpha 0.25

The learning curve presented in Figure 4 shows the progression of training and
validation losses across epochs. The training loss rapidly decreases during the initial
epochs, indicating that the model effectively learns the underlying patterns in the training
data. Similarly, the validation loss decreases significantly, stabilising around the 10th
epoch, which suggests the model’s ability to generalise well to unseen data. The relatively
small gap between the training and validation loss curves highlights minimal overfitting,
supported by the implementation of early stopping at the 13th epoch.
The training process achieved a peak validation F1 score of 0.9357 with validation
accuracy reaching 0.9897 at the 13th epoch. Table 6 summarises the evaluation metrics
for the model’s performance across the entity types. High F1 scores for most classes, such
as 0.97 for threat actors and campaigns and 0.98 for malware, demonstrate the model’s
effectiveness. However, the target class exhibited a relatively lower F1 score of 0.83,
primarily due to lower precision. This indicates potential difficulty in distinguishing
this class from others, warranting further investigation or additional training data for
this class. The achieved F1 score of 96% highlights the robustness and effectiveness of
the proposed model in extracting cybersecurity entities from unstructured data. This
performance surpasses many existing methods, which often struggle with class imbalance
and underrepresented categories. The integration of oversampling and focal loss techniques
played a pivotal role in addressing these challenges, ensuring that minority classes such as
“malware” and “campaign” were accurately identified.
In comparison to traditional approaches, which typically rely on rule-based or less
context-aware models, the fine-tuned BERT model demonstrated superior contextual
understanding and adaptability to domain-specific terminology. This high F1 score not only
validates the model’s technical soundness but also underscores its practical applicability in
real-world cybersecurity scenarios, where precision and recall are critical for timely threat
detection and mitigation.
Symmetry 2025, 17, 587 19 of 27

By achieving this level of performance, the proposed approach sets a new benchmark
for entity recognition in cyber threat intelligence, paving the way for more reliable and
actionable insights in the field. The relatively lower F1 score for the target class can be
attributed to the smaller number of annotated examples and the inherent diversity of
entities within this class, which makes classification more challenging. Previous research
highlights that increasing the amount of annotated data and improving the consistency of
the annotation process can significantly improve model performance for underrepresented
classes [5]. These findings suggest that expanding the dataset and refining the annotation
guidelines could potentially improve the performance of the target class, and this will be
considered in future work.

Figure 4. Loss function of training and validation.

Table 6. Test set performance metrics for named entity recognition.

Entity Class Precision% Recall% F1 Score% Support


Actor 0.96 0.99 0.97 1245
Campaign 0.95 1.00 0.97 212
Malware 0.96 0.99 0.98 1218
Target 0.73 0.95 0.83 414
Micro Avg 0.92 0.99 0.95 3089
Macro Avg 0.90 0.98 0.94 3089
Weighted Avg 0.93 0.99 0.96 3089

The confusion matrix in Figure 5 is indicative of the strong performance of the model,
where most predictions are correctly realised and very few errors are observed. The mini-
mum misclassifications of six for B-Malware and one for B-Target reflect the robustness
of the model, which is consistent with the high F1 scores in Table 6. This reinforces the
reliability of the model in effectively recognising cyber security assets.
Symmetry 2025, 17, 587 20 of 27

Figure 5. Entity-level confusion matrix.

The learning curve, combined with the detailed evaluation metrics, illustrates that the
fine-tuned BERT model successfully captures cybersecurity-specific entities with high accu-
racy and generalisability. These results validate the effectiveness of the training module in
transforming unstructured threat intelligence into structured data for subsequent analysis.

4.2. Graph Analysis


A total of 540 cyber threat intelligence documents were analysed in this study to
extract entities such as threat actors, malware, campaigns, and targets. These entities were
structured and stored in RDF files based on a custom cyber threat ontology designed to
standardise the representation of the extracted data. Each document was first converted
into a separate RDF file and was then combined into a single dataset. The combined RDF
data were transferred to the Neo4j graph database to enable the querying and graph-based
analysis of the relationships and properties between entities extracted from the documents.
The Neo4j graph database created by processing the cyber threat intelligence doc-
uments contains various nodes and relationships that reflect the complexity of the cy-
bersecurity domain. The graph structure consists of 416 ThreatActors, 3049 malware,
378 targets, and 81 campaign nodes. These nodes are interconnected through meaningful
relationships (total 5917) such as “isUsedBy”, “isFocusedBy”, “hasAlias”, “hasTargeted-
Field”, “hasAssociatedMalware”, “hasAssociatedCampaign”, and “isLaunchedBy”. This
structured representation enables the efficient querying and advanced analysis of the re-
lationships and interactions among cyber threat entities, providing valuable insights into
threat actor behaviours, malware associations, and targeted fields.
In this cyber threat analysis study, the complex relationships between threat actors,
target sectors, and malware were analysed using the Neo4j graph database. In the analysis
process, the interactions of each entity type with other entities were evaluated in detail and
the number of connections between entities was calculated to determine their impact levels.
In the basic analyses performed on the cleaned threat reports data, the distribution of
malware usage of threat actors is shown in Table 7. The threat actor with a node ID of 5141
has the widest malware diversity with 347 different malware. The fact that the actor with
a node ID value of 2398 in the second order uses 55 malware shows that the first actor is
dominant compared to the other actors.
Symmetry 2025, 17, 587 21 of 27

Table 7. Distribution of malware count by threat actors.

Node ID (ThreatActor) Malware Count


5141 347
2398 55
261 34
2261 25
687 21

According to the results of the analysis, it is understood that the actor with a node
ID of 5141 has a wide malware portfolio, high technical capacity, and resource richness.
The group’s use of various malware demonstrates that it has a wide range of attacks against
different target systems and security measures.
Another analysis determined how many different threat actors used each malware.
The “isUsedBy” relationship between malware nodes and ThreatActor nodes was used to
analyse the links between malware and the actors. Table 8 shows the number of unique
threat actor nodes associated with each malware node. These results demonstrate the preva-
lence of malware in the cyber threat ecosystem and its use by different actors. The results
show that the malware with a node ID of 617 is used by five different threat actors and is
the most commonly used malware in the threat reports analysed.

Table 8. Distribution of malware usage among threat actors.

Node ID (Malware) Actor Count


617 5
1570 3
318 2
1465 2
1140 2

In another analysis, the target distribution of threat actors was analysed. By using the
‘hasTargetedField’ relationship between the ‘ThreatActor’ and ‘target’ nodes, the areas on
which threat actors focus and their attack trends were analysed. According to the results
in Table 9, the target with a node ID value of 942 (Google) was the most preferred target
targeted by different threat actors. This distribution shows that threat actors focus on
attacks against technology companies, critical infrastructures, and financial institutions.

Table 9. Most targeted sectors and number of targeting actors.

Targeted Sector Number of Targeting Actors


Google 27
Infrastructure 7
Government 5
Banking 5
Government Agencies 4

Finally, the distribution of the targeted fields of malware was analysed. Using the
‘hasTargetedField’ relationship between ‘malware’ and ‘target’ nodes, the number of targets
attacked by malware was counted. This analysis demonstrates the impact areas and attack
scopes of malware.
According to the results in Table 10, the malware with a node ID value of 2168 attacked
19 different targets, while the malware with a node ID value of 1140 attacked 14 different
targets. A sample graph for the malware with a node ID value of 1140 is shown in Figure 6.
Symmetry 2025, 17, 587 22 of 27

Figure 6. Graph representation of malware node 1140 and its targets.

Table 10. Malware with the most targeted areas.

Node ID (Malware) Target Count


2168 19
1140 14
617 11
1046 9
4206 7

4.2.1. PageRank Analysis


In order to understand the complex structure of the relationships between actors,
malware, and targets in the analysed threat reports, centrality analyses were performed.
The first step in this process is to identify the most influential entities in the network using
the PageRank algorithm.
To determine the importance of threat actors, malware, and targets in the network,
the PageRank algorithm was applied. The nodes ’ThreatActor’, ’malware’, and ’target’
were selected, and a network model was created using ’isUsedBy’ and ’hasTargetedField’
relationships. According to the analysis results in Table 11, among the threat actors,
the actor with the node ID 5141 has the highest PageRank score, indicating that it has a
central position in the analysed reports, as illustrated in Figure 7. This shows that the actor
interacts intensively with other entities and has an important influence in the network. In
terms of malware, the second highest score for 318 (malware node ID) indicates that this
malware is widely used in cyber-attacks and is favoured by several threat actors. The fact
that Google (node ID 942) has the highest PageRank score in the target category confirms
that this platform is a priority target for cyber attackers.
Symmetry 2025, 17, 587 23 of 27

Figure 7. Graph representation of threat actor node 5141 and its central position based on PageRank.

Table 11. PageRank analysis results.

Node ID Entity Type PageRank Score


5141 ThreatActor 47.4870
318 Malware 40.9437
942 Target 30.7374
2005 Malware 25.2274
617 Malware 20.7660

4.2.2. Relationship and Target Analysis of Threat Actors


In order to understand the possible links and behavioural patterns among threat actors,
this analysis, based on their common target choices, identifies the target intersections and
intersection numbers of threat actors.
According to the results of the analysis in Table 12, 5141 (threat actor node ID) has a
significant number of common targets with other threat actors. The four common targets
(banks, organisations, infrastructure, and energy sectors) shared with 2082 (threat actor
node ID) indicate that both groups selected similar targets for attacks against critical
infrastructure and financial systems. This intersection suggests that the groups have a
systematic approach to target selection. Together with other results, it shows that threat
actors focus on certain sectors in their target selection. Financial institutions, critical
infrastructure systems, and large organisations are among the most shared targets. This
suggests that these sectors need to strengthen their defences against attacks from multiple
threat actors.
Symmetry 2025, 17, 587 24 of 27

Table 12. Analysis results of actors attacking similar targets.

Actor1 ID Actor2 ID Common Targets Count


5141 2082 banking, organisations, infrastructure, energy 4
5141 2398 bank, Google, banking 3
1125 5141 infrastructure, financial, information 3
2285 51,411 payment, infrastructure, banks 3
4515 1581 Google, government agencies 2
5141 463 healthcare, law 2

4.2.3. Predictive Analyses


Predicting the potential future targets of cyber threat actors is critical for the devel-
opment of proactive cyber security strategies. Graph-based predictive methods provide
a framework to analyse threat actors’ behavioural patterns and identify their potential
targets. The analysis of the current connections and network structure of threat actors
allows for the prediction of possible future attack targets. In this study, common neighbour
analysis and Adamic/Adar index analysis were used to predict potential targets. The
common neighbours method is based on the number of common neighbours shared by
two nodes and assumes that as the number of common links increases, the probability of
future connections between the two nodes increases. In the context of cyber security, this
approach provides an important tool for understanding the target selection patterns of
threat actors.
The results of the analysis in Table 13 shows that 5141 (threat actor node ID) is
particularly interested in government agencies. The government target stands out with
three common connections to node 933, node 1581, and node 4515.
The Adamic/Adar index evaluates the popularity of common connections inversely
on a logarithmic scale, giving higher weight to rarer connections. The results of the analysis
in Table 14 show a score of 2.71 and 10 common connections for all potential targets.
Among the critical infrastructure targets, SSH services and airports stand out, while civil
sector targets such as gaming, cafes, and sports are also noteworthy. Both analyses produced
complementary results. While the common neighbours analysis identified the threat actor’s
basic target profile, the Adamic/Adar index provided a more sophisticated assessment of
this profile. The PageRank value (47.48) shown in Table 11 confirms the central position
of the actor with node ID 5141 in the analysed threat reports, as shown in the graph
representation in Figure 7.

Table 13. Common connections of threat actor node 5141.

Threat Actor Potential Target Common Connections Common Nodes


5141 government agencies 3 933, 1581, 4515
5141 crypto 2 687, 5617
5141 government agency 2 254, 2168
5141 political 2 687, 1581
5141 business 2 1639, 2168

Table 14. Adamic/Adar index results.

Threat Actor Potential Target Adamic/Adar Index Shared Connections


5141 SSH service 2.71 10
5141 gaming 2.71 10
5141 coffee shops 2.71 10
5141 sports 2.71 10
5141 airports 2.71 10
Symmetry 2025, 17, 587 25 of 27

5. Conclusions
This paper presents a novel approach that combines knowledge graphs and advanced
natural language processing techniques for cyber threat intelligence analysis. The proposed
system successfully extracted critical entities such as threat actors, malware, campaigns,
and targets from unstructured security texts and makes the relationships between these en-
tities analysable in a structured knowledge base. In particular, the analysis results showed
that some threat actors have a high level of similarity with certain target domains. These
results show that knowledge graph-based approaches offer a new perspective in cyber
threat intelligence and make visible the relationships and targets that traditional methods
fail to detect. While the results of this study demonstrate the effectiveness of the proposed
approach, certain limitations should be acknowledged. The dataset size, although carefully
curated, may limit the generalisability of the findings to broader cybersecurity contexts.
Expanding the dataset with additional sources and increasing its diversity will be a key
focus in future work. Additionally, testing the model’s performance across different do-
mains and threat scenarios will help validate its robustness and applicability. By addressing
these limitations, we aim to further enhance the reliability and impact of the proposed
approach. The performance of the methods used in the study is confirmed by both the
overall accuracy of the model and the graphical analysis results obtained. In addition
to understanding current threats, the system can predict potential future targets and is
considered as an important tool that can be used in cyber security studies. The method we
present in this study has significant potential for the early prediction of new attack vectors
and threat trends.
In our future work, we plan to test the system with larger datasets and make im-
provements to address threat analysis needs in different sectors. In addition, a more
comprehensive examination of the embedding methods and parameter optimisations can
improve the accuracy of the system. The results of this study show that the proposed
approach can make significant contributions to the field of cyber threat intelligence from
both practical and theoretical perspectives.

Author Contributions: Conceptualisation, D.D.; methodology, D.D. and R.D.; formal analysis,
D.D.; investigation, R.D.; validation, R.D. and D.H.; writing—original draft preparation, D.D.;
writing—review and editing, D.D. and R.D.; supervision, R.D. and D.H. All authors have read and
agreed to the published version of the manuscript.

Funding: This research received no external funding.

Data Availability Statement: The data that support the findings of this study are available on request
from the corresponding author.

Acknowledgments: This study is derived from PhD dissertation entitled “Centralized Monitoring
of Cyber Threat Intelligence Data and Development of New Approaches to Detection of Attacks”
submitted at Inonu University, Graduate School of Nature and Applied Sciences, Department of
Computer Engineering under the supervision of Davut Hanbay and Resul Das.

Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Demirol, D.; Das, R.; Hanbay, D. A key review on security and privacy of big data: Issues, challenges, and future research
directions. Signal Image Video Process. 2022, 17, 1335–1343. [CrossRef]
2. Lei, J.; Kong, L. Fundamentals of big data in radio astronomy. In Big Data in Astronomy; Kong, L., Huang, T., Zhu, Y., Yu, S., Eds.;
Elsevier: Amsterdam, The Netherlands, 2020; pp. 29–58. [CrossRef]
3. Gantz, J.; Reinsel, D. Extracting value from chaos. IDC IVIEW 2011, 1142, 1–12.
4. Ahmetoglu, H.; Das, R. A comprehensive review on detection of cyber-attacks: Data sets, methods, challenges, and future
research directions. Internet Things 2022, 20, 100615. [CrossRef]
Symmetry 2025, 17, 587 26 of 27

5. Zhou, Y.; Tang, Y.; Yi, M.; Xi, C.; Lu, H. CTI View: APT Threat Intelligence Analysis System. Secur. Commun. Netw. 2022,
2022, 9875199. [CrossRef]
6. Piplai, A.; Mittal, S.; Joshi, A.; Finin, T.; Holt, J.; Zak, R. Creating Cybersecurity Knowledge Graphs From Malware After Action
Reports. IEEE Access 2020, 8, 211691–211703. [CrossRef]
7. Sarhan, I.; Spruit, M. Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph. Knowl.-Based Syst. 2021, 233, 107524.
[CrossRef]
8. Meta. ThreatExchange: A Threat Intelligence Sharing Platform. 2024. Available online: https://developers.facebook.com/
products/threat-exchange/ (accessed on 13 December 2023).
9. Symantec. Symantec Enterprise Blogs-Threat Intelligence. Available online: https://symantec-enterprise-blogs.security.com/
blogs/threat-intelligence (accessed on 25 January 2023).
10. Sean, B. Standardising Cyber Threat Intelligence Information with the Structured Threat Information eXpression. 2014. Available
online: https://stixproject.github.io/about/STIX_Whitepaper_v1.1.pdf (accessed on 25 January 2023).
11. MITRE. MAEC-Malware Attribute Enumeration and Characterization. 2022. Available online: https://maecproject.github.io/
(accessed on 25 January 2023).
12. Husari, G.; Al-Shaer, E.; Ahmed, M.; Chu, B.; Niu, X. TTPDrill: Automatic and Accurate Extraction of Threat Actions from
Unstructured Text of CTI Sources. In Proceedings of the 33rd Annual Computer Security Applications Conference, Orlando, FL,
USA, 4–8 December 2017; pp. 103–115. [CrossRef]
13. Jones, C.L.; Bridges, R.A.; Huffer, K.M.T.; Goodall, J.R. Towards a Relation Extraction Framework for Cyber-Security Concepts.
In Proceedings of the 10th Annual Cyber and Information Security Research Conference, Oak Ridge, TN, USA, 7–9 April 2015;
pp. 1–4. [CrossRef]
14. Alves, F.; Bettini, A.; Ferreira, P.M.; Bessani, A. Processing tweets for cybersecurity threat awareness. Inf. Syst. 2021, 95, 101586.
[CrossRef]
15. Kim, E.; Kim, K.; Shin, D.; Jin, B.; Kim, H. CyTIME: Cyber Threat Intelligence ManagEment framework for automatically
generating security rules. In Proceedings of the 13th International Conference on Future Internet Technologies, Seoul, Republic of
Korea, 20–22 June 2018; pp. 1–5. [CrossRef]
16. Zhang, H.; Shen, G.; Guo, C.; Cui, Y.; Jiang, C. EX-Action: Automatically Extracting Threat Actions from Cyber Threat Intelligence
Report Based on Multimodal Learning. Secur. Commun. Netw. 2021, 2021, 5586335:1–5586335:12. [CrossRef]
17. Alam, M.T.; Bhusal, D.; Park, Y.; Rastogi, N. CyNER: A Python Library for Cybersecurity Named Entity Recognition. arXiv 2022,
arXiv:2204.05754. [CrossRef]
18. Zhu, Z.; Dumitras, T. ChainSmith: Automatically Learning the Semantics of Malicious Campaigns by Mining Threat Intelligence
Reports. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April
2018; pp. 458–472. [CrossRef]
19. Ahmetoglu, H.; Das, R. Analysis of Feature Selection Approaches in Large Scale Cyber Intelligence Data with Deep Learning. In
Proceedings of the 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 5–7
October 2020; pp. 1–4. [CrossRef]
20. Noor, U.; Anwar, Z.; Amjad, T.; Choo, K.K.R. A machine learning-based FinTech cyber threat attribution framework using
high-level indicators of compromise. Future Gener. Comput. Syst. 2019, 96, 227–242. [CrossRef]
21. Qin, Y.; Shen, G.; Zhao, W.; Chen, Y.; Yu, M.; Jin, X. A network security entity recognition method based on feature template and
CNN-BiLSTM-CRF. Front. Inf. Technol. Electron. Eng. 2019, 20, 872–884. [CrossRef]
22. Bose, A.; Behzadan, V.; Aguirre, C.; Hsu, W.H. A novel approach for detection and ranking of trendy and emerging cyber threat
events in Twitter streams. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 871–878. [CrossRef]
23. Ma, P.; Jiang, B.; Lu, Z.; Li, N.; Jiang, Z. Cybersecurity named entity recognition using bidirectional long short-term memory with
conditional random fields. Tsinghua Sci. Technol. 2021, 26, 259–265. [CrossRef]
24. Kim, D.; Kim, H.K. Automated Dataset Generation System for Collaborative Research of Cyber Threat Analysis. Secur. Commun.
Netw. 2019, 2019, 6268476. [CrossRef]
25. Zhang, H.; Guo, Y.; Li, T. Multifeature Named Entity Recognition in Information Security Based on Adversarial Learning. Secur.
Commun. Netw. 2019, 2019, 6417407. [CrossRef]
26. Georgescu, T.M.; Iancu, B.; Zurini, M. Named-Entity-Recognition-Based Automated System for Diagnosing Cybersecurity
Situations in IoT Networks. Sensors 2019, 19, 3380. [CrossRef]
27. Sun, T.; Yang, P.; Li, M.; Liao, S. An Automatic Generation Approach of the Cyber Threat Intelligence Records Based on
Multi-Source Information Fusion. Future Internet 2021, 13, 40. [CrossRef]
28. Wu, H.; Li, X.; Gao, Y. An Effective Approach of Named Entity Recognition for Cyber Threat Intelligence. In Proceedings of the
2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China,
12–14 June 2020; IEEE: New York, NY, USA, 2020; pp. 1370–1374. [CrossRef]
Symmetry 2025, 17, 587 27 of 27

29. Simran, K.; Sriram, S.; Vinayakumar, R.; Soman, K.P. Deep Learning Approach for Intelligent Named Entity Recognition of
Cyber Security. In Proceedings of the Advances in Signal Processing and Intelligent Recognition Systems, Trivandrum, India,
18–21 December 2019; Thampi, S.M., Hegde, R.M., Krishnan, S., Mukhopadhyay, J., Chaudhary, V., Marques, O., Piramuthu, S.,
Corchado, J.M., Eds.; Springer: Singapore, 2020; pp. 163–172. [CrossRef]
30. Kim, G.; Lee, C.; Jo, J.; Lim, H. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. Int. J.
Mach. Learn. Cybern. 2020, 11, 2341–2355. [CrossRef]
31. Jia, J.; Yang, L.; Wang, Y.; Sang, A. Hyper attack graph: Constructing a hypergraph for cyber threat intelligence analysis. Comput.
Secur. 2025, 149, 104194. [CrossRef]
32. Srivastava, S.; Paul, B.; Gupta, D. Study of Word Embeddings for Enhanced Cyber Security Named Entity Recognition. Procedia
Comput. Sci. 2023, 218, 449–460. [CrossRef]
33. Ahmed, K.; Khurshid, S.K.; Hina, S. CyberEntRel: Joint extraction of cyber entities and relations using deep learning. Comput.
Secur. 2024, 136, 103579. [CrossRef]
34. Satvat, K.; Gjomemo, R.; Venkatakrishnan, V.N. Extractor: Extracting Attack Behavior from Threat Reports. In Proceedings of the
2021 IEEE European Symposium on Security and Privacy (EuroS&P), Vienna, Austria, 6–10 September 2021; IEEE: New York, NY,
USA, 2021; pp. 598–615. [CrossRef]
35. Guo, Y.; Liu, Z.; Huang, C.; Wang, N.; Min, H.; Guo, W.; Liu, J. A framework for threat intelligence extraction and fusion. Comput.
Secur. 2023, 132, 103371. [CrossRef]
36. Jo, H.; Lee, Y.; Shin, S. Vulcan: Automatic extraction and analysis of cyber threat intelligence from unstructured text. Comput.
Secur. 2022, 120, 102763. [CrossRef]
37. Wang, G.; Liu, P.; Huang, J.; Bin, H.; Wang, X.; Zhu, H. KnowCTI: Knowledge-based cyber threat intelligence entity and relation
extraction. Comput. Secur. 2024, 141, 103824. [CrossRef]
38. Syed, Z.; Padia, A.; Finin, T.; Mathews, M.L.; Joshi, A. UCO: A Unified Cybersecurity Ontology. In Proceedings of the AAAI
Workshop: Artificial Intelligence for Cyber Security, Phoenix, AZ, USA, 12 February 2016; Martinez, D.R., Streilein, W.W., Carter,
K.M., Sinha, A., Eds.; AAAI Technical Report; AAAI Press: Washington, DC, USA, 2016; Volume WS-16-03.
39. ESET. WeLiveSecurity. 2023. Available online: https://www.welivesecurity.com (accessed on 9 August 2024).
40. Labs, F. FortiGuard Labs Threat Research. Available online: https://www.fortinet.com/blog/threat-research (accessed on 9
August 2024).
41. CyberMonitor. APT Cyber Criminal Campaign Collections. 2023. Available online: https://github.com/CyberMonitor/APT_
CyberCriminal_Campagin_Collections (accessed on 9 August 2024).
42. Fenniak, M. The PyPDF2 Library. 2022. Available online: https://pypi.org/project/PyPDF2/ (accessed on 9 August 2024).
43. MITRE Corporation. MITRE ATT&CK Framework. Available online: https://attack.mitre.org (accessed on 14 December 2023).
44. MITRE Corporation. Mitreattack-Python: Python Library for Interacting with the MITRE ATT&CK Framework. 2024. Available
online: https://github.com/mitre-attack/mitreattack-python (accessed on 14 December 2024).
45. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding;
Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [CrossRef]
46. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.
2018, 42, 318–327. [CrossRef]
47. Sulu, M.; Das, R. Graph visualization of cyber threat intelligence data for analysis of cyber attacks. Balk. J. Electr. Comput. Eng.
2022, 10, 300–306. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like