A Novel Approach For Cyber Threat Analysis Systems Using BERT Model From Cyber Threat Intelligence Data
A Novel Approach For Cyber Threat Analysis Systems Using BERT Model From Cyber Threat Intelligence Data
society. To prevent attacks that seriously affect IT society and systems, security researchers
have developed and implemented many methods and taken numerous measures. The
efficiency of artificial intelligence (AI)-based methods is based on their training knowledge.
Keeping this training knowledge base up-to-date is possible by analysing current security
documents and extracting cyber assets efficiently and accurately. The simultaneous sharing
of extracted cyber assets with security communities increases the effectiveness of cyber
threat intelligence and helps to train cyber defence systems with more accurate information.
Cyber threat intelligence (CTI) is critical information that can help organisations to
preserve their infrastructure from threats. It is vital for cybersecurity experts to keep
their knowledge base up to date on new malware and attack scenarios by analysing up-
to-date CTI reports in order to be prepared for new attacks. Although there is a lot of
structured CTI data that are shared by vendors such as Symantec, McAfee, Trend Micro,
FireEye, etc., a large amount of CTI data are available in an unstructured format via
public available sources such as cyber security blogs and security reports. Manually
collecting and analysing large amounts of cyber intelligence data in an unstructured format
can make defence against cyber security threats inadequate and inefficient. However,
the rapid proliferation of open source documents containing CTI information makes it
more difficult for human analysts to monitor and process these data efficiently and in a
timely manner. To overcome these challenges, we proposed a hybrid cyber security entities
extraction system from unstructured security texts. The proposed system was based on the
combination of natural language processing (NLP), AI methods, and rule-based pattern
recognition. The implementation of the proposed system brings about some challenges.
First, before extracting cyber entities, some preprocessing steps need to be applied to
the collected data. Depending on the type of data source, text data need to be cleaned
and extracted from these sources and converted into a common format. To do this, we
developed separate methods according to data sources and extracted clean sentences from
text data. Another challenge is the value of the data collected. The “value” parameter,
which appears as a component in big data, is a time-dependent component. As time
passes, the value of the available data decreases. Therefore, it is necessary to keep the
cyber intelligence data constantly up to date. For this purpose, the proposed system was
updated by collecting new CTI reports from data sources at certain intervals. The main
contribution of this paper is the construction of hand-annotated NER and relationship
extraction datasets for cybersecurity due to the lack of publicly available datasets. This
shortage makes it challenging to create datasets for cyber threat intelligence. To construct
this dataset, we preprocess and annotate 100 PDF-formatted cyber security-related reports
and 540 HTML-formatted web pages.
2. Related Works
Cyber threat intelligence is vital for organisations and security communities to protect
their cybersecurity assets against rapidly evolving cyber threats. In particular, cyber threat
intelligence related to APT (advanced persistent threat) attacks contains detailed techni-
cal information about attackers, targets, and attack techniques and tactics. Furthermore,
extracting and analysing cyber-related information using traditional methods can be very
time-consuming and require much manual work for security analysts. Therefore, extracting
and using cyber threat intelligence from non-structural data is vital and challenging. So,
most security researchers have focused on automating the extraction of threat intelligence
from public data sources. Publicly available threat sources such as ThreatExchange [8],
Symantec [9], Kaspersky, hacker forums, and social media platforms are useful sources
for threat intelligence and sharing. However, the specified sources of threats are only in
the form of unstructured text. Unstructured cyber threat texts have been standardised in
machine-readable formats to formats such as STIX [10] and MAEC [11] to facilitate the shar-
ing of threat intelligence information and provide effective prevention and identification of
attacks promptly [12].
In this section, we review previous works performed on the extraction of CTI data
from open data sources in the field of cybersecurity. Zhou et al. [5] proposed a CTI analysis
framework called CTI View to extract CTI from unstructured APT texts and analyse
them automatically. The authors extract the threat entities and train the model through
the BERT-GRU-BiLSTM-CRF model based on bidirectional encoder representations from
transformers (BERT). The proposed model shows that the accuracy is more than 72%.
Jones et al. employed a combination of semi-supervised machine learning techniques
and active learning approaches to extract entities and relationships related to network
security [13]. Husari et al. proposed a tool called Ttpdrill that is designed to automatically
extract threat actions from unstructured text sources such as CTI reports. The tool uses
NLP techniques to identify and extract relevant information from the text and has been
evaluated on a dataset of CTI reports. The results show that Ttpdrill can extract threat
actions with high accuracy [12]. Alves et al. presented a Twitter streaming threat monitor
that regularly updates a summary of threats related to the target. They collected tweets that
pertain to cyber security incidents and extracted features using the term frequency–inverse
document frequency method and then employed both a multi-layer perceptron and support
vector machine to be used as classifiers for the collected tweets [14]. Kim et al. proposed a
Symmetry 2025, 17, 587 5 of 27
framework called CyTIME that uses structured CTI data from repositories. The framework
collects intelligence data, continuously generates rules without human intervention, and
then converts them into a JSON format known as Structured Threat Information Expression
(STIX) to mitigate real-time network cyber threats [15]. Zhang et al. proposed a framework
to extract cyber threat actions from CTI reports using NLP. In addition to the extraction
of actions, the framework finds relationships among entities [16]. Piplai et al. prepared a
framework to extract cyber information from the after-action reports and represent that in
a knowledge graph to offer insightful analyses to cybersecurity analysts. The system uses
NER and regular expressions to identify cyber-related entities [6]. Sarhan et al. presented a
neural network-based open information extraction (OIE) system to extract valuable cyber
intelligence from unstructured texts. The proposed approach constructs knowledge graph
representation from the threat reports and performs named entity recognition (NER) using
OIE [7]. Alam et al. designed a transformer-based library to perform an NER system that
extracts cyber entities. This library uses a neural language model named XLM RoBERTa
which is pre-trained on threat reports [17]. Zhu et al. proposed a system that uses NLP
methods to extract IOCs from security-related articles and classify the articles into campaign
stages. To enhance the IOC extraction stage, rule-based methods were used [18].
These studies demonstrate various approaches to cyber threat intelligence extraction.
However, recent developments show a clear trend towards more sophisticated methods
that combine transformer models with specialised knowledge representations. Current
research focuses predominantly on hybrid architectures that combine state-of-the-art lan-
guage models with graph-based approaches, achieving higher performance metrics than
traditional methods.
This study aims to contribute to the fields of cyber threat intelligence, natural language
processing, named entity recognition, deep learning models, and graph-based analysis.
A review of existing studies in the literature shows that cyber threat intelligence processes
are mostly handled using traditional methods and that graph-based analysis approaches are
applied in this area to a limited extent. Furthermore, research integrating ontology-based
approaches is rather scarce.
This study addresses these gaps by leveraging a fine-tuned BERT model to enhance
contextual entity recognition in cyber threat intelligence. This aligns with the literature’s
preference of hybrid models, as shown in Table 1, which highlights significant advances in
combining transformer-based architectures with graph-based knowledge representations.
Specifically, this study presents a novel approach to map threats and identify hidden links
by integrating graph-based Neo4j models to visualise and analyse the relationships be-
tween cyber threat entities. In addition, the proposed ontology-based model systematically
structures the entity relationships, further improving the interpretability and utilisation of
the extracted knowledge. These innovations, combining BERT-based NER with domain-
specific ontology and graph-based analysis, deliver a comprehensive framework for cyber
threat intelligence. The fine-tuned BERT model enhances contextual entity recognition,
particularly for cyber threat terminology, while the graph-based approach, implemented
with Neo4j, visualises and analyses relationships between threat entities, uncovering poten-
tial connections and mapping threat chains [19]. Additionally, the ontology-based model
systematically structures entities and their relationships, facilitating a more meaningful and
organised representation of cyber threat data. Together, these advancements significantly
improve multi-entity recognition, relationship extraction, and threat visualisation, provid-
ing both methodological and practical contributions to the field of cyber threat intelligence.
Symmetry 2025, 17, 587 6 of 27
Table 1. Cont.
In the first step, the system collects text-based data from various online sources, such
as websites, APIs, and threat reports. These data can be in both structured and unstructured
formats and form the basis for training and analysing the system. The collected data are pre-
processed to remove unnecessary elements and convert them into a standardised format.
In the second step, after data collection and preprocessing, the data are manually annotated
to create a labelled dataset. This dataset includes entities such as threat actors, malware,
campaigns, and targets, which are critical for the system to identify domain-specific entities.
In the third step, the annotated dataset is then used to fine-tune a pre-trained BERT model
for the named entity recognition (NER) task. The model is trained and evaluated to
Symmetry 2025, 17, 587 8 of 27
ensure its ability to extract entities with high precision and recall, making it suitable for
cybersecurity applications. In the fourth step, once the model extracts the entities and their
relationships, these are organised into entity–relation–entity triples and stored in a Neo4j
graph database. This step enables the creation of knowledge graphs that represent the
relationships between cyber entities. The relationships are organised using the Unified
Cybersecurity Ontology (UCO) framework [38], which defines the relationships between
threat actors, malware, campaigns, and targets. In the final step, the knowledge graphs
are analysed to uncover hidden patterns, predict threat behaviours, and identify more
complex relationships. Advanced graph-based techniques, such as PageRank, are applied to
prioritise and analyse the most critical entities and their connections. This process supports
a deeper understanding of cyber threats and helps develop proactive defence strategies.
The proposed approach, illustrated in Figure 2, is described as a sequential pipeline
representing our cyber threat intelligence work. These modules are the data module,
the training module, the knowledge construction module, and the graph analysis module.
The following subsections provide detailed explanations of each module.
3.1.2. Preprocessing
In this stage of the proposed system, a detailed preprocessing process was carried
out to clean the raw cyber-texts collected from various sources and to convert them to a
certain standard. This process converted the raw text data into a structured form and made
them ready for the next stage. Processes such as cleaning, tokenisation, the removal of
HTML tags, and the removal of unnecessary characters were applied. Algorithm 1 shows
the processes performed on the texts.
• Text Cleaning: The text cleaning process includes the removal of irrelevant items to
ensure the clarity and consistency of the raw data. In this stage, in order to achieve a
certain quality of the dataset used in the following modules, the texts were converted
to lower case, unicode characters were normalised, extra spaces and newlines were
removed, non-alphanumeric characters such as “!”, “@” or “#” that were not of
analytical importance were removed, and HTML and script tags were removed from
the texts obtained from web pages. This minimised the noise in the data, improved
the quality of the data, and made the data more suitable for the annotation process.
• Tokenisation: Tokenisation is the process of splitting text into small and manageable
pieces. At this stage, text is split into sentences and then into words, and analysis is
performed at the word level. In natural language processing, the tokenisation stage
needs to be effectively implemented in order to perform an effective named entity
Symmetry 2025, 17, 587 10 of 27
recognition process. This is because NER systems perform this analysis at the token
level when extracting entities from text.
Token → Tag
The → O
APT29 → B-ThreatActor
group → I-ThreatActor
used → O
the → O
malware → O
SUNBURST → B-Malware
in → O
a → O
campaign → B-Campaign
targeting → O
U.S. → B-Target
government → I-Target
agencies → I-Target
. → O
This systematic labelling enables the creation of a structured dataset that clearly iden-
tifies and separates key entities in cybersecurity text. The remaining portion of the dataset,
comprising unannotated documents, was utilised during the Knowledge Construction
Module to extract entities using the trained NER model. Such annotated datasets are
essential for training machine learning models to automate the recognition of entities and
relationships in unstructured threat reports.
left and right context of each word in the text simultaneously, and, with this approach, it
understands the texts more contextually and produces numerical representations.
The pre-trained BERT model can extract general entities such as names, places, dates,
etc., from text. In order to extract domain-specific entities from text, it is necessary to
fine-tune the BERT model using a specially designed annotated dataset.
formance of the model is being evaluated. This ensures efficient use of resources and
prevents overfitting the model.
• Gradient Accumulation Steps: Accumulates gradients over multiple batches be-
fore performing an update. This allows effective training with smaller batch sizes,
especially on memory-limited hardware.
• Focal Loss Gamma: Controls the focusing factor in the focal loss formula. A higher
gamma value reduces the loss of well-classified examples, allowing the model to focus
more on hard-to-classify tokens.
• Focal Loss Alpha: Balances the importance of different classes. A higher alpha value
emphasises minority classes, helping to address class imbalance during training.
The pseudo-code for the training process is provided in Algorithm 2.
Then, the relationships between entities are established based on the defined ontology,
and finally, the entities and relationships are represented as triples. These triples are
stored in a structured knowledge base that enables querying and analysis of threat
intelligence information.
The knowledge graph construction process begins with the extraction of entities
using the fine-tuned BERT model. Relationships between these entities are defined based
on a domain-specific ontology, which specifies semantic links such as “isUsedBy” (e.g.,
a malware used by a threat actor) and “isLaunchedBy” (e.g., a campaign launched by a
threat actor). These relationships are validated against predefined ontological rules to
ensure consistency. The entities and relationships are then represented as triples in the
format of “Entity–Relation–Entity” (e.g., “Emotet-isUsedBy-APT28”). These triples are
stored in a Neo4j graph database, enabling efficient querying and visualisation.
To further analyse the roles of campaigns and malware, relationships are defined
with other entities. For example, the isUsedBy relationship links a malware to the threat
actor that uses it, while the isLaunchedBy relationship links a campaign to the threat actor
that launched it. Campaigns are defined by two relationships. The usesMalware rela-
tionship specifies which malware a campaign uses, while the isAssociatedWithCampaign
relationship specifies which campaigns the threat actors are associated with.
4. Experimental Results
This section presents a detailed evaluation of the proposed approach from different
perspectives. The evaluation process includes the performance analysis of the named entity
recognition model and the examination of the analysis obtained from the generated graph
structure. The results show the system’s ability to systematically process threat information
and the effectiveness of its analysis capabilities.
In the following subsections, the results are presented, starting from the evaluation
of the training module in Section 4.1 to the detailed analysis of the graph-based results in
Section 4.2.
validation performance, and early stopping was applied at 13 epochs to prevent overfitting.
The BERT-based model was specifically fine-tuned for named entity recognition (NER) tasks
in the cybersecurity domain, leveraging its pre-trained language understanding capabilities.
These hyperparameter choices were guided by systematic experimentation to ensure that
the model achieved robust and reliable performance. Table 5 lists the hyperparameters
used during fine-tuning.
Hyperparameters Value
Maximum Sequence Length 128
Batch Size 32
Test Size 0.2
Hidden Dropout 0.1
Attention Dropout 0.4
Learning Rate 3 × 10−5
Adam Epsilon 1 × 10−8
Number of Epochs 20
Max Gradient Norm 1.0
Warmup Ratio 0.1
Weight Decay 0.01
Early Stopping Patience 5
Gradient Accumulation Steps 3
Gamma 2.0
Alpha 0.25
The learning curve presented in Figure 4 shows the progression of training and
validation losses across epochs. The training loss rapidly decreases during the initial
epochs, indicating that the model effectively learns the underlying patterns in the training
data. Similarly, the validation loss decreases significantly, stabilising around the 10th
epoch, which suggests the model’s ability to generalise well to unseen data. The relatively
small gap between the training and validation loss curves highlights minimal overfitting,
supported by the implementation of early stopping at the 13th epoch.
The training process achieved a peak validation F1 score of 0.9357 with validation
accuracy reaching 0.9897 at the 13th epoch. Table 6 summarises the evaluation metrics
for the model’s performance across the entity types. High F1 scores for most classes, such
as 0.97 for threat actors and campaigns and 0.98 for malware, demonstrate the model’s
effectiveness. However, the target class exhibited a relatively lower F1 score of 0.83,
primarily due to lower precision. This indicates potential difficulty in distinguishing
this class from others, warranting further investigation or additional training data for
this class. The achieved F1 score of 96% highlights the robustness and effectiveness of
the proposed model in extracting cybersecurity entities from unstructured data. This
performance surpasses many existing methods, which often struggle with class imbalance
and underrepresented categories. The integration of oversampling and focal loss techniques
played a pivotal role in addressing these challenges, ensuring that minority classes such as
“malware” and “campaign” were accurately identified.
In comparison to traditional approaches, which typically rely on rule-based or less
context-aware models, the fine-tuned BERT model demonstrated superior contextual
understanding and adaptability to domain-specific terminology. This high F1 score not only
validates the model’s technical soundness but also underscores its practical applicability in
real-world cybersecurity scenarios, where precision and recall are critical for timely threat
detection and mitigation.
Symmetry 2025, 17, 587 19 of 27
By achieving this level of performance, the proposed approach sets a new benchmark
for entity recognition in cyber threat intelligence, paving the way for more reliable and
actionable insights in the field. The relatively lower F1 score for the target class can be
attributed to the smaller number of annotated examples and the inherent diversity of
entities within this class, which makes classification more challenging. Previous research
highlights that increasing the amount of annotated data and improving the consistency of
the annotation process can significantly improve model performance for underrepresented
classes [5]. These findings suggest that expanding the dataset and refining the annotation
guidelines could potentially improve the performance of the target class, and this will be
considered in future work.
The confusion matrix in Figure 5 is indicative of the strong performance of the model,
where most predictions are correctly realised and very few errors are observed. The mini-
mum misclassifications of six for B-Malware and one for B-Target reflect the robustness
of the model, which is consistent with the high F1 scores in Table 6. This reinforces the
reliability of the model in effectively recognising cyber security assets.
Symmetry 2025, 17, 587 20 of 27
The learning curve, combined with the detailed evaluation metrics, illustrates that the
fine-tuned BERT model successfully captures cybersecurity-specific entities with high accu-
racy and generalisability. These results validate the effectiveness of the training module in
transforming unstructured threat intelligence into structured data for subsequent analysis.
According to the results of the analysis, it is understood that the actor with a node
ID of 5141 has a wide malware portfolio, high technical capacity, and resource richness.
The group’s use of various malware demonstrates that it has a wide range of attacks against
different target systems and security measures.
Another analysis determined how many different threat actors used each malware.
The “isUsedBy” relationship between malware nodes and ThreatActor nodes was used to
analyse the links between malware and the actors. Table 8 shows the number of unique
threat actor nodes associated with each malware node. These results demonstrate the preva-
lence of malware in the cyber threat ecosystem and its use by different actors. The results
show that the malware with a node ID of 617 is used by five different threat actors and is
the most commonly used malware in the threat reports analysed.
In another analysis, the target distribution of threat actors was analysed. By using the
‘hasTargetedField’ relationship between the ‘ThreatActor’ and ‘target’ nodes, the areas on
which threat actors focus and their attack trends were analysed. According to the results
in Table 9, the target with a node ID value of 942 (Google) was the most preferred target
targeted by different threat actors. This distribution shows that threat actors focus on
attacks against technology companies, critical infrastructures, and financial institutions.
Finally, the distribution of the targeted fields of malware was analysed. Using the
‘hasTargetedField’ relationship between ‘malware’ and ‘target’ nodes, the number of targets
attacked by malware was counted. This analysis demonstrates the impact areas and attack
scopes of malware.
According to the results in Table 10, the malware with a node ID value of 2168 attacked
19 different targets, while the malware with a node ID value of 1140 attacked 14 different
targets. A sample graph for the malware with a node ID value of 1140 is shown in Figure 6.
Symmetry 2025, 17, 587 22 of 27
Figure 7. Graph representation of threat actor node 5141 and its central position based on PageRank.
5. Conclusions
This paper presents a novel approach that combines knowledge graphs and advanced
natural language processing techniques for cyber threat intelligence analysis. The proposed
system successfully extracted critical entities such as threat actors, malware, campaigns,
and targets from unstructured security texts and makes the relationships between these en-
tities analysable in a structured knowledge base. In particular, the analysis results showed
that some threat actors have a high level of similarity with certain target domains. These
results show that knowledge graph-based approaches offer a new perspective in cyber
threat intelligence and make visible the relationships and targets that traditional methods
fail to detect. While the results of this study demonstrate the effectiveness of the proposed
approach, certain limitations should be acknowledged. The dataset size, although carefully
curated, may limit the generalisability of the findings to broader cybersecurity contexts.
Expanding the dataset with additional sources and increasing its diversity will be a key
focus in future work. Additionally, testing the model’s performance across different do-
mains and threat scenarios will help validate its robustness and applicability. By addressing
these limitations, we aim to further enhance the reliability and impact of the proposed
approach. The performance of the methods used in the study is confirmed by both the
overall accuracy of the model and the graphical analysis results obtained. In addition
to understanding current threats, the system can predict potential future targets and is
considered as an important tool that can be used in cyber security studies. The method we
present in this study has significant potential for the early prediction of new attack vectors
and threat trends.
In our future work, we plan to test the system with larger datasets and make im-
provements to address threat analysis needs in different sectors. In addition, a more
comprehensive examination of the embedding methods and parameter optimisations can
improve the accuracy of the system. The results of this study show that the proposed
approach can make significant contributions to the field of cyber threat intelligence from
both practical and theoretical perspectives.
Author Contributions: Conceptualisation, D.D.; methodology, D.D. and R.D.; formal analysis,
D.D.; investigation, R.D.; validation, R.D. and D.H.; writing—original draft preparation, D.D.;
writing—review and editing, D.D. and R.D.; supervision, R.D. and D.H. All authors have read and
agreed to the published version of the manuscript.
Data Availability Statement: The data that support the findings of this study are available on request
from the corresponding author.
Acknowledgments: This study is derived from PhD dissertation entitled “Centralized Monitoring
of Cyber Threat Intelligence Data and Development of New Approaches to Detection of Attacks”
submitted at Inonu University, Graduate School of Nature and Applied Sciences, Department of
Computer Engineering under the supervision of Davut Hanbay and Resul Das.
References
1. Demirol, D.; Das, R.; Hanbay, D. A key review on security and privacy of big data: Issues, challenges, and future research
directions. Signal Image Video Process. 2022, 17, 1335–1343. [CrossRef]
2. Lei, J.; Kong, L. Fundamentals of big data in radio astronomy. In Big Data in Astronomy; Kong, L., Huang, T., Zhu, Y., Yu, S., Eds.;
Elsevier: Amsterdam, The Netherlands, 2020; pp. 29–58. [CrossRef]
3. Gantz, J.; Reinsel, D. Extracting value from chaos. IDC IVIEW 2011, 1142, 1–12.
4. Ahmetoglu, H.; Das, R. A comprehensive review on detection of cyber-attacks: Data sets, methods, challenges, and future
research directions. Internet Things 2022, 20, 100615. [CrossRef]
Symmetry 2025, 17, 587 26 of 27
5. Zhou, Y.; Tang, Y.; Yi, M.; Xi, C.; Lu, H. CTI View: APT Threat Intelligence Analysis System. Secur. Commun. Netw. 2022,
2022, 9875199. [CrossRef]
6. Piplai, A.; Mittal, S.; Joshi, A.; Finin, T.; Holt, J.; Zak, R. Creating Cybersecurity Knowledge Graphs From Malware After Action
Reports. IEEE Access 2020, 8, 211691–211703. [CrossRef]
7. Sarhan, I.; Spruit, M. Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph. Knowl.-Based Syst. 2021, 233, 107524.
[CrossRef]
8. Meta. ThreatExchange: A Threat Intelligence Sharing Platform. 2024. Available online: https://developers.facebook.com/
products/threat-exchange/ (accessed on 13 December 2023).
9. Symantec. Symantec Enterprise Blogs-Threat Intelligence. Available online: https://symantec-enterprise-blogs.security.com/
blogs/threat-intelligence (accessed on 25 January 2023).
10. Sean, B. Standardising Cyber Threat Intelligence Information with the Structured Threat Information eXpression. 2014. Available
online: https://stixproject.github.io/about/STIX_Whitepaper_v1.1.pdf (accessed on 25 January 2023).
11. MITRE. MAEC-Malware Attribute Enumeration and Characterization. 2022. Available online: https://maecproject.github.io/
(accessed on 25 January 2023).
12. Husari, G.; Al-Shaer, E.; Ahmed, M.; Chu, B.; Niu, X. TTPDrill: Automatic and Accurate Extraction of Threat Actions from
Unstructured Text of CTI Sources. In Proceedings of the 33rd Annual Computer Security Applications Conference, Orlando, FL,
USA, 4–8 December 2017; pp. 103–115. [CrossRef]
13. Jones, C.L.; Bridges, R.A.; Huffer, K.M.T.; Goodall, J.R. Towards a Relation Extraction Framework for Cyber-Security Concepts.
In Proceedings of the 10th Annual Cyber and Information Security Research Conference, Oak Ridge, TN, USA, 7–9 April 2015;
pp. 1–4. [CrossRef]
14. Alves, F.; Bettini, A.; Ferreira, P.M.; Bessani, A. Processing tweets for cybersecurity threat awareness. Inf. Syst. 2021, 95, 101586.
[CrossRef]
15. Kim, E.; Kim, K.; Shin, D.; Jin, B.; Kim, H. CyTIME: Cyber Threat Intelligence ManagEment framework for automatically
generating security rules. In Proceedings of the 13th International Conference on Future Internet Technologies, Seoul, Republic of
Korea, 20–22 June 2018; pp. 1–5. [CrossRef]
16. Zhang, H.; Shen, G.; Guo, C.; Cui, Y.; Jiang, C. EX-Action: Automatically Extracting Threat Actions from Cyber Threat Intelligence
Report Based on Multimodal Learning. Secur. Commun. Netw. 2021, 2021, 5586335:1–5586335:12. [CrossRef]
17. Alam, M.T.; Bhusal, D.; Park, Y.; Rastogi, N. CyNER: A Python Library for Cybersecurity Named Entity Recognition. arXiv 2022,
arXiv:2204.05754. [CrossRef]
18. Zhu, Z.; Dumitras, T. ChainSmith: Automatically Learning the Semantics of Malicious Campaigns by Mining Threat Intelligence
Reports. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April
2018; pp. 458–472. [CrossRef]
19. Ahmetoglu, H.; Das, R. Analysis of Feature Selection Approaches in Large Scale Cyber Intelligence Data with Deep Learning. In
Proceedings of the 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 5–7
October 2020; pp. 1–4. [CrossRef]
20. Noor, U.; Anwar, Z.; Amjad, T.; Choo, K.K.R. A machine learning-based FinTech cyber threat attribution framework using
high-level indicators of compromise. Future Gener. Comput. Syst. 2019, 96, 227–242. [CrossRef]
21. Qin, Y.; Shen, G.; Zhao, W.; Chen, Y.; Yu, M.; Jin, X. A network security entity recognition method based on feature template and
CNN-BiLSTM-CRF. Front. Inf. Technol. Electron. Eng. 2019, 20, 872–884. [CrossRef]
22. Bose, A.; Behzadan, V.; Aguirre, C.; Hsu, W.H. A novel approach for detection and ranking of trendy and emerging cyber threat
events in Twitter streams. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 871–878. [CrossRef]
23. Ma, P.; Jiang, B.; Lu, Z.; Li, N.; Jiang, Z. Cybersecurity named entity recognition using bidirectional long short-term memory with
conditional random fields. Tsinghua Sci. Technol. 2021, 26, 259–265. [CrossRef]
24. Kim, D.; Kim, H.K. Automated Dataset Generation System for Collaborative Research of Cyber Threat Analysis. Secur. Commun.
Netw. 2019, 2019, 6268476. [CrossRef]
25. Zhang, H.; Guo, Y.; Li, T. Multifeature Named Entity Recognition in Information Security Based on Adversarial Learning. Secur.
Commun. Netw. 2019, 2019, 6417407. [CrossRef]
26. Georgescu, T.M.; Iancu, B.; Zurini, M. Named-Entity-Recognition-Based Automated System for Diagnosing Cybersecurity
Situations in IoT Networks. Sensors 2019, 19, 3380. [CrossRef]
27. Sun, T.; Yang, P.; Li, M.; Liao, S. An Automatic Generation Approach of the Cyber Threat Intelligence Records Based on
Multi-Source Information Fusion. Future Internet 2021, 13, 40. [CrossRef]
28. Wu, H.; Li, X.; Gao, Y. An Effective Approach of Named Entity Recognition for Cyber Threat Intelligence. In Proceedings of the
2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China,
12–14 June 2020; IEEE: New York, NY, USA, 2020; pp. 1370–1374. [CrossRef]
Symmetry 2025, 17, 587 27 of 27
29. Simran, K.; Sriram, S.; Vinayakumar, R.; Soman, K.P. Deep Learning Approach for Intelligent Named Entity Recognition of
Cyber Security. In Proceedings of the Advances in Signal Processing and Intelligent Recognition Systems, Trivandrum, India,
18–21 December 2019; Thampi, S.M., Hegde, R.M., Krishnan, S., Mukhopadhyay, J., Chaudhary, V., Marques, O., Piramuthu, S.,
Corchado, J.M., Eds.; Springer: Singapore, 2020; pp. 163–172. [CrossRef]
30. Kim, G.; Lee, C.; Jo, J.; Lim, H. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. Int. J.
Mach. Learn. Cybern. 2020, 11, 2341–2355. [CrossRef]
31. Jia, J.; Yang, L.; Wang, Y.; Sang, A. Hyper attack graph: Constructing a hypergraph for cyber threat intelligence analysis. Comput.
Secur. 2025, 149, 104194. [CrossRef]
32. Srivastava, S.; Paul, B.; Gupta, D. Study of Word Embeddings for Enhanced Cyber Security Named Entity Recognition. Procedia
Comput. Sci. 2023, 218, 449–460. [CrossRef]
33. Ahmed, K.; Khurshid, S.K.; Hina, S. CyberEntRel: Joint extraction of cyber entities and relations using deep learning. Comput.
Secur. 2024, 136, 103579. [CrossRef]
34. Satvat, K.; Gjomemo, R.; Venkatakrishnan, V.N. Extractor: Extracting Attack Behavior from Threat Reports. In Proceedings of the
2021 IEEE European Symposium on Security and Privacy (EuroS&P), Vienna, Austria, 6–10 September 2021; IEEE: New York, NY,
USA, 2021; pp. 598–615. [CrossRef]
35. Guo, Y.; Liu, Z.; Huang, C.; Wang, N.; Min, H.; Guo, W.; Liu, J. A framework for threat intelligence extraction and fusion. Comput.
Secur. 2023, 132, 103371. [CrossRef]
36. Jo, H.; Lee, Y.; Shin, S. Vulcan: Automatic extraction and analysis of cyber threat intelligence from unstructured text. Comput.
Secur. 2022, 120, 102763. [CrossRef]
37. Wang, G.; Liu, P.; Huang, J.; Bin, H.; Wang, X.; Zhu, H. KnowCTI: Knowledge-based cyber threat intelligence entity and relation
extraction. Comput. Secur. 2024, 141, 103824. [CrossRef]
38. Syed, Z.; Padia, A.; Finin, T.; Mathews, M.L.; Joshi, A. UCO: A Unified Cybersecurity Ontology. In Proceedings of the AAAI
Workshop: Artificial Intelligence for Cyber Security, Phoenix, AZ, USA, 12 February 2016; Martinez, D.R., Streilein, W.W., Carter,
K.M., Sinha, A., Eds.; AAAI Technical Report; AAAI Press: Washington, DC, USA, 2016; Volume WS-16-03.
39. ESET. WeLiveSecurity. 2023. Available online: https://www.welivesecurity.com (accessed on 9 August 2024).
40. Labs, F. FortiGuard Labs Threat Research. Available online: https://www.fortinet.com/blog/threat-research (accessed on 9
August 2024).
41. CyberMonitor. APT Cyber Criminal Campaign Collections. 2023. Available online: https://github.com/CyberMonitor/APT_
CyberCriminal_Campagin_Collections (accessed on 9 August 2024).
42. Fenniak, M. The PyPDF2 Library. 2022. Available online: https://pypi.org/project/PyPDF2/ (accessed on 9 August 2024).
43. MITRE Corporation. MITRE ATT&CK Framework. Available online: https://attack.mitre.org (accessed on 14 December 2023).
44. MITRE Corporation. Mitreattack-Python: Python Library for Interacting with the MITRE ATT&CK Framework. 2024. Available
online: https://github.com/mitre-attack/mitreattack-python (accessed on 14 December 2024).
45. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding;
Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [CrossRef]
46. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.
2018, 42, 318–327. [CrossRef]
47. Sulu, M.; Das, R. Graph visualization of cyber threat intelligence data for analysis of cyber attacks. Balk. J. Electr. Comput. Eng.
2022, 10, 300–306. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.