Knowledge Graph on Cybersecurity: A Survey
Maman Sani ABOUBACAR Arnaud CASTELLTORT Anne LAURENT
University of Montpellier University of Montpellier University of Montpellier
Montpellier, France Montpellier, France Montpellier, France
maman-sani.aboubacar- castelltort@lirmm.fr laurent@lirmm.fr
djibo@lirmm.fr
ABSTRACT 2 WHY GRAPHS?
Over the past decade, social networks and discussion forums have By definition a graph is a data structure made up of entities linked
become valuable sources for open source cyber security intelligence. together by relationships. Entities are represented by nodes and re-
However, sorting out relevant and irrelevant information is very lationships by arcs. The knowledge graph is a semantic knowledge
complex, as information comes from a variety of sources and in base that allows to describe the semantics of information sources
different formats. Different approaches exist for extracting and rep- and thus make the content explicit. It is a term that was introduced
resenting this information. In this review we present a comparative by Google in 2012 as part of improving user experience. The goal is
study on knowledge base approaches that consist in extracting rel- to allow users to solve their queries without having to navigate to
evant cyber security information and representing it in a graphical other websites to access critical information [4]. Although knowl-
format for a better understanding and readability of the data. This edge representation is not new, it has gained popularity in recent
comparative study reviews the various cyber security knowledge decades through its use in artificial intelligence applications.
base construction works by identifying the potentials and limita- There are several advantages to using these approaches:
tions of each study. • Interdependent nature of the data: entities are often linked
to each other and have dependencies;
• Powerful representation: graphs naturally represent the in-
Keywords:
terdependencies by introducing links between entities. This
Cybersecurity, Information extraction, Ontology, Knowledge Graph.
allows to efficiently capture the correlation between them;
• Relational nature of the problem areas: the nature of the
anomalies, vulnerabilities or even attacks can be relational.
1 INTRODUCTION
For example, a vulnerability can affect a piece of software,
Cyber security is a fairly vast field that is evolving at a significant and at the same time the whole system.
speed. Analysts and cyber security professionals have a critical
need for the latest information to cope with this evolution. This 2.1 Challenges specific to data and cyber
information is usually generated by different tools, sensors or even
security issues
available on the Web in a structured and unstructured format. The
unification and organization of this information will allow cyber Data challenges such as velocity, volume, variety and quantity can
security professionals to have better visibility and situational aware- also be applied to graph-based data.
ness. • Dynamism and scalability: with an explosion of user and
Data representation approaches based on graphs can be a solu- machine-generated data in real time, dynamism and scala-
tion to this problem. Indeed, they allow the integration, organiza- bility are challenges in graph data management [1];
tion and representation of data in a readable and understandable • Complexity: the available data is rich and complex in terms
way for both the machine and the analysts. Studies have been con- of content. A data can contain several pieces of informa-
ducted on this issue in the field of cyber security [2, 10, 15, 16]. tion, the extraction of the totality of this information and its
However, few of these studies [2] include a comparative study on exploitation is often quite complex.
the representation of graph-based data on cyber security. In addition to these challenges, there are also other challenges
The aim of this paper is to expose the various works dealing related to cyber security data.
with the representation and integration of cyber security data. First, • Lack and noise of labeled data: data labeling is a great chal-
we present the need to use graph-based approaches, as well as the lenge in extracting and representing data in graphical for-
different challenges associated with these approaches. In a second mat. In cyber security there is a great lack of structured
step, we present the different steps of the creation of a knowledge and publicly available data on attacks or incidents, unlike
base as well as the works dealing with this topic. At the end we vulnerability databases 1 . Machine learning techniques for
present a synthetic comparison of these studies. automatic data labeling exist, but produce noise or omit cer-
The rest of this paper is presented as follows: Section 2 presents tain concepts of cyber security [3]. For example, the absence
the interest of using graphs. Section 3 presents the steps of the of true labeled data, i.e. field truth data, makes it difficult
construction of a knowledge base. Section 4 presents the studies to evaluate techniques for detecting incidents or anomalies
that have been conducted on the construction of knowledge graphs related to cyber security ;
for cyber security data. Section 5 presents the conclusion and dis-
cussions. 1 https://nvd.nist.gov/
Maman Sani ABOUBACAR, Arnaud CASTELLTORT, and Anne LAURENT
• Class imbalance: incidents, vulnerabilities or anomalies are means of representing and communicating facts and relationships.
often very rare in the data collected. Also with labeling errors, An ontology consists of a set of classes with attributes and rela-
some classes (concepts) may be under-represented, which tionships between instances of different classes. Figure 2 presents
can make it difficult for machine learning approaches to a STIX [14] cyber security ontology that models the relationship
predict this class. This is why this issue must be carefully between an attack campaign, the instigator of the campaign, the
taken into account. target of the attack, and the tools used to carry out the attack.
3 STEPS OF CREATING KNOWLEDGE
GRAPHS
The creation of knowledge graphs can be done by several ap-
proaches. However, in recent decades, the ontology-based approaches
are much more illustrated by the results they provide. This is why,
in this review, we will present only those works oriented towards
this approach. For the creation of a knowledge base, a fairly generic
pipeline is most often used. This pipeline is shown in the figure 1.
Figure 1: Knowledge base creation pipeline
Figure 2: Example of Cyber Security Ontology
3.1 Data sources
Cyber security data is usually generated by different tools, sensors 3.4 Knowledge base
or even available on the Web in a structured and unstructured for- The knowledge base is a semantic knowledge graph that describes
mat. Indeed, over the last decade, social networks and discussion the semantics of information sources and thus makes the content
forums have become valuable sources for cyber security informa- more explicit for analysts and professionals in the field. A knowl-
tion. With the right tools and methods, these information sources edge graph is the result of associating the concepts of a domain with
can be identified, explored and then exploited to obtain actionable a data representation model, namely here, ontologies. In our case,
information about cyber threats. the knowledge graph can be defined as the instantiation of cyber
To this end, various studies have been conducted on the collec- security concepts extracted in a dedicated cyber security ontology.
tion of cyber security data. This is the case, for example, of the
work of Macdonal et al. [9] which focuses on analyzing the commu- 4 COMPARATIVE STUDY OF WORK ON THE
nications of hackers on forums to identify potential threats against
CONSTRUCTION OF KNOWLEDGE BASE
critical infrastructures using automated analysis tools.
RELATED TO CYBER SECURITY
3.2 Named entities extraction Several works in the literature have focused on the creation of
Entity extraction consists in extracting from annotated data cyber cyber security knowledge bases using ontologies. This is the case
security concepts for subsequent exploitation. The process consists of Undercoffer et al. [16] who introduced the first cyber security
of extracting cyber security concepts from the extracted data and ontology for the modeling of the Intrusion Detection System (IDS).
linking them together through an ontology by relationships so as Syed et al [15] proposed the Unified Cyber Security Ontology, an
to form a knowledge graph on which it is possible to carry out extension of the IDS ontology. This ontology has the advantage of
reasoning. being linked to several external cyber security knowledge bases
It is also possible to link these concepts to external knowledge such as: CVE2 , CWE3 , STUCCO4 , STIX5 .
bases such as: DBpedia, Wikidata, etc. in order to enrich the data. Systems that create or integrate data into cyber security knowl-
edge graphs have been proposed in the literature. This is the case,
3.3 Ontologies mapping for example, of the recent work of Taneeya et al. [12] which deals
Ontologies are used in this pipeline to enable the creation of rela-
2 https://cve.mitre.org/
tionships between entities. Indeed, ontologies are semantic data 3 https://cwe.mitre.org/
models that define the types of things that exist in a domain and the 4 https://github.com/stucco-archive/ontology
properties that can be used to describe them. They are an exclusive 5 https://oasis-open.github.io/cti-documentation/
Knowledge Graph on Cybersecurity: A Survey
with the extraction of cyber security information from cyber secu- However, none of this work has dealt with scalability and the
rity blogs, to integrate them into a knowledge base and reason on instantaneous aspect of the data stored in the knowledge base. In
the data to create new knowledge. addition, none of the works presented dealt with the verification of
Pingle et al. [11] proposed a system to create semantic triplets the quality of the collected data. Indeed, it is more than important in
from cyber security data, using deep learning approaches to extract a field such as cyber security to ensure the quality of the information
possible relationships. The semantic triplets generated by the sys- to be fed back to analysts and professionals in the field at the risk
tem can be used to make assertions in a cyber security knowledge of making bad decisions.
graph. They can also be extracted from the knowledge graph to be
exploited by analysts in their decision making on cyber attacks. 5 CONCLUSION
Iannacone et al [5] have developed an ontology for a Cyber
Security Knowledge Graph database. This ontology integrates in- In this review, we have provided a critical overview of the various
formation from various structured and unstructured data sources works on building a knowledge base for cyber security. We have
clearly defined the main challenges as well as the different stages
and includes all relevant concepts in the field of cyber security.
of setting up such a database. We then carried out a comparative
Shang et al [13] proposed a framework that allows the inte-
gration of cyber security information extracted from texts into a study of the different works that have dealt with this issue.
knowledge base. To do so, they created a vulnerability-centered on- The analysis of the studies presented allowed us to note that
tology and formed a model for extracting named entities related to only the work of Mittal et al. ticked all the boxes. Indeed, the use
cyber security using statistical rules and models such as conditional of external knowledge allowed the authors to enrich their data and
random fields. obtain a better result. However, none of this work deals with data
Mittal et al. proposed CyberTwitter [10] a system that uses Twit- quality verification. Moreover, since data sources are most often
ter as a data source to study vulnerabilities related to cyber security informal sources (blogs, social networks, etc.), it will be important
published on the social network. After data collection, concepts to take this issue into account. The latter may be a lead for future
related to cyber security are extracted through the Security Vulner- work, in addition to the knowledge base scalability lead and the
snapshot data collection lead. For the latter, the work of Sceller et
ability Concept Extractor (SVCE) [7] tool they have developed for
al. [8] can serve as a basis.
this purpose. The extracted concepts are then represented in RDF
format through the Unified Cyber Security Ontology (UCO) [15].
Subsequently SWRL (Semantic Web Rule Language) rules are used REFERENCES
to reason about the extracted concepts to issue alerts to security [1] Muhammad U Arshad, Ashish Kundu, Elisa Bertino, Arif Ghafoor, and Chinmay
Kundu. 2017. Efficient and scalable integrity verification of data and query results
analysts. for graph databases. IEEE Transactions on Knowledge and Data Engineering 30, 5
Jia et al [6] proposed an approach for building a knowledge base (2017), 866–879.
on cyber security using inference rules based on a path sorting al- [2] Carlos Blanco, Joaquin Lasheras, Rafael Valencia-García, Eduardo Fernández-
Medina, Ambrosio Toval, and Mario Piattini. 2008. A systematic review and
gorithm. The basic principle of the path sorting algorithm is to use comparison of security ontologies. In 2008 Third International Conference on
the path connecting two entities as a characteristic to predict the Availability, Reliability and Security. Ieee, 813–820.
relationship between the two entities. By using path sorting algo- [3] Robert A Bridges, Corinne L Jones, Michael D Iannacone, Kelly M Testa, and
John R Goodall. 2013. Automatic labeling for entity extraction in cyber security.
rithms for a given relationship, it is possible to determine whether arXiv preprint arXiv:1308.4941 (2013).
a relationship exists between the two entities. [4] Google. last access, 24 September 2020. Knowledge Graph.
https://fr.wikipedia.org/wiki/Knowledge_Graph.
[5] Michael Iannacone, Shawn Bohn, Grant Nakamura, John Gerth, Kelly Huffer,
Robert Bridges, Erik Ferragut, and John Goodall. 2015. Developing an ontology
for cyber security knowledge graphs. In Proceedings of the 10th Annual Cyber
4.1 Synthetic comparison of the work and Information Security Research Conference. 1–4.
[6] Yan Jia, Yulu Qi, Huaijun Shang, Rong Jiang, and Aiping Li. 2018. A practical
presented approach to constructing a knowledge graph for cybersecurity. Engineering 4, 1
The table 1 presents a comparative study of the work presented. For (2018), 53–60.
[7] Ravendar Lal et al. 2013. Information Extraction of Security related entities and
each study five arguments are considered. The data source, which concepts from unstructured text. (2013).
specifies whether the data being exploited comes from a single or [8] Quentin Le Sceller, ElMouatez Billah Karbab, Mourad Debbabi, and Farkhund
Iqbal. 2017. Sonar: Automatic detection of cyber security events over the twit-
multiple source. The NER (Named Entity Recognition) to specify if ter stream. In Proceedings of the 12th International Conference on Availability,
the study to use entity extraction approaches. Ontology, if the study Reliability and Security. 1–11.
used ontologies for the creation of its knowledge base. External [9] Mitch Macdonald, Richard Frank, Joseph Mei, and Bryan Monk. 2015. Identifying
digital threats in a hacker web forum. In Proceedings of the 2015 IEEE/ACM
KG, if the study used external knowledge bases such as: DBpedia, International Conference on Advances in Social Networks Analysis and Mining 2015.
Wikidata, etc. to enrich its knowledge base. Finally, the field of 926–933.
application of the study. [10] Sudip Mittal, Prajit Kumar Das, Varish Mulwad, Anupam Joshi, and Tim Finin.
2016. Cybertwitter: Using twitter to generate alerts for cybersecurity threats and
From this table, it appears that all the studies presented, except vulnerabilities. In 2016 IEEE/ACM International Conference on Advances in Social
the work of Syed et al. [15] and Undercoffer et al. [16], use named Networks Analysis and Mining (ASONAM). IEEE, 860–867.
[11] Aditya Pingle, Aritran Piplai, Sudip Mittal, Anupam Joshi, James Holt, and Richard
entity extraction approaches for the extraction of cyber security Zak. 2019. Relext: Relation extraction using deep learning approaches for cyber-
concepts. Indeed, Syed et al. use cyber security concepts already security knowledge graph improvement. In Proceedings of the 2019 IEEE/ACM
existing in the literature. It also appears that only the work of Syed International Conference on Advances in Social Networks Analysis and Mining.
879–886.
et al. and Mittal et al. use external knowledge to link the extracted [12] Taneeya Satyapanich, Francis Ferraro, and Tim Finin. 2020. CASIE: Extracting
concepts to this knowledge base in order to enrich the data. Cybersecurity Event Information from Text. UMBC Faculty Collection (2020).
Maman Sani ABOUBACAR, Arnaud CASTELLTORT, and Anne LAURENT
Table 1: Comparative table of studies on building a knowledge base for cyber security
Paper Data source NER Ontology External KG Application
Undercoffer et al. [16] Multiple - ✓ - All cybersecurity domain
Syed et al. [15] Multiple - ✓ ✓ All cybersecurity domain
Taneeya et al. [12] blogs ✓ - - Cybersecurity concepts extraction
Pingle et al. [11] Multiples ✓ ✓ - All cybersecurity domain
Iannacone et al. [5] Multiples - ✓ - All cybersecurity domain
Shang et al. [13] Multiples ✓ ✓ - Vulnerability detection
Jia et al. [6] Multiples ✓ ✓ - Vulnerability detection
Mittal et al. [10] Multiples ✓ ✓ ✓ Cybersecurity event detection on Twitter
[13] Huaijun Shang, Rong Jiang, Aiping Li, and Wei Wang. 2017. A framework to [15] Zareen Syed, Ankur Padia, Tim Finin, Lisa Mathews, and Anupam Joshi. 2016.
construct knowledge base for cyber security. In 2017 IEEE Second International UCO: A unified cybersecurity ontology. UMBC Student Collection (2016).
Conference on Data Science in Cyberspace (DSC). IEEE, 242–248. [16] Jeffrey Undercofer, Anupam Joshi, Tim Finin, John Pinkston, et al. 2003. A target-
[14] STIX. last access, 24 September 2020. Structured Threat Information Expression. centric ontology for intrusion detection. In Workshop on Ontologies in Distributed
https://oasis-open.github.io/cti-documentation/stix/intro. Systems, held at The 18th International Joint Conference on Artificial Intelligence.