0% found this document useful (0 votes)
25 views21 pages

2301 12er013v2

This research presents a system utilizing a Neo4j graph database to enhance cybersecurity threat hunting and vulnerability analysis through open source intelligence. By establishing connections between various data sources, including blogs and threat reports, the system identifies potential indicators of compromise (IOCs) and employs machine learning for data filtering and analysis. The methodology demonstrates improved efficiency in querying relationships between known malicious IOCs and relevant documents, facilitating better threat detection and malware analysis.

Uploaded by

jyothishks380
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views21 pages

2301 12er013v2

This research presents a system utilizing a Neo4j graph database to enhance cybersecurity threat hunting and vulnerability analysis through open source intelligence. By establishing connections between various data sources, including blogs and threat reports, the system identifies potential indicators of compromise (IOCs) and employs machine learning for data filtering and analysis. The methodology demonstrates improved efficiency in querying relationships between known malicious IOCs and relevant documents, facilitating better threat detection and malware analysis.

Uploaded by

jyothishks380
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Cybersecurity Threat Hunting and Vulnerability Analysis Using a

Neo4j Graph Database of Open Source Intelligence


Elijah Pelofske∗1 , Lorie M. Liebrock1 , and Vincent Urias2
1
New Mexico Cybersecurity Center of Excellence, New Mexico Tech
2
Sandia National Laboratories
arXiv:2301.12013v2 [cs.CR] 7 Oct 2024

Abstract
Open source intelligence is a powerful tool for cybersecurity analysts to gather information both for analysis
of discovered vulnerabilities and for detecting novel cybersecurity threats and exploits. However the scale of
information that is relevant for information security on the internet is always increasing and is intractable for
analysts to parse comprehensively. Therefore methods of condensing the available open source intelligence, and
automatically developing connections between disparate sources of information is incredibly valuable. In this
research, we present a system which constructs a Neo4j graph database formed by shared connections (shared
sub-string matches) between open source intelligence text including blogs, cybersecurity bulletins, news sites,
antivirus scans, social media posts (such as Reddit and Twitter), and threat reports. These connections are
comprised of possible indicators of compromise (IP addresses, domains, hashes, email addresses, phone numbers),
information on known exploits and techniques (CVEs and MITRE ATT&CK Technique ID’s), and potential
sources of information on cybersecurity exploits such as twitter usernames. The construction of the database
of potential IOCs is detailed, including the addition of machine learning and metadata which can be used for
filtering of the data for a specific domain (for example a specific natural language) when needed. Examples of
utilizing the graph database for querying connections between known malicious IOCs and open source intelligence
documents, including threat reports, are shown. We show that this type of relationship querying can allow for
more effective use of open source intelligence for threat hunting, malware family clustering, and vulnerability
analysis. We show three specific examples of interesting connections found in the graph database; the connections
to a known exploited CVE, a known malicious IP address, and a malware hash signature.

1 Introduction
Open source intelligence offers an extraordinary amount of information that a cybersecurity analyst can use for
threat detection, mitigation, and analysis [1–6]. However, open source intelligence contains a large amount of
noise (i.e., irrelevant information) and most importantly the scale of the data is too large to be useful in its
raw form. To this end, automating the process of finding indicators of compromise and relevant relationships
between the indicators of compromise, has become increasingly important [5, 7, 8]. The central idea utilized in this
study is forming a network of associations between open source intelligence documents and potential indicators of
compromise (IOCs) that exist in the open source intelligence text. The term potential IOC is important because
it specifies that the text is potentially relevant (for example, discussing usage of a new piece of malware) and can
be unstructured natural language text, but there is a pattern match in the data that does fit a particular form (for
example, an IP address, or a common vulnerabilities and exposures (CVE)). However, given the nature of open
source text and information, there exist many false positives - e.g., correlations that exist in open text that do not
actually have a semantic reason for happening or are not important for the type of data we wish to extract (in this
case, cybersecurity relevant text).
Graphs are a natural way to express these types of higher order connections and are used in a variety of
cybersecurity contexts [9, 10]. These types of networks, when they are intended for providing semantic meaning
between heterogeneous data types are also referred to as knowledge graphs [11–13]. In order to create a graph
representation of a large amount of open source intelligence, we utilize Neo4j which provides a visual interface to
search the graph, allows a number of users to interact with the data, and also provides an efficient query time for
the database to interact with other analysis systems or to simply query and display the raw open source intelligence
(OSint) text documents that are connected to a relevant exploit or IOC. Neo4j graph databases have been used
∗ E-mail: elijah.pelofske@protonmail.com

1
in other domains for the purpose of storing and querying data structures with complex networks [14–17] including
social network analysis [18] and typhoon disaster knowledge [19]. We utilize Neo4j because it is a reasonable choice
for an existing graph database implementation – in particular, it is efficient, open source, and there exist Python
3 libraries for interacting with the database. Utilizing graphs in order to better evaluate relevant connections that
exist in a large dataset is a subject of considerable interest [20–22].

in parallel
Executed
Execute CTC to
determine if the natural
language is likely
discussing
Arbitrarely cybersecurity or not
structured
text files

Execute Tram to
determine possible
Mitre Attack Technique
Extract metadata ID's Package documents
and clean natural and metadata with
OSint web crawler
language text when edges to their
(aka Spider)
required. potential IOC's in the Neo4J Graph Database
structured data.
Deduplicate against Execute rcATT to form of a Neo4J of potential cybersecurity
Includes metadata
all previously determine possible Mitre transaction and add IOC's
such as time stamps
processed Attack Techniques or to the database
documents. Tactics

regex or search for


File antivirus potential IOC's in the
scan natural language
structured
data

Determine likely natural


languages being used

Figure 1: IOC Neo4j graph database construction workflow diagram.

In this study we outline a methodology which consists of collecting and aggregating open source intelligence,
in conjunction with antivirus scan results and presenting the information contained in this data in the form of
a network or a graph. Specifically, edges represent a connection between a potential IOC and a document. The
challenge is that most open source intelligence text from the web is unstructured therefore analyzing the text for
relevant pattern matches is a means to extract the potential IOCs. Here a document is simply a collection of natural
language or data - for example a single tweet from Twitter could be a document. The knowledge graph that is formed
by the documents, potential IOCs, and vulnerability IDs form a network with multiple types of relations (edges)
– making the constructed graph database a specific type of multidimensional network. Furthermore, metadata
is included with the node document when constructing the database - several machine learning (ML) algorithms
identify whether the text is discussing cybersecurity and specifically what type of exploit techniques are being
discussed. This type of metadata is important because it can can serve as a signal for the quality of the given node
document - meaning that a piece of open source intelligence text gathered from social media will have a higher
confidence of being relevant if the machine learning algorithm tells us it is likely discussing a cybersecurity topic. In
Section 2 we detail the construction of this type of cybersecurity intelligence graph, instantiated as a Neo4j graph
database.
The underlying idea of this study is to create a graph database structure where edges between two nodes are
formed by shared string matches within two open source documents, and each node is a single open source document
(which could be from a variety of sources, such as a threat report or a web page).
Refs. [23, 24] have created similar systems that use automated text mining to extract indicators of compromise.
Our study differs from these studies in several ways. The first is the parsing of entirely unstructured text to be
ingested and the second is that this very general parsing is applied to a wide variety of data sources. The third is
the variety of extracted datatypes used to create the graph database - this not only includes standard indicators
of compromise, but also datatypes with more general context for known vulnerabilities and tactics (namely, CVE
numbers and MITRE ATT&CK ids).
In Section 3 we give specific examples of where this type of open source intelligence graph database can provide
useful information and facilitate analysis in cybersecurity. Specifically, we detail examples where indicators of com-

2
promise can be found within the graph database, and we show where connections to other open source documents
can give more information as to how that indicator is important. In particular, in Section 3.4 we show how a shared
PE resource file can be used to link multiple malware samples (likely due the same developer creating these pieces
of malware) that otherwise would not be connected by standard malware analysis tools. We conclude in Section
4 with a discussion of what this type of approach to open source intelligence analysis can be used for and future
research directions.

2 Methods
In this section we outline the pipeline used to construct the Neo4j database. In Section 2.1 the open source
intelligence data sources are defined, in Section 2.2 the database construction is outlined.

2.1 Open source intelligence data sources


The open source intelligence documents that are fed into the system broadly fall into three categories; the three
green starting blocks in Figure 1 show these categories. The first type of data is entirely unstructured text -
typically this type of data is simply a threat report or a list of indicators or vulnerabilities. This option allows any
collection of text sourced from any internet discussion to be analyzed for potential IOCs and to be added into the
Neo4j database. Figure 1 was created in Lucidchart1 .
The second type of data comes from a system of web crawlers (the details of this system are outlined in ref.
[25]). The text found by the web crawlers is typically similarly unstructured and can originate from a variety of
sources such as Reddit, Twitter, blogs, cybersecurity bulletins, and news sites. However, the web crawler data also
contains metadata including links that the crawler followed to this current site, time and date information, and
cybersecurity keywords that were found when parsing the site. Therefore, this data is slightly more structured and
is therefore parsed differently from arbitrary text; specifically the metadata is kept separate from the text that will
be analyzed for potential IOCs, which is eventually added into the node information in Neo4j.
The third type of data is structured antivirus scans of files that are potentially malware. This data is entirely
machine readable and does not contain natural language information. Therefore, this data is also parsed differently
to the other two data sources. In particular, these antivirus scans include hashes of files and the names of those
files. Therefore, those specific fields (hashes and file names) that are present in the structured data are parsed for
creating edges in Neo4j, but no other natural language analysis or pattern matching for potential IOCs is performed.

2.2 Neo4j Graph Database Construction


Figure 1 details the high level workflow that constructs the Neo4j database of potential IoOCs. This entire system
can operate continuously by reading in new data from the independent web crawler infrastructure and then adding
relevant information into the Neo4j database. First, the data is parsed depending on its source, as outlined in
Section 2.1. This input data is referred to as a document because this collection of text is all originating from the
same source and is therefore logically linked, which could be helpful for identifying relationships when searching
the database. As an example, a single document could be a piece of text from social media, such as a Reddit
post or a tweet from Twitter. Any metadata that is associated with the input text is parsed at this point, to be
added into the Neo4j database if a node representing this document is created. Importantly, once the raw text
input has been extracted (this includes the JSON structured antivirus file scan data), a SHA256 checksum of the
data is created and this checksum is then checked against the checksums of all of the documents that are in the
Neo4j database; and if the document is a duplicate, it is not added to the database. Next, several processes begin
executing in parallel, all of which are designed to extract useful meaning from the natural language input. To
this end, the antivirus file scan data is not parsed using the machine learning algorithms or language detection;
however the potential IOCs such as filenames and hashes are extracted and used to create edge relationships in the
database. The Cybersecurity Topic Classification (CTC) tool, which is compromised of multiple machine learning
algorithms trained to detect cybersecurity vs non-cybersecurity discussions from social media and developer forum
English text sources [26], is executed on the text. The output of this algorithm is simply three states - either it is
likely cybersecurity related text, or not, or there was not enough data (e.g., English words) to make a decision.
The Reports Classification by Adversarial Tactics and Techniques (rcATT) machine learning python tool [27]
is also executed on the text. This tool gives a list of likely MITRE ATT&CK tactics and techniques 2 that were
1 www.lucidchart.com
2 https://attack.mitre.org/

3
Node type Count of unique nodes Count of edges
Document 2,128,992
MD5 Hash 394,826 944,671
SHA1 Hash 323,321 990,817
SHA256 Hash 642,535 1,941,248
SHA512 Hash 18,339 28,112
Malware name 365 117,528
APT name 457 165,741
Email 85,396 953,757
CVE ID 174,668 1,313,206
Twitter username 174,143 402,899
Phone number 23,756 144,397
IP address 119,699 705,386
domain 214,720 311,6508
File name 351,507 1,326,480
MITRE ATT&CK Technique ID 445 21,018

Table 1: Neo4j database statistics. The Count of edges column in the table is the number of edges between the
listed node type and Document nodes. For each non-document node type, the edges between document nodes
and the unique nodes for that indicator are always labelled the same as the indicator; for example CVE nodes are
connected to document nodes by edge labelled as CVE. Those edge types are the counts displayed in the third
column. Therefore, there are no edges between document nodes which is why the edge count in that cell is empty.

mentioned in the text. The MITRE ATT&CK framework provides a consistent basis for tracking cybersecurity
techniques [28–33]. The rcATT tool was trained on threat report text and therefore the error rates on non-threat
report documents is expected to be high. The tool is applied uniformly to all documents because it is not necessarily
known a-priori what the exact semantic content of the document is (e.g., whether it is a threat report, or a news
report, or entirely non-cybersecurity). It is generally expected that based on the training data used to create the
machine learning models in the tool, cybersecurity content (e.g., documents where CTC returned True) will have
higher accuracy results in regards to detecting discussions of specific MITRE ATT&CK tactics and techniques.
The Threat Report ATT&CK Mapper (TRAM) machine learning python tool3 is also executed on the document.
TRAM, similar to rcATT, returns likely MITRE ATT&CK techniques that were mentioned in threat reports.
Therefore similarly to rcATT, this data will likely have high error rates for natural language text that is quite
different from threat reports, but could be more accurate for cybersecurity related text.
For the data coming from the web crawlers, the natural language text could be non-English. Having some
signal to indicate when this occurs in the Neo4j database could be useful (for example if one just wants to query
documents that are only English or Spanish text). The other reason that this signal is important is because all
of the natural language machine learning algorithms vectorize the input text using a very broadly defined English
dictionary - meaning that other languages are not used in these models, which means that their results will be
very inaccurate and should not be used. Therefore, the python tool langdetect is also executed on the text and the
resulting language detection information is included in the node metadata.
Lastly, pattern matches for all potential IOCs are performed on the text. With the exception of the structured
antivirus file scans, where the potential IOCs that can be extracted can be done automatically, all of the other
input text can be entirely unstructured. Therefore, simple pattern matching procedures are performed in order
to find potential IOCs. Hashes (md5, sha1, sha256, sha512) are found by searching for high entropy hexadecimal
text that fits within the required character length. File names are found by matching tokenized words (NLTK
[34] was used for most of the tokenization procedures) that have a file ending that matches some standard file
ending (for example .py for python). Advanced Persistent Threat (APT) group names and malware names are all
simply pattern matched against tokenized words in the text. Phone numbers, Email addresses, IP addresses, twitter
usernames, and domain names are all found by pattern match searching for known standard formats. Some simple
checks are used to rule out pattern matches which do not fit the expected format of the data type and in the case
of domain names, the top 1 million (Alexa top 1 million list) most searched domains are removed in order to reduce
noise in the graph. Common Vulnerabilities and Exposures (CVEs) [35] and the unique ID numbers of MITRE
3 https://github.com/center-for-threat-informed-defense/tram/

4
ATT&CK techniques [36] are also pattern matched for; both of these data types also follow a standard format
which can be identified. Each of these pattern matches will correspond to an edge (e.g., a connection) between the
document node that contained this data and a node representing that unique pattern match. This unique pattern
match we broadly call a potential IOC, but it can also simply be a unique ID to track a known vulnerability (for
example a CVE), or it could be a potentially useful piece of information for connecting two document nodes but
not be a malicious IOC.
Once all of this pattern matching and data processing has been completed, a number of Neo4j transactions are
applied in order to add the new data into the database. All of the processed documents have unique nodes created
which contain several different components. The most important part of the node is the original raw text that was
in the node document. Next, web crawler metadata is in the nodes which originated from web crawling - here the
metadata includes the link, parent link, time stamps, keywords, and potentially other data such as checksum of
the raw text. Next, some document nodes are antivirus scans - in these cases there is a large amount of metadata
about the scan that was performed, but typically the data is not natural language. All nodes also have language
detection metadata, if there was natural language text that could be processed. The natural language detection
is useful for filtering nodes which contain only a specific language of interest, however as expected most of the
document nodes are English text. The pattern matches that were found in the text that form the edges in the
graph are also included as a segment in the node data. Lastly, the machine learning results from CTC, rcATT,
and TRAM are also included as a separate data segment (if there is applicable English text that could be fed to
these ML algorithms). Each of these components of node data can be used to filter for specific nodes which have
specific attributes that are relevant for a specific task. Any documents which have no pattern matches are not
added to the database as they would simply be degree 0 nodes. Next, for all unique pattern matches found in
this set of documents, check if the nodes already exist in the database. If they do exist, do nothing, but if they
do not exist, then create them. This is to avoid creating duplicate potential IOC or pattern match nodes. Lastly,
edges are formed between the document nodes and the nodes representing the pattern matches found in those
documents. Table 1 shows the counts of nodes and edges in the database as of the writing of this paper. Therefore,
the constructed graph database is always bipartite where one partition is the node documents (i.e., the sources of
information) and the other partition is the set of nodes representing different potential IOCs or vulnerability IDs
that exist in the node documents. This structure allows users, or algorithms, to query for relationships based on
potential IOCs (e.g., a path connecting two potential IOCs or the network of neighbors associated with a potential
IOC) and then examine the sources of the information within the node documents, including machine learning and
language metadata.
Note that other types of cybersecurity relevant strings, besides those used in this study, could be searched for
and created as a type of edge in such as a graph database. The underlying idea of constructing a graph database
based on extracted strings that match certain patterns can certainly be generalized to other datatypes. This set
of string matches that we searched for in this study serve as a good representative set of cybersecurity relevant
indicators and datatypes.
The computer platform on which the Neo4j database was created in our implementation required sufficient
storage for all of the source documents (which is on the order of hundreds of gigabytes), at least 32 gigabytes
of RAM to operate the database and handle building the database, and sufficient processor cores to perform
multiprocessing when required (this is not a highly parallel processing intensive task and therefore on the order of
8 cores in total is sufficient). Of course, if scaling this graph database text mining system to significantly larger
datasets would require more compute resources.
The use of the hash checksums is motivated because these are standard malware signatures. Advanced persistent
threat (APT) names and malware names are used commonly in threat reports and discussions of threat group
activity - and both malware and APT groups can have multiple different names associated with them, which makes
the addition of these known names into the database potentially useful. Email addresses, Twitter usernames,
phone numbers, IP addresses, and domain names are all standard indicators of compromise. File name can be
a useful indicator of compromise if, for example, a malware file name is unique, but otherwise can result in a
very high degree node with many connections due a commonly used file name. CVE IDs and MITRE ATT&CK
Technique IDs are both useful for extracting information from threat reports and vulnerability reports for post
incident analysis, which can then potentially be connected to other interesting regions of the graph.

3 IOC Connections in the Graph Database


In this section some specific visual examples of the structure of the database indicates what is available to the user
when searching for IOCs or vulnerability IDs. The hard part of cybersecurity analysis is aggregating information,

5
in this case open source unstructured text, into a meaningful form that can provide a network of related documents
and potential indicators of compromise. Therefore, this set of examples that we show is motivated by giving
a demonstration of a walk through of gradually expanding out from a node to find interesting connections in
the graph. In particular, because of the nature of this graph database creating potentially many false positive
correlations (in particular, edges which are not cybersecurity relevant), we illustrate some specific examples that
emulate how a human user would interact with the database. Concretely, there are specific types of indicators
which are significantly more reliable data types - namely data which is unique such as hash checksums, twitter
usernames, CVE numbers, and MITRE ATT&CK Technique IDs. The remaining types of extracted data from the
text mining can have much higher false positive edges created in the database because of the nature of those strings
(for example, IP addresses and phone numbers being sequences of numbers means that un-structured parsed text
from the internet is more likely to contain an incorrect match to one of those datatypes).
In Section 3.1 we show the graph connectivity around an md5 hash of a known malware file from the WannaCry
ransomware. In Section 3.2 we show the graph connectivity surrounding an IP address known to be a command
and control server for the Qakbot trojan. In Section 3.3 we show the graph connectivity around a CVE that is
known to be exploited in the wild.
Section 3.4 shows a specific case where a sha256 hash of a contained resource was in several antivirus scans of
different portable executable malware. This shared contained resource is reasonably unique, and indicates software
re-use during the development of otherwise seemingly unconnected malware samples.
In Section 3.5 we show the distribution of CVE node degrees in the Neo4j database compared to their Common
Vulnerability Scoring System (CVSS) scores, and show that for reputable open source cybersecurity sources (e.g.,
threat reports), there is a weak-to-median linear relationship between the two. Lastly, in Section 3.6 the Neo4j
Graph Data Science implementation of PageRank is applied to the graph database, allowing a ranking of the most
influential CVEs across the entire database.

Figure 2: Node and edge coloring legend for potential IOCs.

For the visual graph examples we query the database with the Neo4j browser using the Cypher query language4 .
The node and edge coloring’s encode the following information. Document nodes are large cyan nodes, SHA1 nodes
are magenta, SHA256 nodes are lime, SHA512 nodes are grey, MD5 nodes are purple, file name nodes are teal, email
address nodes are yellow, IP address nodes are orange, malware name nodes are brown, Twitter usernames are
maroon, APT name nodes are lavender, domain name nodes are green, phone number nodes are blue, CVE number
nodes are red, and MITRE ATT&CK technique IDs are beige. The edge coloring matches the node coloring; for
example an edge connecting a CVE to a document node where it was mentioned will also be colored red. Figure 2
shows the node and edge coloring legend. In order to reduce visual clutter, if there are near duplicate documents
from the web crawlers which have the same connections to a group of nodes, we manually remove all but one of
the duplicate document nodes. Query times for these examples require on the order of seconds of wall clock time
to return the query.

3.1 Visual Analysis: md5 Malware Hash


In order to show a specific malware hash subgraph of the database, we will perform a query to the Neo4j database
for the md5 hash 84c82835a5d21bbcf75a61706d8ab549. The Cypher language query used to search for this
single md5 node is:
MATCH p=(find:node md5 {name: ‘84c82835a5d21bbcf75a61706d8ab549’}) RETURN p
In order to illustrate the utility of the database, we have selected this hash because it is known to be a hash of
a piece of malware. The query results show that there are three unique document nodes that reference this specific
hash. Figure 3 shows this connectivity graph. The content of these three document nodes contains useful context
information. The lower left hand node document is a cybersecurity blog-style website that is detailing a network
4 https://neo4j.com/developer/cypher/

6
Figure 3: Degree 1 connections associated with the md5 hash 84c82835a5d21bbcf75a61706d8ab549 from the
Neo4j graph database. This simple graph structure shows that within the current database, this hash checksum is
mentioned in exactly 3 open source documents, shown as light blue nodes.

Figure 4: Expanded connections from Figure 3; degree 2 connections out from the md5 hash
84c82835a5d21bbcf75a61706d8ab549. This step now shows that each of the 3 open source documents that
contain this hash have a variety of extracted datatypes, including a SHA-1 hash node and a SHA-256 hash node
that are shared between two of the open source documents.

and security analysis tool, which used the example of this md5 hash for detecting malware using antivirus software.
The lower right hand node document is a manalyzer report on this md5 hash5 . The top node document is a 4chan
thread for which a user linked to this manalyzer report within a long discussion thread.
Next, we can query the neighbors of those three document nodes - the connectivity graph for this result is
shown in Figure 4. Notably, there is a SHA1 and a SHA256 hash which were both in two of the node documents.
These these two hashes and the md5 hash are all checksums of the same file.
Next, we expand the neighbor relationships for these two SHA1 and SHA256 hashes that were both contained
in the two threat report nodes in order to see if there are relevant connections we can investigate further. This
neighborhood expansion is shown in Figure 5. We see that there are two node documents that are linked to the
SHA256 hash. These two nodes are threat reports. One of these threat reports is also linked to the earlier manalyzer
5 https://manalyzer.org/report/84c82835a5d21bbcf75a61706d8ab549

7
Figure 5: Expanding the neighbors for two of the hash node connections from Figure 4, which were the only two
shared nodes among the expanded neighborhood of the three original matched open source documents. Only one
of these two shared hash datatype nodes had any further connections – which turned out to be two other open
source documents.

report node by a common filename. However, this filename is simply the command prompt executable (cmd.exe).
These two nodes represent slightly different versions of a threat report titled The Lazarus Constellation, authored
by Avisa Partners. This specific malware hash is associated with the WannaCry ransomware. WannaCry is
ransomware [37–41] that propagated across the world in 2017 targeting computers running Windows OS primarily
using an exploit known as EternalBlue. Figure 6 shows the neighborhood of connections that the The Lazarus
Constellation threat reports contain, which includes a large number of hashes, CVEs, domains, and filenames that
are all associated with this APT group. This malware hash example shows how the graph database can retrieve
relevant information for an IOC - here we were able to find two other hashes of the same file, a place where the
hash was mentioned on a 4chan discussion board, and finally threat reports which give the larger context of why
these hashes are relevant.

3.2 Visual analysis: Qakbot IP address


Next we will consider the connectivity graph around a known malicious IP address 89.101.97.139, which is the
IP address for a command and control server for the malware known as Qakbot. The connections showing where
this IP address was mentioned in the database are shown in Figure 7. There were three document nodes which
mentioned this IP address. Two of these documents were from Github repositories where known IOCs on Qakbot
were published, and the third document is from a pastebin page which also posted Qakbot IOCs. Interestingly,
there are several IP addresses that are common to all, or a subset, of these documents. Note that the malware
name (brown colored degree three node near the center of the graph) was the name Qakbot. This graph shows
the relevant context around this IP address - namely that it is a command and control server for Qakbot and the
connectivity graph shows many additional IOCs that should be monitored in association with Qakbot. Importantly,
these graph connections take advantage of multiple (separated) data sources - where if we were to have read only
a single one of these documents we would not have the aggregated data which includes many more IP addresses
and hash IOCs. It is clear from this graph that this specific IP address is not a lone command and control server

8
Figure 6: For this figure, begin by expanding the two threat report node connections that were connected to the
two shared hash nodes in Figure 5. Expanding the neighborhood of these two open source document nodes, which
are threat reports, then revealed a large halo of extracted datatypes and indicators that were mentioned in these
two threat reports. Using this chain of graph edge connections shown through Figures 3, 4, 5, and now 6, this
connects that original md5 hash to this large cluster of other indicators of compromise described in two different
threat reports. This shows a specific example of using the graph database to connect indicators of compromise
together indirectly using network connections.

for Qakbot, rather it is one of many IP addresses and domains that are connected to this malware.

3.3 Visual Analysis: Known Exploited CVE


Here we show the connectivity graph for all neighboring nodes up to degree 2 away from a CVE that is known to be
exploited in the wild6 ; CVE-2014-4404. CVE-2014-4404 is a remote code execution exploit on Apple OS X, earlier
iOS versions and Apple TV. The neighborhood graph is shown in Figure 8. The neighboring nodes connected to this
specific node are showing the halo of data that is connected to this CVE. In particular, these neighboring nodes show
filenames, other CVEs, twitter usernames, domain names, hashes etc., that were mentioned along with this CVE
in various documents. Two of these nodes were threat reports which described the state of various cyberattacks
in 2021 - therefore they broadly discussed a number of different exploits, APT groups, twitter usernames, and
malware families with CVE-2014-4404 being one of the more exploited Apple software CVEs. As one would expect,
the CTC tool labelled each of these text documents that are in this halo as being cybersecurity related. This ML
metadata can be used to filter for a subgraph of the graph database which contains only cybersecurity related
(English) natural language.

3.4 Malware hash files: code and resource re-use


The static analysis antivirus [42] scan hash file data nodes that are formed in Neo4j include hashes of extracted
files, functions, or strings contained within the malware (as well as the malware itself). What these additional
hashes allow within the graph database is detection of shared code components, for example shared header files,
6 https://www.cisa.gov/known-exploited-vulnerabilities-catalog

9
Figure 7: This is the degree 2 neighborhood halo of connected nodes to a known IP address associated with the
Qakbot malware 89.101.97.139, where the node representing the IP of interesting is the orange degree 3 node
near the bottom document node. This halo of connections is very dense, in particular because all three open source
nodes are very high degree, meaning that a large number of datatypes were extracted from those documents when
the database was formed. Notably, there are a number of shared datatype nodes, besides the original malware IP
address - these include potential IOCs of domains and other IP addresses. This type of search scenario illustrates
how a single IP address search can potentially retrieve a large collection of related pieces of information.

strings, or functions. Of course some of these may not be incredibly useful indicators - for example they may be
standard Portable Executable manifest files. However, if there are unique code artifacts in some subset of antivirus
file scans which are known to be malicious, then those could serve as indicators of shared code re-use from a group.
These could even serve as malware signatures for antivirus products. The graph database is good for investigating
these shared resource hashes because it is easy to find where a set of antivirus scans have a shared resource hash
node.
Here we provide a specific example - that is a SHA256 hash which we can search for in the Neo4j graph database
with the Cypher syntax of:
MATCH p=(find:node sha256 {name: ’84f7c54dc015637a28f06867607c2e0b
dd225d10debb1390ff212d91cd2d042b’}) RETURN p
Using FileScan7 , which contains references to this hash, we can find that this is a hash of the following English
ASCII text: BundleInstall BundleInstall. Note that this representation is not necessarily capturing the full
hexadecimal data present in the data segment. The terms bundle install suggest that the source code language is
Ruby, and this text artifact could be a result of packaging Ruby source code into a portable executable.
7 https://www.filescan.io

10
Figure 8: Shown here are all nodes and edges connected to the node for CVE-2014-4404 up to two degrees away. The
degree 11 red node approximately in the center of the graph is the CVE-2014-4404 node. The 11 light blue nodes
denote 11 open source documents which mention CVE-2014-4404 - all other extracted datatypes from those 11
documents are also shown. Notably, several of these documents mention numerous other CVE numbers, suggesting
they could be security bulletins or repositories of some type. Other datatypes in these associated documents include
hash checksums, potential malware names, APT group names, and a potential email address. All of these linked
datatypes could be IOCs that are linked to CVE-2014-4404.

Within the current Neo4j database, this SHA256 hash node has a degree of 5 (meaning that is referenced in 5
node documents). Two of these nodes are actually effectively duplicates - they are scans of the same file, meaning
that they have the same file name and associated extracted hashes (such as the hashes of contained resources in
the Portable Executable (PE) file), but were scanned at different times and therefore have slightly different data so
the direct de-duplication did not remove one of them. Therefore as with the other Neo4j figures in order to reduce
clutter, one of these duplicates is removed in the displayed figures. The graph renderings of these nodes and their
degree 1 neighborhood connections are shown in Figures 9.
For the bottom graph in Figure 9 we can examine what some of the important information is that the nodes
have, such as the malware filename and the hash of the file. These four nodes are known malware samples (this
statement is based on the high proportion of antivirus scans indicating malware) and additional information on
them is available on VirusTotal.
Top document node in Figure 9: the filename is rkinstaller.exe and the full sha256 hash is given in the
virustotal link8 .
Right document node in Figure 9: no associated filename and the full sha256 hash is given in the virustotal
link9 .
Bottom document node in Figure 9: the filename is rkinstaller364.exe and the full sha256 hash is given in
the virustotal link10 .
Left document node in Figure 9: the filename is poinstaller257.exe and the full sha256 hash is given in the
virustotal link11 .
The notable observation from this data is that this shared portable executable resource hash is a reasonably
unique artifact (meaning that it is not a commonly re-used ASCII text segment in PE development) and was found
8 https://www.virustotal.com/gui/file/5577ce9aa4e4ec2735247c5769f0e84db599825f2d95159b0102f3b30e80b6bb/details
9 https://www.virustotal.com/gui/file/06f11f4a555a4891c93f13f82dc06e8bcedda2a71c8a5e6aa5c18da871f41238/details
10 https://www.virustotal.com/gui/file/f8d11b1e3e027355a11163049b530de4fd67183abd08a691d5d18744653ef575/details
11 https://www.virustotal.com/gui/file/f3efcfc7121f2348deb6f3b5ffde60878d978c25281e67defdc288feaef8b38c/details

11
Figure 9: This is a graph rendering of the neighboring nodes connected to the sha256 node hash of interest, which
is the degree four green-yellow node in the center of the graph. The neighboring connections of these document
nodes have also been displayed in order to determine if there are any other connections of interest. Each of the
document nodes are antivirus scans of portable executables (PEs). The degree 1 expanded neighbors show that
there is actually another sha256 hash node that is shared by three out of the four document nodes, and there are no
other edges connecting the associated hashes and file names. That sha256 hash is a standard manifest for creating
PEs and is common to a large number of PE samples (both benign-ware and malware) and is therefore not unique
enough to attribute a meaningful connection.

across a small subset of malware samples, some of which also have other shared characteristics such as similar (or
identical) file names. This suggests, with reasonable confidence, that the development of these pieces of malware
are linked in a meaningful way, for example the same developer could have created these portable executables. This
demonstrates where the graph database construction of OSint allows a user to link together pieces of information
in order to group together seemingly unconnected documents and other potential indicators.

3.5 CVE Degree and CVSS Score


Because this database is constructed in a largely unsupervised manner - i.e., pattern matches are automatically
generated and new data is added to the database without human review - a natural question that arises is whether
the graph structure of the data represents the real world properties of the vulnerabilities or potential IOCs. An easy
example of this that we can numerically compute is the degree of CVE nodes in the database (which corresponds
to how many times that CVE was mentioned in the text from the different data sources) and compare that against
the common vulnerability scoring system (CVSS) scores of CVEs. CVSS scores are intended to approximately
represent the overall severity of the vulnerability [43–46] where a CVSS score of 0 is the lowest severity and 10 is
the maximum severity.
There are two CVSS score versions that we will compare - CVSS version 2 and version 3. Version 3 is the newest
CVSS scoring method, which is intended to be a more accurate rating scale for modern cybersecurity threats.
The relevant question is whether there exists a relationship of increasing degree of the CVE nodes in the graph
database with respect to CVE CVSS score. Intuitively, if the severity of a CVE corresponds to how frequently that
CVE is mentioned in social media, news, and threat reports, then higher CVSS scores will correspond to higher
degree CVE nodes in the Neo4j graph database. The CVSS scores are retrieved from the National Institute of
Standards and Technology National Vulnerability Database (NIST NVD) dataset 12 . To this end, we compute the
12 https://nvd.nist.gov

12
Figure 10: This graph is a continuation of Figure 9 where further mentions of the filenames and hashes from the
antivirus scans are displayed. One of the sha256 nodes in the bottom document node is also contained in many
other node documents, however it is a hash of a standard dynamic link library (DLL) manifest file and therefore its
neighbors are not included in this figure to reduce visual clutter. The only other node that was referenced outside of
this small connected portion of the graph was the filename in the top document node, which was rkinstaller.exe.
This filename was also mentioned in a site that listed a large number of filenames known to be malware - that node
document along with the halo of its connected potential IOCs (most of which are other filenames) are shown in the
upper left hand portion of the figure.

Pearson correlation coefficient = 0.032 Pearson correlation coefficient = 0.007


5000
800

4000
CVE degree in Neo4j

CVE degree in Neo4j

600
3000
400
2000

200
1000

0 0
0 2 4 6 8 10 2 4 6 8 10
CVSS v2 score CVSS v3 score

Figure 11: CVSS score (x-axis) version 2 (left) and version 3 (right) vs CVE node degree in the Neo4j indicator
database (y-axis). Right hand figure contains 97489 datapoints, and the left hand figure contains 150940 datapoints.
The pearson correlation coefficient for each dataset is shown in the plot titles. The outlier CVE node in the right
hand plot, which has a degree of 4926, is CVE-2021-44228 (also known as Log4j).

Pearson correlation coefficient between CVE node degrees and their CVSS score using scipy in Python 3 [47–50].
Some CVEs do not have a version 2 or a version 3 score, and therefore are not able to plotted in this dataset.
Figure 11 plots all CVE CVSS scores against CVE node degrees in the Neo4j graph database, which shows there
is not a positive or linear correlation between the CVE node degrees and CVSS scores. This is notable because it
shows that across all of the cybersecurity mentioned natural text that was gathered, there is not a strong correlation
between the rate of CVE mentions (e.g. CVE popularity) and CVSS scores. However, it could be the case that

13
Pearson correlation coefficient = 0.189 Pearson correlation coefficient = 0.211
80 80
CVE degree in Neo4j

CVE degree in Neo4j


60 60

40 40

20 20

0 0
2 4 6 8 10 2 3 4 5 6 7 8 9 10
CVSS v2 score CVSS v3 score

Figure 12: CVSS score (x-axis) version 2 (left) and version 3 (right) vs CVE node degree in the Neo4j indicator
database (y-axis). The datapoints plotted is based on the degree of the CVE nodes connected to a subset of the
document nodes which are more reputable for the domain of cybersecurity (e.g., threat reports), and CVE nodes
with degree 1 are not considered. Additionally, only CVEs which were released in the time period of the web
scraping upon which the graph database is built, are plotted in order to remove temporal bias which have existed
in Figure 11. The pearson correlation coefficient is shown in the plot titles. Right hand figure contains 1898
datapoints, and the left hand figure contains 1666 datapoints.

more focused cybersecurity documents have a higher CVE CVSS score and Neo4j node degree correlation. It could
also be the case there is not a strong signal of correlation for CVE nodes which are not commonly mentioned in
news, threat reports, and cybersecurity bulletins - and therefore only considering nodes which have at least a degree
of 2 could remove some noise in the dataset. Lastly, due to popularity of new CVEs, necessarily the web scraping
will have a temporal bias for the times during which the spiders were operating.
Figure 12 shows the correlation plot for CVE node degrees vs CVSS score, but the Neo4j node degrees are
computed using a restricted set of document sources namely reputable blogs, information sites, and threat reports
- specifically fireeye13 , all threat reports, proofpoint14 , exploitdb15 , the Hackernews16 . Additionally, CVE
nodes with degree 1 are not included in this computation since low degree nodes do not necessarily provide a
strong signal in regards to how referenced that CVE is. Lastly, the data in Figure 12 are restricted to the years
during which the OSint crawler system was operational [25] because there is an inherent temporal bias in regards
to what web pages are mentioned and scraped. Because only a subset of the CVEs had CVSS v3 scores, there
were fewer data points available for those plots. In Figure 12 we observe that there is a weak to median linear
positive correlation between CVSS score and Neo4j degree for the documents that are cybersecurity domain focused.
Interestingly, there is a slightly higher Pearson correlation (0.196) for CVSS version 2 compared to version 3 (0.23).
This shows that taking into account the source of the data, temporal bias, and not considering degree 1 nodes
shows that there is a CVSS and CVE node degree correlation, as opposed to these correlations shown in the entire
dataset in Figure 11. There are two potential reasons why the CVSS score - node degree correlations are only
weakly linear in Figure 12:
1. The inherent bias present in the documents from the web crawlers - i.e., a CVE could be more commonly
discussed on social media or news sites because of a reason other than its significance for cybersecurity. This
reason seems to be the most prevalent due to the presence of social media and news content. In essence, the
frequency of mentions of a CVE are more related to what catches attention than severity of the vulnerability.
2. The CVSS score does not perfectly reflect the real world severity of a given CVE.

3.6 CVE Page Rank


Neo4j has a library called Graph Data Science (GDS) which contains various graph algorithms that can be executed
on a graph that is stored in Neo4j. These graphs algorithms could be useful for identifying patterns and clusters
13 https://www.trellix.com/en-us/about/newsroom/stories/threat-labs.html
14 https://www.proofpoint.com/us/blog/threat-insight
15 https://www.exploit-db.com/
16 https://thehackernews.com/

14
CVE ID CVE node Vulnerability name and description In CISA CVSS CVSS
Page Rank known score score
score in exploited v2 v3
Neo4j vulnerabil-
ity catalog
CVE-2021-44228 758.1 Apache Log4j2 Remote Code Execution Yes 9.2 10.0
Vulnerability
CVE-2021-45046 113.06 It was found that the fix to address CVE- No 5.1 9.0
2021-44228 in Apache Log4j 2.15.0 was in-
complete in certain non-default configura-
tions.
CVE-2021-34527 106.21 “PrintNightmare” - Microsoft Windows Yes 9.0 8.8
Print Spooler Remote Code Execution Vul-
nerability
CVE-2017-11882 95.64 Microsoft Office memory corruption vul- Yes 9.3 7.8
nerability
CVE-2012-0158 86.71 Microsoft MSCOMCTL.OCX Remote Yes 9.3 N/A
Code Execution Vulnerability
CVE-2014-0160 81.13 OpenSSL Information Disclosure Vulnera- Yes 5.0 7.5
bility. “heartbleed”
CVE-2021-34481 73.47 Windows Print Spooler Elevation of Privi- No 4.6 7.8
lege Vulnerability
CVE-2021-45105 70.92 Apache Log4j2 versions 2.0-alpha1 through No 4.3 5.9
2.16.0 (excluding 2.12.3 and 2.3.1) did not
protect from uncontrolled recursion from
self-referential lookups.
CVE-2021-1675 64.43 Microsoft Windows Print Spooler Remote Yes 9.3 8.8
Code Execution Vulnerability
CVE-2021-40444 59.45 Microsoft MSHTML Remote Code Execu- Yes 6.8 7.8
tion Vulnerability
CVE-2017-0199 56.32 Microsoft Office/WordPad Remote Code Yes 9.3 7.8
Execution Vulnerability with Windows
API

Table 2: The top 11 most referenced CVEs ranked by their PageRank score in the Neo4j database in descending
order. The PageRank computation was performed on all edges, regardless of type, in the database. PageRank scores
are rounded to two decimal places. Note that the Cybersecurity & Infrastructure Security Agency (CISA) known
exploits catalog is continuously updated - this information is correct at the time this paper is written, but may
change in the future. The PageRank score ranking is the most important part of the table, but the numerical page
rank scores do approximately correspond to the degree of popularity and relevance of the CVE id. For example,
it is clear that in the current dataset CVE-2021-44228 is significantly more referenced than every other CVE in
the graph database. The intention of including the CVSS scores and whether the vulnerability is currently in the
catalog is to give additional context on the severity of the vulnerability.

in this specific potential IOC graph database. As a simple example of how the structure of the graph yields node
rankings based only on the reference to potential IOCs and vulnerability IDs, Table 2 details the top 11 highest
ranked CVEs in the (undirected) graph database according to the PageRank algorithm. PageRank [51, 52] is an
algorithm, originally designed for search engine ranking, which can be applied to any network data structure to
determine which nodes are the most influential and referenced (here a reference is simply an edge in the network).
All parameters were set to default for the PageRank computation with the exception of maxIterations which was
set to 300 and dampingFactor which was set to 0.75 instead of the typical 0.85. The reasoning for selecting a
smaller damping factor than what is typically used in search engines is that in this specific graph we are interested
in potentially longer range influences on the relevance of nodes.
The PageRank scores in Table 2 clearly show that the most referenced CVE nodes in the graph database are high
impact and well known CVEs including the heartbleed, Log4j, and PrintNightmare vulnerabilities. The only two

15
Pearson correlation coefficient = 0.149 Pearson correlation coefficient = 0.102
50

CVE node PageRank score in Neo4j

CVE node PageRank score in Neo4j


700

40 600
500
30
400
300
20
200

10 100
0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
CVSS v2 score CVSS v3 score

Figure 13: CVSS score (x-axis) version 2 (left) and version 3 (right) vs CVE node PageRank score in the Neo4j
indicator database (y-axis). Right hand figure contains 715 datapoints, and the left hand figure contains 567
datapoints. The pearson correlation coefficient is shown in the plot titles.

CVEs which are not currently in the CISA known exploited catalog are follow up CVEs to the Log4j vulnerability.
All of these CVEs not only are high impact and typically rank high on CVSS scores, but they are also notable
because of how widespread their discussion was throughout the various social media channels, in large part because
of the wide user base of the software these vulnerabilities exploit, that this graph database is constructed from.
Another natural question that arises from the PageRank scores, is how those scores relate to the CVE CVSS
score. Table 2 shows that there does seem to be at least some correlation, where the high PageRank score CVEs
have high CVSS scores. Similar to the degree correlation plots in Figures 11 and 12, in Figure 13 CVE node
PageRank scores are plotted against CVSS scores. Because the majority of the OSint crawler data is temporally
biased towards the more recent years of data gathering, the datapoints plotted in Figure 13 are restricted to the
years during which the OSint crawler system was operational [25], the same as in Figure 12. Additionally, in order
to filter only for nodes which have a robust distribution of neighbors in the graph, only points with a PageRank
score of 4.5 or over is plotted. Figure 13 shows that there is a low to medium linear correlation for the relevance,
i.e. the PageRank score, against the CVE CVSS scores. Interestingly, the CVSS v3 scores are more correlated with
the PageRank scores. This could indicate that the CVSS v3 scores more accurately represent the severity of the
CVE’s compare to CVSS v2.

4 Discussion and Conclusion


There are primarily two technical challenges that still need to be improved in this data gathering and analysis
pipeline:

1. There are many instances of near duplicate content from the web crawling system. For example, the content
of a web page could be slightly altered from day to day (for example even if the web page content includes
the current date and time); and if the web crawlers end up at that same web page multiple times direct
de-duplication will not remove the content for being a near duplicate. Another example of near duplicate
content that occurs often is that the web-crawling will catch a social media conversation (for example a Reddit
thread) while it is occurring. We want the crawlers to catch this type of conversation because they could save
content that is removed at some point in the future, but it also leads to potentially a large number of near
duplicates of the same social media thread as it evolves over time. The challenge is in quantifying how close
a document is to being a duplicate in order to remove it; in particular computing a distance metric pairwise
between all documents can be very computationally intensive. A reasonable solution could be to not remove
near duplicates, but instead to take the union of near duplicate documents, and to compute similarity metrics
among close clusters of documents (for example determined by their relationship in the Neo4j potential IOC
graph) for finding near duplicates.
2. Reduce the amount of noise in the data set. It is difficult to know a-priori which content is relevant,
and therefore in this work we tend to gather more information rather than remove potentially not useful
information. This allows the database to be able to catch interesting edge cases and atypical cybersecurity

16
content, but it comes at the cost of increased noise. However, there are some consistent sources of noise
which are almost always not relevant which can be manually filtered out. For example localhost IP addresses
or standard file names from popular programs. Another example of noise in the data set is software version
numbers being identified as IP addresses due to the similarity in their format in text. It is not clear how to
reduce the noise in the data set uniformly across all document types of potential IOCs, however, machine
learning algorithms which more closely identify relevant information in a piece of text could be used to better
filter the data.

There are additional potential indicators and vulnerability tracking ids that could be extracted from natural
language text in future work, such as CWE’s (Common Weakness Enumeration).
Aggregating and condensing open source intelligence into a human readable and easily searchable form is an
important task given the scale of the data that is available in the form of social media, news, blogs and threat
reports in the cybersecurity space. Here we present one possible way to address this problem by parsing and
transforming the open source data into a graph structure where each document can be associated with potential
cybersecurity indicators of compromise, other infrastructure, CVEs, or MITRE ATT&CK Techniques. Querying
this database based on the indicators then allows analysts to find open source intelligence documents connected
to that indicator and review their content; this graph database thus reduces the overhead required in searching a
massive amount of open source documents to a succinct cloud of relevant documents.
A good topic for future study is to have human analysts utilize the constructed Neo4j graph database on real
world data in order to quantify the efficacy of the data, and the indicators that have been extracted from the
documents.

5 Acknowledgements
Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology &
Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for
the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under contract DE-
NA0003525. The New Mexico Cybersecurity Center of Excellence (NMCCoE) is a statewide Research and Public
Service Project supported center for economic development, education, and research. The authors would like to
thank both SNL and NMCCoE for funding and computing system access and support.

17
References
[1] Michael Glassman and Min Ju Kang. “Intelligence in the internet age: The emergence and evolution of Open
Source Intelligence (OSINT)”. In: Computers in Human Behavior 28.2 (2012), pp. 673–682. doi: 10.1016/
j.chb.2011.11.014.
[2] João Rafael Gonçalves Evangelista et al. “Systematic literature review to investigate the application of open
source intelligence (osint) with artificial intelligence”. In: Journal of Applied Security Research 16.3 (2021),
pp. 345–369. doi: 10.1080/19361610.2020.1761737.
[3] Robert David Steele. “Open source intelligence”. In: Handbook of Intelligence Studies. Routledge, 2007,
pp. 147–165.
[4] Aritran Piplai et al. “Knowledge enrichment by fusing representations for malware threat intelligence and
behavior”. In: 2020 IEEE International Conference on Intelligence and Security Informatics (ISI). IEEE.
2020, pp. 1–6. doi: 10.1109/ISI49825.2020.9280512.
[5] Peng Gao et al. “A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat
Intelligence”. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE). 2021, pp. 2705–
2708. doi: 10.1109/ICDE51399.2021.00309.
[6] Nidhi Rastogi et al. “MALOnt: An Ontology for Malware Threat Intelligence”. In: Deployable Machine
Learning for Security Defense. Ed. by Gang Wang, Arridhana Ciptadi, and Ali Ahmadzadeh. Cham: Springer
International Publishing, 2020, pp. 28–44. isbn: 978-3-030-59621-7.
[7] Onur Catakoglu, Marco Balduzzi, and Davide Balzarotti. “Automatic extraction of indicators of compro-
mise for web applications”. In: Proceedings of the 25th International Conference on World Wide Web. 2016,
pp. 333–343. doi: 10.1145/2872427.2883056.
[8] Yuta Kazato, Yoshihide Nakagawa, and Yuichi Nakatani. “Improving maliciousness estimation of indicator
of compromise using graph convolutional networks”. In: 2020 IEEE 17th Annual Consumer Communications
& Networking Conference (CCNC). IEEE. 2020, pp. 1–7. doi: 10.1109/CCNC46108.2020.9045113.
[9] Ryan Christian et al. “An Ontology-Driven Knowledge Graph for Android Malware”. In: Proceedings of
the 2021 ACM SIGSAC Conference on Computer and Communications Security. CCS ’21. Virtual Event,
Republic of Korea: Association for Computing Machinery, 2021, 2435–2437. isbn: 9781450384544. doi: 10.
1145/3460120.3485353. url: https://doi.org/10.1145/3460120.3485353.
[10] Benjamin Bowman and H. Howie Huang. “Towards Next-Generation Cybersecurity with Graph AI”. In:
SIGOPS Oper. Syst. Rev. 55.1 (2021), 61–67. issn: 0163-5980. doi: 10.1145/3469379.3469386. url: https:
//doi.org/10.1145/3469379.3469386.
[11] Yongfu Wang et al. “The Analysis Method of Security Vulnerability Based on the Knowledge Graph”. In:
2020 the 10th International Conference on Communication and Network Security. ICCNS 2020. Tokyo, Japan:
Association for Computing Machinery, 2021, 135–145. isbn: 9781450389037. doi: 10.1145/3442520.3442535.
url: https://doi.org/10.1145/3442520.3442535.
[12] Shuqin Zhang et al. “Threat Analysis of IoT Security Knowledge Graph Based on Confidence”. In: Emerging
Technologies for Education. Ed. by Weijia Jia et al. Cham: Springer International Publishing, 2021, pp. 254–
264. isbn: 978-3-030-92836-0.
[13] Sharmishtha Dutta et al. “Knowledge Graph for Malware Threat Intelligence”. en. In: (2021). doi: 10.13140/
RG.2.2.27340.95367. url: http://rgdoi.net/10.13140/RG.2.2.27340.95367.
[14] Justin J Miller. “Graph database applications and concepts with Neo4j”. In: Proceedings of the Southern
Association for Information Systems Conference, Atlanta, GA, USA. Vol. 2324. 36. 2013.
[15] José Guia, Valéria Gonçalves Soares, and Jorge Bernardino. “Graph Databases: Neo4j Analysis.” In: ICEIS
(1). 2017, pp. 351–356.
[16] Jaroslav Pokorny. “Graph databases: their power and limitations”. In: Ifip International Conference on Com-
puter Information Systems and Industrial Management. Springer. 2015, pp. 58–69.
[17] Hongcheng Huang and Ziyu Dong. “Research on architecture and query performance based on distributed
graph database Neo4j”. In: 2013 3rd International Conference on Consumer Electronics, Communications
and Networks. IEEE. 2013, pp. 533–536.
[18] Lukasz Warchal. “Using Neo4j graph database in social network analysis”. In: Studia Informatica 33.2A
(2012), pp. 271–279.
[19] Pengcheng Liu et al. “Construction of typhoon disaster knowledge graph based on graph database Neo4j”.
In: 2020 Chinese Control And Decision Conference (CCDC). IEEE. 2020, pp. 3612–3616.
[20] Steven Noel et al. “CyGraph: graph-based analytics and visualization for cybersecurity”. In: Handbook of
Statistics. Vol. 35. Elsevier, 2016, pp. 117–167. doi: 10.1016/bs.host.2016.07.001.

18
[21] Yan Jia et al. “A practical approach to constructing a knowledge graph for cybersecurity”. In: Engineering
4.1 (2018), pp. 53–60. doi: 10.1016/j.eng.2018.01.004.
[22] Cliff Joslyn et al. “Massive scale cyber traffic analysis: a driver for graph database research”. In: First
International Workshop on Graph Data Management Experiences and Systems. 2013, pp. 1–6. doi: 10.1145/
2484425.2484428.
[23] Maaike H. T. de Boer et al. “Text Mining in Cybersecurity: Exploring Threats and Opportunities”. In:
Multimodal Technologies and Interaction 3.3 (2019). issn: 2414-4088. doi: 10.3390/mti3030062. url: https:
//www.mdpi.com/2414-4088/3/3/62.
[24] Juan Caballero et al. “The Rise of GoodFATR: A Novel Accuracy Comparison Methodology for Indicator
Extraction Tools”. In: Future Generation Computer Systems 144 (July 2023), 74–89. issn: 0167-739X. doi:
10.1016/j.future.2023.02.012. url: http://dx.doi.org/10.1016/j.future.2023.02.012.
[25] Donovan Jenkins, Lorie M. Liebrock, and Vince Urias. “Designing a Modular and Distributed Web Crawler
Focused on Unstructured Cybersecurity Intelligence”. In: 2021 International Carnahan Conference on Secu-
rity Technology (ICCST). 2021, pp. 1–6. doi: 10.1109/ICCST49569.2021.9717379.
[26] Elijah Pelofske, Lorie M Liebrock, and Vincent Urias. “A Robust Cybersecurity Topic Classification Tool”.
In: International Journal of Network Security & Its Applications, V14, N1 (2022). doi: 10.48550/ARXIV.
2109.02473. url: https://arxiv.org/abs/2109.02473.
[27] Valentine Solange Marine Legoy. “Retrieving ATT&CK tactics and techniques in cyber threat reports”. MA
thesis. University of Twente, 2019.
[28] Roger Kwon et al. “Cyber Threat Dictionary Using MITRE ATT&CK Matrix and NIST Cybersecurity
Framework Mapping”. In: 2020 Resilience Week (RWS). IEEE. 2020, pp. 106–112. doi: 10.1109/RWS50334.
2020.9241271.
[29] MITRE ATT&CK. “Mitre ATT&CK”. In: URL: https://attack. mitre. org (2021).
[30] Rawan Al-Shaer, Jonathan M Spring, and Eliana Christou. “Learning the associations of mitre ATT&CK
adversarial techniques”. In: 2020 IEEE Conference on Communications and Network Security (CNS). IEEE.
2020, pp. 1–9. doi: 10.1109/CNS48642.2020.9162207. eprint: 2005.01654.
[31] Aditya Kuppa, Lamine Aouad, and Nhien-An Le-Khac. “Linking CVE’s to MITRE ATT&CK Techniques”.
In: The 16th International Conference on Availability, Reliability and Security. 2021, pp. 1–12.
[32] Md Rayhanur Rahman and Laurie Williams. Investigating co-occurrences of MITRE ATT&CK Techniques.
2022. doi: 10.48550/ARXIV.2211.06495. url: https://arxiv.org/abs/2211.06495.
[33] Md Rayhanur Rahman and Laurie Williams. An investigation of security controls and MITRE ATT&CK
techniques. 2022. doi: 10.48550/ARXIV.2211.06500. url: https://arxiv.org/abs/2211.06500.
[34] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with
the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
[35] Kensuke Sumoto et al. “Automatic labeling of the elements of a vulnerability report CVE with NLP”. In:
2022 IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI).
2022, pp. 164–165. doi: 10.1109/IRI54793.2022.00045.
[36] Blake E Strom et al. “Mitre ATT&CK: Design and philosophy”. In: Technical report. The MITRE Corpora-
tion, 2018.
[37] Shou-Ching Hsiao and Da-Yu Kao. “The static analysis of WannaCry ransomware”. In: 2018 20th Interna-
tional Conference on Advanced Communication Technology (ICACT). 2018, pp. 153–158. doi: 10.23919/
ICACT.2018.8323680.
[38] Guohang Lu et al. “A Comprehensive Detection Approach of Wannacry: Principles, Rules and Experiments”.
In: 2020 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (Cy-
berC). 2020, pp. 41–49. doi: 10.1109/CyberC49757.2020.00017.
[39] Da-Yu KAO, Shou-Ching HSIAO, and Raylin TSO. “Analyzing WannaCry Ransomware Considering the
Weapons and Exploits”. In: 2019 21st International Conference on Advanced Communication Technology
(ICACT). 2019, pp. 1098–1107. doi: 10.23919/ICACT.2019.8702049.
[40] Qian Chen and Robert A. Bridges. “Automated Behavioral Analysis of Malware: A Case Study of Wan-
naCry Ransomware”. In: 2017 16th IEEE International Conference on Machine Learning and Applications
(ICMLA). 2017, pp. 454–460. doi: 10.1109/ICMLA.2017.0-119.
[41] Da-Yu Kao and Shou-Ching Hsiao. “The dynamic analysis of WannaCry ransomware”. In: 2018 20th In-
ternational Conference on Advanced Communication Technology (ICACT). 2018, pp. 1–1. doi: 10.23919/
ICACT.2018.8323681.
[42] Katja Hahn and INM Register. “Robust static analysis of portable executable malware”. In: HTWK Leipzig
134 (2014).

19
[43] Pengsu Cheng et al. “Aggregating CVSS base scores for semantics-rich network security metrics”. In: 2012
IEEE 31st Symposium on Reliable Distributed Systems. IEEE. 2012, pp. 31–40.
[44] Karen Scarfone and Peter Mell. “An analysis of CVSS version 2 vulnerability scoring”. In: 2009 3rd Interna-
tional Symposium on Empirical Software Engineering and Measurement. IEEE. 2009, pp. 516–525.
[45] Atefeh Khazaei, Mohammad Ghasemzadeh, and Vali Derhami. “An automatic method for CVSS score pre-
diction using vulnerabilities description”. In: Journal of Intelligent & Fuzzy Systems 30.1 (2016), pp. 89–
96.
[46] Laurent Gallon and Jean Jacques Bascou. “Using CVSS in attack graphs”. In: 2011 Sixth International
Conference on Availability, Reliability and Security. IEEE. 2011, pp. 59–66.
[47] Pauli Virtanen et al. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python”. In: Nature
Methods 17 (2020), pp. 261–272. doi: 10.1038/s41592-019-0686-2.
[48] Student. “Probable error of a correlation coefficient”. In: Biometrika (1908), pp. 302–310.
[49] Charles J Kowalski. “On the effects of non-normality on the distribution of the sample product-moment
correlation coefficient”. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 21.1 (1972),
pp. 1–12.
[50] Jacob Benesty et al. “Pearson correlation coefficient”. In: Noise reduction in speech processing. Springer, 2009,
pp. 1–4.
[51] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual web search engine”. In: Computer
networks and ISDN systems 30.1-7 (1998), pp. 107–117. doi: 10.1016/S0169-7552(98)00110-X.
[52] David F Gleich. “PageRank beyond the Web”. In: siam REVIEW 57.3 (2015), pp. 321–363. doi: 10.1137/
140976649.
[53] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using
NetworkX. Tech. rep. Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.

A Portable Executable malware visualization

Figure 14: Visualization of three different PE32 malware samples byteplot, entropy, and struc-
ture. Each of these contain at least one common resource file (detailed in Section 3.4), which
was sufficiently unique to link these pieces of malware. From left to right the sha256 check-
sums of these samples are 34aa24656d5527a5ff1f7eb4ce4e782085618ded3766730c81f8f16a15d7e0ce,
4f0d5a81b8a5bc3f998a0ac7a37db5bad49e1a22173251ed916363375360d5a4, 8b53201f1914764f384c6ec5a7a
5c5ab2924afaf382d2bbe79f68e43e5dfa3ba. From left to right the OriginalFilename for each sample is
RKInstaller.exe, RKInstaller.exe, and POInstaller.exe. Note the similar naming scheme to the four samples
that are in Neo4j (see Section 3.4). Interestingly, although these samples are not identical (we know this because
their hashes are different), the middle and right hand side samples look visually indistinguishable from each other,
whereas the left hand side sample is distinguishable from the other two. Overall, the structure of these PEs seem
to be very similar which further indicates that these were likely developed by the same group or person.

Figure 14 shows visualizations of three malware examples which contain a common PE resource artifact which
was identified by its sha256 checksum. These three examples are different from the Neo4j nodes (specifically their
hashes are not the same). These three distinct PE samples are also accessible on VirusTotal 17 18 19 . This suggests
that this particular indicator, while unique and not very common, is likely seen on other static analysis tools beyond
17 https://www.virustotal.com/gui/file/34aa24656d5527a5ff1f7eb4ce4e782085618ded3766730c81f8f16a15d7e0ce
18 https://www.virustotal.com/gui/file/4f0d5a81b8a5bc3f998a0ac7a37db5bad49e1a22173251ed916363375360d5a4
19 https://www.virustotal.com/gui/file/8b53201f1914764f384c6ec5a7a5c5ab2924afaf382d2bbe79f68e43e5dfa3ba

20
these two datasets. These visualizations were generated using FileScan20 and PortEx21 .

B Large graph visualization


Expanding out the network connections that exist in the Neo4j graph database is difficult to show visually because
of the scale of the graphs in terms of edges and nodes. Here though in Figure 15 we show two large graph examples,
which is possible using pygraphistry22 and networkx [53] in python3. The graphistry version which was used to
generate these figures is 2.39.32, and the graph layout algorithm is ForceAtlas2Barnes. The primary observation
that is important from Figure 15 is that just a couple of degrees out from an indicator can already result in a very
large graph, even though the entire graph database is very sparse. One of the reasons for this is that different
types of indicators, names, and vulnerability IDs will naturally be mentioned at different rates in the gathered
internet text. For example, popular malware names will be very common in cybersecurity text, whereas hashes
will generally not be mentioned very frequently. Therefore, it is important to filter down to the types of edges you
want to follow in the database when you are searching for specific indicators - for example by searching only for
IP and domain edges if you are searching for server infrastructure connections. It is also useful in these cases to
use the CTC metadata and language detection metadata in order to search specifically for document nodes which
are, with high confidence, English text discussing cybersecurity in order to get more relevant text. When there
is limited information available on an indicator or document, searching through all available network connections
(such as in the malware hash cases shown in Sections 3.1 and 3.4) can also be useful.

Figure 15: Large graph visualizations which are expansions of the Qakbot IP address graph from Figure 7 in Section
3.2. Degree 3 connections (left) has 7,862 nodes and 9,244 edges, degree 4 connections (right) has 46,631 nodes and
112,935 edges. Note that the node and edge coloring is selected by the graphistry software and serves to delineate
different aspects of the graph, but does not follow the same coloring scheme used in the Neo4j browser figures.
These extremely large graph renderings show what the large scale behavior of this subgraph of the database looks
like. Even though the database is overall quite sparse, there are clear clusters of the graph which behave similar to
each other and there are also clearly very highly connected clusters.

20 https://www.filescan.io
21 https://github.com/struppigel/PortEx
22 https://github.com/graphistry/pygraphistry

21

You might also like