0% found this document useful (0 votes)
48 views10 pages

Crowdsourced Cti Datasets

Uploaded by

telenagr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views10 pages

Crowdsourced Cti Datasets

Uploaded by

telenagr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Computer Networks 234 (2023) 109920

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

Are crowd-sourced CTI datasets ready for supporting anti-cybercrime


intelligence?
Mauro Allegretta a , Giuseppe Siracusano b , Roberto Gonzalez b , Marco Gramaglia a ,∗
a
University Carlos III of Madrid, Avenida Universidad 30, Leganés, 28911, Madrid, Spain
b
NEC Laboratories Europe, Kurfürsten-Anlage 36, Heidelberg, 69115, Baden-Württemberg, Germany

ARTICLE INFO ABSTRACT

MSC: Cyber crimes rapidly increased over the past years, with attackers performing large-scale activities, using
0000 sophisticated and complex tactics and techniques, that have targeted governments, companies, and even
1111 strategic infrastructures. To tackle these attacks, the cyber-security community usually shares Cyber Threat
Keywords: Intelligence (CTI) that includes the collected Indicators of Compromise (IoC) using several open or private
CTI sharing platforms. In this paper, we study the informativeness and relevance of the IoCs related to cyber crimes
Crowdsourced data following a major real-world event such as the war in Ukraine, which started in February 2022. To this end,
we analyze different kinds of attacks available in a crowd-sourced dataset of Cyber Threat Intelligence (CTI)
reports. Our analysis shows that while this data is able to capture major trends such as the ones following major
events, the degree of miscellaneous information inside the reports makes it difficult to discern the association
of a specific trace unequivocally.

1. Introduction Several web platforms have appeared to allow the sharing of that
information. They allow Internet users (both cybersecurity experts
In the past years, we have witnessed a continuous digitalization of and normal users) to register and add information in a crowdsourced
all aspects of society, ranging from the development of e-governments manner. Moreover, they provide a huge amount of free information to
solutions to the step increase in remote working and even the starting all the users interested.
of an envisioned metaverse. However, as in many other aspects of However, analyzing such data, especially using Artificial Intelli-
real life, criminal organizations may try to take control of them. So,
gence (AI) and Machine Learning (ML) based solutions, requires that
these advances came together with an increase in the number and
not only the IoC data is reliable, but also the associated IoC metadata
sophistication of cyber-attacks [1].
can be queried and analyzed in a structured way, to use it as a possible
Cyber Threat Intelligence (CTI) is knowledge and information about
cyber threats and threat actors that are shared by security firms, label or context information. Unfortunately, this information describing
governmental bodies, tech companies, and independent security re- malware family, threat actor, attack target, attacker motivation, etc. is
searchers/experts [2,3], with the intent of detecting and mitigating often textual as it is usually human-added. This task is also difficult
attacks that happen in cyberspace. for automated tools, as malware and threat actor names can have
Such knowledge is organized in structured incident reports and several synonyms depending on the organization/institution compiling
collected inside CTI sharing platforms. Accessing and searching for CTI the intelligence report or the anti-viruses providing the label leading to
knowledge is usually based on low-level physical evidence attributed to different aliases for the same malicious actors making it difficult to be
security incidents (Indicator of Compromise — IoC). That is, a security a high-level strategical analysis on the cyber-criminal ecosystem.
operator that has to investigate an incident will search in CTI sharing Finally, metadata can be continuously appended to the IoC, thus
platforms if evidence of the attack (e.g. file hashes, IPs, domains, URLs, mixing the information related to different incidents. All of this con-
etc.) is present in IoCs previously published in CTI, she will then gather tributes to downgrading the quality of the CTI data. Thus, a motivating
the related information/knowledge and use it for the analysis of the
question for this work is to understand how this kind of dataset can be
incident.

∗ Corresponding author.
E-mail address: mgramagl@it.uc3m.es (M. Gramaglia).

https://doi.org/10.1016/j.comnet.2023.109920
Received 18 February 2023; Received in revised form 5 April 2023; Accepted 5 July 2023
Available online 11 July 2023
1389-1286/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
M. Allegretta et al. Computer Networks 234 (2023) 109920

used to understand how crowd-sourced CTI data can be leveraged to 2.1. AlienVault OTX
perform anti-cyber-crime actions.
Thus, in this paper we investigate the quality of non-IoC information For this work, we use the major source of public CTI data available:
available on a well-known CTI sharing platform, motivated by the surge AlienVault Open Threat Exchange (OTX).1 AlienVault OTX is an open
of cyber attacks after the beginning of the war in Ukraine, which started platform, powered by AT&T, that provides access to CTI shared by a
on February 24th, 2022, and harvest data from AlienVault, one of global community of more than 210K threat researchers and security
the major public sources of structured CTI data. In this process, we professionals. Moreover, it enriches the crow-sourced information by
collected more than 115 thousand attack reports (or pulses, in the providing additional details available in external services such as the
AlienVault jargon), leading to more than 30 million IoCs, that include Whois system or Virustotal.2
diverse information such as IP addresses scanning for vulnerabilities, While there are other web-based alternatives for the gathering of
domains used for the command & control of malware, or reports
crowd-sourced CTI data (such as Malware Information Sharing Plat-
of successful cyber attacks. Then, we analyze this information both
form,3 IBM X-Force Exchange,4 or Facebook Threat Exchange5 ) we
temporally and geographically to identify patterns and trends, in this
selected AlienVault for its Open API and the cleanness of the data
case focusing on the changes that happened before and after February
model, that is depicted in Fig. 1.
24th, 2022. Finally, we deep dive into the most common attacks during
The information in OTX is crowdsourced, so, users can freely reg-
the analyzed period and describe their behavior, using the gathered
data. More in detail we analyze the quality of the retrieved data in ister and upload information about CTI attacks in a structured way.
terms of The main information elements in OTX are the pulses. Each pulse
represents a set of information related to a specific threat, usually
• specificity, measuring how the retrieved information can be recorded during an attack. The information in a pulse can be obtained
directly pinpointed to a specific event or if also contains data from very different sources: from automatic ones such as honeypots,
related to other incidents, that is, it is noisy data. or web scanners to human-elaborated content such as phishing lists or
• completeness, measuring if the available non-IoC data is enough attacks on specific companies or regions. In addition to this standard
to gather all the information related to an incident or whether data, OTX enriches the information with some automatically generated
there are data left out from the knowledge. data such as: (𝑖) the adversary, which stands for the different hacker
The main contributions of this work are: (𝑖) an assessment of how groups, (𝑖𝑖) the Targeted Countries of the attack, (𝑖𝑖𝑖) the used Malware
cyber attackers’ behavior can be modeled through crowd-sourced CTI family, and (𝑖𝑣) the industry attacked by the malware.
data, (𝑖𝑖) the discussion on the insights gained by the attack analysis, Attached to each pulse, AlienVault OTX provides detailed informa-
showing how the most popular attacks are leveraging the network for tion on the specific Indicators of Compromise (IoC or Indicator, in the
their purposes, and (𝑖𝑖𝑖) an analysis of possible sources of inaccuracies remainder of the paper) that were used in the incident. The information
in the analyzed data: we found that a non-negligible part of the data depends on the specific type of IoC. In OTX they can be of 10 types: IP
is difficult to uniquely assign to a specific attack. The remainder of the addresses, Domains, Hostnames (subdomains), Emails, URL / URI, File
paper is organized as follows: in Section 2 we describe the data we Hashes, CIDR Rules, File Paths, MUTEX name, and Common Vulnera-
used for our analysis, discussing the data models, the employed data bility Enumerations (CVEs). For instance, an URL includes information
harvesting, and pre-processing procedures. Then in Sections 3 and 4, we about the assigned IP address and the information obtained by other
provide a macro and micro discussion of the trends that can be found in sources such as Google Safe Browsing and VirusTotal, while File Hashes
the CTI dataset, analyzing how this data builds towards enriching the include information about the Antivirus that discovered the fingerprint,
knowledge of the attackers. Finally, in Section 5, we discuss the overall the Intrusion Detection System signatures, or the context of the hosts
quality of the data before concluding the paper in Section 6. that was trying to execute that file (e.g., the OS or the loaded libraries).
Data in AlienVault does not undergo any vetting or moderating
2. Harvesting CTI data
procedure. Users that upload a pulse and the associated IoCs can change
or delete details at any point in time, including the updates of the
There has been a large effort in recent years on the standardization
specific details of pulses, and the addition/deletion of IoC. AlienVault
of sharing mechanisms for CTI data. The most important, STIX [3], is
reports the creation and modification date for each Pulse.
considered the de-facto industrial standard for handling such kind of
data. However, industrial datasets often remain within the enterprise
that created those reports, or they are shared privately, making it 2.2. Data gathering and pre-processing
difficult to provide 3rd party analysis using open data.
Indeed, different attempts have been made to extract structured CTI We use the API offered by AlienVault OTX to obtain the data used
from unstructured sources such as human-readable security reports, in this work. It provides access to all the different resources available
cyber security expert blogs, social media posts, or logs in case of on the platform such as users, pulses, IoC, etc.; although it does not
machine-produced reports. For example, authors in [4] extract entities offer an endpoint for downloading all pulses in a given time range
from malware reports using Natural Language Processing techniques, of interest. Instead, we started our data gathering by listing all active
to retrieve logical relationships among them. A similar idea is shared users in OTX. Thus, we listed all the top 2000 contributing users and
by [5,6], whose source of intelligence is obtained from textual secu- fetched all the pulses they published. As a result, we obtained 206K
rity reports. These works are examples of structured representation pulses of the 225K listed on the OTX homepage. As we discuss in the
from text-based sources that are important attempts but require a next sections, the data we gather ranges from Jan 1st, 2020 to Aug
pre-processing mining phase and further quality assessment of the in- 31st, 2022, which we consider a good overview of the activities of
formation extracted. In this paper, we are not aiming at evaluating the cybercrime performed in recent years. These first figures allow us to
quality of the STIX as a sharing format per-se, but rather at evaluating notice that only a relatively small fraction of the users registered in the
how the crowd-sourced information shared using this format, is useful portal actively contribute with new pulses on a regular basis, while the
or not.
That is because only very recently public sources of structured CTI
data (including ones in STIX format) are starting to become available. 1
https://otx.alienvault.com/.
Motivated by this fact in this work we analyze the availability and 2
https://www.virustotal.com/.
quality of one major provider for crowd-sourced structured CTI data. 3
https://www.misp-project.org/.
In the following, we discuss the data source we used for this work and 4
https://exchange.xforce.ibmcloud.com/.
5
the details related to data harvesting and pre-processing. https://developers.facebook.com/docs/threat-exchange/getting-started.

2
M. Allegretta et al. Computer Networks 234 (2023) 109920

Fig. 1. The AlienVault OTX information data model.

vast majority are registered to the platform to observe the published is able to capture cyber-attack trends. To this end, we set a ground-truth
data. observation point that we could infer from external sources. So, we
Together with Pulses, we can download all the indicators’ details focus on the events following the Russian invasion of Ukraine started
associated with them. Here, we limit the data gathering just to the indi- on February 24th, 2022.
cator types related to the Network Environment: URLs, Domain Names, It is well known that creating cyber-incident reports related to real-
and IP addresses. This subset of indicators’ types provides a wider world events is a practice that has started a long time ago and became
threat-intelligence view on the cyber-attack tactics on a global scale more effective and impactful in recent years. Relevant examples are the
since it is labeled with geographical information including Autonomous cyber-attacks that severely disrupted the Estonian government [10] in
System Number, WhoIs tag, passive DNS details, and additional tag 2007, and the WannaCry [11] distribution in 2017, which marked the
about the possible targeted countries and the APT related to the trace. professionalization of the cyber-criminals.
AlienVault also provides additional information in the indicators field,
They have passed from attacking individual non-savvy Internet
such as the file hashes related to e.g., a malware execution. While from
users to attacking well-protected companies [12] or even governments
a Threat Intelligence point of view, file hashes are certainly important,
and strategic infrastructures, potentially producing real harm outside
when included without any additional detail these were not considered
the cyber-world. The attack on the Colonial Pipeline Company [13] that
useful for the trend analysis we were interested in. In total, we gathered
limited access to gas on the US East Coast during May 2021 is a relevant
around 77M indicators for the considered period. This yields 115K
pulses associated with 31M indicators. example.
To overcome the possible issues caused by both the crowdsourced This novel situation has raised the concern of governments, that
nature of the data (where the manual insertion of text data is an error- have started creating their own hacking forces [14,15] to either protect
prone activity) and the fact that oftentimes many malware are known from external attacks or to be able to perform their own attacks
with slightly different variations of their spelling, we pre-process the following geopolitical interests. For instance, on February 24th, 2022,
data to make text filtering more consistent and account for possible coinciding with the start of the Russian invasion of Ukraine, a targeted
misspellings, abbreviations, aliases, and special characters that may cyber-attack managed to disrupt the service of the communication
have been used to identify them. The lack of consistency in malware company Viasat [16,17] in Ukraine (and some neighboring countries). 6
definitions is a very well-known problem in the security community. Thus, it has become clear that cyber and real attacks are being
This effect, which is especially related to malware families’ names used together more and more, although the effectiveness of the former
produced by automatic tools such as anti-viruses, hinders the quality of is sometimes questioned [18]. Nevertheless, we observe the cyber-
the analysis of CTI data as it is not directly possible to group together incidents collections starting from February 24th, 2022 as a motivating
attacks related to the same actors. In the literature, there have already context for our analysis.
been attempts to develop tools able to perform this pre-processing on
anti-virus labels that each vendor independently assigns to malware
3.1. Spatiotemporal analysis
families’ names: relevant examples are AVclass [7] and Euphony [8].
However, we developed a specific lightweight mechanism to solve this
We start by analyzing the number of pulses published since the
problem.
We do so by matching pulses to our query not only if the field beginning of the observed period. At the first inspection of the data,
tags, malware name, and adversary contain one of the targeted names we observe a small number of pulses publishing several thousand
that we want to analyze (e.g., a specific malware name), but also if indicators in a single day. A more in-deep observation revealed that
the Levenshtein similarity string metric (i.e., that counts the number of those pulses are data dumps (typically of blocklists7 ) that occur only
edits to transition from one string to another) among all the elements once, representing data collected over long periods of time. Since those
of such fields is above 0.7, following a methodology similar to the individual pulses heavily disturb the observation of trends in the data
one described in [9]. We validated the effect of this threshold on the we decided to filter them out for the temporal analysis. Similar to the
Levenshtein metric on a relevant subset of our dataset. In our specific techniques designed to detect bot activity in social networks [19,20],
case, this threshold represented the lowest that would exclude any we identify the pulses whose contribution distribution does not follow
noisy string that would not be related to the desired malware name, a power law. Using the technique described in Alstott et al. [21] we
but that can also capture and group together similar aliases. In that identify 10K indicators published per active day as the limit to fit in a
way, we minimize the effects of glitches in the fields of the input data
and can proceed with its analysis as we discuss next.
6
We want to remark that, while the authors of this paper are against all
3. CTI data representativeness kinds of aggression, the scope of this paper is not to express any political
opinion about this event, but rather we want to use it as a motivating scenario
In this section, we assess the quality of crowdsourced CTI data from for the analysis we perform in this paper.
7
a high-level perspective, analyzing how the data available in this source https://otx.alienvault.com/pulse/625bbccdff6025314bd00084.

3
M. Allegretta et al. Computer Networks 234 (2023) 109920

Fig. 2. The applied filtering (left), and the time series (center, right) for the analyzed CTI data.

Fig. 3. Number of indicators evolution over time by: type (left), targeted industry (center), and attack type (right).

log-normal behavior (Fig. 2(a)). Thus, we filter out all pulses publishing 2022, showing how malicious network activity has been proliferating
more than that amount of indicators per day. in this period.
Fig. 2(b) depicts the number of Pulses, created or modified, since Similar considerations apply when analyzing other information as-
January 1st, 2020 after removing the identified data dumps. It can be sociated with the pulses and the indicators therein. Although the field
observed a consistent activity, as both follow a similar pattern over is not always filled in, the targeted industries tag (Fig. 3, center plot)
time. However, this steady growth suddenly increases in 2022, with allows us to understand the goal of the attacks performed throughout
publication and updates of Pulses peaking at 8K and 10K respectively, time. Again, we observe consistent trends, with strategic sectors like
showing an increase in monitored incidents starting with the beginning Government, Finance, and Technology occupying steadily the first
of 2022. position in the most popular targets throughout 2021 and in the first
To analyze the reliability of the gathered data, we compare it months of 2022.
with an external source of CTI data. For this purpose, we analyze Finally, we analyze the vehicle used by attackers. In the right plot
an industrial STIX dataset, gathered by a major network equipment of Fig. 3, we can observe that all the most popular 6 attack patterns
vendor, whose records are inserted by professional cybercrime experts. in our dataset are increasing in 2022. A relevant increase happens for
STIX data [22] is intrinsically different from the AlienVault OTX data Command and Control (C2C) attacks starting September 2021 staying
model, as data are grouped into bundles of IoCs, internally organized relevant in 2022, along with an evident increase in brute force attacks.
as a graph. Indeed, STIX data captures much more information than the However, while we observe a very high increase in published indicators
AlienVault OTX, as it links together different IoCs semantically. In this for 2022, this is not reflected in the disaggregated data, showing how
recent attacks are yet to be classified or difficult to be assigned to a
paper, we use it as a reference for the correctness of the gathered OTX
well-known MITRE [23] category.
data.
Following the same methodology, we now plot in Fig. 4 the coun-
Overall, the STIX dataset covers the same period as the AlienVault
tries that were targeted by the highest number of indicators in our
OTX one, including 138M bundles and more than 154M IoCs. We
dataset, since January 1st, 2020. Countries are shown in descending
thus compare the trends by counting the number of STIX published
order by the number of targeting indicators, showing the fraction of the
over time and comparing them with the published Pulses and IoCs
count of IoC for the last three years. From the figure, we can see that the
in AlienVault. Besides the differences in absolute numbers, we do not
most targeted country is the USA. Ukraine ranks in the 9th position for
observe the growth experienced in AlienVault OTX, suggesting that in
the observed time period but with a substantial contribution of 2022,
the OTX dataset there could be some kind of replication.
corroborating the fact that Real- and Cyberworlds are very correlated
So, in Fig. 2(c) we depict the number of unique indicators (i.e., not in these periods, and they can be captured using crowd-sourced data.
considering indicators published by two different pulses), that experi-
ence a dramatic increase in 2022, doubling the number of indicators 3.2. Adversary analysis
published in the previously most crowded month (Mar. 2021) Compar-
ing this trend with the industrial dataset we notice a similar trend of After analyzing trends from a spatiotemporal perspective we dig
STIX IoCs published in the first months of 2022. into the specific Advanced Persistent Threat included in the data.
We now go one step further and analyze the details associated with Advanced Persistent Threat (APT) is a term widely used in the cyber-
the indicators in the dataset in Fig. 3. By looking at the kind of created security community to describe dangerous and well-organized attack-
indicators, indeed we observe a consistent behavior with the time series ers, often state-sponsored, that perform orchestrated attacks for a large
in Fig. 2(c), however, while the number of indicators associated with period of time against a specific set of targets. Those attacks often target
e-Mail (i.e., usually associated with Phishing attacks) or File Hashes large organizations such as financial institutions, corporations, and
(i.e., associated with viruses) is almost steady during the observed time governments with their operational entities, political parties, etc. APTs
window, the indicators associated with network attacks are increasing distinguish themselves for the sophistication of the techniques used, the
significantly, especially in 2022. Domains, URLs, and IP addresses- technical level of the experts behind the attacks, and well-structured
related indicators grow at least by an order of magnitude. In particular, multi-staged tactics often described as cyber kill-chain, a term bor-
IP addresses increase by three orders of magnitude at the beginning of rowed by military jargon [14,15]. APTs perform their operations over

4
M. Allegretta et al. Computer Networks 234 (2023) 109920

Fig. 4. Percentage of indicators targeting one of the most popular countries by year.

years, alternating phases of infiltration and knowledge reconnaissance when it was known for targeting Ukrainian Government officials and
of the victim to active attacks. They may appear dormant and then organizations. Since then it has evolved and has gained the attention
wake up leaving traces. of the security research community.
In Fig. 5 we list the temporal evolution of the 6 most popular Since its inception, Gamaredon has been reported [32] to use re-
adversary groups. Contrary to the previous analysis, showing a steady mote code execution techniques to download malicious software. In
number of indicators over time, only increasing in the last month, we the first versions of the attack, the machines were infected with ma-
observe a very heterogeneous number of indicators associated over licious self-extracting archives that upon extraction would either write
time with the different APTs. The peaks, most likely, correspond to batch scripts or automatically install remote desktop clients that were
the specific campaigns that each adversary group performed during the responsible for contacting the C2C servers. Once the connections were
past 2 years and a half. Interestingly, three of the top 6 adversaries established, the software downloads other malicious software or sends
have a very high number of indicators in the dataset created in 2022. information about the infected systems.
While one of them is the Lazarus Group [24], an organization backed Attacking techniques evolved: the APT now infects the victims with
by the North Korean government which has been active since 2009, self-developed tools responsible for the establishment of C2C communi-
the other two can be certainly related to the war in Ukraine. Conti cation, such as Pterodo/Pteranodon. As a first step, the intrusion
is a ransomware that is believed to be managed by a Russian crime is usually initiated via spear-phishing emails with attached documents
organization [25,26] and Gamaredon [27] is an APT linked to Russia which upon opening load remote document templates with malicious
that has been active since 2013. Among known threat actors, we record code [30,31,33].
the activity of Muddy Water [28] an Iranian threat group targeting Mid- Since the attacking campaigns of Gamaredon typically rely on the
dle Eastern state-level strategical sectors such as telecommunications, spread of malicious code through a C2C infrastructure, we now ana-
IT, and oil organizations. A relevant cyber campaign is also the one lyze the infrastructure used by Gamaredon to perform attacks: in our
of Dark Herring [29] an Android Scamware spread through Android analysis we isolate the indicators related to the Gamaredon Group,
applications distributed through official and alternative app stores. focusing on the geographical additional detail provided by AlienVault
Finally, among the top 6, we also notice a bulk of indicators falling OTX describing the location of IPv4 type, relying on indicators of type
into the Untagged category, with a steep increase during February and domain, hostname, and URL on the passive DNS logs. As a result, we
March 2022. While this tag is commonly used to mark attacks that perform a historical view of the evolution in time of the DNS logs.
could not be attributed to known actors, by looking at the pulses Fig. 6 shows the top 16 most popular Autonomous Systems used by
we discovered that 76% of them were published by the IT Army the Gamaredon C2C Network. As observed in Fig. 5 most of the indica-
of Ukraine, hacktivists conducting cyber-guerrilla against mainstream tors have been discovered in 2022, indicating how this group intensi-
Russian government websites. fied the attacks very recently. The group has mainly used IP addresses
belonging to three autonomous systems in the past years: AS197965
4. CTI data precision reg.ru, AS9123 timeweb, and AS20437 constant. However, the
second one is not used anymore by the Gamaredon group. Moreover,
Following the previous analysis, we now focus on how the attacks while their main activity is in Russia, we see that they also use re-
are typically executed by the APTs. To this end, we start by dissecting sources in other countries. In fact, they rely on cloud providers such
the way one of the most prominent APTs (Gamaredon) works. Finally, as AS20437 and AS14061 (with AS20437 being the most used AS since
we analyze the most prominent tools (Wipers) used in the attacks. With the war started) that provide an infrastructure spanning across several
this analysis, we can assess if the crowd-sourced data is capable of continents. This shows the dynamicity of the attacks performed by the
capturing more specific aspects, allowing hence to model (and possibly group, showing how they can redeploy C2C software (from timeweb to
counter) the behavior of the attackers. constant) to counter mitigation techniques based on e.g., blocklisting.
Finally, we discuss how the infrastructure is used to attack different
4.1. Gamaredon geographical regions. Fig. 7 shows the countries attacked (right side of
the plot) and the country that started the attack (left side of the plot).
Gamaredon, also known as Armageddon, Shuckworm, Actinium, or The group others also includes unknown countries (21% of the attacks
Primitive Bear, is an APT attributed to Russia and its governmental and 59% of the targets). We observe how most of the attacks originated
entities [30–32]. The first traces of its activity appeared in 2013 either in USA or Russia as expected after our AS analysis. Moreover,

5
M. Allegretta et al. Computer Networks 234 (2023) 109920

Fig. 5. Number of indicators discovered per adversary (APT) over time.

Fig. 6. Number of new Passive DNS A records for Gamaredon per week categorized by their Autonomous System.

Fig. 7. Flow of attacks performed by the Gamaredon group: On the left, we represent the countries where an attack was initiated, and on the right, the targeted country.

6
M. Allegretta et al. Computer Networks 234 (2023) 109920

Table 1 Table 1 summarizes the operation of the top 9 wipers appearing


Summary of wipers used during 2022.
in 2022 related to campaigns targeting directly Ukraine. We observe
Wiper name Related CVE First time Popularity [%] that WhisperGate makes use of the recent vulnerability CVE-2022-
appeared
26134 exploiting code injection on servers. Furthermore, we observe
WhisperGate CVE-2022-26134 2020-12-06 24.95
how some of the top wipers are still based on some Windows/Office
CVE-2021-42392
vulnerabilities surprisingly discovered and patched in 2013 and 2014.
CVE-2021-3438 2020-05-10 23.53
It remarks on the importance of always having the most up-to-date
NotPetya CVE-2014-4114
CVE-2013-3906 version of the employed software.
CVE-2021-1675 2022-02-24 13.59
HermerticWiper CVE-2021-34527 5. Quality of the CTI data
CVE-2021-40444
IsaacWiper – 2022-03-01 13.24 From the analysis in Sections 2 and 4, we observed that crowd-
Fox Blade – 2021-01-16 7.97 sourced CTI data can be used to understand where, when, and how
BatchWiper – 2022-03-26 4.42
cyber crimes are performed. So, in this section we assess the quality of
Caddy Wiper CVE-2014-4114 2022-03-14 3.20
Industroyer CVE-2013-3906 2021-04-03 0.43 the data, identifying possible sources for glitches that may happen in
Meteor CVE-2021-3438 2021-07-29 2.22 the data: more in detail we analyze the specificity and completeness of
DoubleZero – 2022-03-22 1.25 the data. To this end, we use the Wipers set discussed in Section 4 as
HermeticWizard – 2022-03-01 1.20
an exemplary case study.
Apostole CVE-2021-3437 2021-05-25 0.95
RURansom – 2021-10-13 0.79
WhisperKill – 2020-12-06 0.78 5.1. Specificity
GermanWiper – 2021-07-27 0.53
MBR Wiper CVE-2021-40444 2021-07-29 0.47
We now discuss the specificity of the obtained data. That is, we
CVE-2016-5674 2022-04-01 0.40
study if all the data manually tagged as related to a wiper is actually
CVE-2021-4045
CVE-2021-45382 related to it. Fig. 9 depicts the metrics under study.
AcidRain
CVE-2022-25075 First, we measure the co-occurrences of each of the selected wipers’
CVE-2022-28186 names, with any other in the list in Fig. 9(a). If the pulses are associated
CVE-2022-26210 with a single wiper, we can infer the information inside the pulse is
dnWiper – 2022-03-12 0.04 specific to that wiper. Around 75% of the pulses are tagged with a
Terra Wiper – 2020-02-01 0.01
single wiper name or one of its obvious aliases/misspellings. The rest
includes more than one wiper name, making the assignment of this
information to one of the listed wipers difficult. This effect is even more
most of the attacks are targeting Ukraine (with a secondary target exacerbated when we extend the co-occurrence analysis to any other
in Russia itself). This fact confirms the expectation that Gamaredon specific name of a given malware outside the wipers domain: all the
activity gives direct support to the military operations started by Russia. pulses associated with wipers in our dataset are also associated with
We finish our analysis of the Gamaredon activity by focusing on another malware.
the vulnerabilities used to perform their attacks. Understanding this is This aspect is specific to the analyzed AlienVault OTX dataset. If
important since the main prevention technique against these kinds of we observe the same metric in the Industrial STIX dataset introduced
attacks is the proper update of the software. We identified 15 unique in Section 3.1 this effect is much more diluted. More than 99% of
Common Vulnerability Enumeration (CVE) [34] used by the APT. the bundles marked with a specific Wiper name are marked with one
The most used ones are CVE-2017-11882 [35] and CVE-2017- specific Wiper only. If we extend this analysis to bundles that contain
0199 [36], which are found in 76% of the Gamaredon-related pulses Wipers and Malware, as discussed above. The impact of more than 2 of
and both refer to a vulnerability of MS Office that allows an attacker them in the total count, is negligible.
to run arbitrary code. We dig deeper into these aspects in Fig. 9(b), which shows the co-
occurrence matrix of wipers and other malware in our dataset. We
4.2. Wipers see a dispersion of the co-occurrences, with many Wipers that are
associated with several different wipers, non-exclusively. For instance,
If we go one step further and isolate the kind of attacks performed Hermetic Wiper and WhisperGate are co-occurring in 7% of the
by the Gamaredon group in the most interesting scenarios we iden- pulses that are associated with either of the two. This is a common
tified in the previous sections (i.e., during 2022, targeting Ukraine), behavior for WhisperGate, which co-occurs in 16% of the pulses.
we discover that a significant part of them (11.5% of the indicators) Another remarkable case is the one of Not Petya which is commonly
have a common ground in the malware families lists. Names such co-occurring with 12% of pulses in the dataset.
as HermeticWiper, CaddyWiper, or WhisperGate [37–39] all fall into To better understand this matter, in Fig. 9(c) we represent the
the categories of Wipers. They are a class of malware that is used to relations among the different tags. Every node represents a tag and
permanently damage the IT infrastructure of the victim and thus are the edges represent the co-occurrence of both tags in the same pulse.
especially interesting in the context of linking the effect of cyber attacks The size of the edges is proportional to the number of times two tags
in real-life. Moreover, the AcidRain wiper was used to disrupt satellite appeared together and the size of the nodes to the degree of the node
communication in Ukraine on the same date as the start of the war. (i.e., the number of pulses where the tag appeared). We can observe
Fig. 8 shows the evolution of the usage of wipers since 2020. As several different clusters representing tags associated with similar inci-
a first observation, the usage of wipers has rapidly grown since the dents. That way, the WhisperGate and HermeticWiper cluster together
beginning of 2022, although only a few of them were also used in 2021. with the Gamaredon APT group and IsaacWiper, being also related to
This is corroborated by the observation of the reported lifespan of the NotPetya, which at the same time, cluster with Example Party Ticket.
wipers, which is often really small, typically a few months. We attribute Moreover, different ransomware such as Conti or DarkSide cluster
this behavior to the fact that, due to their incredible damage potential, together with Cobalt Strike or Terra Loader. This indicates that multiple
the vulnerabilities in the systems used by the wipers are rapidly patched tags associated with a single pulse may indicate different tools used in
after the first malware operation. an attack, instead of helping in identifying how a single tool is used.

7
M. Allegretta et al. Computer Networks 234 (2023) 109920

Fig. 8. Wipers usage over time.

Fig. 9. Specificity metrics.

5.2. Completeness 6. Conclusions

Finally, we study the completeness of the manual labeling of wipers. The impact of cybercrime is moving from affecting only the digital
That is, we explore whether there is information related to wipers that life of citizens to disturbing also their real life. In the past years,
are not tagged. To this end, we study the percentage of IoCs inside the cyber-attacks have severely disrupted the operation of governments,
non-wiper tagged pulses that are present in the wiper-tagged pulses. companies, and even strategic infrastructures. While major security
Intuitively, if two pulses share the same IoCs, they should be related. vendors usually collect this information in either proprietary or un-
Thus, if a pulse without a wiper tag shares a large number of indicators structured data, only very recently cybercrime data has been made
available through open-source portals.
with the pulses that are tagged as a wiper, most probably, it should also
While the openness of CTI data can be a very powerful tool to fight
be tagged with the wiper tag.
against cybercrime, the available open data is usually crowdsourced
We identified 40K IoCs belonging to any of the pulses tagged as
and rely on the ability of individuals to provide correct information. In
wipers. We refer to this set of IoCs as WiperSet. 94.75% percent of the
this paper, we analyzed the effectiveness of this crowd-sourced data to
pulses not tagged as wiper do not have any common indicator. For the
capture the real extent of cyber crimes, by taking as motivating context
remaining 5% (10K) pulses without the wiper tag and some relation to the Russian invasion of Ukraine in 2022.
the WiperSet, we calculate the percentage 𝛾 of IoCs of each pulse that By analyzing this data (also by comparing it with an industrial CTI
are present in the WiperSet. Fig. 10 shows the CDF of 𝛾 for all the pulses dataset) we observe that it is possible to capture coarse trends on the
without the wiper tag and at least one common indicator with WiperSet. spatiotemporal level, allowing to depict attack trends. Then, we ana-
The result shows that most of the pulses are barely related to the lyzed the information attached to the attack reports, focusing on two
WiperSet. 84.3% of the pulses have less than 25% of their indicators use cases. First, the operation of Gamaredon, one of the main hacker
included in the WiperSet. However, when we focus on the tail of the groups associated with the Russian state and with major participation
distribution we can see a non-negligible number of pulses with a strong in the attacks on Ukraine. Crowd-sourced data allows modeling the
relation to the wiper information. In particular, 7.21% of the pulses behavior of the attacker, as we found that they can distribute malware
have all their indicators included in the WiperSet. It gives a very good using remote code execution changing their infrastructure over time
indication of the amount of missed tags in the dataset. to escape standard protection measures (i.e., IP blocklisting) Then,

8
M. Allegretta et al. Computer Networks 234 (2023) 109920

Fig. 10. CDF of the percentage 𝛾 of common IoCs for the Pulses that were not associated with any Wiper name.

we have analyzed the most popular digital tools used to disrupt the References
operation of physical systems, namely, the wipers. We found their
popularity has rocketed during the beginning of 2022, with several new [1] H.S. Lallie, L.A. Shepherd, J.R. Nurse, A. Erola, G. Epiphaniou, C. Maple,
X. Bellekens, Cyber security in the age of COVID-19: A timeline and anal-
tools appearing to use existing vulnerabilities in different systems.
ysis of cyber-crime and cyber-attacks during the pandemic, Comput. Secur.
Finally, we discuss the quality of the analyzed data, identifying 105 (2021) 102248, http://dx.doi.org/10.1016/j.cose.2021.102248, URL https:
possible sources of inaccuracy. We evaluated the included metadata //www.sciencedirect.com/science/article/pii/S0167404821000729.
(which can be used to characterize attacks) in terms of specificity and [2] A. Ramsdale, S. Shiaeles, N. Kolokotronis, A comparative analysis of cyber-threat
intelligence sources, formats and languages, Electronics 9 (5) (2020) 824.
completeness, showing the difficulty in uniquely pointing metadata to
[3] T.D. Wagner, K. Mahbub, E. Palomar, A.E. Abdallah, Cyber threat intelligence
a set of IOCs, and motivating the need for more advanced solutions to sharing: Survey and research directions, Comput. Secur. 87 (2019) 101589,
properly structure the data before analyzing it. http://dx.doi.org/10.1016/j.cose.2019.101589, URL https://www.sciencedirect.
com/science/article/pii/S016740481830467X.
[4] A. Piplai, S. Mittal, A. Joshi, T. Finin, J. Holt, R. Zak, Creating cybersecurity
CRediT authorship contribution statement knowledge graphs from malware after action reports, IEEE Access 8 (2020)
211691–211703, http://dx.doi.org/10.1109/ACCESS.2020.3039234.
[5] A. Pingle, A. Piplai, S. Mittal, A. Joshi, J. Holt, R. Zak, Relext: Relation extraction
Mauro Allegretta: Methodology, Software, Visualization, Investi- using deep learning approaches for cybersecurity knowledge graph improvement,
gation, Writing – original draft. Giuseppe Siracusano: Conceptual- 2019, arXiv:1905.02497.
[6] S.N. Narayanan, A. Ganesan, K. Joshi, T. Oates, A. Joshi, T. Finin, Early
ization, Methodology, Software, Writing – original draft, Supervision.
detection of cybersecurity threats using collaborative cognition, in: 2018 IEEE 4th
Roberto Gonzalez: Conceptualization, Methodology, Software, Writ- International Conference on Collaboration and Internet Computing, CIC, 2018,
ing – original draft, Supervision. Marco Gramaglia: Conceptualization, pp. 354–363, http://dx.doi.org/10.1109/CIC.2018.00054.
Methodology, Software, Writing – original draft, Supervision. [7] M. Sebastián, R. Rivera, P. Kotzias, J. Caballero, Avclass: A tool for massive
malware labeling, in: Research in Attacks, Intrusions, and Defenses: 19th
International Symposium, RAID 2016, Paris, France, September 19-21, 2016,
Proceedings 19, Springer, 2016, pp. 230–253.
Declaration of competing interest
[8] M. Hurier, G. Suarez-Tangil, S.K. Dash, T.F. Bissyandé, Y. Le Traon, J. Klein, L.
Cavallaro, Euphony: Harmonious unification of cacophonous anti-virus vendor
The authors declare that they have no known competing finan- labels for android malware, in: 2017 IEEE/ACM 14th International Conference
on Mining Software Repositories, MSR, IEEE, 2017, pp. 425–435.
cial interests or personal relationships that could have appeared to
[9] L. Yujian, L. Bo, A normalized Levenshtein distance metric, IEEE Trans. Pattern
influence the work reported in this paper. Anal. Mach. Intell. 29 (6) (2007) 1091–1095.
[10] The Economist, War in the fifth domain. Are the mouse and keyboard the new
weapons of conflict?, 2010, https://www.economist.com/briefing/2010/07/01/
Data availability war-in-the-fifth-domain [Online; accessed 11-Jul-2023].
[11] Microsoft Defender Security Research Team, WannaCrypt ransomware
worm targets out-of-date systems, 2017, https://www.microsoft.com/
The data that has been used is confidential. security/blog/2017/05/12/wannacrypt-ransomware-worm-targets-out-of-date-
systems/?source=mmpc [Online; accessed 11-Jul-2023].
[12] R. Walters, Cyber attacks on US companies since November 2014, Herit. Found.
Acknowledgments 4487 (2015).
[13] Office of Cybersecurity, Energy Security, and Emergency Response, Colonial
pipeline cyber incident, 2021, https://www.energy.gov/ceser/colonial-pipeline-
The work of UC3M has been supported by the Spanish Ministry of cyber-incident [Online; accessed 11-Jul-2023].
Economic Affairs and Digital Transformation and the European Union- [14] E.M. Hutchins, M.J. Cloppert, R.M. Amin, et al., Intelligence-driven computer
NextGenerationEU through the UNICO 5G I+D project 6G-RIEMANN. network defense informed by analysis of adversary campaigns and intrusion kill
chains, Lead. Issues Inf. Warf. Secur. Res. 1 (1) (2011) 80.
The work of NEC Laboratories Europe has been supported by the EU re-
[15] A. Ahmad, J. Webb, K.C. Desouza, J. Boorman, Strategically-motivated advanced
search projects MARSAL (Grant Agreement 101017171) and DESIRE6G persistent threat: Definition, process, tactics and a disinformation model of
(Grant Agreement 101096466). counterattack, Comput. Secur. 86 (2019) 402–418.

9
M. Allegretta et al. Computer Networks 234 (2023) 109920

[16] Viasat, KA-SAT Network cyber attack overview, 2022, https://www.viasat.com/ [31] Ukraine SSU, Ssu: on night of full-scale invasion, russia aimed to destroy
about/newsroom/blog/ka-sat-network-cyber-attack-overview/ [Online; accessed all cyber defence of ukraine, 2022, https://ssu.gov.ua/en/novyny/u-nich-
11-Jul-2023]. povnomasshtabnoho-vtorhnennia-rf-voroh-khotiv-znyshchyty-ves-kiberzakhyst-
[17] SentinelLABS, AcidRain | A modem wiper rains down on Europe, 2022, https: ukrainy-sbu-video [Online; accessed 11-Jul-2023].
//www.sentinelone.com/labs/acidrain-a-modem-wiper-rains-down-on-europe/ [32] A. Kasza, D. Reichel, The gamaredon group toolset evolution, 2017, p. 2017,
[Online; accessed 11-Jul-2023]. Retrieved March 1.
[18] J.A. Lewis, Cyber war and Ukraine, 2022, https://www.csis.org/analysis/cyber- [33] Unit42, PaloAlto Networks, Russia-Ukraine cyberattacks (updated): How
war-and-ukraine [Online; accessed 11-Jul-2023]. to protect against related cyberthreats including ddos, HermeticWiper,
[19] L. Muchnik, S. Pei, L.C. Parra, S.D. Reis, J.S. Andrade Jr., S. Havlin, H.A. Makse, gamaredon, website defacement, phishing and scams, 2022, https://unit42.
Origins of power-law degree distribution in the heterogeneity of human activity paloaltonetworks.com/preparing-for-cyber-impact-russia-ukraine-crisis/ [Online;
in social networks, Sci. Rep. 3 (1) (2013) 1–8. accessed 11-Jul-2023].
[20] T. Rastogi, A power law approach to estimating fake social network accounts, [34] MITRE, CVE, 2022, https://www.cve.org/ [Online; accessed 11-Jul-2023].
2016, arXiv preprint arXiv:1605.07984. [35] MITRE, CVE-2017-11882, 2017, https://www.cve.org/CVERecord?id=CVE-2017-
[21] J. Alstott, E. Bullmore, D. Plenz, Powerlaw: A python package for analysis of 11882 [Online; accessed 11-Jul-2023].
heavy-tailed distributions, PLoS One 9 (1) (2014) 1–11, http://dx.doi.org/10. [36] MITRE, CVE-2017-0199, 2017, https://www.cve.org/CVERecord?id=CVE-2017-
1371/journal.pone.0085777. 0199 [Online; accessed 11-Jul-2023].
[22] OASIS Cyber Threat Intelligence (CTI) TC, STIX, 2017-2022, https://oasis-open. [37] MSTIC (Microsoft Threat Intelligence Center), Cyber threat activity in Ukraine:
github.io/cti-documentation/ [Online; accessed 10-Oct-2022]. analysis and resources, 2022, https://msrc-blog.microsoft.com/2022/02/28/
[23] B.E. Strom, A. Applebaum, D.P. Miller, K.C. Nickels, A.G. Pennington, C.B. analysis-resources-cyber-threat-activity-ukraine/ [Online; accessed 11-Jul-2023].
Thomas, Mitre att&ck: Design and philosophy, in: Technical Report, The MITRE [38] Symantec, Ukraine: Disk-wiping attacks precede Russian invasion, 2022, https:
Corporation, 2018. //symantec-enterprise-blogs.security.com/blogs/threat-intelligence/ukraine-
[24] P. Kálnai, M. Poslušnỳ, Lazarus Group: a mahjong game played with different wiper-malware-russia [Online; accessed 11-Jul-2023].
sets of tiles, in: Virus Bulletin International Conference, 2018. [39] CISA Gov, Alert (AA22-057a): Destructive malware targeting organizations
[25] CISA Gov, Conti ransomware, 2021, https://www.cisa.gov/sites/default/files/ in Ukraine, 2022, https://www.cisa.gov/uscert/ncas/alerts/aa22-057a [Online;
publications/AA21-265A-Conti_Ransomware_TLP_WHITE.pdf [Online; accessed accessed 11-Jul-2023].
11-Jul-2023].
[26] M. Burgess, Leaked ransomware docs show conti helping putin from the shadows,
2022, https://www.wired.com/story/conti-ransomware-russia/ [Online; accessed
11-Jul-2023]. Mauro Allegretta is a Ph.D. Candidate at University Carlos III of Madrid.
[27] Unit42, PaloAlto Networks, The gamaredon group toolset evolution, 2017,
https://unit42.paloaltonetworks.com/unit-42-title-gamaredon-group-toolset-
Giuseppe Siracusano is a senior researcher at NEC Laboratories Europe.
evolution/ [Online; accessed 11-Jul-2023].
[28] MITRE, Muddy water, 2018, https://attack.mitre.org/groups/G0069/ [Online;
accessed 11-Jul-2023]. Roberto Gonzalez is a Program Manager. NEC Laboratories Europe GmbH.
[29] Zimperiium, Financially motivated mobile scamware exceeds 100m in-
stallations, 2021, https://blog.zimperium.com/dark-herring-android-scamware-
exceeds-100m-installations/ [Online; accessed 11-Jul-2023]. Marco Gramaglia is a visiting professor at University Carlos III of Madrid.
[30] MSTIC (Microsoft Threat Intelligence Center), The gamaredon group toolset evo-
lution, 2022, https://www.microsoft.com/security/blog/2022/02/04/actinium-
targets-ukrainian-organizations/ [Online; accessed 11-Jul-2023].

10

You might also like