0% found this document useful (0 votes)
276 views13 pages

Automated YARA Rule Generation

This document summarizes a research paper presented at Botconf 2019 that proposes an approach called YARA-Signator for the automated generation of code-based YARA rules from malware samples. YARA-Signator isolates instruction n-grams that frequently appear within a malware family but not in other families to generate rules. When applied to the Malpedia dataset, YARA-Signator found on average 51.85% of instruction n-grams of length 4 or higher were unique to each family. The rules achieved a high F1 score against malware but caused few false positives on goodware, demonstrating the method's effectiveness at automated YARA rule generation.

Uploaded by

xn0d0x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
276 views13 pages

Automated YARA Rule Generation

This document summarizes a research paper presented at Botconf 2019 that proposes an approach called YARA-Signator for the automated generation of code-based YARA rules from malware samples. YARA-Signator isolates instruction n-grams that frequently appear within a malware family but not in other families to generate rules. When applied to the Malpedia dataset, YARA-Signator found on average 51.85% of instruction n-grams of length 4 or higher were unique to each family. The rules achieved a high F1 score against malware but caused few false positives on goodware, demonstrating the method's effectiveness at automated YARA rule generation.

Uploaded by

xn0d0x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC.

2019 BOTCONF 2019 PROCEEDINGS

YARA-Signator: Automated
Generation of Code-based YARA
Rules
Felix Bilstein1 , Daniel Plohmann1
1 Fraunhofer FKIE

This paper was presented at Botconf 2019, Bordeaux, 4-6 December 2019, www.botconf.eu
It is published in the Journal on Cybercrime & Digital Investigations by CECyF, https://journal.cecyf.fr/ojs
c b It is shared under the CC BY license http://creativecommons.org/licenses/by/4.0/.

Abstract speed up analysis procedures by making use of pre-


vious knowledge for these families. One of the most
Effective detection and identification signa-
important and popular tools in this context is YARA.
tures are an important component in the toolkit
for malware analysis. The creation of such sig- YARA is a highly efficient pattern matching engine,
natures is still widely a manual task that re- accompanied with a very accessible rule description
quires notable experience and knowledge on language. This has lead to YARA becoming a quasi-
the side of analysts. In this paper, we present standard with wide adoption among practitioners and
YARA-Signator, an approach for the automated many rules being shared openly or in private threat
generation of code-based YARA rules. The hunting groups.
method is based on the isolation of instruc- However, crafting rules that generalize well while
tion n-grams that on the one hand appear fre- avoiding misclassifications still remains a challenge.
quently within a malware family and on the This process is often carried out manually, requiring
other hand are not found in any other family.
knowledge and experience on the side of the analyst.
Applying YARA-Signator to the Malpedia data
set, we show that in fact on average 51.85% of
Effective rules should ideally aim for stable and char-
the instruction n-grams of length 4 and higher acteristic elements of malware, similar to the upper re-
are only found in the respective family. The gions in the "Pyramid of Pain" [1] when thinking about
rules produced by the system using this data attackers. One way to interpret this is trying to avoid
set achieve an overall F1 score of 0.983 and potentially volatile or easily changed elements such as
cause only very few false positives in a sanity strings and instead aim for the code itself.
check against a large goodware data set. YARA- Previous works, e.g. by Blichmann [2] or Zaddach
Signator is made available as open source and and Graziano [3], have already successfully demon-
a periodically updated reference rule set is pro- strated that the automated generation of code-based
vided for free through Malpedia. rules is possible. These approaches are based on the
heuristical identification of longest common subse-
quences (LCS) that isolate code patterns in the form
1 Introduction of instruction sequences that are found in all files of
the input data. One drawback of the demonstrated
Malicious software (short malware) remains to pose a approaches is their dependence on proprietary com-
significant threat to the security and integrity of com- ponents (such as IDA Pro) and potential limitations in
puter systems. To effectively and rapidly triage mal- scalability.
ware, analysts make frequent use of a variety of tools In this work, we present our approach YARA-
and systems. A cornerstone in the initial assessment Signator, which follows the underlying idea of the previ-
of suspicious files are syntactic signatures that al- ously presented approaches but transfers it to instruc-
ready have a long-standing tradition in anti-malware tion sequences of fixed size, so-called n-grams. Op-
efforts. These signatures primarily enable detection posite to the other works, YARA-Signator processes
and identification of malware families, which helps to all malware families in a given input data set in paral-

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 1
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

lel. This allows us to execute aggregation operations of the Intel instruction syntax and the concept of n-
that have the following benefits. First, we can elimi- grams.
nate code sequences that are found in multiple fam-
ilies, which are most likely instances of shared code
undesired to become part of signatures, e.g. libraries. 2.1 Pattern Matching and YARA
Second, by counting and ranking the number of ap-
pearances of n-grams in samples of the same family, Pattern matching is a popular methodology that is also
we achieve an approximation of the LCS identification. widely adapted in the context of threat detection, e.g.
We propose a prototype implementation of YARA- when monitoring network traffic or scanning files for
Signator depending only on open source components malicious content. It typically uses a signature that
and apply it to Malpedia [4], a community-curated cor- consists of one or more known patterns associated
pus of cleanly labeled, unpacked malware samples with a threat to be evaluated against data of interest.
covering more than 1,500 malware families. On this In this regard, it is also used to detect or identify mali-
data set, the rules produced by the system achieve an cious software.
overall F1 score of 0.983 with a high precision of 0.995. Apart from ClamAV [5], YARA [6] has become the
We additionally test the rules against a corpus contain- de-facto standard for pattern matching in malware
ing 10 TB of benign software, on which 70 out of 992 analysis. Its syntax is simple yet powerful, which
rules produce a total of 13,879 false positives. While makes it very popular among practitioners. As a re-
seemingly large, these numbers are however drasti- sult, there are many resources available where detec-
cally driven by very few outliers, as 10 of these rules tion and identification signatures using the YARA for-
account for more than 92% of the FPs, showing that mat are shared.
the rules are generally indeed very accurate. Figure 1 shows an excerpt of a YARA signature.
In summary, our paper makes the following contri- This particular signature is also an example of the au-
butions: tomatically generated rules produced by the approach
proposed in this paper: YARA-Signator.
• We present YARA-Signator, a method for the au- All YARA signatures contain at least one manda-
tomated generation of code-based YARA signa- tory part. A condition that describes what is neces-
tures. sary to trigger a detection when using this rule. This
is a logical expression that can optionally address file
• Using the disassembly for 992 malware families meta data or content (e.g. filesize as shown in Fig-
from the data set Malpedia [4], we show that on ure 1) but will typically reference sequences defined in
average more than 51.85% of instruction n-grams the strings environment. These strings can be de-
of size 4 and larger are intrinsic for the respective fined as printable character sequence, i.e. text string,
families, i.e. only found in these and thus serve as hex string, or as regular expression. Optional key-
as good candidates for rule creation. words can modify the condition under which they eval-
uate as a match, e.g. ascii or wide for text strings,
• We provide an open source implementation of
controlling the encoding for which the strings are de-
YARA-Signator and make a periodically updated
fined. A third environment is also possible for YARA
reference rule set for all processable families
signatures: a collection of meta fields. These allow to
found in Malpedia publicly available.
annotate a signature with additional information, such
The remainder of the paper is structured as fol- as author names, creation date, or sharing restrictions.
lows. We first provide background information to ease
the understanding of the proposed methodology and
discuss related work to give a thematic overview of the 2.2 Intel Instruction Syntax
topic. We then introduce our approach YARA-Signator
Intel x86/x64 machine code instructions [7] are vari-
and explain the workflow and components of the sys-
able in length between 1 and 15 bytes and structurally
tem. Afterwards, we examine the general viability of
encoded as a sequence of 6 fields:
the method by providing a detailed statistical evalua-
tion of the data set. This is followed by an evaluation Legacy Instruction Prefix: An instruction may be pre-
of the classification performance of the rules gener- fixed with zero to four instruction behavior modifiers,
ated using this data set and a false positive analysis indicating exclusive use of shared memory (LOCK),
against a large goodware corpus. We conclude with a conditional instruction repetition (REP), as well as
discussion of limitations and future work. segment, operand size, and address size override
switches.
(Prefixed) Opcode: The core of the instruction is a 1-
2 Background to 3-byte field that defines the actual opcode (cf. Fig-
ure 6, which gives a visual overview for all 1-byte op-
In this section, we discuss a number of aspects rel- codes in x86). Under certain circumstances, the op-
evant for the understanding of the method proposed code can optionally be prefixed, e.g. with a REX prefix
in this paper. We first discuss pattern matching and when operating under 64-bit and wanting to access ex-
YARA in particular, before giving a short overview tended registers such as R8 to R15.

2 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

rule win_citadel_auto {

meta:
author = "Felix Bilstein - yara-signator at cocacoding dot com"
malpedia_reference = "https://malpedia.caad.fkie.fraunhofer.de/details/win.citadel"
malpedia_license = "CC BY-NC-SA 4.0"
malpedia_sharing = "TLP:WHITE"

strings:
$sequence_0 = { 3bfe 7449 ff7508 e8???????? }
// n = 4, score = 3500
// 3bfe | cmp edi, esi
// 7449 | je 0x4b
// ff7508 | push dword ptr [ebp + 8]
// e8???????? |

[...]

$sequence_9 = { 8b0c0e 43 8901 8b470c 8bf3 }


// n = 5, score = 3500
// 8b0c0e | mov ecx, dword ptr [esi + ecx]
// 43 | inc ebx
// 8901 | mov dword ptr [ecx], eax
// 8b470c | mov eax, dword ptr [edi + 0xc]
// 8bf3 | mov esi, ebx

condition:
7 of them and filesize < 1236992
}

Figure 1: Example for a YARA signature targeting the malware family Citadel, automatically generated using YARA-Signator.

ModR/M: A field that is required for some opcodes. of n-grams in the context of detection or identification
If present, it encodes an extension, which defines is a common technique in the field of malware analy-
which concrete registers or memory addressing mode sis. [8]
should be used. For code-based signatures, the interpretation of
Scale, Index, and Base (SIB): Another field only re- items could either mean taking a set number of bytes
quired for some opcodes. If present, it will describe or instructions, which themselves usually consist of
how exactly addresses are calculated and how the dis- multiple bytes. In our case, we use instructions and
placement may be used in this context. Figure 2 provides an example for the derivation of n-
grams given a stream of instructions.
Displacement: The displacement is a field containing
There is a list of instructions given on the left hand
a value of 1, 2, 4, or 8 byte length that is used as an
side in Figure 2. These instructions are sequentially
offset for the calculation defined by the SIB field (if
executed by the target architecture and therefore we
present). In case the displacement has a length of 8
have to keep this order. For a given length (four in
bytes, no immediate may follow.
this case), we derive four possible n-grams from the
Immediate: Some instructions may use an immediate instruction list of length seven.
value, which can be 1, 2, 4, or 8 byte long, depending
on what is defined by the instruction or ModR/M field.
Similarly, an 8-byte long immediate is mutually exclu- 3 Related Work
sive with a displacement.
In the context of this paper, the Displacement and In this section we provide an overview of related work.
Immediate fields are of special interest. Because both We focus on three categories: Frameworks for au-
fields may contain concrete addresses and offsets tonomous rule generation, tools supporting manual
that are very specific to a given compiled program or a creation of rules, and YARA rule archives.
result of mapping and memory relocations, it is desir-
With regard to full systems for rule generation,
able to replace these concrete values with wildcards in
Blichmann recently published VxSig [9], a reference im-
certain cases to achieve signatures with better gener-
plementation for the seminal approach published in
alization. Figures 1 and 2 give an example of wildcard-
his diploma thesis [2]. VxSig allows the automated
ing in the context of YARA, in both cases removing the
generation of signatures in the YARA and ClamAV for-
concrete value for a interprocedural, relative-offset call
mat from sets of previously grouped, similar bina-
instruction.
ries. Input files are processed using BinExport [10] and
BinDiff [10] to locate and isolate common code frag-
2.3 N-gram Structure ments. BASS [3] is a framework published by Zaddach
and Graziano for the automated generation of ClamAV
N-grams are consecutive subsequences with a fixed signatures over previously identified malware clusters.
length taken from a given sequence of items. The use As noted by the authors, their method has strong sim-

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 3
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

Figure 2: N-gram derivation from a given sequence of x86 instructions to 4-grams.

ilarities to Blichmann’s approach [2] but aims for high 4 YARA-Signator


scalability and additionally includes a method for fil-
tering out code sequences from known goodware, us- In this section we introduce YARA-Signator, our frame-
ing Kam1n0 [11]. Roth published yarGen [12], a tool work for the automated generation of code-based
that enables the automated generation of YARA rules YARA rules. The framework is supposed to generate
based on one or more input files. It can process both YARA signatures based on a given set of disassembly
strings and code and can optionally include blacklist reports by processing them in multiple steps that we
information from databases, e.g. to further enhance explain in this section. We start by providing a gen-
the rule creation procedure by removing all strings that eral overview of the approach in Section 4.1 and then
also appear in known goodware. Doman published present each of the processing steps of our framework
YaBin [13], a tool that creates YARA signatures from in more detail in Section 4.2. The structural details of
code sequences that are automatically extracted from the implementation, i.e. its components and depen-
its input programs. The concept of YaBin is based on a dencies, are discussed in Section 4.3.
heuristic search for common function prologues, e.g.
"55 8B EC" (push ebp; mov ebp, esp) and discrimi-
nation against a whitelist of sequences from a collec- 4.1 Approach
tion of non-malicious software (about 100 GB in size). Our approach can be summarized as the task to iden-
Heuristics for prologues cover the compilers MS Vi- tify fragments of code that are found only in represen-
sual C, Borland, and MinGW. tatives of the same malware family, while being ab-
The following projects aim at improving the work- sent in all others. These specific fragments by defi-
flow for manual creation of YARA rules. Yi published nition are characteristic for the respective family and
Hyara [14], a plugin for IDA Pro and BinaryNinja that thus should serve as good components for a detec-
allows to highlight code regions and strings that can tion signature. Since malware has to be considered
then be quickly turned into YARA rules. KoreLogic pub- simply a special category of software we assume that
lished pat2yara [15], a helper script that allows to con- underlying development processes are very similar or
vert rule files generated using IDA Pro’s FLAIR engine comparable to the processes used for regular soft-
into YARA rules. Ballenthin created "YARA-FN" [16], a ware. This means that we generally expect most soft-
script for IDA Pro that creates a YARA rules from all ba- ware projects to have a somewhat stable code base
sic blocks of the currently shown function that is also that usually does not change too drastically between
capable of wildcarding relocations and jump instruc- its versions. We can furthermore take advantage of
tions. this assumption by preferably selecting the code frag-
ments that appear in as many versions of the family as
Multiple notable public collections of YARA rules possible, which can be seen as a sign for their signifi-
exist. The YaraRules Project provides a large collec- cance. In consequence, we expect that combinations
tion of community-collected YARA rules that is man- of such fragments will yield reliable YARA signatures
aged as a Github repository [17]. Roth provides an ex- with good potential for generalization.
tensive and frequently updated set of free YARA rules Since YARA signatures can consist of strings, byte
called "signature-base" [18], which is the default rule sequences, and regular expressions, we choose byte
set used by his free scanning tools. Worth offers a sequences as they are best suited to represent code
curated collection of YARA rules called "Open-Source- fragments. Because code is naturally structured in
YARA-rules" [19] sorted by their creator, covering 126 (machine) instructions, we assume that start and end
entities with 1,711 rules. Wesson provides a set of markers of common code fragments will potentially
rules via "Project Icewater" [20], which produces rules fall together with instruction borders.
that are automatically derived based on clustering over To avoid having to find common code fragments
similarity. of maximum length (cp. Blichmann [2]), we instead

4 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

decide to work on n-grams of instructions. These n-


grams are derived from disassembly reports, which
serve as the input data format. All n-grams for one 1 1) Parsing
family are aggregated into a common pool. Given k
2) Linearization
malware families, the resulting task of isolating the
characteristic code fragments essentially can be un- 3) N-gram Generation
derstood as a set operation in which for each of the
family’s n-gram pools all n-grams of the k − 1 other 4) Wildcarding
families are removed. Figure 3 illustrates this part of
the approach. Following this procedure, the remain- 2 5) Filtering Step
ing n-grams in each pool are ranked and a selection of
candidates is composed into YARA signatures which 6) Ranking System
are then optimized for coverage.
7) Overlapping N-gram Detector

3 8) YARA rule composer

9) Validator
Family B
N-gram Pool Family C
N-gram Pool
4 10) Iterative Improvement

Family A Family D a) Ranking Step


N-gram Pool N-gram Pool
b) Coverage Engine
removed
c) Overlapping N-gram Detector
Figure 3: N-gram filtering.
d) YARA rule composer

Overall, the procedure to automatically produce e) Validator


code-based YARA signatures can be seen as a work-
flow consisting of four stages:

1. Data ingestion Figure 4: YARA-Signator and its processing steps.

2. Unification and Filtering Data Ingestion is the first phase of our approach.
We process given disassembly reports for a set of
3. Rule Generation malware samples that we want to create YARA sig-
natures for. The malware samples have to be un-
4. Iterative Improvement packed and pre-clustered beforehand, so that samples
are grouped with their respective family. Disassem-
During the first step, data ingestion, all reports bly reports are parsed and instruction n-grams are ex-
are parsed by framework and linearized into n-grams. tracted.
These n-grams are unified and filtered by performing The second phase operates on the normalized data
data aggregations. The goal is to find n-gram candi- that we created in the first phase. We filter all dupli-
dates with a high coverage for a given malware family cate n-grams between families and rank candidates
that are not overlapping with the code of other fami- for each malware family. Overlapping n-grams are re-
lies. On the basis of these candidates we apply sev- moved to sanitize the candidate pools for each family.
eral filters to find the most suitable candidates. These A first set of YARA signatures are created within
candidates are written into YARA signatures which are the third phase of our approach. After composing the
evaluated. Then, we iteratively improve the gener- YARA rules, we validate them against the input corpus
ated YARA signatures by re-validating them in every to evaluate the quality of the generated signatures.
step, potentially increasing their precision and cover- The last phase is an iterative improvement phase
age. These stages are explained in detail in the follow- where problematic YARA signatures are re-generated
ing section. and re-evaluated to provide an improvement of the dif-
ferent previously created signatures over time.

4.2 System Details 4.2.1 Data Ingestion

In this section we present the different stages of our Parser. As an initial step, disassembly has to be
approach in detail. Figure 4 illustrates the four primary parsed. We use SMDA [21] as for this, because it is
stages of our framework. an open-source recursive disassembler built on top of

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 5
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

Capstone [22]. It is convenient to use, producing JSON for a given n-gram. Multiple ranking functions can
files as output and has the capability to reliably recon- be chained to incorporate multiple semantics into the
struct and extract code from memory dumps. Note overall rating of an n-gram. Example metrics used for
that the approach (and the implementation) is gener- ranking in our reference implementation are the num-
ally independent from the choice of disassembler as ber of occurrences in different samples and the types
it could be trivially adapted to any other data input for- of instruction (e.g. memory-access, or logic/arith-
mat. metic) found in the n-gram. After the ranking, a selec-
Linearization. An advanced disassembler will produce tion of highest-ranked candidates is selected per fam-
fully reconstructed CFGs that divide the identified code ily (in our configuration, 10).
into functions. Because YARA operates on byte se- Overlapping N-gram Detector. When n-grams are ag-
quences, we need to flatten the Control Flow Graph gregated across samples, the information about the
into a linear sequence of instructions that resembles relative position of n-grams to each other is lost. As
how code is encountered in the binary. Using address Ranking is applied for individual n-grams in a proce-
offsets and the individual instructions’ sizes, we can dure not considering other n-grams, it may rank sev-
furthermore split the linearized stream into consecu- eral n-grams similarly well due to characteristics they
tive chunks whenever a gap is encountered. share. This may potentially be a result of them over-
N-gram Generation. In this step, we produce the data lapping or even being contained within each other,
points YARA-Signator actually operates on. We de- e.g. ABCDEF and ABCD or ABCDEF and CDEFGH (with
rive all possible n-grams of a pre-defined size from the each letter representing an instruction at a certain off-
chunks resulting from Linearization. In the context of set). As it may be favorable to have signature contents
this work, we use sizes of 4 to 7 instructions per n- being spread over the target or at least not being re-
gram based on previous findings [23]. dundant, this stage ensures that no excessive over-
Wildcarding. Code may generally contain absolute lapping exists between n-grams selected for the sig-
virtual addresses or offsets, like memory pointers to natures.
code or data. These may even be shifted due to relo-
cations while mapping a binary. Using sequences with
4.2.3 Rule Generation
these absolute values in place for signature genera-
tion could lead to false negatives. To avoid this, we
YARA Rule Composer. Given a collection of n-grams,
perform additional abstraction and wildcard all occur-
this step uses a rule template to construct a functional
rences of such pointers using absolute addresses. In
YARA rule, updating meta data information such as a
fact, we even wildcard all relative references pointing
date and input data used. In the workflow of the frame-
outside the scope of the function we generate n-grams
work, the n-gram candidates per family are used to
for as well, e.g. inter-procedural calls and jumps. Addi-
compose a rule. We also include a filesize cap for each
tionally, we also wildcard immediate values that could
family’s rule that is calculated as twice the size of the
be interpreted as addresses within the mapping of the
biggest input sample.
given binary when mapped. For this, we inspect the
Displacement and Immediate fields of all instructions Validator. The Validator performs an evaluation of all
(cf. Section 2.2). This procedure is equivalent to the rules created against the data they were generated
wildcarding applied by Cohen and Havrilla [24] for their from. The desired result is obviously full coverage with
technique of creating position-independent code (PIC) no false positives. However, since rules are only de-
hashes of functions. We expect this to additionally rived from parts of the input binaries (disassembled
benefit rule generation as it may help make signatures code), false positives may still occur. Due to the initial
more robust against code reording that can happen selection of n-grams, false negatives may also occur.
due to an author’s refactoring or compiler effects. The resulting evaluation report is used to trigger an It-
erative Improvement phase that is applied to all rules
that do not have full coverage without false positives
4.2.2 Unification and Filtering
yet.
Filtering Step. This step implements the actual idea,
as initially described in the beginning of Section 4.1.
4.2.4 Iterative Improvement
By aggregating identical n-grams across all ingested
samples and families, we can filter out all n-grams that The Iterative Improvement aims at optimizing the
occur in more than one family. While doing so, we YARA signatures through additional rounds of refine-
additionally track in how many different samples the ment. Every iteration can be controlled independently
family-unique n-grams occur as this will help identify- by using a different configuration for each cycle. One
ing representative n-grams in the next stages. All re- iteration cycle has five different steps: Ranking, Cover-
maining n-grams are considered potential candidate age Engine, Overlapping N-gram Detector, YARA Rule
n-grams for rule generation. Composer, and Validator. All steps are similar to their
Ranking System. We developed the ranking system equivalent described before, except that the Ranking
to allow flexible configuration by the user. It consists step can be configured for each iteration indepen-
of individual ranking functions that generate a score dently and additionally Coverage Engine is executed.

6 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

smda-reader Capstone Bindings

yara-signator

java2yara YARA

Postgres Bindings

Figure 5: YARA-Signator and its components: The two JAVA libraries smda-reader and java2yara, its bindings to PostgreSQL,
to the Capstone disassembly engine and YARA.

Information about false positives is used to black- to use the relational database PostgreSQL [26]. This
list n-grams from their use in rules. The Coverage En- database management system natively supports var-
gine is then applied to all rules that did not have opti- ious techniques that can be used to efficiently imple-
mal coverage yet. Given the information about which ment our approach. This includes aggregations for fil-
n-grams cover which samples, the problem of achiev- tering data and a range of performance tuning tech-
ing a minimal coverage of all samples is an instance niques such as indexing and partitioning. We sim-
of the Set Cover problem [25] and in theory NP-hard. ply implemented a wrapper to communicate with the
We use a greedy approximation [25] that performs in database driver to access and persist data.
polynomial time and exceeds the optimal solution by Since we create YARA signatures we needed a li-
no more than the nth harmonic number in ratio. This brary to build YARA rules from JAVA programs. We
suffices for our use case as we look at a few hundred n- implemented java2yara as a library with which signa-
grams as input at most (example harmonic numbers: tures can be created from a collection of strings and
H(100) = 5.19, H(1000) = 7.49). The algorithm it- given meta data. As we want to enrich the rules with
eratively selects an n-gram that achieves the highest additional information about the instructions used in
coverage gain, i.e. covering additional uncovered sam- the signature strings, we also need a disassembler.
ples, until all samples are covered. The Overlapping n- Again, having different options to choose from, we
gram Detector again ensures that the coverage is ad- went with capstone [22] because it is open-source and
ditionally spread over the code. Validation rounds are also the basis for SMDA.
used to update the blacklist with potential iteration un- Finally, for the validation of the YARA rules we use
til a satisfying result in rule output is achieved. YARA itself as an external program. The results are
parsed from its output and processed by our frame-
work. The scan reports generated this way are used
4.3 Implementation to evaluate the rules against the input data set and an
We now discuss the implementation of our approach. important element to steer the iterative improvement
We created a framework around our core tool YARA- process.
Signator to provide a full toolchain enabling automated
generation of YARA signatures. Figure 5 illustrates the
core and relationship of the different modules. 5 Statistical Analysis
We implemented the library smda-reader as a
means for ingesting disassembly reports generated Before we conduct a performance evaluation of the
using SMDA [21] as described in Section 4.2.1. Tech- rules produced by YARA-Signator, we first want to get a
nically, smda-reader parses the reports provided in better understanding for the general viability of signa-
JSON into Java objects. As of now, we only support ture generation approaches based on code n-grams.
SMDA as a disassembler but since the data ingestion For this reason, we perform a statistical analysis of the
is handled through an interface and normalized ob- primary data set used in this study: "Malpedia" [4].
jects, we are not limited to a single technology with our After a short introduction of the data set, we will
approach. An adaption of other third-party software examine different distributions, e.g. amounts of code
like IDA Pro or objdump as an input provider would be found in malware families as well as the individual in-
trivial. structions in the corresponding disassembly reports.
Because processing the disassembly for rule gen- We then continue by further analyzing n-gram distribu-
eration requires space and we want a performant pro- tions and uniqueness, which we obtain as an interme-
cedure, we decided to use a database as backend. diate result in the procedure of applying YARA-Signator
Given several databases to choose from, we decided on the data set.

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 7
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

5.1 Data Set function has about 8-10 basic blocks and consists of
round about 50 instructions, which is fully sufficient to
Given the design of the approach as described in sec- apply our proposed method on.
tion 4, we note that one requirement is that the files of We now use the wildcarding method by Cohen and
the input data have to be grouped already, e.g. by mal- Havrilla [24] as explained in Section 4.2.1. This al-
ware families. A data set that is well suited to test our lows us to determine position-independent code (PIC)
approach on is the Malpedia [4] corpus, a community- hashes for all functions. In our implementation, we
curated malware corpus of (unpacked) reference sam- use MD5 over the concatenation of all wildcarded in-
ples including public analysis references for as many structions in their hexbyte representation, sorted by
families as possible. In this study, we use Git commit address. This leaves us with 947,421 unique hashes
d9bc781 from February 25th, 2020 as a baseline snap- for functions, out of which 848,783 (89.59%) only ap-
shot. At this time, Malpedia consists of 4,469 samples pear in one family each. Our number is higher than
for 1,573 malware families, which accumulate to a to- the 81% reported by Cohen and Havrilla but likely ex-
tal of 8,939 files. plained by their more diverse data set for which we
Not all of the files found in Malpedia can be pro- would expect a wider presence of library code. In any
cessed by YARA-Signator. Because the framework cur- case, this is a positive result as it indicates that we can
rently operates on x86/x64 exclusively and we intend definitely expect to find significant amounts of code
to only process unpacked or dumped files, we need to being unique per malware family which will benefit the
filter the data set before we disassemble the files. This generation of rules.
reduces our input data to 3,313 processable samples
from 1,150 families. Mnem 32bit 64bit
Among these are still families that consist of non- 1 mov 49,890,410 (28.17%) 6,144,638 (39.76%)
native code because they are written in other program- 2 push 26,770,256 (15.12%) 274,490 (1.78%)
ming languages, e.g. those created using the .NET 3 call 14,704,502 (8.30%) 1,347,419 (8.72%)
4 pop 8,548,750 (4.83%) 273,385 (1.77%)
framework. Filtering out all files that do not fulfil the 5 cmp 8,060,341 (4.55%) 770,526 (4.99%)
native-code requirements, we now use the SMDA dis- 6 lea 7,570,190 (4.27%) 1,147,978 (7.43%)
7 add 6,580,883 (3.72%) 557,804 (3.61%)
assembler [21] to process the input files. Ultimately, 8 je 6,371,325 (3.60%) 581,553 (3.76%)
this leaves us with disassembly reports for 3,039 sam- 9 dec 6,064,865 (3.42%) 41,810 (0.27%)
10 test 5,711,807 (3.23%) 585,390 (3.79%)
ples from 1,022 families. These amount to a total 11 jmp 5,184,997 (2.93%) 541,978 (3.51%)
of 4,150 input files because sometimes more than 12 xor 4,934,907 (2.79%) 618,684 (4.00%)
13 jne 4,525,392 (2.55%) 437,352 (2.83%)
one unpacked or dumped representation is associated 14 ret 3,481,393 (1.97%) 246,005 (1.59%)
with one sample, e.g. because of a 32bit and 64bit pay- 15 inc 2,595,499 (1.47%) 109,312 (0.71%)
16 sub 2,485,040 (1.40%) 322,980 (2.09%)
load, or additional modules. 17 and 1,863,284 (1.05%) 196,456 (1.27%)
18 movzx 1,577,742 (0.89%) 216,612 (1.40%)
19 or 1,352,995 (0.76%) 112,826 (0.73%)
Per Family Files Functions Instructions 20 shr 667,087 (0.38%) 69,654 (0.45%)
21 jb 616,242 (0.35%) 58,695 (0.38%)
Minimum 1 1 2 22 shl 571,271 (0.32%) 59,692 (0.39%)
25% 1 138 7,087 23 nop 557,776 (0.31%) 57,045 (0.37%)
50% 2 412 20,923 24 jle 471,338 (0.27%) 40,606 (0.26%)
75% 3 1,135 52,133 25 jl 461,572 (0.26%) 33,305 (0.22%)
Maximum 121 18,213 931,948
Total 4,150 3,733,355 195,422,329 Total 171,619,864 (96,91%) 14,846,195 (96.30%)

Table 1: Statistics for the processed input data. Functions Table 2: The 25 most prominent instruction mnemonics for
and Instructions have been averaged per family before ag- 32bit and 64bit.
gregation.

5.3 Instruction Occurrence Frequencies


5.2 Disassembly Overview
We now have a closer look at the individual instruc-
In this section, we provide a characterization of the dis- tions contained in the disassembly reports. Looking at
assembled data in general. the 25 most popular instructions mnemonics (cf. Ta-
We first address bitness. Out of the 4,150 files, ble 2), we can see that they aready account for more
91.78% are 32bit and 38.22% are 64bit. With respect than 96% of all mnemonics. We further notice signifi-
to families, we find that 97 feature 64bit code only, and cant differences for a number of mnemonics between
74 have code both in 32bit and 64bit, while the 826 re- 32bit and 64bit. First, for 64bit, mov instructions are
maining families are 32bit only. about 10% more frequent. At the same time, the stack
In total, SMDA identifies 3,733,355 functions with operations push and pop together reach only 3.5%, op-
195,422,329 instructions in the used portion of the posed to 20% for 32bit. A primary reason for this is
data set. Table 1 provides further insight into how the that 64bit function calls are often carried out using the
functions and instructions are distributed across the __fastcall calling convention, passing arguments via
samples in the families. Interestingly, these numbers registers instead of using the stack.
are very much in line with our earlier reports about Similarly, inc and dec are found way less often
these statistics [23]. We note that the prototypical for 64bit code. While we did not investigate this in-

8 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

Figure 6: First byte occurrence distribution among the 195,422,329 instructions, separated by bitness. A major difference
is the increased used of 0x40-0x4F bytes in 64bit (REX prefix), and reduced use of stack operations.

depth, we believe that this is connected to instructions n-grams can be used for signatures without causing
starting with 0x40-4F (inc/dec <register> under false positives on the data set. However, we do not
32bit) being repurposed as REX prefix under Intel know yet how these unique n-grams are distributed
x64 [7]. We have also rendered heatmaps of the first across families.
byte instruction distribution in Figure 6, along with
a reference for 32bit instructions and their semantic Per Family 4 5 6 7
context. Minimum 0.00% 0.00% 0.00% 0.00%
25% 20.68% 23.60% 25.53% 26.77%
50% 45.21% 51.11% 53.61% 55.88%
75% 68.68% 78.78% 83.97% 87.24%
5.4 N-gram Occurrence Frequencies Maximum 100.00% 100.00% 100.00% 100.00%
Average 45.86% 51.15% 54.23% 56.14%
After the examination of distribution properties for in-
dividual instructions, we now look at sequences of in- Table 4: Relative amount of unique n-grams per family.
structions, i.e. n-grams, as used by YARA-Signator. We
are interested in two statistics particularly: Unique-
ness of n-grams across families overall and with re- Therefore, we now inspect the percentage of
spect to the families they originate from. Both of these family-unique n-grams for all families. The results are
values provide insight in the general viability of our out- shown in Table 4. For a total of 5 families (with one
lined approach. sample each), YARA-Signator did not identify unique
n-grams. An inspection of these shows that 4 sam-
N-gram size
ples were misclassified in Malpedia because of differ-
occurrences
4 5 6 7 ent aliases referring to them in the referenced report-
1 84.94% 86.51% 87.47% 88.10% ing, while one family was a .NET sample that was not
2 7.70% 7.09% 6.68% 6.34% filtered before. For each of the remaining 992 families,
3 3.06% 2.73% 2.53% 2.39%
4 1.22% 1.05% 0.96% 0.46% a number of n-grams sufficient for rule generation is
found. Not only this, for the median family, between
Table 3: Occurrence counts of n-grams in different malware 45.21 and 55.88% of n-grams are unique to that fam-
families. ily depending on n-gram length. Similar to what was
observed before, longer n-grams lead to higher rela-
First, we look at the occurrence frequency of n- tive uniqueness. Overall, we find that basically every
gram uniqueness across families. The results are family contains some portions of unique code that can
listed in Table 3. We count a total of 187,800,586 be automatically identified and used to target it in a
unique n-grams for all lengths combined. With regard signature. The average percentage of unique n-grams
to their relative uniqueness, we see that even for in- across all n-gram lengths and families is 51.85%.
struction n-grams of length 4, already 84.94% of these Deeper investigation of the results allows us
n-grams appear only in a single family. For two and to make more interesting observations. For ex-
three families, we note a steep decline with 7.70% and ample, families that stick out with a high n-gram
3.06% respectively, summing up to 95% of all n-grams. but low unique n-gram count are for example
Expectedly, for longer n-grams, these numbers lean win.combojack [27] (520,891 n-grams total but 2.40%
even more towards an occurrence of one time across unique) and win.shurl0ckr [28] (1,441,625 n-grams
all families only. For n-grams of length 7, the 88.10% but 3.25% unique). Both are compiled with frame-
of family-unique n-grams are also very close to the ob- works that make use of excessive static linking, in
served 89.59% for family-unique PIC function hashes these cases Delphi and Go respectively.
as discussed in Section 5.2. Overall, these statistics In a few cases, we observe a similar phenomenon
are good news as they imply that a vast majority of for families compiled with the much more popu-

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 9
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

lar MS Visual Studio. Here, we find families that SeqCount N-gramLen SeqLen SizeCap
have a lower number of total n-grams but still a low Minimum 5 4 4 24,576
25% 10 5 14 188,416
number of unique n-grams. An example would be 50% 10 6 18 402,432
win.carrotbat [29], a simple downloader used by a 75% 10 7 23 1,040,384
Maximum 220 7 77 35,323,904
threat actor in campaigns targeting Southeast Asia.
For this family we count 40,295 n-grams among which
Table 6: Statistics that describe the characteristics of the
6.5% are considered unique.
output rules. SeqLen and SizeCap in bytes.
On the other end of the spectrum, we can find
families such as win.locky [30] and win.nymaim [31].
These families employ custom obfuscation schemes
that lead to a high n-gram count out of which the vast
majority is also unique. For example, in win.nymaim we
find 2,335,906 n-grams out of which 99% are unique 6.2 Rule Generation
across all families.

For rule generation, we use a system with the follow-


ing specifications: Intel I7 with 32GB RAM, an SSD as
6 Evaluation system and a HDD to host the data partition. With re-
spect to the different processing phases described in
In this section we present the results of our evaluation Section 4.2, we note the processing times as shown in
of the generated YARA signatures. We first explain the Table 5, taking about 15 hours for the full procedure.
data sets used and then describe the rule generation
The outcome of this procedure is a rule set for 992
process and structure of produced rules. Afterwards,
of 997 families that we used as input. For 5 families, no
we discuss the experiments to measure classification
unique n-grams could be identified and thus no rules
performance.
were generated. We now further characterize these
rules, a summary is given in Table 6.
Phase Time in hours
Disassembly 2
First, we inspect the number of sequences used per
Data Ingestion 8 rule. Only the rule for win.poisonivy consists of 5 se-
Filtering 2
Initial Rule Generation 2
quences (all others have 8 or more) and the largest
Iterative Optimization 1 is win.isfb with 220 sequences that originate from
Total 15 121 input files. In fact 763 (78.23%) have exactly 10 se-
quences, which is a result of the configuration chosen
Table 5: Duration of a full run of YARA-Signator. in Section 4.2.2 and the fact that 482 families from the
input data are only represented by a single file. The to-
tal number of sequences is 12,542, out of which 6,430
6.1 Data Set (51.27%) contain one or more wildcards.
With regard to the distribution of n-gram lengths
In the evaluation, we use the same data set as de- within these sequences, we can see that they are
scribed in Section 5.1: a snapshot of Malpedia (com- skewed towards longer sequences and there are in
mit: d9bc781) [4]. fact 5,028 n-grams with 7 instructions, which are
However, since previous works [4] showed that al- 40.01% of all sequences. This is primarily caused by
most 80% of the analyzed 443 malware families for the Overlapping Detection, which discards shorter n-
Windows in that study had been created using Visual grams in favor of longer n-grams containing them. All
Studio, we take this into concern for blacklisting. For n-grams combined contain 72,954 instructions, out of
this purpose, we use the data provided by the Empty which 10,004 (13.71%) are wildcarded.
MSVC Project [32]. As the name implies, this project
contains empty projects compiled with all available For sequence lengths (SeqLen), we notice that half
versions of MS Visual C in all major compiler settings the sequences are between 14 and 23 bytes. Raff et
(dynamic and static linking, debug and release builds). al. [33] recently showed in a study on code reuse iden-
This way, the resulting programs contain only code tification with large n-grams of up to 1024 bytes length,
that we would expect to be inserted by Visual Studio, that shorter n-grams of n < 32 generally provided bet-
which is likely shared and should not become part of ter accuracy.
YARA signatures. We also add a number of goodware Finally, we look at the values used to cap file sizes
that we identified being prone to false positives in pre- as explained in Section 4.2.3. As this value is twice
vious experiments, among them MFC libraries and net- as large as the largest input file per family, we note
work libraries from Internet Explorer and Firefox. a range from 24 KB up to 35 MB. The smallest files
To further test the robustness of the rules pro- are very simple downloaders that only consist of a few
duced by YARA-Signator with regard to FPs on benign functions while the largest is win.rms, a remote ad-
software, AVAST kindly ran our rules against their cor- ministration toolkit that was observed being used in
pus containing 10 TB of goodware. targeted intrusions by threat actor TA505.

10 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

True False are also the result of different effects. In the major-
Positive 4,035 22 ity of cases, we note that disassembly errors may lead
Negative 3,459 115
to situations where parts of a sample are missed that
could otherwise be used as characteristic sequences
Table 7: Classification results of running the 992 YARA rules
for a given family. This naturally causes a situation
against the input data set. In addition to the 4,035 True Posi-
tives, another 1330 hits on files of the respective family were
where not enough sequences for a sample are ex-
registered. tracted and incorporated into the rule, causing it to
miss the sample. We found that this particularly af-
fects samples that already have a very small number
6.3 Classification Performance of functions. In a number of cases, we also found that
a sample sorted into the wrong family resulted in elim-
After inspection of the generated rules, we now want ination of many otherwise possible sequences from
to evaluate their performance with regard to detection rules in the filtering stage, leading to an insufficient
capabilities. number of sequences to trigger on the sample. This
We first apply the rules against the Malpedia data had the positive side effect that we could optimize the
set with which they were generated. The results are corpus and correct these wrongly classified samples
shown in Table 7. Overall, all except 115 files were pos- in the data set. In few cases, we also noticed that this
itively classified, which results in a Recall of 0.972. In- happened in legitimate cases, especially when a fam-
terestingly, 1,329 additional files were correctly classi- ily as itself is used as a "module" in another family.
fied by the respective rule corresponding to their fam-
ily, which indicates the generalization potential of the
used n-gram and wildcarding method. With just 22
false positives, the rules have a very high Precision of 6.4 False Positive Analysis
0.995. The overall F1 score is 0.983.
We now conduct an analysis of false positives on a
Looking closer at the rules, we find that 977 rules
second data set. For this analysis, AVAST kindly ran
did not produce false positives and 923 rules did not
our rules against their clean data set and provided the
have any false negatives. Combined, 916 rules are con-
detection results back to us. The data set comprises
sidered clean in that they did not cause any misclassi-
of about 10 TB of data and any hits can be safely as-
fications.
sumed to be undesired as it only consists of known
We next investigate these misclassifications in de-
benign software.
tail. First off, false positives typically have to be con-
sidered as a direct result of disassembly errors. If all We register a total of 13,879 hits caused by 70
disassembly was exact, the sequences causing FPs of the 992 rules. While this seems initially like
would have been sorted out during the aggregation a large number, the distribution is highly skewed.
and filtering stage. The following scenarios can occur. The rule with the most hits alone is responsible
First, if code is disassembled correctly in one family for 8,206 hits (59.13%) and targets the ransomware
but missed in another, this may result in n-grams that win.scarabey [35]. We analyzed the rule composition
lead to false positives. Otherwise, if disassembly is and malware, noting that the malware makes exten-
produced "wrongly" for a family, this may lead to wrong sive use of Application Framework Extensions (AFX), a
instruction borders and thus n-grams that will still de- predecessor of Microsoft Foundation Classes (MFC),
tect the same byte sequences in other families. typically used to create GUIs. Smaller portions of AFX
With this in mind, we now first focus on the code fragments are only found in 6 other malware fam-
false positives that occurred. For at least 9 out ilies. Because AFX was not added to the blacklist be-
of 22 hits, we assess that they are caused by ac- forehand, this leaves enough n-grams to be considered
tual contextual relationship between the families. "unique" across the malware in Malpedia. However,
For example, the YARA rule for win.isfb detects since lots of benign software also make use of AFX,
win.dreambot. Both families are based on the leaked this immediately explains the amount of FPs caused
gozi source code [34], with Dreambot e.g. be- by this.
ing able to use Tor. Other overlap that is simi- Looking at the next rules in the top five, we find
larly explainable is found e.g. for win.dropshot, 1,258 (9.06%), 957 (6.90%), 914 (6.59%), and 370
win.shapeshift, and win.stonedrill. The rule for (2.67%) hits. Only 6 other rules have more than 100 hits
win.reactorbot also causes hits in win.rovnix, be- and all of them together are responsible for 92.96% of
cause this protector/rootkit has been used in conjunc- false positives. For the remainder, there are 31 rules
tion with win.reactorbot and the samples in Malpe- with between 10 and 100 hits, while 28 rules produce
dia for win.rovnix contain the win.reactorbot pay- less than 10 hits.
load. One hit is also caused by a binary duplicate It is also notable that 4 out 24 rules for macOS mal-
stored under two family names in Malpedia (that has ware produce false positives. This is explained with
since been resolved). For the remaining 13 hits, we the fact that with such few families in Malpedia gen-
could not find a better explanation than disassembly erally the expected code elimination effect is minimal
errors and potential library code overlap. compared to Windows PE files and that the blacklist
With regard to false negatives, we note that they data did not specifically target macOS.

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 11
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

7 Limitations and Future Work com/fxb-cocacoding/yara-signator, and the fre-


quently updated YARA rules created from the Malpe-
We now discuss limitations and ideas for future im- dia corpus can be freely accessed using the REST
provements of YARA-Signator. API: https://malpedia.caad.fkie.fraunhofer.de/
Right now, YARA-Signator is only capable of ingest- api/get/yara/auto/zip.
ing disassembly reports created using SMDA. The pri-
mary limitation of this is that only input binaries con- Acknowledgments:
sisting of x86 and x64 Intel machine code can be pro- The authors would like to thank the anonymous re-
cessed. Ways to improve this situation would be to viewers of Botconf for their valuable feedback.
provide additional or a more generic interface for data We would also like to thank Jakub Křoustek and his
ingestion. For example, YARA-Signator could provide team at AVAST for their kind support of running our
an interface to parse the output of other disassem- rules against their goodware data set.
blers with wider architecture coverage such as IDA Pro,
Ghidra, or radare2. Otherwise, it could also provide an
interface to ingest pre-processed data sequences as a Author details
full replacement of the current first phase. This would
open it to e.g. working on raw byte sequences of arbi- Felix Bilstein
trary length.
Fraunhofer FKIE
While providing rules that lead to few false posi- Zanderstr. 5, 53177 Bonn
tives already, further improvement could be achieved felix.bilstein@fkie.fraunhofer.de
by providing a more comprehensive blacklist of code
found in benign programs. For further reduction of
false negatives, the parameters of the system (n-gram Daniel Plohmann
size, n-grams per family and hit condition, ...) could be
evaluated to find an optimal configuration. Fraunhofer FKIE
Zanderstr. 5, 53177 Bonn
We also believe that there is room for further re-
daniel.plohmann@fkie.fraunhofer.de
search and improvement in the sequence selection
process.
References
8 Conclusion [1] D. Bianco, “The pyramid of pain.” Blog post:
http://detect-respond.blogspot.com/2013/
In this paper we presented YARA-Signator, a frame- 03/the-pyramid-of-pain.html.
work for the automated generation of code-based
YARA rules. [2] C. Blichmann, “Automatisierte Signaturgener-
ierung für Malware-Stämme,” 2008. Diploma The-
First, we outlined the general idea of isolating char-
sis.
acteristic code sequences unique to a family through
filtering and data aggregation. We explained the pro- [3] J. Zaddach and M. Graziano, “Bass - bass auto-
cessing stages including n-gram deriviation and code mated signature synthesizer,” 2017. Github repos-
wildcarding up to rule creation and iterative improve- itory: https://github.com/Cisco-Talos/BASS.
ment, as well as the implementation of the framework
in detail. [4] D. Plohmann, M. Clauß, S. Enders, and E. Padilla,
Next, we performed a statistical analysis of the “Malpedia: a collaborative effort to inventorize
comprehensive malware corpus Malpedia, showing the malware landscape,” Proceedings of Botconf,
that the general idea of finding unique code sequences 2017.
for malware families is very viable. With on average [5] L. Gibelli, T. Edvin, T. Kojmnet, A. Wu, and N. Horne,
51.85% of n-grams being unique to a family, this leaves “Clamav - open source anti virus engine,” 2004.
significant amounts of code to pinpoint and base sig- Website: https://www.clamav.net/.
natures on, e.g. using YARA.
We then used YARA-Signator to produce YARA [6] V. M. Alvarez, “Yara - the pattern matching swiss
rules for 992 processable malware families in Malpe- knife for malware researchers,” 2014. Website:
dia and evaluated the detection performance of these http://virustotal.github.io/yara/.
rules against the input data set and a collection of be-
[7] I. Intel, “Intel-64 and ia-32 architectures software
nign software. The results are very positive, with a
developer’s manual,” 2013.
F1 score of 0.983 against the input data set and just
13,879 false positives from 70 rules (out of which only [8] R. Edward, Z. Richard, R. Cox, J. Sylvester, P. Yacci,
11 cause more than 100 FPs) on a goodware corpus R. Ward, A. Tracy, M. McLean, and C. Nicholas, “An
spanning more than 10 TB of data. investigation of byte n-gram features for malware
We provide the code for YARA-Signator as open classification,” Journal of Computer Virology and
source via the GitHub repository: https://github. Hacking Techniques, 2016.

12 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS

[9] C. Blichmann, “vxsig - automatically generate av [24] C. Cohen and J. Havrilla, “Function Hashing for
byte signatures from sets of similar binaries.,” Malicious Code Analysis,” tech. rep., SEI, CMU,
2019. Github repository: https://github.com/ 2009.
google/vxsig.
[25] V. Chvatal, “A greedy heuristic for the set-covering
[10] T. Dullien, E. Ventura, S. Meyer-Eppler, T. Kor- problem,” Math. Oper. Res., vol. 4, p. 233–235,
nau, C. Blichmann, and J. Newger, “Zynamics,” Aug. 1979.
2004. Website: https://www.zynamics.com/
software.html. [26] M. Stonebraker, “Postgresql,” 1989. Website:
https://www.postgresql.org/.
[11] S. H. Ding, B. C. Fung, and P. Charland, “Kam1n0:
Mapreduce-based assembly clone search for re- [27] B. Levene and J. Grunzweig, “Sure, I’ll take
verse engineering,” in Proceedings of the 22nd that! New ComboJack Malware Alters Clip-
ACM SIGKDD International Conference on Knowl- boards to Steal Cryptocurrency.” Blogpost:
edge Discovery and Data Mining, KDD ’16, (New https://researchcenter.paloaltonetworks.
York, NY, USA), p. 461–470, Association for Com- com/2018/03/unit42-sure-ill-take-new-
puting Machinery, 2016. combojack-malware-alters-clipboards-
steal-cryptocurrency/.
[12] F. Roth, “yarGen,” 2013-12-18. Github Reposi-
tory: "Github repository: https://github.com/ [28] T. Micro, “ShurL0ckr Ransomware as a Ser-
Neo23x0/yarGen. vice Peddled on Dark Web, can Report-
edly Bypass Cloud Applications.” Blogpost:
[13] C. Doman, “Yabin,” 2018. Github repository: https://www.trendmicro.com/vinfo/us/
https://github.com/AlienVault-OTX/yabin. security/news/cybercrime-and-digital-
threats/shurl0ckr-ransomware-as-a-
[14] H. Yi, “Hyara (ida plugin),” 2018. Github repository: service-peddled-on-dark-web-can-
https://github.com/hy00un/Hyara. reportedly-bypass-cloud-applications.
[15] KoreLogic Security, “Converting ida pat to yara [29] J. Grunzweig and K. Wilhoit, “The Fractured
signatures,” 2013. Blog post: https://blog. Block Campaign: CARROTBAT Used to Deliver
korelogic.com/blog/2013/11/15/pat2yara. Malware Targeting Southeast Asia.” Blogpost:
[16] W. Ballenthin, “Yara-fn,” 2019. Github repos- https://unit42.paloaltonetworks.com/
itory: https://github.com/williballenthin/ unit42-the-fractured-block-campaign-
idawilli/tree/master/scripts/yara_fn. carrotbat-malware-used-to-deliver-
malware-targeting-southeast-asia/.
[17] J. Martin, j0sm1, jovimon, and mmorenog,
“Yara rules,” 2018. Github repository: [30] M. Talbi, “De-obfuscating Jump Chains
https://github.com/Neo23x0/signature- with Binary Ninja.” Blogpost: https:
base/tree/master/yara. //thisissecurity.stormshield.com/2018/
03/20/de-obfuscating-jump-chains-with-
[18] F. Roth, “Yara rules from signature base,” binary-ninja/.
2018. Github repository: https://github.com/
Neo23x0/signature-base/tree/master/yara. [31] D. Plohmann, “Patchwork: Stitching against
malware families with IDA Pro.” Presen-
[19] M. Worth, “Open-source-yara-rules,” 2018. Github tation for SPRING2014: https://public.
repository: https://github.com/mikesxrs/ gdatasoftware.com/Web/Landingpages/DE/GI-
Open-Source-YARA-rules. Spring2014/slides/004_plohmann.pdf.
[20] R. Wesson and SupportIntelligence, “Project ice- [32] D. Plohmann, “Empty msvc,” 2019. Github repos-
water,” 2018. Github repository: https://github. itory: https://github.com/danielplohmann/
com/SupportIntelligence/Icewater. empty_msvc.
[21] D. Plohmann, “SMDA - a minimalist recur- [33] E. Raff, W. Fleming, R. Zak, H. Anderson, B. Fin-
sive disassembler library for x86/64.,” 2018. layson, C. Nicholas, and M. McLean, “Kilograms:
Github repository: https://github.com/ Very large n-grams for malware classification,”
danielplohmann/smda. 2019.
[22] N. A. Quynh, “Capstone: Next-gen dis- [34] gbrindisi, “Gozi ISFB Sourceccode.” Github
assembly framework,” 2014. Website: Repository: https://github.com/gbrindisi/
http://www.capstone-engine.org/BHUSA2014- malware/tree/master/windows/gozi-isfb.
capstone.pdf.
[35] A. Ivanov, “Scarabey Ransomware.” Blog-
[23] F. Bilstein, “Automatic generation of code-based post: https://id-ransomware.blogspot.com/
yara-signatures,” 2018. Bachelor Thesis. 2017/12/scarabey-ransomware.html.

Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 13

You might also like