Automated YARA Rule Generation
Automated YARA Rule Generation
YARA-Signator: Automated
Generation of Code-based YARA
Rules
Felix Bilstein1 , Daniel Plohmann1
1 Fraunhofer FKIE
This paper was presented at Botconf 2019, Bordeaux, 4-6 December 2019, www.botconf.eu
It is published in the Journal on Cybercrime & Digital Investigations by CECyF, https://journal.cecyf.fr/ojs
c b It is shared under the CC BY license http://creativecommons.org/licenses/by/4.0/.
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 1
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
lel. This allows us to execute aggregation operations of the Intel instruction syntax and the concept of n-
that have the following benefits. First, we can elimi- grams.
nate code sequences that are found in multiple fam-
ilies, which are most likely instances of shared code
undesired to become part of signatures, e.g. libraries. 2.1 Pattern Matching and YARA
Second, by counting and ranking the number of ap-
pearances of n-grams in samples of the same family, Pattern matching is a popular methodology that is also
we achieve an approximation of the LCS identification. widely adapted in the context of threat detection, e.g.
We propose a prototype implementation of YARA- when monitoring network traffic or scanning files for
Signator depending only on open source components malicious content. It typically uses a signature that
and apply it to Malpedia [4], a community-curated cor- consists of one or more known patterns associated
pus of cleanly labeled, unpacked malware samples with a threat to be evaluated against data of interest.
covering more than 1,500 malware families. On this In this regard, it is also used to detect or identify mali-
data set, the rules produced by the system achieve an cious software.
overall F1 score of 0.983 with a high precision of 0.995. Apart from ClamAV [5], YARA [6] has become the
We additionally test the rules against a corpus contain- de-facto standard for pattern matching in malware
ing 10 TB of benign software, on which 70 out of 992 analysis. Its syntax is simple yet powerful, which
rules produce a total of 13,879 false positives. While makes it very popular among practitioners. As a re-
seemingly large, these numbers are however drasti- sult, there are many resources available where detec-
cally driven by very few outliers, as 10 of these rules tion and identification signatures using the YARA for-
account for more than 92% of the FPs, showing that mat are shared.
the rules are generally indeed very accurate. Figure 1 shows an excerpt of a YARA signature.
In summary, our paper makes the following contri- This particular signature is also an example of the au-
butions: tomatically generated rules produced by the approach
proposed in this paper: YARA-Signator.
• We present YARA-Signator, a method for the au- All YARA signatures contain at least one manda-
tomated generation of code-based YARA signa- tory part. A condition that describes what is neces-
tures. sary to trigger a detection when using this rule. This
is a logical expression that can optionally address file
• Using the disassembly for 992 malware families meta data or content (e.g. filesize as shown in Fig-
from the data set Malpedia [4], we show that on ure 1) but will typically reference sequences defined in
average more than 51.85% of instruction n-grams the strings environment. These strings can be de-
of size 4 and larger are intrinsic for the respective fined as printable character sequence, i.e. text string,
families, i.e. only found in these and thus serve as hex string, or as regular expression. Optional key-
as good candidates for rule creation. words can modify the condition under which they eval-
uate as a match, e.g. ascii or wide for text strings,
• We provide an open source implementation of
controlling the encoding for which the strings are de-
YARA-Signator and make a periodically updated
fined. A third environment is also possible for YARA
reference rule set for all processable families
signatures: a collection of meta fields. These allow to
found in Malpedia publicly available.
annotate a signature with additional information, such
The remainder of the paper is structured as fol- as author names, creation date, or sharing restrictions.
lows. We first provide background information to ease
the understanding of the proposed methodology and
discuss related work to give a thematic overview of the 2.2 Intel Instruction Syntax
topic. We then introduce our approach YARA-Signator
Intel x86/x64 machine code instructions [7] are vari-
and explain the workflow and components of the sys-
able in length between 1 and 15 bytes and structurally
tem. Afterwards, we examine the general viability of
encoded as a sequence of 6 fields:
the method by providing a detailed statistical evalua-
tion of the data set. This is followed by an evaluation Legacy Instruction Prefix: An instruction may be pre-
of the classification performance of the rules gener- fixed with zero to four instruction behavior modifiers,
ated using this data set and a false positive analysis indicating exclusive use of shared memory (LOCK),
against a large goodware corpus. We conclude with a conditional instruction repetition (REP), as well as
discussion of limitations and future work. segment, operand size, and address size override
switches.
(Prefixed) Opcode: The core of the instruction is a 1-
2 Background to 3-byte field that defines the actual opcode (cf. Fig-
ure 6, which gives a visual overview for all 1-byte op-
In this section, we discuss a number of aspects rel- codes in x86). Under certain circumstances, the op-
evant for the understanding of the method proposed code can optionally be prefixed, e.g. with a REX prefix
in this paper. We first discuss pattern matching and when operating under 64-bit and wanting to access ex-
YARA in particular, before giving a short overview tended registers such as R8 to R15.
2 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
rule win_citadel_auto {
meta:
author = "Felix Bilstein - yara-signator at cocacoding dot com"
malpedia_reference = "https://malpedia.caad.fkie.fraunhofer.de/details/win.citadel"
malpedia_license = "CC BY-NC-SA 4.0"
malpedia_sharing = "TLP:WHITE"
strings:
$sequence_0 = { 3bfe 7449 ff7508 e8???????? }
// n = 4, score = 3500
// 3bfe | cmp edi, esi
// 7449 | je 0x4b
// ff7508 | push dword ptr [ebp + 8]
// e8???????? |
[...]
condition:
7 of them and filesize < 1236992
}
Figure 1: Example for a YARA signature targeting the malware family Citadel, automatically generated using YARA-Signator.
ModR/M: A field that is required for some opcodes. of n-grams in the context of detection or identification
If present, it encodes an extension, which defines is a common technique in the field of malware analy-
which concrete registers or memory addressing mode sis. [8]
should be used. For code-based signatures, the interpretation of
Scale, Index, and Base (SIB): Another field only re- items could either mean taking a set number of bytes
quired for some opcodes. If present, it will describe or instructions, which themselves usually consist of
how exactly addresses are calculated and how the dis- multiple bytes. In our case, we use instructions and
placement may be used in this context. Figure 2 provides an example for the derivation of n-
grams given a stream of instructions.
Displacement: The displacement is a field containing
There is a list of instructions given on the left hand
a value of 1, 2, 4, or 8 byte length that is used as an
side in Figure 2. These instructions are sequentially
offset for the calculation defined by the SIB field (if
executed by the target architecture and therefore we
present). In case the displacement has a length of 8
have to keep this order. For a given length (four in
bytes, no immediate may follow.
this case), we derive four possible n-grams from the
Immediate: Some instructions may use an immediate instruction list of length seven.
value, which can be 1, 2, 4, or 8 byte long, depending
on what is defined by the instruction or ModR/M field.
Similarly, an 8-byte long immediate is mutually exclu- 3 Related Work
sive with a displacement.
In the context of this paper, the Displacement and In this section we provide an overview of related work.
Immediate fields are of special interest. Because both We focus on three categories: Frameworks for au-
fields may contain concrete addresses and offsets tonomous rule generation, tools supporting manual
that are very specific to a given compiled program or a creation of rules, and YARA rule archives.
result of mapping and memory relocations, it is desir-
With regard to full systems for rule generation,
able to replace these concrete values with wildcards in
Blichmann recently published VxSig [9], a reference im-
certain cases to achieve signatures with better gener-
plementation for the seminal approach published in
alization. Figures 1 and 2 give an example of wildcard-
his diploma thesis [2]. VxSig allows the automated
ing in the context of YARA, in both cases removing the
generation of signatures in the YARA and ClamAV for-
concrete value for a interprocedural, relative-offset call
mat from sets of previously grouped, similar bina-
instruction.
ries. Input files are processed using BinExport [10] and
BinDiff [10] to locate and isolate common code frag-
2.3 N-gram Structure ments. BASS [3] is a framework published by Zaddach
and Graziano for the automated generation of ClamAV
N-grams are consecutive subsequences with a fixed signatures over previously identified malware clusters.
length taken from a given sequence of items. The use As noted by the authors, their method has strong sim-
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 3
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
4 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
9) Validator
Family B
N-gram Pool Family C
N-gram Pool
4 10) Iterative Improvement
2. Unification and Filtering Data Ingestion is the first phase of our approach.
We process given disassembly reports for a set of
3. Rule Generation malware samples that we want to create YARA sig-
natures for. The malware samples have to be un-
4. Iterative Improvement packed and pre-clustered beforehand, so that samples
are grouped with their respective family. Disassem-
During the first step, data ingestion, all reports bly reports are parsed and instruction n-grams are ex-
are parsed by framework and linearized into n-grams. tracted.
These n-grams are unified and filtered by performing The second phase operates on the normalized data
data aggregations. The goal is to find n-gram candi- that we created in the first phase. We filter all dupli-
dates with a high coverage for a given malware family cate n-grams between families and rank candidates
that are not overlapping with the code of other fami- for each malware family. Overlapping n-grams are re-
lies. On the basis of these candidates we apply sev- moved to sanitize the candidate pools for each family.
eral filters to find the most suitable candidates. These A first set of YARA signatures are created within
candidates are written into YARA signatures which are the third phase of our approach. After composing the
evaluated. Then, we iteratively improve the gener- YARA rules, we validate them against the input corpus
ated YARA signatures by re-validating them in every to evaluate the quality of the generated signatures.
step, potentially increasing their precision and cover- The last phase is an iterative improvement phase
age. These stages are explained in detail in the follow- where problematic YARA signatures are re-generated
ing section. and re-evaluated to provide an improvement of the dif-
ferent previously created signatures over time.
In this section we present the different stages of our Parser. As an initial step, disassembly has to be
approach in detail. Figure 4 illustrates the four primary parsed. We use SMDA [21] as for this, because it is
stages of our framework. an open-source recursive disassembler built on top of
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 5
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
Capstone [22]. It is convenient to use, producing JSON for a given n-gram. Multiple ranking functions can
files as output and has the capability to reliably recon- be chained to incorporate multiple semantics into the
struct and extract code from memory dumps. Note overall rating of an n-gram. Example metrics used for
that the approach (and the implementation) is gener- ranking in our reference implementation are the num-
ally independent from the choice of disassembler as ber of occurrences in different samples and the types
it could be trivially adapted to any other data input for- of instruction (e.g. memory-access, or logic/arith-
mat. metic) found in the n-gram. After the ranking, a selec-
Linearization. An advanced disassembler will produce tion of highest-ranked candidates is selected per fam-
fully reconstructed CFGs that divide the identified code ily (in our configuration, 10).
into functions. Because YARA operates on byte se- Overlapping N-gram Detector. When n-grams are ag-
quences, we need to flatten the Control Flow Graph gregated across samples, the information about the
into a linear sequence of instructions that resembles relative position of n-grams to each other is lost. As
how code is encountered in the binary. Using address Ranking is applied for individual n-grams in a proce-
offsets and the individual instructions’ sizes, we can dure not considering other n-grams, it may rank sev-
furthermore split the linearized stream into consecu- eral n-grams similarly well due to characteristics they
tive chunks whenever a gap is encountered. share. This may potentially be a result of them over-
N-gram Generation. In this step, we produce the data lapping or even being contained within each other,
points YARA-Signator actually operates on. We de- e.g. ABCDEF and ABCD or ABCDEF and CDEFGH (with
rive all possible n-grams of a pre-defined size from the each letter representing an instruction at a certain off-
chunks resulting from Linearization. In the context of set). As it may be favorable to have signature contents
this work, we use sizes of 4 to 7 instructions per n- being spread over the target or at least not being re-
gram based on previous findings [23]. dundant, this stage ensures that no excessive over-
Wildcarding. Code may generally contain absolute lapping exists between n-grams selected for the sig-
virtual addresses or offsets, like memory pointers to natures.
code or data. These may even be shifted due to relo-
cations while mapping a binary. Using sequences with
4.2.3 Rule Generation
these absolute values in place for signature genera-
tion could lead to false negatives. To avoid this, we
YARA Rule Composer. Given a collection of n-grams,
perform additional abstraction and wildcard all occur-
this step uses a rule template to construct a functional
rences of such pointers using absolute addresses. In
YARA rule, updating meta data information such as a
fact, we even wildcard all relative references pointing
date and input data used. In the workflow of the frame-
outside the scope of the function we generate n-grams
work, the n-gram candidates per family are used to
for as well, e.g. inter-procedural calls and jumps. Addi-
compose a rule. We also include a filesize cap for each
tionally, we also wildcard immediate values that could
family’s rule that is calculated as twice the size of the
be interpreted as addresses within the mapping of the
biggest input sample.
given binary when mapped. For this, we inspect the
Displacement and Immediate fields of all instructions Validator. The Validator performs an evaluation of all
(cf. Section 2.2). This procedure is equivalent to the rules created against the data they were generated
wildcarding applied by Cohen and Havrilla [24] for their from. The desired result is obviously full coverage with
technique of creating position-independent code (PIC) no false positives. However, since rules are only de-
hashes of functions. We expect this to additionally rived from parts of the input binaries (disassembled
benefit rule generation as it may help make signatures code), false positives may still occur. Due to the initial
more robust against code reording that can happen selection of n-grams, false negatives may also occur.
due to an author’s refactoring or compiler effects. The resulting evaluation report is used to trigger an It-
erative Improvement phase that is applied to all rules
that do not have full coverage without false positives
4.2.2 Unification and Filtering
yet.
Filtering Step. This step implements the actual idea,
as initially described in the beginning of Section 4.1.
4.2.4 Iterative Improvement
By aggregating identical n-grams across all ingested
samples and families, we can filter out all n-grams that The Iterative Improvement aims at optimizing the
occur in more than one family. While doing so, we YARA signatures through additional rounds of refine-
additionally track in how many different samples the ment. Every iteration can be controlled independently
family-unique n-grams occur as this will help identify- by using a different configuration for each cycle. One
ing representative n-grams in the next stages. All re- iteration cycle has five different steps: Ranking, Cover-
maining n-grams are considered potential candidate age Engine, Overlapping N-gram Detector, YARA Rule
n-grams for rule generation. Composer, and Validator. All steps are similar to their
Ranking System. We developed the ranking system equivalent described before, except that the Ranking
to allow flexible configuration by the user. It consists step can be configured for each iteration indepen-
of individual ranking functions that generate a score dently and additionally Coverage Engine is executed.
6 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
yara-signator
java2yara YARA
Postgres Bindings
Figure 5: YARA-Signator and its components: The two JAVA libraries smda-reader and java2yara, its bindings to PostgreSQL,
to the Capstone disassembly engine and YARA.
Information about false positives is used to black- to use the relational database PostgreSQL [26]. This
list n-grams from their use in rules. The Coverage En- database management system natively supports var-
gine is then applied to all rules that did not have opti- ious techniques that can be used to efficiently imple-
mal coverage yet. Given the information about which ment our approach. This includes aggregations for fil-
n-grams cover which samples, the problem of achiev- tering data and a range of performance tuning tech-
ing a minimal coverage of all samples is an instance niques such as indexing and partitioning. We sim-
of the Set Cover problem [25] and in theory NP-hard. ply implemented a wrapper to communicate with the
We use a greedy approximation [25] that performs in database driver to access and persist data.
polynomial time and exceeds the optimal solution by Since we create YARA signatures we needed a li-
no more than the nth harmonic number in ratio. This brary to build YARA rules from JAVA programs. We
suffices for our use case as we look at a few hundred n- implemented java2yara as a library with which signa-
grams as input at most (example harmonic numbers: tures can be created from a collection of strings and
H(100) = 5.19, H(1000) = 7.49). The algorithm it- given meta data. As we want to enrich the rules with
eratively selects an n-gram that achieves the highest additional information about the instructions used in
coverage gain, i.e. covering additional uncovered sam- the signature strings, we also need a disassembler.
ples, until all samples are covered. The Overlapping n- Again, having different options to choose from, we
gram Detector again ensures that the coverage is ad- went with capstone [22] because it is open-source and
ditionally spread over the code. Validation rounds are also the basis for SMDA.
used to update the blacklist with potential iteration un- Finally, for the validation of the YARA rules we use
til a satisfying result in rule output is achieved. YARA itself as an external program. The results are
parsed from its output and processed by our frame-
work. The scan reports generated this way are used
4.3 Implementation to evaluate the rules against the input data set and an
We now discuss the implementation of our approach. important element to steer the iterative improvement
We created a framework around our core tool YARA- process.
Signator to provide a full toolchain enabling automated
generation of YARA signatures. Figure 5 illustrates the
core and relationship of the different modules. 5 Statistical Analysis
We implemented the library smda-reader as a
means for ingesting disassembly reports generated Before we conduct a performance evaluation of the
using SMDA [21] as described in Section 4.2.1. Tech- rules produced by YARA-Signator, we first want to get a
nically, smda-reader parses the reports provided in better understanding for the general viability of signa-
JSON into Java objects. As of now, we only support ture generation approaches based on code n-grams.
SMDA as a disassembler but since the data ingestion For this reason, we perform a statistical analysis of the
is handled through an interface and normalized ob- primary data set used in this study: "Malpedia" [4].
jects, we are not limited to a single technology with our After a short introduction of the data set, we will
approach. An adaption of other third-party software examine different distributions, e.g. amounts of code
like IDA Pro or objdump as an input provider would be found in malware families as well as the individual in-
trivial. structions in the corresponding disassembly reports.
Because processing the disassembly for rule gen- We then continue by further analyzing n-gram distribu-
eration requires space and we want a performant pro- tions and uniqueness, which we obtain as an interme-
cedure, we decided to use a database as backend. diate result in the procedure of applying YARA-Signator
Given several databases to choose from, we decided on the data set.
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 7
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
5.1 Data Set function has about 8-10 basic blocks and consists of
round about 50 instructions, which is fully sufficient to
Given the design of the approach as described in sec- apply our proposed method on.
tion 4, we note that one requirement is that the files of We now use the wildcarding method by Cohen and
the input data have to be grouped already, e.g. by mal- Havrilla [24] as explained in Section 4.2.1. This al-
ware families. A data set that is well suited to test our lows us to determine position-independent code (PIC)
approach on is the Malpedia [4] corpus, a community- hashes for all functions. In our implementation, we
curated malware corpus of (unpacked) reference sam- use MD5 over the concatenation of all wildcarded in-
ples including public analysis references for as many structions in their hexbyte representation, sorted by
families as possible. In this study, we use Git commit address. This leaves us with 947,421 unique hashes
d9bc781 from February 25th, 2020 as a baseline snap- for functions, out of which 848,783 (89.59%) only ap-
shot. At this time, Malpedia consists of 4,469 samples pear in one family each. Our number is higher than
for 1,573 malware families, which accumulate to a to- the 81% reported by Cohen and Havrilla but likely ex-
tal of 8,939 files. plained by their more diverse data set for which we
Not all of the files found in Malpedia can be pro- would expect a wider presence of library code. In any
cessed by YARA-Signator. Because the framework cur- case, this is a positive result as it indicates that we can
rently operates on x86/x64 exclusively and we intend definitely expect to find significant amounts of code
to only process unpacked or dumped files, we need to being unique per malware family which will benefit the
filter the data set before we disassemble the files. This generation of rules.
reduces our input data to 3,313 processable samples
from 1,150 families. Mnem 32bit 64bit
Among these are still families that consist of non- 1 mov 49,890,410 (28.17%) 6,144,638 (39.76%)
native code because they are written in other program- 2 push 26,770,256 (15.12%) 274,490 (1.78%)
ming languages, e.g. those created using the .NET 3 call 14,704,502 (8.30%) 1,347,419 (8.72%)
4 pop 8,548,750 (4.83%) 273,385 (1.77%)
framework. Filtering out all files that do not fulfil the 5 cmp 8,060,341 (4.55%) 770,526 (4.99%)
native-code requirements, we now use the SMDA dis- 6 lea 7,570,190 (4.27%) 1,147,978 (7.43%)
7 add 6,580,883 (3.72%) 557,804 (3.61%)
assembler [21] to process the input files. Ultimately, 8 je 6,371,325 (3.60%) 581,553 (3.76%)
this leaves us with disassembly reports for 3,039 sam- 9 dec 6,064,865 (3.42%) 41,810 (0.27%)
10 test 5,711,807 (3.23%) 585,390 (3.79%)
ples from 1,022 families. These amount to a total 11 jmp 5,184,997 (2.93%) 541,978 (3.51%)
of 4,150 input files because sometimes more than 12 xor 4,934,907 (2.79%) 618,684 (4.00%)
13 jne 4,525,392 (2.55%) 437,352 (2.83%)
one unpacked or dumped representation is associated 14 ret 3,481,393 (1.97%) 246,005 (1.59%)
with one sample, e.g. because of a 32bit and 64bit pay- 15 inc 2,595,499 (1.47%) 109,312 (0.71%)
16 sub 2,485,040 (1.40%) 322,980 (2.09%)
load, or additional modules. 17 and 1,863,284 (1.05%) 196,456 (1.27%)
18 movzx 1,577,742 (0.89%) 216,612 (1.40%)
19 or 1,352,995 (0.76%) 112,826 (0.73%)
Per Family Files Functions Instructions 20 shr 667,087 (0.38%) 69,654 (0.45%)
21 jb 616,242 (0.35%) 58,695 (0.38%)
Minimum 1 1 2 22 shl 571,271 (0.32%) 59,692 (0.39%)
25% 1 138 7,087 23 nop 557,776 (0.31%) 57,045 (0.37%)
50% 2 412 20,923 24 jle 471,338 (0.27%) 40,606 (0.26%)
75% 3 1,135 52,133 25 jl 461,572 (0.26%) 33,305 (0.22%)
Maximum 121 18,213 931,948
Total 4,150 3,733,355 195,422,329 Total 171,619,864 (96,91%) 14,846,195 (96.30%)
Table 1: Statistics for the processed input data. Functions Table 2: The 25 most prominent instruction mnemonics for
and Instructions have been averaged per family before ag- 32bit and 64bit.
gregation.
8 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
Figure 6: First byte occurrence distribution among the 195,422,329 instructions, separated by bitness. A major difference
is the increased used of 0x40-0x4F bytes in 64bit (REX prefix), and reduced use of stack operations.
depth, we believe that this is connected to instructions n-grams can be used for signatures without causing
starting with 0x40-4F (inc/dec <register> under false positives on the data set. However, we do not
32bit) being repurposed as REX prefix under Intel know yet how these unique n-grams are distributed
x64 [7]. We have also rendered heatmaps of the first across families.
byte instruction distribution in Figure 6, along with
a reference for 32bit instructions and their semantic Per Family 4 5 6 7
context. Minimum 0.00% 0.00% 0.00% 0.00%
25% 20.68% 23.60% 25.53% 26.77%
50% 45.21% 51.11% 53.61% 55.88%
75% 68.68% 78.78% 83.97% 87.24%
5.4 N-gram Occurrence Frequencies Maximum 100.00% 100.00% 100.00% 100.00%
Average 45.86% 51.15% 54.23% 56.14%
After the examination of distribution properties for in-
dividual instructions, we now look at sequences of in- Table 4: Relative amount of unique n-grams per family.
structions, i.e. n-grams, as used by YARA-Signator. We
are interested in two statistics particularly: Unique-
ness of n-grams across families overall and with re- Therefore, we now inspect the percentage of
spect to the families they originate from. Both of these family-unique n-grams for all families. The results are
values provide insight in the general viability of our out- shown in Table 4. For a total of 5 families (with one
lined approach. sample each), YARA-Signator did not identify unique
n-grams. An inspection of these shows that 4 sam-
N-gram size
ples were misclassified in Malpedia because of differ-
occurrences
4 5 6 7 ent aliases referring to them in the referenced report-
1 84.94% 86.51% 87.47% 88.10% ing, while one family was a .NET sample that was not
2 7.70% 7.09% 6.68% 6.34% filtered before. For each of the remaining 992 families,
3 3.06% 2.73% 2.53% 2.39%
4 1.22% 1.05% 0.96% 0.46% a number of n-grams sufficient for rule generation is
found. Not only this, for the median family, between
Table 3: Occurrence counts of n-grams in different malware 45.21 and 55.88% of n-grams are unique to that fam-
families. ily depending on n-gram length. Similar to what was
observed before, longer n-grams lead to higher rela-
First, we look at the occurrence frequency of n- tive uniqueness. Overall, we find that basically every
gram uniqueness across families. The results are family contains some portions of unique code that can
listed in Table 3. We count a total of 187,800,586 be automatically identified and used to target it in a
unique n-grams for all lengths combined. With regard signature. The average percentage of unique n-grams
to their relative uniqueness, we see that even for in- across all n-gram lengths and families is 51.85%.
struction n-grams of length 4, already 84.94% of these Deeper investigation of the results allows us
n-grams appear only in a single family. For two and to make more interesting observations. For ex-
three families, we note a steep decline with 7.70% and ample, families that stick out with a high n-gram
3.06% respectively, summing up to 95% of all n-grams. but low unique n-gram count are for example
Expectedly, for longer n-grams, these numbers lean win.combojack [27] (520,891 n-grams total but 2.40%
even more towards an occurrence of one time across unique) and win.shurl0ckr [28] (1,441,625 n-grams
all families only. For n-grams of length 7, the 88.10% but 3.25% unique). Both are compiled with frame-
of family-unique n-grams are also very close to the ob- works that make use of excessive static linking, in
served 89.59% for family-unique PIC function hashes these cases Delphi and Go respectively.
as discussed in Section 5.2. Overall, these statistics In a few cases, we observe a similar phenomenon
are good news as they imply that a vast majority of for families compiled with the much more popu-
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 9
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
lar MS Visual Studio. Here, we find families that SeqCount N-gramLen SeqLen SizeCap
have a lower number of total n-grams but still a low Minimum 5 4 4 24,576
25% 10 5 14 188,416
number of unique n-grams. An example would be 50% 10 6 18 402,432
win.carrotbat [29], a simple downloader used by a 75% 10 7 23 1,040,384
Maximum 220 7 77 35,323,904
threat actor in campaigns targeting Southeast Asia.
For this family we count 40,295 n-grams among which
Table 6: Statistics that describe the characteristics of the
6.5% are considered unique.
output rules. SeqLen and SizeCap in bytes.
On the other end of the spectrum, we can find
families such as win.locky [30] and win.nymaim [31].
These families employ custom obfuscation schemes
that lead to a high n-gram count out of which the vast
majority is also unique. For example, in win.nymaim we
find 2,335,906 n-grams out of which 99% are unique 6.2 Rule Generation
across all families.
10 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
True False are also the result of different effects. In the major-
Positive 4,035 22 ity of cases, we note that disassembly errors may lead
Negative 3,459 115
to situations where parts of a sample are missed that
could otherwise be used as characteristic sequences
Table 7: Classification results of running the 992 YARA rules
for a given family. This naturally causes a situation
against the input data set. In addition to the 4,035 True Posi-
tives, another 1330 hits on files of the respective family were
where not enough sequences for a sample are ex-
registered. tracted and incorporated into the rule, causing it to
miss the sample. We found that this particularly af-
fects samples that already have a very small number
6.3 Classification Performance of functions. In a number of cases, we also found that
a sample sorted into the wrong family resulted in elim-
After inspection of the generated rules, we now want ination of many otherwise possible sequences from
to evaluate their performance with regard to detection rules in the filtering stage, leading to an insufficient
capabilities. number of sequences to trigger on the sample. This
We first apply the rules against the Malpedia data had the positive side effect that we could optimize the
set with which they were generated. The results are corpus and correct these wrongly classified samples
shown in Table 7. Overall, all except 115 files were pos- in the data set. In few cases, we also noticed that this
itively classified, which results in a Recall of 0.972. In- happened in legitimate cases, especially when a fam-
terestingly, 1,329 additional files were correctly classi- ily as itself is used as a "module" in another family.
fied by the respective rule corresponding to their fam-
ily, which indicates the generalization potential of the
used n-gram and wildcarding method. With just 22
false positives, the rules have a very high Precision of 6.4 False Positive Analysis
0.995. The overall F1 score is 0.983.
We now conduct an analysis of false positives on a
Looking closer at the rules, we find that 977 rules
second data set. For this analysis, AVAST kindly ran
did not produce false positives and 923 rules did not
our rules against their clean data set and provided the
have any false negatives. Combined, 916 rules are con-
detection results back to us. The data set comprises
sidered clean in that they did not cause any misclassi-
of about 10 TB of data and any hits can be safely as-
fications.
sumed to be undesired as it only consists of known
We next investigate these misclassifications in de-
benign software.
tail. First off, false positives typically have to be con-
sidered as a direct result of disassembly errors. If all We register a total of 13,879 hits caused by 70
disassembly was exact, the sequences causing FPs of the 992 rules. While this seems initially like
would have been sorted out during the aggregation a large number, the distribution is highly skewed.
and filtering stage. The following scenarios can occur. The rule with the most hits alone is responsible
First, if code is disassembled correctly in one family for 8,206 hits (59.13%) and targets the ransomware
but missed in another, this may result in n-grams that win.scarabey [35]. We analyzed the rule composition
lead to false positives. Otherwise, if disassembly is and malware, noting that the malware makes exten-
produced "wrongly" for a family, this may lead to wrong sive use of Application Framework Extensions (AFX), a
instruction borders and thus n-grams that will still de- predecessor of Microsoft Foundation Classes (MFC),
tect the same byte sequences in other families. typically used to create GUIs. Smaller portions of AFX
With this in mind, we now first focus on the code fragments are only found in 6 other malware fam-
false positives that occurred. For at least 9 out ilies. Because AFX was not added to the blacklist be-
of 22 hits, we assess that they are caused by ac- forehand, this leaves enough n-grams to be considered
tual contextual relationship between the families. "unique" across the malware in Malpedia. However,
For example, the YARA rule for win.isfb detects since lots of benign software also make use of AFX,
win.dreambot. Both families are based on the leaked this immediately explains the amount of FPs caused
gozi source code [34], with Dreambot e.g. be- by this.
ing able to use Tor. Other overlap that is simi- Looking at the next rules in the top five, we find
larly explainable is found e.g. for win.dropshot, 1,258 (9.06%), 957 (6.90%), 914 (6.59%), and 370
win.shapeshift, and win.stonedrill. The rule for (2.67%) hits. Only 6 other rules have more than 100 hits
win.reactorbot also causes hits in win.rovnix, be- and all of them together are responsible for 92.96% of
cause this protector/rootkit has been used in conjunc- false positives. For the remainder, there are 31 rules
tion with win.reactorbot and the samples in Malpe- with between 10 and 100 hits, while 28 rules produce
dia for win.rovnix contain the win.reactorbot pay- less than 10 hits.
load. One hit is also caused by a binary duplicate It is also notable that 4 out 24 rules for macOS mal-
stored under two family names in Malpedia (that has ware produce false positives. This is explained with
since been resolved). For the remaining 13 hits, we the fact that with such few families in Malpedia gen-
could not find a better explanation than disassembly erally the expected code elimination effect is minimal
errors and potential library code overlap. compared to Windows PE files and that the blacklist
With regard to false negatives, we note that they data did not specifically target macOS.
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 11
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
12 Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules
THE JOURNAL ON CYBERCRIME & DIGITAL INVESTIGATIONS, VOL. 5, NO. 1, DEC. 2019 BOTCONF 2019 PROCEEDINGS
[9] C. Blichmann, “vxsig - automatically generate av [24] C. Cohen and J. Havrilla, “Function Hashing for
byte signatures from sets of similar binaries.,” Malicious Code Analysis,” tech. rep., SEI, CMU,
2019. Github repository: https://github.com/ 2009.
google/vxsig.
[25] V. Chvatal, “A greedy heuristic for the set-covering
[10] T. Dullien, E. Ventura, S. Meyer-Eppler, T. Kor- problem,” Math. Oper. Res., vol. 4, p. 233–235,
nau, C. Blichmann, and J. Newger, “Zynamics,” Aug. 1979.
2004. Website: https://www.zynamics.com/
software.html. [26] M. Stonebraker, “Postgresql,” 1989. Website:
https://www.postgresql.org/.
[11] S. H. Ding, B. C. Fung, and P. Charland, “Kam1n0:
Mapreduce-based assembly clone search for re- [27] B. Levene and J. Grunzweig, “Sure, I’ll take
verse engineering,” in Proceedings of the 22nd that! New ComboJack Malware Alters Clip-
ACM SIGKDD International Conference on Knowl- boards to Steal Cryptocurrency.” Blogpost:
edge Discovery and Data Mining, KDD ’16, (New https://researchcenter.paloaltonetworks.
York, NY, USA), p. 461–470, Association for Com- com/2018/03/unit42-sure-ill-take-new-
puting Machinery, 2016. combojack-malware-alters-clipboards-
steal-cryptocurrency/.
[12] F. Roth, “yarGen,” 2013-12-18. Github Reposi-
tory: "Github repository: https://github.com/ [28] T. Micro, “ShurL0ckr Ransomware as a Ser-
Neo23x0/yarGen. vice Peddled on Dark Web, can Report-
edly Bypass Cloud Applications.” Blogpost:
[13] C. Doman, “Yabin,” 2018. Github repository: https://www.trendmicro.com/vinfo/us/
https://github.com/AlienVault-OTX/yabin. security/news/cybercrime-and-digital-
threats/shurl0ckr-ransomware-as-a-
[14] H. Yi, “Hyara (ida plugin),” 2018. Github repository: service-peddled-on-dark-web-can-
https://github.com/hy00un/Hyara. reportedly-bypass-cloud-applications.
[15] KoreLogic Security, “Converting ida pat to yara [29] J. Grunzweig and K. Wilhoit, “The Fractured
signatures,” 2013. Blog post: https://blog. Block Campaign: CARROTBAT Used to Deliver
korelogic.com/blog/2013/11/15/pat2yara. Malware Targeting Southeast Asia.” Blogpost:
[16] W. Ballenthin, “Yara-fn,” 2019. Github repos- https://unit42.paloaltonetworks.com/
itory: https://github.com/williballenthin/ unit42-the-fractured-block-campaign-
idawilli/tree/master/scripts/yara_fn. carrotbat-malware-used-to-deliver-
malware-targeting-southeast-asia/.
[17] J. Martin, j0sm1, jovimon, and mmorenog,
“Yara rules,” 2018. Github repository: [30] M. Talbi, “De-obfuscating Jump Chains
https://github.com/Neo23x0/signature- with Binary Ninja.” Blogpost: https:
base/tree/master/yara. //thisissecurity.stormshield.com/2018/
03/20/de-obfuscating-jump-chains-with-
[18] F. Roth, “Yara rules from signature base,” binary-ninja/.
2018. Github repository: https://github.com/
Neo23x0/signature-base/tree/master/yara. [31] D. Plohmann, “Patchwork: Stitching against
malware families with IDA Pro.” Presen-
[19] M. Worth, “Open-source-yara-rules,” 2018. Github tation for SPRING2014: https://public.
repository: https://github.com/mikesxrs/ gdatasoftware.com/Web/Landingpages/DE/GI-
Open-Source-YARA-rules. Spring2014/slides/004_plohmann.pdf.
[20] R. Wesson and SupportIntelligence, “Project ice- [32] D. Plohmann, “Empty msvc,” 2019. Github repos-
water,” 2018. Github repository: https://github. itory: https://github.com/danielplohmann/
com/SupportIntelligence/Icewater. empty_msvc.
[21] D. Plohmann, “SMDA - a minimalist recur- [33] E. Raff, W. Fleming, R. Zak, H. Anderson, B. Fin-
sive disassembler library for x86/64.,” 2018. layson, C. Nicholas, and M. McLean, “Kilograms:
Github repository: https://github.com/ Very large n-grams for malware classification,”
danielplohmann/smda. 2019.
[22] N. A. Quynh, “Capstone: Next-gen dis- [34] gbrindisi, “Gozi ISFB Sourceccode.” Github
assembly framework,” 2014. Website: Repository: https://github.com/gbrindisi/
http://www.capstone-engine.org/BHUSA2014- malware/tree/master/windows/gozi-isfb.
capstone.pdf.
[35] A. Ivanov, “Scarabey Ransomware.” Blog-
[23] F. Bilstein, “Automatic generation of code-based post: https://id-ransomware.blogspot.com/
yara-signatures,” 2018. Bachelor Thesis. 2017/12/scarabey-ransomware.html.
Felix Bilstein, Daniel Plohmann. YARA-Signator: Automated Generation of Code-based YARA Rules 13