A Study of Detecting Computer Viruses in Real-Infected Files in The N-Gram Representation With Machine Learning Methods

The document discusses detecting computer viruses in real-infected executable files using n-gram representations and machine learning methods. It describes creating datasets of benign, virus loader, and real infected files. The results show that detecting viruses in real infected files using n-grams and machine learning is very difficult.

Uploaded by

ombidasar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views15 pages

A Study of Detecting Computer Viruses in Real-Infected Files in The N-Gram Representation With Machine Learning Methods

Uploaded by

ombidasar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A Study of Detecting Computer Viruses in

Real-Infected Files in the n-gram

Representation with Machine Learning Methods
Note, this is an extended version. The LNCS published version is
available via springerlink
Thomas Stibor
Fakultat f ur Informatik
Technische Universitat M unchen
thomas.stibor@in.tum.de
Abstract. Machine learning methods were successfully applied in re-
cent years for detecting new and unseen computer viruses. The viruses
were, however, detected in small virus loader les and not in real infected
executable les. We created data sets of benign les, virus loader les and
real infected executable les and represented the data as collections of
n-grams. Histograms of the relative frequency of the n-gram collections
indicate that detecting viruses in real infected executable les with ma-
chine learning methods is nearly impossible in the n-gram representation.
This statement is underpinned by exploring the n-gram representation
from an information theoretic perspective and empirically by performing
classication experiments with machine learning methods.
1 Introduction
Detecting new and unseen viruses with machine learning methods is a challenging
problem in the eld of computer security. Signature based detection provides
adequate protection against known viruses, when proper virus signatures are
available. From a machine learning point of view, signature based detection is
based on a prediction model where no generalization exists. In other words,
no detection beyond the known viruses can be performed. To obtain a form of
generalization, current anti-virus systems use heuristics generated by hand.
In recent years, however, machine learning methods are investigated for de-
tecting unknown viruses [15]. Reported results show high and acceptable de-
tection rates. The experiments, however, were performed on skewed data sets,
that is, not real infected executable les are considered, but small executable
virus loader les. In this paper, we investigate and compare machine learning
methods on real infected executable les and virus loader les. Results reveal
that real infected executable les when represented as n-grams can barely be
detected with machine learning methods.
2 Data Sets
In this paper, we focus on computer virus detection in DOS executable les.
These les are executable on DOS and some Windows platforms only and can
be identied by the leading two byte signature MZ and the le name sux
.EXE. It is important to note that although DOS platforms are obsolete, the
princinple of virus infection and detection on DOS platforms can be generalized
to other OS platforms and executable les.
In total three dierent data sets are created. The benign data set consists
of virus-free executable DOS les which are collected from public FTP servers.
The collected les consist of DOS games, compilers, editors, etc. and are prepro-
cessed such that no duplicate les exist. Additionally, les that are compressed
with executable packers
1
such as lzexe, pklite, diet, etc. are uncompressed. The
preprocessing steps are performed to obtain a clean data set which is supposed
to induce high detection rates of the considered machine learning methods. In
total, we collected 3514 of such clean DOS executable les.
DOS executable virus loader les are collected from the VXHeaven website.
2
The same processing steps are applied, that is, all duplicate les are removed
and compressed les are uncompressed. It turned out however, that some les
from VXHeaven are real infected executable les and not virus loader les. These
les can be identied by inspecting their content with a disassembler/hexeditor
and by their large le sizes. Consequently, we removed all les greater than 10
KBytes and collected in total 1555 virus loader les.
For the sake of clarity, the benign data set is denoted as B
les
, the virus
loader data set as /
les
.
The data set of real infected executable les denoted as J
les
is created by
performing the following steps:
1: J
files
:=
2: for each file.v in /
files
3: setup new DOS environment in DOS emulator
4: randomly determine file.r from B
files
5: boot DOS and execute file.v and file.r
6: if (file.r gets infected)
7: J
files
:= J
files
file.r
Fig. 1. Steps performed to create real infected executable DOS les.
These steps were executed on a GNU/Linux machine by means of a Perl script
and a DOS emulator. A DOS emulator is chosen since it is too time consuming
to setup a clean DOS environment on a DOS machine whenever a new le gets
1
Executable compression is frequently used to confuse and hide from virus scanners.
2
Available at http://vx.netlux.org/
infected. The statement file.r gets infected at line 6 in Figure 1 denotes
the data labeling process. Since only viruses were considered for which signatures
are available, no elements in the created data sets are mislabeled. In total 1215
real infected les were collected.
3
3 Transforming Executable Files to n-grams and
Frequency Vectors
Raw executable les cannot serve properly as input data for machine learning
methods. Consequently, a suitable representation is required. A common and
frequently used representation is the hexdump. A hexdump is a hexadecimal
representation of an executable le (or of data in general) and is related to ma-
chine instructions, termed opcodes. Figure 2 illustrates the connection between
machine instructions, opcodes, and a hexdump. It is important to note, that the
hexdump representation used and reported in the literature, is computed over
a complete executable le which consists of a header, relocation tables and the
executable code. The machine instructions are located in the code segment only
and this implies that a hexdump obtained over a complete executable le may
not be interpreted completely as machine instructions.
machine instruction opcode
mov bx, ds 8CDB
push ebx 53
add ebx, 2E 83C32E
nop 90
8CDB5383C32E90
hexdump
Fig. 2. Relation between machine instructions, opcodes and the resulting hexdump.
In the second preprocessing step, the hexdump is cut into substrings of
length n N, denoted as n-grams. According to the literature an n-gram can be
any set of characters in a string [6]. In terms of detecting viruses in executable
les, n-grams are adjacent substrings created by shifting a window of length
s N over the hexdump. The resulting number of created n-grams depends on
the value of n and the window shift length s.
We created for each le an ordered collection
4
of n-grams by cutting the
hexdump from left to right (see Fig. 3). Note that the collection can contain the
same n-grams multiple times. The total number of n-grams in the collection when
given n and s is (l n)/s +1, where l denotes the hexdump length. In a nal
step, the collection is transformed into a vector of dimension
5
d = 16
n
, where
3
For the sake of veriability, the three data sets are available at
http://www.sec.in.tum.de/stibor/ieaaie2010/
4
If not otherwise stated we denote an ordered collection as collection.
5
An unary hexadecimal number can take 16 values.
8 C D B 5 3 8 3 C 3 2 E 9 0
8 C D B
C D B 5
D B 5 3
3 2 E 9
2 E 9 0
8 C D B 5 3 8 3 C 3 2 E 9 0
8 C D B
D B 5 3
5 3 8 3
C 3 2 E
2 E 9 0
n = 4, s = 1
{8CDB,CDB5,DB53,. . .,32E9,2E90}
n = 4, s = 2
{8CDB,DB53,5383,. . .,C32E,2E90}
Fig. 3. Resulting n-grams and collections for n = 4 and dierent window shift lengths
(s = 1 and s = 2).
each component corresponds to the absolute frequency of the n-gram occurring
in the collection (cf. Example 1). In the problem domain of text classication
such vectors are named term frequency vectors [7]. We therefore denote such
vectors as n-gram frequency vectors.
Example 1. Given the n-gram collection
B, 3, A, F, B, 7, B, B, 9, 7, 0, where n = 1 and s = 1 and hence d = 16. The
corresponding vector results in
(1, 0, 0, 1, 0, 0, 0, 2, 0, 1, 1, 4, 0, 0, 0, 1), because n-gram 0 occurs one time in the
collection, n-gram 1 occurs zero times and so on.
4 Creating and Selecting of Informative n-grams
For large values of n, the created n-gram frequency vectors are limited process-
able due to their high-dimensional nature. In this section we therefore address
the problem of creating and selecting the most informative n-grams. More
precisely, the following two problems are addressed:
How to determine the parameters n and s such that the most informative
n-grams can be created?
How to select the most informative n-grams once they are created?
Both problems can be worked out by using arguments from information theory.
Let X be a random variable that represents the outcome of an n-gram and
A the set of all n-grams of length n. The entropy is dened as
H(X) =

xX
P(X = x) log
2
1
P(X = x)
. (1)
Informally speaking, high entropy implies that X is from a roughly uniform
distribution and n-grams sampled from it are distributed all over the place.
Consequently, predicting sampled n-grams is hard. In contrast, low entropy
implies that X is from a varied distribution and n-grams sampled from it would
be more predictable.
4.1 Information Loss in Frequency Vectors
Given a hexdump of a le, the resulting n-gram collection ( and the created
n-gram frequency vector c. We are interested in minimizing the amount of in-
formation loss for reconstructing the n-gram collection ( when given c. For the
sake of simplicity we consider the case n = s and denote the cardinality of ( as
t. Assume there are r unique n-grams in (, where each n-gram appears n
i
times
such that

r
i=1
n
i
= t. The collection ( can have
t!
n
1
! n
2
! n
r
!
=
_
t
n
1
, n
2
, . . . , n
r
_
(2)
many rearrangements of n-grams. Applying the theorem from information theory
on the size of a type class [8], (pp. 350, Theorem 11.1.3) it follows:
1
(t + 1)
||
2
tH(P)

_
t
n
1
, n
2
, . . . , n
r
_
2
tH(P)
(3)
where P denotes the empirical probability distribution of the n-grams, H(P)
the entropy of distribution P and the used alphabet. Using n = s it follows
from (3) that 2
l/nH(P)
is an upper bound of the information loss. To keep the
upper bound as small as possible, the value of n has to be close to the value
of l and the empirical probability distribution P has to be a varied distribution
rather than a uniform distribution.
4.2 Entropy of n-gram Collections
To support the statement in Section 4.1 empirically, the entropy values of the
created n-gram collections are calculated. The values are set in relation to the
maximum possible entropy (uniform distribution) denoted as H
max
. An n-gram
collection where the corresponding value of H(X)/H
max
is close to zero, con-
tains more predictive n-grams than the one where H(X)/H
max
is close to one.
The results for dierent parameter combinations of n and s are listed in Ta-
ble 1. One can see that larger n-grams lengths are more predictable than short
ones. Additionally, one can observe that window shift length s has to be a mul-
tiple of two, because opcodes most frequently occur as two and four byte values.
Moreover, one can observe that the result of H(X)/H
max
for n-gram collections
created from the benign set B
les
and virus infected set J
les
have approximately
the same values. In other words, discriminating between those two classes is
limited realizable. This observation is also identiable in Section 6 where the
classication results are presented and by inspecting Figure 4 which shows his-
tograms of the relative frequencies of the n-grams. From a practical point of
view, however, large values of n induce a computational complexity which is not
manageable. To be more precise, the crucial factor that inuences the computa-
tional complexity is not just the total number of created n-grams (see Table 2),
but also the number of unique n-grams, that is, the lower bound of the dimen-
sion of the frequency vector. For our data sets we obtained the results listed in
Table 3. One can observe that all possible n-grams occurred in the collections.
As a consequence, we focus in the paper on n-gram lengths 2 and 3. We tried to
run the classication experiments for n = 4, 5, 6, 7, 8 in combination with a di-
mensionality reduction described in the subsequent section, however, due to the
resulting space complexity and lack of memory space
6
no results were obtained.
4.3 Feature Selection
The purpose of feature selection is to reduce the dimensionality of the data such
that the machine learning methods can be applied more eciently in terms of
time and space complexity. Furthermore, feature selection often increases clas-
sication accuracy by eliminating noisy features. We apply on our data sets the
mutual information method [7]. This method is used in previous work and ap-
pears to be practical [5]. A comparison study of other feature selection methods
in terms of classifying malicious executables is provided in [9].
Mutual Information Selection: Let U be a random variable that takes values
e
ng
= 1 (the collection contains n-gram ng) and e
ng
= 0 (the collection does not
contain ng). Let C be a random variable that takes values e
c
= 1 (the collection
belongs to class c) and e
c
= 0 (the collection does not belong to class c).
The mutual information measures how much information the presence/absence
of an n-gram contributes to making the correct classication decision on c. More
formally
(4)

eng {0,1}

ec {0,1}
P(U = e
ng
, C = e
c
)
log
2
P(U = e
ng
, C = e
c
)
P(U = e
ng
)P(C = e
c
)
.
6
A Quad-Core AMD 2.6 Ghz with 16 GB RAM was used.
@
@
@
s
n
2 3 4
1
0.923 0.896 0.862
0.924 0.895 0.861
0.738 0.679 0.628
2
0.889 0.857 0.818
0.888 0.857 0.816
0.708 0.649 0.593
3
0.896 0.862
0.895 0.860
0.679 0.626
4
0.816
0.813
0.590
Table 1. Results of the expression H(X)/Hmax for dierent parameter combinations
of n and s. Smaller values indicate that more predictive n-grams can be created. The
upper value in each row denotes the result of n-grams created from benign set B
les
,
the middle value from virus infected set I
les
, the lower value from virus loader set
L
les
.
@
@
@
s
n
2 3 4
1
378 238 050 378 234 536 378 231 022
128 293 929 128 292 714 128 291 499
8 281 043 8 279 488 8 277 933
2
189 120 782 189 117 268 189 117 268
64 147 572 64 146 357 64 146 357
4 141 299 4 139 744 4 139 744
3
126 079 338 126 078 183
42 764 639 42 764 238
2 760 368 2 759 815
4
94 560 011
32 073 535
2 070 340
Table 2. Number of created n-grams for dierent n-gram and window lengths s. The
upper number in each row denotes the number of n-grams created from benign set
B
les
, the middle number from virus infected set I
les
, the lower number from virus
loader set L
les
.
n 2 3 4 5
B
les
256 4096 65536 1048576
L
les
256 4096 65536 1048576
I
les
256 4096 65536 1048576
Table 3. The number of unique n-grams of the three data sets B
les
, L
les
, I
les
. The
number determines the dimension of the corresponding frequency vectors.
5 Classication Methods
In this section we briey introduce the classication methods which are used in
the experiments. These methods are frequently applied in the eld of machine
learning.
5.1 Cosine Similarity Classication
Given two vectors x, y R
d
, the cosine similarity is given as follows
cos =
x, y
|x| |y|
(5)
=
x
1
y
1
+ . . . + x
d
y
d
_
x
2
1
+ . . . + x
2
d
_
y
2
1
+ . . . + y
2
d
. (6)
The dot product of x and y denotes as x, y has a straightforward geometrical
interpretation, namely, the length of the projection of x onto the unit vector
y/|y|. Consequently, (5) represents the cosine angle between x and y. The
result of (5) lies in interval [1, 1], where 1 is the result of equal vectors and 1
of total dissimilar vectors.
For performing classication, the arithmetic mean vector c of each class is
determined. A new vector u is assigned to the class with the largest cosine
similarity value between c and u.
5.2 Naive Bayesian Classier
A naive Bayesian classier is well known in the machine learning community
and is popularized by applications such as spam ltering or text classication in
general.
Given feature variables F
1
, F
2
, . . . , F
n
and a class variable C. The Bayes
theorem states
P(C [ F
1
, F
2
, . . . , F
n
) =
P(C) P(F
1
, F
2
, . . . , F
n
[ C)
P(F
1
, F
2
, . . . , F
n
)
. (7)
Assuming that each feature F
i
is conditionally independent of every other feature
F
j
for i ,= j one obtains
P(C [ F
1
, F
2
, . . . , F
n
) =
P(C)

n
i=1
P(F
i
[ C)
P(F
1
, F
2
, . . . , F
n
)
. (8)
The denominator serves as a scaling factor and can be omitted in the nal
classication rule
argmax
c
P(C = c)
n

i=1
P(F
i
= f
i
[ C = c). (9)
5.3 Support Vector Machine
The Support Vector Machine (SVM) [10, 11] is a classication technique which
has its foundation in computational geometry, optimization and statistics. Given
a sample
(x
1
, y
1
), (x
2
, y
2
), . . . , (x
l
, y
l
) R
d
Y, Y = 1, +1
and a (nonlinear) mapping into a potentially much higher dimensional feature
space T
: R
d
T
x (x).
The goal of SVM learning is to nd a separating hyperplane with maximal
margin in T such that
y
i
(w, (x
i
) + b) 1, i = 1, . . . , l, (10)
where w T is the normal vector of the hyperplane and b R the oset. The
separation problem between the 1 and +1 class can be formulated in terms of
the quadratic convex optimization problem
min
w,b,
1
2
|w|
2
+ C
l

i=1

i
(11)
subject to y
i
(w, (x
i
) + b) 1
i
, (12)
where C is a penalty term and
i
0 slack variables to relax the hard constrains
in (10). Solving the dual form of (11) and (12) leads to the nal decision function
f(x) = sgn
_
l

i=1
y
i

i
(x), (x
i
) + b
_
(13)
= sgn
_
l

i=1
y
i

i
k(x, x
i
) + b
_
(14)
where k(, ) is a kernel function satises the Mercers condition [12] and
i
Lagrangian multipliers obtained by solving the dual form.
In the performed classication experiments we used the Gaussian kernel
k(x, y) = exp(|x y|
2
/) as it gives the highest classication results com-
pared to other kernels. The hyperparameters and C are optimized by means
of the inbuilt validation method of the used R package kernlab [13].
6 ROC Analysis and Results
ROC analysis is performed to measure the goodness of the classication results.
More specically, we are interested in the following quantities:
true positives (TP): number of virus executable examples classied as virus
executable,
true negatives (TN): number of benign executable examples classied as
benign executable,
false positives (FP): number of benign executable examples classied as virus
executable,
false negatives (FN): number of virus executable examples classied as be-
nign executable.
The detection rate (true positive rate), false alarm rate (false positive rate), and
overall accuracy of a classier are calculated then as follow
detection rate =
TP
TP + FN
false alarm rate =
FP
FP + TN
overall accuracy =
TP + TN
TP + TN + FP + FN
.
Moreover, to avoid under/overtting eects of the classiers, we performed
a K-crossfold validation. That is, the data set is split in K roughly equal-sized
parts. The classication method is trained on K1 parts and has to predict the
not seen testing part. This is performed for k = 1, 2, . . . , K and combined as the
mean value to estimate the generalization error. Since the data sets are large,
a 10-crossfold validation is performed. The classication results are depicted in
Tables 4-7, where also the standard deviation result of the crossfold validation
is provided.
The classication results evidently show (cf. Tables 4-7) that high detection
rates and low false alarm rates can be obtained when discriminating between
benign les and virus loader les. This observation supports previously published
results [5, 1, 2]. However, for detecting viruses in real infected executables these
methods are barely applicable. This observation is not a great surprise, because
benign les and real infected les are from a statistical point of view nearly
indistinguishable (cf. Figure 4, last page).
It is interesting to note that the SVM gives excellent classication results
when discriminating between benign and virus loader les. However, poor re-
sults when discriminating between benign and virus infected les. In terms of
the overall accuracy, the SVM still outperforms the Bayes and cosine measure
method.
Additionally, one can observe that selecting 50 % of the features by means
of the mutual information criterion does not signicantly increase and decrease
the classication accuracy of the SVM. The Bayes and cosine measure, however,
benet by the feature selection preprocessing step as a signicant increase of the
classication accuracy can be observed.
7 Conclusion
We investigated the applicability of machine learning methods for detecting
viruses in real infected DOS executable les when using the n-gram representa-
tion. For that reason, three data sets were created. The benign data set contains
virus free DOS executable les collected from all over the Internet. For the sake
of comparison to previous work, we created a data set which consists of virus
loader les collected from the VXHeaven website. Furthermore, we created a
new data set which consists of real infected executable les. The data sets were
transformed to collections of n-grams and represented as frequency vectors. Due
to this transformational step a information loss occurs. We showed empirically
and theoretically with arguments from information theory that this loss can be
minimized by increasing the lengths of the n-grams. As a consequence however,
this results in a computational complexity which is not manageable.
Moreover, we calculated the entropy of the n-gram collections and observed
that the benign les and real infected les nearly have identical entropy values.
As a consequence, discriminating between those two classes is limited realizable.
This was also observed by creating histograms of the relative n-gram frequencies.
Furthermore, we performed classication experiments and conrmed that
discriminating between those two classes is limited realizable. More precisely,
the SVM gives unacceptable detection rates. The Bayes and the cosine measure
gives higher detection rates, however, they also give unacceptable false alarm
rates.
In summary, our results doubt the applicability of detecting viruses in real
infected executable les with machine learning methods when using the (naive)
n-gram representation. However, it has not escaped our mind that learning al-
gorithms for sequential data could be one approach to tackle this problem more
eectively. Another promising approach is to learn the behavior of a virus, that
is, the sequence of system calls a virus is generating. Such a behavior can be
collected in a secure sandbox and learned with appropriate machine learning
methods.
References
1. Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection
of new malicious executables. In: Proceedings of the IEEE Symposium on Security
and Privacy, IEEE Computer Society (2001) 3849
2. Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious
code using n-grams signatures. In: Second Annual Conference on Privacy, Security
and Trust. (2004) 193196
3. Reddy, D.K.S., Pujari, A.K.: N-gram analysis for computer virus detection. Journal
in Computer Virology 2(3) (2006) 231239
4. Yoo, I.S., Ultes-Nitsche, U.: Non-signature based virus detection. Journal in Com-
puter Virology 2(3) (2006) 163186
5. Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables
in the wild. Journal of Machine Learning Research 7 (2006) 27212744
6. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings
of Third Annual Symposium on Document Analysis and Information Retrieval.
(1994) 161175
7. Manning, C.D., Raghavan, P., Sch utze, H.: Introduction to Information Retrieval.
Cambridge University Press (2008)
8. Cover, T.M., Thomas, J.A.: Elements of Information Theory. 2nd edn. Wiley-
Interscience (2006)
9. Cai, D.M., Gokhale, M., James, J.T.: Comparison of feature selection and classi-
cation algorithms in identifying malicious executables. Computational Statistics
& Data Analysis 51(6) (2007) 31563172
10. Boser, B.E., Guyon, I.M., Vapnik, V.: A training algorithm for optimal margin
classiers. In: Proceedings of the fth annual workshop on Computational learning
theory (COLT), ACM Press (1992) 144152
11. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)
273297
12. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press (2001)
13. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab an S4 package for
kernel methods in R. Journal of Statistical Software 11(9) (2004) 120
benign vs. virus loader
s Method Detection Rate False Alarm Rate Overall Accuracy
1
SVM 0.9499(0.0191) 0.0272(0.0075) 0.9654(0.0059)
Bayes 0.6120 (0.0459) 0.0710 (0.0110) 0.8315 (0.0194)
Cosine 0.5883 (0.0337) 0.1239 (0.0123) 0.7873 (0.0151)
2
SVM 0.9531(0.0120) 0.0261(0.0091) 0.9674(0.0083)
Bayes 0.7267 (0.0244) 0.0692 (0.0136) 0.8684 (0.0122)
Cosine 0.6385 (0.0255) 0.1211 (0.0190) 0.8050 (0.0194)
benign vs. virus infected
1
SVM 0.0370 (0.0161) 0.0731(0.0078) 0.6982(0.0155)
Bayes 0.6039(0.0677) 0.6113 (0.0679) 0.4438 (0.0416)
Cosine 0.5479 (0.1300) 0.5674 (0.0906) 0.4620 (0.0394)
2
SVM 0.0556 (0.0153) 0.0888(0.0125) 0.6912(0.0206)
Bayes 0.6682(0.0644) 0.6533 (0.0736) 0.4284 (0.0465)
Cosine 0.5835 (0.1096) 0.5938 (0.0897) 0.4506 (0.0474)
Table 4. Classication results for n-gram length n = 2 and sliding window s = 1, 2
without feature selection.
benign vs. virus loader with feature selection
s Method Detection Rate False Alarm Rate Overall Accuracy
1
SVM 0.9443(0.0151) 0.0324(0.0057) 0.9605(0.0035)
Bayes 0.7897 (0.0297) 0.1123 (0.0187) 0.8577 (0.0178)
Cosine 0.7688 (0.0292) 0.1016 (0.0194) 0.8587 (0.0181)
2
SVM 0.9354(0.0099) 0.0353(0.0129) 0.9558(0.0105)
Bayes 0.8844 (0.0301) 0.1653 (0.0284) 0.8498 (0.0223)
Cosine 0.8643 (0.0391) 0.1479 (0.0275) 0.8559 (0.0186)
benign vs. virus infected with feature selection
1
SVM 0.0251 (0.0128) 0.0702(0.0201) 0.6974(0.0213)
Bayes 0.6084 (0.0502) 0.5860 (0.0456) 0.4637 (0.0287)
Cosine 0.6454(0.0486) 0.6408 (0.0210) 0.4326 (0.0173)
2
SVM 0.0264 (0.0120) 0.0782(0.0147) 0.6916(0.0184)
Bayes 0.6270(0.0577) 0.6120 (0.0456) 0.4495 (0.0312)
Cosine 0.6212 (0.0791) 0.6216 (0.0695) 0.4419 (0.0323)
Table 5. Classication results for n-gram length n = 2 and sliding window s = 1, 2
with selecting 50 % of the original features by means of the mutual information method.
benign vs. virus loader
s Method Detection Rate False Alarm Rate Overall Accuracy
1
SVM 0.9622(0.0148) 0.0267(0.0080) 0.9700(0.0080)
Bayes 0.7493 (0.0302) 0.0594 (0.0106) 0.8820 (0.0076)
Cosine 0.6452 (0.0258) 0.1284 (0.0148) 0.8021 (0.0150)
2
SVM 0.9560(0.0183) 0.0266(0.0091) 0.9680(0.0095)
Bayes 0.8044 (0.0210) 0.0595 (0.0142) 0.8985 (0.0108)
Cosine 0.6905 (0.0330) 0.1220 (0.0214) 0.8202 (0.0137)
3
SVM 0.9618(0.0089) 0.0289(0.0094) 0.9682(0.0071)
Bayes 0.7488 (0.0377) 0.0580 (0.0101) 0.8828 (0.0146)
Cosine 0.6450 (0.0410) 0.1272 (0.0215) 0.8025 (0.0224)
benign vs. virus infected
1
SVM 0.0868 (0.0266) 0.0997(0.0162) 0.6912(0.0239)
Bayes 0.6152(0.0742) 0.6110 (0.0661) 0.4476 (0.0352)
Cosine 0.5518 (0.1016) 0.5624 (0.0804) 0.4677 (0.0370)
2
SVM 0.1155 (0.0345) 0.0979(0.0190) 0.7001(0.0186)
Bayes 0.6668(0.0877) 0.6386 (0.1169) 0.4398 (0.0679)
Cosine 0.5976 (0.0940) 0.6032 (0.0798) 0.4485 (0.0393)
3
SVM 0.0665 (0.0204) 0.0944(0.0157) 0.6899(0.0195)
Bayes 0.6110(0.0844) 0.5995 (0.0980) 0.4542 (0.0563)
Cosine 0.5510 (0.0964) 0.5609 (0.0594) 0.4677 (0.0244)
Table 6. Classication results for n-gram length n = 3 and sliding window s = 1, 2, 3
without feature selection.
benign vs. virus loader with feature selection
s Method Detection Rate False Alarm Rate Overall Accuracy
1
SVM 0.9619(0.0173) 0.0334 (0.0086) 0.9648(0.0066)
Bayes 0.9281 (0.0195) 0.1702 (0.0242) 0.8597 (0.0182)
Cosine 0.1645 (0.0240) 0.0065(0.0035) 0.7391 (0.0231)
2
SVM 0.9620(0.0203) 0.0320 (0.0082) 0.9660(0.0107)
Bayes 0.8205 (0.0358) 0.0701 (0.0151) 0.8962 (0.0200)
Cosine 0.2645 (0.0277) 0.0196(0.0044) 0.7605 (0.0202)
3
SVM 0.9549(0.0092) 0.0325 (0.0073) 0.9636(0.0064)
Bayes 0.7362 (0.0368) 0.0429 (0.0094) 0.8891 (0.0126)
Cosine 0.2306 (0.02646) 0.0096(0.0034) 0.7573 (0.0128)
benign vs. virus infected with feature selection
1
SVM 0.1057 (0.0281) 0.1024(0.0184) 0.6944(0.0196)
Bayes 0.6380(0.0659) 0.6159 (0.0829) 0.4482 (0.0500)
Cosine 0.4688 (0.1429) 0.4473 (0.1491) 0.5307 (0.0733)
2
SVM 0.1406 (0.0372) 0.1010(0.0220) 0.7037(0.0276)
Bayes 0.6643(0.0844) 0.6287 (0.1157) 0.4455 (0.0657)
Cosine 0.5913 (0.1283) 0.6035 (0.1265) 0.4447 (0.0627)
3
SVM 0.0611 (0.0185) 0.1001(0.0165) 0.6842(0.0253)
Bayes 0.6075(0.0762) 0.5668 (0.0823) 0.4764 (0.0520)
Cosine 0.5713 (0.0954) 0.5557 (0.0711) 0.4757 (0.0325)
Table 7. Classication results for n-gram length n = 3 and sliding window s = 1, 2, 3
with selecting 50 % of the original features by means of the mutual information method.
ngrams: 00,01,02,...,FE,FF
0
.
0
0
0
.
0
4
0
.
0
8
0
.
1
2
00 FF
(a) Relative frequencies of n-grams created from benign les for n = 2 and s = 2.
ngrams: 00,01,02,...,FE,FF
0
.
0
0
0
.
0
4
0
.
0
8
0
.
1
2
00 FF
(b) Relative frequencies of n-grams created from virus infected les for n = 2 and s = 2.
ngrams: 00,01,02,...,FE,FF
0
.
0
0
0
.
0
4
0
.
0
8
0
.
1
2
00 FF
(c) Relative frequencies of n-grams created from virus loader les for n = 2 and s = 2. Note that the
relative frequency of n-gram 00 is cropped.
0
.
0
0
0
.
0
4
0
.
0
8
0
.
1
2
(d) Dierence of relative frequency between benign les and virus loader les.
0
.
0
0
0
.
0
4
0
.
0
8
0
.
1
2
(e) Dierence of relative frequency between benign les and virus infected les.
Fig. 4. Relative frequencies of n-grams created from benign les, virus infected les
and virus loader les. Observe, that the relative frequencies of the n-grams created from
benign les and les infected are nearly indistinguishable. As a consequence, discrimi-
nating n-grams created from benign les and virus infected les is nearly infeasible.

Detecting and Classifying Malware Executables Through Data Mining
No ratings yet
Detecting and Classifying Malware Executables Through Data Mining
4 pages
Automatic Malware Signature Generation: Karin Ask October 16, 2006
No ratings yet
Automatic Malware Signature Generation: Karin Ask October 16, 2006
69 pages
ELF Et Virologie Informatique
No ratings yet
ELF Et Virologie Informatique
7 pages
Slides
No ratings yet
Slides
14 pages
Kolter
No ratings yet
Kolter
24 pages
Malware Detection with Machine Learning
No ratings yet
Malware Detection with Machine Learning
29 pages
Malware Analysis Using Machine Learning and Deep Learning Techniques
No ratings yet
Malware Analysis Using Machine Learning and Deep Learning Techniques
7 pages
Clamav Signatures
No ratings yet
Clamav Signatures
35 pages
Effective Malware Detection Based On Behaviour and Data Features
No ratings yet
Effective Malware Detection Based On Behaviour and Data Features
16 pages
Opcode Sequences As Representation of Executables For Data-Mining-Based Unknown Malware Detection (Elsevier-2013) PDF
No ratings yet
Opcode Sequences As Representation of Executables For Data-Mining-Based Unknown Malware Detection (Elsevier-2013) PDF
19 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
Islamic Studies Subjective-2017
No ratings yet
Islamic Studies Subjective-2017
7 pages
Intelligent Malware Detection System: Yanfang Ye Dingding Wang, Tao Li Dwang003, Taoli@cs - Fiu.edu Dongyi Ye
No ratings yet
Intelligent Malware Detection System: Yanfang Ye Dingding Wang, Tao Li Dwang003, Taoli@cs - Fiu.edu Dongyi Ye
5 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features
No ratings yet
Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features
10 pages
Malware Detection for Researchers
No ratings yet
Malware Detection for Researchers
11 pages
Malware Classification Dimva08
No ratings yet
Malware Classification Dimva08
20 pages
Detection of Advanced Malware by Machine Learning Techniques
No ratings yet
Detection of Advanced Malware by Machine Learning Techniques
8 pages
08 Rohit Final Malware Research Paper
No ratings yet
08 Rohit Final Malware Research Paper
13 pages
Detecting Malware Using Deep Learning Mo
No ratings yet
Detecting Malware Using Deep Learning Mo
4 pages
AI-Powered Windows Malware Detection
No ratings yet
AI-Powered Windows Malware Detection
10 pages
A Survey of Machine Learning Methods and Challenges For Windows Malware Classification
No ratings yet
A Survey of Machine Learning Methods and Challenges For Windows Malware Classification
52 pages
A - Multi-Strategy - Adversarial - Attack - Method - For - Deep - Learning - Based - Malware - Detectors
No ratings yet
A - Multi-Strategy - Adversarial - Attack - Method - For - Deep - Learning - Based - Malware - Detectors
5 pages
Malware Analysis
No ratings yet
Malware Analysis
14 pages
A Survey and Experimental Evaluation of Practical Attacks On Machine Learning For Windows Malware Detection
No ratings yet
A Survey and Experimental Evaluation of Practical Attacks On Machine Learning For Windows Malware Detection
31 pages
Using Mobile Agent Results To Create Hard-To-Detect Computer Viruses
No ratings yet
Using Mobile Agent Results To Create Hard-To-Detect Computer Viruses
9 pages
Polymorphic Malware Detection Using Logistic Regression
No ratings yet
Polymorphic Malware Detection Using Logistic Regression
5 pages
Paprer CJ Usenix03
No ratings yet
Paprer CJ Usenix03
18 pages
Udayakumar 2017
No ratings yet
Udayakumar 2017
6 pages
ICIIS
No ratings yet
ICIIS
6 pages
Poster
No ratings yet
Poster
1 page
AGISA Towards Automatic Generation of Infection Signatures
No ratings yet
AGISA Towards Automatic Generation of Infection Signatures
12 pages
The Basic Building Blocks of Malware
No ratings yet
The Basic Building Blocks of Malware
16 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
Malware Detection Using Machine Leaning
No ratings yet
Malware Detection Using Machine Leaning
9 pages
ECS 289M Lecture 24: - Program That Inserts Itself Into One or More Files and Performs Some Action
No ratings yet
ECS 289M Lecture 24: - Program That Inserts Itself Into One or More Files and Performs Some Action
22 pages
Malware - Detection - Research - Paper - Updated Soheb6
No ratings yet
Malware - Detection - Research - Paper - Updated Soheb6
8 pages
Malware Analysis for Cybersecurity Students
100% (1)
Malware Analysis for Cybersecurity Students
39 pages
Malware Analysis with ML Techniques
No ratings yet
Malware Analysis with ML Techniques
10 pages
Malware Detection for Tech Experts
No ratings yet
Malware Detection for Tech Experts
6 pages
Malware Detection
No ratings yet
Malware Detection
38 pages
Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning
No ratings yet
Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning
10 pages
Malware Classification & Analysis Latest
No ratings yet
Malware Classification & Analysis Latest
19 pages
Malwarepjct PDF
No ratings yet
Malwarepjct PDF
70 pages
Adversarial EXEmples: Evading Windows Malware Detectors
No ratings yet
Adversarial EXEmples: Evading Windows Malware Detectors
116 pages
A Fast Malware Detection Algorithm Based On Objective-Oriented Association Mining
No ratings yet
A Fast Malware Detection Algorithm Based On Objective-Oriented Association Mining
24 pages
Malware Classification ML Report TechGB2336 Group13
No ratings yet
Malware Classification ML Report TechGB2336 Group13
27 pages
Detecting Malware in Portable Executable Files Using Machine Learning Approach
No ratings yet
Detecting Malware in Portable Executable Files Using Machine Learning Approach
7 pages
Malware Detection Using ANN
No ratings yet
Malware Detection Using ANN
10 pages
Automated Microsoft Office Macro Malware Detection Using Machine Learning
No ratings yet
Automated Microsoft Office Macro Malware Detection Using Machine Learning
5 pages
Research Paper
No ratings yet
Research Paper
8 pages
Virus Workshop
No ratings yet
Virus Workshop
100 pages
Report On Polymorphic Virus Code
No ratings yet
Report On Polymorphic Virus Code
8 pages
Elizabeth Walkup, MacMalware
No ratings yet
Elizabeth Walkup, MacMalware
5 pages
Catch Them Alive: Malware Detection
No ratings yet
Catch Them Alive: Malware Detection
19 pages
Classification of Malware Detection Using Machine Learning Algorithms A Survey
No ratings yet
Classification of Malware Detection Using Machine Learning Algorithms A Survey
7 pages
Say No To Over Fitting
No ratings yet
Say No To Over Fitting
7 pages
Architecture During Khalji Dynasty, Islamic Architecture
100% (2)
Architecture During Khalji Dynasty, Islamic Architecture
3 pages
Training & Implementation Checklist Sales & Marketing
No ratings yet
Training & Implementation Checklist Sales & Marketing
1 page
Architecture During Sayyid and Lodi Dynasty, Islamic Architecture
No ratings yet
Architecture During Sayyid and Lodi Dynasty, Islamic Architecture
3 pages
Training & Implementation Checklist HR & Payroll
No ratings yet
Training & Implementation Checklist HR & Payroll
4 pages
How To Hack Through Telnet!!
88% (8)
How To Hack Through Telnet!!
6 pages
Breaking 104 Bit WEP in Less Than 60 Seconds - Paper
No ratings yet
Breaking 104 Bit WEP in Less Than 60 Seconds - Paper
16 pages
SQL Injection
100% (3)
SQL Injection
22 pages
Union Public Service Commission: Dholpur House, Shahjahan Road, New Delhi-110 069
No ratings yet
Union Public Service Commission: Dholpur House, Shahjahan Road, New Delhi-110 069
2 pages
Examination Schedule and Centre Details: Roll No
No ratings yet
Examination Schedule and Centre Details: Roll No
1 page
Animal Motifs in Mughal Art
No ratings yet
Animal Motifs in Mughal Art
3 pages
Nion Ublic Ervice Ommission
No ratings yet
Nion Ublic Ervice Ommission
6 pages
Ping 2
No ratings yet
Ping 2
15 pages
Training & Implementation Checklist General Setup
No ratings yet
Training & Implementation Checklist General Setup
1 page
Training & Implementation Checklist Telephones
No ratings yet
Training & Implementation Checklist Telephones
2 pages
Microsoft Word Power Shell User Guide
No ratings yet
Microsoft Word Power Shell User Guide
220 pages
AdvEnglish PDF
No ratings yet
AdvEnglish PDF
10 pages
Troubleshooting Guide
No ratings yet
Troubleshooting Guide
104 pages
Chapter 14. Advanced & Less Common Queueing Disciplines: Bfifo Pfifo
No ratings yet
Chapter 14. Advanced & Less Common Queueing Disciplines: Bfifo Pfifo
1 page
Vbscript: Vbscipt User'S Guide Vbscript Langauge Reference
No ratings yet
Vbscript: Vbscipt User'S Guide Vbscript Langauge Reference
331 pages
Windows Internals Expert Speaks On Source Code Leak Updated
No ratings yet
Windows Internals Expert Speaks On Source Code Leak Updated
8 pages
Andaman & Nicobar Union Territory Health Mission
No ratings yet
Andaman & Nicobar Union Territory Health Mission
1 page
COEN 3054 - Lab Exercise No. 2 - Arrays With Functions
No ratings yet
COEN 3054 - Lab Exercise No. 2 - Arrays With Functions
5 pages
Develop A Menu Driven Program To Animate A Flag Using Bezier Curve Algorithm
No ratings yet
Develop A Menu Driven Program To Animate A Flag Using Bezier Curve Algorithm
4 pages
Non Deterministic Algorithms
No ratings yet
Non Deterministic Algorithms
6 pages
Piping Fatigue Analysis Guide
No ratings yet
Piping Fatigue Analysis Guide
4 pages
Data and Graph Interpretation Quiz
No ratings yet
Data and Graph Interpretation Quiz
9 pages
Three Degree of Freedom Airfoil and Fuselage (Example HA145A)
No ratings yet
Three Degree of Freedom Airfoil and Fuselage (Example HA145A)
16 pages
Calculus II Course Overview
No ratings yet
Calculus II Course Overview
15 pages
Asean Mock Test Paper
No ratings yet
Asean Mock Test Paper
9 pages
Alehegn Asresa Negash Thesis
No ratings yet
Alehegn Asresa Negash Thesis
104 pages
7.SP.8c - Compound Events & Simulations 1 - 10Q
No ratings yet
7.SP.8c - Compound Events & Simulations 1 - 10Q
5 pages
Service Dept Cost Allocation Quiz
No ratings yet
Service Dept Cost Allocation Quiz
9 pages
2nd Revised
No ratings yet
2nd Revised
9 pages
No of Bricks Calculation For Line III Kiln
No ratings yet
No of Bricks Calculation For Line III Kiln
3 pages
Applied Engg. Mathematics II MEng
No ratings yet
Applied Engg. Mathematics II MEng
2 pages
Probability Concepts and Application
No ratings yet
Probability Concepts and Application
44 pages
Queens College Department of Economics
100% (1)
Queens College Department of Economics
2 pages
Statistics For Management
100% (1)
Statistics For Management
102 pages
Detailed Lesson Plan in Operation On Integer
100% (10)
Detailed Lesson Plan in Operation On Integer
12 pages
MYP Year 3 Math Assessment
No ratings yet
MYP Year 3 Math Assessment
4 pages
Signal and System Syllabus
No ratings yet
Signal and System Syllabus
2 pages
Pharmaceutical Statistics Practical and Clinical Applications, Fifth Edition - 5th Edition Open Access Download
100% (14)
Pharmaceutical Statistics Practical and Clinical Applications, Fifth Edition - 5th Edition Open Access Download
17 pages
Mathcad Manual PDF
No ratings yet
Mathcad Manual PDF
170 pages
Point Completion Network - Point CLD
No ratings yet
Point Completion Network - Point CLD
17 pages
Circuit Breaker Interrupting Times
No ratings yet
Circuit Breaker Interrupting Times
2 pages
Response of A Damped System Under The Harmonic Motion of The Base
No ratings yet
Response of A Damped System Under The Harmonic Motion of The Base
28 pages
CBNST Lab Practical File B.Tech CSE
No ratings yet
CBNST Lab Practical File B.Tech CSE
14 pages
IGCSEFM TrigonometryII 3DTrigSineCosine
100% (1)
IGCSEFM TrigonometryII 3DTrigSineCosine
28 pages
Structural Grillage Analysis
No ratings yet
Structural Grillage Analysis
2 pages
1052 Fullpaper
No ratings yet
1052 Fullpaper
10 pages
Example of Learning Scenario
No ratings yet
Example of Learning Scenario
3 pages