Detection of malicious PDF files and
directions for enhancements: A state-of-the
art survey
2015. 06. 01
Hyungjin Im
(imhj9121@seoultech.ac.kr)
Table of Contents
1. Introduction
2. Structure of PDF files
3. Techniques and possible attacks via PDF files
4. Advanced methods for the detection of malicious PDF files
5. Dataset collection and preliminary analysis
6. Our suggested active learning based framework
7. Discussion and conclusions
2
Introduction
• Since 2009, cyber-attacks against businesses and
organizations have increased
• In 2013, 91% of all organizations were hit with cyber-
attack
• 9% were the victims of targeted attacks
• Email containing attachments of malicious files has
become an attractive platform by which to initiate cyber-
attacks against organizations
• Existing tools are limited in their ability to detect and
identify the attacks that occur within email
Introduction
• Attackers usually use social engineering in order to encourage the
recipient to open a malicious email, open an attachment, or press a
link
• As most email servers prevent attachments of executable files to
email messages, the non-executable files attached to an email have
played a major role in many recent cyberattacks.
• Users consider non-executable files safer than executables, and
thus, they are less suspicious toward such files received by email
– non-executable files are as dangerous as executable files, since their
readers can contain vulnerabilities
– the most popular file types for targeted attacks in 2008e2009 were PDF
and Microsoft Office files.
4
Introduction
• An incident aimed at the Israeli Ministry of Defense (IMOD) took
place on January 15, 2014
– it identified an attack in which attackers sent email messages, allegedly
from IMOD, with a malicious PDF file attachment posing as an IMOD
document
– When opened, the PDF file installed a Trojan horse that enabled the
attacker to take control of the computer
– clearly demonstrates that the existing solutions previously mentioned
are insuffi- cient in detecting and preventing such attacks
• In this Survey paper, they present several significant studies
pertaining to PDF detection us Machine learning algorithms based
on static analysis, dynamic analysis.
• This paper also outlines a novel Active Learning (AL) framework and
highlights the correlation between the structural incompatibility of
PDF files and their maliciousness.
5
Structure of PDF files
• A Portable Document Format (PDF) is a formatting language first
conceived by John Warnock, one of the founders of Adobe Systems.
The first version, version 1.0, was introduced in 1993
• Has many functions beyond simple text: it can include images and
other multimedia elements, be password protected, execute
JavaScript, etc
• Supported in all the prominent operating systems for the PC and
mobile platforms
6
Structure of PDF files
• A PDF file is comprised of four basic parts
– Objects - basic elements in a PDF file
– File Structure - defines how the objects are accessed and how they are
updated.
– Document Structure - defines how objects are logically and hierarchically
organized to reflect the. content of a PDF file
– Content Streams - objects that contain instructions which define the appearance
of the page.
7
Structure of PDF files
• Object
– Indirect objects
• objects referenced by a number
– Direct objects
• objects that are not referenced by a number
– Object types: Boolean, Numeric, String,
Name, Null, Array, Dictionary, Stream
8
Structure of PDF files
• File Structure
– Header: the first line of a PDF file which specifies the version
number of PDF specification which the document uses. Header
format is “%PDF-[version number]”.
– Body: contains all the PDF objects. The body is used to hold all
of the document's data that is shown to the user.
– Cross reference: a table that includes the position of every
indirect object in memory and allows random access to objects
in the file, so the application does not need to read the whole file
to locate a particular object
– Trailer: provides relevant information about how the application
reading the file should find the cross reference table and other
special objects. The trailer also contains information about the
number of revisions made to the document. All PDF readers
should begin reading a file from this section.
9
Structure of PDF files
10
Structure of PDF files
11
Techniques and possible attacks via PDF files
• Protected mode uses the sandbox technique in order to
create an isolated environment for the Acrobat Reader
rendering agent to run while reading a PDF file.
• JavaScript code attack (1/2)
– PDF files can contain client-side JavaScript code for
legitimate purposes including: 3D content, form
validation, and calculations.
– The primary goal of the malicious JavaScript code
inside a PDF file is to exploit a vulnerability in the
PDF viewer in order to divert the normal execution
flow to the embedded malicious JavaScript code
12
Techniques and possible attacks via PDF files
• JavaScript code attack (2/2)
– performing a heap spraying attack, as
implemented through JavaScript
– Another malicious activity that can be carried
out using JavaScript is downloading an
executable file from the Internet
13
Techniques and possible attacks via PDF files
• Code obfuscation is legitimately used to prevent reverse
engineering of proprietary applications
• It can also be used by attackers to conceal malicious JavaScript
code from being recognized
Obfuscation technique Detatils
Separating malicious code Malicious code is spread among multiple
over multiple object objects. Code chunks are collected and
merged and compiled to form a malicious
piece of code only during runtime
Applying filters Filters are used to conceal malicious code
White space randomization Random white spaces are inserted in the
malicious code in order to evade recognition
by signature based maliciousness detectors.
White spaces do not affect the code since
JavaScript ignores them
14
Techniques and possible attacks via PDF files
Obfuscation technique Detatils
Comment randomization Random comments are inserted in the
malicious code in order to evade recognition
by signature based maliciousness detectors
Variable name randomization Changing the variable's name randomly in
order to fool signature based maliciousness
detectors.
Integer obfuscation Representing numbers in a different way. For
example, this can be used to hide a specific
memory address.
String obfuscation Making changes to string in order to make it
difficult for a human analyst to understand the
code. For example, by splitting string into
several substrings
Function name obfuscation Hiding the name of the function used which
can provide a clue about the code's intention.
This is done by creating a pointer with a
random name to the required function.
15
Techniques and possible attacks via PDF files
Obfuscation technique Detatils
Advanced code obfuscation String can hold encrypted malicious code. The
decryption process takes place during runtime,
just before usage. Metadata fields and even
the document's words can also be used to
store malicious code.
Block randomization Changing the syntax of the code but not its
action
Dead code Inserting blocks of code that are not intended
to be executed.
Pointless code Inserting blocks of code do not perform
anything.
16
Techniques and possible attacks via PDF files
• Embedded files attack
– A PDF file can contain other file types inside of it, for example,
HTML, JavaScript, SWF, XLSX, EXE, Microsoft Office files or
even another PDF file
– An attacker can use this functionality in order to embed a
malicious file inside a benign file.
– The PDF viewer will not allow the launching of an embedded
executable file because of its blacklist
17
Techniques and possible attacks via PDF files
• Mimicry attacks attempt to change a malicious file's
structure and objects so that the file is similar to a benign
file.
– embedding malicious EXE payload into a benign PDF file
– embedding a malicious PDF file into a benign PDF file
– JavaScript injection in which malicious JavaScript code that is
embedded in the PDF file
18
Techniques and possible attacks via PDF files
• Form submission and URI attack
– Adobe Reader supports the option of
submitting the PDF form from a client to a
specific server using the/submitform
command
– Adobe generates an FDF file from a PDF in
order to send the data to a specified URL. If
the URL belongs to a remote webserver, it is
able to respond. Responses are temporarily
stored in the %APPData% directory which
automatically pops up in the default web
browser
19
Advanced methods for the detection of malicious
PDF files
• Taxonomy of academic research on detection methods
of malicious PDF files
20
Advanced methods for the detection of malicious
PDF files
• Detection methods based on static analysis
– Includes methods aimed at statically analyzing the embedded
JavaScript code inside the PDF files
– Conduct static analysis based on the PDF file's metadata.
– JavaScript analysis
• Both methods apply machine learning algorithms to the tokenized code in
order to build a classification model and classify new, unfamiliar PDF files
after the embedded JavaScript code has been extracted from them
– Metadata analysis
• Analyze a PDF file by examining its metadata
• These approaches share a focus on global or statistical information about
the PDF file's objects and structure, rather than on its actual content
21
Advanced methods for the detection of malicious
PDF files
• Detection methods based on JavaScript analysis
• Lexical analysis
– Srndic and Laskov introduced PJScan
– One-Class Support Vector Machine (OCSVM), a machine learning method,
is used to automatically construct models from available data for
subsequent classification of new data.
– The feature extraction component makes use of an open source PDF
rendering library called “POPPLER” for searches for embedded JavaScript
code in a document
– After the JavaScript code has been found and extracted, a lexical analysis
is performed on it using “Mozilla SpiderMonkey”
22
Advanced methods for the detection of malicious
PDF files
• Detection methods based on JavaScript analysis
• Clustering
– Vatamanu et al introduced two different static methods for clustering PDF
files based on tokenization of their embedded JavaScript.
– The first is hierarchical bottom up clustering and the second is hash table
clustering.
– Clustering method of the identification of similar scripts that have been
obfuscated using different techniques
– The fingerprint is a set of unique JavaScript tokens and their frequencies
23
Advanced methods for the detection of malicious
PDF files
• Detection methods based on Metadata analysis
• Keywords analysis
– Maiorka et al introduced the PDF Malware Slayer (PDFMS), a static
analysis tool which characterizes PDF files according to the set of
embedded keywords and their occurrence
– Consists of two modules: a data retrieval module which retrieves files for
the training and testing phases, and a feature extractor module which
determines the type of features to be used by the classifier
– To retrieve the keywords from the PDF file, the authors used the PDFid tool
(Python script)
– The files were characterized by keywords such as:/JS,/JavaScript,/ Encrypt,
obj, stream, filter, etc
– Their main contribution is the ability to detect malicious PDF files whether or
not they contain JavaScript code, unlike previously described tools such as
PJScan
24
Advanced methods for the detection of malicious
PDF files
• Detection methods based on Metadata analysis
• Hierarchical structure analysis
– Srndic and Laskov introduced a high performance static method for the
detection of malicious PDF documents which, instead of analyzing
JavaScript or any other content, makes use of essential differences in the
structural properties of malicious and benign PDF files.
– When an attacker injects malicious content into the PDF file, the file
structure inevitably changes.
– The PDF is parsed using the PDF parser, POPPLER. The parser extracts
structural paths from malicious and benign real-world PDF files which is
used to create the training set.
– . Two classification models were trained: SVM e LibSVM and Decision Tree
C5.0 inference implementation.
– Their main contribution is a novel technique for the detection of malicious
PDF files based on the difference between the underlying structural
properties of benign and malicious PDF files
25
Advanced methods for the detection of malicious
PDF files
• Detection methods based on Metadata analysis
• Content metadata analysis
– Smutz and Stavrou presented PDFRate, a framework which is based on
meta-features extracted from a document's content for the detection of
malicious PDF files
– The process is based on the use of a selfimplemented reliable parser for
feature extraction, because existing tools are unable to deal with malformed
documents.
– Two data sources were used for the research: the first is the Contagio
dataset collection and the second is based on monitoring the network of a
large university's HTTP traffic.
26
Advanced methods for the detection of malicious
PDF files
• Detection methods based on Metadata analysis
• Term frequency and entropy analysis
– Contrary to aforementioned approaches rely upon a PDF parser's ability to
extract relevant data from objects embedded in the PDF file, the following
study proposes two different detection methods that do not employ a PDF
parser
– Pareek and Eswari introduced two static analysis methods for the detection
of malicious PDF. The first method is based on entropy, and the second is
based on n-gram term frequency
– The first entropy based method was used to measure the uncertainty or
randomness in a given dataset. A file is represented as a set of byte
sequences
– Low entropy of a file is not a strong indicator of maliciousness, however it
can be a useful feature in combination with other features.
– The second method, the n-gram based approach, takes substrings of a
given large string where the n-gram can be words or bytes
27
Advanced methods for the detection of malicious
PDF files
• Detection methods based on Metadata analysis
• Term frequency and entropy analysis (2/3)
– the following two papers take a different approach and focus on the
development of an applicable network's IDS aimed at the detection of
malicious PDFs that pass through that network component.
– The first work presented is that of Kittilsen in which he attempted to
implement an anomaly based network IDS, which employs an SVM
classifier to detect malicious PDF files.
– The IDS uses SNORT, u2boat and tcpflow tools to extract PDF files from
the network stream to the hard drive
– The classification process begins offline after a period of time, and the user
has access to the file in the meantime.
– The author's own pdfextract.py script, written in Python, was used to extract
18 string features from the file and count their occurrences.
28
Advanced methods for the detection of malicious
PDF files
• Detection methods based on Metadata analysis
• Term frequency and entropy analysis (3/3)
– The second work was presented two years later by Knut Borg as a
continuation of Kittilsen's research
– This thesis focuses on online detection of PDF files, while Kittilsen's thesis
featured offline detection Kittilsen's proposed IDS extracted PDF files from
the network traffic to the local hard drive and then executed a classification
algorithm to detect maliciousness.
– The answer to the first question is that the detection system in its current
form should not be implemented in a real environment because of its many
faults, including the limitations of SNORT
– Due to reasons of insufficient applicability, the last two works described
above will not be listed as solutions in the summary tables presented in the
upcoming section.
29
Advanced methods for the detection of malicious
PDF files
• Detection methods based on dynamic analysis
• All of the following dynamic analysis methods focus on the
analysis of embedded JavaScript code
• The first sub-category presents studies that statically extract the
JavaScript code and includes three methods.
• Two of these methods, MDScan and PDF Scrutinizer , start with
a static extraction of the embedded JavaScript code from a PDF
file and then execute the extracted code using a JavaScript
engine.
• The third method, ShellOS V1 also appears in the second sub-
category of dynamic extraction as ShellOS V2
• MPScan also belongs to this second subcategory of dynamic
extraction as it extracts the JavaScript code dynamically during
runtime
30
Advanced methods for the detection of malicious
PDF files
• Detection methods based on dynamic analysis
• Static JavaScript extraction
– MDScan and PDF Scrutinizer rely on a PDF parser that should be capable
of parsing the PDF file, locating the embedded JavaScript, and extracting it
– Tzermias et al introduced the design and implementation of MDScan, a
standalone malicious document scanner which uses both static and
dynamic analysis methods to detect malicious PDF files.
– Then it pulls out the embedded JavaScript code and examines it by actually
running it on a SpiderMonkey JavaScript engine
– Used string variables are dynamically analyzed during execution, and if
some form of shellcode
– MDScan does not rely on previously known vulnerabilities and thus, is able
to detect malicious PDF documents which exploit unknown vulnerabilities
(zero-day) in PDF readers.
– The benign dataset consisted of 2000 benign PDF files found in Google.
Evaluation results show a TPR of 89% and an FPR of 0%
31
Advanced methods for the detection of malicious
PDF files
• Detection methods based on dynamic analysis
• Static JavaScript extraction
– Schmitt et al introduced PDF Scrutinizer, a malicious PDF detection and
analysis tool that also uses static and dynamic analysis methods to detect
maliciousness
– The first is a parser, which simulates the way Adobe Reader parses a
document
– The second is an action extractor
– The third module consists of an actions executor
– During execution, libemu35 library is used to analyze variable values for the
existence of shellcode
– Both static and dynamic heuristics are applied to detect maliciousness
– Static heuristics focus on JavaScript code string analysis to find a signature
of known suspicious, vulnerable, or malicious function
– Dynamic heuristics focus on the detection of malicious code behavior
32
Advanced methods for the detection of malicious
PDF files
• Detection methods based on dynamic analysis
• Static JavaScript extraction
– The following study differs from the previous work presented in this section
in several respects.
– First, ShellOS is an operating system. Second, unlike previous runtime
analysis techniques that use software-based CPU emulation, the proposed
framework leverages hardware virtualization technology
– Second, unlike previous runtime analysis techniques that use software-
based CPU emulation, the proposed framework leverages hardware
virtualization technology
– Finally, it can't examine a PDF file as a whole, and instead it relies on a host
operating system
33
Advanced methods for the detection of malicious
PDF files
• Detection methods based on dynamic analysis
• Static JavaScript extraction
– Snow et al presented ShellOS, a framework for the detection of code
injection attacks, based on code analysis during runtime
– ShellOS is a new lightweight operation system kernel designed for efficient
execution of code streams.
– ShellOS runs as a guest under a host operating system using Kernel Virtual
Machine
– When shellcode is executed, ShellOS collects useful information, such as
function name and parameters logged.
– The increased analysis performance enables the framework to process
more of the network stream and execute longer code sequences
34
Advanced methods for the detection of malicious
PDF files
• Detection methods based on dynamic analysis
• Dynamic JavaScript extraction
– Lu et al. introduced MPScan, a technique that integrates static malware
detection and dynamic JavaScript de-obfuscation
– MPScan is composed of two modules: an embedded code extraction
module and a multilevel malware detection module that includes a
shellcode/heap spraying detection component and an opcode signature
matching component that searches for malicious signatures in the
JavaScript opcode
– And then evaluated by the static detection module
– Previous methods such as MDScan and PDFphoneyC statically parse the
PDF file and extract JavaScript code and then examine the code
dynamically by running it in the emulated environment of the SpiderMonkey
JavaScript engine
– For the evaluation phase, the authors collected 198 malicious PDF samples
from the Internet and nine malicious PDF samples from the Metasploit
framework.
35
Advanced methods for the detection of malicious
PDF files
36
Advanced methods for the detection of malicious
PDF files
• Advanced methods and coping with exiting attacks (1/2)
• Each of the aforementioned analytical approaches ) has its pros
and cons
• Consequently, a hybrid detection framework meshing static and
dynamic detection techniques could reduce the likelihood of
evasion of the detection mechanism by a malicious PDF.
• The malicious code inside the PDF does not know that it is being
analyzed, because it is not opened by the PDF reader or by an
emulator.
• The static analysis approaches can be divided roughly into two
groups: the first group analyzes the JavaScript code embedded
inside the PDF in a variety of representations. The second group
relies upon meta-feature based approaches and focuses on the
content and structure of the PDF file
37
Advanced methods for the detection of malicious
PDF files
• Advanced methods and coping with exiting attacks (2/2)
• Looking at the disadvantages, static analysis can be evaded
using code obfuscation
• Whenever machine learning methods based on static analysis
are used for detecting unknown malicious code applications,
there is a question about the capability of the suggested
framework for detecting obfuscated code inside PDF files
• We have also presented studies employing a dynamic analysis
approach for detecting malicious PDF files
• In most of these studies, this approach dynamically runs the
JavaScript code embedded in a PDF file by performing pre-static
analysis of the PDF file in order to extract JavaScript code which
will be analyzed dynamically.
38
Advanced methods for the detection of
malicious PDF files
39
Dataset collection and preliminary analysis
• Acquired a total of 50,908 PDF files, including 45,763 malicious and
5145 benign files, from four sources
• The malicious PDF files contain several types of malware families such as
viruses, Trojans, and backdoors. We also included obfuscated PDF files.
• Analysis of our large dataset of 50,908 files by the parser shows that most
of the malicious files are not compatible with the PDF file format
specifications
40
Dataset collection and preliminary analysis
• The incompatibility observed was located at the end of the file, in the
line between “startxref” and “%% EOF” lines.
• This line should contain a number serving as a reference (offset) to
where the last cross reference table section is located in the file.
• In cases of incompatibility, the number that appears is incorrect.
includes the number of compatible files (bracketed) in each of our
collected datasets.
• Note that while incompatible benign files were not present in our
dataset, this does not mean that there weren't any incompatible
benign files.
• It might, however, suggest the very low probability of incompatibility
among benign files and provides support of our observation
mentioned above
41
Our suggested active learning based
framework
• In this survey we presented many studies that were based on
machine learning approaches and were successfully used to induce
malicious PDF detection models. However, all of them focus on
passive learning
• With passive learning, the induced detection model, as accurate as
it is , quickly becomes obsolete since it is incapable of adaptive
learning and integrating new malicious PDF files
• The detection model must be sustained and updated with newly
labeled, informative PDF files
• In cases in which the PDF files are labeled as malicious by the
human expert, they will be used to update the antivirus tool as well,
which is currently the most common solution for organizations.
42
Our suggested active learning based
framework
43
Our suggested active learning based
framework
• The PDF files transported over the Internet are collected and
scrutinized within our framework
• Then, the “known files module” filters all the known benign and
malicious PDF files and antivirus signatur
• The unknown PDF files are then checked for their compatibility as
viable PDF files
• The incompatible PDF files are immediately blocked from being
transported into the organizational network
• Since only compatible files are relevant for organizations and
innocent users, just these files are transformed into vector form for
the advanced check
44
Our suggested active learning based
framework
• This framework provides detection solutions for both instances,
whether the malicious file is compatible or not, and it does somore
efficiently than any other solution that exists today.
• The framework uses the insight that most of the malicious files are
incompatible as a first layer of filtering, and not as a detection rule.
• As noted, there is no reason to open an incompatible file e be it
benign or malicious. Therefore, this understanding provides a
significant reduction (~96.5%) of the analysis efforts of suspected
malicious files.
45
Our suggested active learning based
framework
• Specifically, JavaScript code attacks, embedded file attacks, and
form submission and URI attacks, are the most common attacks
launched via PDF files and three of them are present in our data set
• As being a large and representative dataset based upon trusted
sources, our conclusion of high incompatibility among malicious files
is empirically well based
• The PDF files which are compatible and unknown are then
introduced to the detection model which is a classifier induced by
Machine Learning algorithms.
• The Active Learning methods are aimed at efficiently updating the
detection model and antivirus tool in light of the creation of new PDF
files
46
Our suggested active learning based
framework
• Consider employing several algorithms in order to induce detection
models, one of them is the SVM classification algorithm with the
radial basis function (RBF) kernel in a supervised learning approach
• This projection into higher dimensional space actually makes the
induced model complex and thus more difficult for an attacker to
understand.
• The detection model scrutinizes PDF files and provides two
– A classification decision using the SVM classification algorithm
– Distance calculation from the SVM's separating hyperplane using Equation
47
Our suggested active learning based
framework
• Accordingly, in our context, there are two types of files that may be
considered informative.
• The first type includes PDF files in which the classifier has limited
confidence as to their classification.
– Acquiring them as labeled examples will probably improve the model's detection
capabilities.
– In practical terms, these PDF files will have new features or special combinations
of existing features that should fairly represent their operations and ambience
• The second type of informative file includes those that lie deep
inside the malicious side of the SVM margin and are a maximal
distance from the separating hyperplane according to Equation
48
Our suggested active learning based
framework
• Training: A detection model is trained over an initial training set that
includes both malicious and benign PDF files.
• Detection and updating: For every unknown PDF file that is both
transported over the Internet traffic and through the framework, the
framework's detection model provides a classification, and its active
learning method provides a rank representing how informative the
file is
• The purpose of this framework is to provide a better solution than
random selection or passive learning employed nowadays
49
Discussion and conclusions
• aimed to review the methods, techniques, and tools used for the
detection of malicious PDF files
• These PDF's are usually attached to emails that are sent to
organizations in order to perform the initial penetration of an APT
attack, therefore their detection is a significant concern which
requires attention
• One should note that we don't claim that every malicious PDF file is
incompatible
• And therefore, after the incompatibility check within our framework,
we aim at providing a comprehensive static and dynamic analysis
based on advanced Machine Learning algorithms and detection
models
• The Framework does not rely upon the fact that most of the
malicious files are incompatible, therefore in the case that an
attacker crafts a malicious PDF as an incompatible file, it will be
filtered out and will not be transported to the organizational network
50
Discussion and conclusions
• In this survey paper we do not provide an elaborate
segmentation on our dataset and the attacks which
occurred within it
• Based on this survey, they propose that the detection
model include a hybrid detection approach that conducts
both static and dynamic analysis
• For the static analysis phase, the key to precise and
sensitive detection is preliminary knowledge of the
primary attack and evasion techniques that could be
used by a PDF file
51
Discussion and conclusions
• All the extracted features mentioned in this article can be
leveraged by an ensemble of classifiers such that each
classifier will be induced from different sets of features.
• It was shown by Menahem et al. that using an ensemble
of classifiers using different features can signifi- cantly
improve detection capabilities
• This detection approach provides a comprehensive
indication of the file's purposes and is robust against
many evasion techniques
• This paper also suggest running each suspicious PDF
file through several versions of Adobe in order to
compare its behavior
52
Discussion and conclusions
• In future work, While machine learning has been successfully used
to induce malicious PDF detection models, all methods utilizing this
approach focus on passive learning
• Suggest pertains to the fact that PDFs are one of the most common
type of files that act as malicious attachments, however one cannot
ignore the phenomenon of malicious Microsoft Office files attached
to email
• We suggest combining email features (mentioned previously) with
features extracted from attached Microsoft Office files, thus
enhancing the detection of malicious office files as was explained in
reference to the PDF files
53
Q&A
Thank for your Attention!!