Sym Bion
Sym Bion
Concrete Execution
Fabio Gritti, Lorenzo Fontana, Eric Gustafson, Fabio Pagani, Andrea Continella,
Christopher Kruegel, and Giovanni Vigna
University of California, Santa Barbara
{degrigis, lfontana, edg, pagani, conand, chris, vigna}@cs.ucsb.edu
Abstract—Symbolic execution is a powerful technique for routines. These routines, known as models, emulate the effects
exploring programs and generating inputs that drive them into of a specific environment interaction (e.g., syscalls or calls
specific states. However, symbolic execution is also known to to third-party libraries) on the program’s (symbolic) state.
suffer from severe limitations, which prevent its application to This approximation represents another distinct problem for
real-world software. For example, symbolically executing pro- symbolic execution since models must be developed for every
grams requires modeling their interactions with the surrounding
environment (e.g., libraries, operating systems). Unfortunately,
interaction the program has with the environment.
models are usually created manually, introducing considerable For these reasons, the task of symbolically executing a
approximations of the programs behaviors and significant impre- specific portion of an application’s code can be quite chal-
cision in the analysis. In addition, as the complexity of the system lenging. In fact, symbolically executing the program from
under analysis grows, additional models are needed, making this its entry point to the interesting part is often infeasible or
process unsustainable. For these reasons, in this paper we propose
a novel technique that allows interleaving symbolic execution
prohibitively slow. On the other hand, a mere “jump into the
with concrete execution, focusing the symbolic exploration only middle” of the application (as proposed in under-constrained
on interesting portions of code. We call this approach interleaved symbolic execution [8]) introduces the risk of encountering
symbolic execution. The key idea of our approach is to re-use invalid program states caused by missing initialization code or
the concrete environment to run the input program, and then unsatisfied data dependencies.
synchronize the results of the environment interactions with the
symbolic execution engine. As a consequence, our approach does
However, symbolic reasoning about specific portions of a
not make any assumption about such interactions, and it is program’s code is a very useful, and needed task, in various
agnostic with respect to the concrete environment. We implement fields. For example, consider a malicious program that, before
a prototype for this technique, S YMBION, and we demonstrate unpacking the code responsible for the communication with a
its effectiveness by analyzing real-world malware, showing that C&C server, executes some checks to fingerprint the execution
it allows us to effectively skip complex portions of code that do environment [9]–[12]. In this case, even if the interesting code
not need to be analyzed symbolically. is the one that implements the communication with the C&C,
Keywords—Computer Security, Reverse engineering, Malware. there is no other choice than starting to execute the program
inside the symbolic engine from its entry point, and waiting
for the communication routine to be unpacked. However,
this analysis is infeasible because the environmental checks
I. I NTRODUCTION performed by the malware sample might leverage un-modeled
OS interactions, which are required to successfully reach the
Symbolic execution has been shown to provide great ben-
unpacking code.
efits in analyzing programs and understanding their properties
and behaviors, especially because of its ability to generate Previous work proposed different techniques to either nar-
inputs that drive the execution toward interesting program row symbolic execution to a particular portion of code [7],
states [1]–[4]. In particular, in the computer security commu- [13], [14], or avoid the execution of expensive portions of code
nity, this technique has been used to improve the code coverage that are unnecessary at a specific point of the analysis [15].
of fuzzers [2], to reverse engineer malware [5], to discover However, most of these approaches cannot perform a runtime
backdoors in closed-source software [6], and to automatically analysis, i.e., they cannot modify data in memory to direct
generate exploits [2], [7]. Unfortunately, symbolic execution execution through specific branches at runtime [7], [16], [17].
is haunted by the state explosion problem [1], which prevents For instance, after a symbolic analysis, one could be interested
this technique from being effectively applied to real-world in flipping the condition of a branch to drive the concrete pro-
software. In fact, if used without any precaution, symbolic gram execution toward another path. Similarly, tools that are
execution tends to generate an exponentially increasing num- tightly coupled to their dynamic analysis framework cannot be
ber of states that depend on symbolic data. For this reason, used with arbitrary or non-traditional execution environments
most of the uses of symbolic execution need custom-tailored (e.g., embedded devices). For instance, most of the existing
heuristics to drive the exploration and to discard uninteresting tools leverage QEMU [18] for program emulation, but if the
states. Moreover, to further reduce the number of states, the program cannot be emulated (e.g., due to hardware dependen-
interactions of the program under analysis with the surround- cies), the only remaining solution is to run the application on
ing environment are approximated by manually implemented bare-metal. Unfortunately, this solution is often not compatible
with the design of the system used to symbolically analyze the II. M OTIVATION
target program.
When analyzing real-world programs, focusing on a
To address the aforementioned problems, in this paper we smaller portion of code makes the analysis more practical.
introduce interleaved symbolic execution, a novel approach However, two main issues affect symbolic execution engines:
for symbolically executing complex binaries. Our technique Missing Environment Models and Missing Data Dependencies.
provides granular control over the interleaving of concrete and The combination of these two problems often blocks the
symbolic executions of the analyzed program, and removes possibility to study a specific functionality of the program:
the need for implementing models for every environment in- it is not possible to initialize a symbolic execution engine at
teraction. Differently from the concolic testing approach [19], a specific point because of missing data dependencies, nor it
where the goal is to both symbolically and concretely execute is possible to reach that code from the entry point because of
a full trace of the target program to eventually improve missing environment models.
code coverage, we aim at relieving the environment interac-
tion modeling problem by re-using the concrete environment
to symbolically execute only interesting portions of code. A. Missing Environment Models
Nonetheless, the problem we discuss in this paper also af- This issue arises when the models required to simplify
fects concolic testing, and we, therefore, consider interleaved a target program’s execution are either missing or partially
symbolic execution orthogonal to concolic testing. implemented. Depending on the design choices made to de-
We implemented our approach in a tool, S YMBION, that velop a specific symbolic execution engine, this can happen at
uses a modular, programmatically-controlled concrete environ- three different levels: the API-level, the syscall-level, and the
ment where we execute the program under analysis until a instruction-level.
specified point of interest. Afterward, S YMBION synchronizes
API-level. Some symbolic engine implementations provide
the concrete state of the program with the symbolic execution
models for the most used functions (i.e., strcmp, strcpy, etc.)
engine, allowing for further exploration and precise data-flow
included in the C standard library (libc). The calling sites of
reasoning. Finally, leveraging the information gained from
these functions inside the program are then substituted with
the symbolic exploration, the program can then be steered
calls into the corresponding model implementation. While this
toward a chosen branch by concretizing any symbolic data in
choice has the benefit of relieving the symbolic execution
the program’s state and synchronizing its value back into the
engine from executing code related to those functions (i.e.,
concrete environment. By iterating this execution strategy, we
analyses are faster, and propagate symbolic data more accu-
can delegate the execution of the complex portions of code
rately), it puts the burden on the developers, who have to
to the concrete environment, and apply symbolic execution
implement a large number of models to support the analysis
over arbitrary parts of the program. Moreover, our approach is
of complex programs. If a model for a specific function is
agnostic with respect to the concrete environment, providing
missing, there are two possible solutions. The first one is to
support for native targets running in any environment or
let the symbolic engine explore the real code of the function,
architecture (e.g., debugger, emulators, and hardware devices).
while the second one is to skip the function body altogether
We implement S YMBION on top of angr [4], and we and return a symbolic variable at the call site. Unfortunately,
evaluate it by analyzing real-world malware, showing how both solutions can introduce enough complexity to trigger a
interleaved symbolic execution can effectively support ana- state explosion.
lysts when reverse-engineering malicious code. Specifically,
we show how S YMBION can be effectively used, without Syscall-level. Similarly to API-level models, the syscall-level
requiring the modeling of complex environment interactions, to models are designed to mimic the effects of system calls made
detect the usage of a Domain Generation Algorithm (DGA), to a particular OS kernel. The majority of symbolic execution
to reverse-engineer the C&C commands required to trigger engines provide models at this level because it avoids the need
malicious functionality, and to study the evasion techniques to place the entire kernel in the emulated environment. Since
employed by malicious code. system calls have substantial side-effects on the system and the
user-land programs, models at this level must be complete and
In summary, we make the following contributions: precise or the symbolic execution will reach incorrect states.
• We introduce interleaved symbolic execution, a novel Instruction-level. A common technique to support symbolic
approach that fuses concrete and symbolic execution execution on different architectures is to lift a program’s code
by synchronizing the concrete state of a program (i.e., express assembly instructions with a higher-level repre-
under analysis with a symbolic execution engine, sentation). In this way, symbolic execution engines can reason
allowing the concrete execution to be steered by the on this lifted representation and use it to update the program’s
results of the symbolic exploration. state accordingly. Unfortunately, modern CISC architectures,
• We implement our approach in S YMBION, a sys- such as Intel x86, count more than a thousand valid instructions
tem that we release as an open-source project under and most of them have side-effects. This still represents an
angr’s master1 . important challenge for the symbolic execution of intermediate
representations, because instructions can be unsupported, lifted
• We show the effectiveness of S YMBION by presenting incorrectly, or have an incorrect or partial implementation. In
analyses of real-world malware samples. all these cases, invalid program states are generated and the
1 https://github.com/angr/angr analysis results are unreliable.
1 func: 1 void heavily_packed_function(char *config){
2 mov rax, [0x0000555555774000]; 2 evasion_checks(config);
3 add rax, 0x10; 3 /* [ block of packed/encrypted code] */
4 mov [0x0000555555774008], rax; 4 }
5 _loop: 5 char* download_config()
6 mov rcx, [0x0000555555774008]; 6 {
7 dec rcx; 7 HINTERNET hOpen = NULL, hFile = NULL;
8 mov [0x0000555555774008], rcx; 8 DWORD dwBytesRead = 0;
9 jnz _loop; 9 char config[2000];
10 /* [ omitted code ] */
11 hOpen = InternetOpenA();
Fig. 1: Data dependency example 12 hFile = InternetOpenUrlA(hOpen,
13 "www.evildomain.com/conf.cfg");
14 InternetReadFile(hFile, (LPVOID)config, ...);
15 /* [ omitted code ] */
B. Missing Data Dependencies 16 return config;
17 }
Starting the symbolic exploration at arbitrary functions 18 void parse_config(char *malware_config){
allows to skip code not relevant for the analysis or containing 19 int config_len = atoi(malware_config[0])
missing environment models. This operation is allowed by 20 int xor_key_len = atoi(malware_config[1])
many symbolic execution engines, which let their users initial- 21 char xor_key[xor_key_len];
22
ize and start the execution at specific, arbitrary addresses in the
23 for(int i=2, j=0; i<xor_key_len; i++,j++){
program. However, to provide reliable results, this approach 24 xor_key[j]= malware_config[i]
requires forging a valid program state (i.e., the values of the 25 }
registers and the memory) and to correctly express its data de- 26
pendencies. These dependencies are generated when the code 27 for(int i=xor_key_len, i<config_len; i++){
28 malware_config[i]= malware_config[i] ˆ
that we are symbolically executing refers to data produced 29 xor_key[i%xor_key_len]
by the execution of previous code. This makes the option of 30 }
manually providing the initial state an error-prone process. For 31 }
example, as shown in Figure 1, the effect of missing data 32 int main( int argc, char **argv){
dependencies amplifies the problematic aspects of symbolic 33 char *malware_config = download_config();
34 parse_config(malware_config);
execution even more, and generally leads to a state explosion 35 decrypt_function(&heavily_packed_function,
problem. In this example, if the data required for the com- 36 malware_config.decryption_key)
putation is not present at address 0x0000555555774000 37 heavily_packed_function(malware_config)
the symbolic execution engine will store symbolic data at 38 }
0x0000555555774008. This value is later used to control
the loop termination (Line 9), and, since it is an unconstrained Fig. 2: Malware pseudocode example
symbolic variable, it can theoretically execute an unbounded
number of times, leading to a state explosion.
previous routine (parse_config). Moreover, the code in
C. Motivating Example heavily_packed_function is partially packed, and in
Consider the snippet of code in Figure 2 that implements the normal execution flow of the program, is decrypted by the
malicious code. The goal is to symbolically explore the func- unpacking routine decrypt_function. Therefore, starting
tion called heavily_packed_function. We could start the exploration from heavily_packed_function
to symbolically execute the malware from main; however, would cause the analysis to fail because the code that
let us assume that we stumbled into the missing API-level we want to analyze is revealed only after executing
model problem, and that the engine does not have a sym- decrypt_function.
bolic implementation for the WinAPI at Lines 11, 12, and Finally, another possible and commonly used solution to
14 (InternetOpenA, InternetOpenUrlA, and InternetReadFile, these problems is to implement the missing models to enable
respectively). As a result, the download_config function the symbolic execution to safely reach the interesting portion
returns a buffer containing symbolic data, which is then passed of code. However, this approach requires time and effort, and it
as a parameter to parse_config at Line 34. This function is not always scalable if the number of required models is very
uses this buffer to initialize the variable xor_key_len (Line large. Consider, for example, complex programs like browsers,
20) that therefore will be symbolic as well. Unfortunately, media players, or even a kernel; the amount of work necessary
since this variable controls the loop’s termination, this causes to implement the missing models can be unmanageable.
an unlimited number of iterations, and a consequent early state
explosion, that prevents the symbolic executor from reaching
the target function heavily_packed_function. III. A PPROACH
As we previously discussed, a possible way to Our motivating example demonstrates the challenges of
avoid these problems is to start the exploration from applying simple symbolic execution approaches to real mal-
heavily_packed_function itself. However, this ware samples, highlighting the need for techniques that allow
function calls another routine (evasion_check at Line one to effectively explore arbitrary program functions. For
2) that depends on specific data (config) processed by a this reason, in this paper, we propose interleaved symbolic
execution, a technique that aims to enable a precise symbolic complex programs (Table I) across different environments.
exploration of arbitrary functions. The key idea behind this
approach is to symbolically execute a specific target function • Run-time analysis: whether the approach allows users
in its correct context. In practice, we concretely run the to modify the target’s memory at runtime, therefore
program under analysis until the initialization of such context steering the execution toward specific paths.
is complete, and, after that, we synchronize the concrete state • Multi-Targets Support: whether the approach supports
with the symbolic execution engine, replicating the concrete the execution of targets running inside different envi-
environment. Finally, starting from our synchronized state, we ronments (e.g., embedded systems).
bootstrap our symbolic execution analysis. Essentially, our
approach allows for the interleaving of symbolic execution • Programmatic Context Switching: whether the tool
with concrete execution, focusing the benefits of the symbolic provides a programmatic interface to change the
exploration on specific portions of code. More precisely, given execution context of the program from concrete to
a binary program, interleaved symbolic execution is carried out symbolic, and vice-versa.
in three phases. • Instrumentation Independency: whether the approach
is self-contained or strictly relies on a specific envi-
Phase 1: Initial Concrete Execution. We concretely exe- ronment or on another component to analyze the target
cute the input binary from a Concrete Starting Point (CSP) binary (e.g., QEMU, Pin). This makes the approach
to an arbitrary Point of Interest (PoI). Generally, at the less general, as it is limited by the capabilities of the
beginning of the analysis, the CSP is the entry point of the underlying instrumentation layer.
binary. When the program’s execution reaches the PoI, we
stop it in that specific state (i.e., memory and registers). As we can see in Table I, the most similar approach is im-
plemented by Avatar2 . The tool can be used to implement run-
Phase 2: Interleaved Symbolic Execution. Once the con- time analyses and its flexible interface can be used to easily
crete execution reaches our PoI, we synchronize the con- support different execution environments, moreover, Avatar2
crete execution state with the symbolic engine. In practice, does not need specific instrumentation of the target binary to
we instantiate a new symbolic state, we replicate the perform analyses. However, a programmatic context switching
concrete state (i.e., by reconstructing the memory layout is not officially supported by the tool. Triton implements a
and the registers’ values), and we set parts of the state similar approach, but due to its dependency from an instru-
(i.e., memory or registers) as symbolic. For instance, we mentation framework (Pin [20]), it does not have fully multi-
might set as symbolic specific memory areas to uncover targets support, moreover, it lacks an interface to implement
their data dependencies within the next blocks of code. programmatic context switching. Finally, S2 Eand Mayhem,
We then explore the program symbolically until we reach were not conceived to perform these kinds of analyses and
a Target Point (TP). they miss a proper interface to perform run-time analyses
Phase 3: Re-Synced Concrete Execution. When the sym- and programmatic context switching. Table I summarizes the
bolic execution reaches our target point, TP, we collect the comparison of such tools using the described features.
constraints introduced during the exploration and evaluate
them to obtain, for each symbolic variable, a concrete value IV. S YSTEM D ETAILS
that satisfies the collected constraints. Finally, we store
the results of our symbolic exploration into the concrete We implemented S YMBION on top of the multi-platform
process memory to drive the concrete execution from PoI binary analysis framework angr [4]. The software is written
to TP, thus achieving the exploration of our target point. in Python, and its versatility and flexibility are well-suited to
implement and prototype new research ideas.
Note that we can iterate over our three phases by consid-
ering the TP as a new CSP (or a new PoI), therefore further A. Design Requirements
exploring the input program.
One of S YMBION’s goals is to support switching between
Our approach enables the delegation of the execution of concrete and symbolic execution, avoiding expensive transfers
code affected by missing environment models to a concrete
environment, and reconstructs an execution state that allows
for the symbolic execution of the program from the PoI to the TABLE I: Features comparison between S YMBION and related
TP, without being affected by the missing data dependencies works. (3) means the tool has the specified feature. (7) means
issue. Moreover, our system is agnostic with respect to the the tool does not have the specified feature. (∼) means the tool
specific concrete environment and only requires a simple partially implements the specified feature.
interface to control the program’s execution. This allows for
the flexible application of this approach to a broad range Run-time Multi-Targets Programmatic Instrumentation
Analysis Support Context Independency
of scenarios, e.g., running firmware on native hardware and Tool Switching
obtaining its execution state through JTAG debug ports.
S YMBION 3 3 3 3
Comparison with Existing Approaches. We compare S YM - Avatar2 3 3 ∼ 3
BION with the most similar related approaches. In our com- Triton 3 ∼ ∼ 7
parison, we take into account four different features that S2 E 7 7 7 7
Mayhem 7 7 7 7
are needed to support symbolic execution-based analyses of
Concrete environment Symbolic execution engine Symbolic execution engine Concrete environment
mem[W]= 0
mem[X] =sym_var_A mem[W]= 0
CSP PoI mem[X] =sym_var_A CSP
mem[Y]= sym_var_B PoI
mem[Z]= 7 mem[Y]= sym_var_B
mem[Z]= 7
Concrete values
from solver to
reach TP:
mem[X] = 6
TP
TP mem[Y] = 3
B. angr
of state. To implement the state synchronization from the
concrete world to the symbolic world, we need a way to Here we want to discuss the relevant concepts of angr,
stop the program’s execution at a PoI, copy the values of which plays a central role in S YMBION architecture.
all the registers, and import them inside the symbolic engine.
Moreover, we need to efficiently synchronize the memory SimState & SimEngine. angr executes programs by creating
of the concrete state inside the symbolic engine, and take simulated states (SimStates). A SimState is logically a snapshot
care of all the details related to the underlying architecture of a program in a specific state. A SimState memory can be
(e.g., in the x86 architecture we need to synchronize the manipulated to define some data as symbolic. To symbolically
segment registers). We refer to this transition using the no- execute the program, angr receives as input a SimState, and,
tation Concrete→Symbolic and we show an example of this by using an Execution Engine (SimEngine), applies the effects
transition in Figure 3. In Figure 3(a), we show three paths of the program’s instructions on that state to produce new
(P1 , P2 , P3 ) that start from the CSP, end in the PoI, and pass SimStates (called successors).
through a block of complex code. The concrete execution of SimPlugin. A SimPlugin is an object that holds additional
P1 causes the memory of the program at the PoI to contain data about a SimState. Moreover, it implements an interface
0,1,5,7 at addresses W,X,Y,Z, respectively. Now that the PoI to deal with the SimState lifecycle. The easiest example is
is reached, S YMBION automatically imports the concrete state the state.globals plugin that can be used to propagate
into the symbolic engine. To begin our symbolic analysis, information and data from one SimState to its successor.
using S YMBION, we set the memory at addresses X and Y
to be symbolic Figure 3(b) to eventually reason about paths in Simulation Manager & Exploration Technique. A sim-
the program influenced by these memory locations. ulation manager (SimManager) is an interface that affects
how successors are generated by applying search strategies
Now the symbolic engine can explore different paths (Exploration Techniques) to explore the program’s execution
until the TP. Once the TP is reached, a Symbolic→Concrete paths.
transition is started. To implement this transition, S YMBION, SimProcedures. SimProcedures are Python routines that
under the user guidance, applies the memory modifications angr uses to synthesize the effect of syscalls and external
needed in the concrete process to reach the TP. libraries functions on an input SimState. These routines are
imported during the initialization of the target program, and
Figure 4 shows an example of this type of transition. In they are executed during the symbolic execution of the target.
Figure 4(a) we have symbolically reached the TP, and by
asking the constraint solver for possible values for mem[X]
C. Symbion
and mem[Y] we get respectively 6 and 3. In Figure 4(b),
these values are synchronized into the concrete process, which To develop S YMBION, we implemented a new SimEngine
is eventually resumed to reach the TP. At this point, TP can called SimEngineConcrete, a new Exploration Technique
become a CSP state itself, effectively resuming the interleaved named S YMBION, and a new SimPlugin called Concrete.
symbolic execution of the program. Moreover, we introduced a new concept, called ConcreteTar-
get, that exposes an interface that has to be implemented to
Finally, we want to design a versatile interface to decouple support concrete execution in a new environment. An overview
our tool from the execution environment of the target program. of S YMBION’s architecture is shown in Figure 5.
new
Symbolic execution
engine
Concrete
environment
• Resume the execution of the program until a new PoI
SimState is reached, and give the control back to angr.
SimState
A possible implementation of this interface is the The source code of these examples is publicly available at
GDBConcreteTarget, which is used to programmatically https://github.com/degrigis/symbion-use-cases.
control the target program, running on a possibly remote
machine, under a gdbsever. A. Tracking C&C Domains Generation
SimEngineConcrete. The SimEngineConcrete leverages an For this example, we analyze a malicious PE32 binary that
implementation of the ConcreteTarget interface (e.g., the GDB- implements a password stealer distributed by spam emails.
ConcreteTarget) to perform the following steps:
• Modify the concrete process memory with the values TABLE II: Use Cases
provided by the user at specific memory addresses.
Type Family Use case OS MD5
• Modify the concrete process registers with the values
Trojan Symmi Detect DGA Win 221c235bc70586ce4f4def9a147b8735
provided by the user. Bot Bashlite C&C commands Linux 3d257d80963c9c905e883b568f997550
Ransom Credle Evasion Win 53f6f9a0d0867c10841b815a1eea1468
• Set the new PoI by translating them into breakpoints.
1 <BV32 (if ((((0x0 .. __add__(0xfe624e21, Goal. When analyzing a malware sample, a common problem
SystemTimeAsFileTime_0_64[63:32], 0x0 .. (if is the ephemerality of the malicious infrastructure behind it.
(SystemTimeAsFileTime_0_64[31:0] <= This affects most of the dynamic analysis techniques - for
(0x2ac18000 +
SystemTimeAsFileTime_0_64[31:0])) then 0
example when the sample is run in a sandbox - because the
else 1)) ... [ omitted data ] sample does not show its malicious behavior. In this use-case,
we show how S YMBION can circumvent this problem, and how
Fig. 6: Symbolic data in gethostbyname parameters. it can be used to identify which command triggers a specific
functionality (e.g., the ovhflood action).
Challenges. When we run this sample, we can see
Goal. During the first step of manual analysis, we saw that that it attempts to connect to its C&C server at
the sample makes requests to domains possibly generated 185.244.25.213:3437. However, this server is not work-
by a Domain Generation Algorithm (DGA). This technique ing anymore, and thus, simply debugging the binary to under-
allows bots to periodically generate a new C&C domain stand which functionality is triggered by a specific command is
to contact as a means to evade defenses based on domain not trivial. Moreover, this sample creates a new child process
blacklisting. Since the implementation of a DGA is quite to handle every request received from the C&C server. This
often time-dependent, we want to confirm this hypothesis behavior represents a problem for symbolic execution engines
by trying to identify a connection between the WinAPI call because forking introduces considerable complexity (i.e., the
GetSystemTimeAsFileTime at address 0x0040c493 engine must model the presence of another child process).
and the call gethostbyname at address 0x00407f58. Analysis. To start our analysis, we manually identified the rou-
Challenges. The malware is not packed, but during a pre- tine (called echocommand) that implements the dispatcher
liminary reverse-engineering we confirm that, during startup, for the received commands at address 0x407ec4. However,
a lot of complex and superfluous code (e.g., calls to random to reach this part of the code, a particular sequence from
WinAPIs, allocation, and de-allocation of memory) are used to the C&C needs to be received. Given the different challenges
confuse an analyst, and to slow down the analysis performed involved, we break down this analysis in two different steps.
by automated tools. Moreover, the sample includes a heavy First of all, we must understand and identify the sequence that
initialization step during which four different threads compute we need to provide to the binary to reach the echocommand
the address of the main function of the malicious code. This function. Then, when this sequence is uncovered, we can sym-
complex initialization step would be enough to hinder any bolically explore the program to understand which command
approach based on pure symbolic execution of the code, since triggers the specific functionality (ovhflood).
symbolic execution would certainly incur in the state explosion For this reason, we start the analysis by concretely exe-
problem. cuting the malware right after the connection attempt to the
Analysis. By leveraging S YMBION, we let the malware run C&C; after that, we switch control to the symbolic execution
until the initialization step is done, precisely, until the call engine, we modify the memory as if the connection has been
to GetSystemTimeAsFileTime. After that, we perform successful, and we resume the concrete execution until the
a Concrete→Symbolic transition, and we mark the buffer instruction before the call to recvline at 0x40abd3. At
returned from GetSystemTimeAsFileTime as symbolic. this point, we perform another Concrete→Symbolic transition,
Now we let the symbolic engine explore the binary until it and we mark the buffer returned by recvline as symbolic.
calls the function gethostbyname. Finally, when we reach To conclude this stage, we let the symbolic engine explore the
gethostbyname, we inspect the data used as a parameter binary until it calls the echocommand function.
to check if symbolic data flows between the two calls. The malware code execution is split in the following way:
To summarize: • CSP1 = Binary Entry Point (0x40a488).
• CSP1 = Binary Entry Point (0x40f827), • PoI1 = Code right before the recvline
(0x40abd3).
• PoI1 = call to GetSystemTimeAsFileTime
(0x0040c493), • TP1 = First instruction of echocommand
(0x407ec4).
• TP1 = call to gethostbyname (0x00407f58).
Finally, by concretizing the symbolic variable returned by
By implementing this analysis using S YMBION, we can recvline (Figure 7), we can see the string the malware
see that the parameter for the gethostbyname call is indeed expects to receive to eventually start parsing malicious com-
symbolic, and that it depends on the symbolic variable returned mands.
by the GetSystemTimeAsFileTime call (Figure 6).
To begin the second step of our analysis, we simply
B. Tracking Malicious Bot Commands concretize this sequence in the program memory, and resume
the concrete execution until the beginning of echocommand.
In this use-case, we show how S YMBION can be used to Once there, we perform another Concrete→Symbolic tran-
analyze an instance of a Bashlite backdoor (know as Qbot or sition. Finally, we set the command buffer back to being a
LizardStresser). This Linux malware targets IoT devices, and symbolic variable, and symbolically execute the program until
it is used to build botnets exploited for DDoS attacks. we hit the basic block where the malware calls the ovhflood
1 print(next_state.solver.eval(cmd_buffer_symbolic, to the Windows API GetProcessAffinityMask (ad-
2 cast_to=bytes)) dress: 0x7502a889). Here, we trigger a Concrete→Symbolic
3 b’!\xff \n\r\x00’ transition. To symbolically analyze this part of code,
we need to write one simple SimProcedure to hook
Fig. 7: Dumping the sequence of chars that will bring the the GetProcessAffinityMask call2 . In particular, this
execution to the echocommand functionality. model just fills the buffers used to store the call’s results
with symbolic data. Using the constraints collected over
those symbolic variables we will later understand which
1 print(next_state.solver.eval(cmd_buffer_symbolic, values force the malware to show its real behavior. Note
2 cast_to=bytes)) that, without using S YMBION, the amount of SimProce-
3 b’OVH\x00’ dures that a user needs to write to reach the TP would
have been considerably larger. Now, we symbolically exe-
Fig. 8: Command we need to send to the backdoor to activate cute the program until the first instruction of the malicious
the ovhflood functionality. behavior: 0x4214e4. At the end of the exploration, we
ask the constraint solver to concretize the symbolic vari-
ables set by the GetProcessAffinityMask SimPro-
function (address 0x409db2). By solving the constraints for cedure. In particular, we get the value 0x2902b319 for
the new symbolic buffer we can extract the command to reach lpSystemAffinityMask, which reports a value greater
this specific functionality in the program. than 0x2 for the CPU’s number of cores (the sample is
checking CPU features to fingerprint virtual environments).
For this step, we split the malware execution in this way: By concretizing these solutions in the memory of the concrete
• CSP2 = TP1 = First instruction of echocommand process we can observe the execution of the malicious code.
(0x407ec4). We split the malware code execution in the following way:
• PoI2 = CSP2 .
• CSP1 = Malware main function (0x40fae6).
• TP2 = call to ovhflood (0x409db2).
• PoI1 = Call to GetProcessAffinityMask
Note that, in this phase, we do not need any concrete (0x7502a889).
execution from CSP2 to PoI2 , rather we want to symbolically • TP1 = Start of malicious behavior (0x4214e4).
explore the code immediately starting from TP1 (reached in the
previous phase). To do this, we logically define that the new
PoI2 equals to the CSP2 . Unfortunately, since this function VI. D ISCUSSION
is quite complex, we run into the state explosion problem. Program Execution Correctness. S YMBION does not keep
We decided to help angr symbolically explore this piece of track of the constraints needed to reach the multiple TP defined
code by leveraging a technique known as directed symbolic during the analysis. Therefore, since users have full control ca-
execution [22], that basically prunes paths that do not connect pabilities over the target program (they can change the memory
CSP2 (echocommand) and the TP2 (call to ovhflood). at any time in inconsistent ways), it is possible to generate
To conclude our analysis, once the TP2 (call to infeasible program’s executions. The analyses developed on
ovhflood) is reached, we solve the constraints over the top of S YMBION are responsible to handle this problem if the
symbolic buffer containing the command, and we extract the program execution correctness is a strict requirement.
string we need to use to reach this functionality, which, in this Desynchronized Environment Interactions. Any environ-
example, is the string “OVH” (Figure 8). ment interaction made by the concrete process that involves
resources outside of the process memory is not shared with
C. Bypassing Malware Evasion the symbolic engine. For instance, consider a program that
opens a file during the concrete execution, and, after that,
In this example, we analyze a ransomware sample that
a Concrete→Symbolic transition is performed. During the
uses a dynamic anti-analysis check to detect a virtualized
symbolic execution, if a read over that file happens, our
environment.
symbolic engine uses an abstraction for that file that is com-
Goal. First, we manually confirm that the ransomware be- pletely desynchronized from the one in the concrete process.
havior starts at address 0x4214e4. However, this code is The same effect occurs in the case of open sessions with
conditionally executed depending on the results returned by a remote servers or device drivers. It is possible to support a
call to the GetProcessAffinityMask WinAPI. The goal synchronization mechanism on top of S YMBION to update the
of this analysis is to understand which value this WinAPI has symbolic engine’s state when a certain behavior (e.g., a file
to return to eventually trigger the malicious code. is opened) happens, but we consider the development of this
solution outside of the scope of this work.
Challenges. The binary is packed, and therefore applying
symbolic execution from the entry point of this sample would 2 Note that here we use a SimProcedure only to programmatically break the
certainly run into the space explosion problem. execution and set a symbolic buffer, conceptually implementing a breakpoint.
Instead, using S YMBION we do not need to write SimProcedures that model
Analysis. We manually identify the point in the program all the environment interactions required to bring the execution to the state of
where the code is fully unpacked and we spot the call interest.
Concretization Policy. After reaching the TP, S YMBION of the malicious program inside a symbolic execution engine.
performs a Symbolic→Concrete transition. During this pro- S YMBION implements an idea similar to Baldoni et al., but
cess, the symbolic variables need to be concretized, and their in a completely automated way. Moreover, it is possible to
concrete values committed to the concrete process memory. carefully steer the concrete execution from a symbolic analysis
During this process, we need to specify the subsets of variables to bring the execution at a specific point in the code.
to be concretized and committed, and also which concrete
values should be used (a satisfiable SMT expression can have VIII. C ONCLUSIONS
multiple valid solutions, by default, S YMBION, picks the first
solution). We decided to not force the users to concretize all In this paper, we showed how symbolic execution of
the symbolic variables but rather give them the freedom to complex programs is generally problematic because of missing
choose the ones they think are important for the analysis. environment models and missing data dependencies, which
together block the possibility to perform a symbolic analysis
Targets Support. Currently, S YMBION supports analyses of on a target program. To solve this problem, we proposed
binaries on x86 and ARM architectures, and only a GDBTarget interleaved symbolic execution, a novel approach to execute a
has been implemented. Nevertheless, we do not consider this target binary by interleaving concrete and symbolic execution.
an inherent limitation of S YMBION, because supporting more We implemented our approach in S YMBION, a system built
architectures only requires more engineering effort. on top of the angr framework. We demonstrated how this
technique solves the presented issues, and how it supports the
VII. R ELATED W ORK analyses of real-world binaries. As a further demonstration, we
have been publicly notified that S YMBION is currently being
Applying symbolic execution only to specific parts of a used by the community for different research projects, and by
program has been tackled by researchers in the past. a large corporation for internal projects.
Chipounov et al. [14] proposed S2 E, an automated path
explorer that lets users define which part of a program has to IX. ACKNOWLEDGMENTS
be executed symbolically and which part as to be executed con- We would like to thank our reviewers for their valuable
cretely. Cha et al. [7] developed Mayhem, an AEG (automatic comments and inputs to improve our paper.
exploit generation) system that is based on strict cooperation
between a Concrete Executor Client (CEC) that runs the code Sandia National Laboratories is a multimission laboratory
inside a virtual machine, and a Symbolic Executor Server managed and operated by National Technology & Engineering
(SES) that symbolically explores traces provided by the CEC. Solutions of Sandia, LLC, a wholly owned subsidiary of Hon-
Inputs generated by the SES are sent back to the CEC to eywell International Inc., for the U.S. Department of Energy’s
increase the code coverage of the target binary. Differently National Nuclear Security Administration under contract DE-
from Mayhem and S2 E, S YMBION does not work with traces, NA0003525. This material is based upon work supported by
nor defines a priori which parts of the program have to be AFRL under Award No. FA8750-19-C-0003. This material is
symbolically executed. also based on research sponsored by DARPA under agreement
number HR001118C0060. The U.S. Government is authorized
Saudel et al. [13] proposed Triton, a dynamic binary analy- to reproduce and distribute reprints for Governmental purposes
sis framework based on Intel’s Pin [20]. The target program is notwithstanding any copyright notation thereon. The views
instrumented using Pin, and users can use selective symbolic and conclusions contained herein are those of the authors
execution leveraging the API provided by the tool. Fioraldi and should not be interpreted as necessarily representing the
developed angrdbg [23], a library to generate angr’s states official policies or endorsements, either expressed or implied,
from a concrete state of a program under analysis. Specifically, of DARPA, the U.S. Government, or the other sponsors.
angr is imported inside a running instance of gdb, and after
that, it is possible to create a symbolic state given the current R EFERENCES
concrete state of the program. Differently from angrgdb and
[1] C. Cadar and K. Sen, “Symbolic execution for software testing: three
Triton, in S YMBION the symbolic execution engine does not decades later.” Commun. ACM, vol. 56, no. 2, pp. 82–90, 2013.
live inside a debugger, nor it is dependent on a particular [2] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta,
dynamic execution environment (e.g., Pin or gdb). Y. Shoshitaishvili, C. Kruegel, and G. Vigna, “Driller: Augmenting
fuzzing through selective symbolic execution.” in Proceedings of the
Muench et al. [21] developed the multi-target orchestration Network and Distributed System Security Symposium (NDSS), 2016.
platform Avatar2 . This platform allows for the synchronization [3] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang,
of multiple executions and analysis targets, such as emulators, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena, “Bitblaze: A
hardware, and even symbolic execution engines such as angr, new approach to computer security via binary analysis,” in Proceed-
in a target-agnostic way. To accomplish this, the system ings of the International Conference on Information Systems Security.
Springer, 2008.
supports both the forwarding and synchronization of registers
[4] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino,
and memory between targets. The main focus of Avatar2 is to A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna,
support the analysis of embedded systems, while S YMBION “SoK: (State of) The Art of War: Offensive Techniques in Binary
provides a programmatic interface to support interleaved sym- Analysis,” in Proceeding of the IEEE Symposium on Security and
bolic execution of complex targets. Privacy (S&P), 2016.
[5] A. Moser, C. Kruegel, and E. Kirda, “Exploring multiple execution
Baldoni et al. [24] applied symbolic execution to real-world paths for malware analysis,” in Proceeding of the IEEE Symposium on
malware by implementing a tool that imports the concrete state Security and Privacy (S&P), 2007.
[6] Y. Shoshitaishvili, R. Wang, C. Hauser, C. Kruegel, and G. Vigna,
“Firmalice-automatic detection of authentication bypass vulnerabilities
in binary firmware.” in Proceedings of the Network and Distributed
System Security Symposium (NDSS), 2015.
[7] S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, “Unleashing
mayhem on binary code,” in Proceedings of the IEEE Symposium on
Security and Privacy (S&P), 2012.
[8] D. A. Ramos and D. Engler, “Under-constrained symbolic execution:
Correctness checking for real code,” in Proceedings of the USENIX
Security Symposium, 2015.
[9] R. Paleari, L. Martignoni, G. F. Roglia, and D. Bruschi, “A fistful
of red-pills: How to automatically generate procedures to detect cpu
emulators,” in Proceedings of the USENIX Workshop on Offensive
Technologies (WOOT), 2009.
[10] M. Lindorfer, C. Kolbitsch, and P. Milani Comparetti, “Detecting
environment-sensitive malware,” in Proceedings of the International
Symposium Recent Advances in Intrusion Detection (RAID), 2011.
[11] D. Kirat, G. Vigna, and C. Kruegel, “Barecloud: Bare-metal analysis-
based evasive malware detection,” in Proceedings of the USENIX
Security Symposium, 2014.
[12] M. Polino, A. Continella, S. Mariani, S. D’Alessio, L. Fontana, F. Gritti,
and S. Zanero, “Measuring and defeating anti-instrumentation-equipped
malware,” in Proceedings of the Conference on Detection of Intrusions
and Malware and Vulnerability Assessment (DIMVA), 2017.
[13] F. Saudel and J. Salwan, “Triton: A dynamic symbolic execution frame-
work,” in Symposium sur la sécurité des technologies de l’information
et des communications, SSTIC, France, Rennes, June 3-5 2015. SSTIC,
2015.
[14] V. Chipounov, V. Kuznetsov, and G. Candea, “S2e: A platform for
in-vivo multi-path analysis of software systems,” in ACM SIGARCH
Computer Architecture News. ACM, 2011.
[15] D. Trabish, A. Mattavelli, N. Rinetzky, and C. Cadar, “Chopped sym-
bolic execution,” in Proceedings of the ACM International Conference
on Software Engineering (ICSE), 2018.
[16] C. Cadar, D. Dunbar, D. R. Engler et al., “Klee: Unassisted and auto-
matic generation of high-coverage tests for complex systems programs.”
in OSDI, 2008.
[17] P. Godefroid, M. Y. Levin, and D. Molnar, “Sage: whitebox fuzzing
for security testing,” Communications of the ACM, vol. 55, no. 3, pp.
40–44, 2012.
[18] F. Bellard, “Qemu, a fast and portable dynamic translator.” in USENIX
Annual Technical Conference, FREENIX Track, 2005.
[19] K. Sen, D. Marinov, and G. Agha, “Cute: a concolic unit testing engine
for c,” in ACM SIGSOFT Software Engineering Notes. ACM, 2005.
[20] Intel, “Pin - A Dynamic Binary Instrumentation Tool,” https://software.
intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool,
2019.
[21] M. Muench, D. Nisi, A. Francillon, and D. Balzarotti, “Avatar 2: A
multi-target orchestration platform,” in Proceedings of the Workshop
on Binary Analysis Research (BAR), 2018.
[22] K.-K. Ma, K. Y. Phang, J. S. Foster, and M. Hicks, “Directed symbolic
execution,” in International Static Analysis Symposium. Springer, 2011.
[23] A. Fioraldi, “Symbolic Execution and Debugging Synchronization,”
Bachelor’s thesis, Sapienza University of Rome, October 2018.
[24] R. Baldoni, E. Coppa, D. C. D’Elia, and C. Demetrescu, “Assisting mal-
ware analysis with symbolic execution: A case study,” in International
Conference on Cyber Security Cryptography and Machine Learning.
Springer, 2017.