Replies: 1 comment
-
Update: the decision has been made to use rust and custom serialisation |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Here are the results of my research regarding the interchange format for the communication between the concolic tracer and the fuzzer.
Context
On a very high level, the tracer produces branch constraints, which are essentially expressions in a custom language that closely resembles assembly (SymCC's expressions are close to LLVM IR for example).
These expressions need to be communicated between the tracer and the fuzzer in order for the latter to generate new inputs based on these expressions.
Requirements
Efficiency
There are certainly many ways to represent an expression tree in this language. For simplicity, let's consider a simple variant that has 4 main types of expressions: constants, variables, unary operators and binary operators. Constants and variables are the leaves of the tree and the unary and binary operators make up the structure. A large-ish execution with SymCC generates 100's of millions of tree nodes. Therefore, the efficiency of the format used to represent this data is crucial. At an exemplary 17 bytes per tree node (1 byte tag + 8 bytes identifier + 8 bytes worth of node specific data), we would end up with 1700 MB for 100 million nodes. All this is to point out that choosing a format that 'wastes' 'a couple of bytes' per node is not going to cut it.
Language Compatibility
SymCC (and SymQEMU), with which I want to integrate, are written in C/++. More specifically, the runtime component (which the instrumentation calls into) is written in C/++ and the tracer would extend/replace this runtime component in the original SymCC design. Therefore, the interchange format should be writable from C++, if we want to not introduce a second language into SymCC. If, on the other hand, we would be fine with introducing Rust into SymCC, we could keep the interchange format entirely in Rust, which could greatly simplify the definition and processing of the format, as Rust hast excellent support for serialisation formats with serde. To this point, I have an integration with SymCC that is written in Rust and generates a similar format as envisioned. The integration uses corrosion to link a static rust library into SymCC's runtime component, implementing the relevant parts of the instrumentation-to-runtime interface in Rust. It uses bincode to create a serialisation format that can represent most expression nodes in just 3 bytes or less using variable length integer encoding and a couple of tricks.
I am not sure which path to choose regarding this. Here are some of the pros and cons regarding the choice of integration with SymCC:
Keep it in C/++
Do it in Rust :)
Theoretically, there is an even more dramatic version of introducing Rust which is to reimplement the complete runtime in Rust. This would make the runtime completely independent, which would be really nice in terms of build-setup and integration (the rust code would produce a dynamic library which any SymCC instrumented binary can simply be pointed to, no changes to SymCC necessary). The main aspects which would have to be reimplemented in addition to the tracing itself for this option would be the memory management (should be rather simple) and the hooking of library calls (this may be undesirable, as it would mean not benefitting from potential improvements to the original SymCC hooks in the future). Especially the issues with hooking library calls make this option less appealing to me personally, but its technically an option.
I would be interested in hearing your opinion on this and to decide this together :)
The Format
Library vs. Custom Implementation
Depending on the results of the language compatibility issue, I see the following options: if Rust is chosen, I would forego any cross-language compatible serialisation solutions and go straight to serde with bincode, as it fits the use case pretty much perfectly and is a pleasure to work with.
If C/++ is chosen, there are two options: go with an existing library or define a custom format. For existing libraries I have surveyed protobuf, cap'n'proto, flatbuffers and msgpack.
Existing Library
Custom Format
Tricks
I propose the following tricks to keep the format as efficient as possible:
Implicit Identifiers
Consider the following simplified expression tree (using pre-order-style notation)
Note, that the 3rd node references the first two nodes. It needs to identify those nodes somehow. By serialising the nodes in pre-order-style (or post-order, what matters is that it is consistent), nodes can be identified an implicit increasing identifier. For example, the first node is assigned the identifier 0, the second node 1, etc. This way, nodes do not need to carry an identifier.
Note, that 'by construction' this is the order in which expression nodes are created in SymCC, because they are created when the corresponding instruction is executed. Therefore, this is a serialisation format that requires no re-ordering of expression nodes, which saves memory and time.
Implicit Identifiers and Variable Length Integer Encoding
Consider the previous example once more
Expression nodes that reference other expression nodes typically reference nodes which are 'recent'. Intuitively, a program typically computes values close to their use.
If we additionally consider that small integers can be encoded using fewer bits using Variable Length Integer Encoding (VLIE) (which most binary serialisation libraries support), it makes sense to encode references to other expression nodes relative to the current node.
Example:
Node [2] references node [0] using -2 (since [2-2]==[0]) and [1] using -1 (since [2-1]=[1]).
These two tricks get us to the quoted 3 bytes or less per expression node: 1 Tag (VLIE'd down to 1 byte), 2 Node References (almost always VLIE'd down to 1 byte each) for binary expressions. Unary expressions therefore typically take just 2 bytes. This also helps tremendously with constants, as a lot of them are small (0 and 1 are extremely popular :D).
Open Questions
I would like to discuss the following main questions:
Beta Was this translation helpful? Give feedback.
All reactions