Concol Tracing Interchange Format #156

julihoh · 2021-06-07T08:23:20Z

julihoh
Jun 7, 2021
Collaborator

Here are the results of my research regarding the interchange format for the communication between the concolic tracer and the fuzzer.

Context

On a very high level, the tracer produces branch constraints, which are essentially expressions in a custom language that closely resembles assembly (SymCC's expressions are close to LLVM IR for example).
These expressions need to be communicated between the tracer and the fuzzer in order for the latter to generate new inputs based on these expressions.

Requirements

Efficiency

There are certainly many ways to represent an expression tree in this language. For simplicity, let's consider a simple variant that has 4 main types of expressions: constants, variables, unary operators and binary operators. Constants and variables are the leaves of the tree and the unary and binary operators make up the structure. A large-ish execution with SymCC generates 100's of millions of tree nodes. Therefore, the efficiency of the format used to represent this data is crucial. At an exemplary 17 bytes per tree node (1 byte tag + 8 bytes identifier + 8 bytes worth of node specific data), we would end up with 1700 MB for 100 million nodes. All this is to point out that choosing a format that 'wastes' 'a couple of bytes' per node is not going to cut it.

Language Compatibility

SymCC (and SymQEMU), with which I want to integrate, are written in C/++. More specifically, the runtime component (which the instrumentation calls into) is written in C/++ and the tracer would extend/replace this runtime component in the original SymCC design. Therefore, the interchange format should be writable from C++, if we want to not introduce a second language into SymCC. If, on the other hand, we would be fine with introducing Rust into SymCC, we could keep the interchange format entirely in Rust, which could greatly simplify the definition and processing of the format, as Rust hast excellent support for serialisation formats with serde. To this point, I have an integration with SymCC that is written in Rust and generates a similar format as envisioned. The integration uses corrosion to link a static rust library into SymCC's runtime component, implementing the relevant parts of the instrumentation-to-runtime interface in Rust. It uses bincode to create a serialisation format that can represent most expression nodes in just 3 bytes or less using variable length integer encoding and a couple of tricks.

I am not sure which path to choose regarding this. Here are some of the pros and cons regarding the choice of integration with SymCC:

Keep it in C/++

+ integrates nicely with the rest of SymCC
- requires defining an interchange format that is cross-language compatible
- it's C/++ :(((

Do it in Rust :)

- requires build integration with SymCC (i.e. linking rust code into a cmake project)
+ greatly simplify definition/generation/parsing of interchange format
+ it's rust :)
+ I have a working prototype for this already, so the busy work is mostly done. just needs polish and adaptation to the context of this GSoC.

Theoretically, there is an even more dramatic version of introducing Rust which is to reimplement the complete runtime in Rust. This would make the runtime completely independent, which would be really nice in terms of build-setup and integration (the rust code would produce a dynamic library which any SymCC instrumented binary can simply be pointed to, no changes to SymCC necessary). The main aspects which would have to be reimplemented in addition to the tracing itself for this option would be the memory management (should be rather simple) and the hooking of library calls (this may be undesirable, as it would mean not benefitting from potential improvements to the original SymCC hooks in the future). Especially the issues with hooking library calls make this option less appealing to me personally, but its technically an option.

I would be interested in hearing your opinion on this and to decide this together :)

The Format

Library vs. Custom Implementation

Depending on the results of the language compatibility issue, I see the following options: if Rust is chosen, I would forego any cross-language compatible serialisation solutions and go straight to serde with bincode, as it fits the use case pretty much perfectly and is a pleasure to work with.
If C/++ is chosen, there are two options: go with an existing library or define a custom format. For existing libraries I have surveyed protobuf, cap'n'proto, flatbuffers and msgpack.

Existing Library

+ less effort required than custom format implementation
- tight control of single bits and bytes impossible
- un-rusty. this may be entirely stupid, but these existing libraries do not fit well with rust, which annoys me to no end. the biggest issue is no support for proper enums.

Custom Format

- more implementation overhead (requires implementing write-support in C++ and read support in Rust, which is roughly twice the effort)
+ allows for a maximally efficient implementation

Tricks

I propose the following tricks to keep the format as efficient as possible:

Implicit Identifiers

Consider the following simplified expression tree (using pre-order-style notation)

[a]: Constant(value=0, bits=8)
[b]: Input(offset=13,value=1) // value contains the concrete value during the execution
[c]: Equals(left=a,right=b,taken=false) // referencing the previous two nodes

Note, that the 3rd node references the first two nodes. It needs to identify those nodes somehow. By serialising the nodes in pre-order-style (or post-order, what matters is that it is consistent), nodes can be identified an implicit increasing identifier. For example, the first node is assigned the identifier 0, the second node 1, etc. This way, nodes do not need to carry an identifier.

[0]: Constant(value=0, bits=8)
[1]: Input(offset=13,value=1) 
[2]: Equals(left=0,right=1,taken=false) // referencing the previous two nodes using implicit ids

Note, that 'by construction' this is the order in which expression nodes are created in SymCC, because they are created when the corresponding instruction is executed. Therefore, this is a serialisation format that requires no re-ordering of expression nodes, which saves memory and time.

Implicit Identifiers and Variable Length Integer Encoding

Consider the previous example once more

[0]: Constant(value=0, bits=8)
[1]: Input(offset=13,value=1) 
[2]: Equals(left=0,right=1,taken=false) // referencing the previous two nodes using implicit ids

Expression nodes that reference other expression nodes typically reference nodes which are 'recent'. Intuitively, a program typically computes values close to their use.
If we additionally consider that small integers can be encoded using fewer bits using Variable Length Integer Encoding (VLIE) (which most binary serialisation libraries support), it makes sense to encode references to other expression nodes relative to the current node.
Example:

[0]: Constant(value=0, bits=8)
[1]: Input(offset=13,value=1) 
[2]: Equals(left=-2,right=-1,taken=false) // referencing the previous two nodes relative two our own id

Node [2] references node [0] using -2 (since [2-2]==[0]) and [1] using -1 (since [2-1]=[1]).

These two tricks get us to the quoted 3 bytes or less per expression node: 1 Tag (VLIE'd down to 1 byte), 2 Node References (almost always VLIE'd down to 1 byte each) for binary expressions. Unary expressions therefore typically take just 2 bytes. This also helps tremendously with constants, as a lot of them are small (0 and 1 are extremely popular :D).

Open Questions

I would like to discuss the following main questions:

Choice of language
Library vs. custom

julihoh · 2021-06-07T09:55:08Z

julihoh
Jun 7, 2021
Collaborator Author

Update: the decision has been made to use rust and custom serialisation

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concol Tracing Interchange Format #156

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Concol Tracing Interchange Format #156

julihoh Jun 7, 2021 Collaborator

Context

Requirements

Efficiency

Language Compatibility

Keep it in C/++

Do it in Rust :)

The Format

Library vs. Custom Implementation

Existing Library

Custom Format

Tricks

Implicit Identifiers

Implicit Identifiers and Variable Length Integer Encoding

Open Questions

Replies: 1 comment

julihoh Jun 7, 2021 Collaborator Author

julihoh
Jun 7, 2021
Collaborator

julihoh
Jun 7, 2021
Collaborator Author