A header-only, freestanding C++20 template for IR-bytecode VM loaders. The whole point is to spin up a new loader without rewriting the same dispatch / decode / decrypt plumbing every single time.
Two pieces of prior art kicked this whole thing off.
The first was RISCy Business on secret.club. It's about embedding a full RISC-V interpreter to execute LLVM-retargeted Windows code as RISC-V bytecode. Pretty neat. The second was the Firebeam VM in Havoc Pro, which applies that same VM-as-loader idea inside a production C2.
Most of the value here isn't in inventing some clever ISA: it's in having a small, embeddable, hardenable execution layer between your bytecode and the host. So I wanted to come at it from a loader-first angle and trade ISA fidelity for simplicity. RISC-V gives you a real toolchain, but at the cost of carrying an interpreter and a CRT shim.
For a loader the bytecode rarely needs to do more than allocate, write, decrypt, jump. A custom IR with fixed-size ops covers that surface in a fraction of the code, and the obfuscation primitives that actually matter (opcode randomization, bytecode encryption at rest, per-op state encryption) port across cleanly. They're properties of the dispatch loop, not the instruction set.
So this template is the dispatch / decode / decrypt skeleton I wished I had
on hand when starting fresh: in "modern" C++, freestanding-friendly, with
everything Windows-specific left as a clearly marked extension point. Drop
in a Handler<> specialization per opcode and you've got a working loader.
A single header (vm_loader.hpp) that gives you:
- A fixed-size IR
Oprecord and a typedVm<...>dispatcher - Compile-time per-opcode validation through
Handler<Op>template specialization - A 256-entry
constexprjump table built at compile time (zero runtime cost on dispatch) - Three opt-in obfuscation hooks that cost nothing when disabled:
- opcode randomization (per-build randomized bytecode opcodes)
- bytecode XOR encryption at rest
- per-operation context encryption
- A
constevalJenkins-OAAT API hash, in case you want it for dynamic resolution
The runnable example is a real shellcode loader, not a MessageBox stand-in.
example_builder.cpp reads example/payload.bin (raw shellcode), XOR-encrypts
both the IR bytecode and the payload, and emits example/embedded.h with two
encrypted blobs inside. example_loader.cpp #includes that header and runs
the classic 5-op pipeline: AllocRegion → WritePayload (encrypted bytes
into the region) → DecryptRegion (XOR in place) → ProtectRX → ExecRegion
(cast to fn ptr and jump). Same shape a production loader uses; the only
thing you have to bring is the payload.
- Memory primitives (
VirtualAlloc/NtAllocateVirtualMemory) - Execution methods (fibers, threadpool, indirect syscalls, …)
- Anti-analysis checks
- Syscall resolution and API hash tables
- Payload encryption
The template covers the VM core. The rest is wired in through Handler<>
bodies on your side.
template/
vm_loader.hpp the entire VM
Makefile builds the example loader and builder
compile_flags.txt clangd config (C++20, freestanding-friendly)
example/
example_loader.cpp Windows loader that runs encrypted shellcode via the VM
example_builder.cpp matching builder that consumes payload.bin + emits embedded.h
payload.bin raw shellcode you drop in (NOT committed)
embedded.h generated by the builder, consumed by the loader
Four moving parts inside vm_loader.hpp:
| Type | Role |
|---|---|
vmkit::Op<Opcode> |
Fixed-size operation record |
vmkit::Handler<Op> |
Per-opcode behavior, you specialize this |
vmkit::OpcodeList<Ops...> |
Pack of opcodes the VM should dispatch |
vmkit::Vm<Opcode, Ctx, Cfg, OpcodeList> |
The dispatcher |
#include "vm_loader.hpp"
// 1. Define your opcodes (must fit in uint8_t).
enum class MyOp : std::uint8_t { Alloc = 0, Write = 1, Exec = 2 };
// 2. Define your loader's mutable state.
struct MyContext { void* regions[8]; };
// 3. Specialize Handler<> for each opcode.
template <> struct vmkit::Handler<MyOp::Alloc> {
static void execute(MyContext& ctx, const vmkit::Op<MyOp>& op) noexcept {
ctx.regions[op.u32[0]] = my_virtual_alloc(op.u64[0]);
}
};
// ... Write, Exec ...
// 4. Pick a config (or roll your own by inheriting from DefaultConfig).
struct MyCfg : vmkit::DefaultConfig {
static constexpr bool bytecode_xor_encrypted = true;
static void decrypt_bytecode(std::span<std::uint8_t> blob,
std::uint32_t seed) noexcept {
// your in-place XOR / chacha / aes routine here
}
};
// 5. Run.
vmkit::Vm<MyOp, MyContext, MyCfg, vmkit::OpcodeList<MyOp::Alloc, MyOp::Write, MyOp::Exec>> vm;
vm.execute(blob_span, ctx, seed);You need to drop a raw shellcode binary at example/payload.bin first,
otherwise make will refuse with No rule to make target 'example/payload.bin'.
Anything that's a valid x64 entry point works. Then:
make # build everything (builder -> embedded.h -> loader)
make run-loader # build + run the loader (executes the shellcode)
make run-builder # build + run the builder on its own (prints to stdout)
make cleanThe build chain is: builder compiles first, then runs against payload.bin
to generate example/embedded.h, then the loader compiles against it. Swap
out payload.bin and re-run make and the whole thing rebuilds with the
new payload baked in.
make CXX=g++ or make CXXFLAGS="-std=c++20 -O3" if you want to override
the defaults.
A walk through what actually happens between make and the shellcode running:
- Builder side.
example_builder.cppreadsexample/payload.bin(raw shellcode), builds a 5-opOpT program[]whose(size, src_off)fields reference the payload, encodes each opcode through a forward map (real opcode → randomized byte), and XOR-encrypts both the bytecode and the payload with a 32-byte key derived from a fixed seed (0xC0FFEE). It prints a self-contained C++ header with#pragma onceand fourinline constexprsymbols:g_ir_blob,g_ir_seed,g_ir_payload, andg_ir_payload_size. - Make. The Makefile redirects the builder's stdout into
example/embedded.h. Ifpayload.binis missing, Make stops cold; if it changes, embedded.h regenerates and the loader rebuilds. - Loader side.
example_loader.cpp#includesembedded.h, copiesg_ir_blobinto a stack-local mutable buffer (sinceexecute()decrypts in place), and hands it off tovm.execute(blob, ctx, g_ir_seed). - Execute. The VM:
- calls
LoaderConfig::decrypt_bytecodeonce (XOR the bytecode with the derived key), - reads each
Op, looks up its randomized opcode byte inLoaderConfig::opcode_reverse_map, - dispatches through the
constexpr256-entry table to the matchingHandler<Op>::execute.
- calls
- The pipeline runs. Five ops:
AllocRegion(PAGE_READWRITE),WritePayload(copies still-encrypted bytes fromg_ir_payload),DecryptRegion(XOR in place using the same derived key),ProtectRX(VirtualProtecttoPAGE_EXECUTE_READ),ExecRegion(cast to fn ptr and jump). The shellcode starts running at the end of step 5.
If the builder's forward map and the loader's reverse map drift, or the
two derive_key implementations disagree on a single byte, the whole
thing falls apart immediately: either the bytecode dispatches into
unknown_op, or the decrypted shellcode is garbage and the ExecRegion
jump dies. That's the contract this example is testing for you ^^.
vmkit::DefaultConfig exposes three flags. Override the ones you want,
leave the rest alone:
| Flag | Effect when true |
|---|---|
opcode_randomization |
Decode each opcode through opcode_reverse_map[...] |
bytecode_xor_encrypted |
Calls Cfg::decrypt_bytecode(blob, seed) once before dispatch |
per_op_context_encryption |
Wraps each op with decrypt_context / encrypt_context |
When a flag is false, the corresponding hook is never instantiated.
No overhead, no symbols, no dead code in the binary. Pretty nice ^^.
enum class MyOp : std::uint8_t { /* existing... */, NewThing = 7 };
template <> struct vmkit::Handler<MyOp::NewThing> {
static void execute(MyContext& ctx, const vmkit::Op<MyOp>& op) noexcept {
// your logic
}
};
// Then add it to the OpcodeList. Forget this and your opcode silently no-ops at runtime.
// Forget the Handler specialization and the build dies with a static_assert.
vmkit::Vm<MyOp, MyContext, MyCfg, vmkit::OpcodeList</* existing... */, MyOp::NewThing>> vm;The header itself is plain C++20. For an actual loader the typical flag set looks like:
-std=c++20 -O2 -ffreestanding -fno-exceptions -fno-rtti -nostdlib++
On MSVC: /std:c++20 /EHs-c- /GR- /kernel (or hand-tune; /kernel implies
no-exceptions + no-RTTI anyway).
This part is for anyone who wants to understand how it works, not just how to use it. Feel free to skip if you only need the API.
The classic IR interpreter pattern is one giant switch on the opcode
byte. That works, sure, but every new opcode means editing the switch,
and a missing case is a silent runtime no-op (which is exactly when you
don't want to find out, btw).
vmkit flips that. The Vm<> class holds a single static member:
static constexpr std::array<Dispatcher, 256> dispatch_table = build_table();build_table() is a constexpr function that:
- Initializes all 256 slots to
&unknown_op(a no-op). - For every opcode
Oplisted inOpcodeList<Ops...>, setsdispatch_table[Op] = &dispatch_to<Op>. dispatch_to<Op>is astatic_assert-guarded thunk that callsHandler<Op>::execute(ctx, op).
Step 2 is a fold expression over the parameter pack, built entirely at compile time:
((t[static_cast<std::size_t>(Ops)] = &dispatch_to<Ops>), ...);The result: dispatch is a single indirect call through a table the compiler already knows about. Modern compilers will frequently devirtualize and inline it. There's no runtime registration step, no virtual table, no hash lookup.
Handler<Op> is a primary template that's deliberately undefined. When you
write template<> struct vmkit::Handler<MyOp::Alloc> { ... }, you're
filling in one slot of a compile-time registry.
The kicker is in dispatch_to<Op_>:
static_assert(HasHandler<Op_, Ctx, OpType>, "vmkit: missing Handler<Op> specialization for a listed opcode");
Handler<Op_>::execute(ctx, op);HasHandler is a concept that probes for Handler<Op>::execute(ctx, op).
If you list an opcode in OpcodeList<...> without specializing Handler<>
for it, dispatch_to<Op> fails to instantiate and the build dies with a
clear message. With a switch, that exact same mistake compiles cleanly
and silently no-ops at runtime. Which, again, is not when you want to
find out about it.
By default, the bytecode opcode byte is the real opcode. With
opcode_randomization = true, the byte stored in the bytecode is a
randomized encoding instead, and the runtime maps it back through a
256-byte reverse table:
std::uint8_t raw = static_cast<std::uint8_t>(op.opcode);
if constexpr (Cfg::opcode_randomization) {
raw = Cfg::opcode_reverse_map[raw];
}
dispatch_table[raw](ctx, op);There are two halves to this:
- Builder side (forward map): real opcode → randomized byte. Each build picks a fresh permutation seeded from a config value, so identical IR programs produce different bytecode across builds.
- Runtime side (reverse map): randomized byte → real opcode. Embedded
as a
constexpr std::arrayin the binary.
The two maps are inverses of each other. The example demonstrates a tiny hand-rolled permutation; a real builder would produce a random shuffle keyed off the build seed.
What this buys you: static signatures based on opcode byte sequences become useless, since every build has a different alphabet. What it doesn't buy you: protection against execution-trace analysis or symbolic execution.
bytecode_xor_encrypted = true makes execute() call
Cfg::decrypt_bytecode(blob, seed) once before the dispatch loop starts.
The header intentionally doesn't ship a crypto routine (that'd be one
more recognizable signature), so you wire in whatever transform matches
your builder.
The xor_codec::apply helper in the header is provided for convenience: a
minimal in-place XOR with a key span. For real loaders you'll probably
want at least a seed-derived key, ideally a stream cipher (sry, no
shortcuts here).
blob is std::span<std::uint8_t>, mutable on purpose, since the
decryption happens in place. If you keep an encrypted copy elsewhere, copy
before calling execute().
per_op_context_encryption = true adds two calls around each opcode:
for (i = 0; i < count; ++i) {
if (i > 0) Cfg::decrypt_context(ctx, i);
/* dispatch */
Cfg::encrypt_context(ctx, i + 1);
}
if (count > 0) Cfg::decrypt_context(ctx, count);The pattern is: the loader's mutable state (region pointers, syscall
table, exec method, …) lives encrypted in memory between operations. A
memory dump captured between two ops shows ciphertext, not pointers. Each
op transitions through plaintext for the duration of Handler<>::execute,
and gets re-encrypted right after.
The op index gets fed into the encryption hook, so the key can rotate per op and make static dump analysis even harder.
This is mostly useful against passive memory forensics, btw. Anything actively hooking your handlers will see plaintext.
Each obfuscation feature is gated on a static constexpr bool:
if constexpr (Cfg::bytecode_xor_encrypted) {
Cfg::decrypt_bytecode(blob, seed);
}This is not a runtime branch. If the flag is false, the entire branch
gets discarded at compile time. decrypt_bytecode is never even
instantiated and contributes zero bytes to the binary. A loader that uses
no obfuscation features compiles down to literally the dispatch loop,
nothing else.
Compare with #ifdef: this gives you the same dead-code elimination, but
with template-argument granularity. You can have two Vm<> instances in
the same binary running different configs, which is something #ifdef
just can't do.
Three features the design genuinely needs:
autonon-type template parameters. LetsHandler<MyOp::Alloc>work regardless of whether the underlying enum isintoruint8_t. Pre-C++17 this would need a template-template wrapper or a macro, neither of which is fun to read.- Concepts.
HasHandleris the cleanest way to say "this opcode has a valid handler" and produce a diagnostic that's actually readable. constexpr std::arraymember init via lambda. Theopcode_reverse_mapin the example is built by an immediately-invokedconstexprlambda. C++17 could do it via a helper function; C++20 just lets you write it inline.
The header pulls in only <array>, <cstddef>, <cstdint>, and <span>.
All header-only, no allocators, freestanding-compatible.
const auto* ops = reinterpret_cast<const OpType*>(blob.data());Strict-aliasing pedants will recoil. The real-world story tho:
- The bytecode is an array of
OpType, just delivered as bytes. OpTypeis trivially copyable and standard-layout.std::uint8_tis allowed to alias other types in practice on every compiler that ships a Windows loader.
Real loaders get built with -fno-strict-aliasing anyway. If you're
targeting something where this genuinely matters, use std::memcpy into a
stack-local OpType per iteration and the optimizer will collapse it.
template <typename Opcode, std::size_t U32 = 8, std::size_t U64 = 4>
struct alignas(8) Op { Opcode opcode; std::uint32_t u32[U32]; std::uint64_t u64[U64]; };alignas(8)guarantees the bytecode array is 8-byte aligned, sou64[]reads are aligned without per-op shuffling.- The slot counts are template parameters, not hardcoded. Bump them up if your handlers need more operands per op; bump them down if the defaults are wasteful for your use case.
- All ops in a single
Vm<>instance are the same size. If you need variable-length operands (large strings, payload chunks…), put them in a side table and reference them by offset.WritePayloadin the example shows the pattern.
- The opcode enum's underlying type must fit in a byte (0..255). Dispatch table is exactly 256 entries; anything wider doesn't fit.
Op<Opcode>is fixed-size. Variable-length operands go in side tables referenced by offset.execute()takes a mutablestd::spanbecause in-place bytecode decryption rewrites it. Copy first if you want to keep the encrypted form around.- Unknown opcodes silently no-op. That's by design. Easy to drop in junk
opcodes for control-flow obfuscation. Replace
unknown_opinvm_loader.hppwith a trap if you'd rather hard-fail.
The template covers the VM core. Production loaders typically build out:
- Real Win32 / NTAPI hooks for alloc / protect / exec
- Indirect syscalls (Hell's gate variants and friends)
- API hashing tables and dynamic resolution
- Anti-analysis checks
- Callstack spoofing
- Multiple execution methods
- Payload encryption
- Transport obfuscation
- Section-entropy padding and PE checksum patching
All of those slot in either as Handler<> bodies, as overrides on a
custom Cfg, or as builder-side preprocessing. None of them require
touching the VM core in vm_loader.hpp. As always, feel free to bend the
template to your use case ^^.
Secret Club: https://secret.club/2023/12/24/riscy-business.html
Infinity Curve: https://infinitycurve.org/products/havoc-professional