cpp-trading-core is an experimental C++17/20 trading core library built as a learning project around
limit order books, market-data style event processing and low-latency friendly design.
The goal is not to build a full production trading system, but to have a small, clean codebase that can be:
- used to practice modern C++ (RAII, STL, containers, iterators),
- extended towards low-latency / HFT-style architectures,
- referenced in interviews as a concrete example of systems-level C++ work.
⚠️ This project is for educational and research purposes only.
It is not production-ready code and not financial advice.
-
Single-instrument in-memory limit order book
- Separate books for bids and asks.
- Price–time priority within each price level.
add_limit_order(Side, Price, Quantity):- inserts a limit order into the appropriate side,
- maintains a per-order index for O(1) cancellation by id.
cancel(OrderId):- removes the order from both the price level and the index,
- cleans up empty price levels.
execute_market_order(Side, Quantity):- for
Buy– consumes liquidity from the ask book starting from best ask (lowest price), - for
Sell– consumes liquidity from the bid book starting from best bid (highest price), - respects price–time priority within each level,
- returns
MatchResult { requested, filled, remaining }.
- for
-
Best quotes / book inspection
best_bid()/best_ask()return:- price of the best level,
- total quantity at that level,
validflag when the side is non-empty.
empty()to check if the whole book is empty.
-
Unit tests
- Tests are written with GoogleTest (fetched via CMake
FetchContent):- empty book state,
- basic add-limit behaviour and best bid/ask,
- market order scenarios:
- partial and full fills,
- multiple levels on each side,
- correct best-level selection (best bid/ask first),
- cancellation scenarios:
- cancelling the only order on a side,
- cancelling one of multiple orders on the same price level,
- cancelling bids does not affect asks and vice versa,
- cancelling a non-existing id leaves the book unchanged.
- Tests are written with GoogleTest (fetched via CMake
-
Market data replay tool
- Simple CSV-like event format:
ADD,BUY,price,qtyADD,SELL,price,qtyMKT,BUY,qtyMKT,SELL,qtyCANCEL,orderId- lines starting with
#are treated as comments.
trading_replayapp (app/replay_main.cpp):- reads events from a file,
- feeds them into
OrderBook, - prints a summary:
- counts of ADD / MKT / CANCEL,
- total filled quantity,
- final best bid/ask.
- Simple CSV-like event format:
-
Microbenchmark harness and order book benchmark app
- Header-only benchmark harness in
include/utils/benchmark.hpp:- batch-based timing to reduce
clock::now()overhead and OS noise, - per-batch ns/op samples with p50, p95, p99,
- multi-run aggregation: averages mean, p50, p95, p99 across runs.
- batch-based timing to reduce
trading_bench_order_bookapp (app/bench_order_book_main.cpp):- benchmarks:
OrderBook::add_limit_order,OrderBook::execute_market_order,- and a synthetic
empty_loopto estimate harness overhead.
- configurable parameters via CLI:
iterations(operations per run),runs(how many times to repeat the benchmark),batch_size(operations per timed batch).
- benchmarks:
- Header-only benchmark harness in
Internally the order book uses a flat storage for orders plus two price books (bids / asks) that reference orders by index:
struct Order {
OrderId id{0};
Side side{Side::Buy};
Price price{0};
Quantity qty{0};
bool active{false};
};
using OrderIndex = std::uint32_t;
struct Level {
// Indices into the flat `orders_` array.
std::vector<OrderIndex> indices;
};
using BidBook = std::map<Price, Level, std::greater<Price>>;
using AskBook = std::map<Price, Level, std::less<Price>>;
class OrderBook {
BidBook bids_; // best bid at bids_.begin()
AskBook asks_; // best ask at asks_.begin()
std::vector<Order> orders_; // flat storage
std::vector<OrderIndex> free_indices_; // reusable slots in `orders_`
std::unordered_map<OrderId, OrderIndex> id_to_index_;
};Key points:
- Best levels are always at bids_.begin() and asks_.begin() due to comparators.
- Each Level holds indices into a flat orders_ array:
- this keeps per-price queues cache-friendly;
- cancelled/filled orders are marked inactive and their indices are recycled.
- A hash map id_to_index_ provides O(1)-ish lookup for cancels.
On top of that, a unified matching core:
template <typename Book, typename PricePredicate>
Quantity match_on_book(Book& book, Quantity qty, PricePredicate&& should_cross);implements both:
- market order execution (execute_market_order), and
- the taker leg of incoming limit orders (match_incoming_limit), with different price predicates for BUY/SELL and limit prices.
Let N be the number of price levels and K the number of orders in a given level.
-
add_limit_order:std::map::find/emplace– O(log N),std::list::push_back– O(1),unordered_mapinsert – O(1) average.- Total: ~O(log N).
-
cancel(OrderId):unordered_map::find– O(1) average,- list erase via iterator – O(1),
- optional
map::eraseof an empty level – O(log N). - Total: ~O(log N) in the worst case, O(1) when the level stays non-empty.
-
execute_market_order(Side, qty):- walks price levels from best bid/ask outwards:
- each fully consumed level is erased in O(log N),
- partially consumed levels stay,
- within each level, orders are consumed in O(K) (K = number of orders at that price).
- Total: proportional to the number of touched orders and levels.
- walks price levels from best bid/ask outwards:
This gives a good base for discussing:
- where the bottlenecks are,
- what could be improved for low-latency (cache-friendly containers, custom allocators, flat maps, etc.).
- CMake ≥ 3.16
- A C++20-capable compiler (GCC, Clang, MSVC)
mkdir -p build
cd build
cmake .. -DBUILD_TESTING=ON
cmake --build .cd build
ctest --output-on-failureGoogleTest is fetched automatically via FetchContent on the first configure step.
Build as above, then from the build directory:
./trading_replay path/to/events.csvExample event file:
# type,side,price,qty_or_id
ADD,BUY,100,10
ADD,SELL,105,5
MKT,BUY,3
CANCEL,1
Output includes:
- counts of ADD / MKT / CANCEL events,
- total filled quantity,
- final best bid / best ask.
See docs/performance.md for detailed benchmarks and profiling notes.
The microbenchmark harness lives in include/utils/benchmark.hpp and provides:
-
run_benchmark_with_percentiles_batched(...):- measures operations in batches to amortize timer overhead,
- collects per-batch ns/op samples,
- computes p50 / p95 / p99.
-
run_multi_benchmark(...):- runs a benchmark multiple times,
- aggregates mean / p50 / p95 / p99 across runs,
- prints results in a compact format.
The trading_bench_order_book app (app/bench_order_book_main.cpp) uses this harness
to benchmark:
- an empty loop (to estimate harness overhead),
OrderBook::add_limit_order,OrderBook::execute_market_order.
From the build directory (Release build):
./trading_bench_order_book [iterations] [runs] [batch_size]
# Example (defaults):
./trading_bench_order_book
# iterations = 200000, runs = 5, batch_size = 128Environment:
- Linux VPS, single thread
- Release build (
-O3), C++20 - Command:
./trading_bench_order_book(iterations = 200000, runs = 5, batch size = 128)
These numbers are a baseline for future optimizations.
| Version | Mean ns/op | p50 ns | p95 ns | p99 ns | Iterations | Runs | Batch size |
|---|---|---|---|---|---|---|---|
| baseline | 0.34 | 0.30 | 0.39 | 0.43 | 200000 | 5 | 128 |
| flat storage | 0.44 | 0.41 | 0.45 | 0.53 | 200000 | 5 | 128 |
| Version | Mean ns/op | p50 ns | p95 ns | p99 ns | Iterations | Runs | Batch size |
|---|---|---|---|---|---|---|---|
| baseline | 283.70 | 202.89 | 380.54 | 514.32 | 200000 | 5 | 128 |
| flat storage | 242.74 | 199.24 | 350.24 | 467.71 | 200000 | 5 | 128 |
| Version | Mean ns/op | p50 ns | p95 ns | p99 ns | Iterations | Runs | Batch size |
|---|---|---|---|---|---|---|---|
| baseline | 58.51 | 7.01 | 375.57 | 488.27 | 200000 | 5 | 128 |
| flat storage | 17.52 | 16.74 | 19.74 | 22.33 | 200000 | 5 | 128 |
For a more detailed before/after comparison and additional context see docs/performance.md.
The trading_mt_bench app (app/mt_bench_main.cpp) benchmarks a two-thread pipeline:
-
a producer thread:
- generates synthetic events (ADD / MARKET / CANCEL) using a simple random model,
- timestamps each event and pushes it into a lock-free SPSC ring buffer (
utils::SpscQueue),
-
a consumer thread:
- pops events from the queue,
- applies them to
OrderBook(add_limit_order_with_id,execute_market_order,cancel), - records end-to-end latency for each event from enqueue → processed.
This benchmark focuses on:
- total throughput (events per second),
- end-to-end event latency distribution (p50 / p95 / p99) across the full pipeline.
Note: the current version (
mt_bench v1) still usesstd::this_thread::yield()as a simple backoff strategy in both producer and consumer. This makes the latency numbers sensitive to OS scheduler jitter and represents a pessimistic baseline before tighter spin/backoff and CPU pinning.
Example command (from build):
./trading_mt_bench 200000 42Example output:
mt_bench: processed 200000 events in 0.0706884 s
throughput: 2.82932e+06 events/s
mean: 353.442 ns/event
Latency (enqueue -> processed):
p50: 23199405 ns
p95: 32158376 ns
p99: 33408312 ns
Final best bid valid=0, price=0, qty=0
Final best ask valid=1, price=101, qty=2
Summary (current state, mt_bench v1):
| Mode | Events | Threads | Queue | Throughput (Mev/s) | Mean ns/event | Latency p50 ns | p95 ns | p99 ns |
|---|---|---|---|---|---|---|---|---|
| mt_bench v1 | 200000 | 2 (producer/engine) | SPSC, big buffer | 2.83 | 353 | 23_199_405 | 32_158_376 | 33_408_312 |
| mt_bench v2 | 200000 | 2 (producer/engine) | SPSC, 4096, warmup | 2.62 | 382 | 1_546_276 | 1_655_484 | 1_720_337 |
Planned improvements:
- replace
std::this_thread::yield()with a tighter spin/backoff strategy - NOTE: backoff with current implementation on VPS shown worst results, - experiment with different build flags (
-O2/-O3/-march=native) and compare throughput and latency under the same event flow, - experiment with different
Levelcontainers (std::listvsstd::deque/ flat structures) and measure their impact on both microbenchmarks and the pipeline.
We can fetch a real order book snapshot from Bybit via REST and rebuild our OrderBook from it multiple times to measure performance.
Example commands (Release build):
./trading_bybit_orderbook_snapshot # BTCUSDT, limit=50, runs=1000
./trading_bybit_orderbook_snapshot BTCUSDT 50 5000
./trading_bybit_orderbook_snapshot ETHUSDT 100 2000Results on my VPS (Release, single thread):
| Symbol | Depth (bids×asks) | Levels | Runs | ns/snapshot | ns/level |
|---|---|---|---|---|---|
| BTCUSDT | 50×50 | 100 | 1000 | 32,271.4 | 322.7 |
| BTCUSDT | 50×50 | 100 | 5000 | 29,808.1 | 298.1 |
| ETHUSDT | 100×100 | 200 | 2000 | 88,431.6 | 442.2 |
HTTP request time for the snapshot itself is roughly 270–630 ms and is dominated by network latency and Bybit’s API, not by the order book implementation.
These numbers are consistent with the microbenchmarks for OrderBook::add_limit_order and show that the engine can rebuild a 50×50–100×100 depth snapshot in tens of microseconds.
#include "trading/order_book.hpp"
#include <iostream>
int main() {
using namespace trading;
OrderBook book;
// Add some liquidity
book.add_limit_order(Side::Buy, 100, 10); // bid: 100 x 10
book.add_limit_order(Side::Sell, 105, 5); // ask: 105 x 5
auto bb = book.best_bid();
auto ba = book.best_ask();
std::cout << "Best bid: " << (bb.valid ? std::to_string(bb.price) : "none")
<< " qty=" << bb.qty << "
";
std::cout << "Best ask: " << (ba.valid ? std::to_string(ba.price) : "none")
<< " qty=" << ba.qty << "
";
// Execute a market buy for quantity 3 – consumes from asks
auto result = book.execute_market_order(Side::Buy, 3);
std::cout << "Market buy requested=" << result.requested
<< " filled=" << result.filled
<< " remaining=" << result.remaining << "
";
return 0;
}We also measure live WebSocket feed processing for the aggregated orderbook:
- Exchange: Bybit (public WS)
- Symbol:
BTCUSDT - Topic:
orderbook.50.BTCUSDT - Message types: 1 initial
snapshot+deltaupdates - Instrumented in
trading_bybit_ws_orderbook_live
Example run:
./trading_bybit_ws_orderbook_live BTCUSDT 1000Output (summarized):
Messages: 999 (snapshots=1, deltas=998)
Processing time (handler):
mean: 110310 ns
p50 : 104713 ns
p95 : 161639 ns
p99 : 215189 ns
Data latency (local_now_ms - msg.ts_ms):
mean: 92.38 ms
p50 : 90 ms
p95 : 93 ms
p99 : 115 ms
Summary table (baseline run, Release build on VPS):
| Symbol | Depth | Messages | Handler mean ns | p50 ns | p95 ns | p99 ns | Latency mean ms | p50 ms | p95 ms | p99 ms |
|---|---|---|---|---|---|---|---|---|---|---|
| BTCUSDT | 50 | 999 | 110,310 | 104,713 | 161,639 | 215,189 | 92.38 | 90 | 93 | 115 |
Notes:
- The handler includes JSON parsing, applying Bybit deltas to an aggregated level book
and rebuilding
trading::OrderBookon every message. - Even with a full rebuild on each
delta, handler time stays in the ~0.1–0.2 ms range. - Data latency is dominated by network + exchange processing (~90–115 ms here), so CPU-side processing is only a small fraction of end-to-end latency.
This section summarizes the main performance numbers across the different stages of the project.
Config (for all rows):
- iterations: 200,000
- runs: 5
- batch size: 128
- warmup: 20,000
trading_bench_order_book (steady_clock-based), after subtracting empty loop overhead.
| Operation | runs | iters | batch | mean ns/op | p50 ns | p95 ns | p99 ns | Mops/s |
|---|---|---|---|---|---|---|---|---|
| add_limit_order | 5 | 200,000 | 128 | 333.70 | 308.761 | 451.194 | 552.638 | 3.00 |
| execute_market_order | 5 | 200,000 | 128 | 9.62 | 8.523 | 12.971 | 18.625 | 103.90 |
TSC-based variant (trading_bench_order_book_tsc) gives similar values with slightly lower overhead (empty loop ~0.14 ns/op), confirming the order of magnitude above.
Single-producer / single-consumer pipeline:
- bounded lock-free-ish queue
- explicit warmup
- producer uses
std::this_thread::yield()when the queue is full
Example run:
./trading_mt_bench 200000 42Summary:
| Mode | events | throughput (M ev/s) | mean ns/event | p50 latency ns | p95 latency ns | p99 latency ns |
|---|---|---|---|---|---|---|
SP/SC, bounded queue + warmup + yield() |
200,000 | 2.62 | 382.27 | 1,546,276 | 1,655,484 | 1,720,337 |
Latency here is measured as (dequeue_ts - enqueue_ts) inside the benchmark.
Public REST snapshots for spot orderbook, measured in trading_bybit_orderbook_snapshot.
Config:
- Exchange: Bybit
- Endpoint:
/v5/market/orderbook - Depth:
limitlevels per side (bids + asks)
Benchmark summary:
| Symbol | Depth (per side) | runs | total levels (bids+asks) | mean ns/snapshot | mean ns/level |
|---|---|---|---|---|---|
| BTCUSDT | 50 | 5000 | 100 | 29,808.1 | 298.08 |
| ETHUSDT | 100 | 2000 | 200 | 88,431.6 | 442.16 |
This measures building a fresh trading::OrderBook from the JSON snapshot on every run.
WebSocket snapshot (type = snapshot) for the same aggregated orderbook, measured in trading_ws_orderbook_snapshot.
Example run:
python tools/bybit_ws_orderbook.py | ./trading_ws_orderbook_snapshot BTCUSDT 2000Summary:
| Symbol | Depth (per side) | runs | total levels (bids+asks) | mean ns/snapshot | mean ns/level |
|---|---|---|---|---|---|
| BTCUSDT | 50 | 2000 | 100 | 21,603.7 | 216.04 |
The WS snapshot path is slightly faster per level than the HTTP snapshot path, primarily because we skip the HTTP client overhead and work directly from an already-received JSON message.
For detailed numbers see section 6. Live Bybit WebSocket orderbook. Here is a compact summary table for the baseline run:
| Symbol | Depth | Messages | Handler mean ns | p50 ns | p95 ns | p99 ns | Latency mean ms | p50 ms | p95 ms | p99 ms |
|---|---|---|---|---|---|---|---|---|---|---|
| BTCUSDT | 50 | 999 | 110,310 | 104,713 | 161,639 | 215,189 | 92.38 | 90 | 93 | 115 |
Handler time includes parsing the WS JSON and applying deltas to a simple level book + rebuilding trading::OrderBook on every message. Data latency is local_now_ms - exchange_ts_ms and is dominated by network + exchange processing, not CPU time.
This project is intentionally minimal. Possible extensions:
-
More realistic market data replay
- timestamps and sequence numbers,
- basic latency statistics,
- different event distributions (e.g. bursty flow, quote stuffing).
-
More advanced microbenchmarks & profiling
rdtsc-based timing,- CPU affinity / pinning,
- comparisons of different data structures (flat maps, arrays, custom pools).
-
Multi-symbol support
OrderBookper symbol,BookManagerthat routes events by symbol id.
-
Concurrency
- single-writer / multi-reader model,
- SPSC or MPSC queues between feed handler, book and strategy,
- later: experiments with lock-free structures.
-
Memory & performance
- custom allocators / pools for orders and levels,
- more cache-friendly layouts (e.g. flat containers, struct-of-arrays),
- hot-path micro-optimizations based on profiling.
Even in its current form, this project is useful as a concrete example of C++ systems code to discuss in interviews (data structures, trade-offs, complexity and further evolution).