August 27, 2025
Infrastructure & Systems Engineering: Complete Crypto HFT Dossier
5) Infrastructure & Systems Engineering — the com-
plete, mathematically-grounded, no-fluff dossier (crypto
HFT lens)
You asked for everything: keywords, math, equations, relations, how people misunderstand
things, stepwise examples, and concrete takeaways. Here you go — long, dense, practical,
and written so you can operate, measure, design, and reason about tradeoffs.
I’ll open this as: design objectives → latency math → networking & OS → hardware choices
(FPGA/CPU/NIC) → clocking/timestamps → software architecture (order path, signers)
→ queueing & capacity planning (math + numeric examples) → monitoring & SLOs →
security & key management → deployment / runbook → common misunderstandings
→ actionable checklist & example latency budget. Where I do arithmetic I show the
intermediate steps.
A. High-level design goals (what your infra must opti-
mize)
1. Latency — minimize decision-to-wire and roundtrip latency (and its variance).
2. Determinism & jitter control — minimize tail latency (p99/p99.99), not just
mean.
3. Throughput — sustain required transactions-per-second without collapsing queues.
4. Reliability & observability — detect degradations fast, recover automatically.
5. Security — keys, signing, and fail-safe controls must be airtight.
6. Cost / ROI — spend where marginal PnL per microsecond justifies cost.
1
Everything below maps to these objectives.
B. The latency decomposition (the core equation)
Total one-way latency from your decision to the exchange observing your packet (approx):
Lone-way = Lapp + Lsys + Lnic + Lserial + Lprop + Lswitch + Lexchange-front .
Round-trip latency (submit → match → ACK) ≈ 2 · Lone-way + Lmatching where Lmatching is
the matching engine work time.
Component meanings:
• Lapp : app processing (serialize message, risk checks).
• Lsys : OS/kernel overhead (syscall, context switch) unless kernel-bypass used.
• Lnic : NIC queue/driver overhead (interrupt handling, DMA).
• Lserial : serialization on wire = packet_size_bits / link_bps.
• Lprop : propagation delay = distance / signal speed (fiber ∼ 2/3 c).
• Lswitch : per-switch/hop latency (microseconds per switch).
• Lexchange-front : front-end ingress processing at exchange.
Why write it explicitly? Because improving latency means attacking these terms —
not some vague “optimize network”. You must measure and attribute.
C. Physics of propagation & serialization — numeric,
exact, stepwise
Useful constants:
• speed of light in vacuum c ≈ 299,792,458 m/s.
• speed in fiber ≈ v ≈ 0.67c ≈ 200,000 km/s (we use 200,000 km/s as good engineering
number).
Example 1 — propagation delay (step-by-step)
Compute one-way propagation for distance 5,567 km (approx NYC London):
1. distance = 5,567 km
2
2. speed in fiber v = 200,000 km/s
3. time = distance / speed = 5,567 / 200,000 = 0.027835 s
4. convert to ms: 0.027835 × 1000 = 27.835 ms
So one-way ≈ 27.835 ms, roundtrip ≈ 2 × 27.835 = 55.67 ms.
Example 2 — short distances (intra-DC / colo)
• 100 km → time = 100 / 200000 = 0.0005 s = 0.5 ms one-way.
• 10 km → time = 10 / 200000 = 0.00005 s = 0.05 ms = 50 µs one-way.
Implication: distance >> all other micro-optimizations once you cross a few hundred kilo-
meters. Co-location (same data center / POP) collapses propagation term to microseconds;
cross-region adds tens of milliseconds.
Serialization time (packet on the wire)
packet_size_bits
Serialization ts = .
bandwidth_bps
Step example: 1500 bytes at 10 Gbps
1. bytes = 1500
2. bits = 1500 × 8 = 12,000 bits
3. bandwidth = 10 Gbps = 10,000,000,000 bits/s
4. ts = 12,000/10,000,000,000 = 0.0000012 s = 1.2 µs
Small packets serialize much faster: 64 bytes → 512 bits → at 10 Gbps → 512 / 10e9 =
5.12e-8 s 51.2 ns.
Engineer’s note: For microsecond latency budgets, packet size matters — use compact
framing.
D. Queueing effects — math, pitfalls, and numbers
Network devices and software are queues. When utilization approaches capacity, delays
explode.
3
M/M/1 queue (illustrative)
If arrival rate λ (jobs/sec) and service rate µ (jobs/sec) with exponential assumptions:
• Utilization ρ = λ/µ.
1 1
• Mean waiting time in system (sojourn): W = = .
µ−λ µ(1 − ρ)
Numeric example (10 Gbps, 1500B packets)
Compute µ = packets/s service rate for 1500B at 10Gbps:
1. packet bits = 1500 × 8 = 12,000 bits
2. bandwidth = 10 ×109 bits/s
3. µ = 10e9/12, 000 = 833, 333.333 . . . packets/s
Now pick arrival λ:
• λ = 800, 000 pps → ρ = 800, 000/833, 333.3333 = 0.96 W = 1/(833,333.3333 −
800,000) = 1/33,333.3333 = 0.00003 s = 30 µs.
• λ = 825, 000 pps → ρ = 0.99 W = 1/(833,333.3333 − 825,000) = 1/8,333.3333 =
0.00012 s = 120 µs.
• λ = 833, 000 pps → ρ ≈ 0.9996 W = 1/(833,333.333 − 833,000) = 1/333.333 = 0.003
s = 3 ms.
Takeaway: pushing link utilization from 96% → 99.9% can increase mean latencies by
orders of magnitude. For low-latency trading, keep device utilization comfortably below
high levels and provision headroom.
More realistic: Router/switch queues are not memoryless; you get fat tails. Use
p99/p999 metrics.
E. OS & NIC: kernel bypass, interrupts, CPU pinning,
and why they matter
Key keywords: kernel bypass (DPDK, AF_XDP), NIC hardware timestamping, SR-IOV,
RSS, RPS/XPS, CPU pinning, busy-polling, hugepages, NUMA.
4
Why the kernel hurts
• syscalls, context switches, interrupts → microsecond-scale overhead per packet. On
commodity stacks, syscall + send can be tens of microseconds (or more under load).
• For microsecond-grade latency you want to avoid interrupt-driven packet processing
and avoid scheduler jitter.
Remedies
• Kernel-bypass frameworks (DPDK, Solarflare Onload, AF_XDP): map NIC
DMA rings into user-space, poll rings, and do zero-copy — removes syscall/context-
switch latencies.
• Busy-polling: spin on CPU for network events to avoid context switching; expensive
on CPU but reduces wake/sleep jitter.
• SR-IOV / VF: partition NIC resources safely across VMs/processes.
• NIC features: hardware timestamping (PTP), TCP offloads, LSO/GSO (large
send offload) — sometimes offloads hurt latency (they batch), so disable features
that increase per-packet latency.
• CPU pinning & NUMA: pin networking threads to cores on same NUMA node
as NIC to avoid cross-node memory latencies.
Practical pattern: dedicate a core (or set of cores) to a tight poll-loop that handles
network IO + order generation. Pre-allocate all memory (no malloc at runtime), use
hugepages to avoid TLB misses, avoid page faults.
[utf8]inputenc ...
F. Hardware choices — CPUs, NICs, FPGAs, and
where each shines
CPU (x86)
• Flexible, fast for complex decision logic, ML features, but higher predictable latency
than FPGA for simple path.
• Modern CPUs with high single-thread IPC + high clock are good for decision
latency.
5
NIC
• Choose NICs with:
– hardware timestamping,
– kernel-bypass support,
– programmable flows (XDP),
– low-latency driver stacks (Solarflare, Mellanox/ConnectX)
• Beware NIC features that buffer/aggregate (LSO) — they increase latency.
FPGA / SmartNIC
• FPGAs excel when you need deterministic sub-microsecond processing (packet
filtering, order message shaping, signature pipelines).
• Use-cases: market data decoding (parse binary feed in hardware), order serialization,
pre-signing operations, hardware timestamping, or even order-by-order deterministic
throttling.
• Downside: longer dev cycle and cost.
GPU
• Good for batchable heavy compute (simulations, ML), but not for low-latency
per-message paths.
Rule of thumb: CPU + kernel-bypass NIC for most strategies; add FPGA when you
require sub-50 µs deterministic latency for many replacement cycles.
G. Time synchronization & timestamping (PTP / hard-
ware timestamps)
Why: to measure latency accurately, reconstruct causality across machines, and enforce
correct ordering.
• NTP: ms-level accuracy — not enough.
• PTP (Precision Time Protocol): µs to sub-µs (with hardware timestamping) —
required for timestamping trades and LOB events precisely.
• Hardware timestamping: NICs that timestamp packets in hardware give the
single-source truth for transmit/receive times (crucial for p99 measurements and
latency attribution).
6
Anti-pattern: trusting OS timestamps alone — they’re delayed by the kernel path and
jitter.
H. Order path & software architecture (concrete com-
ponents)
Minimum components and responsibilities:
1. Strategy process: computes decision on each tick; outputs an order intent.
2. Risk & compliance engine: checks notional limits, shelf-limits, pre-trade risk;
must be fast (often inlined in path).
3. Sequencer / order gateway: converts intent →wire bytes, adds headers, signs
(or passes to signer), and pushes to NIC.
4. Signer / HSM: signs transactions (for on-chain) or signs order messages. Place
signer off the hot path if signing is slow (use pre-sign or hardware signers).
5. Nonce manager (for on-chain): centralized atomic nonce allocator to prevent
double-nonce collisions.
6. Network I/O thread(s): kernel-bypass pollers for send/receive.
7. Ack/reconciliation loop: match submit →exchange ack →update local state;
handle missing ACKs, resends.
8. Order store & state machine: persistent ledger of outstanding orders, fills,
partial fills, cancels.
9. Risk kills & circuit breakers: immediate stops on anomalies.
Key design patterns:
• Keep risk checks cheap and deterministic (simple arithmetic) on the decision
path.
• Offload slow tasks (logging, metrics) to separate async threads using lock-free queues.
• Provide strong idempotency: each order has a unique client order id; resends handle
duplicates.
I. Nonce management & signing for on-chain HFT
For on-chain, nonce mgmt is critical.
Problem: parallel transactions need unique nonces; races cause one to fail (revert) or
replace.
7
Solution pattern: centralized nonce allocator
• Single-threaded service or a lock-protected DB row that returns the next nonce N
for (account, chain).
• Strategy thread asks for nonce, immediately constructs unsigned tx, passes to signer.
• For high throughput, pre-allocate ranges: allocate [N, N+k) and manage locally; on
restart, reconcile.
Signing performance:
• HSMs or KMS can be a bottleneck; pre-sign batches if safe, or use hardware signers
with high TPS.
Security:
• Keys should live in HSM/KMS. Use signing proxies with least privileges, audit logs.
J. Matching engine & exchange front-end notes (how
they affect you)
• Matching engines are optimized for deterministic order arrival order (sequence
numbers, timestamps). Some exchanges accept multi-request gateways; others have
strict TCP/UDP front-ends.
• Backpressure and flow control: exchanges may drop or reject on overload —
monitor rejects and Watchdog for queues.
• Sequence numbers & snapshots: reconstruct LOB via snapshot + deltas; detect
seq gaps and resync.
K. Observability — metrics you must collect
Latency metrics: p50/p95/p99/p99.9/p99.99 for:
• decision-to-wire,
• wire-to-exchange-receive (if exchange timestamps available),
• submit-to-inclusion,
• cancel latency,
• time-to-fill.
System metrics:
• NIC stats (packets/sec, drops, ring-full),
8
• Interrupt rates,
• CPU counters (cache-misses, context-sw),
• PTP offset & drift.
Business metrics:
• per-order slippage,
• fill rates,
• reject rates,
• revenue per microsecond (see ROI model below).
Tooling: eBPF for tracing, hardware NIC counters (ethtool -S), perf, Prometheus +
Grafana.
L. Queueing & capacity planning (practical formulas
+ example)
You must size threads, NIC queues, and RPC pips. Use queueing theory as a guide — not
gospel — but the formulas are helpful.
M/M/c (c servers) expected wait (Erlang C) — use for modeling thread pool
serving requests. For heavy loads, compute blocking probability and required servers.
Design rule: design for SLA that p99 latency < target. Start by measuring service time
mean µs and variance; choose thread pool size so utilization ρ < 70–80% during peak to
keep p99 in check.
M. Co-location vs cloud: tradeoffs and numbers
• Co-location (on-exchange POP): minimal propagation; typical one-way intra-
host ≪ 100 ţs. Best for lowest-latency HFT. But: expensive (rack space, port fees),
less flexible.
• Cloud (AWS/GCP): easier to deploy and scale; cross-region propagation adds ms.
For many crypto strategies (e.g., MEV research, off-chain arbitrage) the flexibility
may be better; for firm latency arbitrage, colo is usually required.
Simple decision model: Let ∆L = latency advantage in seconds gained by colocating.
Suppose expected incremental PnL per second of latency improvement is α (USD / s
/ unit volume). Colocation cost per year = C. Break-even when α · ∆L · throughput ·
trading days ≥ C. Model it numerically with your own α.
9
[12pt]article amsmath, amssymb geometry array booktabs hyperref
margin=1in
N. Reliability & safe defaults (must-have controls)
• Pre-trade hard limits (max size/notional per symbol per order).
• Kill switch: global e-stop if (a) RPC lag > threshold, (b) p99 latency spikes, (c)
high reject rate.
• Idempotency: unique client order ids to reconcile resends.
• Backpressure: if downstream is slow, drop non-critical orders or slow strategies
gracefully.
• Position limits and forced hedges.
O. Security & key management (strong guidance)
• Use HSMs or managed KMS with hardware-backed keys. Never store private keys
plain on hot hosts.
• Use signing proxies with limited scopes and audit logs.
• Use multi-sig / multisig guardians for high-value custody.
• Rotations, access control, and emergency key compromise procedures.
P. Tests, validation, and chaos engineering
• Unit tests for nonces, order state transitions, reconcilers.
• Replay engine: deterministic replay of market data with actors to validate strategy
and infra behavior.
• Stress tests: generate synthetic events at pps above production to validate
drop/rejection behavior.
• Chaos: simulate RPC disconnects, high latency, packet drops; verify kills and
safe-state transitions.
10
Q. Where people are usually wrong / misunderstand-
ings
1. “Lowering mean latency is enough.” — It isn’t. Tail latency (p99/p99.99)
usually kills you. Design for tails.
2. “More bandwidth = lower latency.” — Not true: bandwidth reduces serialization
time for large packets but does nothing for propagation and can raise utilization
(increasing queueing delay).
3. “Cloud is almost as fast as colocation.” — For intra-exchange time-critical
HFT, cross-host propagation dominates: colo wins.
4. “FPGA always better.” — Only for very simple deterministic, per-packet pro-
cessing; for complex strategies the cost & dev cycle often outweigh benefits.
5. “Pre-signing is free.” — Pre-signing transactions can be risky (stale nonces,
replay); requires careful nonce mgmt and security.
6. “Use NTP, it’s fine.” — NTP’s ms precision is unacceptable for µs-level attribution
or matching engine debugging.
R. ROI / value model (mathematical sketch)
You must justify every infra spend. A simple model:
Let
• ∆t be microseconds of latency improvement,
• V be average dollars you trade per second (exposure),
• β be sensitivity of expected profit to latency (USD per second per microsecond) —
empirically measured.
Approximate incremental annual PnL from improving latency by ∆t:
∆PnL ≈ β · ∆t · active_seconds_per_year.
Estimate β from backtests (regress realized edge vs latency). Use this to prioritize infra
work. If cost of an FPGA + colo > ∆PnL over N-year horizon, don’t build it.
11
S. Example: designing an ultra-low-latency order placer
— full budget
Goal: Decision → order visible at exchange in ≤ 100 µs (one-way) in co-located environ-
ment.
Budget table (example numbers; replace with measured values):
Component Target (µs) Why
App processing (serialized minimal) 2 pre-compute message templates
Kernel-bypass send (DPDK poll) 1 avoid syscalls
NIC DMA & ring 1 pre-pinned memory
Serialization on wire (small packet, 200 B) compute below
Switch hop (1) 1
Propagation (intra-DC 100 m) 0.5
Exchange front-end 10–50 depends on exchange
Matching engine 20–100 depends on exchange
Packet serialization (200 bytes at 10Gbps):
1. 200 bytes ×8 = 1,600 bits
2. 10 Gbps = 10,000,000,000 bits/s
3. t = 1,600/10,000,000,000 = 1.6 × 10−7 s = 0.16 µs
Sum (optimistic): 2 + 1 + 1 + 0.16 + 1 + 0.5 + 10 ≈ 15.66 µs+matching engine. With
matching engine say 50 s, roundtrip 121 s. Realistic numbers diverge by exchange;
measure everything.
Key engineering tactics:
• Pre-build the binary message and copy to NIC DMA region (zero serialization cost
in app path).
• Use small packets, avoid TLS/extra layers in front path (but balance security needs).
• Time everything with hardware timestamps.
T. Actionable checklist (what to implement first)
1. Measure first: baseline p50/p95/p99 for decision-to-wire and submit→included
using hardware timestamps. You can’t optimize what you don’t measure.
12
2. Kernel-bypass prototype: run a minimal DPDK-based sender/receiver to exercise
NIC rings; compare latency vs standard sockets.
3. PTP + hardware timestamps: enable NIC hw-ts and sync clocks; validate
offsets < 1 µs.
4. Nonce manager: implement centralized nonce allocator if on-chain; test for race
conditions with replay tests.
5. Monitoring: p50/p95/p99/p999 dashboards, NIC ring fullness, CPU pinning
heatmaps, and PTP offsets.
6. Circuit breakers & backpressure: implement simple kill-switch and safe-state.
7. Replay & stress: replay 2× expected traffic, observe p99 and behavior under
overload.
8. Security: HSM for signing and secure signer proxy.
U. Short list of the most important metrics (for dash-
boards)
• decision_to_wire_us (p50/p95/p99/p999)
• wire_to_exchange_receive_us (with hw timestamps)
• submit_to_include_ms (p50/p95/p99)
• NIC ring fullness (%), Rx/Tx drops
• Interrupt/sec and context switches/sec on pinned cores
• PTP offset & drift (µs)
• CPU usage per pinned core, cache-miss rates
• Reject & stale-nonce counts
V. Final high-IQ examples & critical cautions
Example — why microbenchmarks mislead
You microbenchmark a CPU send path and get 2 s. In production, under bursts, context
switches and NIC ring contention add tens to hundreds of s. Always benchmark under
production-like load, not just cold runs.
13
Example — serialization vs propagation tradeoff
For cross-continental arbitrage, shaving 10 s serialization is irrelevant when propagation
is 30 ms. Focus on co-location or on-chain-specific strategies instead.
Example — the cost/benefit of FPGA
If you expect to win an extra 0.1 basis point per execution from sub-50 s execution and
you do 1 billion notional/day, that value might justify FPGA+colo. But if your edge is
fragile or low-volume, the FPGA bill won’t pay.
W. Appendix — formulas & quick reference
d (km)
• Propagation: t = (in seconds).
200,000 km/s
bytes × 8
• Serialization: ts = .
bandwidth (bps)
1
• M/M/1 mean waiting: W = .
µ−λ
• Utilization: ρ = λ/µ.
• Roundtrip estimate: LRT ≈ 2 · Lone-way + Lmatching .
X. Closing / next steps (what I can produce for you
immediately)
I can immediately produce any of the following (pick one, I’ll deliver runnable/proof-of-
concept artifacts right now):
1. Latency attribution script: Python skeleton that accepts hw timestamps (tx/rx)
and computes component breakdown, p50/p95/p99, tail attribution.
2. DPDK send/receive minimal example (annotated) and profiling checklist (what
to measure).
3. Nonce manager + signer microservice (Python/Go) with atomic allocation,
pre-allocation, recovery paths, and integration tests.
4. Capacity calculator: interactive notebook that inputs expected arrival distribu-
tions and outputs required cores/NICs and expected p99 latencies under M/M/1 or
M/M/c approximations.
5. Grafana dashboard JSON tuned to the infra metrics above.
14
Say drill: 1 (or 2,3,4,5) and I’ll give the artifact now.
15