Ultra-low-latency and high-throughput multiprocess transport stack over SHM and mmap ring buffers on Linux and macOS.
Dr Bill Dally is working on shaving nanoseconds by shrinking the distance data has to travel (on-chip wires, off-chip PHYs, memory-to-compute) and stripping overhead out of the path. System software should be just as serious about stripping copies, wakeups, coordination, and communication distance out of its own path.
myelon borrows its name from the Greek root behind the spinal cord: the signal path between brain and body. Wrapped in myelin, that cord exists to move impulses fast and clean. This crate tries to do the same for processes: give independent OS processes a low-latency fabric over SHM and mmap, with as little copying, waiting, and coordination drag as possible.
myelon is the default crate in this repository. It builds on top of disruptor-mp, which extends LMAX Disruptor's HFT-grade design (lock-free atomics, no syscalls in the hot path, cache-aligned cursors, busyspin wait) to cross-process IPC. myelon keeps the raw ring reachable and adds framing, codecs, typed zero-copy, topology helpers, and layout helpers for low-latency pipelines.
This repository publishes two crates:
| Crate | Use it when... |
|---|---|
disruptor-mp |
You want the raw fixed-size event substrate and you own the wire format. |
myelon |
You want one dependency that keeps the raw ring reachable and also gives you framing, codecs, typed zero-copy, topology helpers, and layout helpers. |
disruptor-mp is the Layer 0 substrate. myelon is the broader public transport crate built on top of it.
Both crates do non-trivial unsafe work internally (raw ring slots, SHM mmap, atomic ordering, manual cache-line layout), but the public API is safe Rust with only a small set of documented escape hatches. See Safety & quality gates below.
Throughput mode reports the peak attainable rate plus its latency distribution under self-pressure. CO mode reports coordinated-omission-corrected latency while holding a configured constant offered rate, so tail percentiles reflect real time-to-receipt, not just bench iteration time.
| Topology | shm | mmap |
|---|---|---|
| 1p1c | 332 M ops/s | 239 M ops/s |
| 1p2c | 163 M | 188 M |
| 1p4c | 95 M | 97 M |
| 1p8c | 30 M | 44 M |
Pipelined fan-out, no ack. Per-consumer rate within 0.5% of producer rate. Signal scales down as consumers fan out because the publisher contends with each consumer's cursor.
| Mode | Achieved ops/s | P50 | P99 | P99.99 |
|---|---|---|---|---|
| Max throughput | 5.58 M | 130 ns | 240 ns | 2.5 µs |
| CO constant rate @ 1.2 M ops/s | 1.20 M | 188 ns | 282 ns | 13.3 µs |
Producer rate equals consumer rate (single ring, round-trip). The CO row holds 1.2 M ops/s sustained with coordinated-omission-corrected percentiles. mmap variants reach the same throughput at the same P50, with the P99.99 tail 4-5x wider.
| Payload | ops/s | GB/s | P50 | P99 | P99.99 |
|---|---|---|---|---|---|
| 64 B | 4.51 M | 0.29 | 180 ns | 300 ns | 3.02 µs |
| 1 KB | 2.45 M | 2.51 | 360 ns | 610 ns | 3.70 µs |
| 32 KB | 133.6 K | 4.38 | 7.31 µs | 11.41 µs | 24.89 µs |
| 128 KB (multi-frame) | 31.2 K | 4.09 | 31.68 µs | 38.08 µs | 50.24 µs |
Producer rate equals consumer rate (single ring, round-trip). 128 KB fragments across multiple frames; per-message rate drops but per-message bandwidth stays at ~4 GB/s.
| Topology | Payload | Producer ops/s | Per-consumer ops/s | Producer GB/s |
|---|---|---|---|---|
| 1p4c | 1 KB | 9.09 M | 9.07 M | 9.31 |
| 1p4c | 128 KB | 108.9 K | 109.0 K | 14.27 |
| 1p8c | 1 KB | 5.98 M | 5.97 M | 6.13 |
| 1p8c | 128 KB | 88.3 K | 85.5 K | 11.57 |
Each consumer receives every message; per-consumer rate within 0.5-3.2% of producer. Aggregate fan-out scales by N: 1p8c × 128 KB delivers ~92.6 GB/s aggregate across 8 consumers. Broadcast throughput mode doesn't measure per-message RTT (no ack); CO-mode latency under sustained load lives in the crates/perf-bench/ bench output.
| Batch | Payload | ops/s | P50 | P99 | P99.99 |
|---|---|---|---|---|---|
| 1 | 592 B | 1.89 M | 490 ns | 660 ns | 4.05 µs |
| 64 | 37 KB | 94.2 K | 10.5 µs | 13.7 µs | 22.4 µs |
| 256 | 150 KB | 26.7 K | 37.2 µs | 44.4 µs | 55.0 µs |
ZeroCopyCodec::access reads Archived<T> fields in place. Speedup vs full owned decode: 3.0× at batch=1, 5.5× at batch=64, 5.7× at batch=256.
Measured on AMD Ryzen 7 5800X (8 cores / 16 threads, 4.85 GHz boost, 32 MiB L3), 64 GiB DDR4, Ubuntu 22.04, kernel 6.8 via crates/perf-bench/.
Pingpong throughput at 1 KB payload: myelon-raw vs 11 other in-machine IPC adapters.
Broadcast P99 latency at 1 KB · 4 consumers · 400 K msgs/s sustained (coordinated-omission-corrected).
Pingpong throughput heatmap across the full adapter × payload matrix.
More bench charts (per-layer heatmaps, payload-vs-latency curves, broadcast scaling) live in assets/ and the crates/perf-bench/ bench output.
Start with the crate that matches your data model.
[dependencies]
myelon = "0.1.0-alpha.2"Or, if you only want the raw ring:
[dependencies]
disruptor-mp = "0.1.0-alpha.2"Runnable first-party examples live under examples/demos:
cargo run --release -p demos --example shm_disruptor
cargo run --release -p demos --example pingpongRead next:
crates/myelon/README.mdfor the layered transport surfacecrates/disruptor-mp/README.mdfor the raw substrateexamples/demos/README.mdfor the example ladder
crates/
├── disruptor-mp/ # Publishable raw multiprocess substrate.
├── myelon/ # Publishable layered transport crate.
├── myelon-dst/ # Internal deterministic-simulation runner. Inspired by FoundationDB, TigerBeetle, Turso & SlateDB.
├── perf-bench/ # Internal broad transport sweep harness.
└── competitive-bench/ # Internal external-comparison harness.
examples/
├── demos/ # Runnable first-party examples.
└── myelon-pulse-vanity/ # Brand vanity demo shown above.
book/ # mdBook source. Maintained separately from this README.
Top-level workspace commands:
make helpmake buildmake testmake workspace-smokemake orchestrate-rustmake smoke
Benchmark crate entry points:
make -C crates/perf-bench super-tinymake -C crates/competitive-bench super-tiny
Bench binaries are built with --profile competitive (max-perf, panic=abort, stripped — bench-fairness defaults, not production). For production use release or prod-max (release + fat LTO + line-table debug).
- Cross-process Single Producer Single Consumer (SPSC).
- Cross-process Single Producer Multi Consumer (SPMC).
- Cross-process Multi Producer Single Consumer (MPSC).
- Cross-process Multi Producer Multi Consumer (MPMC).
- Communication patterns:
- Ping-pong: request/response RTT (two SPSC rings).
- Broadcast: strict fan-out; every consumer sees every event, slowest gates the producer.
- Signal: pipelined fan-out, no ack; maximum throughput.
- Broadcast + per-rank ping-pong: one SPMC dispatch ring + N SPSC return rings (driver ↔ N worker ranks); the inference-fabric shape.
- Two type-identical backends:
- POSIX shared memory (
shm_open). - Memory-mapped file (
mmap).
- POSIX shared memory (
- Memory-level zero-copy reads (
&Einto the ring slot). - Wait strategies (
AutoWaitStrategy):-
BusySpin: pure busy loop, 100% CPU. -
BusySpinWithSpinLoopHint: busy loop with CPU hint viaspin_loop. -
SpinThenYield { spins }: N spins, then yield to scheduler. -
Sleep(Duration): sleep for a configured interval. -
Block: efficient blocking, balanced performance / CPU.
-
- Liveness for gating consumers: producer-side stall detection with cold-path alert, optional hook, and recoverable rejoin.
- Portable shared-memory naming (macOS 31-byte budget enforced).
- Hot-path observability counters:
-
metrics-rs facade (default). - Prometheus exporter (
metrics-prometheus). - OpenTelemetry / OTLP exporter (
metrics-otel).
-
- Deterministic-simulation hooks behind
RUSTFLAGS="--cfg dst". - HFT-grade deployment tuning:
- Hugepages-backed SHM segments (2 MiB / 1 GiB pages to cut TLB pressure).
- Core pinning /
isolcpusintegration in the builder API. - NUMA-aware SHM placement (producer, consumer, and segment on the same socket).
- Re-exports the raw substrate at type-identical types.
- Framed transport:
&[u8]payloads in fixed-size frames; payloads larger than one frame fragment across multiple frames (start/end flags + message id let the consumer reassemble). - Compile-time-fixed frame size:
FixedFrame<N>/AlignedFixedFrame<N>(aligned variant for zero-copy reads). One transport per size class for runtime variation. - Typed transport (codec encodes
T→ bytes; consumer decodes back into an ownedT, allocates):- bincode.
- rkyv.
- flatbuffers.
- Typed zero-copy (consumer reads fields in place via
ZeroCopyCodec::access; no decode step, no allocation):- rkyv (
Archived<T>). - flatbuffers root tables.
- rkyv (
- Topology helpers for inference fabrics: rank-scoped request/response, producer-owned startup, attach-time wait-strategy metadata.
- Runner with fault injection and invariant oracle.
- Verification, report emission, and DST-coverage sweep.
- Pingpong, broadcast, signal, repeatability binaries.
- Layer matrix: raw, framed, typed, codec, typed-zero-copy (all × shm / mmap).
- Throughput and CO-aware fixed-rate measurement modes.
- Tier ladder:
super-tiny,simple-smoke,smoke,quick,extensive.
- Adapters: Crossbeam, Iceoryx2, Rusteron (Aeron), shmipc, ZeroMQ (IPC / IPC-abs / TCP), Boost.MQ, OpenMPI.
- Tier ladder matching
perf-bench. - Aggregate report and Pareto-frontier summary per run.
- Python.
- C / C++.
- Zig.
Both crates do non-trivial unsafe work internally (raw ring slots, SHM mmap, atomic ordering, manual cache-line layout). The public API is safe Rust, with a small set of documented escape hatches.
Tier 1: public-surface contract (required for crates.io)
-
myelonpublic surface: zeropub unsafe fn. -
disruptor-mppublic surface: four documentedpub unsafe fnescape hatches inobservability(CountersFile::init,::attach,::from_ptr,AggregatorHandle::spawn); each carries a# Safetyrustdoc section. -
unsafe_op_in_unsafe_fn = warnworkspace-wide. -
clippy::missing_safety_doc = warnworkspace-wide. -
cargo clippy --workspace --all-targets -- -D warningsclean.
Tier 2: internal correctness
- DST harness:
assert_always/assert_sometimes+ BUGGIFY fault injection, FoundationDB + TigerBeetle-style. Run withRUSTFLAGS="--cfg dst". - Per-block
// SAFETY:comment on every internalunsafe { ... }block. Currently 12 / 81 blocks ≈ 15% coverage; backfill scheduled forv0.1.0-alpha.2. -
cargo miri testlane (pointer-math helpers; not the syscall paths). - AddressSanitizer lane.
- ThreadSanitizer lane.
Tier 3: enforcement and audit
-
clippy::undocumented_unsafe_blocks = warnenabled workspace-wide (after Tier 2 backfill). - Safe wrappers around the four
pub unsafe fn(boxed(),with_shm_segment()); the four escape hatches stay as power-user surface. -
cargo geigeraudit reported.
- Linux: officially supported.
- macOS: officially supported.
- Windows: unsupported.
- LMAX Disruptor by Martin Thompson
& team for the original lock-free ring-buffer single process multi-threaded design and the mechanical-sympathy mindset behind it.
disruptor-rsby Nicholas Schultz-Møllerfor the single-process multi-threaded Rust port that
disruptor-mpextends.- vLLM
shm_broadcast.pyby Kaichao Youfor the SOTA Python shared-memory broadcast fabric used in intra-node inter-process inference worker processes.
- Jeff Dean and Dr Bill Dally, Advancing to AI's Next Frontier, NVIDIA GTC 2026 for stating the systems point clearly: at the ultra-low-latency edge of inference, the bulk of the delay is communication latency.
LMAX Disruptor & mechanical sympathy:
- Venkat — The power of mechanical sympathy in software engineering: How LMAX Disruptor is so fast
- LMAX Exchange — Disruptor (read this first)
- Martin Fowler — The LMAX Architecture
- Martin Thompson & Michael Barker — Building HPC fintech handling 100K+ TPS at LMAX (InfoQ)
- Sam Adams — LMAX Exchange Architecture (InfoQ)
- Trisha Gee — Dissecting the Disruptor series: RingBuffer, LocksAreBad, CacheLinePadding, MemoryBarriers, Consumer, Producer, Disruptor 2.0
- Guy Raz Nir — Disruptor (The Edge 2012)
- Martin Thompson — Memory Barriers / Fences
If you use myelon or disruptor-mp in research or downstream work, cite this repository.
Repository:
https://github.com/Venkat2811/myelon
Twitter/X: @venkat_systems
Formal citation metadata lives in CITATION.cff.
BibTeX entries can live in CITATION.bib.
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Issues, Feedback, Discussions, PR are welcome & appreciated !