Skip to content

krlptrck/myelon

 
 

myelon-pulse brand demo for the myelon multiprocess shared-memory transport

myelon

myelon on crates.io disruptor-mp on crates.io myelon on docs.rs disruptor-mp on docs.rs Build codecov License

Ultra-low-latency and high-throughput multiprocess transport stack over SHM and mmap ring buffers on Linux and macOS.

Dr Bill Dally is working on shaving nanoseconds by shrinking the distance data has to travel (on-chip wires, off-chip PHYs, memory-to-compute) and stripping overhead out of the path. System software should be just as serious about stripping copies, wakeups, coordination, and communication distance out of its own path.

myelon borrows its name from the Greek root behind the spinal cord: the signal path between brain and body. Wrapped in myelin, that cord exists to move impulses fast and clean. This crate tries to do the same for processes: give independent OS processes a low-latency fabric over SHM and mmap, with as little copying, waiting, and coordination drag as possible.

myelon is the default crate in this repository. It builds on top of disruptor-mp, which extends LMAX Disruptor's HFT-grade design (lock-free atomics, no syscalls in the hot path, cache-aligned cursors, busyspin wait) to cross-process IPC. myelon keeps the raw ring reachable and adds framing, codecs, typed zero-copy, topology helpers, and layout helpers for low-latency pipelines.

This repository publishes two crates:

Crate Use it when...
disruptor-mp You want the raw fixed-size event substrate and you own the wire format.
myelon You want one dependency that keeps the raw ring reachable and also gives you framing, codecs, typed zero-copy, topology helpers, and layout helpers.

disruptor-mp is the Layer 0 substrate. myelon is the broader public transport crate built on top of it.

Both crates do non-trivial unsafe work internally (raw ring slots, SHM mmap, atomic ordering, manual cache-line layout), but the public API is safe Rust with only a small set of documented escape hatches. See Safety & quality gates below.

Headline numbers

Throughput mode reports the peak attainable rate plus its latency distribution under self-pressure. CO mode reports coordinated-omission-corrected latency while holding a configured constant offered rate, so tail percentiles reflect real time-to-receipt, not just bench iteration time.

Signal ceiling, consumer scaling (no payload)

Topology shm mmap
1p1c 332 M ops/s 239 M ops/s
1p2c 163 M 188 M
1p4c 95 M 97 M
1p8c 30 M 44 M

Pipelined fan-out, no ack. Per-consumer rate within 0.5% of producer rate. Signal scales down as consumers fan out because the publisher contends with each consumer's cursor.

Raw ping-pong, 1p1c, 64B, shm, busyspin

Mode Achieved ops/s P50 P99 P99.99
Max throughput 5.58 M 130 ns 240 ns 2.5 µs
CO constant rate @ 1.2 M ops/s 1.20 M 188 ns 282 ns 13.3 µs

Producer rate equals consumer rate (single ring, round-trip). The CO row holds 1.2 M ops/s sustained with coordinated-omission-corrected percentiles. mmap variants reach the same throughput at the same P50, with the P99.99 tail 4-5x wider.

Framed ping-pong, payload scaling (1p1c, shm, throughput mode)

Payload ops/s GB/s P50 P99 P99.99
64 B 4.51 M 0.29 180 ns 300 ns 3.02 µs
1 KB 2.45 M 2.51 360 ns 610 ns 3.70 µs
32 KB 133.6 K 4.38 7.31 µs 11.41 µs 24.89 µs
128 KB (multi-frame) 31.2 K 4.09 31.68 µs 38.08 µs 50.24 µs

Producer rate equals consumer rate (single ring, round-trip). 128 KB fragments across multiple frames; per-message rate drops but per-message bandwidth stays at ~4 GB/s.

Framed broadcast, consumer × payload scaling (mmap, throughput mode)

Topology Payload Producer ops/s Per-consumer ops/s Producer GB/s
1p4c 1 KB 9.09 M 9.07 M 9.31
1p4c 128 KB 108.9 K 109.0 K 14.27
1p8c 1 KB 5.98 M 5.97 M 6.13
1p8c 128 KB 88.3 K 85.5 K 11.57

Each consumer receives every message; per-consumer rate within 0.5-3.2% of producer. Aggregate fan-out scales by N: 1p8c × 128 KB delivers ~92.6 GB/s aggregate across 8 consumers. Broadcast throughput mode doesn't measure per-message RTT (no ack); CO-mode latency under sustained load lives in the crates/perf-bench/ bench output.

Typed zero-copy ping-pong (shm, rkyv, throughput mode)

Batch Payload ops/s P50 P99 P99.99
1 592 B 1.89 M 490 ns 660 ns 4.05 µs
64 37 KB 94.2 K 10.5 µs 13.7 µs 22.4 µs
256 150 KB 26.7 K 37.2 µs 44.4 µs 55.0 µs

ZeroCopyCodec::access reads Archived<T> fields in place. Speedup vs full owned decode: 3.0× at batch=1, 5.5× at batch=64, 5.7× at batch=256.

Measured on AMD Ryzen 7 5800X (8 cores / 16 threads, 4.85 GHz boost, 32 MiB L3), 64 GiB DDR4, Ubuntu 22.04, kernel 6.8 via crates/perf-bench/.

Charts

Pingpong throughput at 1 KB payload: myelon-raw vs 11 other in-machine IPC adapters.

Pingpong throughput at 1 KB payload across 12 IPC adapters

Broadcast P99 latency at 1 KB · 4 consumers · 400 K msgs/s sustained (coordinated-omission-corrected).

Broadcast CO P99 latency at 1 KB and 4 consumers under 400 K msgs/s sustained

Pingpong throughput heatmap across the full adapter × payload matrix.

Pingpong throughput heatmap across adapters and payload sizes

More bench charts (per-layer heatmaps, payload-vs-latency curves, broadcast scaling) live in assets/ and the crates/perf-bench/ bench output.

Quick start

Start with the crate that matches your data model.

[dependencies]
myelon = "0.1.0-alpha.2"

Or, if you only want the raw ring:

[dependencies]
disruptor-mp = "0.1.0-alpha.2"

Runnable first-party examples live under examples/demos:

cargo run --release -p demos --example shm_disruptor
cargo run --release -p demos --example pingpong

Read next:

Repository layout

crates/
├── disruptor-mp/        # Publishable raw multiprocess substrate.
├── myelon/              # Publishable layered transport crate.
├── myelon-dst/          # Internal deterministic-simulation runner. Inspired by FoundationDB, TigerBeetle, Turso & SlateDB.
├── perf-bench/          # Internal broad transport sweep harness.
└── competitive-bench/   # Internal external-comparison harness.

examples/
├── demos/               # Runnable first-party examples.
└── myelon-pulse-vanity/ # Brand vanity demo shown above.

book/                    # mdBook source. Maintained separately from this README.

Validation and benchmarks

Top-level workspace commands:

  • make help
  • make build
  • make test
  • make workspace-smoke
  • make orchestrate-rust
  • make smoke

Benchmark crate entry points:

  • make -C crates/perf-bench super-tiny
  • make -C crates/competitive-bench super-tiny

Bench binaries are built with --profile competitive (max-perf, panic=abort, stripped — bench-fairness defaults, not production). For production use release or prod-max (release + fat LTO + line-table debug).

Features

disruptor-mp: raw multiprocess substrate

  • Cross-process Single Producer Single Consumer (SPSC).
  • Cross-process Single Producer Multi Consumer (SPMC).
  • Cross-process Multi Producer Single Consumer (MPSC).
  • Cross-process Multi Producer Multi Consumer (MPMC).
  • Communication patterns:
    • Ping-pong: request/response RTT (two SPSC rings).
    • Broadcast: strict fan-out; every consumer sees every event, slowest gates the producer.
    • Signal: pipelined fan-out, no ack; maximum throughput.
    • Broadcast + per-rank ping-pong: one SPMC dispatch ring + N SPSC return rings (driver ↔ N worker ranks); the inference-fabric shape.
  • Two type-identical backends:
    • POSIX shared memory (shm_open).
    • Memory-mapped file (mmap).
  • Memory-level zero-copy reads (&E into the ring slot).
  • Wait strategies (AutoWaitStrategy):
    • BusySpin: pure busy loop, 100% CPU.
    • BusySpinWithSpinLoopHint: busy loop with CPU hint via spin_loop.
    • SpinThenYield { spins }: N spins, then yield to scheduler.
    • Sleep(Duration): sleep for a configured interval.
    • Block: efficient blocking, balanced performance / CPU.
  • Liveness for gating consumers: producer-side stall detection with cold-path alert, optional hook, and recoverable rejoin.
  • Portable shared-memory naming (macOS 31-byte budget enforced).
  • Hot-path observability counters:
    • metrics-rs facade (default).
    • Prometheus exporter (metrics-prometheus).
    • OpenTelemetry / OTLP exporter (metrics-otel).
  • Deterministic-simulation hooks behind RUSTFLAGS="--cfg dst".
  • HFT-grade deployment tuning:
    • Hugepages-backed SHM segments (2 MiB / 1 GiB pages to cut TLB pressure).
    • Core pinning / isolcpus integration in the builder API.
    • NUMA-aware SHM placement (producer, consumer, and segment on the same socket).

myelon: layered transport on top of disruptor-mp

  • Re-exports the raw substrate at type-identical types.
  • Framed transport: &[u8] payloads in fixed-size frames; payloads larger than one frame fragment across multiple frames (start/end flags + message id let the consumer reassemble).
  • Compile-time-fixed frame size: FixedFrame<N> / AlignedFixedFrame<N> (aligned variant for zero-copy reads). One transport per size class for runtime variation.
  • Typed transport (codec encodes T → bytes; consumer decodes back into an owned T, allocates):
    • bincode.
    • rkyv.
    • flatbuffers.
  • Typed zero-copy (consumer reads fields in place via ZeroCopyCodec::access; no decode step, no allocation):
    • rkyv (Archived<T>).
    • flatbuffers root tables.
  • Topology helpers for inference fabrics: rank-scoped request/response, producer-owned startup, attach-time wait-strategy metadata.

myelon-dst: internal deterministic-simulation harness

  • Runner with fault injection and invariant oracle.
  • Verification, report emission, and DST-coverage sweep.

perf-bench: internal broad transport sweep harness

  • Pingpong, broadcast, signal, repeatability binaries.
  • Layer matrix: raw, framed, typed, codec, typed-zero-copy (all × shm / mmap).
  • Throughput and CO-aware fixed-rate measurement modes.
  • Tier ladder: super-tiny, simple-smoke, smoke, quick, extensive.

competitive-bench: internal external-comparison harness

  • Adapters: Crossbeam, Iceoryx2, Rusteron (Aeron), shmipc, ZeroMQ (IPC / IPC-abs / TCP), Boost.MQ, OpenMPI.
  • Tier ladder matching perf-bench.
  • Aggregate report and Pareto-frontier summary per run.

Bindings

  • Python.
  • C / C++.
  • Zig.

Safety & quality gates

Both crates do non-trivial unsafe work internally (raw ring slots, SHM mmap, atomic ordering, manual cache-line layout). The public API is safe Rust, with a small set of documented escape hatches.

Tier 1: public-surface contract (required for crates.io)

  • myelon public surface: zero pub unsafe fn.
  • disruptor-mp public surface: four documented pub unsafe fn escape hatches in observability (CountersFile::init, ::attach, ::from_ptr, AggregatorHandle::spawn); each carries a # Safety rustdoc section.
  • unsafe_op_in_unsafe_fn = warn workspace-wide.
  • clippy::missing_safety_doc = warn workspace-wide.
  • cargo clippy --workspace --all-targets -- -D warnings clean.

Tier 2: internal correctness

  • DST harness: assert_always / assert_sometimes + BUGGIFY fault injection, FoundationDB + TigerBeetle-style. Run with RUSTFLAGS="--cfg dst".
  • Per-block // SAFETY: comment on every internal unsafe { ... } block. Currently 12 / 81 blocks ≈ 15% coverage; backfill scheduled for v0.1.0-alpha.2.
  • cargo miri test lane (pointer-math helpers; not the syscall paths).
  • AddressSanitizer lane.
  • ThreadSanitizer lane.

Tier 3: enforcement and audit

  • clippy::undocumented_unsafe_blocks = warn enabled workspace-wide (after Tier 2 backfill).
  • Safe wrappers around the four pub unsafe fn (boxed(), with_shm_segment()); the four escape hatches stay as power-user surface.
  • cargo geiger audit reported.

Platform support

  • Linux: officially supported.
  • macOS: officially supported.
  • Windows: unsupported.

Acknowledgements

  • LMAX Disruptor by Martin Thompson GitHub & team for the original lock-free ring-buffer single process multi-threaded design and the mechanical-sympathy mindset behind it.
  • disruptor-rs by Nicholas Schultz-Møller GitHub for the single-process multi-threaded Rust port that disruptor-mp extends.
  • vLLM shm_broadcast.py by Kaichao You GitHub for the SOTA Python shared-memory broadcast fabric used in intra-node inter-process inference worker processes.
  • Jeff Dean and Dr Bill Dally, Advancing to AI's Next Frontier, NVIDIA GTC 2026 for stating the systems point clearly: at the ultra-low-latency edge of inference, the bulk of the delay is communication latency.

Resources

LMAX Disruptor & mechanical sympathy:

Built with

Agentic engineering, using: Codex Claude Code

Citation

If you use myelon or disruptor-mp in research or downstream work, cite this repository.

Repository: https://github.com/Venkat2811/myelon

Twitter/X: @venkat_systems

Formal citation metadata lives in CITATION.cff. BibTeX entries can live in CITATION.bib.

License

Licensed under either of:

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Issues, Feedback, Discussions, PR are welcome & appreciated !

About

Ultra-low-latency, high-throughput multiprocess transport over SHM and mmap. LMAX-Disruptor-style cross-process ring substrate.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 98.4%
  • Makefile 1.3%
  • Other 0.3%