Symbiotic is an embedded profiling runtime for Rust. It instruments a program from the inside — hardware counters, kernel tracepoints, memory layout, per-line sample attribution — producing the kind of data that previously required a combination of perf, bpftrace, valgrind, and manual /proc parsing. One symbiotic::init() call replaces all of them.
The library compiles on stable Rust. Subsystems that require kernel support (eBPF, PMU counters) degrade gracefully when unavailable. Everything is feature-gated; what you don't enable doesn't compile.
Existing Rust profiling tools fall into two categories: external sampling profilers that attach to a running process (perf record, samply, flamegraph-rs), and lightweight timing wrappers that measure wall time. Neither gives you hardware counter attribution at the source-line level, kernel event correlation during a specific code region, or a unified view of what the CPU, memory subsystem, scheduler, and I/O stack were doing during your function call.
Symbiotic closes that gap. A region!("sort", { data.sort(); }) block captures cycles, instructions, IPC, branch misses, L1D/LLC cache misses, page faults, context switches, futex contention, and I/O bytes — all attributed to that specific block, all in a single run.
Per-region hardware counters. RAII guards read six PMU counters (cycles, instructions, branches, branch misses, L1D misses, LLC misses) via diff-on-drop. Nested regions compose correctly. Cost: ~20ns per enter/exit.
96-counter kernel sensory system. A BPF program (sense_hub) attaches to 58 tracepoints and maintains 96 u64 counters in a BPF_F_MMAPABLE array. Userspace reads are volatile pointer loads — no syscalls, no ring buffer drain. Covers: network state (TCP/UDP/UNIX), file descriptors, I/O byte counts, page faults, RSS breakdown, scheduler stats, futex/lock contention, thread lifecycle, block I/O latency, page cache, writeback pressure, TCP health (retransmits, RTT, congestion), OOM, thermals, MCE.
Per-line IP sampling with DWARF resolution. Ten hardware and software events are captured at the instruction pointer level: cycles, L1D misses (via perf_event ring buffer), LLC misses, branch misses, DTLB misses, AMD IBS-Op, AMD IBS-Fetch, major page faults, CPU migrations, alignment faults (via BPF perf_event programs writing to a zero-copy circular buffer). After profiling, IPs are batch-resolved through DWARF via blazesym and aggregated per source line. Output is a .symbiot trace file (zstd-compressed JSON).
Multi-level code view. For hot functions, the disassembler correlates four representations — Rust source, MIR, LLVM IR, x86 assembly — with per-instruction sample counts. DWARF .loc directives, MIR scope chains, and IR debug metadata provide the cross-level mapping.
eBPF off-CPU, syscall, and lock profiling. Separate BPF programs track sched_switch (off-CPU stacks with blazesym symbolization), per-syscall latency distributions, and futex contention (wait time + call stacks). BTF tracepoints are preferred where available, with automatic fallback to legacy tracepoints.
Process-wide PMU counters. inherit(true) counters span all threads — rayon workers, thread pools, everything. Provides accurate IPC and branch miss rate for the entire process.
Interactive report viewer. A terminal UI (ratatui) with tabs for overview, CPU, cache hierarchy, per-line samples, region tree, sensory state, and analysis. An HTML dashboard (self-contained, offline) provides the same data in a VS Code-style Monokai Pro layout with a timeline swimlane.
Query server. HTTP/JSON, gRPC, and Unix socket endpoints expose live metrics while the profiled program runs. The dashboard auto-starts on port 9882.
use symbiotic::{region, BenchProfiler};
fn main() {
symbiotic::init(); // loads BPF, enables sensory capture
let _profiler = BenchProfiler::new("my_workload");
region!("sort", {
data.sort_unstable();
});
region!("process", {
expensive_computation(&data);
});
// report generated on drop: TUI, HTML dashboard, .symbiot trace
}Annotate functions directly:
#[symbiotic::profile]
fn hot_function(data: &mut [f64]) -> f64 {
data.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap());
data.iter().sum()
}| Flag | What it enables |
|---|---|
profiling |
Per-region PMU counters (region!, RegionGuard) |
ebpf |
BPF sensory system + off-CPU/syscall/lock profiling |
line-profiler |
Per-line IP sampling + DWARF resolution + .symbiot export |
disasm |
Multi-level code view (source/MIR/IR/ASM) |
tui |
Interactive terminal report viewer |
dashboard |
VS Code-style HTML dashboard + live web UI |
server |
HTTP/JSON query server |
server-grpc |
gRPC transport |
pmu |
Raw PMU counter access |
jemalloc |
jemalloc allocator statistics |
alloc-track |
Per-region allocation counting |
flamegraph |
Flamegraph SVG export |
full |
Everything above |
- Linux (kernel >= 5.8 for full eBPF support; PMU counters work on older kernels)
perf_event_paranoid<= 1 for hardware counter access, orCAP_PERFMONCAP_BPF+CAP_PERFMONfor eBPF programs (or root)- Stable Rust toolchain
The following areas are under active development. Items are listed by theme, not priority.
Ecosystem integration. Criterion adapter for automatic hardware-counter enrichment of benchmark results. cargo-symbiotic subcommand for one-shot profiling of any binary or test. Integration with tracing spans so existing instrumented code gets PMU attribution without source changes.
CI/CD and regression detection. Machine-readable profile output (JSON, protobuf) suitable for ingestion by CI systems. Diff mode: compare two .symbiot traces and surface regressions in IPC, cache miss rate, or branch misprediction at the source-line level. Threshold-based assertions (assert_ipc!(region, >= 2.0)) for performance-gated CI pipelines.
Profile-guided development. IDE integration beyond VS Code: per-line hardware counters displayed inline in the editor during development. Persistent profile database across runs for trend analysis. Correlation of profile data with git blame to attribute regressions to specific commits.
Platform and architecture. Full aarch64 PMU support (ARMv8.1+ SPE for precise memory sampling). macOS kperf backend for Apple Silicon. Cross-platform fallback mode (wall time + allocation tracking) for environments without PMU access.
Deeper kernel visibility. Per-syscall latency histograms with automatic slow-path detection. IRQ and softirq stolen-time attribution to specific code regions. NUMA topology-aware memory placement analysis.
Allocation profiling. Integration with jemalloc's prof facility for heap flamegraphs. Per-region allocation rate tracking with leak detection heuristics. Object lifetime analysis for identifying unnecessary clones.
Dual-licensed under MIT and Apache 2.0.