#pprof #tracing

culpert

Per-span heap allocation profiler core for Rust services

2 releases

new 0.1.1 May 15, 2026
0.1.0 May 15, 2026

#75 in Profiling


Used in 3 crates

MIT/Apache

115KB
1.5K SLoC

culpert

Per-span heap allocation profiler for Rust services.

A #[global_allocator] wrapper that attributes every sampled allocation to the span it happened inside, exports pprof-format profiles so the existing tool ecosystem (stock pprof, Speedscope, Pyroscope, Polar Signals) keeps working, and ships a CLI with unbiased per-span reports plus a diff subcommand for CI/PR workflows. Geometric sampling with the Bernstein correction means bytes figures are directly meaningful — no separate "raw vs estimated" columns to interpret.

Three integration paths — pick whichever matches your service:

Your service uses… Adapter Cost
foundations culpert-foundations Zero instrumentation change. Existing #[span_fn] annotations become attribution keys.
The tracing crate culpert-tracing Compose a tracing_subscriber::Layer into your subscriber stack. Existing #[tracing::instrument] annotations become attribution keys.
Neither, or you want attribution independent of trace sampling culpert::scope + #[culpert::span_fn] One macro on the functions you want attributed. No external tracer.

Status: 0.1.0 on crates.io. Pre-1.0 — API may break in minor releases. See plan.md for the design, notes.md for the Phase 0 research verdicts, CHANGELOG.md for release notes, and ROADMAP.md for what's next.

Install

Library:

cargo add culpert                # core
cargo add culpert-foundations    # if your service uses foundations
cargo add culpert-tracing        # if your service uses the tracing crate

culpert-macros is re-exported by culpert (don't depend on it directly).

CLI:

cargo install --locked culpert-cli
# → installs the `culpert` binary into ~/.cargo/bin
culpert --help

# Optional feature for talking to a culpert-archive instance that
# sits behind Cloudflare Access — adds --cf-access-client-id /
# --cf-access-client-secret flags (with CF_ACCESS_CLIENT_ID /
# CF_ACCESS_CLIENT_SECRET env-var fallbacks) to `upload` / `pull`:
cargo install --locked --features cloudflare-access culpert-cli

Quickstart — standalone (#[culpert::span_fn])

The smallest viable setup. No external tracer:

[dependencies]
culpert = "0.1"
use culpert::{Config, LocalSpanContext, TrackingAllocator};
use std::alloc::System;

#[global_allocator]
static GLOBAL: TrackingAllocator<System> = TrackingAllocator::new(System);

fn main() {
    culpert::install(LocalSpanContext::new(), Config::default());

    handle_request();

    let profile = culpert::snapshot();
    let bytes = culpert::pprof::encode_gzipped(&profile).unwrap();
    std::fs::write("/tmp/prof.pb.gz", &bytes).unwrap();

    culpert::shutdown();
}

#[culpert::span_fn("handle_request")]
fn handle_request() {
    // Every sampled allocation in here, transitively, is attributed to
    // span_name = "handle_request".
}

See examples/macros for a runnable version.

Quickstart — with foundations

[dependencies]
culpert              = "0.1"
culpert-foundations  = "0.1"
foundations = { version = "5", default-features = false, features = ["tracing", "telemetry-server"] }

The default-features = false is required — foundations' default jemalloc feature declares its own #[global_allocator] which conflicts with culpert's TrackingAllocator and fails to link.

#[global_allocator]
static GLOBAL: TrackingAllocator<System> = TrackingAllocator::new(System);

#[tokio::main]
async fn main() {
    let driver = foundations::telemetry::init(TelemetryConfig {
        service_info: &service_info!(),
        settings: &settings.telemetry,
        custom_server_routes: vec![
            culpert_foundations::pprof_route("/debug/alloc/profile"),
        ],
    }).unwrap();

    culpert_foundations::install();

    // ... your existing app, with #[span_fn] annotations as usual.
}

Existing #[foundations::telemetry::tracing::span_fn] annotations become attribution keys for free. See examples/foundations (minimal) and examples/mock-axum (full HTTP service with pprof_route-served profile).

Quickstart — with the tracing crate

[dependencies]
culpert            = "0.1"
culpert-tracing    = "0.1"
tracing            = "0.1"
tracing-subscriber = "0.3"
use tracing_subscriber::prelude::*;

#[global_allocator]
static GLOBAL: TrackingAllocator<System> = TrackingAllocator::new(System);

fn main() {
    tracing_subscriber::registry()
        .with(culpert_tracing::layer())
        .with(/* your other layers — fmt, OTLP, etc. */)
        .init();

    culpert_tracing::install();

    // ... your existing app, with #[tracing::instrument] annotations as usual.
}

Existing #[tracing::instrument] annotations become attribution keys. See examples/tracing for a runnable version.

What you get

A profile (*.pb.gz) you can either feed to stock pprof or read with the shipped CLI. Output below is from the examples/mock-axum service under load; the same shape works for any of the three integration paths.

Tree report — culpert report <profile>

The default: hierarchical breakdown, with each sub-span nested under its parent. Built from span_parent_id labels emitted by whichever SpanContext was installed.

$ culpert report /tmp/mock-axum.pb.gz

Hierarchical span report (143179 samples, sample rate 4.00 KB/alloc):
  Tree shows span_name groupings under their parents. `bytes` is the
  Bernstein-corrected, unbiased estimate of allocated bytes (see
  CHANGELOG: geometric sampling). Use --flat for a simple sorted table.

vec                                            1.30 GB   70.48%  (self 1.30 GB)
(no span)                                    246.45 MB   13.07%  (self 246.45 MB)
json                                         180.92 MB    9.59%  (self 180.92 MB)
nested                                         3.55 MB    0.19%  (self 170.62 KB)
├─ parse_payload                              59.77 MB    3.17%  (self 59.77 MB)
├─ validate_payload                           59.57 MB    3.16%  (self 59.57 MB)
└─ build_response                            604.00 KB    0.03%  (self 604.00 KB)
strings                                        5.87 MB    0.31%  (self 5.87 MB)

--flat switches to a sorted-by-bytes table for users who prefer it. The bytes column is the unbiased estimate of total bytes allocated under each span — each underlying sample is weighted by 1 / (1exp(−bytes/rate)) (the Bernstein correction for geometric sampling). No raw column is shown: with geometric sampling the corrected value is the only one that means anything meaningful.

Drill into one span — --span <name>

$ culpert report /tmp/mock-axum.pb.gz --span json --top 4

Top callsites within span "json" (sample rate 4.00 KB/alloc):
  callsite                                                  samples         bytes  bytes %
  -------------------------------------------------------  --------  ------------  -------
  alloc::vec::Vec::from_iter::SpecFromIter                   23000      89.84 MB   50.00%
  alloc::fmt::format::{closure}                              23000      89.84 MB   50.00%

Drill into the unattributed bucket — --no-span

Useful for "is this my code's fault, or the runtime's?". Shows the top callsites of allocations that fired outside any span — i.e. tokio runtime work, framework internals, foundations' or tracing's own reporters, or code paths you haven't yet annotated.

$ culpert report /tmp/mock-axum.pb.gz --no-span --top 4

Top callsites in unattributed samples — outside any span:
  callsite                                                  samples         bytes  bytes %
  -------------------------------------------------------  --------  ------------  -------
  cf_rustracing_jaeger::Tag as Clone>::clone                  8720      34.06 MB   33.74%
  alloc::vec::Vec::append_elements                            6678      26.09 MB   25.84%
  alloc::boxed::Box::new_uninit                               6551      25.59 MB   25.34%
  bytes::bytes_mut::BytesMut::reserve_inner                   2307      16.00 MB   15.85%

Compare two profiles — culpert diff

For CI workflows: diff a "before" and "after" profile by span_name with both an absolute (--threshold-bytes) and a relative (--threshold-pct) gate. --format markdown produces output you can pipe straight into $GITHUB_STEP_SUMMARY:

$ culpert diff before.pb.gz after.pb.gz --format markdown

### culpert: allocation diff

- **before:** `before.pb.gz` — total 2.70 MB (estimated)
- **after:**  `after.pb.gz` — total 5.75 MB (estimated)
- **net Δ:** +3.05 MB (+113.21%)

#### Regressions

| Span | Before | After | Δ | Δ% |
|------|-------:|------:|--:|---:|
| `encode_response` | 1.53 MB | 4.58 MB | +3.05 MB | +200.00% |

Stock pprof works too

The on-disk format is canonical pprof, so everything in the ecosystem reads it:

pprof -tags        /tmp/mock-axum.pb.gz   # totals grouped by span_name + span_id
pprof -tagfocus="span_name:json" -text /tmp/mock-axum.pb.gz
pprof -http=:8090  /tmp/mock-axum.pb.gz   # interactive flame graph + source view

Persist profiles for CI — culpert upload / culpert pull

For CI workflows you usually want last week's profiles to compare against. The companion culpert-archive Cloudflare Worker stores .pb.gz files keyed by commit SHA; the upload / pull subcommands of culpert-cli are its first-party client. Endpoint and token are picked up from environment variables (CULPERT_ARCHIVE / CULPERT_TOKEN), commit SHA / branch from GITHUB_SHA / GITHUB_REF_NAME so the GitHub Actions step body is short:

# Push the just-captured profile under this commit
culpert upload /tmp/profile.pb.gz

# Pull last build's main-branch profile as a baseline. --allow-missing
# exits 0 (writing nothing) on 404 so the very first main run doesn't
# fail the build.
culpert pull --latest-of main -o /tmp/baseline.pb.gz --allow-missing

# Diff. Markdown to $GITHUB_STEP_SUMMARY, plus a fail-gated text run.
culpert diff /tmp/baseline.pb.gz /tmp/profile.pb.gz \
  --format markdown >> "$GITHUB_STEP_SUMMARY"
culpert diff /tmp/baseline.pb.gz /tmp/profile.pb.gz \
  --threshold-bytes 1048576 --threshold-pct 10   # exit 1 on regression

For drop-in CI use, this repo ships a reusable composite action that wraps the whole flow:

# In your workflow, after you've captured a profile:
- uses: rupert648/culpert/.github/actions/culpert-diff@main
  with:
    archive-url:   ${{ vars.CULPERT_ARCHIVE }}
    archive-token: ${{ secrets.CULPERT_TOKEN }}
    profile:       /tmp/my-service-profile.pb.gz
    culpert-cli:   ./target/release/culpert    # path to the binary

That one block does: culpert info (sanity check in the run log) → culpert pull --latest-of main --allow-missingculpert diff --format markdown (posted to $GITHUB_STEP_SUMMARY and, on pull_request events, as a sticky PR comment) → culpert upload as the new baseline. Override defaults with the action's inputs — baseline-branch, threshold-bytes, threshold-pct, fail-on-regression, etc. See .github/actions/culpert-diff/action.yml for the full input schema.

Culpert's own rust.yml profiles example-macros and invokes the same action — that's the worked example. Currently warn-only (fail-on-regression: "false") until enough main-branch runs have accumulated to make gating meaningful.

The worker never parses the pprof bytes — it's dumb storage. Sample attribution, Bernstein correction, threshold logic all run in this CLI. See culpert-archive's README for the deploy recipe and HTTP surface.

Comparison

culpert foundations MemoryProfiler jemalloc heap prof dhat bytehound heaptrack
Span attribution
Allocator any GlobalAlloc jemalloc only jemalloc only dhat-rs alloc linker-injected LD_PRELOAD
Platform any Linux only Linux only any Linux Linux
Output pprof pprof pprof dhat-format bytehound-format heaptrack-format
Sampling yes (~512 KiB) yes (jemalloc) yes full-fidelity full-fidelity full-fidelity
Overhead (typical service) ~0% off, ~0–5% on low low extreme high high
Pre-existing instrumentation needed #[span_fn] none none none none none

The comparison column that matters: only culpert produces a profile that answers "which handler allocates most?" without manual stack→handler correlation. That's the entire reason for it. If you don't need per-span attribution, foundations' MemoryProfiler is simpler and battle-tested; culpert and MemoryProfiler can also coexist (independent sample streams) if you want both.

Honest scope limits

What culpert currently does not do:

  • Low-overhead full-fidelity profiling. Sampled is the only mode. Workloads that allocate heavily and do nothing else see meaningful per-alloc cost from the stack-capture path; realistic services with CPU work between allocations see negligible overhead at the default 1-in-512 KiB rate (samples are rare relative to surrounding work). Opt in to StackCaptureStrategy::FramePointer to drop the alloc-heavy overhead by ~91× — see the Overhead section.
  • Async #[culpert::span_fn]. v0.2 ships sync support; the macro emits a compile error on async fn with a pointer at the alternative (use the foundations or tracing adapters for async paths). A ScopedFuture wrapper with explicit parent-capture-on-construction semantics is queued for v0.2.x.
  • Foundations attribution at very low trace sampling rates. The culpert-foundations adapter gates on span_is_sampled(). With foundations' default 100 % sampling this never matters; with low-rate sampling, allocs in unsampled traces land in the (no span) bucket. Workarounds: bump foundations sampling, or use #[culpert::span_fn] for the spans you care about (it doesn't gate on any external tracer's sampling).
  • CPU profiling, lock contention, fragmentation analysis. Wrong product; use pprof-rs / samply for CPU, jemalloc's stats for fragmentation.

Overhead

Measured with Criterion on Apple M-series, release builds. Four configurations: baseline (System global allocator), tracking_off (TrackingAllocator, no profiler installed), tracking_on (BT) (TrackingAllocator + installed profiler at default 1-in-512 KiB using the default Backtrace strategy), and tracking_on (FP) using StackCaptureStrategy::FramePointer.

Workload baseline tracking_off tracking_on (BT) tracking_on (FP)
200 × (alloc + ~1 µs CPU) 19.2 µs 17.8 µs 18.8 µs 17.9 µs
200 × 64 B allocs (small) 3.44 µs 4.12 µs 3.76 µs 3.45 µs
50 × 1 MiB allocs 5.06 µs 4.30 µs 584 µs 6.41 µs

The first row is the realistic case (allocation interleaved with real work); the others are pure-alloc microbenches that emphasise the per-alloc overhead.

tracking_off adds tens of ns per alloc which disappears under any meaningful CPU work between allocations.

tracking_on (BT) is dominated by backtrace::trace per sampled alloc. On the realistic workload most iterations have no samples (only ~1 in ~10 iterations crosses 512 KiB), so the cost is in the noise. On the 50 × 1 MiB microbench every allocation samples, so each iteration pays 100 backtrace walks — at ~5 µs each on macOS's libunwind that's 500 µs per iteration, which is what dominates.

tracking_on (FP) opts in to StackCaptureStrategy::FramePointer — a tiny load-and-cmp loop over the frame-pointer chain — and brings the dense-sampling case from 584 µs to 6.4 µs (~91× faster). On the realistic workload the difference is within measurement noise because samples are rare relative to the surrounding CPU work.

culpert::install(ctx, Config {
    stack_capture_strategy: StackCaptureStrategy::FramePointer,
    ..Default::default()
});

Frame-pointer capture is x86_64 / aarch64 only; other targets fall back to Backtrace transparently. Requires RUSTFLAGS="-C force-frame-pointers=yes" on Linux x86_64 release builds; macOS aarch64 has frame pointers on by default. On Linux x86_64 without compiled-in frame pointers, backtrace::trace uses DWARF unwinding (slow); FP is expected to be a major win across all workloads on that platform, though we haven't measured it here.

Examples

Four standalone examples in examples/ showing each integration path:

Example What it shows Key wiring
mock-axum foundations + axum service with /debug/alloc/profile HTTP endpoint, plus a load script foundations::telemetry::init + culpert_foundations::pprof_route + #[foundations::span_fn]
foundations minimal foundations wiring without an HTTP server — just init, work, snapshot, write culpert_foundations::install + #[foundations::span_fn]
tracing tracing crate integration via tracing_subscriber::Layer culpert_tracing::layer() composed into a Registry + #[tracing::instrument]
macros sampling-independent attribution with no external tracer; also doubles as a BT-vs-FP timing demo (set CULPERT_FP=1) culpert::LocalSpanContext + #[culpert::span_fn]

Each prints a profile to /tmp/example-*.pb.gz. View with cargo run -p culpert-cli --bin culpert -- report <path> or pprof -tags <path>.

Shutdown

Call culpert::shutdown() before letting main return. It flips an internal "we're tearing down" flag so the observer ignores any allocations triggered by TLS destructors during process exit. Without it, certain combinations (notably the tracing crate's Registry on macOS, whose thread_local-based per-thread state allocates during teardown) can deadlock against a recursive std::sync::Mutex acquisition.

culpert::install registers a libc::atexit handler that sets the same flag automatically — that catches typical Unix and Windows teardown paths, but macOS dyld4's TLS finalizers run before atexit, so an explicit culpert::shutdown() is the safe way to make exit deterministic.

Architecture

A core crate (culpert) does the work in three pieces:

  1. TrackingAllocator<A> — a GlobalAlloc wrapper that observes every alloc. Samples on a per-thread byte countdown.
  2. SpanContext trait — pluggable source of "what's the active span on this thread, right now?". culpert is span-source agnostic.
  3. Profile snapshot — drains every thread's per-thread sample buffer, buckets by (span, callsite), resolves symbols, returns a Profile. culpert::pprof::encode_gzipped writes it as a pprof protobuf.

Three production SpanContext adapters ship alongside:

  • culpert-foundations reads foundations::telemetry::tracing::rustracing_span() and uses cf-rustracing's span_id (a stable u64 per span) as culpert's identity. Also provides pprof_route() — a TelemetryServerRoute you register at init time.
  • culpert-tracing ships a tracing_subscriber::Layer that captures span metadata at creation and resolves tracing::Span::current().id() on the hot path. Compose into your existing subscriber stack.
  • culpert::LocalSpanContext + #[culpert::span_fn] — sampling-independent attribution via culpert's own thread-local scope stack. No external tracer, no trace-sampling gating. Sync functions only in v0.2; async via ScopedFuture is queued for a follow-up.

culpert-cli is the report binary. It decodes pprof, groups by span_name label, walks call stacks (skipping capture-machinery frames so the leaf is real user code), and prints the tables shown above.

MockSpanContext lets you drive the system in tests without any adapter.

See plan.md for the locked architectural decisions and ROADMAP.md for what's queued next.

License

Dual-licensed under MIT or Apache-2.0.

Dependencies

~6MB
~122K SLoC