Skip to content

Conversation

@nordzilla
Copy link

@nordzilla nordzilla commented Nov 11, 2025

Overview

This is a minimal initial implementation of #349 based off of @blp's commit feldera/feldera@1b448b7.

I've pulled the commit directly into my pull request, so as to preserve the original attributions, and built the crate around the Unix mmap capabilities. Windows is stubbed out, but not yet implemented.

Examples

The README and generated rust docs are full of examples to look at.

Crate Name Reservation

I've also reserved the crate name in good faith that I'd like to release this eventually.

Though, if for some reason this falls through, I'd be happy to relinquish ownership of the crate. I'm not a fan of name squatting without substance or genuine intent.

Current Status

That being said, I'm not ready to release the crate as 0.1.0 yet. This is just the first pull request.

This implementation does the bare minimum that is (mostly) compatible with the current samply implementation (I did have to make one small change that I would prefer to revert in time).

What's implemented:

  • The core public API that I'm fairly committed to, though welcome to feedback.
  • Full Unix support (Linux and macOS).
  • Documentation with links to Firefox Profiler examples.
  • Runnable examples demonstrating synchronous, asynchronous profiling.

Future Work

The following are all considerations that I would like to handle before a 0.1.0 crate release, but aren't necessarily critical functionality.

Windows Support

Windows is currently stubbed out with unimplemented!(). The implementation will need platform-specific equivalents for the Unix mmap-based approach.

Note

Of all the topics, I'm most okay with releasing 0.1.0 without this.
I don't have a Windows dev machine to implement or test this on.

Runtime Detection

I would like to have runtime detection for emitting markers if possible. Perhaps samply itself could set SAMPLY_MARKERS=1 for the process to run, or if the user already has SAMPLY_MARKERS=0 in the current session, then Samply could preserve that.

That would be a lot more ergonomic to use, since markers would either emit if samply is recording, or not if samply is not, and likely not cost much performance compared to the currently implemented compile-time disablement.

Extended Marker Data

I would like to allow samply-markers to output more optional data that can take advantage of runtime marker schemas, but this will require more changes to Samply itself.

Ideally, I'd like to redo the format and parsing of the marker files (which would be a breaking change).

I could do this in a backward-compatible, or perhaps I could make my own format that outputs to samply-marker-{PID}-{TID}.txt.

I'm not sure how important it is that the current format remains intact.

Known Issues

The following are all considerations that will need to be handled before a 0.1.0 crate release. But I don't think they should block merging this PR, creating these as sub-issues, and getting this crate hooked up to CI for testing.

Buffer Flushing

Currently, the marker-file buffers use thread-local storage (TLS) and ensure flushing on drop. This approach doesn't work for long-lived threads, such as those in Rayon's global thread pool. As a result, buffered marker data may never be written out for long-lived threads.

I really don't want to have to mutex-guard my writes to the buffers, but at this point, I can't think of a safe solution that isn't a centralized flush mechanism on the main thread.

Buffer Size

The current implementation has a 64 KiB buffer per thread to write markers, because it performed well in the single not-very-principled hyperfine benchmark that I ran. But 64 KiB feels like a lot. I'll definitely need to do more research and benchmarking into what is an ideal buffer size to have for each thread, both in terms of size and speed.

Ctrl+C Handling

This issue is related to buffer flushing. Whatever solution is developed for the flushing problem will likely be used here as well, but explicit handling of Ctrl+C will still be needed to ensure that buffers are properly flushed before program termination.

Attaching to a Process with Markers

Markers are emitted from the process that is being recorded, and this can make things look akward in the profiler UI when samply is attached to an already-running process. It means that the marker start time for some intervals could be earlier than the profiling start time.

I think the cleanest solution here would be to filter markers as they are added to the JSON, to ensure that any interval markers' start times are equal to or after the profiling start time.

This would require Samply to know the start time of the PID it just attached to. There may be system calls to find this information (I haven't looked too deeply yet), or if we revise the marker file format, we could ensure that every marker file includes a global first time-since-epoch-style timestamp that is guaranteed to be earlier than the first marker. This would allow Samply to reason about the marker timestamp data more easily.


@mstange, perhaps we could collaborate on this in a new issue once this stack is reviewed and (hopefully) landed.

@nordzilla nordzilla marked this pull request as ready for review November 11, 2025 22:08
@blp
Copy link

blp commented Nov 11, 2025

What a gift! Thank you!

(But I haven't looked beyond the commit message.)

@nordzilla
Copy link
Author

nordzilla commented Nov 11, 2025

What a gift! Thank you!

(But I haven't looked beyond the commit message.)

Please give it a try. I'd appreciate your feedback.

It's slightly different than what you implemented, and I've removed the dependency on tracing.

But I'm hoping the documentation generated by cargo doc, as well as in the README, is easy to follow, and that the design/intent is intuitive.

@nordzilla nordzilla force-pushed the samply-markers branch 27 times, most recently from 1a6295a to 2ca0131 Compare November 14, 2025 12:22
@nordzilla nordzilla force-pushed the samply-markers branch 5 times, most recently from 1ba0c8d to 3337731 Compare November 18, 2025 11:45
nordzilla and others added 23 commits November 18, 2025 05:53
This commit adds a new workspace for the `samply-markers`
crate starting at version `0.0.0`. Since this is my first
PR to this project, introducing a new component, I will
bump the version to 0.1.0 once it is fully reviewed,
accepted, and ready.
Source: feldera/feldera@1b448b7
Original author: Ben Pfaff <blp@feldera.com>
File renamed from crates/adapters/src/samply.rs to ./samply-markers/src/unix.rs
This commit runs cargo-fmt on the ported commit from feldera/feldera@1b448b7
to ensure that the code style is up to date with samply's repository.
This commit adds all of the required dependencies in order to
get the ported code from feldera/feldera@1b448b7 to build.

Some of these dependencies may be removed as I update the API
to my vision for the project.
This commit removes all of the code from unix.rs
that is not specific to actually writing to the `mmap`
files on unix systems. This preserves the original
author's attributions.
This commit adds the `SamplyTimestamp` implementation
with a default provider that always returns zero.
This commit adds the `SamplyMarker` implementation
with a default provider that intentionally does not
write the markers to a file.
This commit updates the `unix.rs` implementation
to be compatible with the uniform internal provider
APIs that markers require in order to function.
This commit adds unimplemented stubs for a
Windows provider that would conform to the
uniform internal provider APIs.
This commit adds a `SamplyTimer` abstraction
which is a scoped RAII object that records the
start time it was created and emits an interval
marker when it goes out of scope.
This commit adds a declarative macro wrapper around the
`SamplyMarker` implementation with the intent to be the
primary way of emitting markers.
This commit adds a declarative macro wrapper around the
`SamplyTimer` implementation with the intent to be the
primary way of constructing timer objects to emit intervals.
This commit adds a declarative macro wrapper that will
automatically measure the runtime of a provided sync
or async code block by emitting an intterval marker when
the code block has finished running.
This commit adds a `prelude` module to the `samply-markers`
crate so that users have an easy way to glob import everything
that they need to instrument their code with markers.
This commit adds a top-level doc comment to the `samply-markers` crate.
This commit updates the way Samply adds markers to
use instant marker timing, instead of a span marker
timing, if the start time and end time of the span
are exactly the same.

Admittedly, I don't like this change, and I would like
to revert it once a different marker file format is
potentially supported.

I think that a zero-length interval should be possible
if that is actually what was measured. But as of right
now, I have no way to emit an actual instant marker
given the current file-format restrictions.
This commit adds integreation tests. The tests for
Unix systems are fully fleshed out, ensuring that
markers are correctly written to the expected files.

The for Windows systems currently expect to panic.
This commit adds a project-level README
file to the samply-markers crate.
This commit ensures that .truncate(true) is set when creating marker files to clarify
that existing files should be overwritten. This is unlikely to happen with the temp
directory file format containing the PID and TID, but if it were to happen, it's better
to be safe than sorry. This also fixes the clippy::suspicious_open_options warning.
This commit ensures that the `unix.rs` file correctly handles
emitting markers on macOS. I hadn't realized in previous commits
that the functionality was only working on Linux.

It should work on both systems now.
This commit adds a safety comment to every unsafe block
in the `unix.rs` file, explaining why I blieve the call
is justified.
This commit adds an examples directory with instructions
on how to build, run, and profile the examples shown in
the main README file.
This commit improves the performance of samply-markers
when enabled, based on benchmarks of the Fibonacci example
increased to n = 30 and measured with hyperfine.

The Fibonacci example at n = 30 makes 2,692,536 recursive
calls, emitting the same number of markers. This provides
a great stress test for marker emission overhead.

The configurations are as follows:

- disabled: samply-markers disabled
- enabled-baseline: samply-markers enabled before any performance improvements
- buffered-writes: adds the buffered writes improvement
- itoa-buffers: adds the itoa improvement on top of buffered writes

Linux results:

  Benchmark 1: ./fib30-disabled
    Time (mean ± σ):     128.2 ms ±   0.6 ms    [User: 127.2 ms, System: 0.9 ms]
    Range (min … max):   127.1 ms … 129.6 ms    23 runs

  Benchmark 2: ./fib30-enabled-baseline
    Time (mean ± σ):      3.144 s ±  0.009 s    [User: 0.924 s, System: 2.220 s]
    Range (min … max):    3.134 s …  3.161 s    10 runs

  Benchmark 3: ./fib30-enabled-buffered-writes
    Time (mean ± σ):     522.9 ms ±   1.1 ms    [User: 478.9 ms, System: 43.7 ms]
    Range (min … max):   520.2 ms … 524.1 ms    10 runs

  Benchmark 4: ./fib30-enabled-itoa-buffers
    Time (mean ± σ):     423.1 ms ±   1.8 ms    [User: 377.8 ms, System: 45.1 ms]
    Range (min … max):   419.6 ms … 426.3 ms    10 runs

  Summary
    ./fib30-disabled ran
      3.30 ± 0.02 times faster than ./fib30-enabled-itoa-buffers
      4.08 ± 0.02 times faster than ./fib30-enabled-buffered-writes
     24.53 ± 0.14 times faster than ./fib30-enabled-baseline

macOS results:

  Benchmark 1: ./fib30-disabled
    Time (mean ± σ):      90.7 ms ±   0.4 ms    [User: 89.4 ms, System: 0.9 ms]
    Range (min … max):    90.0 ms …  91.6 ms    31 runs

  Benchmark 2: ./fib30-enabled-baseline
    Time (mean ± σ):      4.384 s ±  0.053 s    [User: 0.578 s, System: 3.783 s]
    Range (min … max):    4.295 s …  4.495 s    10 runs

  Benchmark 3: ./fib30-enabled-buffered-writes
    Time (mean ± σ):     257.4 ms ±   1.8 ms    [User: 237.6 ms, System: 16.9 ms]
    Range (min … max):   255.1 ms … 261.8 ms    11 runs

  Benchmark 4: ./fib30-enabled-itoa-buffers
    Time (mean ± σ):     187.6 ms ±   1.1 ms    [User: 169.9 ms, System: 16.1 ms]
    Range (min … max):   186.3 ms … 190.1 ms    15 runs

  Summary
    ./fib30-disabled ran
      2.07 ± 0.01 times faster than ./fib30-enabled-itoa-buffers
      2.84 ± 0.02 times faster than ./fib30-enabled-buffered-writes
     48.32 ± 0.62 times faster than ./fib30-enabled-baseline

Profiles:

Ubuntu:

* fib30-disabled: https://share.firefox.dev/48id64U
* fib30-enabled-baseline: https://share.firefox.dev/47K8Taf
* fib30-enabled-buffered-writes: https://share.firefox.dev/4oK5Uog
* fib30-enabled-itoa-buffers: https://share.firefox.dev/47ZtGVQ

macOS:

* fib30-disabled: https://share.firefox.dev/3LMmaX2
* fib30-enabled-baseline: https://share.firefox.dev/49TPODF
* fib30-enabled-buffered-writes: https://share.firefox.dev/4oFo1LP
* fib30-enabled-itoa-buffers: https://share.firefox.dev/3K6rLqG

Overall:
- Using buffered writes dramatically reduces overhead.
- Using itoa provides an additional ~19% improvement on Linux and ~27% on macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants