Skip to content

Record kernel samples#736

Open
DaniPopes wants to merge 19 commits into
mstange:mainfrom
DaniPopes:sample-kernel
Open

Record kernel samples#736
DaniPopes wants to merge 19 commits into
mstange:mainfrom
DaniPopes:sample-kernel

Conversation

@DaniPopes

@DaniPopes DaniPopes commented Dec 21, 2025

Copy link
Copy Markdown
Contributor

Kernel samples are not recorded nor decoded during samply record. This PR is mainly two commits:

  • 18611d3: enables PERF_SAMPLE_CALLCHAIN and decodes kernel stack frames from the callchain in the sample events
  • 4ef1638: register kernel and modules during profiler setup, just like in perf, so that Converter is notified of the kernel symbols

Based on #734.

pid: -1,
tid: 0,
address: module.base_address,
length: module.size,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A surprising coincidence, but I have been working on something similar locally. I've found an issue with using the extents from /proc/modules. Details below:

When CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC is enabled, .text and .data goes into separate memory regions. However /proc/modules reports start of .text as the basis but the module_total_size (code). So combining these two values will get you overlapping module ranges, and in samply overlapping module ranges will evict the previous one, causing symbolization to fail.

This probably needs to be taken upstream but until then I think we might need to clamp the module extents so that they don't overlap.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I got confused. On x64, CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC is not enabled, so it's not due to it. The correct explanation seems to be that data symbols precedes .text so using total size is still wrong.

@DaniPopes DaniPopes Dec 22, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed b8822c6, is this what you had in mind?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll test when I'm back at desk.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the kernel module bases seem to resolve correctly, but symbolization for modules does not work. You probably need to parse module entries from kallsyms as well. Do you plan to do it? Otherwise I can see if I can pick up some of my local changes on top.

https://share.firefox.dev/4pPtAId

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly created a module symbol parser in 9eaf3fa. It seems to symbolize correctly on my end.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks, cherry-picked it

Mimics `perf` so the kernel and modules' symbols are
registered in `Converter`.
@DaniPopes DaniPopes marked this pull request as ready for review December 22, 2025 02:49
@mstange

mstange commented Dec 22, 2025

Copy link
Copy Markdown
Owner

I'm curious how you're running samply in a way that gives you useful kernel symbols here. Did you change your /proc/sys/kernel/kptr_restrict setting? Or are you running samply as root?

@ishitatsuyuki

Copy link
Copy Markdown
Contributor

On my end, I have both kptr_restrict and perf_event_paranoid relaxed. That allows reading kallsyms without root.

But in any case root should work.

https://github.com/torvalds/linux/blob/master/kernel/ksyms_common.c

@@ -311,6 +311,10 @@ impl PerfBuilder {
attr.sample_type |= PERF_SAMPLE_STACK_USER;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also set the EXCLUDE_CALLCHAIN_USER flag bit, because we only want one source of user stack, either the stack bytes (with user-space stack walking) or the call chain (from kernel-space stack walking).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Separately, I had an idea that we could use both the user-space stack we get from the kernel and merge it with our own obtained from DWARF to improve accuracy, but I couldn't find an easy way to do this.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err your current patch is setting PERF_ATTR_FLAG_EXCLUDE_CALLCHAIN_KERNEL but I was asking for EXCLUDE_CALLCHAIN_USER. So this isn't really achieving the thing I asked for - having only one source of user stack, either the stack bytes or the user call chain.

Always request PERF_SAMPLE_CALLCHAIN with user frames so the kernel
walks the frame pointer chain at sample time. When DWARF unwinding
is truncated, splice the remaining user FP frames from the kernel's
callchain into the merged stack, giving complete stacks for binaries
compiled with frame pointers.
Comment thread samply/src/linux_shared/converter.rs Outdated
Comment on lines +2103 to +2112
// Find where to splice: match the last real DWARF frame's address in the
// FP callchain, then append everything deeper from the FP walk.
let last_dwarf_addr = self.dwarf[..dwarf_len].last().map(|f| f.address());
let splice_idx = last_dwarf_addr
.and_then(|addr| fp_stack.iter().position(|f| f.address() == addr))
.map(|i| i + 1)
.unwrap_or(fp_stack.len());
if splice_idx < fp_stack.len() {
self.merged.extend_from_slice(&fp_stack[splice_idx..]);
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. With recursion, you may end up with merged stacks that have extra or missing recursive parts. It would be nice to match the two stacks up using the per-frame stack-pointer value, unfortunately the callchain from the kernel doesn't give us the stack pointers, as far as I know.

@mstange mstange left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest looks good. I just think we shouldn't mess with two user stacks here. Can you change this PR so that the user stack comes either from the callchain info or from our dwarf unwinding? Then you don't need the SampleStack struct either. And then we can discuss the user stack stitching in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants