Record kernel samples#736
Conversation
| pid: -1, | ||
| tid: 0, | ||
| address: module.base_address, | ||
| length: module.size, |
There was a problem hiding this comment.
A surprising coincidence, but I have been working on something similar locally. I've found an issue with using the extents from /proc/modules. Details below:
When CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC is enabled, .text and .data goes into separate memory regions. However /proc/modules reports start of .text as the basis but the module_total_size (code). So combining these two values will get you overlapping module ranges, and in samply overlapping module ranges will evict the previous one, causing symbolization to fail.
This probably needs to be taken upstream but until then I think we might need to clamp the module extents so that they don't overlap.
There was a problem hiding this comment.
Sorry, I got confused. On x64, CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC is not enabled, so it's not due to it. The correct explanation seems to be that data symbols precedes .text so using total size is still wrong.
There was a problem hiding this comment.
I pushed b8822c6, is this what you had in mind?
There was a problem hiding this comment.
Thanks, I'll test when I'm back at desk.
There was a problem hiding this comment.
Okay, the kernel module bases seem to resolve correctly, but symbolization for modules does not work. You probably need to parse module entries from kallsyms as well. Do you plan to do it? Otherwise I can see if I can pick up some of my local changes on top.
There was a problem hiding this comment.
I quickly created a module symbol parser in 9eaf3fa. It seems to symbolize correctly on my end.
There was a problem hiding this comment.
Nice, thanks, cherry-picked it
Mimics `perf` so the kernel and modules' symbols are registered in `Converter`.
f25cd32 to
133b718
Compare
133b718 to
c9de5c9
Compare
|
I'm curious how you're running samply in a way that gives you useful kernel symbols here. Did you change your |
|
On my end, I have both kptr_restrict and perf_event_paranoid relaxed. That allows reading kallsyms without root. But in any case root should work. https://github.com/torvalds/linux/blob/master/kernel/ksyms_common.c |
e01b5c6 to
e31872e
Compare
e31872e to
b8822c6
Compare
| @@ -311,6 +311,10 @@ impl PerfBuilder { | |||
| attr.sample_type |= PERF_SAMPLE_STACK_USER; | |||
There was a problem hiding this comment.
I think we should also set the EXCLUDE_CALLCHAIN_USER flag bit, because we only want one source of user stack, either the stack bytes (with user-space stack walking) or the call chain (from kernel-space stack walking).
There was a problem hiding this comment.
That makes sense.
Separately, I had an idea that we could use both the user-space stack we get from the kernel and merge it with our own obtained from DWARF to improve accuracy, but I couldn't find an easy way to do this.
There was a problem hiding this comment.
Err your current patch is setting PERF_ATTR_FLAG_EXCLUDE_CALLCHAIN_KERNEL but I was asking for EXCLUDE_CALLCHAIN_USER. So this isn't really achieving the thing I asked for - having only one source of user stack, either the stack bytes or the user call chain.
Always request PERF_SAMPLE_CALLCHAIN with user frames so the kernel walks the frame pointer chain at sample time. When DWARF unwinding is truncated, splice the remaining user FP frames from the kernel's callchain into the merged stack, giving complete stacks for binaries compiled with frame pointers.
| // Find where to splice: match the last real DWARF frame's address in the | ||
| // FP callchain, then append everything deeper from the FP walk. | ||
| let last_dwarf_addr = self.dwarf[..dwarf_len].last().map(|f| f.address()); | ||
| let splice_idx = last_dwarf_addr | ||
| .and_then(|addr| fp_stack.iter().position(|f| f.address() == addr)) | ||
| .map(|i| i + 1) | ||
| .unwrap_or(fp_stack.len()); | ||
| if splice_idx < fp_stack.len() { | ||
| self.merged.extend_from_slice(&fp_stack[splice_idx..]); | ||
| } |
There was a problem hiding this comment.
I'm not sure about this. With recursion, you may end up with merged stacks that have extra or missing recursive parts. It would be nice to match the two stacks up using the per-frame stack-pointer value, unfortunately the callchain from the kernel doesn't give us the stack pointers, as far as I know.
mstange
left a comment
There was a problem hiding this comment.
The rest looks good. I just think we shouldn't mess with two user stacks here. Can you change this PR so that the user stack comes either from the callchain info or from our dwarf unwinding? Then you don't need the SampleStack struct either. And then we can discuss the user stack stitching in a separate PR.
Kernel samples are not recorded nor decoded during
samply record. This PR is mainly two commits:perf, so thatConverteris notified of the kernel symbolsBased on #734.