Skip to content

RFC: add optional DWARF user stack unwinding for BCC tools#5519

Open
XinShuichen wants to merge 4 commits into
iovisor:masterfrom
XinShuichen:bcc-dwarf-upstream-pr
Open

RFC: add optional DWARF user stack unwinding for BCC tools#5519
XinShuichen wants to merge 4 commits into
iovisor:masterfrom
XinShuichen:bcc-dwarf-upstream-pr

Conversation

@XinShuichen

Copy link
Copy Markdown

Description

This PR is posted as an RFC to discuss the API shape and DWARF unwind model
before expanding the feature to more BCC tools.

BCC user-stack collection today mostly depends on kernel stack maps walking
frame-pointer chains. That works well when user binaries preserve frame
pointers, but it misses or truncates stacks for common optimized userspace
binaries and libraries. This is the long-standing problem described in #1234.

This series adds an optional DWARF user-stack provider backed by libgunwinder.
The default behavior is unchanged. DWARF unwinding is opt-in.

image

This PR adds:

  • optional libgunwinder discovery through ENABLE_DWARF_UNWINDER, default OFF
  • a small C ABI for managing a DWARF unwinder context and copied unwind results
  • Python bindings in bcc.dwarf
  • reusable BPF snippet helpers for bounded user register/stack snapshots
  • profile.py --dwarf -U
  • capable.py --dwarf -U
  • man page, example, C++ tests, and Python tests

The current BCC provider is intentionally conservative:

  • existing BPF_STACK_TRACE behavior remains unchanged
  • x86_64 BCC provider only in this first slice
  • requires bpf_task_pt_regs
  • no raw task-stack fallback
  • capable.py --dwarf --unique is rejected for now

The x86_64-only BCC provider is not a libgunwinder limitation. libgunwinder and
continue-profiling-agent have x86_64 and arm64 support. arm64 BCC support can be
added later by extending the BPF register mapping and validation.

Related work:

Why this approach

DWARF user-stack unwinding from BPF samples is not just symbolization. The BPF
side captures a bounded user register and stack snapshot, and userspace needs to
unwind that snapshot repeatedly across many processes and shared libraries.

libgunwinder is designed for this model:

  • it unwinds from caller-provided registers and stack bytes
  • it keeps reusable per-process and per-ELF metadata caches
  • it shares CFI/symbol metadata for the same ELF across processes
  • it focuses the hot path on CFI, symbols, and process maps
  • it avoids retaining full perf.data-style raw snapshots for offline DWARF
    processing

This differs from a libunwind-ptrace style integration. libunwind is useful for
traditional process-oriented unwinding, but whole-machine profiling needs a
global cache model for repeated samples across many PIDs. Without that, common
ELFs such as libc, libstdc++, runtimes, and service libraries are likely to be
loaded and indexed repeatedly per process.

libgunwinder is used by Volcengine continue-profiling-agent (CPA):

CPA is a whole-machine continuous profiling agent. It is designed to run
persistently on production hosts, collect CPU/off-CPU/profile-style stack data
across processes, and keep overhead low enough for fleet-wide deployment. The
same libgunwinder-based DWARF unwind path is used there for repeated global
user-stack unwinding.

The ByteDance internal version of the CPA stack has also been exercised at very
large fleet scale, close to one million machines. This is mentioned only as
engineering stability background for the unwinder design. It is not a substitute
for BCC CI, upstream review, or BCC-side test coverage.

Public CPA benchmark data for the same unwinder design includes:

  • CFI evaluator microbenchmark:
    • 100-entry working set: 11,998,307 frames/s, avg 83.35 ns/frame
    • 1,000-entry working set: 4,227,484 frames/s, avg 236.55 ns/frame
    • 10,000-entry working set: 1,353,664 frames/s, avg 738.74 ns/frame
  • ClickHouse workload dump:
    • stable DWARF throughput: 8.8k to 11.3k unwinds/s
    • stable P99: 189 us to 213 us
  • CPA production C++ bench:
    • stable DWARF throughput: avg 4.41k samples/s, median 4.39k samples/s,
      P95 4.57k samples/s
    • latency: avg 13.1 us, median 12.5 us, P95 20.1 us, P99 22.8 us
    • across roughly 1.39 million stable unwinds, 98.23% completed within
      64 us

These numbers are workload-dependent and are not used as BCC CI assertions.
They are included to explain why libgunwinder is a better fit than a generic
per-process unwinder for repeated whole-machine profiling.

There are still API design questions for BCC:

  • Profile-style tools may want a stack-provider abstraction with synthetic
    userspace stack IDs after unwinding.
  • Hook/event-style tools may want a reusable per-event DWARF sample payload and
    decode helper.
  • Aggregating tools such as off-CPU or stackcount-style tools probably need a
    separate design because DWARF frame identity is known only after userspace
    unwinding.

This PR keeps the first integration small so the dependency, event shape,
helper requirements, and tool semantics can be reviewed before changing more
tools.

Tests run locally:

  • git diff --check upstream/master..HEAD
  • python3 -m py_compile for affected Python files
  • ctest -R '^test_dwarf_unwind$' with DWARF enabled
  • ctest -R '^test_dwarf_unwind$' with DWARF disabled
  • tests/python/test_dwarf_unwind.py with DWARF enabled
  • tests/python/test_dwarf_unwind.py with DWARF disabled
  • DWARF-related test_tools_smoke.py subset
  • profile.py --dwarf -U -F 1 1 -f live smoke
  • profile.py 1 live smoke
  • capable.py --dwarf -U -v live smoke
  • CPA sym_c workload with profile.py --dwarf -U, producing expected
    c_path_a_* / c_path_b_* user frames

Live DWARF stack output is not made a mandatory CI assertion because it depends
on kernel BPF capabilities, target workload, debug/unwind metadata, and perf
buffer pressure.


Checklist

  • Commit prefix matches changed area (e.g., tools/toolname:, libbpf-tools/toolname:, src/cc:, docs:, build:, tests/python:)
  • Commit body explains why this change is needed

For new tools only

N/A. This PR does not add a new tool. It extends existing profile.py and
capable.py, and updates their existing man pages, example files, and smoke
tests.


About AI Code Review: This project uses GitHub Copilot to assist with code review.
If a Copilot review is added, treat its feedback as you would any reviewer comment — you can
agree, disagree (with explanation), or ask questions. The maintainer makes all final decisions.

Add optional libgunwinder discovery and keep the dependency disabled by
default. When enabled, libbcc exposes a small C ABI for creating an
unwinder context, sampling caller-provided register and stack snapshots,
copying frame metadata, and releasing results.

The adapter initializes libgunwinder only for enabled builds and preserves
default builds with a stub implementation. Add focused C++ tests for
unsupported builds, result ownership, truncation, and enabled adapter
behavior.

Test: pass

Signed-off-by: Zhang Yuchen <zhangyuchen.lcr@bytedance.com>
Add ctypes bindings for the DWARF C ABI and expose a bcc.dwarf wrapper
that copies native results before freeing libbcc-owned memory. The wrapper
keeps context close idempotent and reports errno-backed failures to callers.

Also add reusable BPF snippet builders for bounded DWARF user-stack
samples. These helpers keep tool scripts thin while centralizing register
validation, bpf_task_pt_regs probing, and sample decoding.

Test: pass

Signed-off-by: Zhang Yuchen <zhangyuchen.lcr@bytedance.com>
Add an opt-in --dwarf -U path for user stack profiling. The BPF side
captures bounded register and stack snapshots through the reusable DWARF
provider, while Python unwinds samples with libgunwinder and aggregates the
formatted user stacks.

Keep the existing stack-id path unchanged for default profiling. DWARF mode
rejects unsupported stack selections, requires bpf_task_pt_regs, and reports
perf-buffer loss separately from missed unwinds.

Test: pass

Signed-off-by: Zhang Yuchen <zhangyuchen.lcr@bytedance.com>
Add --dwarf -U support for capable.py using the shared hook-sample DWARF
provider. Kernel stacks continue to use BPF_STACK_TRACE, while user stacks
are decoded from bounded per-event samples in Python.

Reject --dwarf without -U and reject --dwarf with --unique for now, because
unique filtering is currently performed in BPF with stack IDs and DWARF
frame identity is only available after userspace unwinding.

Test: pass

Signed-off-by: Zhang Yuchen <zhangyuchen.lcr@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant