bperf captures on-CPU and off-CPU profiling data in a single recording
session using eBPF. It writes a standard perf.data file readable by
unmodified perf report, perf script, and flame graph tools.
Traditional profilers force a choice: perf record -e task-clock sees only
on-CPU activity, while off-CPU tools like offcputime see only blocking time.
Neither answers the question "where does my application spend its wall-clock
time?" bperf captures both, classifies off-CPU reasons (runqueue wait, I/O
wait, interruptible/uninterruptible sleep), and merges everything into a single
wall-clock event.
The concept originates from the bperf paper (OSDI '24, Yonsei University), which introduced blocked samples via kernel patches to Linux. This implementation achieves the same profiling capability using eBPF on stock kernels -- no kernel patches required.
Two-tone wall-clock flamegraph for MySQl (one connection thread).
- On-CPU:
perf_event_open(task-clock)samples at a configurable frequency - Off-CPU: BPF program on
tp_btf/sched_switchcaptures every blocking episode with kernel + user stacks and duration - Both streams share
CLOCK_MONOTONICtimestamps and are merged into a single perf.data with one unifiedwall-clockevent attribute
See DESIGN.md for the full architecture, BPF program details, and perf.data output format.
| Requirement | Minimum | Tested With |
|---|---|---|
| Linux kernel | 6.1+ with BTF | 6.8.0-106-generic (aarch64) |
CONFIG_DEBUG_INFO_BTF |
=y |
Required for CO-RE / tp_btf |
CONFIG_BPF_SYSCALL |
=y |
Required for BPF |
| clang/llvm | 14+ | clang-18, llvm-18 |
| libbpf | 1.0+ | 1.3.0 (libbpf-dev) |
| libelf | any | libelf-dev |
| bpftool | 5.15+ | 7.4.0 |
NOTE: Install pahole package before reconfigure and build linux kernel.
Architecture: Tested on aarch64 (ARM). Should work on x86_64 without
changes (the Makefile auto-detects uname -m).
Ubuntu/Debian:
sudo apt-get install -y \
clang-18 llvm-18 \
libbpf-dev libelf-dev zlib1g-dev \
linux-tools-common bpftool paholeIf bpftool is not available as a package, it can be built from the kernel
source tree (tools/bpf/bpftool).
This header provides BTF type definitions for CO-RE. Generate it once from the running kernel:
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.hIt must be regenerated if you move to a different kernel version.
make -j16This compiles the BPF program, generates the BPF skeleton, and links the
userspace binary. Output: ./bperf.
bperf record [OPTIONS] [-- command [args...]]
OPTIONS:
-p, --pid <PID> Profile a specific process (all threads if TGID, one thread if TID)
-t, --tid <TID> Alias for -p
-a, --all-cpus System-wide profiling
-F, --freq <HZ> On-CPU sampling frequency [default: 99]
--no-kernel Exclude kernel call chains
--min-block <USEC> Minimum off-CPU duration to record [default: 1]
-d, --duration <SEC> Recording duration [default: until Ctrl-C]
-o, --output <FILE> Output file [default: bperf.data]
--stack-depth <N> Maximum stack depth [default: 127]
--ringbuf-size <MB> BPF ring buffer size [default: 16]
--no-flamegraph Skip SVG flamegraph generation
Root (or CAP_BPF + CAP_PERFMON) is required for BPF and perf_event_open.
# Profile a running thread for 30 seconds
sudo ./bperf record -t 12345 -F 99 -d 30
# System-wide for 10 seconds
sudo ./bperf record -a -d 10 -o system.data
# Launch and profile a command
sudo ./bperf record -- ./my_server --config server.conf
# Skip flamegraph generation
sudo ./bperf record --no-flamegraph -p 12345 -d 10For processes that spread work across many threads (databases, web servers,
thread pools), profiling the whole process with -p TGID aggregates all
threads into one flamegraph. This can obscure the bottleneck: a worker thread
blocked on I/O is buried among dozens of idle threads, diluting its signal.
Recommendation: identify the thread doing the work you care about and
profile it directly with -t TID. This gives a clean wall-clock view of
exactly that thread's on-CPU and off-CPU activity with no noise from unrelated
threads.
# Find threads of a process and their CPU usage
ps -T -p <PID> -o tid,comm,%cpu --sort=-%cpu | head -20
# Profile the busiest worker thread specifically
sudo ./bperf record -t <TID> -d 30Whole-process mode (-p TGID) is still useful for getting an overview or when
you do not know which thread to focus on yet.
# Interactive report
perf report -i bperf.data
# Raw event dump
perf script -i bperf.data | head -40The output SVG flamegraph is written next to the data file (e.g.,
bperf.data.svg). Open it in any browser for an interactive wall-clock
flamegraph with two-tone coloring (red = on-CPU, blue = off-CPU).
Build and run the included test workload:
gcc -O2 -fno-omit-frame-pointer -o test_workload test_workload.c -lm
sudo ./bperf record -F 99 -o bperf.data -- ./test_workloadExpected output:
bperf: profiling command './test_workload' (pid 12345)
bperf: recording... press Ctrl-C to stop
test_workload: 10 rounds of CPU work + sleep
test_workload: done
bperf: stopping...
bperf: on-CPU samples: 48
bperf: off-CPU events: 363
bperf: output written to bperf.data
bperf: generating flamegraph...
bperf: flamegraph written to bperf.data.svg
- Frame pointers: For accurate user-space stacks, compile your target with
-fno-omit-frame-pointer. The test workload already uses this flag. - Kernel symbols: If
perf reportshows[unknown]for kernel symbols, runecho 0 | sudo tee /proc/sys/kernel/kptr_restrict. - The bundled
flamegraph/directory contains scripts from Brendan Gregg's FlameGraph project.perlandperfmust be installed for flamegraph generation. Please note it's modified to fit this project.
bperf/
├── DESIGN.md # Architecture and perf.data format details
├── Makefile # Build system
├── vmlinux.h # Generated: BTF header for CO-RE
├── test_workload.c # Test program (CPU work + nanosleep cycles)
├── include/
│ ├── bperf_common.h # Shared structs between BPF and userspace
│ └── perf_file.h # perf.data format definitions
├── src/
│ ├── bperf.bpf.c # BPF program (tp_btf/sched_switch handler)
│ ├── bperf.c # CLI entry point, argument parsing
│ ├── record.c / record.h # Recording orchestration
│ ├── oncpu.c / oncpu.h # On-CPU: perf_event_open, mmap ring buffer
│ ├── offcpu.c / offcpu.h # Off-CPU: BPF skeleton loader, ringbuf consumer
│ ├── writer.c / writer.h # perf.data file writer
│ └── proc.c / proc.h # /proc parser (maps, comm, threads)
└── flamegraph/
├── stackcollapse-perf.pl # Collapse perf script output to folded stacks
└── flamegraph.pl # Render folded stacks as interactive SVG
This project builds on ideas and tools from:
- bperf / blocked_samples (OSDI '24, Yonsei University) -- the original blocked-samples profiling concept and kernel implementation (paper & code, kernel patches)
- FlameGraph by Brendan Gregg -- flame graph visualization scripts (github, CDDL-1.0 license)
- libbpf / BPF CO-RE -- the eBPF library and compile-once-run-everywhere framework that makes this work on stock kernels
- perf (Linux
tools/perf) -- the standard profiling infrastructure whose file format and tools we target
Code and documentation were written with OpenCode + Claude Opus 4.6.