0% found this document useful (0 votes)
11 views13 pages

Moore

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Moore

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

USENIX Association

Proceedings of the
FREENIX Track:
2001 USENIX Annual
Technical Conference
Boston, Massachusetts, USA
June 25–30, 2001

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association:
Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: office@usenix.org WWW: http://www.usenix.org
Rights to individual papers remain with the author or the author's employer.
Permission is granted for noncommercial reproduction of the work for educational or research purposes.
This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
A Universal Dynamic Trace for Linux and other Operating Systems
Richard Moore - IBM, Linux Technology Centre - richardj_moore@uk.ibm.com

Abstract Essentially DProbes has become a driver or enabler for


Dynamic Probes (DProbes) from IBM [*] is a generic other debugging technologies. Its enabling capability
and pervasive system debugging facility that will derives from the following key characteristics:
operate under the most extreme software conditions
with minimal system disruption. It permits debugging of 1. There is a mechanism for intercepting execution at
some of the most difficult types of software problem arbitrary code locations - this is the probepoint
especially those encountered in a production mechanism.
environment that will not readily re-create. It is also an
invaluable aid for the developer who has to debug parts 2. Each probepoint has an associated probe handler
of the operating system inaccessible to other that allows specific actions to be taken. This is
technologies. DProbes is a front-end enabler for other implemented using a low-level Reverse Polish
debugging technologies, such as crash and core dump s Notation (RPN) language that gives access to kernel
and kernel/user debuggers. It is designed to operate and user space memory and to the processor's
with minimal dependence on the operating system, registers [1].
which affords it the possibility of being ported to other
operating systems, especially UNIX [**] variants, but
3. A probe handler terminates in one of three ways:
not limited to UNIX as it originated conceptually from
Dynamic Trace under OS/2 [*]. This paper des cribes
the latest developments of the DProbes project in i. By returning to the probed code seamlessly.
particular it use as a tracing tool with the Linux Trace
Toolkit project from Opersys [**]. System dependencies ii. By returning to the probed code via a logging
are discussed with an emphasis on portability to other dæmon. A temporary logging buffer is made
Linux H/W platforms as well as other operating systems. available for this purpose. This characteristic is
exploited to provide a means of instrumenting a
1. Introduction module with tracepoints.
Dynamic Probes (DProbes) for Linux <1> was first
released in August 2000 and presented at the Annual iii. By transferring control to an external
Linux Showcase in October 2000 <2>. The original debugging facility having first removed the
functionality was essentially that of an automated kernel probepoint. Whether or not the original code
debugger. Since then DProbes has been extended will continue execution is a function of the
considerably. It now interfaces with a number of external external facility.
debugging agents, for example: The Kernel Debugger
<3> and Kernel Crash Dump <4> facilities from Silicon The efficacy of DProbes is further enhanced by the
Graphics Inc. (SGI) [**]; the standard user-space core following three design criteria:
dump and syslog facilities within Linux and also the
Linux Trace Toolkit <5> from Opersys. 1. There is no required interactive user interface for the
probe handler[2]. This is intentional - it minimizes the
The major topics discussed in this paper are: dependency of the probe handler on system
interfaces and resources. Thus the probe handler is
Ÿ Detailed implementation aspects of DProbes that designed to run as a self-contained interrupt
relate to its use as an agent for trace instrumentation handler. The RPN command interpreter provides
under the Linux operating system running on the recovery form potential fatal errors without
Intel 32-bit architecture (IA32) <8>. reference to operating system facilities. This
criterion gives DProbes its universality since there
Ÿ Portability considerations across other operating are very few restrictions on where a probepoint may
systems running under the Intel 32-bit architecture be placed and when the probe handler may execute.
in particular UNIX-like operating systems. In fact, probepoints are only restricted from being
placed in the code path of the probe handler. If such
Ÿ Portability to other processor architectures. a probepoint were to be defined, DProbes would
detect it and silently remove it. Probepoints may
therefore be placed in code that runs at task time,
interrupt time or during a context switch.
2. The second design criterion was to align a 2. DProbes as a Tracing Mechanism
probepoint with a module rather than a storage This section discusses the implementation details of
location. Note that the watchpoint extension, which DProbes as a tracing mechanism in detail. We describe
is described under 7. Dynamic Probes Recent first the internal mechanism of the Dynamic Probes
Extensions, deviates from this criterion for reasons Event Handler (DPEH) that enables it to be used as a
explained thereunder. By aligning a probepoint with tracing agent.
a module, or to be more precise an offset into a
module, the probe becomes independent of DPEH internal details:
incidental circumstances that relate to a module’s DProbes provides various working storage elements for
installation in memory and als o the processes under use by the probe handler:
which that module is exe cuting. Thus, code that is
shared between processes at different virtual
1. Local variable array.
addresses (for example a Linux shared library) has
location-independent and process-independent 2. Global variable array.
probe definitions. This criterion gives DProbes its 3. A per-processor log buffer.
independence, since it makes it possible to describe
a probe: The latter is intended for use as a staging area for
building a trace or log record to be passed
i. Independently of an operating system’s synchronously to a logging or tracing dæmon external to
implementation of module management (clearly DProbes. One log buffer is permanently allocated per
the internal implementation needs to processor but the data in each buffer persists only for
understand this); the duration that a probe handler is active on its
respective processor. The RPN command interpreter
maintains an internal pointer to the next available
ii. In canonical terms that relate to a programmer’s
location in the buffer, which is reset to the beginning of
view of his/her module which are: independent
the buffer on entry to the probe handler. Data is thus
of whether a module is loaded at the time the
always accumulated monotonically and discarded on exit
probe is defined; and independent of any
from the probe handler.
particular process under which that module
executes.
The buffer is populated using the log class of RPN
instructions. These are defined in two categories: those
3. Probepoints are inserted into module code paths that copy data directly from the RPN stack and those
without the need for source code modifications to that use the RPN stack to specify data to be copied from
that module. Furthermore they may be inserted into system memory.
any loaded and running code (kernel or user space)
or code that is paged out or modules not yet The direct category comprises three instructions in the
loaded. This mechanism has been described in <2>. IA32 implementation:
In summary, the instruction at the probe location is
overlaid with a trapping instruction - under Intel log b,<n> Log byte
32-bit architecture (IA32) the int3 instruction is log w,<n> Log word
chosen. The original instruction is either log d,<n> Log double-word
single-stepped or emulated after the probe handler
executes. This particular criterion gives DProbes its Pop <n> elements from the RPN stack and from each,
dynamic characteristic. The dynamism refers to its copy the least significant byte (8-bit integer), word
ability to instrument a module with probes on the fly (16-bit integer) or double-word to the log buffer.
so to speak. Thus there is absolutely no
performance penalty when a probepoint is inactive. The indirect category has four members in the IA32
implementation. Each operates by popping an address
These characteristics therefore provide an elementary followed by a length from the RPN stack. Data, for that
universal dynamic tracing capability, which have been length, at that address, is copied to the log buffer.
further extended in both universality and dynamism by However, before data is copied, it is appended with a
recent enhancements - see the section: 7. Dynamic 3-byte prefix that contains a token byte and a length
Probes Recent Extensions for details . word. The length refers to the length of data that follows
the prefix and the token byte to type of data. The feature pop a length. Reserve space for the 3-byte prefix in
enables: the log buffer. Copy data from the address specified
for the length specified to the log buffer. If a fault is
a. A trace or log record to contain variable length data generated then set the prefix token to -1, the length
such as arrays whose length is determined to 4 and store the fault address. If no fault is
dynamically. generated then set the prefix with a 0 token and
length of data logged.
b. A formatting utility to operate using a fixed template
log arf Log ASCII range from flat address.
for variable length data.
log ars Log ASCII range from segmented
address.
The token byte values are defined as follows:
Pop a flat address (log arf) or a 16-bit offset then a
0 binary data logged successfully of length
16-bit selector (log ars) from the RPN stack. Then
specified by the prefix length word.
pop a length. Reserve space for the 3-byte prefix in
the log buffer. Copy data from the address specified
1 (ASCII) string data logged successfully of
up to the length specified or until a NULL terminator
maximum length specified by the prefix length
byte has been copied. If a fault is generated then set
word. The actual length of the data may be less
the prefix token to -2, the length to 4 and store the
if a terminating NULL byte is encountered
fault address. If no fault is generated then set the
within the prefix length. This allows data of
prefix with a 1 token and maximum length value
arbitrary lengths to be capped especially in
popped from the RPN stack.
cases where string data has been corrupted.

-1 a fault occurred accessing the data using a flat 3. DProbes Event Handler Processing
address. The prefix length is set to 4 and the We turn now to considerations concerning back-end
variable data contains only the (flat) address probe event handler processing.
that caused the fault.
As described in <2>, the DProbes Event Handler (DPEH)
-2 a fault occurred accessing the data using an needs to execute the original instruction that was
invalid selector. The prefix length is set to 4 and replaced with a breakpoint. It does this by
the variable data contains only the selector for single-stepping the original instruction in situ with
the segment generating the fault. interrupts disabled[3]. If that instruction faults then we
require the operating system to recover and retry the
Note: the latter two tokens may occur in instruction. However, one does not normally wish to
circumstances where the data address was have multiple trace records generated for each retry
valid. Since the RPN probe handler executes, execution of an instruction, especially when that
essentially as an interrupt handler, with minimal instruction eventually succeeds and will thus appear to
access to system facilities it will not be able to the underlying program to have executed only once, and
recover from otherwise recoverable faults. This with success. This is achieved by delaying the call to the
is a trade-off between the universality and tracing dæmon until after the original instruction has
flexibility of DProbes. See 3. DProbes Event completed single-step. If a fault is generated then the
Handler Processing for a further dis cussion on dæmon is not called and the log buffer is reset.
how such conditions are handled. Furthermore the DPEH reinstates the probepoint - int3
instruction under IA32 - and resets interrupt status,
Under the IA32 implementation there are 4 RPN saved by the processor when the fault was generated, to
instructions use for logging data in memory: indicate the status prior to execution of the probepoint.
Finally it returns to the system fault handler to allow
log mrf Log memory range from flat address. normal fault processing to occur. If the system retries the
log mrs Log memory range from segmented faulting instruction it will unwittingly retry the
address. probepoint instruction. The DPEH will therefore be
called for each retry. Only on successful execution of the
Pop a flat address (log mrf) or a 16-bit offset then a original instruction will the trace dæmon be called.
16-bit selector (log mrs) from the RPN stack. Then
It is a requirement for the probe handler to be 4. Using same probe location as 2 but a single push
re-executed for each retry of a faulting instruction. This eax[6] RPN instruction added to the probe handler.
is because it is quite possible, in the case of a page fault,
that the data causing the instruction to fault is also
5. The same probe location as 2 but a single exit[7]
accessed by the probe handler for copying into the log
RPN instruction in the probe handler.
buffer. Only on successful execution of the original
instruction would the trace record in this case be
The results were as follows:
complete.

Some instructions generate faults for non-error reasons, 1. One iteration of the three-instruction loop averaged
for example IA32 bounds instruction. For such 30ns, each instruction approximately 10ns .
instructions it would be desirable to log a trace record
despite a fault being generated on single-step. This is 2. One iteration of the loop averaged 16µs. Therefore
now possible by means of the logonfault control the minimum overhead of the DPEH is approximately
statement, which is specified in the header of the RPN 16µs.
file. This feature was added recently to DProbes - see 7.
Dynamic Probes Recent Extensions below. 3. One iteration of the loop averaged 8µs. Therefore
the cost of the DPEH back-end single-step
4. DPEH Performance Implications processing accounts for half the overhead per
We have made an initial study of the performance probe.
overhead of a probepoint. A mo re comprehensive
performance evaluation is future work. The first set of 4. One iteration of the loop averaged 16µs. push eax
results are quantitative observations made under the therefore has a negligible effect. One might
Linux 2.2.12 kernel. We also present some qualitative reasonably assume most register and memory based
results taken from real-life usage under OS/2. The RPN instructions are of a similar overhead.
conclusions from these OS/2 exa mples indicate that
under mo st conditions the impact of a probepoint is
5. One iteration of the loop averaged 200µs. Most of
negligible when active. We concern ourselves only with
this is the cost of a printk, which is the default
measuring the effect of the active probepoint, since for
logging method (see 5. The Trace Dæmon Interface
inactive probepoints there is no alteration to the code
below for details on how printk in invoked).
path and therefore a zero overhead.
Taken at face value the minimum overhead of the DPEH
We estimated the overhead of the DPEH using a 90 MHz
appears to be of the order of 103. This would certainly be
Pentium[**] processor[4]. Five exp eriments were
a valid perception if a probe were placed in a tight
performed:
CPU-bound loop. However, in most applications of
DProbes the average number of instructions exe cuted
1. To obtain a base measurement of the time taken to between consecutive executions of the same probepoint
execute a sequence of the following three single outweighs any overhead imposed by the DPEH. This is
cycle instructions in a loop: illustrated by the following qualitative results taken from
real-life uses of DProbes (actually Dynamic Trace) under
loop: dec eax OS/2:
nop
jnz loop
1. Tracepoints [8] on every kernel API entry and exit
(circa 500 tracepoints).
2. To test a null probe handler with only the abort[5] The user perception varies from unnoticeable to
RPN instruction and with the probe placed on the very slight depending on work load. The system is
dec eax instruction. Here dec is single-stepped by useable and performs within acceptable norms.
the DPEH.
2. Tracepoints on entry and exit to the process context
3. Using the same probe handler but the probe placed switching code with page table data logged on entry
on the nop instruction. Here nop in emulated by the and exit.
DPEH. No noticeable overhead.
3. Tracepoints on every OS/2 Presentation Manager Logging to the Communications ports or klog:
[*] API entry and exit (circa 500 tracepoints). Logging to both the com1 and com2 communications
A noticeable slowing of GUI response. The GUI was ports and klog involves converting the log data to an
useable. ASCII string of pairs of hexadecimal characters and
outputting that to the respective medium. Prior to this we
format a record header that contains both constant
4. Tracepoints on entry to page allocation, page
information and some optional entities that are common
de-allocation and page fault handling routines in the
to all tracepoints. The most generalised form of the trace
OS/2 page manager.
header template is as follows:
No noticeable overhead.

5. Tracepoints on 4000 internal kernel routines. “DProbes(%d,%d) cpu=%d name=%s pid=%d


uid=%d cs=%x eip=%08lx ss=%x esp=%08lx
Very noticeable, however the system was still
tsc=%08lx:%08lx\n”
useable.

Conclusions Other than the DProbes(...) text item, every other item is
While the cost of a probe is not cheap it can be optionally present. All but cpu are selectable by the user
considerably reduced by placing the probe on an through parameter switched to the dprobes command;
emulated instruction such as nop. It can als o be reduced cpu is activated automatically when DProbes is run from
by judicious use of logging by employing conditional a multi-processor system.
logic in the probe handler to avoid unnecessary log
events. But foremost, the practical use of DProbes finds The meaning of each constituent header item is as
probepoints being placed in code paths with a relatively follows:
long mean time to iterate. Under these circumstances the
overhead is negligible. DProbes(%d,%d)
Displays the major and minor code that identifies
5. The Trace Dæmon Interface the probepoint. Each probepoint has assigned a
The generic requirements for a trace dæmon interface major and minor identifier. These are not required
are: to be unique, but by convention are chosen to
indicate a unique type of probe, for example the exit
1. To provide a logging API capable of being called point(s) of a particular routine. Major and minor
from kernel space, while interrupts are disabled, from codes are intended to be used by a generalised
both a task-time and interrupt-time context. formatter to identify a unique formatting template.
See 6. Trace Formatting Interface below
2. To allow binary data of an arbitrary length to be cpu=% d
logged and identified as originating from DProbes. Displays the processor id on which the probe was
executed. This is suppressed on uniprocessor
A number of candidates satisfy these requirements. The systems and always displayed on multi-processor
default behavior is to invoke the klog dæmon via printk. systems.
Other options include directing output through a
dedicated asynchronous communications port (com1 or name=%s
com2). Strictly speaking, using a communications port Displays the process name taken from the current
doesn’t necessarily invoke a dæmon unless one thinks task structure when the probe was executed.
of the monitoring system connected to the system
running DProbes as a dæmon. And finally, a local tracing pid=% d
dæmon can be invoked to record the log buffer. Use of Displays the process id taken from the current task
this option requires a degree of conformance between structure when the probe was executed.
both DProbes and the tracing facility. We have chosen
to use the Linux Trace Toolkit from Opersys <5> as an uid=% d
initial implementation. We will describe a little later a Displays the user id name taken from the current
generic interface that is possible to implement by using task structure when the probe was executed.
the Generalised Kernel Hook Interface <6> mechanism.
cs=% x eip=%08lx This an event identifier defined by LTT. It signifies
Displays the CS and EIP registers at the probe a binary data record, the format of which is
location. This is sometimes useful in distinguishing undisclosed to LTT.
individuals of a group of similarly formatted and
therefore identical major and minor coded probes. id
For example, multiple return points from a function. This a module identifier returned by LTT when
tracepoints for a given module are activated.
ss=% x esp=%08lx DProbes calls the LTT trace_create_event()
Displays the SS and ESP register values when the routine when it inserts tracepoint for a given
probe was executed. This can give an indication of module. This enables LTT to correlate events with
the nesting level of a subroutine. a module for the purposes of event analysis. Note:
DProbes will call LTT trace_destroy_event()
tsc=%08lx:%08lx routine when tracepoints for a module are
Displays the high resolution processor time-stamp removed.
counter in seconds and micro-seconds.
DataSize
The remaining data is output as an ASCII string of This is the overall size of the trace record (flags + +
hexadecimal characters. log buffer content).

Logging to a Trace Daemon Data


We chose to use the Linux Trace Toolkit (LTT) <5> from This is a pointer to the trace record (flags + header
Opersys as the trace recording dæmon. It provides the + log buffer content).
usual post-processing, formatting and analysis features
as well as a dæmon that manages the a kernel space trace The logged data is further structured with a header
buffer and a mechanism for off-loading the trace buffer followed by the data from the log buffer. The header
to disk. But most importantly, the Linux Trace Toolkit comprises a flag double-word followed by one or mo re
was conceived as a kernel based static [9] tracing binary data items concatenated together. The presence
mechanism, capable for having tracepoints placed in of an item is signified by its corresponding flag bit being
both interrupt handlers and code that runs with set. The following table shows the format of each header
interrupts disabled. In other words the conditions under item and its corresponding flag setting in the order they
which the DPEH executes. We were able extend the appear in the header:
Linux Trace Toolkit to provide an kernel programming
interface that allows data to be logged of a an arbitrary # flag type description
length.
1 0x0001 uint32 major
The KPI interface to Linux Trace Toolkit’s Raw Data 2 0x0002 uint32 minor
interface is show below: 3 0x0004 uint32 cpu
4 0x0008 uint32 pid
struct trace_raw {
uint32_t id; /* Event ID */
5 0x0010 uint32 uid
uint32_t DataSize; /* Size of data 6 0x0020 uint32 cs
recorded by event */
void* Data; /* Data recorded by 7 0x0040 uint32 eip
event */
} 8 0x0080 uint32 ss
9 0x0100 uint32 esp
#define TRACE_RAW(ID, LEN, DATA) \
do { \ 10 0x0200 uint64 tsc
struct trace_raw raw_event; \
raw_event.id = id; \ 11 0x0400 string process name
raw_event.DataSize = LEN; \
raw_event.Data = DATA; \
This implementation is specific to LTT, but may be
readily adapted to other dæmons either by requiring that
they support the three interfaces for creating, destroying
event
and logging an event.
Generalised Kernel Hook Interface Note: these two interfaces may be called sequentially, in
The disadvantage of the implementation just described reverse order to allow templates to be re-read from disk
is that DProbes needs to be built for use with LTT and following an update.
LTT needs to be present in the system before DProbes
loads in order to resolve the external references to the Formatting Template Structure
three interfaces. We can avoid this problem by using the The template syntax is an extension and simplification of
Generalised Kernel Hook Interface <6> to define hook that employed by the OS/2 Trace Formatter, which is a
exit points within DProbes for the three interfaces. An natural thing to do since Dynamic Probes also owes its
arbitrary trace dæmon would register and arm exit origin to OS/2's Dynamic Trace facility <7>. This scheme
routines for these three hooks when the dæmon loads or is based on a printf-like formatting template. But as
is instructed to do so. Because the state of activation of discussed below, we have a requirement to format arrays
a GKHI hook is transparent to DProbes, it would execute and binary data (essentially an array of bytes) whose
code paths that call the three interfaces (now hooks) number of elements is only determined at the time a trace
without regard to whether a recipient dæmon had armed record is created. This requirement necessitates
them. The equivalent hook exit points for each of the deviation from a simple printf template.
three API calls is coded as follows:
By convention a unique major code is assigned per
trace_event(event, &event_struc); module. Each unique trace record format for a module is
GKHOOK_2VAR(GKHOOK_DPROBES_LOG_EVEN assigned a unique minor code within the major code.
T, event, &event_struc); This allows us to employ one formatting template file per
major code. A template directory is employed to
event=trace_create_event(name, format, desc); cross-reference major code to template file name. The
GKHOOK_4VAR(GKHOOK_DPROBES_CREATE_EV template file needs only to identify minor code to delimit
ENT,&event, &name, &format, &desc); each template, however, for sanity purposes the major
code is coded at the head of the file.
rc=trace_destroy_event(event); Comments are allowed using c-style comment syntax.
GKHOOK_2VAR(GKHOOK_DPROBES_DESTROY_
EVENT,&rc, event); Each formatting statement is of the form
keyword=<value>
DProbes would notify GKHI of the existence of these
three hooks during initialisation by calling Numeric values are allowed to be expressed in decimal
GKH_identify. and hexadecimal using c-notation.

6. Trace Formatting Interface Strings are quoted using c-notation.


Clearly, a hexidecimal format for the trace record is not
the most user friendly. Therefore we have proposed a The first statement of the file is:
formatting utility in the form of a set of shared library major=<major code>
routines that may be called to format individual trace
records. The unformatted binary trace record is passed Subsequent statements will follow the format:
to the formatter and a pointer to the formatted trace minor=<minor code>[,]
record is returned. desc=<”descriptive header text”>[,]
fmt=<”template 1”>[,]
The formatter uses text templates with place-holders to fmt=<”template 2”>[,]
format the raw data. For efficiency, templates are cached .
in memory. The formatting library provides two .
additional subroutine calls: .

End of file or the next minor keyword delimits the end of


1. initialise, where essentially the template directory
the previous template.
file is opened, loaded and closed.
2. terminate, where any cached templates are freed. The desc statement serves to provide a static text
description of the trace event.
Major, minor and desc are mandatory, fmt is optional, encountered. If %p is not specified then % s formats a
however if minor is omitted then only default formatting string until a null is encountered.
will be performed. The data will be treated as binary and
formatted in dump format displaying offsets, hexadecimal %<n>u - format an n-byte unsigned decimal integer with
and ASCII. leading zeros removed.

The fmt statements are used to supply template %<n>x - format and n-byte hexadecimal integer
information for formatting user data in the trace record. including leading zeros.

In general any alphanumeric character found in the fmt % z - format the remainder of the trace record in dump
statement is treated as literal text and copied directly to format (offset, 0x20 hexadecimal bytes separated by
the output buffer. Escape control characters \n and \t are spaces and ASCII equivalent for each 0x20 bytes,
supported. In general the last pair of characters in a repeated for each 0x20 bytes - one per line).
sequence of fmt statements will be \n, however the
formatter will always generate an additional new-line at +00000000 21 22 23 24 25 26 27 28 20 c4 a8
the end of a new trace record. fe ae ef ff bb *abcdefgh .......*
+00000020 21 22 23 24 25 26 27 28 20 c4 a8
fe ae ef ff bb *abcdefgh .......*
Multiple fmt statements for the same minor code are +00000040 21 22 23 24 25 26
concatenated by the formatter, so the user must supply *abcdef*
necessary spacing and new-line characters if the
formatted data is to span more than one line. Place
( - begins a complex expression - see below
holders for data to be extracted and formatted within the
) - ends a complex expression - see below
template is signified by a sequence that is prefixed with a
% character. Multi-byte control sequences are
Where a <n> qualifier is allowed then its omission
terminated by any non-numeric character, since in a
defaults to 1.
multi-byte control sequence the trailing characters are
numeric.
Processing the 3-byte prefix
% p causes the formatter to skip over the prefix, noting
The following control sequences may be specified:
the code and length. If an error is indicated an error
message is formatted.
%<n>c - format n-bytes as an ASCII characters. If the
character is in the range 0x20-0x7f then format the ASCII
If % s follows % p and the code is 0x01 then data is
equivalent character, otherwise substitute a period.
formatted up to the first null character or until the length
is exhausted.
%<n>d - format and n-byte decimal integer with leading
zeros removed.
If % s follows % p and the code is 0x00 then data is
formatted up to the first null character and any remaining
%<n>f - format an n-byte floating point numeric with
data up to the value of the length is skipped.
leading zeros removed.
If any other control follows % p then that data is
%<n>i - skip n bytes in the unformatted data buffer.
formatted according to the following control, having
skipped the prefix (the error code being checked first).
% p - skip the three-byte prefix for variable length data,
see the description of the logging RPN commands under
% p may be combined with any control other then % r
2. DProbes as a Tracing Mechanism above. This is used
and % u.
in combination with mo st other controls by placing then
after p. Controls u and r are excluded from use with p.
% r is used to process the prefix in a similar way to % p,
except in this case it uses the prefix to repeat the control
% r - skip the three byte prefix for variable length data,
sequence that follows until data of the length specified
but use it as a repetition control, see below. This is used
by the prefix is formatted. % r may be combined with
with other controls or a complex expression following.
any control though it seldom makes sense to combine it
with %p, %s in simple formatting expressions.
% s - format an ASCII string up to the length specified
by the % p prefix, or until a null terminator is
When controls are combined only one % is specified. Since its original release, Dynamic Probes has been
For example: enhanced with a number of new features which are
%ps - causes a prefixed string to be processed. relevant to tracing. These are briefly described below:

When two data items are to be concatenated then two Watchpoint Probes
% signs are needed. For example: This innovation defines a new class of probe that
exploits the hardware watchpoint[10] architecture.
%4us - formats a 4-byte unsigned decimal integer Watchpoints are specified by watch-type, which
suffixed with a character s, whereas under IA32 may be Read, Write, Execute or IO; and
%4u%s - formats a 4-byte unsigned decimal integer address range. Watchpoints are global and not
concatenated to a zero terminated string. aligned with any particular module, however
symbolic expressions are permitted in the
% r may be followed by a left parenthesis ( to form a specification of a watchpoint address. This
complex formatting expression, which is completed with capability gives DProbes its ability to trace memory
a right parenthesis ). This device allows arrays of accesses.
structures to be formatted. For example an array for
which each entry contained two double-words called Logonfault
“function” and “return code” would be formatted using: This allows the option of logging the contents of
a log buffer whether or not the instruction at the
%r(function=0x%2x return code=0x%2x\n) tracepoint generates an exception during
single-step. If the operating system retries the
The result would be (for a length value of 12 in the instruction then mu ltiple events will be logged.
prefix): This was introduced to handle two circumstances:

Function=0x0000 return code=0x0000 1. where instructions such as bounds generate


Function=0x0000 return code=0x0003 exceptions as part of normal execution and the
Function=0x0002 return code=0x0000 exception is not subject to seamless recovery
by the operating system.
A more complex example where the array is a table
pointers to strings could be formatted using:
2. when a probe is used fo r monitoring program
efficiency. For example, by logging all
%r(pointer=0x%4x, string=’%ps’\n)
attempted exe cutions of an instruction that is
capable of generating a page-fault. By this
The result would be:
means one may glean an insight into the
effects of a particular code path on d emand
pointer=0x801234455, string=’this is an example string’
paging.
pointer=0x802234455, string=’this is another example
string’
In both cases it is acceptable log each exe cution of
the probed instruction whether or not it is for
Within a complex expression the % must be used to
recovery purposes.
prefix groups of controls.
To format a literal %, ( or ) character then an additional
Probe handler exception handling
prefix % is required. For example:
This capability allows an RPN probe handler to
specify a label from which execution will continue
%% results in %
should a fault occur when processing a log
%( results in (
instruction. The 3-byte prefix is optionally
%) results in )
generated with the error code depending on the
definition of the exception handler. Interpretation
Note: there is scope for extending this scheme to cope
of the RPN probe handler is allowed to continue.
with formatting bit masks and conditional formatting and
this is something we plan to do.
Call Kmod
This allows an open-ended extension to the RPN
7. Dynamic Probes Recent Extensions command set, by providing a hook for which any
kernel module may register. The call kmod RPN breakpoint instruction will need to be hooked by
instruction will give control to the hook exit the DPEH.
routine. GKHI is used to implement this interface.
Single-step
8. Porting Considerations: The original instruction at the probepoint needs to
Because DProbes relies on few operating system be single-stepped. Such a mechanism must
interfaces it is relatively easy to port to other operating therefore exits for use under software control. Use
systems, especially of the UNIX variety. Furthermore it of the hardware watchpoint mechanism may be
is structured in a way that enables it to be ported to needed to implement this. Under IBM zSeries, one
other architectures besides IA32. IBM is currently would use the Program Event Recording (PER)
working on ports to the zSeries (31-bit and 64-bit) [*] facility. If no inherent single-step capability exists
and Intel 64-bit <8> architectures. then use of additional breakpoint instructions will
be required - this however is an imperfect solution
Porting to Linux on other processor platforms which may prohibit the specification of
The following are the key items to be translated when probepoints on jump or call instructions.
considering a port to another processor architecture:
Processor exceptions
Integer size All processor exceptions that are generated
The global and local variable array element size is through normal instruction execution need to be
set to the integer size (in multiples of 8). So also the intercepted as part of the single-step back-end
element size of the RPN stack. All these processing. Additionally the sequence from
dependencies are tied to a single #define breakpoint interrupt through to single-step
definition. interrupt needs to be conducted with interrupts
disabled in order:
RPN instruction set
References to processor registers need to be 1. to preserve event sequences where an
mapped to the new architecture. Each register push interrupt occurs during the processing of a
instruction is actually an alias for the single probe event
instruction push r,<n>. The aliases are 2. to avoid difficulties that arise with recursion
implemented by the dprobes command from a table through the DPEH.
that cross-references register to register number.
Under Linux for IA32 this required both the
The push byte, word and double-word set may exception 1 and exception 3 trap gates to be
need to be extended to include a quad-word converted to interrupt gates.
(64-bit). The will need to be implemented to
produce the correct results for the particular endian Processor serialisation
characteristic of the processor. Under a multi-processor environment the
single-step optionally needs to be executed while
It is unlikely that a probe handler written for one other processors suspend execution. If this facility
architecture would work without modification for cannot be guaranteed then the -stopcpus switch of
another. However this can be addressed by using a the dprobes command will not be supportable.
high-level language interface for probe handler
definitions. This would avoid low-level CPU based Instruction cache serialisation
constructs and have a good chance of being Because instructions of loaded modules are
architecturally independent.[11] dynamically altered, serialisation of the instruction
pre-fectch cache may need to be performed. Under
Probepoint implementation Linux the flush_icache operating system call
Probepoints are implemented trapping instruction achieves this.
breakpoints. The processor architecture must
provide an instruction that can be stored Watchpoint implementation
atomically and will case a privilege-level switch. This is more complex and difficult to generalise. In
For example, SVC 255 serves this purpose for IBM the worst case scenario, watchpoint probe support
zSeries processors. The interrupt handler for the will have to be removed. Otherwise support is a
matter of mapping the watchpoint address and 9. Where to obtain DProbes and GKHI:
range the processor implementation. The DPEH DProbes, and GKHI are available from the IBM Linux
watchpoint event interface will need to hook the Technology Centre’s web page at:
watchpoint interrupt handler. It is likely that a http://oss.software.ibm.com/developerworks/opensource
generalised debug register allocation scheme will /linux/projects/dprobes
be needed along with adjustments to context
switching to ensure registers used for watchpoints The development team comprises:
are global to all contexts and can easily co-exist Richard J Moore (DProbes Project Lead) -
with other uses of watchpoints within the richardj_moore@uk.ibm.com
system[12]. Bharata B Rao - rbharata@in.ibm.com
Subodh Soni - ssubodh@in.ibm.com
Porting to other operating systems. Vamsikrishna Sangavarapu - r1vamsi@in.ibm.com
There are four key considerations: Suparna Bhattacharya - bsuparna@in.ibm.com

Module management
10. References
DProbes requires a unique handle by which it can
<1> Dynamic Probes is an open-source project
refer to a module while either loaded or on disk.
distributed freely under the GNU GPL from
There needs to be a means of correlating a virtual
http://oss.software.ibm.com/developerworks/openso
storage address in a given context with a module
urce/linux/projects/dprobes
handle. Under Linux the inode serves this purpose.
<2> Dynamic Probes and Generalised Kernel Hooks
paper published in the USENIX Proceedings of the
Page management
October 2000 Annual Linux Showcase.
Probepoints need to be re-inserted when a module
<3> The SGI [**] Kernel Debugger is an open-source
page is brought into memory. Under Linux DProbes
project from Silicon Graphics Inc.. It may be
achieves this by hooking the readpage address. If
obtained from:
the paging mechanism is not used make the initial
http://oss.sgi.com/projects/kdb
load of a module, as in the case of Linux kernel
<4> The SGI Kernel Crash Dump is an open-source
modules then module load and unload will also
project from Silicon Graphics Inc.. It may be
need to be hooked.
obtained from:
http://oss.sgi.com/projects/lkcd
Symbolic support
<5> The Linux Trace Toolkit is an open-source project
To support symbolic expressions the
from Opersys, Motreal. It may be obtained from:
expression-analyser in the dprobes command will
http://www.opersys.com/LTT/
need to be adapted to process the module format.
<6> Generalised Kernel Hooks Interface is an
Under Linux DProbes assumes the ELF format,
open-source project distributed freely under the
which is common to many UNIX-like platforms.
GNU GPL from:
http://oss.software.ibm.com/developerworks/opens
Memory management services
ource/linux/projects/dprobes
Apart from basic allocation and de-allocation
<7> OS/2 Trace facilities are described in the OS/2
functions, DProbes will require a means of aliasing
Debugging Handbook Volume 3. Order number
a physical page with a private writeable virtual
SBOF 8617 or as an on-line Redbook under order
address to be able to store the breakpoint
number SG244640.
instructions without causing a fault or a
<8> IA32 and IA64 are abbreviations for the 32-bit
proliferation of privatised pages, which would be
Pentium and 64-bit Itanium processors of the Intel
the case where a Copy-on-Write page management
Corporation [**].
scheme is implemented.

Fault handling 11. Trademarks


The DPEH needs to intercept faults relating to [*] IBM, OS/2, zSeries, S/390 and Presentation Manager
access violations before any operating system are trademarks of the International Business
processing so that they may be silently handled by Machines Corporation in the United States and other
the DPEH RPN command interpreter. countries.
[**] UNIX is a registered trademark of The Open Group
in the United States and other countries.
Intel, Pentium and Itanium are trademarks of the opposed to dynamically inserted at run-rime. With
Intel Corporation in the United States, other static trace there is always an overhead even when
countries, or both. the tracepoint is inactive.
Java is a trademark of Sun Microsystems, Inc. in the [10] Watchpoints refer to processor implemented
United States, other countries, or both. breakpoints that require no code mo dification. In
Other company, product, and service names may be general they are implemented using special registers
trademarks or service marks of others. and features of the processor. They normally are not
confined to mo nitoring exe cution but als o permit
12. Notes memory references to be monitored. Watchpoints are
[1] An RPN language is used for the following reasons: usually global in nature being specified by virtual or
a. it allows a simple abstraction of the processor even physical address location under some
architecture to be defined to give access to t he architectures.
lowest level resources for minimal overhead. [11] The IBM Dprobes team is working on a current
b. it provides a basis on which high-level project to implement a high-level language
language interfaces can be defined and be preprocessor for DProbes which generates RPN
largely architecturally independent. Compare instructions from a c-like probe definition language.
this with the Java [**] language and its [12] The IBM DProbes team submitted a Linux kernel
implementation by a Java Virtual Machine patch to the Linux Kernel Mailing List to achieve this
which has an RPN-based virtual machine code. for Linux under IA32.
[2] An interactive interface could always be provided by
transfering control to a debugger such as the SGI
kernel debugger.
[3] The reasons for this restrictive behavior are
described in <2>. In summary this is due to the fact
the recursion cannot be tolerated by the DPEH since
few system services are available to it, in particular
memory allocation. It would be possible to tolerate a
finite level of recursion using a DPEH state saving
stack independent of the IA32 implemented stack,
however, performance and boundary conditions
become complications. The latter in particular, since it
would be difficult to manifest a consistent behavior
to the user.
[4] These experiments were subsequently repeated
using an Intel 200MHz Pentium processor. The
results were consistent with those obtained earlier
using the Intel 90Mhz Pentium processor, being
scaled by a factor of approximately 50%.
[5] The abort RPN instruction causes probe handler to
exit without calling any external logging function.
[6] push eax stores the value of the EAX register on the
RPN stack. The processing by the interpreter for this
instruction similar to that of most of the RPN
instruction set.
[7] The exit RPN instruction causes the probe handler
to exit and for the default external logging function to
be called.
[8] A tracepoint is a probepoint used for the purpose of
tracing.
[9] Static as opposed to dynamic trace refers here to
tracepoints that are hard coded in program source as

You might also like