Moore
Moore
Proceedings of the
                                   FREENIX Track:
                                 2001 USENIX Annual
                                 Technical Conference
                                      Boston, Massachusetts, USA
                                           June 25–30, 2001
© 2001 by The USENIX Association                All Rights Reserved             For more information about the USENIX Association:
Phone: 1 510 528 8649            FAX: 1 510 548 5738              Email: office@usenix.org              WWW: http://www.usenix.org
                                Rights to individual papers remain with the author or the author's employer.
                  Permission is granted for noncommercial reproduction of the work for educational or research purposes.
              This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
              A Universal Dynamic Trace for Linux and other Operating Systems
          Richard Moore - IBM, Linux Technology Centre - richardj_moore@uk.ibm.com
 -1       a fault occurred accessing the data using a flat      3. DProbes Event Handler Processing
          address. The prefix length is set to 4 and the        We turn now to considerations concerning back-end
          variable data contains only the (flat) address        probe event handler processing.
          that caused the fault.
                                                                As described in <2>, the DProbes Event Handler (DPEH)
 -2       a fault occurred accessing the data using an          needs to execute the original instruction that was
          invalid selector. The prefix length is set to 4 and   replaced with a breakpoint. It does this by
          the variable data contains only the selector for      single-stepping the original instruction in situ with
          the segment generating the fault.                     interrupts disabled[3]. If that instruction faults then we
                                                                require the operating system to recover and retry the
          Note: the latter two tokens may occur in              instruction. However, one does not normally wish to
          circumstances where the data address was              have multiple trace records generated for each retry
          valid. Since the RPN probe handler executes,          execution of an instruction, especially when that
          essentially as an interrupt handler, with minimal     instruction eventually succeeds and will thus appear to
          access to system facilities it will not be able to    the underlying program to have executed only once, and
          recover from otherwise recoverable faults. This       with success. This is achieved by delaying the call to the
          is a trade-off between the universality and           tracing dæmon until after the original instruction has
          flexibility of DProbes. See 3. DProbes Event          completed single-step. If a fault is generated then the
          Handler Processing for a further dis cussion on       dæmon is not called and the log buffer is reset.
          how such conditions are handled.                      Furthermore the DPEH reinstates the probepoint - int3
                                                                instruction under IA32 - and resets interrupt status,
Under the IA32 implementation there are 4 RPN                   saved by the processor when the fault was generated, to
instructions use for logging data in memory:                    indicate the status prior to execution of the probepoint.
                                                                Finally it returns to the system fault handler to allow
log mrf            Log memory range from flat address.          normal fault processing to occur. If the system retries the
log mrs            Log memory range from segmented              faulting instruction it will unwittingly retry the
                   address.                                     probepoint instruction. The DPEH will therefore be
                                                                called for each retry. Only on successful execution of the
      Pop a flat address (log mrf) or a 16-bit offset then a    original instruction will the trace dæmon be called.
      16-bit selector (log mrs) from the RPN stack. Then
It is a requirement for the probe handler to be                 4. Using same probe location as 2 but a single push
re-executed for each retry of a faulting instruction. This          eax[6] RPN instruction added to the probe handler.
is because it is quite possible, in the case of a page fault,
that the data causing the instruction to fault is also
                                                                5. The same probe location as 2 but a single exit[7]
accessed by the probe handler for copying into the log
                                                                    RPN instruction in the probe handler.
buffer. Only on successful execution of the original
instruction would the trace record in this case be
                                                                The results were as follows:
complete.
Some instructions generate faults for non-error reasons,        1. One iteration of the three-instruction loop averaged
for example IA32 bounds instruction. For such                       30ns, each instruction approximately 10ns .
instructions it would be desirable to log a trace record
despite a fault being generated on single-step. This is         2. One iteration of the loop averaged 16µs. Therefore
now possible by means of the logonfault control                     the minimum overhead of the DPEH is approximately
statement, which is specified in the header of the RPN              16µs.
file. This feature was added recently to DProbes - see 7.
Dynamic Probes Recent Extensions below.                         3. One iteration of the loop averaged 8µs. Therefore
                                                                    the cost of the DPEH back-end single-step
4. DPEH Performance Implications                                    processing accounts for half the overhead per
We have made an initial study of the performance                    probe.
overhead of a probepoint. A mo re comprehensive
performance evaluation is future work. The first set of         4. One iteration of the loop averaged 16µs. push eax
results are quantitative observations made under the                therefore has a negligible effect. One might
Linux 2.2.12 kernel. We also present some qualitative               reasonably assume most register and memory based
results taken from real-life usage under OS/2. The                  RPN instructions are of a similar overhead.
conclusions from these OS/2 exa mples indicate that
under mo st conditions the impact of a probepoint is
                                                                5. One iteration of the loop averaged 200µs. Most of
negligible when active. We concern ourselves only with
                                                                    this is the cost of a printk, which is the default
measuring the effect of the active probepoint, since for
                                                                    logging method (see 5. The Trace Dæmon Interface
inactive probepoints there is no alteration to the code
                                                                    below for details on how printk in invoked).
path and therefore a zero overhead.
                                                                Taken at face value the minimum overhead of the DPEH
We estimated the overhead of the DPEH using a 90 MHz
                                                                appears to be of the order of 103. This would certainly be
Pentium[**] processor[4]. Five exp eriments were
                                                                a valid perception if a probe were placed in a tight
performed:
                                                                CPU-bound loop. However, in most applications of
                                                                DProbes the average number of instructions exe cuted
1. To obtain a base measurement of the time taken to            between consecutive executions of the same probepoint
    execute a sequence of the following three single            outweighs any overhead imposed by the DPEH. This is
    cycle instructions in a loop:                               illustrated by the following qualitative results taken from
                                                                real-life uses of DProbes (actually Dynamic Trace) under
    loop: dec eax                                               OS/2:
          nop
          jnz loop
                                                                1. Tracepoints [8] on every kernel API entry and exit
                                                                    (circa 500 tracepoints).
2. To test a null probe handler with only the abort[5]              The user perception varies from unnoticeable to
    RPN instruction and with the probe placed on the                very slight depending on work load. The system is
    dec eax instruction. Here dec is single-stepped by              useable and performs within acceptable norms.
    the DPEH.
                                                                2. Tracepoints on entry and exit to the process context
3. Using the same probe handler but the probe placed                switching code with page table data logged on entry
    on the nop instruction. Here nop in emulated by the             and exit.
    DPEH.                                                           No noticeable overhead.
3. Tracepoints on every OS/2 Presentation Manager            Logging to the Communications ports or klog:
    [*] API entry and exit (circa 500 tracepoints).          Logging to both the com1 and com2 communications
    A noticeable slowing of GUI response. The GUI was        ports and klog involves converting the log data to an
    useable.                                                 ASCII string of pairs of hexadecimal characters and
                                                             outputting that to the respective medium. Prior to this we
                                                             format a record header that contains both constant
4. Tracepoints on entry to page allocation, page
                                                             information and some optional entities that are common
    de-allocation and page fault handling routines in the
                                                             to all tracepoints. The most generalised form of the trace
    OS/2 page manager.
                                                             header template is as follows:
    No noticeable overhead.
Conclusions                                                  Other than the DProbes(...) text item, every other item is
While the cost of a probe is not cheap it can be             optionally present. All but cpu are selectable by the user
considerably reduced by placing the probe on an              through parameter switched to the dprobes command;
emulated instruction such as nop. It can als o be reduced    cpu is activated automatically when DProbes is run from
by judicious use of logging by employing conditional         a multi-processor system.
logic in the probe handler to avoid unnecessary log
events. But foremost, the practical use of DProbes finds     The meaning of each constituent header item is as
probepoints being placed in code paths with a relatively     follows:
long mean time to iterate. Under these circumstances the
overhead is negligible.                                      DProbes(%d,%d)
                                                                 Displays the major and minor code that identifies
5. The Trace Dæmon Interface                                     the probepoint. Each probepoint has assigned a
The generic requirements for a trace dæmon interface             major and minor identifier. These are not required
are:                                                             to be unique, but by convention are chosen to
                                                                 indicate a unique type of probe, for example the exit
1. To provide a logging API capable of being called              point(s) of a particular routine. Major and minor
    from kernel space, while interrupts are disabled, from       codes are intended to be used by a generalised
    both a task-time and interrupt-time context.                 formatter to identify a unique formatting template.
                                                                 See 6. Trace Formatting Interface below
2. To allow binary data of an arbitrary length to be         cpu=% d
    logged and identified as originating from DProbes.           Displays the processor id on which the probe was
                                                                 executed. This is suppressed on uniprocessor
A number of candidates satisfy these requirements. The           systems and always displayed on multi-processor
default behavior is to invoke the klog dæmon via printk.         systems.
Other options include directing output through a
dedicated asynchronous communications port (com1 or          name=%s
com2). Strictly speaking, using a communications port            Displays the process name taken from the current
doesn’t necessarily invoke a dæmon unless one thinks             task structure when the probe was executed.
of the monitoring system connected to the system
running DProbes as a dæmon. And finally, a local tracing     pid=% d
dæmon can be invoked to record the log buffer. Use of             Displays the process id taken from the current task
this option requires a degree of conformance between              structure when the probe was executed.
both DProbes and the tracing facility. We have chosen
to use the Linux Trace Toolkit from Opersys <5> as an        uid=% d
initial implementation. We will describe a little later a        Displays the user id name taken from the current
generic interface that is possible to implement by using         task structure when the probe was executed.
the Generalised Kernel Hook Interface <6> mechanism.
cs=% x eip=%08lx                                                      This an event identifier defined by LTT. It signifies
    Displays the CS and EIP registers at the probe                    a binary data record, the format of which is
    location. This is sometimes useful in distinguishing              undisclosed to LTT.
    individuals of a group of similarly formatted and
    therefore identical major and minor coded probes.       id
    For example, multiple return points from a function.              This a module identifier returned by LTT when
                                                                      tracepoints for a given module are activated.
ss=% x esp=%08lx                                                      DProbes calls the LTT trace_create_event()
    Displays the SS and ESP register values when the                  routine when it inserts tracepoint for a given
    probe was executed. This can give an indication of                module. This enables LTT to correlate events with
    the nesting level of a subroutine.                                a module for the purposes of event analysis. Note:
                                                                      DProbes will call LTT trace_destroy_event()
tsc=%08lx:%08lx                                                       routine when tracepoints for a module are
    Displays the high resolution processor time-stamp                 removed.
    counter in seconds and micro-seconds.
                                                            DataSize
The remaining data is output as an ASCII string of              This is the overall size of the trace record (flags + +
hexadecimal characters.                                         log buffer content).
The fmt statements are used to supply template                %<n>x - format and n-byte hexadecimal integer
information for formatting user data in the trace record.     including leading zeros.
In general any alphanumeric character found in the fmt        % z - format the remainder of the trace record in dump
statement is treated as literal text and copied directly to   format (offset, 0x20 hexadecimal bytes separated by
the output buffer. Escape control characters \n and \t are    spaces and ASCII equivalent for each 0x20 bytes,
supported. In general the last pair of characters in a        repeated for each 0x20 bytes - one per line).
sequence of fmt statements will be \n, however the
formatter will always generate an additional new-line at      +00000000 21 22 23 24 25           26 27 28 20 c4 a8
the end of a new trace record.                                fe ae ef ff bb *abcdefgh           .......*
                                                              +00000020 21 22 23 24 25           26 27 28 20 c4 a8
                                                              fe ae ef ff bb *abcdefgh           .......*
Multiple fmt statements for the same minor code are           +00000040 21 22 23 24 25           26
concatenated by the formatter, so the user must supply                      *abcdef*
necessary spacing and new-line characters if the
formatted data is to span more than one line. Place
                                                              ( - begins a complex expression - see below
holders for data to be extracted and formatted within the
                                                              ) - ends a complex expression - see below
template is signified by a sequence that is prefixed with a
% character. Multi-byte control sequences are
                                                              Where a <n> qualifier is allowed then its omission
terminated by any non-numeric character, since in a
                                                              defaults to 1.
multi-byte control sequence the trailing characters are
numeric.
                                                              Processing the 3-byte prefix
                                                              % p causes the formatter to skip over the prefix, noting
The following control sequences may be specified:
                                                              the code and length. If an error is indicated an error
                                                              message is formatted.
%<n>c - format n-bytes as an ASCII characters. If the
character is in the range 0x20-0x7f then format the ASCII
                                                              If % s follows % p and the code is 0x01 then data is
equivalent character, otherwise substitute a period.
                                                              formatted up to the first null character or until the length
                                                              is exhausted.
%<n>d - format and n-byte decimal integer with leading
zeros removed.
                                                              If % s follows % p and the code is 0x00 then data is
                                                              formatted up to the first null character and any remaining
%<n>f - format an n-byte floating point numeric with
                                                              data up to the value of the length is skipped.
leading zeros removed.
                                                              If any other control follows % p then that data is
%<n>i - skip n bytes in the unformatted data buffer.
                                                              formatted according to the following control, having
                                                              skipped the prefix (the error code being checked first).
% p - skip the three-byte prefix for variable length data,
see the description of the logging RPN commands under
                                                              % p may be combined with any control other then % r
2. DProbes as a Tracing Mechanism above. This is used
                                                              and % u.
in combination with mo st other controls by placing then
after p. Controls u and r are excluded from use with p.
                                                              % r is used to process the prefix in a similar way to % p,
                                                              except in this case it uses the prefix to repeat the control
% r - skip the three byte prefix for variable length data,
                                                              sequence that follows until data of the length specified
but use it as a repetition control, see below. This is used
                                                              by the prefix is formatted. % r may be combined with
with other controls or a complex expression following.
                                                              any control though it seldom makes sense to combine it
                                                              with %p, %s in simple formatting expressions.
% s - format an ASCII string up to the length specified
by the % p prefix, or until a null terminator is
When controls are combined only one % is specified.          Since its original release, Dynamic Probes has been
For example:                                                 enhanced with a number of new features which are
%ps - causes a prefixed string to be processed.              relevant to tracing. These are briefly described below:
When two data items are to be concatenated then two          Watchpoint Probes
% signs are needed. For example:                                 This innovation defines a new class of probe that
                                                                 exploits the hardware watchpoint[10] architecture.
%4us - formats a 4-byte unsigned decimal integer                 Watchpoints are specified by watch-type, which
suffixed with a character s, whereas                             under IA32 may be Read, Write, Execute or IO; and
%4u%s - formats a 4-byte unsigned decimal integer                address range. Watchpoints are global and not
concatenated to a zero terminated string.                        aligned with any particular module, however
                                                                 symbolic expressions are permitted in the
% r may be followed by a left parenthesis ( to form a            specification of a watchpoint address. This
complex formatting expression, which is completed with           capability gives DProbes its ability to trace memory
a right parenthesis ). This device allows arrays of              accesses.
structures to be formatted. For example an array for
which each entry contained two double-words called           Logonfault
“function” and “return code” would be formatted using:           This allows the option of logging the contents of
                                                                 a log buffer whether or not the instruction at the
%r(function=0x%2x return code=0x%2x\n)                           tracepoint generates an exception during
                                                                 single-step. If the operating system retries the
The result would be (for a length value of 12 in the             instruction then mu ltiple events will be logged.
prefix):                                                         This was introduced to handle two circumstances:
Module management
                                                           10. References
    DProbes requires a unique handle by which it can
                                                           <1> Dynamic Probes is an open-source project
    refer to a module while either loaded or on disk.
                                                              distributed freely under the GNU GPL from
    There needs to be a means of correlating a virtual
                                                               http://oss.software.ibm.com/developerworks/openso
    storage address in a given context with a module
                                                               urce/linux/projects/dprobes
    handle. Under Linux the inode serves this purpose.
                                                           <2> Dynamic Probes and Generalised Kernel Hooks
                                                               paper published in the USENIX Proceedings of the
Page management
                                                               October 2000 Annual Linux Showcase.
     Probepoints need to be re-inserted when a module
                                                           <3> The SGI [**] Kernel Debugger is an open-source
     page is brought into memory. Under Linux DProbes
                                                               project from Silicon Graphics Inc.. It may be
     achieves this by hooking the readpage address. If
                                                               obtained from:
     the paging mechanism is not used make the initial
                                                               http://oss.sgi.com/projects/kdb
     load of a module, as in the case of Linux kernel
                                                           <4> The SGI Kernel Crash Dump is an open-source
     modules then module load and unload will also
                                                               project from Silicon Graphics Inc.. It may be
     need to be hooked.
                                                               obtained from:
                                                               http://oss.sgi.com/projects/lkcd
Symbolic support
                                                           <5> The Linux Trace Toolkit is an open-source project
    To support symbolic expressions the
                                                               from Opersys, Motreal. It may be obtained from:
    expression-analyser in the dprobes command will
                                                               http://www.opersys.com/LTT/
    need to be adapted to process the module format.
                                                           <6> Generalised Kernel Hooks Interface is an
    Under Linux DProbes assumes the ELF format,
                                                               open-source project distributed freely under the
    which is common to many UNIX-like platforms.
                                                               GNU GPL from:
                                                                http://oss.software.ibm.com/developerworks/opens
Memory management services
                                                               ource/linux/projects/dprobes
   Apart from basic allocation and de-allocation
                                                           <7> OS/2 Trace facilities are described in the OS/2
   functions, DProbes will require a means of aliasing
                                                               Debugging Handbook Volume 3. Order number
   a physical page with a private writeable virtual
                                                               SBOF 8617 or as an on-line Redbook under order
   address to be able to store the breakpoint
                                                               number SG244640.
   instructions without causing a fault or a
                                                           <8> IA32 and IA64 are abbreviations for the 32-bit
   proliferation of privatised pages, which would be
                                                               Pentium and 64-bit Itanium processors of the Intel
   the case where a Copy-on-Write page management
                                                               Corporation [**].
   scheme is implemented.