MODULE 2: SELECTION OF METRICS AND EVALUATION TECHNIQUES
PERFORMANCE METRICS
These performance metrics are used to evaluate whether a system/unit/equipment or particular
architecture performs well under a given condition. Computer performance metrics include:
1. Availability
2. Response time
3. Channel capacity
4. Latency
5. Completion time
6. Service time
7. Bandwidth
8. Throughput
9. Relative efficiency
10. Scalability
11. Performance per watt
12. Compression ratio
13. Instruction path length
14. Speedup etc
Some of these performance metrics can be explained as follows:
1. Availability: The term availability in computer performance evaluation means the degree
to which a system, subsystem, or equipment is in a specified operable and committable
state at the start of a mission, when the mission is called for at an unknown i.e a random
time. Simply put, availability is the proportion of time a system is in a functioning
condition. This is often described as a mission capable rate. Mathematically, it can be
expressed as the ratio of (a) the total time a functional unit is capable of being used
during a given interval to (b) the length of the interval.
Example 1
A unit that is capable of being used 100 hours per week (168 hours i.e 24 * 7) would have an
availability of a/b i.e 100/168 = 0.5952 x 100 = 59.52% availability. However, no system can
guarantee 100.00% reliability and as such, no system can assure 100.00% availability.
The simplest representation of availability is as a ratio of the expected value of the uptime of a
system to the aggregate of the expected values of up and down time i . e
A= E[Uptime]
E[Uptime] + E[Downtime]
Example 2
If we are using equipment which has mean time to failure (MTTF) of 81.5 years and mean time
to repair of 1 hour. Calculate the availability and unavailability of this equipment
Solution
MTTF in hours = 81.5 * 365 * 24 = 713940
Availability = MTTF/(MTTF + MTTR) = 713940/713941 = 0.9999986 = 99.99986%
Unavailability = 1 – availability = 1 – 0.9999986 = 0.0000014%
The table below shows the time to fail and time to repair of a given computer system in
years.
TTF (years) 3 5 6 8 10
TTR (years) 1 2 4 6 7
Calculate the Availability and Unavailability of the computer system.
2. Throughput: This is the average rate of successful message delivery over a
communication channel. This is usually measured in bits per second (bits/s or bps), and
sometimes in data packets per second or data packets per time slot. The throughput can
be analyzed mathematically by means of queueing theory where the load in packets per
time unit is denoted arrival rate λ , and the throughput in packets per time unit is denoted
departure rate μ. Throughput is essentially synonymous to digital bandwidth
consumption.
Example
Given an Ethernet with maximum frame size of 1526 bytes (i.e maximum 1500 byte payload + 8
byte preamble + 14 byte header + 4 byte trailer). An additional minimum interframe gap
corresponding to 12 byte is inserted after each frame. Calculate (i) Maximum channel utilization
(ii) Maximum throughput or channel efficiency
Solution
(i) Maximum channel utilization is 1526 / (1526 + 12) x 100% = 99.22% or a maximum
channel use of 99.22 Mbit/s inclusive of Ethernet data link-layer protocol overhead.
(ii) Maximum throughput or channel efficiency is 1500 / (1562 + 12) = 97.5 Mbit/s
exclusive of Ethernet protocol overhead.
3. Speedup: This refers to how much a parallel algorithm is faster than a corresponding
sequential algorithm. Speedup is defined by the following formula:
Speedup is a measure of performance. It measures the ratio between the sequential execution
time and the parallel execution time. Efficiency is a measure of the usage of the computational
capacity. It measures the ratio between performance and the number of resources available to
achieve that performance.
The speedup of a parallel algorithm over a corresponding sequential algorithm is the ratio of the
compute time for the sequential algorithm to the time for the parallel algorithm. If the speedup
factor is n, then we say we have n-fold speedup.
Sp = T1
Tp
Where
Pis the number of processors
T1 is the execution time of the sequential algorithm
Tp is the execution time of the parallel algorithm with p processors
Hence, performance metric efficiency is defined as:
E p = Sp = T1
p pTp
Given that the speedup of a quad-core processor system is 20 seconds and the time taken
by a parallel algorithm on the machine is 10 seconds.
Calculate:
(i) The time a sequential algorithm will take on the same machine
(ii) The performance efficiency of the machine
4. Response Time: Response time is the time a system or functional unit takes to react to a
given input. In data processing, the response time perceived by the end user is the interval
between:
(a) The instant at which an operator at a terminal enters a request for a response from a
computer and
(b) The instant at which the first character of the response is received at a terminal.
5. Channel Capacity: In information theory, channel capacity is the tightest upper bound
on the amount of information that can be reliably transmitted over a communication
channel. Also, by the noisy-channel coding theorem, the channel capacity of a given
channel is the limiting information rate (in units of information per unit time) that can be
achieved with arbitrarily small error probability.
6. Latency: This is a measure of time delay experienced in a system. This depends on the
system and the time being measured. Latency in a packet-switched network is measured
either one-way (the time from the source sending a packet to the destination receiving it),
or round-trip (the one-way latency from source to destination plus the one-way latency
from the destination back to the source). A good example of how to measure round-trip
latency is ping. Ping is a computer network administration utility used to test the
reachability of a host on an Internet Protocol (IP) network and to measure the round-trip
time for messages sent from the originating host to a destination computer. Thus, ping is
a relatively accurate way of measuring latency.
7. Bandwidth: The word bandwidth, network bandwidth, data bandwidth, or digital
bandwidth refers to the amount of data that can be transmitted over a network in a certain
amount of time. Bandwidth is measured in bps (bits per second), Kbps (kilobits per
second), Mbps (megabits per second) etc. If the bandwidth of a network is 1 Mbps, it
means that 1 megabit of data can be transmitted over that network in 1 second.
8. Scalability: Scalability is the ability of a system, network, or process to handle growing
amount of work in a capable manner or its ability to be enlarged to accommodate that
growth. For example, it can refer to the ability or capability of a system to increase total
throughput under an increased load when resources (typically hardware) are added. A
system whose performance improves after adding hardware, proportionally to the
capacity added, is said to be a scalable system. A good example of a system that must be
scalable is a search engine.
9. Performance per watt: This is a measure of the energy efficiency of particular computer
architecture or computer hardware. It measures the rate of computation that can be
delivered by a computer for every watt of power consumed. For example, the early
UNIVAC I computer performs 1,905 operations per second, while consuming 125kW.
Most of the power a computer uses is converted into heat, so a system that takes fewer
watts to do a job will required less cooling to maintain a given operating temperature.
10. Instruction path length: This is the number of machine code instructions required to
execute a section of a computer program. The total path length for the entire program
could be deemed a measure of the algorithm’s performance on a particular computer
hardware. The path length of a simple conditional instruction would normally be
considered as equal to two, one instruction to perform the comparison and another to take
a branch if the particular condition is satisfied.
11. Completion time or Task Completion Time: This is a metric that measures the amount
of time it takes for a user to complete a specific task or set of tasks on a website or
application. It is commonly used to measure the efficiency and usability of a website or
application, as well as user engagement and satisfaction. A shorter time-on-task generally
indicates that a website or application is easy to use and navigate, while a longer time-on-
task may indicate that users are struggling to complete certain tasks and may need to be
redesigned or improved.
The formula for calculating Task Completion Time:
Task completion time = (Total time spent on task / Number of successful task
completions)
Where:
“Total time spent on task” is the total amount of time users spend on the task or
set of tasks.
“Number of successful task completions” is the number of times users
successfully completed the task or set of tasks.
For example, if users spent a total of 1000 minutes on a task and successfully completed
it 80 times, the average task completion time would be 12.5 minutes.
UTILITY CLASSIFICATION OF PERFORMANCE METRICS
Depending upon the utility function of a performance metric, it can be categorized into three
classes:
• Higher is Better or HB: System users and system managers prefer higher values of such
metrics. System throughput is an example of an HB metric.
• Lower is Better or LB: System users and system managers prefer smaller values of such
metrics. Response time is an example of an LB metric.
• Nominal is Best or NB: Both high and low values are undesirable. A particular value in the
middle is considered the best. Utilization is an example of an NB characteristic. Very high
utilization is considered bad by the users since their response times are high. Very low utilization
is considered bad by system managers since the system resources are not being used. Some value
in the range of 50 to 75% may be considered best by both users and system managers.
CLASSIFICATION OF PERFORMANCE EVALUATION TECHNIQUES
State-of-the-art high performance microprocessors contain tens of millions of transistors and
operate at frequencies close to 2GHz. These processors perform several tasks in overlap, employ
significant amounts of speculation and out-of-order execution, and other microarchitectural
techniques, and are true marvels of engineering. Designing and evaluating these microprocessors
is a major challenge, especially considering the fact that one second of program execution on
these processors involves several billion instructions and analyzing one second of execution may
involve dealing with tens of billion pieces of information.
In general, design of microprocessors and computer systems involves several steps
(i) Understanding applications and workloads that the systems will be running
(ii) Innovating potential designs
(iii) Evaluating performance of the candidate designs
(iv) Selecting the best design
Performance evaluation can be classified into performance modeling and performance
measurement, as illustrated in Table 1. Performance measurement is possible only if the
system of interest is available for measurement and only if one has access to the parameters of
interest. Performance measurement may further be classified into on-chip hardware monitoring,
off-chip hardware monitoring, software monitoring and micro-coded instrumentation.
Performance modeling on the other hand is typically used when actual systems are not
available for measurement or if the actual systems do not have test points to measure every detail
of interest. Performance modeling may further be classified into simulation modeling and
analytical modeling. Simulation models may further be classified into numerous categories
depending on the mode/level of detail of simulation. Analytical models use probabilistic models,
queueing theory, Markov models or Petri nets.
Table 1.5: A Classification of Performance Evaluation Techniques
FEATURES OF PERFORMANCE MODELING/MEASUREMENT TECHNIQUES
1. They must be accurate
2. They must be non-invasive: The measurement process must not alter the system or
degrade the system's performance.
3. They must not be expensive: Building the performance measurement facility should not
cost significant amount of time or money.
4. They must be easy to change or extend. Microprocessors and computer systems
constantly undergo changes and it must be easy to extend the modeling/measurement
facility to include the upgraded system.
5. They must not need source code of applications. If tools and techniques necessitate
source code, it will not be possible to evaluate commercial applications where source is
not often available.
6. They should measure all activity including kernel and user activity. Often it is easy to
build tools that measure only user activity. This was acceptable in traditional scientific
and engineering workloads, however in database, web server, and Java workloads, there
is significant operating system activity and it is important to build tools that measure
operating system activity as well.
7. They should be capable of measuring a wide variety of applications including those that
use signals, exceptions and DLLs (Dynamically Linked Libraries).
8. They should be user-friendly: Hard to use tools often are under-utilized. Hard-to-use
tools also result in more user error.
9. They should be fast: If a performance model is very slow, long-running workloads which
take hours to run may take days or weeks to run on the model. If an instrumentation tool
is slow, it can be invasive.
10. Models and tools should handle multiprocessor systems and multithreaded applications.
Dual and quad-processor systems are very common nowadays. Applications are
becoming increasingly multithreaded especially with the advent of Java, and it is
important that the tool handles these.
11. It will be desirable for a performance evaluation technique to be able to evaluate the
performance of systems that are not yet built.
Many of these requirements are often conflicting. For instance, it is difficult for a mechanism
to be fast and accurate. Consider mathematical models. They are fast; however, several
simplifying assumptions go into their creation and often they are not accurate. Similarly, it is
difficult for a tool to be non-invasive and user- friendly. Many users like graphical user
interfaces (GUIs), however, most instrumentation and simulation tools with GUIs are slow and
invasive.
Benchmarks and metrics to be used for performance evaluation have always been interesting
and controversial issues. There has been a lot of improvements in benchmark suites since
1988. Before that computer performance evaluation has been largely with small benchmarks
such as kernels extracted from applications (eg: Lawrence Livermore Loops), Dhrystone and
Whetstone benchmarks, Linpack, Sorting, Sieve of Eratosthenes, 8-queens problem, Tower of
Hanoi, etc. [1]. The Standard Performance Evaluation Cooperative (SPEC) consortium and the
Transactions Processing Council (TPC) formed in 1988 have made available several
benchmark suites and benchmarking guidelines to improve the quality of benchmarking.
Several state-of-the-art benchmark suites are described in section 4.
Another important issue in performance evaluation is the choice of performance metric. For a
system level designer, execution time and throughput are two important performance metrics.
Execution time is generally the most important measure of performance. Execution time is the
product of the number of instructions, cycles per instruction (CPI) and the clock period. The
throughput of an application is a more important metric, especially in server systems. In
servers that serve the banking industry, airline industry, or other similar business, what is
important is the number of transactions that could be completed in unit time. Such servers,
typically called transaction processing systems, use transactions per minute (tpm) as a
performance metric. MIPS (Millions of Instructions Per Second) and MFLOPS (Millions of
Floating-Point Operations Per Second) have been very popular measures of performance in the
past. Both are very simple and straightforward to understand and hence have been used often,
however, they do not contain all three components of program execution time and hence are
incomplete measures of performance. There are also several low level metrics of interest to
microprocessor designers, in order to help them identify performance bottlenecks and tune
their designs. Cache hit ratios, branch misprediction ratios, number of off-chip memory
accesses, etc are examples of such measures.
Another major problem is the issue of reporting performance with a single number. A single
number is easy to understand and easy to be used by the trade press. Use of several
benchmarks also make it necessary to find some kind of a mean. Arithmetic Mean,
Geometric Mean and Harmonic Mean are three ways of finding the central tendency of a
group of numbers, however, it should be noted that each of these means should be used in
appropriate conditions depending on the nature of the numbers which need to be
averaged. Simple arithmetic mean can be used to find average execution time from a set of
execution times. Geometric mean can be used to find the central tendency of metrics that are
in the form of ratios (eg: speedup) and harmonic mean can be used to find the central tendency
of measures that are in the form of a rate (eg: throughput).
PERFORMANCE MEASUREMENT TECHNIQUES
Performance measurement is used for understanding systems that are already built or prototyped.
There are two major purposes performance measurement can serve:
(i) Adjust this system or systems to be built
(ii) Adjust the application if source code and algorithms can still be changed (e.g open
source software or system)
Essentially, the process of performance measurement techniques involves:
(i) Understanding the bottlenecks in the system that has been built
(ii) Understanding the applications that are running on the system and the match between
the features of the system and the characteristics of the workload
(iii) Innovating design features that will exploit the workload features
Performance measurement can be done via the following means:
1. Microprocessor on-chip performance monitoring counters
2. Off-chip hardware monitoring
3. Software monitoring
4. Micro-coded instrumentation
ON-CHIP PERFORMANCE MONITORING COUNTERS
All state-of-the-art high performance microprocessors including Intel's Pentium III and Pentium
IV, IBM's POWER 3 and POWER 4 processors, AMD's Athlon, Compaq's Alpha, and Sun's
UltraSPARC processors incorporate on-chip performance monitoring counters which can be
used to understand performance of these microprocessors while they run complex, real-world
workloads. Now, complex run time systems involving multiple software applications can be
evaluated and monitored very closely. All microprocessor vendors nowadays release information
on their performance monitoring counters, although they are not part of the architecture.
OFF-CHIP HARDWARE MEASUREMENT
Instrumentation using hardware means can also be done by attaching off-chip hardware, two
examples of which are described as follows:
SpeedTracer from AMD: AMD developed this hardware tracing platform to aid in the design
of their x86 microprocessors. When an application is being traced, the tracer interrupts the
processor on each instruction boundary. The state of the CPU is captured on each interrupt and
then transferred to a separate control machine where the trace is stored. The trace contains
virtually all valuable pieces of information for each instruction that executes on the processor.
Operating system activity can also be traced. However, tracing in this manner can be invasive,
and may slow down the processor.
Logic Analyzers: These are used to analyze 3D graphics workloads on AMD-K6-2 based
systems. Detailed logic analyzer traces are limited by restrictions on sizes and are typically used
for the most important sections of the program under analysis.
SOFTWARE MONITORING
Software monitoring is often performed by utilizing architectural features such as a trap
instruction or a breakpoint instruction on an actual system, or on a prototype. The VAX
processor from Digital (now Compaq) had a T-bit that caused an exception after every
instruction. The primary advantage of software monitoring is that it is easy to do. However,
disadvantages include that the instrumentation can slow down the application. The overhead of
servicing the exception, switching to a data collection process, and performing the necessary
tracing can slow down a program by more than 1000 times. Another disadvantage is that
software monitoring systems typically only handle the user activity.
MICRO-CODED INSTRUMENTATION
Digital (now Compaq) used micro-coded instrumentation to obtain traces of some architecture.
This is a technique lying between trapping information on each instruction using hardware
interrupts (traps) or software traps. Unlike software monitoring, this could trace all processes
including the operating system and the user’s activity. A good example of the performance
monitoring utility carried out by an operating system is the performance section of the task
manager.
PERFORMANCE MODELING
Performance measurement as described in the previous section can be done only if the actual
system or a prototype exists. It is expensive to build prototypes for early-stage evaluation. Hence
one needs to resort to some kind of modeling in order to study systems yet to be built.
Performance modeling can be done using simulation models or analytical models.
SIMULATION MODEL
Simulation has become the defacto performance modeling method in the evaluation of
microprocessor architectures. There are several reasons for this. The accuracy of analytical
models in the past has been insufficient for the type of design decisions computer architects
wish to make (for instance, what kind of caches or branch predictors are needed). Hence cycle
accurate simulation has been used extensively by architects. Simulators model existing or
future machines or microprocessors. They are essentially a model of the system being
simulated, written in a high level computer language such as C or Java, and running on some
existing machine. The machine on which the simulator runs is called the host machine
and the machine being modeled is called the target machine. Such simulators can be
constructed in many ways.
Simulators can be functional simulators or timing simulators. They can be trace driven or
execution driven simulators. They can be simulators of components or that of the complete
system. Functional simulators simulate functionality of the target processor, and in essence
provide a component similar to the one being modeled. The register values of the simulated
machine are available in the equivalent registers of the simulator. In addition to the values, the
simulators also provide performance information in terms of cycles of execution, cache hit
ratios, branch prediction rates, etc. Thus the simulator is a virtual component representing the
microprocessor or subsystem being modeled plus a variety of performance information.
If performance evaluation is the only objective, one does not need to model the functionality.
For instance, a cache performance simulator does not need to actually store values in the
cache; it only needs to store information related to the address of the value being cached. That
information is sufficient to determine a future hit or miss. While it is nice to have the values
as well, a simulator that models functionality in addition to performance is bound to be
slower than a pure performance simulator. Register Transfer Language (RTL) models used
for functional verification may also be used for performance simulations, however, these
models are very slow for performance estimation with real world workloads, and hence are
not discussed in this article.
Trace Driven Simulation
Trace-driven simulation consists of a simulator model whose input is modeled as a trace or
sequence of information representing the instruction sequence that would have actually
executed on the target machine. A simple trace driven cache simulator needs a trace consisting
of address values. Depending on whether the simulator is modeling a unified instruction or
data cache, the address trace should contain addresses of instruction and data references.
Cachesim5 and Dinero IV are examples of cache simulators for memory reference traces.
Cachesim5 comes from Sun Microsystems along with their Shade package. Dinero IV [16] is
available from the University of Wisconsin, Madison. These simulators are not timing
simulators. There is no notion of simulated time or cycles, only references. They are not
functional simulators. Data and instructions do not move in and out of the caches. The primary
result of simulation is hit and miss information. The basic idea is to simulate a memory
hierarchy consisting of various caches. The various parameters of each cache can be set
separately (architecture, mapping policies, replacement policies, write policy, statistics).
During initialization, the configuration to be simulated is built up, one cache at a time,
starting with each memory as a special case. After initialization, each reference is fed to the
appropriate top-level cache by a single simple function call. Lower levels of the hierarchy are
handled automatically. One does not need to store a trace while using cachesim5, because
Shade can directly feed the trace into cachesim5.
Trace driven simulation is simple and easy to understand. The simulators are easy to debug.
Experiments are repeatable because the input information is not changing from run to run.
However, trace driven simulation has two major problems:
1. Traces can be prohibitively long if entire executions of some real-world applications
are considered. The storage needed by the traces may be prohibitively large. Trace size
is proportional to the dynamic instruction count of the benchmark.
2. The traces do not represent the actual stream of processors with branch
predictions. Most trace
generators generate traces of only completed or retired instructions in speculative
processors. Hence, they do not contain instructions from the mispredicted path.
The first problem is typically solved using trace sampling and trace reduction techniques.
Trace sampling is a method to achieve reduced traces. However, the sampling should be
performed in such a way that the resulting trace is representative of the original trace. It may
not be sufficient to periodically sample a program execution. Locality properties of the
resulting sequence may be widely different from that of the original sequence. Another
technique is to skip tracing for a certain interval, then collect for a fixed interval and then skip
again. It may also be needed to leave a warm up period after the skip interval, to let the
caches and other such structures to warm up. Several trace sampling techniques are discussed
by Crowley and Baer. The QPT trace collection system solves the trace size issue by splitting
the tracing process into a trace record generation step and a trace regeneration process. The
trace record has a size similar to the static code size, and the trace regeneration expands it
to the actual full trace upon demand.
The second problem can be solved by reconstructing the mispredicted path. An image
of the instruction memory space of the application is created by one pass through the trace,
and thereafter fetching from this image as opposed to the trace. While 100% of the
mispredicted branch targets may not be in the recreated image, studies show that more than
95% of the targets can be located.
Execution Driven Simulation
There are two meanings in which this term is used by researchers and practitioners. Some
refer to simulators that take program executables as input as execution driven simulators.
These simulators utilize the actual input executable and not a trace. Hence the size of the
input is proportional to the static instruction count and not the dynamic instruction count.
Mispredicted branches can be accurately simulated as well. Thus these simulators solve the
two major problems faced by trace-driven simulators. The widely used Simplescalar simulator
is an example of such an execution driven simulator. With this tool set, the user can simulate
real programs on a range of modern processors and systems, using fast execution- driven
simulation. There is a fast functional simulator and a detailed, out-of-order issue processor that
supports non-blocking caches, speculative execution, and state-of-the-art branch prediction.
Some others consider execution driven simulators to be simulators that rely on actual
execution of parts of code on the host machine (hardware acceleration by the host instead of
simulation). These execution driven simulators do not simulate every individual instruction in
the application. Only the instructions that are of interest are simulated. The remaining
instructions are directly executed by the host computer. This can be done when the
instruction set of the host is the same as that of the machine being simulated. Such simulation
involves two stages. In the first stage or preprocessing, the application program is modified by
inserting calls to the simulator routines at events of interest. For instance, for a memory
system simulator, only memory access instructions need to be instrumented. For other
instructions, the only important thing is to make sure that they get performed and that their
execution time is properly accounted for. The advantage of execution driven simulation is
speed. By directly executing most instructions at the machine's execution rate, the simulator
can operate orders of magnitude faster than cycle by cycle simulators that emulate each
individual instruction. Tango, Proteus and FAST are examples of such simulators.
Complete system simulation
Many execution and trace driven simulators only simulate the processor and memory
subsystem. Neither I/O activity nor operating system activity is handled in simulators like
Simple scalar. But in many workloads, it is extremely important to consider I/O and
operating system activity. Complete system simulators are complete simulation
environments that model hardware components with enough detail to boot and run a full-
blown commercial operating system. The functionality of the processors, memory
subsystem, disks, buses, SCSI/IDE/FC controllers, network controllers, graphics
controllers, CD-ROM, serial devices, timers, etc are modeled accurately in order to achieve
this. While functionality stays the same, different microarchitectures in the processing
component can lead to different performance. Most of the complete system simulators use
microarchitectural models that can be plugged in and out. For instance, SimOS, a popular
complete system simulator provides a simple pipelined processor model and an aggressive
superscalar processor model. SimOS and SIMICS can simulate uniprocessor and
multiprocessor systems. Table 4 lists popular complete system simulators.
Stochastic Discrete Event Driven Simulation
It is possible to simulate systems in such a way that the input is derived stochastically rather
than as a trace/executable from an actual execution. For instance, one can construct a
memory system simulator in which the inputs are assumed to arrive according to a Gaussian
distribution. Such models can be written in general purpose languages such as C, or using
special simulation languages such as SIMSCRIPT. Languages such as SIMSCRIPT have
several built-in primitives to allow quick simulation of most kinds of common systems.
There are built-in input profiles, resource templates, process templates, queue structures, etc.
to facilitate easy simulation of common systems. An example of the use of event-driven
simulators using SIMSCRIPT may be seen in the performance evaluation of multiple-bus
multiprocessor systems in Kurian et. al.
Program Profilers
There are a class of tools called software profiling tools, which are similar to simulators and
performance measurement tools. These tools are used to generate traces, to obtain instruction
mix, and a variety of instruction statistics. They can be thought of as software monitoring on
a simulator. They input an executable and decode and analyze each instruction in the
executable. These program profilers can be used as the front end of simulators. A popular
program profiling tool is Shade for the UltraSparc.
Shade
SHADE is a fast instruction-set simulator for execution profiling. It is a simulation and tracing
tool that provides features of simulators and tracers in one tool. Shade analyzes the original
program instructions and cross-compiles them to sequences of instructions that simulate or
trace the original code. Static cross- compilation can produce fast code, but purely static
translators cannot simulate and trace all details of dynamically linked code. One can develop a
variety of 'analyzers' to process the information generated by Shade and create the
performance metrics of interest. For instance, one can use shade to generate address traces to
feed into a cache analyzer to compute hit-rates and miss rates of cache configurations. The
shade analyzer cachesim5 does exactly this.
Jaba
Jaba is a Java Bytecode Analyzer developed at the University of Texas for tracing Java
programs. While Java programs can be traced using shade to obtain profiles of native
execution, Jaba can yield profiles at the bytecode level. It uses JVM specification 1.1. It
allows the user to gather information about the dynamic execution of a Java application at the
Java bytecode level. It provides information on bytecodes executed, load operations,
branches executed, branch outcomes, etc.
A variety of profiling tools exist for different platforms. In addition to describing the working
of Shade, Cmelik et. al also compares Shade to several other profiling tools for other
platforms. A popular one for the x86 platform is Etch. Conte and Gimarc is a good source of
information to those interested in creating profiling tools.
ANALYTICAL MODEL
Analytical performance models, while not popular for microprocessors are suitable for
evaluation of large computer systems. In large systems where details cannot be modeled
accurately for cycle accurate simulation, analytical modeling is an appropriate way to obtain
approximate performance metrics. Computer systems can generally be considered as a set of
hardware and software resources and a set of tasks or jobs competing for using the resources.
Multicomputer systems and multi programmed systems are examples. Analytical models rely on
probabilistic methods, queuing theory, Markov models, or Petri nets to create a model of the
computer system. Analytical models are cost-effective because they are based on efficient
solutions to mathematical equations. Analytical models do not capture all the detail typically
built into simulation models.