Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing The Safety of Data Relocation
Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing The Safety of Data Relocation
88
1063-6897/99/$10.00 (c) 1999 IEEE
Despite the high performance potential of many relocation- apply a relocation-based optimization reduces solely to evaluating
based optimizations, the key stumbling block which often prevents its cost/benefit performance tradeoffs. In effect, memory forward-
them from being used in practice is the first step—i.e. guarantee- ing enables software to optimistically speculate that when it relo-
ing correctness. To safely move data, we must guarantee that any cates an object, it has successfully updated all pointers to that ob-
future references to the object will find it at its new location. The ject to point to its new location. If the speculation fails, then there
fundamental problem is that updating the precise set of pointers1 is a recovery cost (i.e. dereferencing the forwarding address), but
to a given object requires perfect aliasing information related to the execution still proceeds correctly. Therefore, as in all forms of
that object. In general, computing such precise information is be- speculation, one is gambling that the speculation is correct often
yond the capabilities of the compiler,2 and is even quite difficult enough that the benefit outweighs the cost. Another feature of our
for the programmer for large programs. In the face of uncertainty, mechanism which helps improve these odds is that software can
we must conservatively assume that relocating an object will break optionally specify that dereferencing a forwarding address will in-
the program—no matter how unlikely this may seem in reality— voke a user-level trap that enables software to update the offending
and therefore the optimization cannot be performed. pointer to point to the object’s new address. Hence software can
There is one mechanism in modern systems which provides learn from its mistakes to avoid repeating them.
a very limited form of safe data relocation: the virtual memory
system. The operating system can relocate an entire page of mem- 1.2. Related Work
ory in the physical address space without breaking the program It is interesting to note that the work which is most closely re-
by simply copying the page and updating its virtual-to-physical lated to our study occurred well over a decade ago in the context of
mapping. One cache optimization which exploits this flexibility architectures that directly supported the Lisp programming envi-
is page coloring [22], whereby the operating system attempts to ronment [29, 32]. Performance concerns were quite different back
avoid mapping conflicts in large off-chip caches. Therefore, by then: main memory was relatively small and expensive, and cache
adding a layer of indirection within the memory system, we can miss latencies were less problematic because the gap between mi-
move data safely and transparently without any special language croprocessor and memory speeds was dramatically smaller. (In
or compiler support. Unfortunately, the virtual memory system fact, a number of microprocessor-based systems did not even have
only provides this flexibility at the granularity of an entire page. caches.) Therefore the primary concern in optimizing memory
To actively create spatial locality within a cache line, we must have performance back then was minimizing the overall space require-
this flexibility at a word granularity. However, applying standard ments of a program, so that it would fit into main memory and
virtual memory techniques at such a fine granularity—i.e. setting avoid paging to disk. Two aspects of the Lisp environment made
the page size to be one word—is not a viable solution, due to the this challenging: the need to perform automatic garbage collec-
enormous overheads that this would involve. (Not only would the tion, and the relative space inefficiency of the ubiquitous list struc-
number of page table entries and the TLB size grow enormously, tures in the language. In addition, another aspect of the Lisp lan-
but also the cache tags would have to be maintained at a word guage which resulted in specialized hardware support was the need
granularity.) Instead, we propose a completely different solution. to determine object data types at run-time.
Although the performance goals which inspired specialized
1.1. Our Solution: Memory Forwarding hardware and software support in these Lisp machines are quite
To give software the flexibility to apply relocation-based data different from our goal of improving cache performance, there
layout optimizations at any time without concern over violating are nonetheless a number of interesting overlaps between our sup-
program correctness, we propose a mechanism called memory for- port and some features of these earlier machines. We now discuss
warding which guarantees the safety of relocation at a word gran- the connections between this previous work and our study from
ularity. (In our discussion, we define the “word” size to be equal three different perspectives: tagged memory, garbage collection,
to the size of a pointer.) The basic idea behind memory forward- and data layout optimizations.
ing is that when we relocate an object, we store its new address in
Tagged Memory: To make objects self-descriptive with re-
its old location and mark the old location so that hardware recog-
spect to their types, a number of Lisp architectures [29, 32] associ-
nizes it as a forwarding address. Therefore if the program acci-
ated a tag with each memory location. As we will see later in Sec-
dentally accesses the old address, the hardware will automatically
tion 2.1, our memory forwarding scheme also requires a form of
forward the reference to the object’s new location, thereby always
tagged memory to distinguish forwarding addresses from normal
guaranteeing the correct result. Moreover, our scheme only pays
data. A key difference, however, is that the tags in Lisp machines
the run-time overhead of an extra indirection when it is actually
provided much more functionality than in our case, and therefore
necessary—i.e. when the alternative is to violate program seman-
they required more overhead. For example, the SPUR architec-
tics. In the far more common cases of references to non-relocated
ture [32] added eight bits of tag to each 32 bits of memory (a 25%
objects, or references that have been properly updated to point to
overhead), whereas our scheme only requires one tag bit per 64
the new addresses of relocated objects, our scheme imposes no
bits of memory (a 1.5% overhead) in a modern 64-bit architecture.
performance overhead. The space overhead of our scheme is also
A fact that is even more relevant to our study is that a form
low (a 1.5% fixed memory cost on a 64-bit architecture).
of memory forwarding (using tagged memory) has appeared in
With memory forwarding support, the decision of whether to
previous Lisp machines, albeit for a very different purpose. The
1 We use the term “pointers” loosely to refer to any mechanism for gen- concept of an invisible pointer (which is similar to our forwarding
erating an address pointing to the object in question. address) was proposed twenty-four years ago by Greenblatt [15],
2 This is especially true for heap-allocated objects in languages like C. and the Symbolics 3600 [29] used one of its tags to implement
89
1063-6897/99/$10.00 (c) 1999 IEEE
a forwarding pointer. The motivation behind these mechanisms (a) Before data relocation (b) After data relocation
was threefold: to enable the insertion of an item into a cdr-coded byte address forwarding bit byte address forwarding bit
list [2], to facilitate incremental garbage collection, and to imple- 0800 5800 0800 5800
ment overlapping arrays. In contrast, our focus is on improving the 0804
-85 0 5804
? 0 0804 5800 1 5804 -85 0
-47 ? -47
cache performance of programs written in C, and therefore none 0808 5808 0808 5808
0812
0 5812 ? 0 0812 5808 1 5812 0 0
of these issues apply. In essence, what we are doing is taking a 0 0 ? 0
0816 5816 0816 5816
very old mechanism and adapting it to a completely new purpose 0820
-99 0 5820
? 0 0820 5816 1 5820
-99 0
within the context of modern out-of-order superscalar processors. 5 ? 5
0824 5824 0824 5824
old new old new
Garbage Collection: A common feature among these Lisp
machines is that they support some form of automatic garbage col- Figure 1. Example of data relocation with memory forwarding.
(A memory word is 8 bytes, and addresses are in decimal.)
lection. Garbage collection algorithms involve phases where they
identify two classes of data items: those that can be reclaimed, The remainder of this paper is organized as follows. We be-
and those that can be relocated. A data item can be reclaimed gin in Section 2 with an overview of memory forwarding and how
when it can no longer be accessed through any pointers that are it can be used. Section 3 discusses issues related to implement-
still active, and a data item can be relocated if all pointers to the ing memory forwarding in a modern processor. Sections 4 and 5
old location can be updated to point to the new location. In both present our experimental methodology and experimental results,
cases, the key challenge is identifying all pointers which point respectively, to demonstrate the usefulness of the mechanism. Fi-
to the given location. In languages such as Lisp, ML, and Java, nally, we conclude in Section 6.
where the use of pointers is either restricted or disallowed alto-
gether, one can solve this problem in practice. In contrast, in lan- 2. Memory Forwarding
guages such as C and C++ which do not restrict pointer usage, one
We now discuss the basic concepts behind memory forward-
generally cannot determine which pointers point to a given object,
ing, a number of applications of this mechanism, and some issues
and therefore automatic garbage collection (and data relocation) is
related to its performance.
extremely difficult. Finally, it is interesting to note that a form of
memory forwarding is used in copying garbage collectors [10, 28], 2.1. Basic Concepts
whereby the forwarding addresses are used to preserve data con-
Memory forwarding enables aggressive yet safe data reloca-
sistency during the distinct phases when collection takes place.
tion. As we mentioned earlier in Section 1.1, the basic idea is to
Data Layout Optimizations: An important topic in Lisp re- store the new address of an object into its old memory location,
search is how to represent list structures compactly. List com- and to mark this old location as a forwarding address. Whenever
paction can be performed either separately or during garbage col- a forwarding address is accessed, the hardware will automatically
lection. Most of the list compaction techniques designed for dereference that location to find the object at its new location.
Lisp [1, 2, 16] involve either moving or copying the original list to There are three implications of this mechanism in terms of
a new, denser set of locations. As we discussed above, data reloca- memory storage. First, the minimum unit of data that can be re-
tion in Lisp does not pose the safety problems that we encounter in located is the width of a pointer—which we refer to as a “word”
C. However, our memory forwarding support gives us the flexibil- throughout this paper—since otherwise there would not be enough
ity to exploit some of these same list compaction techniques—e.g., space to store the forwarding address.3 Note that it is possible to
a technique called list linearization [13]—for the sake of improv- relocate byte-sized objects—this simply means that enough neigh-
ing spatial locality in C programs. boring bytes must be moved at the same time to comprise an entire
word. Second, a chunk of data that is relocated must be word-
1.3. Objectives of This Study aligned, so that the alignment of the forwarding address is prede-
This paper makes the following contributions. First, we pro- termined. Note that this still allows us to perform byte-sized loads
pose a solution to the problem of safely relocating data at a fine and stores to forwarded objects—the byte offset into the new lo-
granularity to improve the cache performance of programs written cation is simply assumed to be the same as it was at its original
in languages such as C which do not support garbage collection. address. We consider these first two restrictions to be quite mi-
Although the concept of memory forwarding was proposed over nor, especially given that our only option for safe relocation today
two decades ago in the context of Lisp machines, to the best of our is page-sized, page-aligned chunks of data. Finally, to enable the
knowledge, we are the first to propose that it be adapted to facil- hardware to distinguish forwarding addresses from regular data,
itate a broad class of data layout optimizations to improve cache we attach a one-bit tag (called a forwarding bit) to each word in
performance. Second, we discuss how memory forwarding can be memory. For a 64-bit architecture, this results in a space overhead
implemented within modern out-of-order superscalar processors of only 1.5%, and therefore is reasonably efficient.
(which are quite different from the processors in which other forms Figure 1 shows a simple example of how the memory contents
of forwarding have been implemented in the past). Third, we sug- and forwarding bits are modified upon data relocation. Assume
gest a number of optimizations which can benefit from memory 3 One could imagine creating a more elaborate scheme for compressing
forwarding. Finally, we quantitatively evaluate the benefits and the size of forwarding address pointers (e.g., by restricting the distance
overheads of our scheme by using it to apply a number of differ- between the old and new address to something that fit within its former
ent run-time locality optimizations to a collection of non-numeric size), but this would involve additional complexity and fancier tag storage,
applications running on a modern superscalar processor. so we do not consider such an approach further.
90
1063-6897/99/$10.00 (c) 1999 IEEE
that we have a 64-bit architecture, and that we would like to relo- (a) Before list linearization
cate five 32-bit elements from addresses 0800-0816 to addresses head 0 0 0 0
5800-5816 (these addresses are in decimal notation). Figure 1(a) 3000 A B C D
shows the memory contents and forwarding bits before relocation 3000 1000 4000 2000
(note that none of the forwarding bits have been set). To relocate
a word, we first copy it to its new location, and then we simul- (b) After list linearization
taneously write its new address into its old location and set the 1 1 1 1
corresponding forwarding bit at the same time. Figure 1(b) shows 8000 8016 8032 8048
the state of memory after the relocation. Notice that to relocate the 3000 1000 4000 2000
32-bit subword at address 0816, we must also relocate the 32-bit
subword at address 0824 (which contains the value 5) along with head 0 0 0 0
8000 A B C D
it. After the relocation, a 32-bit load of the subword at address
0804 will be forwarded to address 5804—which is computed by 8000 8016 8032 8048
adding the forwarding address (5800) to the byte offset within the Figure 2. Example of list linearization with memory forwarding,
word (4)—thereby returning the correct value of -47. assuming that cache lines and list elements are 32 and 16 bytes
To simplify our discussion throughout the remainder of the pa- long, respectively. (Addresses are in decimal.)
per, we now define two terms which we will use frequently:
performance, it can also reduce memory bandwidth consumption,
Initial address: The address of the first location accessed by a which in turn can help reduce power consumption (which is be-
memory reference. For example, in Figure 1(b), the initial coming an increasingly important concern).
address of a write to word 0816 is 0816 itself. A good example of a technique which uses packing to enhance
Final address: The address of the last location accessed by a spatial locality is list linearization. As we will see later in Sec-
memory reference. For example, in Figure 1(b), the final tion 5, this technique can offer dramatic performance improve-
address of a write to word 0816 is 5816. When data is ments. List linearization has been an important technique for com-
not forwarded, the final address equals the initial address. pacting lists in Lisp programs, and can eliminate as much as half
of the space consumption [13]. The idea behind list linearization is
In addition to preserving the correctness of pointer derefer- to relocate the nodes of a linked list so that they reside in contigu-
ences, another concern with data relocation is preserving the cor- ous memory locations. Depending on whether the list structure
rectness of pointer comparisons. In the presence of memory for- continues to change over time, the linearization process can be
warding, it is possible that two pointers with distinct initial ad- invoked either just once, or else periodically to adapt to the chang-
dresses may in fact point to the same object (i.e. share the same ing structure. Although list linearization can potentially offer large
final address). Hence to preserve correctness, explicit pointer performance gains, it is very difficult to safely use this optimiza-
comparisons4 should be performed with respect to their final ad- tion in practice for general C programs due to the possibility of
dresses. Although our memory forwarding hardware does not per- pointers outside of the linked list itself pointing to list elements.
form this check automatically, the compiler can easily insert ad- Fortunately, with memory forwarding support, we can apply list
ditional instructions (described later in Section 3) to look up the linearization at any time without worrying about whether all po-
final addresses for these comparisons. We implemented such a tential pointers to list elements have been properly updated.
compiler pass, and the resulting software overhead is included in Figure 2 shows an example of list linearization with memory
our performance results. As we will see later in Section 5, this forwarding. Before linearization, the four nodes of the list (i.e.
overhead does not present a problem. nodes A, B, C, and D) are scattered throughout memory such that
2.2. Applications of Memory Forwarding they reside in four separate cache lines, as shown in Figure 2(a).
List linearization packs the four nodes into a contiguous memory
While the act of dereferencing a forwarding address clearly
region starting at location 8000, as shown in Figure 2(b). As a
does not improve performance on its own, the advantage of mem-
result, the four relocated nodes occupy only two cache lines, rather
ory forwarding support is that it enables a wide range of data lay-
than four, thereby potentially eliminating half of the cache misses
out optimizations which can enhance cache performance. Not only
due to this list as we continue to revisit it. Note that the forwarding
are these optimizations useful for mitigating the impact of mem-
addresses and forwarding bits have been set properly such that we
ory latency, they can also be used to conserve memory bandwidth.
will still maintain correct execution even if a stray pointer accesses
We now briefly describe some of these potential optimizations.
a list element at its old address. However, we expect that most
Improving Spatial Locality: A straightforward method of accesses to the list will find it directly at its new address, thereby
actively improving spatial locality is to take data items which are enjoying the enhanced spatial locality.
accessed close together in time, but which are scattered sparsely
throughout the address space, and pack them into adjacent mem-
Increasing Prefetching Effectiveness: The effectiveness of
prefetching for non-numeric applications is largely limited by the
ory locations. This form of data packing makes cache lines much
difficulties in generating prefetch addresses early enough [25]. For
more effective, and it can potentially reduce the number of capac-
example, consider the linked list in Figure 2(a), and assume that
ity, compulsory, and conflict misses. Not only does this improve
we need to prefetch three nodes ahead to hide the entire miss la-
4 In reality, we only need to worry about cases where the pointers po- tency. Therefore, we would want to prefetch node D as soon as
tentially point to the same type of relocatable object. we arrive at node A. However, the problem is that we do not know
91
1063-6897/99/$10.00 (c) 1999 IEEE
the address of node D until we have dereferenced nodes A, B, and R = Read FBit(word* A): Read the forwarding bit of the
C—this is known as the pointer-chasing problem [25]. In con- word at address A into register R.
trast, after the list is linearized, we can trivially prefetch node D R = Unforwarded Read(word* A): Read the value stored
at node A by simply prefetching the next cache line, thus avoid- in the word at address A into register R, with forward-
ing any pointer chasing. (We referred to this technique as data ing disabled. If the word’s forwarding bit is set, this is a
linearization prefetching in an earlier publication [25].) forwarding address; otherwise this is a regular data value.
Reducing Cache Conflicts: Data copying [23] was orig- *A = Unforwarded Write(register word R, bit B ):a
inally proposed to reduce conflict misses within tiled (or Write the value of register R into the word at address A
“blocked”) numeric applications. Since a given tile is reused many and set the word’s forwarding bit to B atomically, with
times after it is brought into the cache, it is particularly problem- forwarding disabled.
atic if different elements within the tile conflict with each other. a If an instruction of the underlying ISA cannot have three operands,
To avoid this problem, the data copying optimization first copies we can have two separate instructions for Unforwarded Write( ,0) R
a tile to a contiguous set of addresses in a temporary array before and Unforwarded Write( ,1). R
using it; since these locations do not conflict with one another, Figure 3. Proposed instruction set extensions to support mem-
the problem is eliminated. Another technique called data color- ory forwarding. (C syntax is used to improve readability.)
ing [11] was proposed as a method of reducing conflict misses in
pointer-based data structures. The idea is to partition the cache into observe that the real performance concern is ensuring that the reor-
logically separate regions (or colors). By relocating data structure ganized data layout actually delivers higher memory performance
elements which are accessed close together in time to separate re- than the original layout.
gions of the cache, conflict misses can be avoided. Memory for-
warding can help facilitate both copying and coloring techniques 3. Implementation Issues
by guaranteeing that they are safe. We now discuss the support that we need from the instruction
set, the hardware, and the software to implement memory forward-
Reducing False Sharing: In cache-coherent shared-memory ing in modern superscalar processors.
multiprocessing systems, false sharing [21, 34] occurs when two
or more processors access distinct data items which happen to fall 3.1. Extensions to the Instruction Set Architecture
within the same cache line (which is the unit of coherence), and To exploit memory forwarding, the machine must have some
at least one access is a write. False sharing can hurt performance way to manipulate the forwarding information—i.e. the forward-
dramatically as the line ping-pongs between processors despite the ing addresses and the forwarding bits. Rather than taking a purely
fact that no real communication is taking place. By relocating hardware-based approach, we propose to extend the underlying in-
those unrelated data items to distinct cache lines, false sharing can struction set architecture (ISA) by adding a few instructions which
be avoided. Memory forwarding would be especially helpful in will allow software to manipulate the forwarding information di-
avoiding false sharing in irregular shared-memory applications, rectly. The advantages of this approach are its programmability
where proving that data items can be safely relocated is difficult. and flexibility. In addition, we expect the software overhead to be
In summary, memory forwarding enables a broad range of low since forwarding information changes relatively infrequently.
relocation-based optimizations; we have presented just a partial Figure 3 shows our proposed ISA extensions, which consist
list of such optimizations. We would also like to emphasize that of three new instructions. Read FBit allows software to check
these optimizations are applicable not only to caches but also to whether a given location contains a forwarding address or actual
the other levels of the memory hierarchy. For example, we can data. Unforwarded Read and Unforwarded Write allow
apply data relocation to improve the spatial locality within pages software to manipulate memory with the forwarding mechanism
(and hence on disk) for out-of-core applications. disabled. For example, in Figure 1(b), a normal Read (i.e. with
the forwarding mechanism enabled) of the word at address 0808
2.3. Performance Issues will get the forwarded value of 0, but an Unforwarded Read of
A relocation-based optimization will improve overall perfor- the same word will get 5808, which is the forwarding address. An
mance if two conditions hold: (i) the new data layout actually pro- Unforwarded Write must change the word and its forwarding
vides better memory performance than the original layout; and (ii) bit atomically in order to preserve data consistency.
the gain in the memory performance outweighs the optimization To demonstrate how software can make use of these new in-
overhead. This overhead includes the extra execution time due structions, Figure 4(a) shows two procedures for relocating a data
to actually relocating the data, and may also include forwarding object of size n words from src to tgt, and then storing tgt as
overhead if any references actually need to be forwarded after the the forwarding address into src. Procedure Relocate() loops
relocation. While the overhead of relocating the data may seem until a clear forwarding bit is read so that tgt will be appended at
to be a concern at first glance, our experimental results indicate the end of the forwarding chain (if any). Figure 4(b) shows a pro-
that it is usually not a problem because relocation is invoked infre- cedure called ListLinearize() (which we will use frequently
quently and modern processors can execute multiple instructions later in our experiments) which calls Relocate() to perform
per cycle. In addition, we find that the performance overhead of list linearization. The parameter head handle is the address of
forwarding is negligible in many cases because most data refer- the list head. Note that the address of the list head (rather than
ences are updated properly and do not need to be forwarded. We its value) is passed into ListLinearize() because we want
92
1063-6897/99/$10.00 (c) 1999 IEEE
other words, we can treat forwarding as an exception. We will de-
sign the hardware to be fast in the common case—i.e. a normal,
(a) Data Relocation non-forwarded reference—and we are less concerned about the
// src = address of the object before relocation
// tgt = address of the object after relocation performance penalty when forwarding is actually invoked, since
g
relocated = Read FBit(src); past [29, 32]. One difference with our scheme (as discussed ear-
lier) is that we require less tag storage overhead than previous
g
actualRelocate(src, tgt, n words);
schemes; otherwise, it is quite similar. We now discuss the more
void actualRelocate(word* src, word* tgt, int n words) f novel features of our hardware support in greater detail.
f
// relocate each word in the object Dereferencing Forwarding Addresses: In the presence of
for(; n words > 0; --n words)
word temp; memory forwarding, the data referencing mechanism must be able
// save the content of src to follow forwarding chains of arbitrary lengths. More specifi-
temp = Unforwarded Read(src);
// setup the forwarding address and forwarding bit
cally, when a memory word is accessed by a data reference, its for-
*src = Unforwarded Write(tgt, 1); warding bit is tested. If this bit is set, then the original data address
// copy the original content of src to tgt will be replaced by the contents of the word just accessed (which
*tgt = Unforwarded Write(temp, 0);
// prepare for the next word contains a forwarding address), and a new memory access using
g til a clear forwarding bit is read (we will discuss how cycles might
be handled later in Section 3.2), at which point the data reference
(b) List Linearization can proceed as usual. One option is to implement this dereferenc-
// a pool of space for data relocation ing mechanism purely within hardware; another is to implement it
extern char* memory pool;
using a software-based exception handler (where the exception is
f
// head handle = address of the list head pointer
void ListLinearize(node ** head handle) triggered by accessing a word with its forwarding bit set). With the
node ** handle, *tgt; ISA extensions that we propose, it would be straightforward for a
// start from the list head
software handler to chase the forwarding pointer chain. Although
f
handle = head handle;
while (*handle) the forwarding bit cannot be tested until the memory location is
// grab space from the pool
tgt = (node*)memory pool;
brought into the primary cache, this is no different from the delays
// increment the pool pointer associated with checking ECC or parity bits.
memory pool += sizeof(node);
// relocate the node pointed-to by handle to the address stored in tgt Data Dependence Speculation: One consequence of mem-
Relocate(*handle, tgt, sizeof(node)/sizeof(word));
// append the relocated node to the linearized list
ory forwarding is that we do not know the final data address of a
*handle = tgt; given reference until the reference is nearly completed. This de-
!
// prepare for next node layed generation of the final address poses a potential problem in
g g
handle = &(tgt next);
out-of-order superscalar machines. These machines normally al-
low a load access to proceed before an earlier store, provided that
the load and store are to different addresses. If either address is
Figure 4. Procedures using the proposed ISA extensions to
implement (a) data relocation and (b) list linearization.
unknown, the conservative approach is to delay the load until both
addresses are resolved. With memory forwarding, since the final
to modify the list head to point to the new locations after the re- address of a store is not known until the store actually completes,
location is performed. (This effect was illustrated earlier in Fig- this delay would cause the conservative approach to never execute
ure 2(b), where the value of head is changed to 8000 after the a load ahead of an earlier store.
linearization.) By doing so, the next time that the list is accessed Fortunately, there is a solution to this problem. A technique
via the list head, the new locations will be accessed directly with- called data dependence speculation [12, 30] allows a load to spec-
out touching the old locations. Finally, note that in Figure 4(b), the ulatively execute before an earlier store, even if the store address
new locations for the relocated nodes are allocated from a pool of is unknown. If it turns out that the load was not dependent on
contiguous memory, thereby creating spatial locality. the store, then the speculation succeeds; otherwise, a true depen-
dence has been violated, and the effects of the incorrect specu-
3.2. Hardware Support lation must be undone. Recent out-of-order superscalar proces-
We now discuss the hardware modifications necessary to sup- sors [19, 20, 24] have already implemented some form of data
port memory forwarding. The key insight which helps us keep dependence speculation. With support for data dependence specu-
the hardware simple is that references which actually require for- lation, we can speculate that the final address of a reference will be
warding are expected to occur rarely (if ever). The forwarding the same as its initial address (i.e. we do not expect the reference
mechanism is simply a safety net which allows us to continue to to be forwarded), and therefore the delayed final-address genera-
preserve program correctness in case the unexpected happens. In tion will not degrade performance in the common case where the
93
1063-6897/99/$10.00 (c) 1999 IEEE
reference is not forwarded after all. If forwarding does occur, then in a region of memory to initialize it before making that memory
our speculation would only be incorrect in the case where the load available to an application.
and store had different initial addresses but the same final address.
In our experiments, we observed that incorrect data dependence Deallocating Forwarded Data: When an object is deallo-
speculation almost never occurred; hence it appears to be a very cated, all memory reachable via the chain of forwarding addresses
effective solution to supporting memory forwarding. for that object should be deallocated as well. A simple way to ac-
complish this is to create a wrapper memory-deallocation routine
Handling Forwarding Cycles: A forwarding cycle is cre- which first deallocates all of the memory allocated on the forward-
ated when software erroneously inserts an address more than once ing chain, and then calls the original memory-deallocation routine,
into a forwarding chain. The hardware must have some mech- which can be either a system-provided procedure (e.g., free()
anism for detecting and breaking forwarding cycles; otherwise, in C and delete() in C++) or a user-defined procedure if the
the machine could be stalled forever chasing the forwarding chain. program performs its own memory management.
Detecting forwarding cycles accurately is an expensive operation;
Memory Alignment: Since the minimal granularity of mem-
for each hop, the hardware would have to match the current for-
ory forwarding is a word, software must ensure that two different
warding address against all previous forwarding addresses derefer-
objects which are being relocated to two different destinations do
enced by the same data reference. Because of this high cost—and
not share the same word, since we cannot store two different for-
also because we expect forwarding cycles to be extremely rare—
warding addresses in that same word. In other words, relocatable
we would prefer that the hardware instead perform a fast but pos-
objects must be word-aligned. Enforcing this alignment can be
sibly inaccurate check for a cycle during normal execution, and
accomplished either by specifying the alignment to the memory
only perform accurate cycle detection when it is necessary. One
allocator for dynamically-allocated objects, or else by tuning the
possibility is to predetermine a limit on the number of forward-
alignment option in the compiler if some relocatable objects are
ing hops that are allowed for a given data reference. We simply
statically allocated. In some compilers—e.g., the MIPS C com-
maintain a counter (which could be implemented either in hard-
piler that we used in our experiments—aggregate objects are al-
ware or in software) to keep track of the number of forwarding
ready aligned to word boundaries by default.
hops performed so far, and when this count exceeds the limit, we
raise an exception. The corresponding software exception handler Preserving Outcomes of Pointer Comparisons: The
will then perform an accurate cycle check. If it is a false alarm, compiler is responsible for replacing all pointer comparisons that
then we will reset the counter and resume execution; otherwise, could be affected by relocation with explicit code to look up and
the execution will be aborted. compare final addresses. Pointer analysis techniques [14, 36] can
help the compiler avoid inserting these more costly comparisons
Providing User-Level Traps Upon Forwarding: In addi- by ignoring cases where pointers cannot point to relocated objects.
tion to the system-level exception handlers which might be pro-
vided to support the dereferencing of forwarding addresses and
4. Experimental Framework
the detection of forwarding cycles, it may also be useful to pro-
vide a lightweight user-level trapping mechanism that would be To evaluate the potential performance benefits of memory for-
invoked upon accessing a forwarded location. Such a mechanism warding, we modeled it in a modern processor and used it to en-
would be useful for allowing the application to tune its own per- able a number of relocation-based optimizations which we applied
formance in the following two ways. First, one could write a pro- to a collection of non-numeric applications. We chose to focus on
filing tool to gather forwarding-related statistics for the purpose of non-numeric applications because compilers are mostly unable to
improving the performance of a future execution of the program. guarantee the safety of data relocation in these applications. The
For example, one might record which instructions experienced for- goals of the optimizations that we applied were improving spa-
warding for the sake of eliminating that forwarding in future runs tial locality and prefetching effectiveness. Since current compiler
of the program. Second, a user-level trap handler could be used technology does not support these optimizations (mainly because
to optimize away forwarding (and thereby improve performance) their safety cannot be proven), we added these optimizations to
on-the-fly. For example, one could write a tool that updates stray the applications manually. Table 1 describes the eight applications
pointers on-the-fly to point directly to their correct final addresses, used in our experiments along with the optimizations that we ap-
thereby avoiding the need to invoke the forwarding mechanism plied. All applications were run to completion in our simulations.
again. (Note that one must have application-specific knowledge We added our proposed ISA extensions to the underlying MIPS
in order to do this.) A mechanism similar to informing memory ISA by making use of a few machine instruction sequences that
operation traps [17] could be used for this purpose. never appear in ordinary programs (e.g., loading a value into a reg-
ister which is hardwired to the value zero). We modeled the full
3.3. Software Support performance effects of maintaining and dereferencing the forward-
Having discussed the hardware support for memory forward- ing addresses. The “Space Overhead” column shows the amount
ing, we now focus on its impact on software. of virtual memory space for accommodating relocated data; this
amount (ranging from 0.5MB to 14.9MB) does not present a prob-
Initialization of Forwarding Bits: The forwarding bit of a lem in modern machines, and the simulation results include the
memory word must already be clear when it is used by a program impact of this overhead on performance.
for the first time. To guarantee this, the operating system must per- We implemented data dependence speculation in our simu-
form an Unforwarded Write(0,0) operation on all words lator. An ambiguous data dependence is stored in a table until
94
1063-6897/99/$10.00 (c) 1999 IEEE
Table 1. Application characteristics. Note: “Inst. Grad.” is the number of instructions actually graduated. The “combined” miss
rate is the fraction of loads which suffer misses in both the 16KB D-cache and the 512KB L2 cache, using 32B cache lines. “Space
Overhead” is the amount of virtual memory space used for forwarding addresses.
Table 2. Simulation parameters. hierarchy (including tag, bank, and bus contention), etc. Table 2
Pipeline Parameters shows the parameters used in our model for the bulk of our experi-
Issue Width 4 ments. Five line sizes—ranging from 32B to 512B—were used in
Functional Units 2 Integer, 2 FP,
2 Memory, 2 Branch our experiments, along with five corresponding sets of miss laten-
Reorder Buffer Size 64 cies (longer lines have longer transfer times).
Integer Multiply 3 cycles We compiled our applications with -O2 optimization using the
Integer Divide 15 cycles
All Other Integer 1 cycle standard MIPS C compilers and the SUIF compiler [35] under
FP Divide 15 cycles IRIX 5.3. For the experiments which required the insertion of soft-
FP Square Root 20 cycles ware prefetches into the source code, we used the SUIF compiler;
All Other FP 2 cycles
Branch Prediction Scheme 2-bit Counters
otherwise, the MIPS compiler was used.
95
1063-6897/99/$10.00 (c) 1999 IEEE
244
Normalized Execution Time
180 busy
|
167
160 159 155
150 149148
|
140
|
127124 126
118 115
120 108 114 109 108 114
|
79 84
80 70
|
60 55 52
|
40
|
20
|
0
|
N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L
32B 256B 512B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B
(0%) (3%) (11%) (-8%) (-3%) (1%) (6%) (8%) (12%) (43%) (70%) (123%) (12%) (21%) (27%) (11%) (25%) (63%) (26%) (85%) (161%)
BH Compress Eqntott Health MST Radiosity VIS
Figure 5. Performance of locality optimizations for various cache line sizes (N = not optimized, L = locality optimized).
egories explaining what happened during all potential graduation (a) Number of load D-cache misses
slots.5 The bottom section (busy) is the number of slots when 200 partial misses: misses combined with outstanding misses
|
% of Load D-Cache Misses
146
175 full misses: misses served by the secondary cache
135
|
132
127
instructions actually graduate, the top two sections are any non-
122
150
116
116
|
104
102
100
100
100
100
100
100
100
100
125
99
99
98
98
|
93
93
92
92
91
graduating slots that are immediately caused by the oldest instruc-
79
100
78
77
74
73
|
63
63
62
61
59
75
51
49
49
48
|
41
31
50
tion suffering either a load or store miss, and the inst stall section
|
25
|
0
is all other slots where instructions do not graduate. Note that the
|
NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL
32B 256B512B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B
load stall and store stall sections are only a first-order approxima- BH Compress Eqntott Health MST Radiosity VIS
tion of the performance loss due to cache stalls, since these delays
(b) Total amount of bandwidth consumed
also exacerbate subsequent data dependence stalls. In addition,
1186
1200
there is a percentage in parentheses below each pair of bars rep-
|
% of Bandwidth Consumed 700 between the secondary cache and memory
881
|
1000
498
494
600 between primary caches and the secondary cache
451
resenting the speedup of the optimized over the unoptimized case
396
800 500
372
553
347
|
463
400
279
600
243
226
225
|
214
203
186
300
175
174
173
168
|
149
149
400
121
111
108
107
|
100
100
100
100
100
100
200
100
100
95
92
80
79
79
Our first observation from Figure 5 is that performance gen-
65
56
200 100
|
|
0 0
|
|
erally degrades when line size increases, especially for the unop- NL NL NL
32B 256B 512B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
BH Compress Eqntott Health MST Radiosity VIS
timized cases. This trend is due to a lack of spatial locality in
these applications, which means that longer lines offer little perfor- Figure 6. Additional performance metrics for the impact of lo-
cality optimizations (N = not optimized, L = locality optimized).
mance advantage. Fortunately, our locality optimizations (which
The y-axes are normalized to the N cases of the 32B line size.
are enabled by memory forwarding) improve the spatial locality
of these application significantly. As we see in Figure 5, the op- essarily suffer the full miss latency. A full miss, on the other hand,
timized cases outperform the unoptimized cases for the same line does not combine with any access and therefore suffers the full la-
sizes in all applications except Compress, and the speedups in- tency. Figure 6(a) clearly demonstrates that the improved spatial
crease along with line size. The performance improvement can locality offered by locality optimizations reduces the miss count
be dramatic—with 128B lines, Health and VIS enjoy more substantially, with more than a 35% reduction in misses in 11 out
than twofold speedups. Among our optimizations, list lineariza- of the 21 cases (seven applications with three line sizes each). In
tion is particularly powerful since it improves the performance of many cases, both partial misses and full misses are reduced, and
Health, MST, Radiosity, and VIS substantially. It is inter- hence the total miss penalty decreases accordingly.
esting to note that in Health, the absolute performance of the Figure 6(b) shows another useful performance metric: the total
optimized cases increases along with line size. This is due to amount of bandwidth consumed by our applications. Each bar in
the prefetching benefits of long cache lines after spatial locality Figure 6(b) denotes the total number of bytes transferred between
is greatly improved. Compress is an exceptional case where the the primary and secondary caches (the bottom section), and the
locality gets worse in the optimized case for 32B and 64B lines. amount transferred between the secondary cache and main mem-
We also observe from Figure 5 that the instruction overhead of ory (the top section). Again, each bar is normalized to the N case
these locality optimizations is usually low, which suggests that of the 32B line size. It is clear from Figure 6(b) that locality op-
these optimizations could be invoked even more frequently dur- timizations reduce the bandwidth consumption in nearly all cases,
ing the execution to further improve the data layout. and achieve a bandwidth reduction of twofold or more in a few
While execution time is the most important performance met- cases. Thus we see that these optimizations deliver not only higher
ric, further insight can also be gained by examining the impact on performance, but also reduced bandwidth consumption.
total cache misses. Figure 6(a) shows the number of load D-cache
misses in the unoptimized and optimized cases for different line 5.2. Impact on the Effectiveness of Prefetching
sizes. Each bar is normalized to the N case of the 32B line size, We now turn our attention to the interaction between our lo-
and is divided into two categories indicating how a D-cache miss cality optimizations and the effectiveness of prefetching. Based
is serviced. A partial miss is a D-cache miss that combines with on a profile of each application, we added software prefetches
an outstanding miss to the same line, and therefore does not nec- for a few static loads that suffer significantly from cache misses.
5 The number of graduation slots is the issue width (4 in this case) mul- Prefetches are inserted at the earliest points in the program where
tiplied by the number of cycles. We focus on graduation rather than issue the prefetch addresses are known (this is done in an identical fash-
slots to avoid counting speculative operations that are squashed. ion for both the original and locality optimized cases). We assume
96
1063-6897/99/$10.00 (c) 1999 IEEE
load stall (a) Original (b) Optimized
Normalized Execution Time
180
|
116
140 inst stall
108
107
|
101
101
100
100
100
100
100
100
100
100
100
100
120
97
busy
95
95
93
|
90
89
88
86
hash table
82
100
79
|
70
64
80 PTERM
|
60 short integers
|
29
40
|
20 i
|
0
|
N L NP LP N L NP LP N L NP LP N L NP LP N L NP LP N L NP LP N L NP LP
(0%) (-3%) (-8%)(-6%) (6%) (8%) (43%)(118%) (12%)(10%) (11%)(14%) (26%)(42%) i i+1
BH Compress Eqntott Health MST Radiosity VIS
i+1
Figure 7. Performance impact of locality optimizations on
prefetching. (N = not optimized, L = locality optimized, NP
= prefetching without locality optimizations, LP = prefetching
with locality optimizations). Figure 8. Locality optimization for Eqntott (objects in the
same shaded region are allocated to contiguous memory).
that a single prefetch instruction can prefetch one or more consec-
utive cache lines (i.e. block prefetching is supported). For both the
(a) Original (b) Subtrees clustered
unoptimized and optimized cases, we experimented with a range
of prefetch block sizes, and we report the results with the block
size that performed the best for each case.
Figure 7 shows how prefetching performs both with (LP) and
without (NP) locality optimizations. For the sake of comparison,
the N and L cases from Figure 5 are also included in Figure 7. The
cache line size is fixed at 32B. We observe from Figure 7 that the Figure 9. Example of the subtree clustering applied to BH
performance of prefetching is improved by locality optimizations (nodes in the same shaded region are in the same cache line).
in five applications, and two of them (VIS and Health) enjoy that most functions in this library return pointers to list elements,
speedups of over 40%. We note that four of these five applica- which can be scattered across any of the over hundred source files
tions operate heavily on linked lists, and previous research [25] has of VIS. The program behave incorrectly if after a list is linearized,
shown that prefetching linked lists—especially those that are short it is later accessed using a pointer to the middle of the list that
and traversed within small loop bodies—is particularly difficult existed before the linearization. Fortunately, memory forwarding
because of the pointer-chasing problem. As we can see in Figure 7, allows us to simply ignore this hazard, thereby safely resulting in
the list linearization optimization is quite successful in alleviating an over twofold performance gain with 128B lines.
this problem. With the exception of VIS (which experiences con-
siderable prefetching overhead), in the remaining four out of five Eqntott: The most interesting data structure in Eqntott is a
applications where the locality is substantially improved, combin- hash table which stores pointers to a record of type PTERM. A
ing locality optimizations and prefetching (LP) performs better PTERM record in turn contains a pointer to an array of short in-
than either technique alone (most noticeably in Health). There- tegers. The original layout of this data structure is shown in Fig-
fore, it appears that prefetching and our locality optimizations are ure 8(a). We optimize the locality by (i) relocating a PTERM record
complementary in nature. and its short integer array into a single chunk of memory, and (ii)
putting these chunks into contiguous memory locations in increas-
5.3. Case Studies ing order of the hash index. The optimized layout is shown in
Having studied the overall performance, we now look at the Figure 8(b). This relocation optimization is invoked only once in
individual applications in more detail. the program, immediately after the hash table is constructed.
Health, MST, Radiosity, and VIS: We apply the same lo- BH: In BH, an octree is constructed and then traversed at each
cality optimization to all four of these applications: list lineariza- time step of the N-body force calculation. The octree is con-
tion. The structure of the linked lists used in these applications is structed in a depth-first order, but the traversal order is fairly ran-
modified throughout the program execution, and therefore list lin- dom. We improve the locality of the traversal by clustering non-
earization is invoked periodically. To make our discussion more leaf nodes of the tree. We do not cluster leaf nodes since they are
concrete, we use VIS as a representative example. VIS is a large actually linked together by a list and accessed via list traversals.
application, consisting of more than 150,000 lines of C code. This Subtree clustering [11] attempts to pack nodes of a subtree into a
program makes extensive use of a generic list library which imple- cache line, in the most balanced form. Locality will be improved
ments many common list operations. Our optimizations are local- if the next node to be visited—which can be any of the children
ized within this library. We optimize the locality of list processing of the current node—is already in the current cache line. Figure 9
as follows. We add a counter field to the head record of each list illustrates this optimization using a binary tree. Figure 9(a) shows
to count how many insertion or deletion operations have been per- the original memory layout of the tree, which was created using a
formed on the list since the last time that the list was linearized. pre-order traversal, and Figure 9(b) shows the memory layout af-
The list linearization procedure ListLinearize()—shown ter subtree clustering. Since a non-leaf node in BH is 78B long, we
earlier in Figure 4(b)—is invoked whenever the list’s counter ex- need cache lines of 256B or longer to do meaningful clustering.
ceeds a threshold, which was arbitrarily set to 50 in our experi-
ments. The counter is reset after each linearization. Despite the Compress: The most relevant data structures in Compress
simplicity and usefulness of this optimization, performing it with- are two hash tables, namely htab and codetab, which are im-
out the support of memory forwarding is dangerous due to the fact plemented using arrays. Indices to htab are computed through
97
1063-6897/99/$10.00 (c) 1999 IEEE
(a) Execution time (b) Miss count ally required. While this latter case is not achievable, it represents
160 150 a useful bound on performance. As we see in Figure 10(a), the
|
Normalized Execution Time
|
140 inst stall 121 performance of scheme L is degraded by forwarding in two ways.
|
120
|
busy
120 100 100
|
100 100 93 90 First, the act of dereferencing forwarding addresses incurs extra
|
100 97
|
80
|
80
time. Second, when forwarding occurs, both the old and new loca-
|
60
|
60
|
40
tions of relocated data are accessed, thereby degrading cache be-
|
40
|
20 20
|
|
|
N L Perf N L Perf N L Perf
Loads Stores and the performance does improve. However, the improvement is
only marginal due to the fact that we cannot optimize the layout to
(c) Distribution of number (d) Average load and accelerate both the hash table and tree access patterns.
of forwarding hops store latencies To provide further insight into the source of the forwarding
10
|
% of Loads/Stores
Number of Cycles
| 8.3 ordinary
8
|
80
rics. Figure 10(b) shows the impact of the schemes on the num-
|
7
|
6 5.6
|
60 5.2 4.9
|
5 ber of load and store data cache misses. As we see in this figure,
|
4 3.5
|
40 3.2
|
20 2
|
|
7.7 1 shows that 7.7% of loads and 1.7% of stores require one forward-
|
1.7
0 0
|
0
Loads
1 0 1
Stores
N L Perf N L Perf ing hop under scheme L. Finally, Figure 10(d) shows the average
Loads Stores
Number of Forwarding Hops number of CPU cycles needed to complete a load or store under
Figure 10. Performance results for SMV. (N = not optimized, each scheme. Each bar in Figure 10(d) is divided into two sec-
L = locality optimized with realistic forwarding, Perf = locality tions explaining the reason for the stall. The forwarding section
optimized with perfect forwarding). The line size is fixed at 64B. represents the time spent dereferencing forwarding addresses, and
the ordinary section includes cache hit and miss latencies. The
hashing, but codetab always shares the same index values
ordinary sections of scheme L increase due to the cache pollution
as htab. Therefore, spatial locality might be improved if
effects of touching the forwarding pointers, as mentioned earlier.
codetab[i] could be next to htab[i] in the memory. We
As we see in Figure 10(d), both the latency of dereferencing a for-
achieve this by copying the two tables into a single larger table T
warding address and its resulting cache pollution effects play sig-
such that htab[i] and codetab[i] occupy adjacent elements
nificant roles in the overall performance degradation. A profiling
in T. However, as we have already seen in Figure 5, performance
tool based on user-level traps (as discussed earlier in Section 3.2)
is in fact degraded by this optimization for 32B and 64B lines due
could potentially identify cases such as this where forwarding oc-
to worse locality than in the original code.
curs too frequently.
5.4. Impact of Forwarding Overhead
In each of the applications that we have studied so far, we were 6. Conclusions
successful enough at updating the appropriate pointers to point to
a relocated object’s new location that the forwarding mechanism As changes in technology continue to alter the landscape of
was almost never invoked. (At the same time, we would like to what constitutes a major performance bottleneck, it is sometimes
point out that without memory forwarding support, we would not worth re-examining old architectural ideas that have fallen out of
have been able to apply these optimizations because they were not fashion to see whether they can be adapted to serve completely
provably safe.) As a result, the performance of dereferencing a new purposes. In this paper, we have examined such a technique:
forwarding address did not matter in these cases. To quantify the memory forwarding. Although the original concept was proposed
impact of forwarding overhead in a case where it does matter, we to facilitate garbage collection in early Lisp machines, we have
now focus on SMV, which is the only application we studied that demonstrated that memory forwarding can be adapted to address
experiences significant forwarding after data relocation. the entirely modern problem of enhancing cache performance. In
SMV is a model checking program which makes extensive use addition, we have shown that it is quite feasible to implement
of Binary Decision Diagrams (BDDs) [4]. The BDD nodes are this mechanism within modern out-of-order superscalar proces-
connected both through a hash table and through binary trees. The sors, largely because forwarding can be treated as an exception.
hash table is organized as an array of buckets pointing to linked By liberating aggressive relocation-based data layout optimiza-
lists. Since more cache misses occur during hash table accesses tions from concerns over violating program correctness, memory
than binary tree accesses, we attempted to improve locality by lin- forwarding can enable impressive performance gains: we observe
earizing the lists stored in the hash table. Unfortunately, since our over twofold speedups in two applications. These optimizations
optimized code is not able to update the tree pointers to point to are useful not only for hiding memory latency, but also for re-
a relocated object’s new address, forwarding does occur whenever ducing memory bandwidth consumption. Although one must still
relocated BDD nodes are accessed via the tree pointers. exercise caution not to use forwarding carelessly, a user-level trap
Figure 10 shows our performance results for SMV. In addition mechanism can help identify and avoid cases where pointers have
to the cases without (N) and with (L) locality optimization, as not been updated successfully. In summary, memory forwarding
shown in earlier graphs, we also show a case with locality opti- is a powerful tool which makes a large class of optimizations that
mization and perfect forwarding (Perf). We say that memory for- were promising in theory useful in practice. Its applicability ex-
warding is perfect if all references to relocated objects access them tends beyond caches to the rest of the memory hierarchy (e.g.,
directly at their new addresses, and hence no forwarding is actu- disks), and we advocate that it be supported in future processors.
98
1063-6897/99/$10.00 (c) 1999 IEEE
7. Acknowledgments [20] IBM. PowerPC 620 Risc Microprocessor Technical Summary, Octo-
ber 1994.
We thank Daniel Meneveaux for providing his radiosity pro-
gram. Chi-Keung Luk is partially supported by a Canadian Com- [21] T.E. Jeremiassen and S.J. Eggers. Reducing false sharing on shared
memory multiprocessors through compile time data transformations.
monwealth Fellowship. Todd C. Mowry is partially supported by
In Proceedings of PPoPP’95, July 1995.
an Alfred P. Sloan Research Fellowship, and by a Faculty Devel-
opment Award from IBM. [22] R. E. Kessler and M. D. Hill. Page placement algorithms for large
real-indexed caches. ACM Trans. on Comp. Sys., 10(4), Nov 1992.
99
1063-6897/99/$10.00 (c) 1999 IEEE