0% found this document useful (0 votes)

21 views12 pages

Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing The Safety of Data Relocation

The document discusses a technique called memory forwarding that enhances cache performance by allowing safe data relocation at runtime. This method creates spatial locality, avoids cache conflicts, and reduces false sharing by adding a layer of indirection in memory, ensuring that data relocations do not break program correctness. Experimental results indicate that memory forwarding can lead to significant performance improvements, with speedups exceeding twofold in certain scenarios.

Uploaded by

Shreyas Grampurohit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views12 pages

Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing The Safety of Data Relocation

Uploaded by

Shreyas Grampurohit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Memory Forwarding: Enabling Aggressive Layout Optimizations

by Guaranteeing the Safety of Data Relocation

Chi-Keung Luk Todd C. Mowry
Department of Computer Science Computer Science Department
University of Toronto Carnegie Mellon University
Toronto, Canada M5S 3G4 Pittsburgh, PA 15213
luk@eecg.toronto.edu tcm@cs.cmu.edu

Abstract Cache performance depends on two factors: when data items

By optimizing data layout at run-time, we can potentially en- are accessed, and where they exist in the address space. There-
hance the performance of caches by actively creating spatial lo- fore, software-based techniques for improving cache performance
cality, facilitating prefetching, and avoiding cache conflicts and typically do one of two things: they either restructure the compu-
false sharing. Unfortunately, it is extremely difficult to guarantee tation, or else they restructure the data layout. The idea behind
that such optimizations are safe in practice on today’s machines, restructuring the computation is that given a fixed data layout, we
since accurately updating all pointers to an object requires perfect would like to manipulate the ordering of accesses such that multi-
alias information, which is well beyond the scope of the compiler ple accesses to the same data item (or cache line) occur close to-
for languages such as C. To overcome this limitation, we propose a gether in time, thereby enhancing locality [8, 37]. In contrast, the
technique called memory forwarding which effectively adds a new idea behind optimizing the data layout is that given that a set of
layer of indirection within the memory system whenever necessary data items are accessed close together in time in the original com-
to guarantee that data relocation is always safe. Because actual putation, we would like to actively arrange them in the address
forwarding rarely occurs (it exists as a safety net), the mechanism space such that: (i) we create spatial locality by allocating them at
can be implemented as an exception in modern superscalar pro- contiguous addresses (thereby enhancing the effectiveness of long
cessors. Our experimental results demonstrate that the aggres- cache lines and simplifying prefetch address generation); (ii) we
sive layout optimizations enabled by memory forwarding can re- avoid cache conflicts by ensuring that they do not reside in sepa-
sult in significant speedups—more than twofold in some cases—by rate lines which map into the same cache sets; and (iii) we avoid
reducing the number of cache misses, improving the effectiveness false sharing by ensuring that items accessed by different proces-
of prefetching, and conserving memory bandwidth. sors fall within separate cache lines. While both approaches have
received considerable attention in the past, our focus in this study
1. Introduction is on facilitating data layout optimizations.
There are two possibilities for when we manipulate data layout.
As the gap between processor and memory speeds continues to
The first approach—which we call static placement—is to assign
grow, memory latency is becoming a key performance bottleneck.
an object to its optimized address when it is created [6]. The sec-
In addition, there is growing concern that the bandwidth to the off-
ond approach is to move an object (perhaps more than once) after
chip memory hierarchy may also limit performance [5]. The pri-
it has been allocated; we refer to this latter approach as data relo-
mary mechanism for reducing both the off-chip latency and band-
cation (or simply relocation). The advantage of static placement is
width requirements is the on-chip cache hierarchy. While caches
its simplicity. The advantage of relocation, however, is that it can
are an important step toward addressing these problems, they face
adapt to dynamic program behavior. Previous studies have shown
several well-known limitations. For example, multi-word cache
that relocation-based optimizations such as copying [23, 33] and
lines can improve performance by prefetching useful data when
clustering [11] can offer impressive performance gains.
spatial locality is abundant, but they can also waste bandwidth and
displace useful data when it is not. Caches can suffer pathologi- In general, relocation-based data layout optimizations involve
cally bad behavior due to either mapping conflicts or false shar- the following three steps:
ing [21, 34]. Prefetching techniques [9, 25, 31] can potentially 1. Guaranteeing Correctness: Either the programmer or the
hide cache miss latency, but only if they can predict access patterns compiler must prove that relocating the data will never
ahead of time, and only if there is sufficient memory bandwidth. break the program; otherwise, the optimization is unsafe.
A brute force approach of simply making caches larger is not
2. Estimating the Cost/Benefit Tradeoff: The potential opti-
likely to solve these problems, in part because application problem
mization should only be performed if the performance ben-
sizes are also increasing rapidly, and also because cache sizes are
efit is expected to outweigh the overheads involved in relo-
constrained by the requirement of low access latency and by hard-
cating the data. This estimation could be based on some
ware resource limitations. In addition, recent studies [5, 18] have
combination of programmer knowledge, static compiler
shown that the effectiveness of caches is often low because a sig-
analysis, profiling feedback, or run-time information.
nificant fraction of cached data is not reused before it is displaced
from the cache. Hence the first problem to address is managing 3. Generating Relocation Code: Additional code must be in-
existing caches more intelligently to improve their effectiveness. serted to perform the actual data relocation at run-time.

88
1063-6897/99/$10.00 (c) 1999 IEEE
Despite the high performance potential of many relocation- apply a relocation-based optimization reduces solely to evaluating
based optimizations, the key stumbling block which often prevents its cost/benefit performance tradeoffs. In effect, memory forward-
them from being used in practice is the first step—i.e. guaranteeing enables software to optimistically speculate that when it relo-
ing correctness. To safely move data, we must guarantee that any cates an object, it has successfully updated all pointers to that ob-
future references to the object will find it at its new location. The ject to point to its new location. If the speculation fails, then there
fundamental problem is that updating the precise set of pointers1 is a recovery cost (i.e. dereferencing the forwarding address), but
to a given object requires perfect aliasing information related to the execution still proceeds correctly. Therefore, as in all forms of
that object. In general, computing such precise information is be- speculation, one is gambling that the speculation is correct often
yond the capabilities of the compiler,2 and is even quite difficult enough that the benefit outweighs the cost. Another feature of our
for the programmer for large programs. In the face of uncertainty, mechanism which helps improve these odds is that software can
we must conservatively assume that relocating an object will break optionally specify that dereferencing a forwarding address will in-
the program—no matter how unlikely this may seem in reality— voke a user-level trap that enables software to update the offending
and therefore the optimization cannot be performed. pointer to point to the object’s new address. Hence software can
There is one mechanism in modern systems which provides learn from its mistakes to avoid repeating them.
a very limited form of safe data relocation: the virtual memory
system. The operating system can relocate an entire page of mem- 1.2. Related Work
ory in the physical address space without breaking the program It is interesting to note that the work which is most closely re-
by simply copying the page and updating its virtual-to-physical lated to our study occurred well over a decade ago in the context of
mapping. One cache optimization which exploits this flexibility architectures that directly supported the Lisp programming envi-
is page coloring [22], whereby the operating system attempts to ronment [29, 32]. Performance concerns were quite different back
avoid mapping conflicts in large off-chip caches. Therefore, by then: main memory was relatively small and expensive, and cache
adding a layer of indirection within the memory system, we can miss latencies were less problematic because the gap between mi-
move data safely and transparently without any special language croprocessor and memory speeds was dramatically smaller. (In
or compiler support. Unfortunately, the virtual memory system fact, a number of microprocessor-based systems did not even have
only provides this flexibility at the granularity of an entire page. caches.) Therefore the primary concern in optimizing memory
To actively create spatial locality within a cache line, we must have performance back then was minimizing the overall space require-
this flexibility at a word granularity. However, applying standard ments of a program, so that it would fit into main memory and
virtual memory techniques at such a fine granularity—i.e. setting avoid paging to disk. Two aspects of the Lisp environment made
the page size to be one word—is not a viable solution, due to the this challenging: the need to perform automatic garbage collec-
enormous overheads that this would involve. (Not only would the tion, and the relative space inefficiency of the ubiquitous list struc-
number of page table entries and the TLB size grow enormously, tures in the language. In addition, another aspect of the Lisp lan-
but also the cache tags would have to be maintained at a word guage which resulted in specialized hardware support was the need
granularity.) Instead, we propose a completely different solution. to determine object data types at run-time.
Although the performance goals which inspired specialized
1.1. Our Solution: Memory Forwarding hardware and software support in these Lisp machines are quite
To give software the flexibility to apply relocation-based data different from our goal of improving cache performance, there
layout optimizations at any time without concern over violating are nonetheless a number of interesting overlaps between our sup-
program correctness, we propose a mechanism called memory for- port and some features of these earlier machines. We now discuss
warding which guarantees the safety of relocation at a word gran- the connections between this previous work and our study from
ularity. (In our discussion, we define the “word” size to be equal three different perspectives: tagged memory, garbage collection,
to the size of a pointer.) The basic idea behind memory forward- and data layout optimizations.
ing is that when we relocate an object, we store its new address in
Tagged Memory: To make objects self-descriptive with re-
its old location and mark the old location so that hardware recog-
spect to their types, a number of Lisp architectures [29, 32] associ-
nizes it as a forwarding address. Therefore if the program acci-
ated a tag with each memory location. As we will see later in Sec-
dentally accesses the old address, the hardware will automatically
tion 2.1, our memory forwarding scheme also requires a form of
forward the reference to the object’s new location, thereby always
tagged memory to distinguish forwarding addresses from normal
guaranteeing the correct result. Moreover, our scheme only pays
data. A key difference, however, is that the tags in Lisp machines
the run-time overhead of an extra indirection when it is actually
provided much more functionality than in our case, and therefore
necessary—i.e. when the alternative is to violate program seman-
they required more overhead. For example, the SPUR architec-
tics. In the far more common cases of references to non-relocated
ture [32] added eight bits of tag to each 32 bits of memory (a 25%
objects, or references that have been properly updated to point to
overhead), whereas our scheme only requires one tag bit per 64
the new addresses of relocated objects, our scheme imposes no
bits of memory (a 1.5% overhead) in a modern 64-bit architecture.
performance overhead. The space overhead of our scheme is also
A fact that is even more relevant to our study is that a form
low (a 1.5% fixed memory cost on a 64-bit architecture).
of memory forwarding (using tagged memory) has appeared in
With memory forwarding support, the decision of whether to
previous Lisp machines, albeit for a very different purpose. The
1 We use the term “pointers” loosely to refer to any mechanism for gen- concept of an invisible pointer (which is similar to our forwarding
erating an address pointing to the object in question. address) was proposed twenty-four years ago by Greenblatt [15],
2 This is especially true for heap-allocated objects in languages like C. and the Symbolics 3600 [29] used one of its tags to implement

89
1063-6897/99/$10.00 (c) 1999 IEEE
a forwarding pointer. The motivation behind these mechanisms (a) Before data relocation (b) After data relocation
was threefold: to enable the insertion of an item into a cdr-coded byte address forwarding bit byte address forwarding bit

list [2], to facilitate incremental garbage collection, and to imple- 0800 5800 0800 5800
ment overlapping arrays. In contrast, our focus is on improving the 0804
-85 0 5804
? 0 0804 5800 1 5804 -85 0
-47 ? -47
cache performance of programs written in C, and therefore none 0808 5808 0808 5808
0812
0 5812 ? 0 0812 5808 1 5812 0 0
of these issues apply. In essence, what we are doing is taking a 0 0 ? 0
0816 5816 0816 5816
very old mechanism and adapting it to a completely new purpose 0820
-99 0 5820
? 0 0820 5816 1 5820
-99 0
within the context of modern out-of-order superscalar processors. 5 ? 5
0824 5824 0824 5824
old new old new
Garbage Collection: A common feature among these Lisp
machines is that they support some form of automatic garbage col- Figure 1. Example of data relocation with memory forwarding.
(A memory word is 8 bytes, and addresses are in decimal.)
lection. Garbage collection algorithms involve phases where they
identify two classes of data items: those that can be reclaimed, The remainder of this paper is organized as follows. We be-
and those that can be relocated. A data item can be reclaimed gin in Section 2 with an overview of memory forwarding and how
when it can no longer be accessed through any pointers that are it can be used. Section 3 discusses issues related to implement-
still active, and a data item can be relocated if all pointers to the ing memory forwarding in a modern processor. Sections 4 and 5
old location can be updated to point to the new location. In both present our experimental methodology and experimental results,
cases, the key challenge is identifying all pointers which point respectively, to demonstrate the usefulness of the mechanism. Fi-
to the given location. In languages such as Lisp, ML, and Java, nally, we conclude in Section 6.
where the use of pointers is either restricted or disallowed alto-
gether, one can solve this problem in practice. In contrast, in lan- 2. Memory Forwarding
guages such as C and C++ which do not restrict pointer usage, one
We now discuss the basic concepts behind memory forward-
generally cannot determine which pointers point to a given object,
ing, a number of applications of this mechanism, and some issues
and therefore automatic garbage collection (and data relocation) is
related to its performance.
extremely difficult. Finally, it is interesting to note that a form of
memory forwarding is used in copying garbage collectors [10, 28], 2.1. Basic Concepts
whereby the forwarding addresses are used to preserve data con-
Memory forwarding enables aggressive yet safe data reloca-
sistency during the distinct phases when collection takes place.
tion. As we mentioned earlier in Section 1.1, the basic idea is to
Data Layout Optimizations: An important topic in Lisp re- store the new address of an object into its old memory location,
search is how to represent list structures compactly. List com- and to mark this old location as a forwarding address. Whenever
paction can be performed either separately or during garbage col- a forwarding address is accessed, the hardware will automatically
lection. Most of the list compaction techniques designed for dereference that location to find the object at its new location.
Lisp [1, 2, 16] involve either moving or copying the original list to There are three implications of this mechanism in terms of
a new, denser set of locations. As we discussed above, data reloca- memory storage. First, the minimum unit of data that can be re-
tion in Lisp does not pose the safety problems that we encounter in located is the width of a pointer—which we refer to as a “word”
C. However, our memory forwarding support gives us the flexibil- throughout this paper—since otherwise there would not be enough
ity to exploit some of these same list compaction techniques—e.g., space to store the forwarding address.3 Note that it is possible to
a technique called list linearization [13]—for the sake of improv- relocate byte-sized objects—this simply means that enough neigh-
ing spatial locality in C programs. boring bytes must be moved at the same time to comprise an entire
word. Second, a chunk of data that is relocated must be word-
1.3. Objectives of This Study aligned, so that the alignment of the forwarding address is prede-
This paper makes the following contributions. First, we pro- termined. Note that this still allows us to perform byte-sized loads
pose a solution to the problem of safely relocating data at a fine and stores to forwarded objects—the byte offset into the new lo-
granularity to improve the cache performance of programs written cation is simply assumed to be the same as it was at its original
in languages such as C which do not support garbage collection. address. We consider these first two restrictions to be quite mi-
Although the concept of memory forwarding was proposed over nor, especially given that our only option for safe relocation today
two decades ago in the context of Lisp machines, to the best of our is page-sized, page-aligned chunks of data. Finally, to enable the
knowledge, we are the first to propose that it be adapted to facil- hardware to distinguish forwarding addresses from regular data,
itate a broad class of data layout optimizations to improve cache we attach a one-bit tag (called a forwarding bit) to each word in
performance. Second, we discuss how memory forwarding can be memory. For a 64-bit architecture, this results in a space overhead
implemented within modern out-of-order superscalar processors of only 1.5%, and therefore is reasonably efficient.
(which are quite different from the processors in which other forms Figure 1 shows a simple example of how the memory contents
of forwarding have been implemented in the past). Third, we sug- and forwarding bits are modified upon data relocation. Assume
gest a number of optimizations which can benefit from memory 3 One could imagine creating a more elaborate scheme for compressing
forwarding. Finally, we quantitatively evaluate the benefits and the size of forwarding address pointers (e.g., by restricting the distance
overheads of our scheme by using it to apply a number of differ- between the old and new address to something that fit within its former
ent run-time locality optimizations to a collection of non-numeric size), but this would involve additional complexity and fancier tag storage,
applications running on a modern superscalar processor. so we do not consider such an approach further.

90
1063-6897/99/$10.00 (c) 1999 IEEE
that we have a 64-bit architecture, and that we would like to relo- (a) Before list linearization
cate five 32-bit elements from addresses 0800-0816 to addresses head 0 0 0 0
5800-5816 (these addresses are in decimal notation). Figure 1(a) 3000 A B C D
shows the memory contents and forwarding bits before relocation 3000 1000 4000 2000
(note that none of the forwarding bits have been set). To relocate
a word, we first copy it to its new location, and then we simul- (b) After list linearization
taneously write its new address into its old location and set the 1 1 1 1
corresponding forwarding bit at the same time. Figure 1(b) shows 8000 8016 8032 8048
the state of memory after the relocation. Notice that to relocate the 3000 1000 4000 2000
32-bit subword at address 0816, we must also relocate the 32-bit
subword at address 0824 (which contains the value 5) along with head 0 0 0 0
8000 A B C D
it. After the relocation, a 32-bit load of the subword at address
0804 will be forwarded to address 5804—which is computed by 8000 8016 8032 8048
adding the forwarding address (5800) to the byte offset within the Figure 2. Example of list linearization with memory forwarding,
word (4)—thereby returning the correct value of -47. assuming that cache lines and list elements are 32 and 16 bytes
To simplify our discussion throughout the remainder of the pa- long, respectively. (Addresses are in decimal.)
per, we now define two terms which we will use frequently:
performance, it can also reduce memory bandwidth consumption,
Initial address: The address of the first location accessed by a which in turn can help reduce power consumption (which is be-
memory reference. For example, in Figure 1(b), the initial coming an increasingly important concern).
address of a write to word 0816 is 0816 itself. A good example of a technique which uses packing to enhance
Final address: The address of the last location accessed by a spatial locality is list linearization. As we will see later in Sec-
memory reference. For example, in Figure 1(b), the final tion 5, this technique can offer dramatic performance improve-
address of a write to word 0816 is 5816. When data is ments. List linearization has been an important technique for com-
not forwarded, the final address equals the initial address. pacting lists in Lisp programs, and can eliminate as much as half
of the space consumption [13]. The idea behind list linearization is
In addition to preserving the correctness of pointer derefer- to relocate the nodes of a linked list so that they reside in contigu-
ences, another concern with data relocation is preserving the cor- ous memory locations. Depending on whether the list structure
rectness of pointer comparisons. In the presence of memory for- continues to change over time, the linearization process can be
warding, it is possible that two pointers with distinct initial ad- invoked either just once, or else periodically to adapt to the chang-
dresses may in fact point to the same object (i.e. share the same ing structure. Although list linearization can potentially offer large
final address). Hence to preserve correctness, explicit pointer performance gains, it is very difficult to safely use this optimiza-
comparisons4 should be performed with respect to their final ad- tion in practice for general C programs due to the possibility of
dresses. Although our memory forwarding hardware does not per- pointers outside of the linked list itself pointing to list elements.
form this check automatically, the compiler can easily insert ad- Fortunately, with memory forwarding support, we can apply list
ditional instructions (described later in Section 3) to look up the linearization at any time without worrying about whether all po-
final addresses for these comparisons. We implemented such a tential pointers to list elements have been properly updated.
compiler pass, and the resulting software overhead is included in Figure 2 shows an example of list linearization with memory
our performance results. As we will see later in Section 5, this forwarding. Before linearization, the four nodes of the list (i.e.
overhead does not present a problem. nodes A, B, C, and D) are scattered throughout memory such that
2.2. Applications of Memory Forwarding they reside in four separate cache lines, as shown in Figure 2(a).
List linearization packs the four nodes into a contiguous memory
While the act of dereferencing a forwarding address clearly
region starting at location 8000, as shown in Figure 2(b). As a
does not improve performance on its own, the advantage of mem-
result, the four relocated nodes occupy only two cache lines, rather
ory forwarding support is that it enables a wide range of data lay-
than four, thereby potentially eliminating half of the cache misses
out optimizations which can enhance cache performance. Not only
due to this list as we continue to revisit it. Note that the forwarding
are these optimizations useful for mitigating the impact of mem-
addresses and forwarding bits have been set properly such that we
ory latency, they can also be used to conserve memory bandwidth.
will still maintain correct execution even if a stray pointer accesses
We now briefly describe some of these potential optimizations.
a list element at its old address. However, we expect that most
Improving Spatial Locality: A straightforward method of accesses to the list will find it directly at its new address, thereby
actively improving spatial locality is to take data items which are enjoying the enhanced spatial locality.
accessed close together in time, but which are scattered sparsely
throughout the address space, and pack them into adjacent mem-
Increasing Prefetching Effectiveness: The effectiveness of
prefetching for non-numeric applications is largely limited by the
ory locations. This form of data packing makes cache lines much
difficulties in generating prefetch addresses early enough [25]. For
more effective, and it can potentially reduce the number of capac-
example, consider the linked list in Figure 2(a), and assume that
ity, compulsory, and conflict misses. Not only does this improve
we need to prefetch three nodes ahead to hide the entire miss la-
4 In reality, we only need to worry about cases where the pointers po- tency. Therefore, we would want to prefetch node D as soon as
tentially point to the same type of relocatable object. we arrive at node A. However, the problem is that we do not know

91
1063-6897/99/$10.00 (c) 1999 IEEE
the address of node D until we have dereferenced nodes A, B, and R = Read FBit(word* A): Read the forwarding bit of the
C—this is known as the pointer-chasing problem [25]. In con- word at address A into register R.
trast, after the list is linearized, we can trivially prefetch node D R = Unforwarded Read(word* A): Read the value stored
at node A by simply prefetching the next cache line, thus avoid- in the word at address A into register R, with forward-
ing any pointer chasing. (We referred to this technique as data ing disabled. If the word’s forwarding bit is set, this is a
linearization prefetching in an earlier publication [25].) forwarding address; otherwise this is a regular data value.

Reducing Cache Conflicts: Data copying [23] was orig- *A = Unforwarded Write(register word R, bit B ):a
inally proposed to reduce conflict misses within tiled (or Write the value of register R into the word at address A
“blocked”) numeric applications. Since a given tile is reused many and set the word’s forwarding bit to B atomically, with
times after it is brought into the cache, it is particularly problem- forwarding disabled.
atic if different elements within the tile conflict with each other. a If an instruction of the underlying ISA cannot have three operands,
To avoid this problem, the data copying optimization first copies we can have two separate instructions for Unforwarded Write( ,0) R
a tile to a contiguous set of addresses in a temporary array before and Unforwarded Write( ,1). R
using it; since these locations do not conflict with one another, Figure 3. Proposed instruction set extensions to support mem-
the problem is eliminated. Another technique called data color- ory forwarding. (C syntax is used to improve readability.)
ing [11] was proposed as a method of reducing conflict misses in
pointer-based data structures. The idea is to partition the cache into observe that the real performance concern is ensuring that the reor-
logically separate regions (or colors). By relocating data structure ganized data layout actually delivers higher memory performance
elements which are accessed close together in time to separate re- than the original layout.
gions of the cache, conflict misses can be avoided. Memory for-
warding can help facilitate both copying and coloring techniques 3. Implementation Issues
by guaranteeing that they are safe. We now discuss the support that we need from the instruction
set, the hardware, and the software to implement memory forward-
Reducing False Sharing: In cache-coherent shared-memory ing in modern superscalar processors.
multiprocessing systems, false sharing [21, 34] occurs when two
or more processors access distinct data items which happen to fall 3.1. Extensions to the Instruction Set Architecture
within the same cache line (which is the unit of coherence), and To exploit memory forwarding, the machine must have some
at least one access is a write. False sharing can hurt performance way to manipulate the forwarding information—i.e. the forward-
dramatically as the line ping-pongs between processors despite the ing addresses and the forwarding bits. Rather than taking a purely
fact that no real communication is taking place. By relocating hardware-based approach, we propose to extend the underlying in-
those unrelated data items to distinct cache lines, false sharing can struction set architecture (ISA) by adding a few instructions which
be avoided. Memory forwarding would be especially helpful in will allow software to manipulate the forwarding information di-
avoiding false sharing in irregular shared-memory applications, rectly. The advantages of this approach are its programmability
where proving that data items can be safely relocated is difficult. and flexibility. In addition, we expect the software overhead to be
In summary, memory forwarding enables a broad range of low since forwarding information changes relatively infrequently.
relocation-based optimizations; we have presented just a partial Figure 3 shows our proposed ISA extensions, which consist
list of such optimizations. We would also like to emphasize that of three new instructions. Read FBit allows software to check
these optimizations are applicable not only to caches but also to whether a given location contains a forwarding address or actual
the other levels of the memory hierarchy. For example, we can data. Unforwarded Read and Unforwarded Write allow
apply data relocation to improve the spatial locality within pages software to manipulate memory with the forwarding mechanism
(and hence on disk) for out-of-core applications. disabled. For example, in Figure 1(b), a normal Read (i.e. with
the forwarding mechanism enabled) of the word at address 0808
2.3. Performance Issues will get the forwarded value of 0, but an Unforwarded Read of
A relocation-based optimization will improve overall perfor- the same word will get 5808, which is the forwarding address. An
mance if two conditions hold: (i) the new data layout actually pro- Unforwarded Write must change the word and its forwarding
vides better memory performance than the original layout; and (ii) bit atomically in order to preserve data consistency.
the gain in the memory performance outweighs the optimization To demonstrate how software can make use of these new in-
overhead. This overhead includes the extra execution time due structions, Figure 4(a) shows two procedures for relocating a data
to actually relocating the data, and may also include forwarding object of size n words from src to tgt, and then storing tgt as
overhead if any references actually need to be forwarded after the the forwarding address into src. Procedure Relocate() loops
relocation. While the overhead of relocating the data may seem until a clear forwarding bit is read so that tgt will be appended at
to be a concern at first glance, our experimental results indicate the end of the forwarding chain (if any). Figure 4(b) shows a pro-
that it is usually not a problem because relocation is invoked infre- cedure called ListLinearize() (which we will use frequently
quently and modern processors can execute multiple instructions later in our experiments) which calls Relocate() to perform
per cycle. In addition, we find that the performance overhead of list linearization. The parameter head handle is the address of
forwarding is negligible in many cases because most data refer- the list head. Note that the address of the list head (rather than
ences are updated properly and do not need to be forwarded. We its value) is passed into ListLinearize() because we want

92
1063-6897/99/$10.00 (c) 1999 IEEE
other words, we can treat forwarding as an exception. We will de-
sign the hardware to be fast in the common case—i.e. a normal,
(a) Data Relocation non-forwarded reference—and we are less concerned about the
// src = address of the object before relocation
// tgt = address of the object after relocation performance penalty when forwarding is actually invoked, since

f that is rare. Hence a legitimate option is to use a processor’s nor-

// n words = number of words to relocate
void Relocate(word* src, word* tgt, int n words) mal exception handling mechanism to implement forwarding.
boolean relocated;
One hardware requirement that was mentioned earlier in Sec-
f
relocated = Read FBit(src);
while (relocated) tion 2.1 is that we need tagged memory. A number of sys-
// loop until the final address is reached
src = Unforwarded Read(src);
tems which supported tagged memory have been built in the

g
relocated = Read FBit(src); past [29, 32]. One difference with our scheme (as discussed ear-
lier) is that we require less tag storage overhead than previous
g
actualRelocate(src, tgt, n words);
schemes; otherwise, it is quite similar. We now discuss the more

void actualRelocate(word* src, word* tgt, int n words) f novel features of our hardware support in greater detail.

f
// relocate each word in the object Dereferencing Forwarding Addresses: In the presence of
for(; n words > 0; --n words)
word temp; memory forwarding, the data referencing mechanism must be able
// save the content of src to follow forwarding chains of arbitrary lengths. More specifi-
temp = Unforwarded Read(src);
// setup the forwarding address and forwarding bit
cally, when a memory word is accessed by a data reference, its for-
*src = Unforwarded Write(tgt, 1); warding bit is tested. If this bit is set, then the original data address
// copy the original content of src to tgt will be replaced by the contents of the word just accessed (which
*tgt = Unforwarded Write(temp, 0);
// prepare for the next word contains a forwarding address), and a new memory access using

g the forwarding address will be launched. This process repeats un-

src += 1; tgt += 1;

g til a clear forwarding bit is read (we will discuss how cycles might
be handled later in Section 3.2), at which point the data reference
(b) List Linearization can proceed as usual. One option is to implement this dereferenc-
// a pool of space for data relocation ing mechanism purely within hardware; another is to implement it
extern char* memory pool;
using a software-based exception handler (where the exception is
f
// head handle = address of the list head pointer
void ListLinearize(node ** head handle) triggered by accessing a word with its forwarding bit set). With the
node ** handle, *tgt; ISA extensions that we propose, it would be straightforward for a
// start from the list head
software handler to chase the forwarding pointer chain. Although
f
handle = head handle;
while (*handle) the forwarding bit cannot be tested until the memory location is
// grab space from the pool
tgt = (node*)memory pool;
brought into the primary cache, this is no different from the delays
// increment the pool pointer associated with checking ECC or parity bits.
memory pool += sizeof(node);
// relocate the node pointed-to by handle to the address stored in tgt Data Dependence Speculation: One consequence of mem-
Relocate(*handle, tgt, sizeof(node)/sizeof(word));
// append the relocated node to the linearized list
ory forwarding is that we do not know the final data address of a
*handle = tgt; given reference until the reference is nearly completed. This de-
!
// prepare for next node layed generation of the final address poses a potential problem in

g g
handle = &(tgt next);
out-of-order superscalar machines. These machines normally al-
low a load access to proceed before an earlier store, provided that
the load and store are to different addresses. If either address is
Figure 4. Procedures using the proposed ISA extensions to
implement (a) data relocation and (b) list linearization.
unknown, the conservative approach is to delay the load until both
addresses are resolved. With memory forwarding, since the final
to modify the list head to point to the new locations after the re- address of a store is not known until the store actually completes,
location is performed. (This effect was illustrated earlier in Fig- this delay would cause the conservative approach to never execute
ure 2(b), where the value of head is changed to 8000 after the a load ahead of an earlier store.
linearization.) By doing so, the next time that the list is accessed Fortunately, there is a solution to this problem. A technique
via the list head, the new locations will be accessed directly with- called data dependence speculation [12, 30] allows a load to spec-
out touching the old locations. Finally, note that in Figure 4(b), the ulatively execute before an earlier store, even if the store address
new locations for the relocated nodes are allocated from a pool of is unknown. If it turns out that the load was not dependent on
contiguous memory, thereby creating spatial locality. the store, then the speculation succeeds; otherwise, a true depen-
dence has been violated, and the effects of the incorrect specu-
3.2. Hardware Support lation must be undone. Recent out-of-order superscalar proces-
We now discuss the hardware modifications necessary to sup- sors [19, 20, 24] have already implemented some form of data
port memory forwarding. The key insight which helps us keep dependence speculation. With support for data dependence specu-
the hardware simple is that references which actually require for- lation, we can speculate that the final address of a reference will be
warding are expected to occur rarely (if ever). The forwarding the same as its initial address (i.e. we do not expect the reference
mechanism is simply a safety net which allows us to continue to to be forwarded), and therefore the delayed final-address genera-
preserve program correctness in case the unexpected happens. In tion will not degrade performance in the common case where the

93
1063-6897/99/$10.00 (c) 1999 IEEE
reference is not forwarded after all. If forwarding does occur, then in a region of memory to initialize it before making that memory
our speculation would only be incorrect in the case where the load available to an application.
and store had different initial addresses but the same final address.
In our experiments, we observed that incorrect data dependence Deallocating Forwarded Data: When an object is deallo-
speculation almost never occurred; hence it appears to be a very cated, all memory reachable via the chain of forwarding addresses
effective solution to supporting memory forwarding. for that object should be deallocated as well. A simple way to ac-
complish this is to create a wrapper memory-deallocation routine
Handling Forwarding Cycles: A forwarding cycle is cre- which first deallocates all of the memory allocated on the forward-
ated when software erroneously inserts an address more than once ing chain, and then calls the original memory-deallocation routine,
into a forwarding chain. The hardware must have some mech- which can be either a system-provided procedure (e.g., free()
anism for detecting and breaking forwarding cycles; otherwise, in C and delete() in C++) or a user-defined procedure if the
the machine could be stalled forever chasing the forwarding chain. program performs its own memory management.
Detecting forwarding cycles accurately is an expensive operation;
Memory Alignment: Since the minimal granularity of mem-
for each hop, the hardware would have to match the current for-
ory forwarding is a word, software must ensure that two different
warding address against all previous forwarding addresses derefer-
objects which are being relocated to two different destinations do
enced by the same data reference. Because of this high cost—and
not share the same word, since we cannot store two different for-
also because we expect forwarding cycles to be extremely rare—
warding addresses in that same word. In other words, relocatable
we would prefer that the hardware instead perform a fast but pos-
objects must be word-aligned. Enforcing this alignment can be
sibly inaccurate check for a cycle during normal execution, and
accomplished either by specifying the alignment to the memory
only perform accurate cycle detection when it is necessary. One
allocator for dynamically-allocated objects, or else by tuning the
possibility is to predetermine a limit on the number of forward-
alignment option in the compiler if some relocatable objects are
ing hops that are allowed for a given data reference. We simply
statically allocated. In some compilers—e.g., the MIPS C com-
maintain a counter (which could be implemented either in hard-
piler that we used in our experiments—aggregate objects are al-
ware or in software) to keep track of the number of forwarding
ready aligned to word boundaries by default.
hops performed so far, and when this count exceeds the limit, we
raise an exception. The corresponding software exception handler Preserving Outcomes of Pointer Comparisons: The
will then perform an accurate cycle check. If it is a false alarm, compiler is responsible for replacing all pointer comparisons that
then we will reset the counter and resume execution; otherwise, could be affected by relocation with explicit code to look up and
the execution will be aborted. compare final addresses. Pointer analysis techniques [14, 36] can
help the compiler avoid inserting these more costly comparisons
Providing User-Level Traps Upon Forwarding: In addi- by ignoring cases where pointers cannot point to relocated objects.
tion to the system-level exception handlers which might be pro-
vided to support the dereferencing of forwarding addresses and
4. Experimental Framework
the detection of forwarding cycles, it may also be useful to pro-
vide a lightweight user-level trapping mechanism that would be To evaluate the potential performance benefits of memory for-
invoked upon accessing a forwarded location. Such a mechanism warding, we modeled it in a modern processor and used it to en-
would be useful for allowing the application to tune its own per- able a number of relocation-based optimizations which we applied
formance in the following two ways. First, one could write a pro- to a collection of non-numeric applications. We chose to focus on
filing tool to gather forwarding-related statistics for the purpose of non-numeric applications because compilers are mostly unable to
improving the performance of a future execution of the program. guarantee the safety of data relocation in these applications. The
For example, one might record which instructions experienced for- goals of the optimizations that we applied were improving spa-
warding for the sake of eliminating that forwarding in future runs tial locality and prefetching effectiveness. Since current compiler
of the program. Second, a user-level trap handler could be used technology does not support these optimizations (mainly because
to optimize away forwarding (and thereby improve performance) their safety cannot be proven), we added these optimizations to
on-the-fly. For example, one could write a tool that updates stray the applications manually. Table 1 describes the eight applications
pointers on-the-fly to point directly to their correct final addresses, used in our experiments along with the optimizations that we ap-
thereby avoiding the need to invoke the forwarding mechanism plied. All applications were run to completion in our simulations.
again. (Note that one must have application-specific knowledge We added our proposed ISA extensions to the underlying MIPS
in order to do this.) A mechanism similar to informing memory ISA by making use of a few machine instruction sequences that
operation traps [17] could be used for this purpose. never appear in ordinary programs (e.g., loading a value into a reg-
ister which is hardwired to the value zero). We modeled the full
3.3. Software Support performance effects of maintaining and dereferencing the forward-
Having discussed the hardware support for memory forwarding addresses. The “Space Overhead” column shows the amount
ing, we now focus on its impact on software. of virtual memory space for accommodating relocated data; this
amount (ranging from 0.5MB to 14.9MB) does not present a prob-
Initialization of Forwarding Bits: The forwarding bit of a lem in modern machines, and the simulation results include the
memory word must already be clear when it is used by a program impact of this overhead on performance.
for the first time. To guarantee this, the operating system must per- We implemented data dependence speculation in our simu-
form an Unforwarded Write(0,0) operation on all words lator. An ambiguous data dependence is stored in a table until

94
1063-6897/99/$10.00 (c) 1999 IEEE
Table 1. Application characteristics. Note: “Inst. Grad.” is the number of instructions actually graduated. The “combined” miss
rate is the fraction of loads which suffer misses in both the 16KB D-cache and the 512KB L2 cache, using 32B cache lines. “Space
Overhead” is the amount of virtual memory space used for forwarding addresses.

Optimizations Insts. Load Miss Rate Space

Name Description Source Input Data Set Applied Grad. D-Cache Combined Overhead
BH Barnes-Hut’s N-body force Olden [7] 4K bodies Subtree clustering 1472M 2.57% 0.18% 1.7MB
calculation algorithm
Compress Compresses and decompresses SPEC95 A file of 150K Array 546M 10.20% 0.46 % 0.5MB
file in memory characters merging
Eqntott Translation of boolean SPEC92 int pri 3.eqn Packing of hash 1914M 5.22% 0.63% 0.5MB
equations into truth tables table elements
Health Simulation of the Columbian Olden max. level = 5 List 213M 30.66% 16.73% 4.7MB
health care system max. time = 500 linearization
MST Finds the minimum spanning Olden 1K nodes List linearization 302M 8.67% 5.35% 12.0MB
tree of a graph
Radiosity Virtual image rendering using IRISA [27] A scene consisting of 10 List 4552M 3.72% 0.26% 0.6MB
hierarchical radiosity lightly furnished rooms linearization
SMV A symbolic model checker CMU [26] The “dme2.smv” file List 302M 8.78% 3.75% 2.2MB
provided in the package linearization
VIS A verification and synthesis The VIS A reduction of the 8 queens List 432M 12.81% 2.53% 14.9MB
system for finite-state group [3] problem to combinational linearization
hardware systems equivalence checking

Table 2. Simulation parameters. hierarchy (including tag, bank, and bus contention), etc. Table 2
Pipeline Parameters shows the parameters used in our model for the bulk of our experi-
Issue Width 4 ments. Five line sizes—ranging from 32B to 512B—were used in
Functional Units 2 Integer, 2 FP,
2 Memory, 2 Branch our experiments, along with five corresponding sets of miss laten-
Reorder Buffer Size 64 cies (longer lines have longer transfer times).
Integer Multiply 3 cycles We compiled our applications with -O2 optimization using the
Integer Divide 15 cycles
All Other Integer 1 cycle standard MIPS C compilers and the SUIF compiler [35] under
FP Divide 15 cycles IRIX 5.3. For the experiments which required the insertion of soft-
FP Square Root 20 cycles ware prefetches into the source code, we used the SUIF compiler;
All Other FP 2 cycles
Branch Prediction Scheme 2-bit Counters
otherwise, the MIPS compiler was used.

Memory Parameters 5. Experimental Results

Line Sizes 32/64/128/256/512 bytes
I-Cache 16KB, direct-mapped, 2 banks We now present results from our simulation studies. We start
D-Cache 16KB, direct-mapped, 2 banks by evaluating the performance of a number of aggressive locality
Data Victim Buffer 8 32-byte entries optimizations enabled by memory forwarding (which we simply
Miss Handlers (MSHRs) 32 for data and 2 for inst.
Unified S-Cache 512KB, 2-way set-associative, 4 banks refer to as locality optimizations). Next, we study the impact of
Primary-to-Secondary 12/18/30/54/102 cycles (plus these optimizations on prefetching effectiveness. We then examine
Miss Latencies any delays due to contention) the details of individual applications, explaining the optimizations
Primary-to-Memory 75/93/129/201/345 cycles (plus
Miss Latencies any delays due to contention) that we applied to each application. Finally, we study the perfor-
Primary-to-Secondary Bandwidth 16 bytes/cycle mance impact of forwarding overhead for one of the applications.
Secondary-to-Memory Bandwidth 8 bytes/cycle
5.1. Performance of Locality Optimizations
Figure 5 shows the performance of our locality optimizations
the two final data addresses involved in the dependence are deter- for various cache line sizes. Seven of our eight applications are
mined. If the dependence is incorrectly speculated, then the simu- included in Figure 5. We will show the performance of SMV sep-
lator will then re-execute all instructions after (and including) the arately, later in Section 5.4, since it is the only application that is
instruction which had violated the dependence. We replaced the affected by forwarding overhead. For each application in Figure 5,
memory deallocation calls in the applications by calls to our own we show three line sizes, each of which has two cases: the bar on
memory deallocator which first checks whether there is any mem- the left (N) is the original case where no locality optimization is
ory residing in forwarding chains which must be freed. We wrote applied, and the bar on the right (L) is the case with locality opti-
a compiler pass in SUIF [35] that automatically determined which mizations. For all applications except BH, the three line sizes used
pointer comparisons needed to be replaced by final address com- are 32B, 64B, and 128B. For BH, we instead use line sizes of 32B,
parisons. The overhead of executing these replaced comparisons 256B, and 512B, because the optimization applied to BH (subtree
is included in our simulations. clustering) requires a cache line containing at least two tree nodes,
We performed detailed cycle-by-cycle simulations of our appli- and this requires cache lines longer than 128B (this optimization
cations on a dynamically-scheduled, superscalar processor similar will not be turned on for lines shorter than 256B, and that is why
to the MIPS R10000 [38]. Our simulator models the rich details the N and the L bars are identical for the 32B line size in BH).
of the processor including the pipeline, register renaming, the re- Each bar in Figure 5 represents execution time normalized to
order buffer, branch prediction, branching penalties, the memory the N case of the 32B line size, and is broken down into four cat-

95
1063-6897/99/$10.00 (c) 1999 IEEE
244
Normalized Execution Time

240 load stall

220 store stall

200 inst stall

180 busy
|

167
160 159 155
150 149148
|

140
|

127124 126
118 115
120 108 114 109 108 114
|

100100 100 100 95 103 95 97 100 94 100 99 100 98 100 94

100 89 89 90 91
|

79 84
80 70
|

60 55 52
|

40
|

20
|

0
|

N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L N L
32B 256B 512B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B
(0%) (3%) (11%) (-8%) (-3%) (1%) (6%) (8%) (12%) (43%) (70%) (123%) (12%) (21%) (27%) (11%) (25%) (63%) (26%) (85%) (161%)
BH Compress Eqntott Health MST Radiosity VIS

Figure 5. Performance of locality optimizations for various cache line sizes (N = not optimized, L = locality optimized).
egories explaining what happened during all potential graduation (a) Number of load D-cache misses
slots.5 The bottom section (busy) is the number of slots when 200 partial misses: misses combined with outstanding misses

|
% of Load D-Cache Misses

146
175 full misses: misses served by the secondary cache

135
|

132
127
instructions actually graduate, the top two sections are any non-

122
150

116
116
|

104
102
100
100

100

100
125

99
98
98
|

92
92
91
graduating slots that are immediately caused by the oldest instruc-

79
100

78
77

74
73
|

63
63
62
61

59
75

49
49
48
|

41
31
50
tion suffering either a load or store miss, and the inst stall section

|
25

|
0
is all other slots where instructions do not graduate. Note that the

|
NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL NL
32B 256B512B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B 32B 64B 128B
load stall and store stall sections are only a first-order approxima- BH Compress Eqntott Health MST Radiosity VIS

tion of the performance loss due to cache stalls, since these delays
(b) Total amount of bandwidth consumed
also exacerbate subsequent data dependence stalls. In addition,

1186
1200
there is a percentage in parentheses below each pair of bars rep-
|
% of Bandwidth Consumed 700 between the secondary cache and memory

881

|
1000

498

494
600 between primary caches and the secondary cache

451
resenting the speedup of the optimized over the unoptimized case

396
800 500

372
553

347
|

463
400

279
600

243

226
225
|

for the given line size.

214
203
186
300

175
174

173
168
|

149

149
400

121
111
108
107
|

100

100
200
100
100

95
92

79
Our first observation from Figure 5 is that performance gen-

56
200 100
|

|
0 0
|

|
erally degrades when line size increases, especially for the unop- NL NL NL
32B 256B 512B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
NL NL NL
32B 64B 128B
BH Compress Eqntott Health MST Radiosity VIS
timized cases. This trend is due to a lack of spatial locality in
these applications, which means that longer lines offer little perfor- Figure 6. Additional performance metrics for the impact of lo-
cality optimizations (N = not optimized, L = locality optimized).
mance advantage. Fortunately, our locality optimizations (which
The y-axes are normalized to the N cases of the 32B line size.
are enabled by memory forwarding) improve the spatial locality
of these application significantly. As we see in Figure 5, the op- essarily suffer the full miss latency. A full miss, on the other hand,
timized cases outperform the unoptimized cases for the same line does not combine with any access and therefore suffers the full la-
sizes in all applications except Compress, and the speedups in- tency. Figure 6(a) clearly demonstrates that the improved spatial
crease along with line size. The performance improvement can locality offered by locality optimizations reduces the miss count
be dramatic—with 128B lines, Health and VIS enjoy more substantially, with more than a 35% reduction in misses in 11 out
than twofold speedups. Among our optimizations, list lineariza- of the 21 cases (seven applications with three line sizes each). In
tion is particularly powerful since it improves the performance of many cases, both partial misses and full misses are reduced, and
Health, MST, Radiosity, and VIS substantially. It is inter- hence the total miss penalty decreases accordingly.
esting to note that in Health, the absolute performance of the Figure 6(b) shows another useful performance metric: the total
optimized cases increases along with line size. This is due to amount of bandwidth consumed by our applications. Each bar in
the prefetching benefits of long cache lines after spatial locality Figure 6(b) denotes the total number of bytes transferred between
is greatly improved. Compress is an exceptional case where the the primary and secondary caches (the bottom section), and the
locality gets worse in the optimized case for 32B and 64B lines. amount transferred between the secondary cache and main mem-
We also observe from Figure 5 that the instruction overhead of ory (the top section). Again, each bar is normalized to the N case
these locality optimizations is usually low, which suggests that of the 32B line size. It is clear from Figure 6(b) that locality op-
these optimizations could be invoked even more frequently dur- timizations reduce the bandwidth consumption in nearly all cases,
ing the execution to further improve the data layout. and achieve a bandwidth reduction of twofold or more in a few
While execution time is the most important performance met- cases. Thus we see that these optimizations deliver not only higher
ric, further insight can also be gained by examining the impact on performance, but also reduced bandwidth consumption.
total cache misses. Figure 6(a) shows the number of load D-cache
misses in the unoptimized and optimized cases for different line 5.2. Impact on the Effectiveness of Prefetching
sizes. Each bar is normalized to the N case of the 32B line size, We now turn our attention to the interaction between our lo-
and is divided into two categories indicating how a D-cache miss cality optimizations and the effectiveness of prefetching. Based
is serviced. A partial miss is a D-cache miss that combines with on a profile of each application, we added software prefetches
an outstanding miss to the same line, and therefore does not nec- for a few static loads that suffer significantly from cache misses.
5 The number of graduation slots is the issue width (4 in this case) mul- Prefetches are inserted at the earliest points in the program where
tiplied by the number of cycles. We focus on graduation rather than issue the prefetch addresses are known (this is done in an identical fash-
slots to avoid counting speculative operations that are squashed. ion for both the original and locality optimized cases). We assume

96
1063-6897/99/$10.00 (c) 1999 IEEE
load stall (a) Original (b) Optimized
Normalized Execution Time

180
|

160 store stall

116
140 inst stall

108

107
|

101

101
100
100

100

100
100

100
120
97

busy

95
95

93
|

90
89

88
86
hash table

82
100

79
|

70
64
80 PTERM
|

60 short integers
|

29
40
|

20 i
|

0
|

N L NP LP N L NP LP N L NP LP N L NP LP N L NP LP N L NP LP N L NP LP
(0%) (-3%) (-8%)(-6%) (6%) (8%) (43%)(118%) (12%)(10%) (11%)(14%) (26%)(42%) i i+1
BH Compress Eqntott Health MST Radiosity VIS
i+1
Figure 7. Performance impact of locality optimizations on
prefetching. (N = not optimized, L = locality optimized, NP
= prefetching without locality optimizations, LP = prefetching
with locality optimizations). Figure 8. Locality optimization for Eqntott (objects in the
same shaded region are allocated to contiguous memory).
that a single prefetch instruction can prefetch one or more consec-
utive cache lines (i.e. block prefetching is supported). For both the
(a) Original (b) Subtrees clustered
unoptimized and optimized cases, we experimented with a range
of prefetch block sizes, and we report the results with the block
size that performed the best for each case.
Figure 7 shows how prefetching performs both with (LP) and
without (NP) locality optimizations. For the sake of comparison,
the N and L cases from Figure 5 are also included in Figure 7. The
cache line size is fixed at 32B. We observe from Figure 7 that the Figure 9. Example of the subtree clustering applied to BH
performance of prefetching is improved by locality optimizations (nodes in the same shaded region are in the same cache line).
in five applications, and two of them (VIS and Health) enjoy that most functions in this library return pointers to list elements,
speedups of over 40%. We note that four of these five applica- which can be scattered across any of the over hundred source files
tions operate heavily on linked lists, and previous research [25] has of VIS. The program behave incorrectly if after a list is linearized,
shown that prefetching linked lists—especially those that are short it is later accessed using a pointer to the middle of the list that
and traversed within small loop bodies—is particularly difficult existed before the linearization. Fortunately, memory forwarding
because of the pointer-chasing problem. As we can see in Figure 7, allows us to simply ignore this hazard, thereby safely resulting in
the list linearization optimization is quite successful in alleviating an over twofold performance gain with 128B lines.
this problem. With the exception of VIS (which experiences con-
siderable prefetching overhead), in the remaining four out of five Eqntott: The most interesting data structure in Eqntott is a
applications where the locality is substantially improved, combin- hash table which stores pointers to a record of type PTERM. A
ing locality optimizations and prefetching (LP) performs better PTERM record in turn contains a pointer to an array of short in-
than either technique alone (most noticeably in Health). There- tegers. The original layout of this data structure is shown in Fig-
fore, it appears that prefetching and our locality optimizations are ure 8(a). We optimize the locality by (i) relocating a PTERM record
complementary in nature. and its short integer array into a single chunk of memory, and (ii)
putting these chunks into contiguous memory locations in increas-
5.3. Case Studies ing order of the hash index. The optimized layout is shown in
Having studied the overall performance, we now look at the Figure 8(b). This relocation optimization is invoked only once in
individual applications in more detail. the program, immediately after the hash table is constructed.
Health, MST, Radiosity, and VIS: We apply the same lo- BH: In BH, an octree is constructed and then traversed at each
cality optimization to all four of these applications: list lineariza- time step of the N-body force calculation. The octree is con-
tion. The structure of the linked lists used in these applications is structed in a depth-first order, but the traversal order is fairly ran-
modified throughout the program execution, and therefore list lin- dom. We improve the locality of the traversal by clustering non-
earization is invoked periodically. To make our discussion more leaf nodes of the tree. We do not cluster leaf nodes since they are
concrete, we use VIS as a representative example. VIS is a large actually linked together by a list and accessed via list traversals.
application, consisting of more than 150,000 lines of C code. This Subtree clustering [11] attempts to pack nodes of a subtree into a
program makes extensive use of a generic list library which imple- cache line, in the most balanced form. Locality will be improved
ments many common list operations. Our optimizations are local- if the next node to be visited—which can be any of the children
ized within this library. We optimize the locality of list processing of the current node—is already in the current cache line. Figure 9
as follows. We add a counter field to the head record of each list illustrates this optimization using a binary tree. Figure 9(a) shows
to count how many insertion or deletion operations have been per- the original memory layout of the tree, which was created using a
formed on the list since the last time that the list was linearized. pre-order traversal, and Figure 9(b) shows the memory layout af-
The list linearization procedure ListLinearize()—shown ter subtree clustering. Since a non-leaf node in BH is 78B long, we
earlier in Figure 4(b)—is invoked whenever the list’s counter ex- need cache lines of 256B or longer to do meaningful clustering.
ceeds a threshold, which was arbitrarily set to 50 in our experi-
ments. The counter is reset after each linearization. Despite the Compress: The most relevant data structures in Compress
simplicity and usefulness of this optimization, performing it with- are two hash tables, namely htab and codetab, which are im-
out the support of memory forwarding is dangerous due to the fact plemented using arrays. Indices to htab are computed through

97
1063-6897/99/$10.00 (c) 1999 IEEE
(a) Execution time (b) Miss count ally required. While this latter case is not achievable, it represents
160 150 a useful bound on performance. As we see in Figure 10(a), the

|
Normalized Execution Time

% of Load/Store D-Cache Misses

164 load stall partial misses
160 store stall 140 full misses
|

|
140 inst stall 121 performance of scheme L is degraded by forwarding in two ways.
|

120

|
busy
120 100 100
|

100 100 93 90 First, the act of dereferencing forwarding addresses incurs extra

|
100 97
|

|
80
time. Second, when forwarding occurs, both the old and new loca-
|

|
60
|

40
tions of relocated data are accessed, thereby degrading cache be-

|
40
|

20 20

|
|

0 0 havior. With perfect forwarding, there is no forwarding overhead

|
N L Perf N L Perf N L Perf
Loads Stores and the performance does improve. However, the improvement is
only marginal due to the fact that we cannot optimize the layout to
(c) Distribution of number (d) Average load and accelerate both the hash table and tree access patterns.
of forwarding hops store latencies To provide further insight into the source of the forwarding
10
|
% of Loads/Stores

Number of Cycles

100 98.3 forwarding

92.3 9
overhead, Figure 10 presents three additional performance met-
|

| 8.3 ordinary
8
|

80
rics. Figure 10(b) shows the impact of the schemes on the num-
|

7
|

6 5.6
|

60 5.2 4.9
|

5 ber of load and store data cache misses. As we see in this figure,
|

4 3.5
|

40 3.2
|

3 scheme L suffers a noticeable increase in misses. Figure 10(c)

20 2
|
|

7.7 1 shows that 7.7% of loads and 1.7% of stores require one forward-
|

1.7
0 0
|

0
Loads
1 0 1
Stores
N L Perf N L Perf ing hop under scheme L. Finally, Figure 10(d) shows the average
Loads Stores
Number of Forwarding Hops number of CPU cycles needed to complete a load or store under
Figure 10. Performance results for SMV. (N = not optimized, each scheme. Each bar in Figure 10(d) is divided into two sec-
L = locality optimized with realistic forwarding, Perf = locality tions explaining the reason for the stall. The forwarding section
optimized with perfect forwarding). The line size is fixed at 64B. represents the time spent dereferencing forwarding addresses, and
the ordinary section includes cache hit and miss latencies. The
hashing, but codetab always shares the same index values
ordinary sections of scheme L increase due to the cache pollution
as htab. Therefore, spatial locality might be improved if
effects of touching the forwarding pointers, as mentioned earlier.
codetab[i] could be next to htab[i] in the memory. We
As we see in Figure 10(d), both the latency of dereferencing a for-
achieve this by copying the two tables into a single larger table T
warding address and its resulting cache pollution effects play sig-
such that htab[i] and codetab[i] occupy adjacent elements
nificant roles in the overall performance degradation. A profiling
in T. However, as we have already seen in Figure 5, performance
tool based on user-level traps (as discussed earlier in Section 3.2)
is in fact degraded by this optimization for 32B and 64B lines due
could potentially identify cases such as this where forwarding oc-
to worse locality than in the original code.
curs too frequently.
5.4. Impact of Forwarding Overhead
In each of the applications that we have studied so far, we were 6. Conclusions
successful enough at updating the appropriate pointers to point to
a relocated object’s new location that the forwarding mechanism As changes in technology continue to alter the landscape of
was almost never invoked. (At the same time, we would like to what constitutes a major performance bottleneck, it is sometimes
point out that without memory forwarding support, we would not worth re-examining old architectural ideas that have fallen out of
have been able to apply these optimizations because they were not fashion to see whether they can be adapted to serve completely
provably safe.) As a result, the performance of dereferencing a new purposes. In this paper, we have examined such a technique:
forwarding address did not matter in these cases. To quantify the memory forwarding. Although the original concept was proposed
impact of forwarding overhead in a case where it does matter, we to facilitate garbage collection in early Lisp machines, we have
now focus on SMV, which is the only application we studied that demonstrated that memory forwarding can be adapted to address
experiences significant forwarding after data relocation. the entirely modern problem of enhancing cache performance. In
SMV is a model checking program which makes extensive use addition, we have shown that it is quite feasible to implement
of Binary Decision Diagrams (BDDs) [4]. The BDD nodes are this mechanism within modern out-of-order superscalar proces-
connected both through a hash table and through binary trees. The sors, largely because forwarding can be treated as an exception.
hash table is organized as an array of buckets pointing to linked By liberating aggressive relocation-based data layout optimiza-
lists. Since more cache misses occur during hash table accesses tions from concerns over violating program correctness, memory
than binary tree accesses, we attempted to improve locality by lin- forwarding can enable impressive performance gains: we observe
earizing the lists stored in the hash table. Unfortunately, since our over twofold speedups in two applications. These optimizations
optimized code is not able to update the tree pointers to point to are useful not only for hiding memory latency, but also for re-
a relocated object’s new address, forwarding does occur whenever ducing memory bandwidth consumption. Although one must still
relocated BDD nodes are accessed via the tree pointers. exercise caution not to use forwarding carelessly, a user-level trap
Figure 10 shows our performance results for SMV. In addition mechanism can help identify and avoid cases where pointers have
to the cases without (N) and with (L) locality optimization, as not been updated successfully. In summary, memory forwarding
shown in earlier graphs, we also show a case with locality opti- is a powerful tool which makes a large class of optimizations that
mization and perfect forwarding (Perf). We say that memory for- were promising in theory useful in practice. Its applicability ex-
warding is perfect if all references to relocated objects access them tends beyond caches to the rest of the memory hierarchy (e.g.,
directly at their new addresses, and hence no forwarding is actu- disks), and we advocate that it be supported in future processors.

98
1063-6897/99/$10.00 (c) 1999 IEEE
7. Acknowledgments [20] IBM. PowerPC 620 Risc Microprocessor Technical Summary, Octo-
ber 1994.
We thank Daniel Meneveaux for providing his radiosity pro-
gram. Chi-Keung Luk is partially supported by a Canadian Com- [21] T.E. Jeremiassen and S.J. Eggers. Reducing false sharing on shared
memory multiprocessors through compile time data transformations.
monwealth Fellowship. Todd C. Mowry is partially supported by
In Proceedings of PPoPP’95, July 1995.
an Alfred P. Sloan Research Fellowship, and by a Faculty Devel-
opment Award from IBM. [22] R. E. Kessler and M. D. Hill. Page placement algorithms for large
real-indexed caches. ACM Trans. on Comp. Sys., 10(4), Nov 1992.

References [23] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance

and optimizations of blocked algorithms. In ASPLOS-IV, pages 63–
[1] H. G. Baker. List processing in real time on a serial computer. Com- 74, April 1991.
mun. ACM, 21(4):280–294, April 1978. [24] D. Leibholz and R. Razdan. The Alpha 21264: A 500 MHz Out-of-
[2] D. G. Bobrow and D. W. Clark. Compact encodings of list structure. Order Execution Microprocessor. In IEEE CompCon’97, February
ACM TOPLAS, 1(2):267–286, October 1979. 1997.
[3] R. K. Brayton, G. D. Hachtel, A. S. Vincentelli, F. Somenzi, A. Aziz, [25] C.-K. Luk and T. C. Mowry. Compiler-based prefetching for recur-
S. Cheng, S. Edwards, S. Khatri, Y. Kukimoto, A. Pardo, S. Qadeer, sive data structures. In ASPLOS-VII, pages 222–233, October 1996.
R. Ranjan, S. Sarwary, T. R. Shilpe, G. Swamy, and T. Villa. VIS: [26] K. L. McMillan. The SMV system. Carnegie-Mellon University, Feb
a system for verification and synthesis. In Proceedings of the 8th 1992.
International Conference on Computer Aided Verification, July 1996.
[27] D. Meneveaux, K. Bouatouch, and E. Maisel. Memory management
[4] R. E. Bryant. Graph-based algorithms for Boolean function manipu- schemes for radiosity computation in complex environment. Techni-
lation. IEEE Trans. on Comp., C-35(8):677–691, 1986. cal Report PI 1097, IRISA/INRIA, 1996.
[5] D. Burger, J. R. Goodman, and A. Kagi. Memory bandwidth lim- [28] M. L. Minsky. A Lisp garbage collector algorithm using serial sec-
itations of future microprocessors. In ISCA’96, pages 78–89, May ondary storage. Technical Report Memo 58 (rev.), Project MAC,
1996. M.I.T., Cambridge, Mass., 1963.
[6] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data [29] D. A. Moon. Architecture of the symbolics 3600. In ISCA’85, pages
placement. In ASPLOS-VIII, October 1998. 76–83, 1985.
[7] M. C. Carlisle and Anne Rogers. Software caching and computation [30] A. Moshovos, S. Breach, T. Vijaykumar, and G. Sohi. Dynamic spec-
migration in olden. In Proceedings of PPoPP’95, pages 29–38, July ulation and synchronization of data dependences. In ISCA’97, pages
1995. 181–193, June 1997.
[8] S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations [31] T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of
for improving data locality. In ASPLOS-VI, pages 252–262, October a compiler algorithm for prefetching. In ASPLOS-V, pages 62–73,
1994. October 1992.
[9] T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching [32] G. S. Taylor, P. N. Hilfinger, J. R. Larus, D. A. Patterson, and B. G.
for high-performance processors. IEEE Trans. on Comp., 44(5), May Zorn. Evaluation of the SPUR Lisp architecture. In ISCA’86, pages
1995. 444–452, 1986.
[10] C. J. Cheney. A nonrecursive list compacting algorithm. Commun. [33] O. Temam, E. D. Granston, and W. Jalby. To copy or not to copy: A
ACM, 13(11):677–678, Nov 1970. compile-time technique for assessing when data copying should be
[11] T. M. Chilimbi, J. R. Larus, and M. D. Hill. Tools for cache- used to eliminate cache conflicts. In Proceedings of Supercomput-
ing’93, pages 410–419, November 1993.
conscious data structures. In PLDI’99, May 1999.
[34] J. Torrellas, M. S. Lam, and J. L. Hennessy. False sharing and spatial
[12] G. Chrysos and J. Emer. Memory dependency prediction using store
locality in multiprocessor caches. IEEE Trans. on Comp., 43(6):651–
sets. In ISCA’98, pages 142–153, June 1998.
663, 1994.
[13] D. W. Clark. List structure: measurements, algorithms, and encod-
[35] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M.
ings. PhD thesis, Carnegie-Mellon University, August 1976.
Anderson, S. W.K. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall,
[14] R. Ghiya and L. J. Hendren. Is it a Tree, a DAG, or a Cyclic Graph? M. S. Lam, and J. L. Hennessy. SUIF: An infrastructure for research
A shape analysis for heap-directed pointers in C. In Proceedings of on parallelizing and optimizing compilers. ACM SIGPLAN Notices,
the 23rd Annual ACM SIGPLAN-SIGACT Symposium on Principles 29(12), Dec 1994.
of Programming Languages, pages 1–15, January 1996.
[36] R. P. Wilson and M. S. Lam. Efficient context-sensitive pointer anal-
[15] R. Greenblatt. The LISP Machine. Technical Report Working Paper ysis for C programs. In PLDI’95, pages 1–12, June 1995.
79, M.I.T. Artificial Intelligence Laboratory, November 1974.
[37] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In
[16] W. J. Hansen. Compact list representation: Definition, garbage col- PLDI’91, pages 30–44, June 1991.
lection, and system implementation. Commun. ACM, 12(9):499–507,
[38] K. Yeager. The MIPS R10000 superscalar microprocessor. IEEE
September 1969.
Micro, April 1996.
[17] M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Inform-
ing memory operations: Providing memory performance feedback in
modern processors. In ISCA’96, pages 260–270, May 1996.
[18] A.S. Huang and J. P. Shen. A limit study of memory requirements
using value reuse profiles. In MICRO-28, pages 71–81, Dec 1995.
[19] D. Hunt. Advanced performance features of the 64-bit PA-8000. In
IEEE CompCon’95, March 1995.

99
1063-6897/99/$10.00 (c) 1999 IEEE

GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Lecture 11. Memory Hierarchy
No ratings yet
Lecture 11. Memory Hierarchy
107 pages
Trabalho Faculdade
No ratings yet
Trabalho Faculdade
5 pages
Data Memory Organization and Optimizations in Application-Specific Systems
No ratings yet
Data Memory Organization and Optimizations in Application-Specific Systems
13 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Cache
No ratings yet
Cache
35 pages
L - 3-AssociativeMapping - Virtual Memory
No ratings yet
L - 3-AssociativeMapping - Virtual Memory
52 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
CS530 Fall2015 Lecture7
No ratings yet
CS530 Fall2015 Lecture7
7 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
Memory 2
No ratings yet
Memory 2
31 pages
CH10 - Memory Hierarchy
No ratings yet
CH10 - Memory Hierarchy
106 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
21 pages
10 Caches
No ratings yet
10 Caches
34 pages
Memory-Management Strategies
No ratings yet
Memory-Management Strategies
70 pages
Coa PPT
No ratings yet
Coa PPT
158 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
CA L7a SoftwareTechniquesToImproveCachePerformance
No ratings yet
CA L7a SoftwareTechniquesToImproveCachePerformance
15 pages
CH04 COA10e
No ratings yet
CH04 COA10e
46 pages
Conspect of Lecture 7
No ratings yet
Conspect of Lecture 7
13 pages
Computer Engineering Students
No ratings yet
Computer Engineering Students
17 pages
Memory Management Techniques
No ratings yet
Memory Management Techniques
18 pages
L-4 (Cache Memory)
No ratings yet
L-4 (Cache Memory)
61 pages
Usha Mittal Institute of Technology SNDT Women'S University: MUMBAI - 400049
No ratings yet
Usha Mittal Institute of Technology SNDT Women'S University: MUMBAI - 400049
19 pages
Mapping Functions
No ratings yet
Mapping Functions
23 pages
Cache Memory Architecture Guide
No ratings yet
Cache Memory Architecture Guide
65 pages
Computer Memory Hierarchy Guide
No ratings yet
Computer Memory Hierarchy Guide
7 pages
Os Internal 2 Notes
No ratings yet
Os Internal 2 Notes
30 pages
Lecture - 6 CSE332 NJL
No ratings yet
Lecture - 6 CSE332 NJL
19 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
28 pages
Chapter 05 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
75% (4)
Chapter 05 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
105 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
46 pages
Unit 3 - Memory Management
No ratings yet
Unit 3 - Memory Management
38 pages
Chapter 05
No ratings yet
Chapter 05
105 pages
CH04 Cache Memory
No ratings yet
CH04 Cache Memory
44 pages
Lec 2
No ratings yet
Lec 2
26 pages
Memory Mar 4
No ratings yet
Memory Mar 4
4 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
Lecture 3 - Memory Management and Virtual Memory
No ratings yet
Lecture 3 - Memory Management and Virtual Memory
8 pages
4.chap6 Lookahead Vs Look Through Cache
No ratings yet
4.chap6 Lookahead Vs Look Through Cache
23 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
No ratings yet
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
25 pages
CH04 COA10e
No ratings yet
CH04 COA10e
41 pages
Cache Memory & Design Principles
No ratings yet
Cache Memory & Design Principles
47 pages
Chapter 05
No ratings yet
Chapter 05
105 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
ch5 2
No ratings yet
ch5 2
61 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Memory Subsystem Notes-1
No ratings yet
Memory Subsystem Notes-1
8 pages
Large and Fast: Exploiting Memory Hierarchy: The Hardware/Software Interface
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: The Hardware/Software Interface
33 pages
Module 5
No ratings yet
Module 5
17 pages
Help 2
No ratings yet
Help 2
102 pages
Unit 4
No ratings yet
Unit 4
72 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Important Questions in Mobile Communication It2402 It 2402 Subject For Nov
No ratings yet
Important Questions in Mobile Communication It2402 It 2402 Subject For Nov
3 pages
Windows Memory Management
No ratings yet
Windows Memory Management
31 pages
OS45
No ratings yet
OS45
6 pages
#OS Lecture Note 3 Memory Management
No ratings yet
#OS Lecture Note 3 Memory Management
64 pages
Computer Education B.ed
No ratings yet
Computer Education B.ed
31 pages
MC0070
No ratings yet
MC0070
17 pages
Operating System Basics & Features
No ratings yet
Operating System Basics & Features
19 pages
C and C++ Programming Concepts and Data Structures Author(s) :P. S. Subramanyam Publisher: BS Publications/BSP Books ISBN: 9788178002842
No ratings yet
C and C++ Programming Concepts and Data Structures Author(s) :P. S. Subramanyam Publisher: BS Publications/BSP Books ISBN: 9788178002842
31 pages
IT Exam: Computer & Network Tech
No ratings yet
IT Exam: Computer & Network Tech
4 pages
Digital Principles and Computer Organization - CS3351 - Important Questions With Answer - Unit 5 - Memory and IO
No ratings yet
Digital Principles and Computer Organization - CS3351 - Important Questions With Answer - Unit 5 - Memory and IO
13 pages
Memory Hierarchies in Computing
No ratings yet
Memory Hierarchies in Computing
53 pages
Platform Technologies Module 5
No ratings yet
Platform Technologies Module 5
29 pages
Unit III Virtualization
No ratings yet
Unit III Virtualization
31 pages
B.Tech. (CSE) - DS Semester-4
No ratings yet
B.Tech. (CSE) - DS Semester-4
24 pages
Virtual Memory in Operating System - GeeksforGeeks
No ratings yet
Virtual Memory in Operating System - GeeksforGeeks
16 pages
(Almost) Everything You Need To Know For OCR A Level Computer Science (v1.0)
No ratings yet
(Almost) Everything You Need To Know For OCR A Level Computer Science (v1.0)
30 pages
Practice Set DPVM
No ratings yet
Practice Set DPVM
2 pages
CS604 Grand Quiz by Junaid
No ratings yet
CS604 Grand Quiz by Junaid
15 pages
Unit 3.2 Address Spaces, Paging
No ratings yet
Unit 3.2 Address Spaces, Paging
44 pages
C Program for FIFO & LRU Page Replacement
No ratings yet
C Program for FIFO & LRU Page Replacement
7 pages
5.memory Management
No ratings yet
5.memory Management
20 pages
OS Merged
No ratings yet
OS Merged
97 pages
Types of Computer Operations
No ratings yet
Types of Computer Operations
3 pages
CH 02 - OS8e
No ratings yet
CH 02 - OS8e
91 pages
OS Concepts Quiz for CS Students
No ratings yet
OS Concepts Quiz for CS Students
18 pages
VBI Virtual Block Interface - Isca20
No ratings yet
VBI Virtual Block Interface - Isca20
14 pages
Operating System Questions With Their Answers: (Memory Management, Virtual Memory, Processes Synchronization
No ratings yet
Operating System Questions With Their Answers: (Memory Management, Virtual Memory, Processes Synchronization
14 pages
Memory Management
No ratings yet
Memory Management
27 pages
Enterprise IT Troubleshooting Cross Functional IT Problem Solving
No ratings yet
Enterprise IT Troubleshooting Cross Functional IT Problem Solving
399 pages
Memory and I/O System Overview
No ratings yet
Memory and I/O System Overview
89 pages

Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing The Safety of Data Relocation

Uploaded by

Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing The Safety of Data Relocation

Uploaded by

Memory Forwarding: Enabling Aggressive Layout Optimizations

by Guaranteeing the Safety of Data Relocation

Abstract Cache performance depends on two factors: when data items

f that is rare. Hence a legitimate option is to use a processor’s nor-

g the forwarding address will be launched. This process repeats un-

Optimizations Insts. Load Miss Rate Space

Memory Parameters 5. Experimental Results

240 load stall

220 store stall

200 inst stall

100100 100 100 95 103 95 97 100 94 100 99 100 98 100 94

for the given line size.

160 store stall

% of Load/Store D-Cache Misses

0 0 havior. With perfect forwarding, there is no forwarding overhead

100 98.3 forwarding

3 scheme L suffers a noticeable increase in misses. Figure 10(c)

References [23] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance

You might also like