Cache Replacement Policies
Cache Replacement Policies
Akanksha Jain
The University of Texas at Austin
Calvin Lin
The University of Texas at Austin
M
&C Morgan & cLaypool publishers
ABSTRACT
This book summarizes the landscape of cache replacement policies for CPU data
caches. Our emphasis is on algorithmic issues, so we start by defining a taxonomy
that places previous policies into two broad categories, which we refer to as Coarse-
Grained policies and Fine-Grained policies. Each of these categories is then sub-
divided into three more categories that each describes a different approach to solv-
ing the cache replacement problem. Having defined this taxonomy, we then describe
in more detail each category of the taxonomy, along with summaries of significant
work in each category. We then describe work that considers richer factors, includ-
ing solutions that optimize for metrics beyond cache miss rates, solutions tailored to
multi-core settings, solutions that consider interactions with prefetchers, and solu-
tions that consider new memory technologies. We conclude by discussing trends and
challenges for future work.
KEYWORDS
hardware caches, CPUs, cache replacement policies
v
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
5 Richer Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Cost-Aware Cache Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Memory Level Parallelism (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Criticality-Driven Cache Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Critical Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.2 Criticality-Aware Multi-Level Cache Hierarchy . . . . . . . . . . . . 44
5.3 Multi-Core-Aware Cache Management . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.1 Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Shared-Cache-Aware Cache Replacement . . . . . . . . . . . . . . . . . 48
5.4 Prefetch-Aware Cache Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.1 Cache Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.2 Deprioritizing Prefetchable Lines . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Cache Architecture-Aware Cache Replacement . . . . . . . . . . . . . . . . . 54
5.5.1 Inclusion Aware Cache Replacement . . . . . . . . . . . . . . . . . . . . . 54
5.5.2 Compression-Aware Cache Replacement . . . . . . . . . . . . . . . . . 55
5.6 New Technology Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6.1 NVM Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.6.2 DRAM Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Author’s Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
viii
Preface
We have written this book for those who wish to understand the state-of-the-
art in cache replacement policies, so this book attempts to organize the space of
solutions and make sense of the many different approaches that have been explored
in the literature. In doing so, we also hope to identify trends and issues that will be
important in the future.
We have intentionally chosen to focus on algorithmic issues, so we do not go
into details about hardware implementation.
We assume that readers have a basic undergraduate understanding of computer
architecture and caches.
Acknowledgments
We thank Aamer Jaleel and the anonymous reviewer for their valuable feedback.
We also thank Margaret Martonosi and Michael Morgan for their encouragement in
writing this book and for helpful guidance and support throughout this process. This
effort was funded in part by NSF Grant CCF-1823546 and a gift from Intel Corpora-
tion through the NSF/Intel Partnership on Foundational Microarchitecture Research,
and it was funded in part by a grant from Huawei Technologies.
CHAPTER 1
Introduction
For decades now, the latency of moving data has greatly exceeded the latency of
executing an instruction, so caches, which both reduce memory latency and reduce
memory traffic, are important components of all modern microprocessors. Because
there is a general tradeoff between the size of a cache and its latency, most micro-
processors maintain a hierarchy of caches, with smaller lower-latency caches being
fed by larger higher-latency caches, which is eventually fed by DRAM. For each of
these caches, effectiveness can be measured by its hit rate, which we define to be rs ,
where s is the number of memory requests serviced by the cache and r is the total
number of memory requests made to the cache.
There are several methods of improving a cache’s hit rate. One method is to
increase the size of the cache, typically at the expense of increased latency. A second
method is to increase the associativity of the cache, which increases the number of
possible cache locations to which a cache line can be mapped. At one extreme, a
direct mapped cache (associativity of 1) maps each cache line to a single location
in the cache. At the other extreme, a fully associative cache allows a cache line to
be placed anywhere in the cache. Unfortunately, power consumption and hardware
complexity both increase as we increase associativity. The third method, which is the
subject of this book, is to choose a good cache replacement policy, which answers
the question, “When a new line is to be inserted into the cache, which line should
be evicted to make space for the new line?"
It may seem strange to write a book on cache replacement, since Lazslo Be-
lady produced a provably optimal policy over five decades ago. But Belady’s policy is
unrealizable because it relies on future knowledge—it evicts the line that will be re-
used furthest in the future. Thus, over the years, researchers have explored several
different approaches to solving the cache replacement problem, typically relying on
heuristics that consider frequency of access, recency of access, and more recently,
prediction techniques.
Moreover, there is the question of how cache replacement policies relate to
dead block predictors, which attempt to predict lines that will no longer be needed
in the cache. We now know that the life cycle of a cache line has multiple deci-
sion points, beginning with the insertion of the line and progressing over time to the
eventual eviction of the line, and we know that there are different techniques for
2 1. INTRODUCTION
performing actions at these different decision points. With this view, we argue that
dead block predictors are a special case of cache replacement policies, and we find
that the space of cache replacement policies is quite rich.
Finally, caches are ubiquitous in software systems as well. In fact, the first re-
placement policies were developed for the paging systems of operating systems, and
while there are technological differences between software caches and hardware
caches, we hope that some of the ideas in this book will prove useful for the de-
velopers of software caches, as well.
Scope This book focuses on hardware cache replacement policies for CPU data
caches. And while most of the research that we discuss is performed in the context
of last-level caches, where the benefits of intelligent cache replacement are most
pronounced, the general ideas often apply to other levels of the cache hierarchy.
CHAPTER 2
A Taxonomy of Cache
Replacement Policies
To both organize this book and to organize the many ideas that have been studied
over several decades, we present a taxonomy of solutions to the cache replacement
problem. Our taxonomy is built on the observation that cache replacement policies
solve a prediction problem, where the goal is to predict whether any given line should
be allowed to stay in cache. This decision is re-evaluated at multiple points in a cache
line’s lifetime, which begins when the line is inserted into the cache and ends when
the line is evicted from the cache.
Therefore, in our taxonomy, we first divide cache replacement policies into two
broad categories based on the granularity of their insertion decisions. Polices in the
first category, which we refer to as Coarse-Grained policies, treat all lines identically
when they are inserted into the cache and only differentiate among lines based on
their their behavior while they reside in the cache. For example, as a line resides in
the cache, its priority might be increased each time it is reused. By contrast, Fine-
Grained policies distinguish among lines when they are inserted into the cache (in
addition to observing their behavior while they reside in the cache). To make this
distinction at the time of insertion, Fine-Grained policies typically rely on historical
information about cache access behavior. For example, if a Fine-Grained policy has
learned that a particular instruction loads lines that in the past tend to be evicted
without being reused, it can insert that line with a low priority.
To better understand our taxonomy, it is useful to understand that replacement
policies are typically implemented by associating a small amount of replacement state
with each cache line:
• Insertion Policy: How does the replacement policy initialize the replacement
state of a new line when it is inserted into the cache?
• Promotion Policy: How does the replacement policy update the replacement
state of a line when it hits in the cache?
• Aging Policy: How does the replacement policy update the replacement state
of a line when a competing line is inserted or promoted?
4 2. A TAXONOMY OF CACHE REPLACEMENT POLICIES
History
Insertion Promotion Eviction
Policy Policy Policy
Aging Aging
Time
• Eviction Policy: Given a fixed number of choices, which line does the replace-
ment policy evict?
With this view of the lifetime of a cache line, we can see that Coarse-Grained
policies treat all cache lines identically on insertion, so they primarily rely on clever
aging and promotion policies to manipulate replacement priorities. By contrast,
Fine-Grained policies employ smarter insertion policies. Of course, Fine-Grained
policies also need appropriate aging and promotion to update replacement state be-
cause not all behavior can be predicted at the time of insertion.
We now discuss these two classes in more detail.
Hybrid Policies Since different Coarse-Grained policies work well for differ-
ent cache access patterns, hybrid policies dynamically choose among a few pre-
determined Coarse-Grained policies to adapt to phase changes in a program’s ex-
ecution. In particular, hybrid policies observe the hit rate of a few candidate Coarse-
Grained policies during an evaluation period and use this feedback to choose the
best Coarse-Grained policy for future accesses. Adaptive policies are advantageous
because they help overcome pathologies of individual Coarse-Grained policies.
In Chapter 3, we will see that state-of-the-art hybrid policies modulate between
Coarse-Grained policies that use different insertion priorities, and we note that de-
spite the change in insertion priority over time, such policies are still coarse-grained
because within a time period, all lines are treated identically.
Replacement
Policies
Coarse-Grained Fine-Grained
Policies Policies
Economic Reuse
Recency Frequency Hybrid Classification
Value Distance
Figure 2.2 summarizes our complete taxonomy and shows how different classes
of replacement policies address these design factors. In general, the trend, as we
move to the right of the figure, which generally corresponds with newer policies, is
towards finer-grained solutions that use longer histories and that can accommodate a
wide variety of access patterns. The importance of predicting cache behavior at fine
granularity can be gauged by observing that the top four solutions in the recent Cache
Replacement Championship (2017) were all fine-grained. Fine-grained predictions
give state-of-the-art Fine-Grained policies two advantages. First, they allow Fine-
Grained policies to only dedicate cache space to lines that are most likely to ben-
8 2. A TAXONOMY OF CACHE REPLACEMENT POLICIES
efit from caching; by contrast, Coarse-Grained policies tend to repeatedly dedicate
cache resources to lines that do not yield any hits. Second, they allow Fine-Grained
policies to dynamically determine access patterns for different groups of lines; by
contrast, Coarse-Grained policies assume that the entire cache follows a uniform
cache access pattern.
Among Fine-Grained policies, the trend is towards using longer histories as the
state-of-the-art Fine-Grained policy Jain and Lin [2016] samples information from
a history that is 8× the size of the cache (see Chapter 4). The use of long histories
allows the policy to detect cache-friendly lines that have long-term reuse, which
would otherwise be obfuscated by a long sequence of intermediate cache-averse
accesses.
9
CHAPTER 3
Coarse-Grained
Replacement Policies
The cache replacement literature spans 50 years, first in the context of OS page re-
placement and then in the context of hardware caches. During this time, most re-
search has focused on the development of smarter Coarse-Grained policies, which
can perhaps be explained by the simplicity of these policies: Each cache line is asso-
ciated with a small amount of replacement state, which is initialized uniformly for all
newly inserted lines and then manipulated using simple operations as the lines are
reused.
In this chapter, we discuss key advances in Coarse-Grained policies. We di-
vide Coarse-Grained policies into three classes based on the manner in which they
distinguish lines after their insertion into the cache. The first class, which includes
the vast majority of Coarse-Grained policies, uses recency information to order
lines (Recency-Based Policies). The second class instead uses frequency to order
lines (Frequency-Based Policies). Finally, the third class (Hybrid polices) dynamically
chooses among different Coarse-Grained replacement policies.
One running theme in the design of Coarse-Grained policies is the recognition
of three commonly observed cache access patterns, namely, recency-friendly ac-
cesses, thrashing accesses Denning [1968], and scans Jaleel et al. [2010b]. Recency-
friendly accesses occur when the working set of the application is small enough to
fit in the cache, such that the reuse distance of most memory references is smaller
than the size of the cache. By contrast, thrashing accesses occur when the working
set of the application exceeds the size of the cache, such that the reuse distance of
most memory references is greater than the size of the cache. Finally, scans are a se-
quence of streaming accesses that never repeat. As we will see in this chapter, almost
all Coarse-Grained replacement policies are optimized specifically for these access
patterns or for mixes of these three access patterns.
Accesses A B C A B C
Cache contents
(A, _ ) (A, B) (C, B) (C, A) (B, A) (B, C)
after latest access
Miss Miss Miss Miss
Miss Miss
(Evict A) (Evict B) (Evict C) (Evict A)
Time
Figure 3.1: An example of thrashing. The LRU policy produces no cache hits when the
access pattern cycles through a working set (3 in this example) that is larger than the cache
capacity (2 in this case).
LRU performs well when there is temporal locality of reference, that is, when
data that was used recently is likely to be reused in the near future. But it performs
poorly for two types of access patterns. First, it can be pessimal when the applica-
tion’s working set size exceeds the cache size, leading to a phenomenon known as
thrashing Denning [1968] (see Figure 3.1). Second, LRU performs poorly in the pres-
ence of scans because it caches more recently used scans at the expense of older lines
that are more likely to be reused.
Accesses A B C A B C
Cache contents
(A, _ ) (A, B) (C, B) (C, A) (B, A) (B, C)
after latest access
Miss Miss Miss
Miss Miss Hit
(Evict B) (Evict C) (Evict B)
Time
Figure 3.2: The MRU policy avoids thrashing by caching a portion of the working set. In
this example, the MRU policy is able to cache A even though the working set size exceeds
the cache capacity of 2 lines.
Most Recently Used (MRU) The Most Recently Used Policy addresses thrashing
by evicting new lines to retain old lines. Thus, when the working set is larger than the
cache, it is able to retain a portion of the working set. For example, Figure 3.2 shows
that for the thrashing access pattern shown in Figure 3.1, MRU improves upon LRU’s
hit rate by caching a portion of the working set—in this example never evicting line
A. Table 3.1.1 shows that the MRU policy is identical to LRU, except that it evicts
the line at the MRU position instead of the LRU position.
While MRU is ideal for thrashing accesses, it performs poorly in the presence of
recency-friendly accesses, and it adapts poorly to changes in the application’s work-
ing set, as it is unlikely to cache anything from the new working set.
Early Eviction LRU (EELRU) The EELRU policy also avoids thrashing Smarag-
dakis et al. [1999]. The main idea is to detect cases where the working set size ex-
ceeds the cache size, at which point a few lines are evicted early. Thus, early eviction
discards a few randomly chosen lines so that the remaining lines can be managed ef-
fectively by the LRU policy. More specifically, EELRU evicts the LRU line when the
working set fits in the cache, but it evicts the eth most recently used line when it ob-
12 3. COARSE-GRAINED REPLACEMENT POLICIES
serves that too many lines are being touched in a roughly cyclic pattern that is larger
than main memory.
Figure 3.3 shows that EELRU distinguishes among three regions of the recency
axis. The left endpoint of the recency axis represents the most recently used (MRU)
line, and the right endpoint represents the least recently used (LRU) line. The LRU
memory region consists of the most recently used lines, and positions e and M mark
the beginning of the early eviction region and late eviction region, respectively. On a
miss, the EELRU policy either evicts the line at the least recently used position (late
region) or the line at the eth position (early region).
To decide whether to use early eviction or LRU eviction, EELRU tracks the
number of hits received in each region. If the distribution is monotonically decreas-
ing, then EELRU assumes that there is no thrashing and evicts the LRU line. But if
the distribution shows more hits in the late region than the early region, then EELRU
evicts from the early region, which allows lines from the late region to remain longer
in the cache. Table 3.3 summarizes the operation of EELRU.
Table 3.4 summarizes the operation of Seg-LRU. New lines are inserted at the
MRU position in the probationary segment, and on a cache hit, lines are promoted
to the MRU position in the protected segment. Because the protected segment is
finite, a promotion to the protected segment may force the migration of the LRU line
in the protected segment to the most recently used (MRU) end of the probationary
segment, giving this line another chance to receive a hit before it is evicted from the
probationary segment. Thus, Seg-LRU can adapt to changes in the working set as old
lines eventually get demoted to the probationary segment.
This insight spurred the design of replacement policies with new insertion and
promotion policies. By interpreting recency information in different ways, these poli-
cies can address much broader classes of access patterns than LRU. In this section,
we discuss some notable solutions in this direction. Since applications typically con-
sist of many different access patterns, none of these policies is sufficient on its own
and is typically used as part of a hybrid solution, which we discuss in Section 3.3.
Bimodal Insertion Policy (BIP) Since thrash-resistant policies like LIP (and MRU)
cannot adapt to changes in the working set during phase changes, BIP Qureshi et al.
[2007] modifies LIP such that lines are occasionally (with a low probability) inserted
in the MRU position. BIP maintains the thrash-resistance of LIP because it inserts in
the LRU position most of time, but it can also adapt to phase changes by occasionally
retaining newer lines. Table 3.6 shows that BIP inserts in the MRU position with a
probability of , which is set to be 1/32 or 1/64. An value of 1 inserts in the MRU
position (mimicking LRU’s insertion policy), and an value of 0 inserts in the LRU
position (mimicking LIP’s insertion policy). Thus, from an implementation point of
view, BIP’s parameter unifies all insertion policies that lie at different points in the
spectrum between LRU and MRU insertion.
Promotion
Jaleel et al., show that instead of predicting re-reference intervals that lie at
the extreme ends of the RRIP chain, there is great benefit in predicting an interme-
diate re-reference interval, which allows the replacement policy to accommodate
a mixture of different access patterns. In particular, scans—a burst of references to
data that is not reused—disrupt recency-friendly policies, such as LRU, because they
discard the working set of the application without yielding any cache hits. To accom-
modate mixes of recency-friendly accesses and scans, Jaleel et al., propose SRRIP,
which gives incoming lines an intermediate re-reference interval and then promotes
16 3. COARSE-GRAINED REPLACEMENT POLICIES
lines to the head of the chain if they are reused. Thus, SRRIP adds scan-resistance
to recency-friendly policies, as it prevents lines with a distant re-reference interval
(scans) from evicting lines with a near re-reference interval.
In the most general case, SRRIP associates an M-bit value per cache block to
store its Rereference Prediction Value (RRPV), but Jaleel et al., find that a 2-bit RRPV
value is sufficient to provide scan-resistance. Table 3.7 summarizes the operations
of SRRIP with a 2-bit RRPV value.
Table 3.7: Static Re-Reference Interval Prediction Policy (SRRIP)
Like LRU, SRRIP thrashes the cache when the re-reference interval of all
blocks is larger than the available cache. To add thrash-resistance to scan-resistant
policies, Jaleel et al., propose Bimodal RRIP (BRRIP), a variant of BIP Qureshi et al.
[2007] that inserts a majority of cache blocks with a distant re-reference interval
prediction (i.e., RRPV of 3), and it infrequently inserts cache blocks with an inter-
mediate re-reference interval prediction (i.e., RRPV of 2). BRRIP’s operations are
summarized in Table 3.8.
Table 3.8: Bimodal Re-Reference Interval Prediction Policy (BRRIP)
Protecting Distance-Based Policy (PDP) The PDP policy Duong et al. [2012] is a
generalization of RRIP as it dynamically estimates a Protecting Distance (PD), and all
cache lines are protected for P D accesses before they can be evicted. The Protecting
Distance is a single value that is used for all lines inserted into the cache, but it is
continually updated based on the application’s dynamic behavior.
3.1. RECENCY-BASED POLICIES 17
To estimate the PD, PDP computes the reuse distance distribution (RDD),
which is the distribution of reuse distances observed within a recent interval of the
program’s execution. Using the RDD, the Protecting Distance is defined to be the
reuse distance that covers a majority of lines in the cache, such that most lines are
reused at the PD or smaller distance. For example, Figure 3.6 shows the RDD for
436.cactusADM, where the PD is set to be 64. The PD is recomputed infrequently
using a small special-purpose processor.
Figure 3.6: Protecting Distance covers a majority of lines (for example, 64 for 436.cac-
tusADM).
Figure 3.7: IPV for LRU (top), and the IPV evolved using genetic algorithm (bottom).
Figure 3.7 shows two sample IPVs, with the first one representing LRU, and the
second one representing a more sophisticated insertion and promotion strategy.
While the generalized notion of IPVs is quite powerful, the number of possible
IPVs grows exponentially, (k k+1 possible IPVs for a k-way cache), so Jiménez et al.,
3.1. RECENCY-BASED POLICIES 19
use an offline genetic search to evolve good IPVs for the SPEC 2006 benchmarks.
The genetic search yielded the IPV shown in the bottom part of Figure 3.7.
Just as there is no single insertion policy or RRIP policy that works for all work-
loads, the best IPV differs for each workload. Therefore, Jiménez et al., present a
hybrid solution that uses set dueling (described in Section 3.3) to consider multiple
IPVs.
Shepherd Cache The Shepherd Cache mimics Belady’s optimal policy Rajan and
Govindarajan [2007]. Since Belady’s policy requires knowledge of future accesses,
Rajan et al., emulate this future lookahead with the help of an auxiliary cache, called
the Shepherd Cache. In particular, the cache is logically divided into two compo-
nents, the Main Cache (MC) which emulates optimal replacement, and the Shep-
herd Cache (SC) which uses a simple FIFO replacement policy. The SC supports
optimal replacement for the MC by providing a lookahead window. New lines are
initially buffered in the SC, and the decision to replace a candidate from the MC is
deferred until the new line leaves the SC. While the new line is in the SC, information
is gathered about the reuse order of replacement candidates in the MC. For example,
candidates that are reused earlier become less likely candidates for eviction since Be-
lady’s policy evicts the lines that is reused furthest in the future. When the new line
is evicted from the SC (due to other insertions in the SC), a replacement candidate
is chosen by either picking a candidate from the MC that hasn’t been reused within
the lookahead window, or the candidate that was reused last; if all lines in the MC
were reused before the SC line was reused, then the SC line replaces itself. Though
SC and MC are logically separate, Rajan et al., avoid any movement of data from one
component to another by organizing the cache such that the two logical components
can be organized as a single physical structure.
Thus, the Shepherd Cache emulates a superior replacement scheme for lines in
the MC cache by extending their lifetime with the help of the Shepherd Cache. The
tradeoff for Shepherd Cache is that replacement in the MC approaches true optimal-
ity with high lookaheads, and the higher lookahead comes at the cost of a diminished
Main Cache capacity. Unfortunately, subsequent work Jain and Lin [2016] has shown
20 3. COARSE-GRAINED REPLACEMENT POLICIES
that to approach the behavior of Belady’s MIN policy, the policy needs a lookahead
of 8× the size of the cache.
Figure 3.8: FBR does not increment frequency counters in the new section.
replacement to the least frequently used line in an old section, which consists of lines
that have not been recently accessed (bottom of the LRU stack). The remaining part
of the stack is called the middle section, which gives frequently used lines sufficient
time to build up their frequency values.
Table 3.10 summarizes the operation of FBR policy.
Least Recently/Frequently Used (LRFU) The LRFU policy Lee et al. [2001]
builds on the observation that the LRU and LFU (Least Frequently Used) policies
represent extreme points of a spectrum of policies that combine recency and fre-
quency information (see Figure 3.9). Using a new metric called Combined Recency
and Frequency (CRF), LRFU explores this spectrum by allowing a flexible tradeoff
between recency and frequency.
Like Frequency-Based policies, LRFU accounts for each past reference to the
block, but unlike Frequency-Based policies, LRFU weighs the relative contribution
of each reference by a weighing function. In particular, LRFU computes for each
block a CRF value, which is the sum of the weighing function F (x) for each past
reference, where x is the distance of the past reference from the current time. Thus,
for emulating purely Frequency-Based policies, the weighing function can give equal
priority to all past references, and for emulating Recency-Based policies, the weigh-
ing function can give high priority to the last reference of the line.
22 3. COARSE-GRAINED REPLACEMENT POLICIES
1 λx
(3.1)
F x = p
Table 3.11 summarizes the operation of LRFU policy for a block b at different
decision points.
Dueling builds on the observation that a few randomly chosen sets can accurately
represent the behavior of different replacement policies on the entire cache. Qureshi
et al., mathematically show that for caches with 1-4 MB (1024 to 4096 sets), 64 sets
are enough to capture the behavior of the entire cache. We now discuss Set Dueling
in more detail by discussing two representative policies.
Dynamic Insertion Policy (DIP) The Dynamic Insertion Policy (DIP) combines
the benefit of recency-friendly policies and thrash-resistant policies by dynamically
modulating the insertion positions of incoming lines Qureshi et al. [2007]. In partic-
ular, DIP alternates between the recency-friendly LRU (Table 3.1) and the thrash-
resistant Bimodal Insertion Policy (Table 3.6).
To choose between the two policies, DIP uses Set Dueling to dynamically track
the hit rate of each policy. Figure 3.11 shows that DIP dedicates a few sets to LRU
(sets 0, 5, 10 and 15 in Figure 3.11) and a few sets to BIP (sets 3, 6, 9 and 12 in Fig-
ure 3.11). These dedicated sets are called Set Dueling Monitors (SDMs), while the
remaining sets are called the follower sets. The policy selection (PSEL) saturating
counter determines the winning policy by identifying the SDM that receives more
cache hits. In particular, the PSEL is incremented when the SDM dedicated to LRU
receives a hit, and it is decremented when the SDM dedicated to BIP receives a hit
(a k-bit PSEL counter is initialized to 2k−1 ). The winning policy is identified by the
3.3. HYBRID POLICIES 25
MSB of the PSEL. If the MSB of PSEL is 0, the follower sets use the LRU policy;
otherwise the follower sets use BIP. Thus, Set Dueling does not require any separate
storage structure other than the PSEL counter.
CHAPTER 4
Fine-Grained
Replacement Policies
Fine-Grained policies differentiate cache lines at the time of insertion, and this dif-
ferentiation is typically based on eviction information from previous lifetimes of sim-
ilar cache lines. For example, if a Fine-Grained policy learns that a line was evicted
without being reused in its previous lifetimes, then it can insert the line into the cache
with low priority. By contrast, a Coarse-Grained policy, such as LRU, will evict a line
only after it has migrated from the MRU position to the LRU position, so it forces
the line to reside in the cache for a long period of time—consuming precious cache
space—just to determine that the line should be evicted. Thus, by learning from the
behavior of previous cache lines, Fine-Grained policies can make more effective use
of the cache.
We divide Fine-Grained policies into three broad categories based on the met-
ric they use for predicting insertion priorities. The first category (Section 4.1) con-
sists of solutions that predict expected reuse intervals for incoming lines. The second
category (Section 4.2) consists of solutions that predict just a binary caching decision
(cache-friendly vs. cache-averse). The third category, which is much smaller than
the other two, includes policies Beckmann and Sanchez [2017], Kharbutli and Soli-
hin [2005] that introduces novel prediction metrics.
Fine-Grained solutions have several other design dimensions. First, since it can
be cumbersome to remember the caching behavior of individual lines across mul-
tiple cache lifetimes, these policies learn caching behaviors for groups of lines. For
example, many solutions group lines based on the address (PC) of the instruction that
loaded the line, because lines that are loaded by the same PC tend to have similar
caching behavior. Recent solutions look at more sophisticated ways to group cache
lines Jiménez and Teran [2017], Teran et al. [2016]. A second design dimension is the
amount of history that is used for learning cache behavior.
Fine-Grained replacement policies have roots in two seemingly different con-
texts. One line of work uses prediction to identify dead blocks—blocks that will not
be used before being evicted—that could be re-purposed for other uses. For exam-
ple, one of the earliest motivations for identifying dead blocks was to use them as
prefetch buffers Hu et al. [2002], Lai et al. [2001]. Another motivation was to turn
4.1. REUSE DISTANCE PREDICTION POLICIES 27
off cache lines that are dead Abella et al. [2005], Kaxiras et al. [2001]. The second
line of work generalizes hybrid re-reference interval policies Jaleel et al. [2010b] so
that they are more learning based. Despite their different origins, these two lines of
research have converged to conceptually similar solutions.
We now discuss the different classes of Fine-Grained policies.
Figure 4.1: The ETA policy chooses between the line with the longest ETA or the one
with the largest decay.
4.2. CLASSIFICATION-BASED POLICIES 29
More concretely, Figure 4.1 shows that the replacement policy has two candi-
dates: (1) the line with the largest ETA (the ETA line), and (2) the line with the largest
Decay time (the LRU line). The line with the largest of the two values is selected for
eviction. Thus, the policy relies on ETA when available and reverts to LRU other-
wise.
• What is the prediction mechanism, and at what granularity are the predictions
being made?
• What is the aging mechanism for ensuring that inaccurate predictions are even-
tually evicted?
30 4. FINE-GRAINED REPLACEMENT POLICIES
4.2.1 SAMPLING BASED DEAD BLOCK PREDICTION (SDBP)
Many studies observe that because a majority of the blocks in the LLC are dead (they
are not reused again before they are evicted), dead block prediction can be used to
guide cache replacement and early bypass of dead blocks Khan et al. [2010], Lai and
Falsafi [2000]. Lai et al., introduce the idea of using dead block predictors to prefetch
data into dead blocks in the L1 Lai and Falsafi [2000]. Their reftrace predictor pre-
dicts that if a trace of instruction addresses leads to the last access for one block, then
the same trace will also lead to the last access for other blocks. To reduce the cost
of maintaining an instruction trace for all cache blocks, Khan et al., introduce the
Sampling Based Dead Block Predictor (SDBP), which samples the caching behavior
of program counters (PCs) to determine whether an incoming block is likely to be
dead Khan et al. [2010]. Future cache accesses from PCs that are known to insert
dead blocks are bypassed so that they do not pollute the cache. Accesses from PCs
that do not insert dead blocks are inserted into the cache using some baseline policy,
namely, a random or LRU replacement policy.
Notably, SDBP learns from a decoupled sampler that is populated using a small
fraction of the cache accesses (see Figure 4.2). If a block is evicted from the sam-
pler without being reused, the corresponding PC is trained negatively; otherwise,
the predictor is trained positively. The decoupled sampler has several advantages.
First, the predictor is trained using only a small sample of all cache accesses, which
leads to a power- and space-efficient predictor design and manageable metadata in
the sampler (PCs are maintained for each sampler entry). Second, the replacement
policy of the sampler does not have to match the replacement policy of the cache.
Khan et al., use the LRU policy for the sampler, and they use random replacement
for the main cache. Finally, the associativity of the sampler is independent of the as-
sociativity of the cache, which allows for a cheaper sampler design. Khan et al., use
a 12-way sampler for a 16-way cache. Table 4.1 summarizes SDBP’s key operations.
Thus, to answer the questions that we outlined at the beginning of this sec-
tion: (1) SDBP learns the caching decisions of an LRU-based sampler; (2) it predicts
dead blocks (cache-averse vs. cache-friendly) using a skewed predictor design, mak-
ings these predictions at the granularity of PCs; and (3) SDBP bypasses all incoming
4.2. CLASSIFICATION-BASED POLICIES 31
blocks that are predicted to be cache-averse, with the remaining lines being managed
using the baseline replacement policy, so false positives (cache-averse blocks that are
predicted to be cache-friendly) are aged out using the baseline replacement policy,
and false negatives (cache-friendly blocks that are predicted to be cache-averse) do
not get any opportunity to see reuse.
1A minor difference is that SDBP learns the LRU solution, whereas SHiP learns from SRRIP.
32 4. FINE-GRAINED REPLACEMENT POLICIES
More specifically, SHiP trains a predictor that learns whether a given signature
has near or distant re-reference intervals. A few sets are sampled in the cache to
maintain signatures and train the predictor. On a cache hit in a sampled set, the sig-
nature associated with the line is trained positively to indicate a near re-reference,
and on an eviction of a line that was never reused, the signature is trained negatively
to indicate a distant re-reference. When a new line is inserted, the signature of the
incoming line is used to consult the predictor and determine the re-reference inter-
val of the incoming line (prediction is performed for all accesses, not just the sampled
sets). Once inserted into the cache, lines are managed using a simple RRIP policy (see
Table 4.2).
The choice of signature is critical to SHiP’s effectiveness. Jaleel, et al., evaluate
a program counter signature (PC), a memory region signature, and an instruction
sequence history signature, and they find that the PC signature performs best.
SHiP builds on DRRIP Jaleel et al. [2010b] to create a Fine-Grained policy.
Whereas DRRIP makes uniform re-reference predictions for all cache lines in an
epoch, SHiP makes finer-grained predictions: It categorizes incoming lines into dif-
ferent groups by associating each reference with a unique signature. Cache lines with
the same signature are assumed to have similar re-reference behavior, but cache lines
with different signatures are allowed to have different re-reference behavior within
the same epoch.
We now answer the questions mentioned at the beginning of this sections. (1)
Initially, SHiP learns from SRRIP, but once SHiP’s predictor has been trained, further
training updates come from SHiP’s own reuse and eviction behavior. (2) SHiP uses a
PC-based predictor, where the PC associated with a line is the one that inserted the
line on a cache miss, where each PC is associated with a saturating counter. (3) SHiP
relies on the RRIP policy to age all lines.
4.2.3 HAWKEYE
To avoid the pathologies of heuristic-based solutions, such as LRU, Hawkeye Jain
and Lin [2016] builds off of Belady’s MIN solution Belady [1966]2 , which is intriguing
for two reasons. First, Belady’s MIN is optimal for any sequence of references, so a
MIN-based solution is likely to work for any access pattern. Second, Belady’s MIN
algorithm is an impractical algorithm because it replaces the line that will be reused
furthest in the future; hence, it relies on knowledge of the future.
The key insight behind Hawkeye is that while it is impossible to look into the
future, it is possible to apply Belady’s MIN algorithm to the memory references of the
past. Moreover, if a program’s past behavior is a good predictor of its future behavior,
2 Technically, the Hawkeye policy builds off of a new linear-time algorithm that produces the same result as
Belady’s MIN policy.
4.2. CLASSIFICATION-BASED POLICIES 33
then by learning the optimal solution for past, Hawkeye can train a predictor that
should perform well for future accesses.
To understand how much history is needed to simulate MIN for past events,
Jain and Lin study the performance of MIN by limiting its window into the future.
Figure 4.3 shows that while MIN needs a long window into the future (8× the size
of the cache for SPECint 2006), it does not need an unbounded window. Thus, to
apply MIN to past events, we would need a history of 8× the size of the cache.
Error compared to Infinite OPT (%)!
70
60 Belady
LRU
50
40
30
20
10
0
1X 2X 4X 8X 16X
View of the Future (in terms of cache capacity)!
are predicted to be cache-friendly are inserted with high priority, and lines that are
predicted to be cache-averse are inserted with low priority (see Table 4.3).
To answer the questions at the beginning of the section: (1) Hawkeye learns
from the optimal caching solution, instead of learning from LRU or SRRIP. (2) Hawk-
eye learns the optimal solution using a PC-based predictor, which is conceptually
similar to the predictors used in SDBP and SHiP. (3) Hawkeye also relies on RRIP’s
aging mechanism to age lines that are inserted with high priority. To correct for in-
accurate predictions, Hawkeye also trains the predictor negatively when a line that
was predicted to be cache-averse is evicted without being reused.
Hawkeye makes an interesting conceptual contribution: It phrases cache re-
placement as a supervised learning problem4 , which is surprising because unlike
branch prediction, where the program execution eventually provides the correct
outcomes of each branch, hardware caches do not provide such labeled data. By
applying the optimal solution to past events, Hawkeye provides labeled data, which
suggests that the field of cache replacement might benefit from the vast research in
supervised learning.
Jimenez and Teran aim to improve predictor accuracy by using better features and
better prediction models Jiménez and Teran [2017], Teran et al. [2016]. For example,
the Perceptron Predictor Teran et al. [2016] uses simple artificial neurons5 Rosen-
blatt [1962] to augment the PC with richer features, such as (1) the history of PCs,
(2) bits from the memory address, (3) a compressed representation of the data, and
(4) the number of times a block has been accessed. Each feature is used to index a
distinct table of saturating counters, which are then summed and compared against
a threshold to generate a binary prediction. A small fraction of accesses are sampled
to updated the perceptron predictor using the perceptron update rule: If the predic-
tion is incorrect, or if the sum fails to exceed some magnitude, then the counters are
decremented on an access and incremented on an eviction. Figure 4.5 contrasts the
perceptron predictor (right) with prior PC-based predictors (left).
The Multiperspective Reuse Predictor Jiménez and Teran [2017] explores an
extensive set of features that captures various properties of the program, producing
a predictor that is informed from multiple perspectives. The features are parameter-
ized with richer information about the LRU stack position of each training input, the
bits of the PC with which each feature is hashed, and the length of the PC history. To-
gether, these parameters create a large feature space that leads to higher prediction
accuracy.
Conceptually, the EAF policy extends the lifetime of cache lines beyond evic-
tion, so that the lifetime of a line starts with its insertion, but it ends when the line
is removed from the bloom filter. With an extended lifetime, it becomes feasible for
EAF to observe reuse for lines with long reuse intervals, which leads to better scan-
resistance and better thrash-resistance, and thus better performance.
The Reuse Detector (ReD) crc [2017], Albericio et al. [2013] proposes similar
ideas as it bypasses any line that does not hit in the LLC or an Address Reuse Table,
which tracks recent cache misses. As a result, ReD only inserts lines in the cache on
their second reuse. To avoid bypassing all lines when they are first seen, ReD also
uses a PC-based predictor to predict lines that are likely to be reused after their first
access.
CHAPTER 5
Richer Considerations
We have until now focused narrowly on the problem of cache replacement, both
in terms of metrics—in which we have focused on cache misses—and in terms of
context—in which we have focused on the cache as an isolated abstract entity. But
of course, cache misses do not translate directly to performance loss, and cache re-
placement policies do not operate in isolation. This chapter broadens our discussion
in both dimensions, first exploring replacement policies that consider the variable
cost of cache misses and then exploring policies that consider the impact of prefetch-
ers, the impact of the cache architecture, and finally the impact of new memory tech-
nologies.‘
Note that a search tree is required because greedily caching a high-cost block
at any node in the tree might not translate to the best replacement sequence in the
long run. In particular, greedily caching a high-cost block might preclude the caching
of another block that has an even higher cost, or it might displace several low-cost
blocks that together are more costly than the single high-cost block. Thus, the best
replacement decision depends on the behavior of multiple competing lines in the
future, resulting in a large search space of solutions.
Unfortunately, CSOPT’s branch-and-bound approach is exponential in the
number of cache accesses, whereas MIN is linear in the number of cache accesses.
Jeong and Dubois propose heuristics Jeong and Dubois [2006] to prune the search
tree and to reduce the search space. For a few scientific workloads, these heuristics
are shown to make the computation of CSOPT tractable—in the sense that an offline
analysis can complete—but in general, they do not reduce the worst-case complex-
ity.
Figure 5.2 shows the relative cost savings of CSOPT over MIN (y-axis) for the
Barnes Hut benchmark from the SPLASH benchmark suite Jeong and Dubois [2006].
Here cache misses are assumed to have just two possible costs, either high cost or low
cost. The x-axis represents different proportions of high-cost accesses, and the lines
represent different cost ratios between high-cost and low-cost accesses. We see that
the cost savings of CSOPT over MIN increases with higher cost ratios, and they are
most pronounced when the high-cost accesses comprise 20-30% of the total number
of accesses. This trend makes sense because the benefit of CSOPT over MIN grows
when the difference is cost is high and when the percentage of low-cost misses is
high, because in these cases, MIN is more likely to make the wrong choice.
Of course, like MIN, CSOPT is impractical since it requires knowledge of fu-
ture accesses. Therefore, practical solutions for cost-aware cache replacement rely
5.1. COST-AWARE CACHE REPLACEMENT 41
Figure 5.2: Cost savings over MIN with random cost assignments Jeong and Dubois
[2006].
on intuitive heuristics to (1) identify high-cost cache accesses and (2) preferentially
cache high-cost loads over low-cost loads.
The MLP-based cost of a cache miss is defined to be the number of cycles that
the miss spends waiting to be serviced, which can be easily tracked using Miss Status
Handling Registers. For parallel misses, the cycle count is divided equally among all
concurrent misses. Thus, a higher MLP-based cost implies that the line is likely to
result in an isolated miss, so it is more desirable to keep it in the cache. Furthermore,
Qureshi et al., note that the MLP-based cost repeats for consecutive misses, so the
last MLP-based cost of a miss is a good indicator of its MLP-based cost for future
misses.
The LIN replacement policy linearly combines recency and MLP-based cost
and evicts the line with the lowest aggregate cost. In particular, if R(i) is the recency
value of a block i (the highest value denotes the MRU position and lowest value de-
notes the LRU position), and costq (i) is its quantized MLP-based cost, then the LIN
policy will evict the line with the lowest value of R(i) + λ ∗ costq (i), where λ is the
importance of the costq in choosing the victim. A high value of λ indicates that LIN
is likely to retain recent blocks with a high MLP-based cost, and a low value of λ
indicates that LIN is likely to put more emphasis on recency. Qureshi et al., set λ to
4.
Finally, because LIN does not always outperform LRU, Qureshi et al., dynami-
cally choose between LRU and LIN by using Set Dueling (see Section 3.3.2) to period-
ically choose between the LIN and LRU policies. Kumar et al., provide an alternate
solution by incorporating post-eviction reuse information into the cost-metric it-
self Arunkumar and Wu [2014]. In particular, their ReMAP policy defines the overall
cost of a line as a linear combination of recency, DRAM latency, and post-eviction
reuse distance, where the post-eviction reuse distance is computed using a bloom
filter for evicted lines.
5.2. CRITICALITY-DRIVEN CACHE OPTIMIZATIONS 43
Locality-Aware Cost-Sensitive Cache Replacement Policy Recent work
builds on Qureshi et al.’s work by (1) defining new cost metrics, and (2) defining
new replacement strategies. For example, the Locality-Aware Cost-Sensitive Cache
Replacement Algorithm (LACS) Kharbutli and Sheikh [2014] estimates the cost of a
cache block by counting the number of instructions issued during a block’s LLC miss.
Intuitively, this definition of cost reflects the processor’s ability to hide miss penalty,
which is similar to MLP. Cache blocks are classified as either low-cost or high-cost,
depending on whether the number of issued instructions is above or below a thresh-
old. For each block, LACS maintains a 2-bit cost value, so that they can represent
both high and low cost with two levels of confidence. Thus, instead of defining a
numeric cost value, LACS uses a binary cost classification.
For its replacement policy, LACS attempts to reserve high-cost blocks in the
cache, but only while their locality is still high (i.e. they have been accessed recently).
In particular, on a cache miss, LACS chooses a low-cost block to evict, so that high-
cost blocks remain in the cache. However, high-cost blocks are aged by decrement-
ing their 2-bit cost value so that they relinquish the reservation if they’ve been in the
cache for too long. Similarly, low-cost blocks are promoted by incrementing their
2-bit cost value so that they are retained in the cache if they show high temporal
locality. Thus, LACS combines the benefits of both locality and cost-based replace-
ment.
Figure 5.4: Critical path in a program’s Data Dependence Graph (DDG) Fields et al.
[2001].
mit. An E-E edge denotes a data dependence, C-C edges denote in-order commit,
and D-D nodes denote inorder allocation into the core. The critical path in the graph
is marked with dotted lines, and instructions 1, 2, 4, and 5 are found to be critical.
Nori et al. propose a novel and fast incremental method to learn the critical path us-
ing an optimized representation of the data dependence graph in hardware, which
takes just 3 KB of area. This method is used to enumerate a small set of critical load
instructions (critical PCs).
To optimize critical loads, critical loads are prefetched from the L2 or the LLC
into the L1 cache with the help of Timeliness Aware and Criticality Triggered (TACT)
prefetchers. TACT utilizes the association between the address or data of load in-
structions in the vicinity of the target critical load to trigger the prefetches into L1.
The implication of TACT prefetchers is that critical loads in the LLC can be served
at the latency of the L1 cache, and the L2 cache is no longer necessary and can be
eliminated altogether.
Utility-Based Cache Partitioning (UCP) UCP is built on the insight that LRU does
not work well for shared caches because it tends to allocate the most cache capacity
to the application that issues the most memory requests rather than to the applica-
tion that benefits the most from the cache Qureshi and Patt [2006]. To address this
issue, Qureshi et al., dynamically partition the cache among cores, and they propose
a lightweight runtime mechanism to estimate the partition size for each core. The
key idea behind the partitioning scheme is to allocate a larger cache partition to the
application that is more likely to see an improved hit rate with the extra cache space.
For example, in Figure 5.5, LRU gives equake 7 cache ways even though it does not
see any hit rate improvement beyond 3 ways.
To find the hit rate curve for each competing application, UCP leverages the
stack property of the LRU policy, which says that an access that hits in a LRU man-
aged cache containing n ways is guaranteed to also hit if the cache had more than n
ways (assuming that the number of sets remains constant). In particular, UCP deploys
sampled auxiliary tag directories, called Utility Monitors or UMONs, to monitor the
reuse behavior for each application, assuming that it had the entire cache to itself;
for each application, it counts the number of cache hits at each recency position.
The number of hits at each recency position determines the marginal utility of giving
the application one extra way. For example, if for a cache with 2 ways, 25 of 100
accesses miss the cache, 70 hit in the MRU position, and only 5 hit in the LRU po-
sition, then reducing the cache size from two ways to one way will increase the miss
count from 25 to 30. However, reducing the cache size from one way to zero ways
will dramatically increase the miss count from 30 to 100.
5.3. MULTI-CORE-AWARE CACHE MANAGEMENT 47
Thread-Aware DIP The Dynamic Insertion Policy (DIP) Qureshi et al. [2007] dis-
cussed in Section 3.3 uses Set Dueling to modulate between the recency-friendly
LRU policy and the thrash-resistant BIP policy. However, DIP leaves room for im-
provement in shared caches as it does not distinguish between the caching behavior
of individual applications. So if DIP chooses LRU, all applications sharing the cache
will use the LRU policy, which is not desirable when some applications benefit from
the cache and others don’t.
Jaleel et al., propose thread-aware extensions to DIP that are designed to work
well for shared caches Jaleel et al. [2008]. The key idea is to make a choice for
each application individually. However, this idea is complicated by the fact that for
n applications sharing the cache, sampling the behavior of 2n combinations is pro-
hibitively expensive. To mitigate this overhead, they propose two approaches. The
first approach, called TADIP-Isolated (TADIP-I), learns the insertion policy for each
application independently, assuming that all other applications use the LRU policy.
The second approach, called TADIP-Feedback (TADIP-F), accounts for interaction
among applications by learning the insertion policy for each application, assuming
that all other applications use the insertion policy that currently performs the best
for that application.
RADAR Our discussion so far has focused on multiple programs sharing the last-
level cache. Manivannan et al., instead look at the problem of last-level cache re-
placement for task-parallel programs running on a multi-core system. Their policy,
called RADAR, combines static and dynamic program information to predict dead
blocks for task-parallel programs Manivannan et al. [2016]. In particular, RADAR
builds on task-flow programming models, such as OpenMP, where programmer an-
notations explicitly specify (1) dependences between tasks, and (2) address regions
that will be accessed by each task. The runtime system uses this information in con-
junction with dynamic program behavior to predict regions that are likely to be dead.
Blocks that belong to dead regions are demoted and preferentially evicted from the
cache.
More concretely, RADAR has three variants that combine information from the
programming model and the architecture in different ways. First, the Look-ahead
scheme uses the task data-flow graph to peek into the window of tasks that are going
to be executed soon, and it uses this information to identify regions that are likely to
be accessed in the future and regions that are likely to be dead. Second, the Look-
back scheme tracks per-region access history to predict when the next region access
is likely to occur. Finally, the combined scheme exploits knowledge of future region
accesses and past region accesses to make more accurate predictions.
Demand-MIN To address this limitation, Jain and Lin propose a variant of Be-
lady’s MIN, called Demand-MIN, that minimizes demand misses in the presence of
prefetches. Unlike MIN, which evicts the line that is reused furthest in the future,
Demand-MIN evicts the line that is prefetched furthest in the future. More precisely,
Demand-MIN states:
Evict the line that will be prefetched furthest in the future, and if no such line
exists, evict the line that will see a demand request furthest in the future.
Thus, by preferentially evicting lines that can be prefetched in the future,
Demand-MIN accommodates lines that cannot be prefetched. For example, in Fig-
ure 5.7, Demand-MIN recognizes that because line X will be prefetched at time t = 1,
1 MIN correctly handles cache pollution, as inaccurate prefetches are always reused furthest in the future.
52 5. RICHER CONSIDERATIONS
Opportunity
Time
line X can be evicted at t = 0, thereby freeing up cache space in the time interval
between t = 0 and t = 1, which can be utilized to cache other demand loads. The re-
duction in demand miss rate can be significant: On a mix of SPEC 2006 benchmarks
running on 4 cores, LRU yields an average MPKI of 29.8, MIN an average of 21.7,
and Demand-MIN an average of 16.9.
Unfortunately, Demand-MIN’s increase in demand hit rate comes at the ex-
pense of a larger number of prefetch misses2 , which results in extra prefetch traf-
fic. Thus, MIN and Demand-MIN define the extreme points of a design space, with
MIN minimizing overall traffic on one extreme and Demand-MIN minimizing de-
mand misses on the other.
Design Space Figure 5.8 shows the tradeoff between demand hit rate (x-axis) and
overall traffic (y-axis) for several SPEC benchmarks Jain and Lin [2018]. We see that
different benchmarks will prefer different points in this design space. Benchmarks
such as astar (blue) and sphinx (orange) have lines that are close to horizontal, so
they can enjoy the increase in demand hit rate that Demand-MIN provides while
incurring little increase in memory traffic. By contrast, benchmarks such as tonto
(light blue) and calculix (purple) have vertical lines, so Demand-MIN increases traffic
but provides no improvement in demand hit rate. Finally, the remaining benchmarks
(bwaves and cactus) present less obvious tradeoffs.
To navigate this design space, Flex-MIN picks a point between MIN and
Demand-MIN, such that the chosen point has a good tradeoff between demand hit
rate and traffic. In particular, Flex-MIN is built on the notion of a protected line,
which is a cache line that would be evicted by Demand-MIN but not by Flex-MIN
because it would generate traffic without providing a significant improvement in hit
rate. Thus, Flex-MIN is defined as follows:
2 Prefetch misses are prefetch requests that miss in the cache
5.4. PREFETCH-AWARE CACHE REPLACEMENT 53
100
90
memory traffic (lower is better)
astar_163B
80 sphinx3_883B
70 tonto_2834B
% Increase in
calculix_2670B
60
bwaves_1609B
50
cactusADM_734B
40
30
20
10
0
0 20 40 60 80 100
Demand hit rate % (higher is better)
Figure 5.8: With prefetching, replacement policies face a tradeoff between demand hit
rate and prefetcher traffic.
Evict the line that will be prefetched furthest in the future and is not protected.
If no such line exists, default to MIN.
Jain et al., define a simple heuristic to identify protected lines Jain and Lin
[2018]. Of course, unlike MIN and Demand-MIN, Flex-MIN is not optimal in any
theoretical sense since it’s built on a heuristic.
PC
Cache
Access Insertion
Stream Harmony Priority
FlexMINgen Last Level
OPT Predictor Cache
hit/miss
Computes Flex-MIN’s Remembers past Flex-
decisions for the past MIN decisions
Base Victim Compression Gaur et al., observe that that cache compression and
replacement policies can interact antagonistically Gaur et al. [2016]. In particular,
they find that decisions that favor compressed blocks over uncompressed blocks can
lead to sub-optimal replacement decisions because they force intelligent replace-
ment policies to change their replacement order. As a result, the resulting com-
pressed cache loses the performance gains from state-of-the-art replacement poli-
cies.
To avoid negative interactions with replacement policies, Gaur et al., introduce
a cache design that guarantees that all lines that would have existed in an uncom-
pressed cache would also be present in the compressed cache. In particular, their
cache design keeps the data array unmodified, but it modifies the tag array to ac-
commodate compression. In particular, the tag array is augmented to associate two
tags with each physical way. Logically, the cache is partitioned into a Baseline cache,
which is managed just like an uncompressed cache, and a Victim cache, which op-
portunistically caches victims from the baseline cache if they can be compressed.
This design guarantees a hit rate at least as high as that of an uncompressed cache,
so it enjoys the benefits of advanced replacement policies. Furthermore, it can lever-
age the benefits of compression with simple modifications to the tag array.
pages, they use a filter cache (CHOP-FC) that tracks the access frequency of pages
that miss in the L3 and only insert a page in the DRAM cache if its access frequency
is above a certain threshold (see Figure 5.11). To avoid missing hot pages that get
prematurely evicted from the filter cache, they also propose a scheme in which the
frequency counters in the filter cache are backed up in main memory (CHOP-MFC).
For efficiency, these counters can be added to the page’s page-table entry so that
they can be retrieved on-chip on a TLB miss. Finally, since the requirements on the
DRAM cache vary by workload, they propose adaptive versions of CHOP-FC and
CHOP-MFC that turn the filter cache on or off based on memory utilization.
60
CHAPTER 6
Conclusions
In this book, we have summarized and organized research in cache replacement by
defining a new taxonomy of cache replacement policies. While we have not dis-
cussed every policy in the literature, we hope that our taxonomy outlines strategic
shifts in the design of cache replacement solutions. For example, we have shown
that while the first three or four decades of research in cache replacement focused
largely on Coarse-Grained policies, the last decade has seen a notable shift towards
Fine-Grained policies.
ing (second configuration). Since MIN requires knowledge of the future, we run the simulation twice to esti-
mate MIN’s speedup. The second simulation uses MIN’s caching decisions from the first simulation. It is diffi-
cult to simulate MIN on multi-core configurations because applications can interleave non-deterministically
across different runs.
61
Single-Core IPC speedup (with L1/L2 prefetcher)
Single-Core IPC speedup (no prefetcher) 5
7 6.6% 4.5 4.3%
6 4
5% 3.5
5 4.7%
4.4% 3 2.5%
4 2.5 2.2% 2.2%
3.1%
3 2
1.5
2
1 0.6%
1 0.5
0 0
SHiP MPPPB LIME1
LIME1 SHiP++ MIN SHiP Hawkeye MPPPB SHiP++ MIN
(baseline) (baseline)
Four-Core IPC speedup (no prefetcher) Four-Core IPC speedup (with L1/L2 prefetcher)
12 8 7.2%
9.6% 7 6.4% 6.5%
10 9.0%
Speedup over LRU (%)
8.2% 6
Speedup over LRU (%)
8 6.9% 5
6 4
2.9%
3
4
2
2 1
0 0
SHiP MPPPB ReD Hawkeye SHiP SHiP++ ReD Hawkeye
(baseline) (baseline)
systems than single-core systems. On the 4-core configuration, the winning solution
improves performance by 9.6% in the absence of prefetching (vs. 5% for single-core
configuration) and by 7.2% in the presence of prefetching (vs. 2.5% for single-core
configuration).
Trends and Challenges Our comprehesive look at cache replacement policies re-
veals some clear trends, which point to interesting opportunities for future research.
First, Fine-Grained policies, particularly Classification-Based policies, have
been shown to outperform traditional Coarse-Grained policies. The most recent
Cache Replacement Championship publicized the top three performing finishers
of their 15 submissions, and all three used Fine-Grained policies. Of course, Fine-
Grained policies leverage aging mechanisms from Coarse-Grained policies, so one
avenue of future work would be to explore aging schemes customized for Fine-
Grained policies.
62 6. CONCLUSIONS
Second, Fine-Grained policies learn from past behavior, but the key imped-
iment to their performance is their prediction accuracy and their ability to handle
inaccurate predictions. Improvements in prediction accuracy are needed to bridge
the gap between practical policies and Belady’s MIN. As we discuss in Section 4.2.3,
state-of-the-art Fine-Grained policies now view cache replacement as a supervised
learning problem, which opens up the possibility of applying machine learning to
cache replacement.
Third, as seen in Figure 6.1, there remains a considerable gap between state-
of-the-art replacement policies and MIN even for single-core configurations, which
suggests that there is still room to improve cache replacement for both single-core
and multi-core configurations.
Fourth, recent advances have blurred the line between dead block predictors
and Fine-Grained cache replacement policies, and we show that even though tradi-
tional dead block predictors were designed for a different goal, they fit well in our
cache replacement taxonomy.
Finally, there is room to improve cache replacement policies by considering
richer factors, including (1) cache replacement policies that cooperate with other
components in the system, such as the prefetcher, (2) policies that use performance
goals that go beyond improved cache hit rates, and (3) policies that take into account
the cache architecture and new memory technologies. Unfortunately, accounting for
richer factors within a cache replacement policy remains a challenging problem be-
cause of the significantly larger design spaces and because of the lack of efficient
optimal algorithms to guide the design process. Nevertheless, with the low-hanging
fruit already plucked, these avenues remain promising for future cache replacement
research.
63
Bibliography
2nd Cache Replacement Championship, 2017. URL http://crc2.ece.tamu.edu/.
Jaume Abella, Antonio González, Xavier Vera, and Michael FP O’Boyle. Iatac: a smart
predictor to turn-off l2 cache lines. ACM Transactions on Architecture and Code
Optimization (TACO), 2(1):55–77, 2005.
Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M Llabería. The reuse
cache: downsizing the shared last-level cache. In Proceedings of the 46th Annual
IEEE/ACM International Symposium on Microarchitecture, pages 310–321. ACM,
2013.
Akhil Arunkumar and Carole-Jean Wu. Remap: Reuse and memory access cost aware
eviction policy for last level cache management. In 2014 IEEE 32nd International
Conference on Computer Design (ICCD), pages 110–117. IEEE, 2014.
J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hier-
archies. In Proceedings of the 15th Annual International Symposium on Computer
Architecture, pages 73–80, 1988.
Nathan Beckmann and Daniel Sanchez. Maximizing cache performance under un-
certainty. In 2017 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pages 109–120, 2017.
Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H
Loh, Don McCaule, Pat Morrow, Donald W Nelson, Daniel Pantuso, et al. Die
stacking (3d) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture, pages 469–479. IEEE Computer
Society, 2006.
Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Com-
munications of the ACM, 13(7):422–426, 1970.
64 6. CONCLUSIONS
Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. Bear: techniques for mit-
igating bandwidth bloat in gigascale dram caches. In ACM SIGARCH Computer
Architecture News, volume 43, pages 198–210. ACM, 2015.
Edward Grady Coffman and Peter J Denning. Operating systems theory, volume 973.
Prentice-Hall Englewood Cliffs, NJ, 1973.
Peter J Denning. Thrashing: Its causes and prevention. In Proceedings of the De-
cember 9-11, 1968, fall joint computer conference, part I, pages 915–922. ACM,
1968.
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexan-
der V. Veidenbaum. Improving cache management policies using dynamic reuse
distances. In 45th Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO), pages 389–400, 2012.
Priyank Faldu and Boris Grot. Leeway: Addressing variability in dead-block predic-
tion for last-level caches. In Proceedings of the 26th International Conference on
Parallel Architectures and Compilation Techniques, pages 180–193, 2017.
Brian Fields, Shai Rubin, and Rastislav Bodík. Focusing processor policies via
critical-path prediction. In Proceedings of the 28th Annual International Sym-
posium on Computer Architecture, ISCA ’01, pages 74–85. ACM, 2001.
Shinobu Fujita, Hiroki Noguchi, Kazutaka Ikegami, Susumu Takeda, Kumiko No-
mura, and Keiko Abe. Novel memory hierarchy with e-stt-mram for near-future
applications. In VLSI Design, Automation and Test (VLSI-DAT), 2017 Interna-
tional Symposium on, pages 1–2. IEEE, 2017.
Hongliang Gao and Chris Wilkerson. A dueling segmented LRU replacement algo-
rithm with adaptive bypassing. In JWAC 2010-1st JILP Workshop on Computer
Architecture Competitions: Cache Replacement Championship, 2010.
Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. Timekeeping in the mem-
ory system: predicting and optimizing memory behavior. In Computer Architec-
ture, 2002. Proceedings. 29th Annual International Symposium on, pages 209–
220. IEEE, 2002.
65
Yasuo Ishii, Mary Inaba, and Kei Hiraki. Access map pattern matching for high per-
formance data cache prefetch. In Journal of Instruction-Level Parallelism, vol-
ume 13, pages 1–24, 2011.
Yasuo Ishii, Mary Inaba, and Kei Hiraki. Unified memory optimizing architecture:
memory subsystem control with a unified predictor. In Proceedings of the 26th
ACM International Conference on Supercomputing, pages 267–278, 2012.
Akanksha Jain and Calvin Lin. Back to the future: Leveraging belady’s algorithm for
improved cache replacement. In Proceedings of the International Symposium on
Computer Architecture (ISCA), June 2016.
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely,
Jr., and Joel Emer. Adaptive insertion policies for managing shared caches. In 17th
International Conference on Parallel Architectures and Compilation Techniques
(PACT), pages 208–219, 2008.
Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C Steely Jr, and Joel Emer. Achiev-
ing non-inclusive cache performance with inclusive caches: Temporal locality
aware (tla) cache management policies. In Proceedings of the 2010 43rd Annual
IEEE/ACM International Symposium on Microarchitecture, pages 151–162. IEEE
Computer Society, 2010a.
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr, and Joel Emer. High perfor-
mance cache replacement using re-reference interval prediction (RRIP). In Pro-
ceedings of the International Symposium on Computer Architecture (ISCA), pages
60–71, 2010b.
Jaeheon Jeong and Michel Dubois. Cache replacement algorithms with nonuniform
miss costs. IEEE Transactions on Computers, 55(4):353–365, 2006.
Djordje Jevdjic, Stavros Volos, and Babak Falsafi. Die-stacked dram caches for
servers: hit ratio, latency, or bandwidth? have it all with footprint cache. In ACM
SIGARCH Computer Architecture News, volume 41, pages 404–415. ACM, 2013.
Djordje Jevdjic, Gabriel H Loh, Cansu Kaynak, and Babak Falsafi. Unison cache: A
scalable and effective die-stacked dram cache. In Proceedings of the 47th Annual
IEEE/ACM International Symposium on Microarchitecture, pages 25–37. IEEE
Computer Society, 2014.
66 6. CONCLUSIONS
Xiaowei Jiang, Niti Madan, Li Zhao, Mike Upton, Ravishankar Iyer, Srihari Makineni,
Donald Newell, Yan Solihin, and Rajeev Balasubramonian. Chop: Adaptive filter-
based dram caching for cmp server platforms. In HPCA-16 2010 The Sixteenth
International Symposium on High-Performance Computer Architecture, pages 1–
12. IEEE, 2010.
Stefanos Kaxiras, Zhigang Hu, and Margaret Martonosi. Cache decay: exploiting
generational behavior to reduce cache leakage power. In Computer Architecture,
2001. Proceedings. 28th Annual International Symposium on, pages 240–251.
IEEE, 2001.
Samira Khan, Yingying Tian, and Daniel A Jiménez. Sampling dead block predic-
tion for last-level caches. In 43rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), pages 175–186, 2010.
Mazen Kharbutli and Rami Sheikh. Lacs: A locality-aware cost-sensitive cache re-
placement algorithm. IEEE Transactions on Computers, 63(8):1975–1987, 2014.
Jinchun Kim, Elvira Teran, Paul V Gratz, Daniel A Jiménez, Seth H Pugsley, and Chris
Wilkerson. Kill the program counter: Reconstructing program behavior in the pro-
cessor cache hierarchy. In Proceedings of the Twenty-Second Int’Conference on
Architectural Support for Programming Languages and Operating Systems (ASP-
LOS), pages 737–749, 2017.
67
Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni,
Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young, and Hong
Wang. Density tradeoffs of non-volatile memory as a replacement for sram based
last level cache. In Proceedings of the 45th Annual International Symposium on
Computer Architecture, pages 315–327. IEEE Press, 2018.
An-Chow Lai and Babak Falsafi. Selective, accurate, and timely self-invalidation
using last-touch prediction. In the 27th International Symposium on Computer
Architecture (ISCA), pages 139–148, 2000.
An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block prediction & dead-block
correlating prefetchers. In Proceedings. 28th Annual International Symposium on
Computer Architecture (ISCA), 2001.
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun
Cho, and Chong Sang Kim. On the existence of a spectrum of policies that sub-
sumes the least recently used (lru) and least frequently used (lfu) policies. In
ACM SIGMETRICS Performance Evaluation Review, volume 27, pages 134–143.
ACM, 1999.
Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. Cache bursts: A new
approach for eliminating dead blocks and increasing cache efficiency. In 41st An-
nual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages
222–233, 2008.
Anant Vithal Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney, and Hong
Wang. Criticality aware tiered cache hierarchy: a fundamental relook at multi-level
cache hierarchies. In 2018 ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA), pages 96–109. IEEE, 2018.
68 6. CONCLUSIONS
Elizabeth J O’Neil, Patrick E O’Neil, and Gerhard Weikum. The LRU-K page re-
placement algorithm for database disk buffering. In ACM SIGMOD Record, pages
297–306, 1993.
Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip B Gibbons,
Michael A Kozuch, and Todd C Mowry. Exploiting compressed block size as an
indicator of future reuse. In 2015 IEEE 21st International Symposium on High
Performance Computer Architecture (HPCA), pages 51–63. IEEE, 2015.
Moinuddin K Qureshi, Daniel N Lynch, Onur Mutlu, and Yale N Patt. A case for
MLP-aware cache replacement. In Proceedings of the International Symposium
on Computer Architecture (ISCA), pages 167–178, 2006.
Moinuddin K Qureshi, Aamer Jaleel, Yale N Patt, Simon C Steely, and Joel Emer.
Adaptive insertion policies for high performance caching. In Proceedings of the
International Symposium on Computer Architecture (ISCA), pages 381–391, 2007.
Daniel Sanchez and Christos Kozyrakis. Vantage: scalable and efficient fine-grain
cache partitioning. In Proceedings of the International Symposium on Computer
Architecture (ISCA), pages 57–68, 2011.
Somayeh Sardashti, André Seznec, and David A Wood. Yet another compressed
cache: A low-cost yet effective compressed cache. ACM Transactions on Archi-
tecture and Code Optimization (TACO), 13(3):27, 2016.
Vivek Seshadri, Onur Mutlu, Michael A Kozuch, and Todd C Mowry. The evicted-
address filter: A unified mechanism to address both cache pollution and thrashing.
In the 21st Int’l Conference on Parallel Architectures and Compilation Techniques
(PACT), pages 355–366, 2012.
Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B Gibbons,
Michael A Kozuch, and Todd C Mowry. Mitigating prefetcher-caused pollution
using informed caching policies for prefetched blocks. ACM Transactions on Ar-
chitecture and Code Optimization (TACO), 11(4):51, 2015.
D Shasha and T Johnson. 2q: A low overhead high performance buffer management
replacement algoritm. In Proceedings of the Twentieth International Conference
on Very Large Databases, Santiago, Chile, pages 439–450, 1994.
Yannis Smaragdakis, Scott Kaplan, and Paul Wilson. EELRU: simple and effective
adaptive page replacement. In ACM SIGMETRICS Performance Evaluation Re-
view, pages 122–133, 1999.
Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N Patt. Feedback directed
prefetching: Improving the performance and bandwidth-efficiency of hardware
prefetchers. In Proceedings of the 13th International Symposium on High Per-
formance Computer Architecture (HPCA), pages 63–74, 2007.
Srikanth T Srinivasan, R Dz-Ching Ju, Alvin R Lebeck, and Chris Wilkerson. Lo-
cality vs. criticality. In Computer Architecture, 2001. Proceedings. 28th Annual
International Symposium on, pages 132–143. IEEE, 2001.
Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu.
The application slowdown model: Quantifying and controlling the impact of inter-
application interference at shared caches and main memory. In Proceedings of the
48th International Symposium on Microarchitecture, pages 62–75. ACM, 2015.
70 6. CONCLUSIONS
Masamichi Takagi and Kei Hiraki. Inter-reference gap distribution replacement:
an improved replacement algorithm for set-associative caches. In Proceedings of
the 18th annual international conference on Supercomputing, pages 20–30. ACM,
2004.
Elvira Teran, Zhe Wang, and Daniel A Jiménez. Perceptron learning for reuse predic-
tion. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International
Symposium on, pages 1–12. IEEE, 2016.
HSP Wong, C Ahn, J Cao, HY Chen, SW Fong, Z Jiang, C Neumann, S Qin, J Sohn,
Y Wu, et al. Stanford memory trends, 2016.
Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and
Joel Emer. PACMan: prefetch-aware cache management for high performance
caching. In 44th Annual IEEE/ACM International Symposium on Microarchitec-
ture (MICRO), pages 442–453, 2011b.
Author’s Biography
AKANKSHA JAIN
Akanksha Jain is a Research Associate at The University of Texas at Austin. She re-
ceived her Ph.D. in Computer Science from The University of Texas in August 2016.
In 2009, she received the B.Tech and M. Tech degrees in Computer Science and En-
gineering from the Indian Institute of Technology Madras. Her research interests are
in computer architecture, with a particular focus on the memory system and on using
machine learning techniques to improve the design of memory system optimizations.
CALVIN LIN
Calvin Lin is a University Distinguished Teacher Professor of Computer Science at
The University of Texas at Austin. Lin received the BSE in Computer Science from
Princeton University in 1985 (Magna Cum Laude) and the Ph.D. in Computer Sci-
ence from the University of Washington in December, 1992. Lin was a postdoc at
the University of Washington until 1996, when he joined the faculty at Texas. Lin’s
research takes a broad view at how compilers and computer hardware can be used
to improve system performance, system security, and programmer productivity. He
is also Director of UT’s Turing Scholars Honors Program, and when he is not work-
ing, he can be found chasing his two young sons or coaching the UT men’s ultimate
frisbee team.