0% found this document useful (0 votes)
5 views21 pages

Lecture 7

The document discusses advanced cache optimizations in computer architecture, focusing on reducing hit time, miss penalty, and miss rate through various techniques such as small caches, way prediction, victim caches, and hardware prefetching. It also covers the use of CACTI for modeling cache performance and highlights compiler optimizations that can significantly reduce cache misses. Key strategies include pipelining cache access, implementing non-blocking caches, and optimizing data access patterns to improve spatial and temporal locality.

Uploaded by

Anwar Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

Lecture 7

The document discusses advanced cache optimizations in computer architecture, focusing on reducing hit time, miss penalty, and miss rate through various techniques such as small caches, way prediction, victim caches, and hardware prefetching. It also covers the use of CACTI for modeling cache performance and highlights compiler optimizations that can significantly reduce cache misses. Key strategies include pipelining cache access, implementing non-blocking caches, and optimizing data access patterns to improve spatial and temporal locality.

Uploaded by

Anwar Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1

CS-506 Advanced Computer Systems


Architecture

Lecture#7
Gul Munir Ujjan
Assistant Professor
CISE Department, NEDUET
Karachi
2
9 Advanced Cache Optimizations
Reducing hit time Reducing Miss Penalty
1. Small and simple caches 7. Critical word first
2. Way prediction
3. Victim caches
Reducing Miss Rate
Increasing cache 8. Hardware prefetching
bandwidth 9. Compiler Optimizations
4. Pipelined caches
5. Multibanked caches
6. Nonblocking caches
1. Fast Hit times via Small and Simple 3
Caches
Index tag memory and then compare takes time
 Small cache can help hit time since smaller memory takes less time to
index
E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and
Opteron
Also L2 cache small enough to fit on chip with the processor avoids time penalty of
going off chip
Simple  direct mapping
Can overlap tag check with data transmission since no choice
Access time estimate for 90 nm using CACTI model 4.0
Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and
1.43 for 2-way, 4-way, and 8-way caches
2.50
Access time (ns)

2.00 1-way 2-way 4-way 8-way

1.50

1.00
0.50
-
16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB
Cache size
4
CACTI
CACTI is an integrated cache and memory access time, cycle time,
area, leakage, and dynamic power model.
By integrating all these models together, one can have confidence
that tradeoffs between time, power, and area are all based on the
same assumptions and, hence, are mutually consistent.
CACTI is intended for use by computer architects to better
understand the performance tradeoffs inherent in memory system
organizations.
The open-source software of CACTI can be accessed at
https://github.com/Cacti/cacti
5
Recall: Set Associative Cache
N-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallel
Example: Two-way set associative cache
Cache Index selects a “set” from the cache
The two tags in the set are compared to the input in
parallel
Data is selected based on the tag result
Disadvantage: Time to set mux
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit
6
L1 Size and Associativity

Access time vs. size and associativity


7
L1 Size and Associativity

Energy per read vs. size and associativity


8
2. Fast Hit times via Way Prediction

How to combine fast hit time of Direct Mapped and


have the lower conflict misses of 2-way SA cache?
Way prediction: keep extra bits in cache to predict the
“way,” or block within the set, of next cache access.
Multiplexor is set early to select desired block, only
1 tag comparison performed that clock cycle in
parallel with reading the cache data
Miss  1st check other blocks for matches in next
clock cycle

Accuracy  85%
9
2. Fast Hit times via Way Prediction
10
3. Victim Caches
Introduce victim caches to reduce mis-penalty
Use additional storage near L1 for most recently evicted blocks
Efficient for thrashing problem in L1 cache.
Lookup in victim caches before L2 cache
L1 and victim caches are exclusive.
11
4: Increasing Cache Bandwidth by
Pipelining
Pipeline cache access to maintain bandwidth, but higher
latency
Split cache memory accesses into several sub stages.
Like: Indexing, Tag read, hit/mis check, data
transfer
Pipeline cache access to improves the bandwidth
Pentium-IV to Core i7 uses 4 cycles
5. Increasing Cache Bandwidth: 12

Non-Blocking Caches
A blocking cache stalls the pipeline on a cache miss
Non-blocking cache or lockup-free cache allow data cache to
continue to supply cache hits during a miss
requires out-of-order execution
requires multi-bank memories
5. Increasing Cache Bandwidth: 13

Non-Blocking Caches
Requires out-of-order execution
“hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
“hit under multiple miss” or “miss under miss” may
further lower the effective miss penalty by overlapping
multiple misses
Significantly increases the complexity of cache controller
as there can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot
support)
14
6: Increasing Cache Bandwidth:
Non-Blocking Caches
Requires Multiple Banks
Rather than treat the cache as a single monolithic block, divide
into independent banks that can support simultaneous accesses
Organize cache as independent banks to support simultaneous
access
E.g.,T1 (“Niagara”) L2 has 4 banks
ARM cortex-A8 supports 1-4 banks of L2 cache
Intel i7 supports 4 banks for L1 and 8 for L2
Banking works best when accesses naturally spread themselves
across banks  mapping of addresses to banks affects behavior
of memory system
Simple mapping that works well is “sequential interleaving”
Spread block addresses sequentially across banks
E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0;
bank 1 has all blocks whose address modulo 4 is 1; …
7. Reduce Miss Penalty: 15
Early Restart and Critical Word First
Don’t wait for entire block to be loaded before restarting CPU
Early restart— Request the words in order; As soon as the
requested word of the block arrives, send it to the CPU and let
the CPU continue execution
Spatial locality  tend to want next sequential word,
so not clear size of benefit of just early restart
Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block
Long blocks more popular today  Critical Word
1st Widely used
7. Reduce Miss Penalty: 16
Early Restart and Critical Word First
8. Reducing Misses by Hardware Prefetching of 17
Instructions & Data

Prefetching relies on having extra memory bandwidth that can be used


without penalty
Prefetching
Typically, CPU fetches 2 blocks on a miss: the requested block and the next
consecutive block.
Requested block is placed in instruction cache when it returns, and prefetched block
is placed into instruction stream buffer
Performance Improvement

2.20 1.97
2.00
1.80
1.60 1.45 1.49
1.40
1.26 1.29 1.32
1.40 1.18 1.20 1.21
1.16
1.20
1.00
u
p

el
3d

ke
e

im
c

id
cf

s
pl
re

ca
ga

is

lg

gr
m

ua
sw
m

pw

ap
ce
ga

lu

m
fa

eq
fa
wu

SPECint2000 SPECfp2000
9. Reducing Misses by Compiler 18
Optimizations
McFarling [1989] reduced cache misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
Reorder procedures in memory so as to reduce conflict
misses
Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in
order stored in memory
Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
Blocking: Improve temporal locality by accessing “blocks”
of data repeatedly vs. going down whole columns or rows
19
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */


struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key;


improve spatial locality
20
Loop Interchange Example
/* Before */
for (j = 0; j < 100; j = j+1) … …
for (i = 0; i < 5000; i = i+1)
i=0 j=0
x[i][j] = 2 * x[i][j];
j=0 i=1 i=0 j=1
… …
/* After */
i=0 j=0
j=1 i=1 i=1 j=1
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
… …
x[i][j] = 2 * x[i][j];
i=0 j=0
j=2 i=1 i=2 j=1
……
……

Sequential accesses instead of striding through memory every 100 words;


improved spatial locality
21
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];

/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

You might also like