0% found this document useful (0 votes)

5 views21 pages

Lecture 7

The document discusses advanced cache optimizations in computer architecture, focusing on reducing hit time, miss penalty, and miss rate through various techniques such as small caches, way prediction, victim caches, and hardware prefetching. It also covers the use of CACTI for modeling cache performance and highlights compiler optimizations that can significantly reduce cache misses. Key strategies include pipelining cache access, implementing non-blocking caches, and optimizing data access patterns to improve spatial and temporal locality.

Uploaded by

Anwar Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views21 pages

Lecture 7

Uploaded by

Anwar Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

1

CS-506 Advanced Computer Systems

Architecture

Lecture#7
Gul Munir Ujjan
Assistant Professor
CISE Department, NEDUET
Karachi
2
9 Advanced Cache Optimizations
Reducing hit time Reducing Miss Penalty
1. Small and simple caches 7. Critical word first
2. Way prediction
3. Victim caches
Reducing Miss Rate
Increasing cache 8. Hardware prefetching
bandwidth 9. Compiler Optimizations
4. Pipelined caches
5. Multibanked caches
6. Nonblocking caches
1. Fast Hit times via Small and Simple 3
Caches
Index tag memory and then compare takes time
 Small cache can help hit time since smaller memory takes less time to
index
E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and
Opteron
Also L2 cache small enough to fit on chip with the processor avoids time penalty of
going off chip
Simple  direct mapping
Can overlap tag check with data transmission since no choice
Access time estimate for 90 nm using CACTI model 4.0
Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and
1.43 for 2-way, 4-way, and 8-way caches
2.50
Access time (ns)

2.00 1-way 2-way 4-way 8-way

1.50

1.00
0.50
-
16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB
Cache size
4
CACTI
CACTI is an integrated cache and memory access time, cycle time,
area, leakage, and dynamic power model.
By integrating all these models together, one can have confidence
that tradeoffs between time, power, and area are all based on the
same assumptions and, hence, are mutually consistent.
CACTI is intended for use by computer architects to better
understand the performance tradeoffs inherent in memory system
organizations.
The open-source software of CACTI can be accessed at
https://github.com/Cacti/cacti
5
Recall: Set Associative Cache
N-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallel
Example: Two-way set associative cache
Cache Index selects a “set” from the cache
The two tags in the set are compared to the input in
parallel
Data is selected based on the tag result
Disadvantage: Time to set mux
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit
6
L1 Size and Associativity

Access time vs. size and associativity

7
L1 Size and Associativity

Energy per read vs. size and associativity

8
2. Fast Hit times via Way Prediction

How to combine fast hit time of Direct Mapped and

have the lower conflict misses of 2-way SA cache?
Way prediction: keep extra bits in cache to predict the
“way,” or block within the set, of next cache access.
Multiplexor is set early to select desired block, only
1 tag comparison performed that clock cycle in
parallel with reading the cache data
Miss  1st check other blocks for matches in next
clock cycle

Accuracy  85%
9
2. Fast Hit times via Way Prediction
10
3. Victim Caches
Introduce victim caches to reduce mis-penalty
Use additional storage near L1 for most recently evicted blocks
Efficient for thrashing problem in L1 cache.
Lookup in victim caches before L2 cache
L1 and victim caches are exclusive.
11
4: Increasing Cache Bandwidth by
Pipelining
Pipeline cache access to maintain bandwidth, but higher
latency
Split cache memory accesses into several sub stages.
Like: Indexing, Tag read, hit/mis check, data
transfer
Pipeline cache access to improves the bandwidth
Pentium-IV to Core i7 uses 4 cycles
5. Increasing Cache Bandwidth: 12

Non-Blocking Caches
A blocking cache stalls the pipeline on a cache miss
Non-blocking cache or lockup-free cache allow data cache to
continue to supply cache hits during a miss
requires out-of-order execution
requires multi-bank memories
5. Increasing Cache Bandwidth: 13

Non-Blocking Caches
Requires out-of-order execution
“hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
“hit under multiple miss” or “miss under miss” may
further lower the effective miss penalty by overlapping
multiple misses
Significantly increases the complexity of cache controller
as there can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot
support)
14
6: Increasing Cache Bandwidth:
Non-Blocking Caches
Requires Multiple Banks
Rather than treat the cache as a single monolithic block, divide
into independent banks that can support simultaneous accesses
Organize cache as independent banks to support simultaneous
access
E.g.,T1 (“Niagara”) L2 has 4 banks
ARM cortex-A8 supports 1-4 banks of L2 cache
Intel i7 supports 4 banks for L1 and 8 for L2
Banking works best when accesses naturally spread themselves
across banks  mapping of addresses to banks affects behavior
of memory system
Simple mapping that works well is “sequential interleaving”
Spread block addresses sequentially across banks
E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0;
bank 1 has all blocks whose address modulo 4 is 1; …
7. Reduce Miss Penalty: 15
Early Restart and Critical Word First
Don’t wait for entire block to be loaded before restarting CPU
Early restart— Request the words in order; As soon as the
requested word of the block arrives, send it to the CPU and let
the CPU continue execution
Spatial locality  tend to want next sequential word,
so not clear size of benefit of just early restart
Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block
Long blocks more popular today  Critical Word
1st Widely used
7. Reduce Miss Penalty: 16
Early Restart and Critical Word First
8. Reducing Misses by Hardware Prefetching of 17
Instructions & Data

Prefetching relies on having extra memory bandwidth that can be used

without penalty
Prefetching
Typically, CPU fetches 2 blocks on a miss: the requested block and the next
consecutive block.
Requested block is placed in instruction cache when it returns, and prefetched block
is placed into instruction stream buffer
Performance Improvement

2.20 1.97
2.00
1.80
1.60 1.45 1.49
1.40
1.26 1.29 1.32
1.40 1.18 1.20 1.21
1.16
1.20
1.00
u
p

el
3d

ke
e

im
c

id
cf

s
pl
re

ca
ga

gr
m

ua
sw
m

ap
ce
ga

m
fa

eq
fa
wu

SPECint2000 SPECfp2000
9. Reducing Misses by Compiler 18
Optimizations
McFarling [1989] reduced cache misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
Reorder procedures in memory so as to reduce conflict
misses
Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in
order stored in memory
Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
Blocking: Improve temporal locality by accessing “blocks”
of data repeatedly vs. going down whole columns or rows
19
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */

struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key;

improve spatial locality
20
Loop Interchange Example
/* Before */
for (j = 0; j < 100; j = j+1) … …
for (i = 0; i < 5000; i = i+1)
i=0 j=0
x[i][j] = 2 * x[i][j];
j=0 i=1 i=0 j=1
… …
/* After */
i=0 j=0
j=1 i=1 i=1 j=1
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
… …
x[i][j] = 2 * x[i][j];
i=0 j=0
j=2 i=1 i=2 j=1
……
……

Sequential accesses instead of striding through memory every 100 words;

improved spatial locality
21
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];

/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
15 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
10 Caches
No ratings yet
10 Caches
34 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
No ratings yet
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
16 pages
Compiler Optimizations and Prefetching
No ratings yet
Compiler Optimizations and Prefetching
22 pages
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
No ratings yet
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
12 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
4 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
Cache Optimizations
No ratings yet
Cache Optimizations
29 pages
Unit II
No ratings yet
Unit II
9 pages
Cache Memory and Optimization Guide
No ratings yet
Cache Memory and Optimization Guide
2 pages
Advanced Cache Optimization Techniques: Lecture 4E
No ratings yet
Advanced Cache Optimization Techniques: Lecture 4E
15 pages
Cache 2 Output
No ratings yet
Cache 2 Output
37 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
Lec 34
No ratings yet
Lec 34
26 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Advanced Cache Strategies
No ratings yet
Advanced Cache Strategies
27 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Cache
No ratings yet
Cache
34 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
Cache Optimizations
No ratings yet
Cache Optimizations
23 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
CS-30005 (HPC) - CS End Nov 2024
No ratings yet
CS-30005 (HPC) - CS End Nov 2024
23 pages
CS530 Fall2015 Lecture6
No ratings yet
CS530 Fall2015 Lecture6
3 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
Advanced Cache Optimization Techniques
No ratings yet
Advanced Cache Optimization Techniques
21 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
No ratings yet
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
7 pages
Memory Hierarchy Design Guide
No ratings yet
Memory Hierarchy Design Guide
54 pages
Cache Impact On Performance: An Example: Assuming The Following Execution and Cache Parameters
No ratings yet
Cache Impact On Performance: An Example: Assuming The Following Execution and Cache Parameters
32 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
Cache
No ratings yet
Cache
31 pages
Improving Cache Performance Reducing Misses
No ratings yet
Improving Cache Performance Reducing Misses
9 pages
Lecture 8
No ratings yet
Lecture 8
33 pages
Memory Hierarchy Design Guide
No ratings yet
Memory Hierarchy Design Guide
115 pages
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
No ratings yet
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
14 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
CA11 2023S1 New
No ratings yet
CA11 2023S1 New
26 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
Cache Memory: A Safe Place For Hiding or Storing Things
No ratings yet
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
Server 2025 Installation Steps
No ratings yet
Server 2025 Installation Steps
3 pages
Computer Class 6
0% (1)
Computer Class 6
2 pages
TCP - IP Ports of All SAP Products
No ratings yet
TCP - IP Ports of All SAP Products
2 pages
LTSP Practice Guide for Linux Servers
100% (4)
LTSP Practice Guide for Linux Servers
14 pages
JM-LL301 GPS Tracker Communication Protocol - v1.0 - 20210823
No ratings yet
JM-LL301 GPS Tracker Communication Protocol - v1.0 - 20210823
72 pages
Chapter2 MCQ
No ratings yet
Chapter2 MCQ
6 pages
SLA - Enterprise Internet - Apr
No ratings yet
SLA - Enterprise Internet - Apr
4 pages
AMD64 ABI Guide for Developers
No ratings yet
AMD64 ABI Guide for Developers
128 pages
AN12 Modify EPI File
No ratings yet
AN12 Modify EPI File
12 pages
SetUp Log Mpegable DS Decoder
No ratings yet
SetUp Log Mpegable DS Decoder
2 pages
MachineExpert Basic ReleaseNotes en
No ratings yet
MachineExpert Basic ReleaseNotes en
34 pages
VG10-B21A - Datasheet - 20220725
100% (1)
VG10-B21A - Datasheet - 20220725
2 pages
Services of Operating System
No ratings yet
Services of Operating System
10 pages
Computer Organization and Architecture 18EC35: Course Handling Faculty: Gahan A V
No ratings yet
Computer Organization and Architecture 18EC35: Course Handling Faculty: Gahan A V
72 pages
Berkeley Software Architecture Manual 4.4BSD Edition
No ratings yet
Berkeley Software Architecture Manual 4.4BSD Edition
43 pages
Lab Experiments:05: OBJECTIVE: To Study Cloud Security Management
100% (1)
Lab Experiments:05: OBJECTIVE: To Study Cloud Security Management
15 pages
Hadoop The Definitive Guide 3rd Edition
100% (1)
Hadoop The Definitive Guide 3rd Edition
647 pages
Rooman Technologies: Vocational Training Project Report On
No ratings yet
Rooman Technologies: Vocational Training Project Report On
25 pages
SVNManager Installation
No ratings yet
SVNManager Installation
14 pages
Basic Computer Troubleshooting - by Engr Peter A Ogbeide
No ratings yet
Basic Computer Troubleshooting - by Engr Peter A Ogbeide
7 pages
DC Cloud Integral Guide
No ratings yet
DC Cloud Integral Guide
14 pages
Chris Bryant - S CCNP ROUTE 300-101 Study Guide
100% (2)
Chris Bryant - S CCNP ROUTE 300-101 Study Guide
450 pages
Operating System
No ratings yet
Operating System
20 pages
COA - Important Questions
No ratings yet
COA - Important Questions
4 pages
Nitgen Fingkey Hamster
No ratings yet
Nitgen Fingkey Hamster
2 pages
Dhansham - Engineer's Notebook Checkpoint Firewalls Gaia - Palo Alto - Useful CLI Commands
No ratings yet
Dhansham - Engineer's Notebook Checkpoint Firewalls Gaia - Palo Alto - Useful CLI Commands
11 pages
Di 1201
100% (1)
Di 1201
785 pages
Avmar Quiesce Snapshot
No ratings yet
Avmar Quiesce Snapshot
5 pages
New Thinkpad X61 Tablet Models Include Intel Core 2 Duo Processor Technology and Microsoft Windows Vista Operating System
No ratings yet
New Thinkpad X61 Tablet Models Include Intel Core 2 Duo Processor Technology and Microsoft Windows Vista Operating System
14 pages
Isabel Mercado
No ratings yet
Isabel Mercado
1 page

Lecture 7

Uploaded by

Lecture 7

Uploaded by

1

CS-506 Advanced Computer Systems

2.00 1-way 2-way 4-way 8-way

Access time vs. size and associativity

Energy per read vs. size and associativity

How to combine fast hit time of Direct Mapped and

Prefetching relies on having extra memory bandwidth that can be used

/* After: 1 array of stuctures */

Reducing conflicts between val & key;

Sequential accesses instead of striding through memory every 100 words;

You might also like