Unit-6 Cache Memory Organization
Unit-6 Cache Memory Organization
«k .
’ -
r
m Iwmte
3 if - . •
^t /
.
*
- '
S- S •
.
'
J
>
S3&&& .**. • v > .i
:.
‘
r*
.
*
-
T'
r v
-
- : r
:- <
*
-
SfeFv
|
&
* --
Ct -
> -?
•
-
- *
>.
-•- -
*
•
—
'
.
r’
i.
'
}'
-
* -
-ai.tTr?
*
=7?
-v
•
' • :V
- tfr 3
ft
X : s* -.K 5
i -i
~ ’ Of * 'S
4 £
*4.-
fir*
i' •
'
*&%- &*
•
-
.
* «7 -
r: :
•
,
*
-
;
" :%
A
-
-
i -
- ViWr
i v
— i.
:ifi> ’ .::> W
- >
c;
,
-?
iw-
=
--
.>.
-
—Cache Memory
.
.
•
•
- -
tf1 V' J
- -
.V-V -
3M5:-- v: ** ;. |S
— -^r--y
'
: V |
7
'
i
r\
Orga nisattinn
Ariico ion
used main
which stores copies of data from frequently
Cache memory is, „a smaller, faster memory average lime to access the cache.
_
memory locations CPU cache is used by CPU of a computer to reduce types of
know how cache is reganised. what are different
Thus after studying this chapter we come to
cache memory.
cache, now data is updated in main memory and
*5 of
‘ . . -A -
* «
V -
•r
6:0INTRODUCTION ;
.
! 6.2 IMPORTANT TERMINOLOGY. -
i .
-
— V V v .7 -
3
If processor wants some data, and it is present in the cache memory then this is called cache hit. And if
the referenced data is not present in the cache memory then this is called a miss.
. In this case the miss, data must be brought from main memory to cache by cache control mechanism .
The cache performance is measured by miss rate. The miss rate times the miss rate measures the delay
r
penalty due to cache miss.
Z 171
iii Advanced
•>» • • •'
*
. *v- - <• U i
CPU chip
r
Register file
L4 ALU
cache
. Memory bus
System bus
Cache bus
Main
/l \ •
! I/O '
bridge
iK/.1 memory
L 2 cache / ; Bus interface
.
Fig 6.1 Typical Bus structure for cache and main
memory.
> ? > in a main memory access with much longer latency . There are thrbe kinds of cache
;
misses, •/
:
- instruction read miss , data read miss , and data write miss.
‘ 4
U: /- , > . -
*/
n f
• 2
f organisation .
f
-
3 . Time taken by Mapping process i . e . time taken initially to compute a real address
address.
from a virtual
The effective cycle- time of a cache memory ( teff ) is the average of
( cache ) and main - memory cycle time ( /
cache- memory cycle time
'probabilities of hits and misses.
main ) where the probabilities in the averaging process are the
t
v
« If we consider only READ operations , then a formula
for the average cycle - time is:
eff = cache + ( 1 - h ) t ' '
where h is the probability of a cache hit (sometimes
called then hit rate), then quantity ( 1
the probability of a miss, is know as the miss
rate.
- /1) , which is
> V
to all accesses = Pmiss
.
i
I raffic ratio is percent
of data words to “bus” for nex w i H
transferred to “CPU" (or next level dl wnS) compared to data
° TV
.
words
up) .
, H t access
r .
time is number of clocks
• Miss penalty is number of clocks to
bus
to return a cache hit
process a cache miss (
^
traffic ratio is an i ndication of
effectiveness
J
K
•
v
. 4 y
A
.
««
:rf
11
*
6
I
miss penalV
does not. a cache
When reading .
the c0st of
cost of replacing
ftlisj
Main
" for fete
„
Feich and Tile Mechanism bi g and renting the data into cache. There arc
There are three policies
memory
1. Fetch on demand
2. Prefetch data
3. Selective fetch
\
6.3 .
misation. -V ii
Fetch on demand , Prefetch data Selective fetch *v
memory «
*
a virtual
.
Fig 6.2 Fetch and write mechanism
bo
.
1 Fetch on Demand: As the name specifies the meaning fetch on demand means that data is to
cle time fetched from the main memory when needed or demanded i.e a demand fetch cache
brings a new
i are the memory locality into cache only when a processor
reference is not found in the cache that is
called basically miss.
^
designed so that the processor n
midt 'Processor systems. Cache systems need to be
iyer
ocations could be tagged
can arrec
as non -
1 ‘"
difeCtly and byPass the cache' Indivi<Jlia
< cacheabi
»rds
wasted whik
in terms of a
searclpCteri
T ,
KliC °f Cache memory is its fast access time
g lhC dala from . Thus ideally no time
^
must bo
quantity called hit cache. The
ratio. Hit rat cache memory is frequently measured
0rmance of
iess .*
.
! Pep
dtl = h t / hii
° *< . + miss)
6.3
*
.
,
^ST"
,ion dataORYMAPPING
# I
) ne The r»fora,
, ”,<
:• of mapping
a
fr m main TECHNIQUESS [MDU: Dec 2007, 2008
20121
^^^
' . f/
^ f
.
° mem
5
'•
i
3.
£*Mapping
* 'COnSideri" ”n° of cache Mapping process.
Cf“memory. These are:
W
Pping
, theThecacf
'i Address
^hii '3
-
' \ 11
(
>1 the ass
Addross and the a
%[ buffer
associair
V hisj
' V)
3
- mall inf -
Control Control
Processor Cache E 6.3. 2
l
0}
V.y The full
-*
» i
kfiL&iki cache 1c
Data high sp«
buffer
, ,. *v b . cache g
'a i
:;\;e • ;. . .V \s
3. t> selected
jvjx Data < bits of t
The fully associative cache is expensive to implement because of requiring a compaiator with each
cache location, effectively a special type of memory . In direct mapping , the cache consists of normal
high speed random access memory , and each location in the cache holds the data, at an address in the
cache given by the lower significant bits of the main memory address. This enables the block to be
selected directly from the lower significant bits of the memory address. The remaining higher significant
bits of the address are stored in the cache with the data to complete the identification of the cached data.
Read
Compare
Different
Same
Access location
L
Scanned with CamScanner
-
r* cr
15 bits - 9 bits
— 8 bits
Index
Tag
i Octal)
presented in
Tag Index ( everything is
000
-* •
,• - -
; Vy&
*-
77
» :V ,
r
f. 000
512*12
32K*12
Cache Memory
*
Main Memory
A. *
777
> .- Ji
• *• ?
77 777
When the memory is referenced , the index is First used to access a word in the cache. Then the tag
stored in the accessed word is read and compared with the tag in the address. If the two tags are the
same, indicating that the word is the one required, access is made tG the addressed cache word. However,
if the tags are not the same, indicating that the required word is not in the cache, reference is made to the
main memory to find it . For a memory read operation, the word is then transferred into the cache where
it is accessed . It is possible to pass the information to the cache and the processor simultaneously , i.e., to
rv
*
read - through the cache, on a miss. The cache location is altered for a write operation. The main memory'
9 amy be altered at the same time ( write- through) or latter.
The direct mapping example here described uses a block size of one word
. The same organisation
but using block size of 8 words is shown in fig below. In this the
index field is
i now divided into two
parts: block field and word field.
Since 1 block has 8 words, thus bits for the word
the bits for block field .
field = 3(8 = 23) and the remaining bits represent
15 bits
Tag (6 bits)
Index (9 bits)
Block (9 - 3 = 6 bits )
Word (3 bits)
-
F|g 6.7 Direct
maPPing cache with
block size of works.
In this the tag field stored
a miss occurs, an entire within the cache i
mm n 10 a 8 words of
block of 8 words mUS mJLC80?,ransfwed
thesame ° " block. Every time
‘ from main ‘
memory to cache memory
.
p*
,TT- *
~
Tag
processor
] Index Word
,
By G I
Cache
Index
In direct mapping, the corresponding blocks with the same index in the main memory will map into
the same block in the cache, and hence only blocks with different indices can be in the cache at the same
time. A replacement algorithm is unnecessary, since there is only one allowable location for each
incoming block. Efficient replacement relies on the low probability of lines with the same index being
required. However there are such occurrences, for example, when two data vectors are stored starting at
the same index and pairs of elements need to processed together. To gain the greatest performance, data
arrays and vectors need to be stored in a manner which minimizes the conflicts in processing pairs of
elements. Fig. 6.7 shows the lower bits of the processor address used to address the cache location The tag ad
directly. It is possible to introduce a mapping function between the
that they are not the same.
address index and the cache index so address bits art
this spreads oul
6.3.3 Set-associative Mapping lormat is knov
Possible to ha ^
as the next
si|
addfess bits a'
Notice
?*** ^ °"sidered a‘^promise between a fully associate
S
:^ :
cache and a direct C aild can be
: rnaryby ra"
35
^.cheThe,sdivr
mM
-
.
in
T
each set
number of blocks in
^
Tsf
l ^
”** »
our Way
«
associative cache would have four bloc
' set s
•
glven
the
f
.
Thc set
together with »JS ° ^
a
as a
stored tag which, assoc ativ > ty or set size. Each block in each set
W
can
‘ identification of the block. First- «
^
*
,
J
f addrcss
al! ° ° ,
frorn the processor is mpletes tIle C
°
*
cruase * , an access to tbe ^ ,
''
f
found, the corresponding location >
t
, as before 3 matL ls
ma n memory is
made.
>!< , ; 0l
>
'
-
%m . n,Qfy
Me
address from
processor
Sector Maf
In sector map
of a number
Tag V -* - Index is stored witl
sector is not
Line Cache are transfen
specific loc;
blocks in tf
Index
Sector
microproo
Tag Data Tag Data Tag % Data Tag Data
* ; . each byte i
•K
6.4 R!
When a n
data item
the exist
Compare
r 1
Main
memory first-in
accessed
Same ** if tags do 1.
not match
to
Access word
The tag address bits are always chosen to be the most significant bits of the full address, the block 2.
address bits are the next significant bits and the word/byte address bits form the least significant bits as
s sPrfads out consecutive man memory blocks throughout consecutive sets in the cache. This addressing
Jcmai i $ Known as hit
selection and is used by all known systems . In a set- associative cache it would be
C have the set address bits as the most significant bits of the address and the block address bits
as th ° S1 n
lcant with the word within the block as the least significant bits , or with the block
^,r '^
addr s ll!> as the ’
and c C S
0
^^
at
least significant bits and the word within the block as the middle bits.
association between the stored tags and the incoming tag is done using comparators
0r eac assoc >ative search , and all the information , tags and data, can be stored in
rdinar ^ ^
° r inc om access memory . The number of comparators required in the set -associative cache is
given by ‘ *
The sei caj num
e er of blocks in a set, not the number of blocks in all , as in a fully associative memory
^
,
the tags
before waiv se0*rectec
C
quickly and all the blocks of the set can be read out simultaneously with
^
block ran / ^ tbe ta8 comparisons to be made . After a tag has been identified , the corresponding
Can
selected.
set as
a or lbm for set-associative mapping need only consider the lines in one .
* ]
179
Cache Memory Organisation
.-
..
. - .. -•
V
When a miss occurs in a set-associative cache and the set is full , it is necessary to replace one of the tag-
data items with a new value. Except for direct mapping, which does not allow a replacement algorithm ,
the exisdng block in the cache is chosen by a replacement algorithm. These replacement policies are
implemented in hardware. The most common replacement algorithms used are: Random replacement ,
first -in first-out (FIFO) and least recently used ( LRU ).
. -
1 Random replacement ( RR ) algorithm: Under this policy replacement is done randomly by
selecting a candidate item and discard it to make space when necessary. This algorithm does not
require keeping any information about the access history.
A true random replacement algorithm would select a block to replace in a totally random
order, with no regard to memory references or previous selections; practical random
replacement algorithms can approximate this algorithm in one of the several ways.
le block 2. FIFO: The first-in first -out replacement algorithm removes the block that has been in the cache
it bits as for the longest time . The first- in first-out algorithm would naturally be implemented
dressing with a
first-in first -out queue of block address, but can be more easily implemented
vould be with counters, only
one counter for a fully associative cache or one counter for each set
Iress bits in a set-associative cache ,
each with a sufficient number of bits to identify the block .
he block
3. LRU: This algorithm selects the item for replacement
that has been least recently used by the
CPU. i .e. in LRU , the block which has not been referenced
iparators for the longest lime is removed from
the cache . Only those blocks in the cache are
stored in considered. LRU algorithm is popular for cache
system and can be implemented fully
cache is when number of blocks involved is small. There are
several ways the algorithm can be implemented
memory , in hardware for a cache like use of counters.
h the tags Register stack , reference matrix etc .
sponding LRU is considered as an ideal policy ,
since it most closely corresponds to the
temporal locality . concept of
ne set, as
:h set, for
^^
each l
is associated with
is line had been
In the counter implementation , a counter
be to increment each counter at regular
intervals and to rest ^ block since last referenced. The
. Hence the value in each counter would indicate the ag
referenced at leplacei
block with the largest age would be replaced
ft set of
is formed , one for each
.
2 Register Stack: In the register stack implementa tion
recently used bl •
lop* of the stack and
block in the set to be considered . The most enho .
con enhonal
the least recently used block at
stack, as both ends
register under
and
certain
internal
conditions
the
values
bottom
are
. Actually
accessible .
, the
The
set
value
of
place
at t e top o
towar s t e
.
ttom
’
o
the ne«
^ '
until a register is found to hold the\same value as the incoming block identification
-
r •. • i
% v
v r-
.. :
yi[MDU
: DEC 2007]
6.5 CACHE DATA
-
' 4'
' - "f- :
< v a- . ?.
• * AV *
• / ! !• •
r
- •
'
/ The performance of cache memory ( miss rate) is determined by the size of cache. If size of the cache is
small , obviously it can accommodate less data thus chances of miss rate will increase. And if the size of
cache is large, the miss rate will get reducedjThus
Smith in 1985 discuss DTMR in ISCA paper. The
Idea was baseline performance data to estimate effects of changes in baseline cache structure like
• Unified cache
• Demand fetch
• Write- back
• LRU replacement
10 Fully associative for large block - way set associative for 4-8 byte blocks
"DTMR is design target
njiss ^
ralesjjhis
is the basic cache data ( misses per reference) for each of the
integrated data and miction cacjigDTMR represents a reasonable assessor
Ice
lcem "T^—
for modem processors. DTMR data Ts based on traces of imSivHi alL
0f expected
E _ penormance
,
perfo R
completion . DTMR data assumes a fully associative cache with LR
For cache of other design , adjustments must be made to DTMR d
sJ Qjchejigrformance depends on workload and
develop the intuition.
nn n8
^^
machine characteristics. And DTMR is useful to
' C0PybaCk
^
DTMR Benchmark Mix considers
I V artery of applications and includes
, ,, operating system effects
.
? rh“ arCchi
C eC UreS: IBM S/37 IBM
3 . 7y languages: Fonran 370 Assembler,
'
° ««*
« M68000, Z8000 CDC 6400
. .^ -
iir
is
is
three figures
The reference of these
Environ in ents
Effect of Application
3Uld - byte blocks (rh
16
>een + Fully associative
The
:ach
and
jnal
lO * 1 S
icxi a>
c Type of application
5
e at
ack vn 370-fortran
2 -*• 370-cobol
-* 370-mvs
io 24= 8200-ultrix
vax -llsp
07]
- vax -mlx
-** 32016 -unix
ikWti
; is
of
DTMR
1 0 ’3
Tie
101 102 TO 3 TO
4
10 s
Cache size (bytes)
he 1
ce
to ^ W
VlLthe i
y -
^ire sm
nrvi» .
to *
OUlQ
affected i
^ • 6 rC ;
1«.1
I
In
lne on
DTMR .
% * tr^-
<
<£
*****
S i
:.
mm
0.02 -
5fgCr 6tov$h
l
10.01
O
S --
.13
u
0.01 ..
0.005 ..
0
+
32 KB 64 KB 128 KB 250KB 512 KB 1M
8KB 16 KB
Cache size
sets are C, C
jig uetter wav then they have small sized work sets, and hence
M » ' , capinre that workine sets and minimize the miss Different
require smaller instructions caches to
com puter architectures have different instruction working sets
arem
. Thus difference architectures are affected ^
by instruction working set.
by instruction set encoding , but is
But on the other hand data working sets is not directly affected
affected more by register set organisation and register a ocation p
”
~
^
fie
66 ./ v
5
^ I and D Cache
^
Cache
5ache (D-cache )
memory
— ^
of liie l leyeUs dmdef
^
that is so-called Harvard architect
!
__ ualbLlS
^— j
Main Memory
* 1
Harvard architecture
on
inn . cache and data
from the inslrucl
to fetch instructions
rHarvard architecture allows the processor
STYOnl G* [ !)
simultaneouslyJ
I from the data cache
W &Y ) Advantages of Harvard’s Architecture instruction caches can be designed as
. Thus ,
Programs do not in general
modify their own instructions they contain . An instruction cache
allow modification ol the
instructions main
read-only devices that do not it without writing them back to the
from
that have to be ejected changed since it was
brought into the
can simply discards any blocks is guaranteed not to have
memory , since the data they contain
cache.
caches separate prevents conflicts between blocks of
and data
Finally, peeping the instruction storage locations in a unified cachej
might map into the same
instructions and data that smaller (two to four times) than its data
significantly
- Often a system’s instruction cache
will
for a
be
program generally take up much less memory than
the
cache. This is because the instructions that when a program modifies its own instructions, those
is
program’s data. Thus the disadvantage the instruction cache.
instructions are treated as dataand arc stored in the data cache, not
Important terms
I-CACHE : Instruction cache
3
D-CACHE (LI) : Data cache closest to registers
S-CACHE (L2) : Secondary data cache
T-CACHE (L3) : Third level data cache
Split Cache
There are several reasons to explain why split caches are preferred over unified ( U
-cache ) on the 1 st
level.Q)ifferent functional units fetch information from 1-cache and D-
cache: decoder and scheduler
operate with I - cachc , but integer execution unit , also referred as
arithmetical and logical umLand
-
floating point umugmmunicatejwith D- }Therc are also load /siore unit with cache and system bus
cadjc
.
^
-
D cache operate with very' low access latenc es tous
serious performance loss on most tasks
A
^
functional units utilise a particular cache more f thal’ m0re Pipe nes of
new ones is a pretty stiff job . access ° ? whiIe adding ^
Additionally, U -cache is affected by oth H
, J
^ ^
mustjeevucted e Hushing instmctinnslw er Jawbadcs For example, all
^ ^- .
^
^ ^^
system exceptions which 3
p’s a regular practice to flush
hTind , U -cache allows for a ^ Zoma ^ ^^
intend to fiushprocess T du
virtual
I -cache with
more effective utilisation ontselH
‘ 313
usually on vw»5
"3$ address- Although
eW
Wel1’ 0n the other
7 “ dependence Sg ^
“ “"
^
a d with on a " ^
°
as fas, asof ,he I a tewl .
lredlifeCf nTn
r* - y
^ “ 7 °f
T
2 d letel does '1 « 1
" " " < >o be memory 1«
and
*'m6^> , rterenced
* B cachc ' k-aacSTiTI
'
^
^ Cf ftrf
01 sThmehappens
tact . B -cache moi>tochipsbe the lastlikelv
Scanned with CamScanner
most
level
'
CodN,terror Org» /. **
,, . . — — .cst
f
. ^ncllhcrbV
* ***
v
* ntutlw
Hu
/ Vt' ofcnfliohlerenliy U„UM,,.w e;"
^ lystcill logic °r c" y I "' <>' *
'
rw » 2221 *«>"
‘^ wrol'
„„
procmof C
* Wq,i
. *'
b»l 'lur,n| «««»'» n
“ “CL;''- 4.
..
.
otic,. .,
^: '
. vvcls c a d be u»ccd 10
° *' "' 1
*' » »
^
bvoici , <» hM wW
l i n i n'irSSc
. Ilinp sonic msll r U C u o n
‘ ^ ^ 'Jum r tec ' '
lh6 tcc\i‘cs &o c s redirected u> the "*
. n e“x t fvLt
" *
t cache cVt ' 6 to >
V
‘ C if a m»S» rcl .
9’
X
^
' ^0rsi <**’1* re* U, tum hits beenynUSbcd outoperating nemia ,/ Jjy
c \ivoted lhc
‘" ve0 m° '* prcvioust
yVjvC* . -
^
obviously lUj0n f r lUal
° " A
’
^ ^ of
v
.widc quantum
infoRMfe* d,
1
d and According to Mitchell
ilicon
ig its Miss rate for architecture I _ Code density for architecture 1
some Miss rate for architecture 2 Code density for architecture 2
es of
ding In addition to instruction set architecture, compilers and register allocators also affect the cache
:, all performance by affecting the code density .
IOUS
ugh Sectors and Biocks
her Sectors share tag Among blocks
:en
Sector 0
Sector 1
—
be
> ts
Block 0
id
SetO t
&
i
Typically block si
can be shared b
ft
o
then it will 15. 3 blocks, thus exploit,
in
ef
- ^^
Sector
CS As s ckc t from the figure tM
'* ‘ *
locality. If lhe si2e of block and
r&e blocks fewer valid/dirtv/shared bits
- ^Pi.-i
U fli
^'' y
ton IagsOi 1
°
nitmg case is 1 block /secl
^ tSS,
e .g.. direct mapped cache, block size of 1 word )
words to read from
words
memory on miss
to write to memory
on write back eviction
^ *
£f "
1
ds 10
'
r traffic ratio
• does pot
exploit burst mode transfers; not necessarily fastest overall design
D
Cache Memory Improve System Performance?
H W Poes ts
0 e a ° rache memory is extremely fast memory that is built into a computer’s central processing unit
'reviousiy *5) ‘ on a separate chip. The disk cache is used to hold the most recently -accessed M
located next to it disk
.
C/)
n E
drd dij (CPl ’ hard . The CPU uses cache memory to store instructions that are repeatedly
m 0f , nf* driv 11 tion from the o
1
^
, °rniar ’" ed to run use the programs , improving overall system speed . The advantage of cache memory is that the
ndreds ihou -
^ Lh
* 1 jSjdocs not
to motherboard ’ s system bus for data transfer. Whenever data o
have must be passed
system bus , the data transfer speed slows to the motherboard ’s capability . The CPU can
Ueh the cn
‘
data much faster by avoiding the bottleneck created by the system bus.
When the processor needs to read from or write to a location in main memory , it first checks
f
' more
d ensely whether a
copy of that data is in the cache. If so, the processor immediately reads from or writes to the
setures which is much faster than reading from or writing to main memory.
,
. capture cache ,
s difference js Each location in each memory has a datum (a cache line ), which in different designs ranges in size
n st ructions 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested
set from
unaffected by by a CPU instruction , which ranges from 1 to 16 bytes. Each location in each memory also has an index,
,
:a 1y the same which is a unique number used to refer to that location. The index for a location in main memory is
called an address. Each location in the cache has a tag that contains the index of the datum in main
memory that has been cached . In a CPU ' s data cache these entries are called cache lines or cache blocks.
-
6.6.5 On Chip Caches
(On -chip caches are a popular technique to fight with the speed mismatch between memory and CPU. As
integrated circuits become denser, designers have more chip area that can be devoted to on chip caches.-
To increase the cache sizes as the available area also increases may not be the best solution, however ,
since the largera cache, the longer its access time, thus using cache hierarchies (two or more levels) is
^ ^-
Potential solution ”
:t the cache
Advantages of Two Level On -Chip Caching:
iJTio primary' ( LI ) cache usually is split into separate instruction and data caches to support the
lnstruction and data fetch bandwidths in modern processor Many programs would benefit from
fdata ^
cache that is larger than the instruction cache, while others would prefer the opposite. By
ujgjnwo - level hierarchy on -chip, a majority of the cache lines are dynamically allocated to
_
1 ^^^^ !i in mstrueiions or data. _
cache access time is likely to be lower. As existing processors are shrunk to smaller lithographic
uture sizes, the die area typically needs to be held constant in order to keep the same number
- is^
in
tag
w nlng
simnrched
pads When processors are initially designed , their on -chip cache access
-
first-level cache sizes, the caches will get slower relat ve to
" - Instead, if the extra area is used to hold a second-level cache, the pnmary
3
cPath
. .
t 0 their cycle times. If the additional area available due to a process shrin *
Ply extend the the PNcesso
caches
,
a scale
d
ia
^ CCCSS along with the datapath , while additional cache capacity is still added on -ch p.
JZS «1« * K» «»"
^
- - .
— — -— —-
'
.
the second -
*»
level
,
}
Cache Memory Organisation
In
.
187
’"”'Vi"8
Cache memory is random access memory that a computer microprocessor can access more quickly
than it can access regular RAM. As the microprocessor processes data, it looks first in the cache
*
o
£
00
I
D
E
&
memory and if it finds the data there (from a previous reading of data), it does not have to do the
more time-consuming reading of data from larger memory ; C/3
Cache memory is described in levels of closeness and accessibility to the microprocessor. An
LI cache is on the same chip as the microprocessor. (For example, the PowerPC 601 processor nas
a 32 kilobyte level-1 cache built into its chip.)
L2 is usually a separate static RAMjSRAM ) chip. The main RAM is usually a dynamic RAM
(DRAM) chip.
Cfo optimize the area available for storing data , the sectored cache minimize the directory size
and affords the best use of area
^
fiphe contents of low level cache ( LI ) is also contained in higher level cache ( L2 ) then the cache
system is termed as inclusive ( logical inclusion )) The total number of misses that occur in second level
cache can be determined by assuming that the processor made all of its requests to 2nd level cache
without the intermediary first level cache. Also it is considered that the line size in 2nd level cache is
same or larger than 1st level cache. If it is smaller than lsl level cache then loading the line in first level
cache would simply cause two misses in the second level cache. So
/ Cache size = set size* number of sets = ( line size*association )* number of sets
Thus number of sets = cache size/( line size*association)
Local miss rate = number of misses experienced by the cache/number
T>f incoming references.
Since we don t know the number of references made it from LI , Second level
therefore it is difficult to evaluate the L 2 cache performance. cache L 2
7
hus Global miss
made by processor.
rale = number of L 2 misses/number of nderem - , H
(C.
So1 miss rate: As the na ,ne
^' —
inlPl es “SOLO the only or single”
Uhus sob miss rateisthe miss rale the' cache would have if i one .
were the only
ccuche
che th
£E2££®2IJ£l£ £
ratals ^^ ^
thenn numtoofU
num
~
' %^
j 2rin£
_ ^
T t misses can
defineTTiy the principle of
_ bTfind ‘-
_thejiresence
miss rate will be same as that of solo '
.
m iS rate.
of 1.1
iinclusion '
he ntenljoffirstjevel
^
out from HT
'- •18
S- iv/- - ...
AssemblyCache _
6.6.6 Warm Cache. Cold Cache , Write
. prwes.sor s ce wntrd must P®*
( a ) Warm ca che; A cache that contains se eral
time between the processors is called as
required system support is also called warm
,
of multiprogramming appears to be cons stent
Warm cache the process has been running
cache ^ ^
• «
'
. rj* data for cachcs with lower degree
with data for user and syst
“a long time .
environment. In
‘‘stomped
^^ ^
flushed ome other task might have
( b ) Cold cache cddcache means that cache has been
^
on" the cache. The software I/O synchronization might
and data in same cache
^
ush/invalidate
flus cache. a;
( c ) Write assembly cache: t reduces the write traffic and it is usually used wit a one ip i
level cache. ^
Sectored cache: how many bytes grouped together as a unit ? This sectored cache has improv
cn
—
On chip/off chip:
^ • on -chip ; fast but small
^ '
• off -chip : large but slow
LT I —-
•v'<
6.7 WRITE POLICIES 4
' V ~
••
L
^hen a system writes a datum to cache, it must at some point write that datum to backing store as well.
The timing of this write is controlled by what is known as the write policyT
)
There are two basic writing approaches:
• Wnte-through - Write is done synchronously both to the cache and to the backing store.
¥ • Wmejack (or Write- behind ) - Initially, writing is done only to the cache
The write to the
replaced bv new content. ^
backingjtore is postp aiiMlhe
^ i
are about to be modified/
± z:‘ :;xr;tzirtr? ,s
--
which
backlnS store4The data in these
i
locations are written back to the backing store only
, when
Z ih
^
- ZTV
»«
, F o I h i s r e a s o n,
te replaced b anollier) will ofien require * * ' !T
Wh Ch“ * ff >
* block“
:r memorv accesses
Iwo 10
,,
•
oZ “ o hc
* *• « ZZZLVZL* to
one to write the replaced
* “ “d
^^^
m a n y chan8 s io a datum m
'
no actual data are needed
write - misses: back, there are two approaches
for situations of
rite allocate (Fetch on write )
by a - Datum at the • .
operation. In this appr
«Kh . wriZTst cache'
*> ^ *"fissed
s,„re ,hlsapp
at
oa h £ ^ £“ “
- wr Tocation is
^
not
. [ £
c0C
;19P
A<Wa °
^achoMem0fVOrga fy tra
use either of these write- miss
lanis
>, °
ni
.. .
they he uses ) lo
clicti -
. Here , subsequent V
* "r
n0 wrile allocate
writes havc no
advanu
lUf. .
«• » Mcti E
“' emalivcly
.
-
Khey still need » be
, case the
changc the data in the backing; store, in which
^ theeacj
ln
egrtoan ** - Buitiesotiterti A|( , when the clten updates he a3
c
—
keep
managed which
O
. policy selection is the effect of the decision on menifl
- considH„ration in write
5
,
,^T
5
An important |ower n emory traffic with small caches while select
g£2J7S«
D
^
Q)
emoty traffic ration is used
£ traffic with large cach
C
mothertwanJ 0
c
03
O
on chip first CO
r memory traffic*
Locate i
has improved Memory request block
£V
.
Hjr
—
store No No
write to the
e modified/
•
-
\;v the
|
-
'
5
-
na in these
i, an effect
1 a block to
Read data from
* replaced lower memory into Write data Into ; *
c
the cache block lower memory •
a datum in IV
nations of
Return data
> followed
,° isreads
!l n not
etn
Done
F|a.
6.16 A Write
%
-Throu9h cache with No-Write Allocation.
computer Architecture
^^
Or®
anisat( traffic ratio factual memory references (cache processor memory
0n > referenc
>QJici
- bu, Usual183
es
|y
Memory
request
Note: Wr
—
) 0
replaces i
Satic elapsed ti
oadvania No- write a>
c
8c s
' nce
C
Write
' Read Request
type
> behind it
not be us
CO
o
C/5
E
»
CO
these kir O
£
$
Yes Yes x>
Cache hit?
Cache hit?
Cl
)
c
i memory traffic <1
CO
o
GO
acting a copyback, Nc No
d to
measure the
' h
H
/ :
Locate a cache : Locate a cache
block to use block to use
V -iv
-
Yes Yes
Is it 'dirt/? Is it 'dirty ?
1
>
2.1
3-
‘V Done
no write allocate
^.^^, ^_^ |
C
(a ) Write Through Policy: With a write through efoulted word (that
line can be accessed in 2 ways : (a ) atari
.of the line (
isstartfromlhe word from where the read n iss has occurred. Thts
line •
)
s called wrafiaround
TT!iss 0r
hvp
load.
,
access time of the firs word and then the time required to transmt
across the memoryjj where M is the number of physical word.v
_
.
IjVhen missed line is accessed from the start of the line ( that is case rematmngM
^ er
.
the
(a )) the mtss t
|
^ ^
^^^^
(he ]ine
a>
processing cant be started till entire line has not been transmitted
But in case ( b )[ theprocessing will start as soon as first word has been
accessed and fo 6
the P^ ssorh^ s resam
jmecj
^
>
processor. The rer ng words of the line are stored in the cache while
an
processing based upon initial returned word ] Io this case both the processor
simultaneously wish to access the*cache and memory gets the priority to access t e
cac e irs -
possible that another miss will occur while the first one was being processed . In this case t le irs miss
will gel the priority that is first miss must complete before the second miss can begin its access.
(b ) Copyback policy:!In this policy, we must determine whether the line to be replaced is dirty or not.
If the line to be replace? is not dirty then this is same as that of above write through policy.. But if the
line is dirty, there must be the provisions for the replaced line to be written back to the memory] And
this is done by just first select the line to be replaced, then if it is dirtVT write the line hack to the memory
and finally bring the missed line into the cache, resuming the processing when the line has been completely
written into the cachefTo speedup the performance a write buffer can be used here.
The fastest approach for fetching a line is the nonblocking cache or prefetching the cache. This
approach is applicable in both write through and write back policy .
• ^ Wi'“”'i E 0
"iK “
Check if reads refer to data not yet
written
)
'
Sr rrrernap ° "e **•
iM s
——
Cache Memory Organisation
"”V;;
—
193
-
•
(a ) The
inten,,
iTitataflta are free ( no occupied ) with
the anticipated data and
hence |* , Si!
;
that
inJ
sed •i Li
- for current requirements
»«
.
»« °r
,
h» » »K
" .* 6.»:
the
3
this P Q3
ne r GO
E
he (a ) LRU: Least recently used o
( b) FIFO: first in first out be fetC .
each'
£
$
T3
ie
RAND: random replacement
(c) . write
* il
y These policies have already been discussed in detail
previously. data item CO
deferring
s Si- backgroui
ftS WRITE ASSEMBLY CACHE (WAC) distributee
storage ar
The goal of write assembly cache is to assemble writes so that
they can be transmitted to memory in an orderly way. From CPU Buffe
the above discussion we know that Write through caches have incapable
advantages over copy back cache and the advantage is that write
through caches requires less software management to support
memory consistency as used in multiprocessor configurations. Wi
WAC is an expansion of write buffer idea i.e. WAC is effecti
write buffer with extra circuitry. WAC offers an intermediate Cache
strategy centralizing the pending memory writes in a single Write Buffer
^ reduces the bus traffic but consistency does not
buTfeTwhich protoc
maintained because now writes are delayed. form
Main Memory
always
Fig. 6.18 yVrite Assembly Cache. A
Note: dii
^
stage.
chaZeTspatLVl5y of wnTr^ ^ ^'° ** *** 0
Stores structs
to R is
1
Stores to arrays disk),
'1516 the
WACca
• Examples
'
*
^ Subr utine cal
0
* (c.g.,
S
al locality of writes
statically allocated scratch variable)
f ! ^NCR SS0 Sin8e- ine
? -line WAC’sWAC
CR - multi
~ |
5-8.2
in workstations
Suppose the buffer
* * WAC is ful"
S JW 0
has n em
,hen is ,
SI
'
yTssomt L?v ? ,
°f * "Umber of ransfer units that comPose a "®. e
( f Ully
'
!?
Temporal i lallly ' relurned to the " 1 Ip
|
plays an important
role f n ti ,
1 0111 0 therwise affect ng the memory syste '
'
cedComputer Architecture
0fhusjr^Mowing
is clem
that to reduce the resul am write
(
this pe
lie*"' -
^
resySfonliance
allZem may reaiize a performance increase upon the initial typically write
(
increase is due to buffering occurring within the caching
transfer data
system.
) of a
V D
(
CO
10 ,ion least on
E
| itemTo
^^^ 7
' Jze rfo T ^ Ca Thus three t
•
3 individual writes are deferred to a batch of writes is a form of buffering. The portion of a caching
protocol where individual reads are deferred to a batch of reads is also a form of buffering , although this
iorm may negatively impact the performance of at least the initial reads. In practice, caching almost
always Evolves some form of buffering , while strict buffering does not involve caching. (0
|A buffer is a temporary memory location that is traditionally used because CPU instructions cannot (d)
dir
UM?ctv a(
Mress data stored in peripheral devices. Thus, addressable memory is used as an intermediate (c )
/ Srd ]
-
tiQnally 5Ucb a bu er may be feasible when a large block of data is assembled or disassembled
^
Squired by a storage device ), or when data may be delivered in a different order than that in which
,
i
Pc ^ ,
dUCeCl' Ais0, a whoIe bl ffer of data is usua Iy transferred sequentially (for example to hard
disk ) ° er n itself
For
the variation or jitter of
Present fku ^ ‘,
(he
ir
° S l 8 sometimes increases transfer performance or reduces . These benefits are
^ . ency as opposed to caching where the intent is to reduce the latency (
evcn if the buffered data are written to the buffer once and read from the buffer once. 0
(<
k in
i
depends on rffic is
healed by whether the cache has an
instructions are
a oc - P»W
"2
if
195
Cache Memory Organisation
n larger Traffic
data traffic
Instruction traffic
Q3
Data write traffic
Data read traffic
GO
E
I . While
traffic.
Fig. 6.19 Instruction traffic and data
O
)f a data £
_
read traffic and data write traffic.
once in
Thus three types of traffic are instruction trafficLdata — T3
; able to to
0ofWiththe (a )
and P is the phySjL_
required by each instruction where I average instruction size m
___
bytes
te
storage ,
,
^
word size from cache toTnsfruction (1) buffer. And P > = ’ lj enerally .
additional instruction traffic
ing as a ( b ) What would be thebranch strategy? Branch instructions also adds
tentially depending upon the branch policy followed . We all know that A branch is sequence of code in
mediate a computer program which is conditionally executed depending on how the
flow of conlroLis
arlnjtinnal fe.trh to the
altered at the branching point . Branch instruction policy always crealfiS- an
hich are target instruction . The fetching pf fhe additional instructions also requires I/P fetches . Thus total
entation instruction traffic per instruction is sumof 1/Hand-thfc excess of instructions fetched on branches:
.
instructions
niai JtW '
'«
—
= ^
It tells us how" for a particular CPU .
in clock cycles
easilybecoiwrted Into
ites
•
*^
.ngnn referenceTto
inV
.25% data references -
Wecan also computethewrite penalty
separately from the read penalty. This may be necessary for
CO
E
o
£
two reasons: 5
T3
ata Treating them as a single quantity yields a useful CPU time formula:
\
Memory access
I
^ CPU time = IC x CPI + ^ jyjissrate x Misspenalty x Clock Cycle Time
Instruction J
Bd
An Example
to Compare the performance of a 64 KB unified cache with a split cache with 32 KB data and 16KB
to instruction . Given
• The miss penalty for either cache is 100 ns, and the CPU clock runs at 200 MHz.
r. • Don’ t forget that the cache requires an extra cycle for load and store hits on a unified cache
r. because of the structural conflict
IL
Calculate the effect on CPI rather than the average memory access time.
It • Assume miss rates are as follows:
5 for 64K Unified cache: 1.35%
s for J 6K instruction cache: 0.64 % and 32 K
I data cache: 4.82%
Assume a data access occurs once for every 3
i
instructions, on average.
i Solution: The solution is o figure out the
•
,
,
« figure on to miss pen lt i to
6 111 1 /1 C3Che the
., „
penalty to CPI separately for instructions and data.
per'lnstnic<ion
of clock cyc)es 00 . m, ^ cyd£S
_x
is (0 + 1.35% x 20) = 0.27 cycles.
i
• Ford
'
„
_
or 0.42 cycles per
instructions, the penalty is (1 + 1.35 % x
The total penalty is 0.69 instruction .
CPI.
,,,
• ZT -
The total penalty is
p e ms n c i 0n 1
'
. <
'
to accesses il is 0 4
I ' 1 ' Is (0 + 0.64% x 20) 0.13 CPI.
'*
4.82% x 20 ) x ( 1/J ) .
0.32 cpi
=
0.45 CPI.
• In this case > the split
cache performs better
because of the
lack of a stall on data accesses.
Cornp*Jter
Architect w ,
CACHE
ESSOR WITHby CPU of Ihe computer to reduce the average *
' lS
v CP
V
——
jjRegisterJ
I
5
=
fC 0
L 3 Cache
7Y i
aj
c
c
ro
r o
5
-
<
CO
E
CO
O
£
$
SZ O
"
• "T• <v
c
Storage c
co
Device o
co
.
Fig 6.20 How cache memory works.
' k Cycle Ti
irne Point to Remember: ..
Memory Hierarchy Terminology .v? '
>
,ta and 16KB 1. Hit: Data appears in some block in the faster level (Example: Block X) . :* * i • • • •
. r .
( a ) Hit Rate: Iiit rate is the fraction of memory access found in the faster level. ,-
( b) Hit Time: Hit time is time to access the faster level which consists of . , •
• tA* •
Slower Level
L
- 1.35% x
i
To Processor
From Processor
Faster Level
Memory
Block X
V
Memory
Block V
Fig . 6.21
([[ data
toain —
i not
ala is
memoryTj
Cache hisses are
found in cache then this is called miss and data must be brought in to the cache
|2.
Compulsory
Capacity
Conflict
•
— Organisation 199
Cache Memory
. r*w •'
. '
Conflict
^
Capacity
Compulsory
misses
Fig. 6.22 Classification of cache
, misses or first
. Il is also called cold start
cu
a block is never inn the cache
1. Compulsorjfjhe first access to
reference misses!( Misses in even an infinite
' cacte) _ __
program, blocks
CO
E
oo
~
and data except vector processor
rrhe cache memory is used by all the processors to access instructions
be < the vector processors access the data directly from memory
^
fXyeeter processor, or array processor, is a central processing unit (CPU ) that implements
,
jin
instruction set containing instructions that operate on one -dimensional arrays of data called veciorsj
This is in contrast to a scalar processor, whose instructions operate on single data items
^
the
The addition of a cache to a memory system complicates isls 0f line read
allocate (CBWA ) caches, the requests to memory coasts
For copybnck systems with write
and line write requests. . f
, the requests consists
-
For write through systems without fetch on write fWTNWAI
line read requests and word wnie requests.
caches
be evaluated :
To develop models of memory systems with caches two basic parameters must
i. rfline time it takes to access a line in memory.
800t»
,
' time ( when memory is busy and proccssor/cachc is able lo make
^W*y* potential contention
requests ( o memory )
Consider a pipelined single processor system using interleaving to support fast line access.
Assume cache has line size of L physical wordsf bus word size) and memory uses low order
interleaving of degree m . Now if m L, the lojal time to move a line ( for both read and write
operations ) **
T
line access = Ta + ( L - 1 ) 7 ^. *
= Wc - 4M ~ 1 * Tbus
—
Tline access , . f( L - 1 ) mod m\
Thu fust word in the line is available in T„, bui module is not
, Um
\
“““"" made 10 fira « * being
available again until T A total of
accounted for in T ,So addirtonal
jj ~ 1 c > cles ai* required . Finally \{ L - 1) mod m ] bus cycles are required for other modules to
complete the line transfer.
Abvume a single module memory system ( m
1 ), with page mode and
sequential access and Ty is the time assume v is the no of fast
between each access
_ .lft -i*
.
°“ *",
^
IUK
Ked c
KCESS
'
~
v
t ( ma (+ T
* Vr
*“ *•» » *PM» mode or FPM mode.
*
! I
1 :
-
yf »
-
. * »• *» «
.,r« rnr
. *
—
Cache Memory Organisation
"
201
/
e
y* Pir \
!
Example \
in -
Computing T unc
'
T line access = r + r -
1
- 1 + maxiT^ TJ L
0 C V
y
= 200 + 150((8/4 ) - 1) + 50 -
r (8 (8/4))
e v/ti
= 200 + 150 + 300 well- 'T
= 650 ns There
Ta = 200 ns, ^c = 150 ns, Tr = 50 ns, Tbus = 25 ns, L = 16, v = 4, m = 2
—
Case 3: _
r / i v1- 1 -f
T line access =
f .
Ta + Tc — • + ( 7 bus ) L V
16 16
. 200 + 150
= 1 + 25 16 - *
2.4 2.4
been
=200 + 150 + 350
=700 ns 6.9.
» + W) access Tm = <* Tl
Miss time may be different for cache and main memory. Th
Let • Tcjnitt = Time processor is idle due to cache miss.
•7 • = Total time main memory takes to process a rniss.
'
I
In case of wrap around load .
(1 * VV)
W ~ riinc access Ta
^ - -_ - .
:
r r
miss during rbusy we call additional delay as T- T
Hn
'
interference =
Expected n. of misses duringinterference'
' : ?9 l
;
of miss
^
during = No of requests during
Tbusy
(Delay per miss)
Bxpecte
'
Tbusy x prob of miss
T-busy * Xp - r,busy f
and
isp
s. sor request rale and
roce
where ^/is miss given
rate per request
a miss
,
during T •
, .
<S amply estimated as rbusy/2
me delay
So
factor “ but
T interference = V '
r
x
interference
, rw * r„ *
4, m 88 ,
Hi see rom Processor
jnd total mis* * "
I
miss — c.miss t ^interference
Processor
ache. When 4- r \
Fig . 6.24
For explaining
this we are considering three cases here:
CaSe 1:
No write buffer and low write traffic:
Memory
Processor < . '
Cache
"
£
Fig. 6.25
nr
— an :************mr* - . - . .x..wux
-
a miss. Thus
. •
-
write traffic is also low, *
with read intc
th
ere
/ *•
„A
,
* *
* -
(0 compute only:
4
...d )
“
rr*
the contention is not included because single processor is considered . But contention
must be
considered if there are multiple processors with write through cache.
(W
——
Advanced Computer Architecture .
n M
( o ) Read Priority over Write on
^
Miss i .e o read
written to memory before the read miss
miss
is tan e
all
. - - ** *
(he pending writes in the
-
buffer
wnte Kllfft
,r are
„ .
writes if no match is found .
.
• ’
- , O' ' w ' . * *
( *
•
translating the
addresses used by the cache by
.
j
real
.
buffer ) provides the
Virtual address
m
i mi
/ X3/
irtual page Page offset
number
Translated to physical
page number
>
Physical address
Fig. 6.27 Mapping of virtual to Physical address.
The page address ( the upper bits of the virtual address) is composed of the bits that require translation.
Selected virtual address bits address the TLB entries. These are selected (or hashed ) from the bits of the
virtual address. This avoids too many address collisions, as might occur when both address and data
pages have the same, say , “000,” low-order page addresses. The size of the virtual address index
is equal
tojggiL where t is the number of entries in the TLB divided by the degree of set associativity
When
a TLBentiyJs accessed , a vinual and real translation pair from each entry is
addresses are compared to the virtual address tag ( the virtual address bits
that were not used in the
index ). If a match is found , the corresponding realraddress multiplexed
I hese may be sequential translation and access or parallel
Virtual Address
to the output of the TLB.
translation and access. ^
Virtual Address
Virtual Page Number Virtual Page Offset
J/irtual Page Number Virtual Page Offset
i ; ;
Translation Translation Cache
I
Cache i
Tag Compare .
I
Compare Tag t
Data
Data
Fig . 6.28 Segmental Translation and
Access. Fig. 6.29 Parallel Translation
and Access.
*
vrt An • r - fc •
*u
1
<
. i «'
.
BN#;: ;
v<
^ #
*v i
6.11 c
Read
*’ :,£
*
#
l. v
»> ***
„ M' "
-*
, ' . he:
''"' both instructions und
ls :l data
\
t'
separate 1 cache and -
4'
* ***
r (V
I wM*<« *!<**with - D cache ;‘
Cache
M 1 * * " fj
i '
-
•Xt ' T' :/!:
; I'aoh*
\\ TK*« the cache
contains
hit.
.
in tlie infonnation
requested, the transaction Fetch
Store '
Vf
H'
i
is s4id to he a cache Memory
Fig. 6.30
v tv S ^ .V iVhc Miss
requested, tlie transaction is said to be a
^ nOt contain the information
> VA;\
\\ hen the cache docs cache rr^.-,
4. Cache Consistency
S nce cache is a photo or copy of a small piece main memory, it is important that the cache alu:- ,
reflects w hat is in main memory. Some common terms used to describe the process of maintains:; v
modified within main memory but not modified in cache the dat*
data.
,
1
7'
^“
v iual address lsi
wocafive
divided iW0 W
1 ping and set-
associative mapping are tliree basic typ® 0
f cache
10*
SUe of cache s
Local m;
- set
size n - of
* ° sets
m
"IU 1 1 case of memory are LRU, FIFO and random rePlacC
ft
*
^ ** .* global miss rate and
sale miss rate
are the three 3 types of miss rate-
m
-
L^
r
REVIEW QUESTIONS
i «0H -
.
time of cache memory is loons a„d
(P 5 How
Q. 6. ? hc acces s M 'Sl 000 n3-
, ry ,
I is estimated that 80 of memo requests
%
The hit ratio for access es is 0.9 and write thm„„ * ° .
r r read and
remaining 20% are for write. * •
*
•
"
Memo,y memory read cycles?
accesS'
„.
Hint: —90
X 100 + — x 1100 = 220 ns
1
L 100 100
( b) What is average access time of the system both read and write requests .
miss.
(c ) What is the hit ratio taking into consideration the write cycles . •[0.9 x 0.8 = 0.72 ]
Q.7. Consider a cache ( Mj) and main memory (M2) with following.
always Characteristics : - 16 k words , 50 ns access time
lining A/2 = 1 M words , 400 ns access time
Assume 8 word cache blocks and a set size of 250 words, with set associative mapping . Calculate
P - This ?efr with cache hit ratio = 0.95 .
within
Q. 8. In a two level cache system we have Ll size 8 kB with 4w . Set associative 16 B lines and L2 size
is 64 kB direct mapping 64 B lines and CBWA . Suppose miss in L, , hit in and delay in
narfed
,
3 cycles and miss in L , miss in L2 delay is 10 cycles; the processor makes 1.5 reference/
instructions.
is are (a) What is L and Lj miss rates.
l
( b) What is expected CPI loss due to cache miss.
ita in
lc ) all line in L { always reside in L2 why?
ta in Cache size
Hint: = No. of sets
Association line size
tche
CPU Time_
= t cycle
Instruction
reference + ( miss rate - miss penalty )
CPI +
instruction
nt.
Use Tc.miss
Miss penalty =
^cycle