0% found this document useful (0 votes)
228 views36 pages

Unit-6 Cache Memory Organization

The document discusses a topic but is long and contains a lot of information. It appears to cover multiple aspects of the topic across several paragraphs.

Uploaded by

Basant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views36 pages

Unit-6 Cache Memory Organization

The document discusses a topic but is long and contains a lot of information. It appears to cover multiple aspects of the topic across several paragraphs.

Uploaded by

Basant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

-H

«k .
’ -
r
m Iwmte
3 if - . •

^t /
.
*
- '
S- S •
.
'

J
>
S3&&& .**. • v > .i
:.

r*
.
*
-
T'
r v
-
- : r
:- <
*

-
SfeFv
|
&
* --
Ct -
> -?

-

- *
>.
-•- -
*


'
.
r’
i.
'

}'
-

* -

-ai.tTr?
*
=7?
-v

' • :V
- tfr 3
ft

X : s* -.K 5
i -i
~ ’ Of * 'S

4 £

*4.-
fir*
i' •
'

*&%- &*

-
.
* «7 -
r: :

,
*
-

;
" :%
A
-
-
i -

- ViWr
i v
— i.
:ifi> ’ .::> W
- >
c;
,
-?
iw-
=
--
.>.
-

—Cache Memory
.
.


- -
tf1 V' J
- -
.V-V -
3M5:-- v: ** ;. |S
— -^r--y
'

: V |
7

'

i
r\
Orga nisattinn
Ariico ion

used main
which stores copies of data from frequently
Cache memory is, „a smaller, faster memory average lime to access the cache.
_
memory locations CPU cache is used by CPU of a computer to reduce types of
know how cache is reganised. what are different
Thus after studying this chapter we come to
cache memory.
cache, now data is updated in main memory and

*5 of
‘ . . -A -
* «

V -
•r

6:0INTRODUCTION ;

” the memory hierarchy consists of various levels.


"
~
^ - V.
~

As explained in chapter, ” Memory hierarchy design


and Main memory. The cache memory is
Cache memory can directly communicate with CPU registers
memory access
employed in computer systems to compensate for the speed differential between main
and the processor logic. Cache is used between CPU and main memory so that access time
of cache
memory is close to processor logic clock cycle time , and thus will compensate the speed difference.
Locality is a characteristic of most of the programs and caches work on the basis of locality of
program behavior. 3 principles are involved here: Spatial locality , temporal locality and sequence
locality which are already described in chapter 3.
*

.
! 6.2 IMPORTANT TERMINOLOGY. -
i .
-
— V V v .7 -

3
If processor wants some data, and it is present in the cache memory then this is called cache hit. And if
the referenced data is not present in the cache memory then this is called a miss.
. In this case the miss, data must be brought from main memory to cache by cache control mechanism .
The cache performance is measured by miss rate. The miss rate times the miss rate measures the delay
r
penalty due to cache miss.

Z 171

Scanned with CamScanner


•» *
,, , - < />

r' .-rrr -r^ Computer Architecture .

iii Advanced
•>» • • •'
*
. *v- - <• U i

CPU chip
r
Register file

L4 ALU
cache
. Memory bus
System bus
Cache bus

Main
/l \ •
! I/O '
bridge
iK/.1 memory
L 2 cache / ; Bus interface

.
Fig 6.1 Typical Bus structure for cache and main
memory.

in the cache, which results


*

. A cache miss refers to a failed attempt to read or write a piece of data


v .* ’•
i

> ? > in a main memory access with much longer latency . There are thrbe kinds of cache
;
misses, •/
:
- instruction read miss , data read miss , and data write miss.
‘ 4
U: /- , > . -
*/

n f
• 2

Number of factors affecting processor performance are :


1 . Access time for a reference found in cache called hit is the property of cache size and organi sation .
2. Access time for a reference not found in cache called cache miss is the property of memory

f organisation .
f
-
3 . Time taken by Mapping process i . e . time taken initially to compute a real address
address.
from a virtual
The effective cycle- time of a cache memory ( teff ) is the average of
( cache ) and main - memory cycle time ( /
cache- memory cycle time
'probabilities of hits and misses.
main ) where the probabilities in the averaging process are the
t
v
« If we consider only READ operations , then a formula
for the average cycle - time is:
eff = cache + ( 1 - h ) t ' '
where h is the probability of a cache hit (sometimes
called then hit rate), then quantity ( 1
the probability of a miss, is know as the miss
rate.
- /1) , which is

Notes: Hit - a cache access finds data -


resident in the cache memory is called hit. 1
A
1
• Miss In case of miss a cache access does
not find data resident w
down in memory hierarchy ’ f°rCmg access t0 next layer
- j

• Miss ratio is percent of misses compared !


-
'

> V
to all accesses = Pmiss
.
i
I raffic ratio is percent
of data words to “bus” for nex w i H
transferred to “CPU" (or next level dl wnS) compared to data
° TV
.
words
up) .

• Caches can act as a “filter” t0 attenuate


.

.i ’
• 4 *

, H t access
r .
time is number of clocks
• Miss penalty is number of clocks to
bus
to return a cache hit
process a cache miss (
^
traffic ratio is an i ndication of
effectiveness
J

clock of the hit time) typically, in addition to at least one t*

K

v
. 4 y
A

.
««
:rf

Scanned with CamScanner


'
$
..
^ °^ ,
emOry CaChe
. _ * .
T ' "
iSa 0n fe
in terms of hit, miss rates, the cost
r* ’

can he quantified of h.11 ^


cache
rfonnance of memory ac ss that finds
a data4in the cache and cache miSs is
.* "
Thus the pe 0lle
-
che hit is a %
,
-*
o

11
*

6
I

miss penalV
does not. a cache
When reading .
the c0st of
cost of replacing
ftlisj

penalty is the additional ( iss


nllSS rate ) * (miss penalty
)
= (hit cost +
)
( Access time
)=
memory access time) + (miss rate
) * (slow memory access time)
= (Fast

Main
" for fete

Feich and Tile Mechanism bi g and renting the data into cache. There arc
There are three policies
memory

1. Fetch on demand
2. Prefetch data
3. Selective fetch
\

suits Cache organization


sses: v *

6.3 .
misation. -V ii
Fetch on demand , Prefetch data Selective fetch *v

memory «
*

a virtual
.
Fig 6.2 Fetch and write mechanism
bo
.
1 Fetch on Demand: As the name specifies the meaning fetch on demand means that data is to
cle time fetched from the main memory when needed or demanded i.e a demand fetch cache
brings a new
i are the memory locality into cache only when a processor
reference is not found in the cache that is
called basically miss.

to antiemate thi >


3 lty 3 0lK t0
^ eans data is fetcbed in advance i.e. the prefetch cache attempt
e requested bytbe processor
3 Select* . p thus prefetches it into the cache
which is
in these cases using
themain me
^ the blocks dePendent upon some defined criterion an
shared writable data might be ^ .tke cacbe t0 bcqd tbe information. For example,
**
esneci
mainla n i{ is always

kept in the main memory and not
^
passed to a
cache for access

^
designed so that the processor n
midt 'Processor systems. Cache systems need to be
iyer
ocations could be tagged
can arrec
as non -
1 ‘"
difeCtly and byPass the cache' Indivi<Jlia
< cacheabi
»rds
wasted whik
in terms of a
searclpCteri
T ,
KliC °f Cache memory is its fast access time
g lhC dala from . Thus ideally no time
^
must bo
quantity called hit cache. The
ratio. Hit rat cache memory is frequently measured
0rmance of
iess .*
.
! Pep
dtl = h t / hii
° *< . + miss)
6.3
*

.
,
^ST"
,ion dataORYMAPPING
# I

) ne The r»fora,
, ”,<
:• of mapping
a
fr m main TECHNIQUESS [MDU: Dec 2007, 2008
20121

^^^
' . f/
^ f
.
° mem

5
'•

i
3.
£*Mapping
* 'COnSideri" ”n° of cache Mapping process.
Cf“memory. These are:
W

Pping

Scanned with CamScanner


.
4$ 1000
i tw
* *'< * • i rt » i wVMMt,
,,,
vwi *' r~ "

, theThecacf
'i Address
^hii '3

-
' \ 11
(
>1 the ass
Addross and the a
%[ buffer
associair
V hisj
' V)
3
- mall inf -
Control Control
Processor Cache E 6.3. 2
l
0}
V.y The full
-*
» i
kfiL&iki cache 1c
Data high sp«
buffer
, ,. *v b . cache g
'a i

:;\;e • ;. . .V \s
3. t> selected
jvjx Data < bits of t

Fig. 6.3 Typical cache organisation.

6.3.1 Associative Mapping


One of the way to relate the cached data to main memory address is to store both memory address and
data together in the cache. This is called fully associative mapping approach . A fully, associative cache
requires the associative memory as shown in the cache to be composed of associative memory holding
both the memory address and the data for each cached line. The incoming memory address is
B iIS to simultaneously compared with all stored addresses using the internal logic of the diagram if a match is
aneu found then the corresponding data is read out. Single words form anywhere within the main memory
that is could be held in the cache, if the associative part of the cache is capable of holding a full address.

tempts Memory address from


processor Main memory accessed
cache. if address not in cache
ion an
iITlple’
Main
nd not Cache memory
I to be
vidual
Compare with
all stored
Address Data
USt & addresses can
isurt ^ 12
i
i
each'
Address not
f
20 found in cacho
fie
tyPeS Address found
thus
Access location
mapping.
Fig . 6.4 Cache with fully associative
M

Scanned with CamScanner


Cache Memory Organisation 175

holding combinations ol blocks


The fully associative mapping cache gives the greatest flexibility ol
in the cache and minimum conflict for a given sized
cache, but is also the most expensive , due to the cost
to remove upon a miss
of the associative memory' - ft requires a replacement algorithm to select a block
of operation . The fully
and the algorithm must be implemented in hardware to maintain a high speed
Microprocessors with
associative cache can only be formed economically with a moderate size capacity.
small internal caches often employ the full associative mechanism.

6.3. 2 Direct Mapping

The fully associative cache is expensive to implement because of requiring a compaiator with each
cache location, effectively a special type of memory . In direct mapping , the cache consists of normal
high speed random access memory , and each location in the cache holds the data, at an address in the
cache given by the lower significant bits of the main memory address. This enables the block to be
selected directly from the lower significant bits of the memory address. The remaining higher significant
bits of the address are stored in the cache with the data to complete the identification of the cached data.

Memory address from


processor
Tag and index
Tag Index

Cache Alain Main


memory memory
accessed
Index if tags to
not match
Tag Data

Read

Compare
Different
Same

Access location

Fig . 6.5 Cache with direct mapping.

Consider an example of cache memory of size 512


can send 15 bit address . * 12 and main memory size is 22k* 12 and
CPU
Thus when CPU send 15 bit address to
Cache and data at the same
12 bit data will be send to
CPU otherwise iit must be fetched from the address is present in cache then
cache memory. • main memory' and also copied to
j In this case (
| field and tag fielddirect mapping ) the 15 bits (32k 2
. The bits in index
= { S
) memory address is
divided in two parts: index
3 thus index field is of 9 bits field is
i equal to address bits of
cache memory' i.e. here 512 = 2° .

15 9 = 6 bits .

L
Scanned with CamScanner
-
r* cr

Advanced Computer Architecture


. .Civf ~VCt****"**
"u
» W
• ' ' ^

15 bits - 9 bits
— 8 bits
Index
Tag

i Octal)
presented in
Tag Index ( everything is

000
-* •
,• - -
; Vy&

*-
77
» :V ,
r

f. 000
512*12
32K*12
Cache Memory
*

Main Memory
A. *
777
> .- Ji
• *• ?

77 777

Fig. 6.6 Direct Mapping.

When the memory is referenced , the index is First used to access a word in the cache. Then the tag
stored in the accessed word is read and compared with the tag in the address. If the two tags are the
same, indicating that the word is the one required, access is made tG the addressed cache word. However,
if the tags are not the same, indicating that the required word is not in the cache, reference is made to the
main memory to find it . For a memory read operation, the word is then transferred into the cache where
it is accessed . It is possible to pass the information to the cache and the processor simultaneously , i.e., to
rv
*
read - through the cache, on a miss. The cache location is altered for a write operation. The main memory'
9 amy be altered at the same time ( write- through) or latter.
The direct mapping example here described uses a block size of one word
. The same organisation
but using block size of 8 words is shown in fig below. In this the
index field is
i now divided into two
parts: block field and word field.
Since 1 block has 8 words, thus bits for the word
the bits for block field .
field = 3(8 = 23) and the remaining bits represent

15 bits
Tag (6 bits)
Index (9 bits)

Block (9 - 3 = 6 bits )
Word (3 bits)

-
F|g 6.7 Direct
maPPing cache with
block size of works.
In this the tag field stored
a miss occurs, an entire within the cache i
mm n 10 a 8 words of
block of 8 words mUS mJLC80?,ransfwed
thesame ° " block. Every time
‘ from main ‘
memory to cache memory
.

Scanned with CamScanner


, .
* naan
r
ten

p*
,TT- *

Memory address from

~
Tag
processor
] Index Word
,
By G I

Cache
Index

Tag Word 0 Word 1 Word 2 Word 3


Main
Read memory
assessed
if tags do
Compare
not match
ix Different
Same

Access word in line

Fig. 6.8 Direct mapped cache with a multi-word block .

In direct mapping, the corresponding blocks with the same index in the main memory will map into
the same block in the cache, and hence only blocks with different indices can be in the cache at the same
time. A replacement algorithm is unnecessary, since there is only one allowable location for each
incoming block. Efficient replacement relies on the low probability of lines with the same index being
required. However there are such occurrences, for example, when two data vectors are stored starting at
the same index and pairs of elements need to processed together. To gain the greatest performance, data
arrays and vectors need to be stored in a manner which minimizes the conflicts in processing pairs of
elements. Fig. 6.7 shows the lower bits of the processor address used to address the cache location The tag ad
directly. It is possible to introduce a mapping function between the
that they are not the same.
address index and the cache index so address bits art
this spreads oul
6.3.3 Set-associative Mapping lormat is knov
Possible to ha ^
as the next
si|
addfess bits a'
Notice
?*** ^ °"sidered a‘^promise between a fully associate
S
:^ :
cache and a direct C aild can be

: rnaryby ra"
35

^.cheThe,sdivr
mM
-
.
in
T
each set
number of blocks in
^
Tsf
l ^
”** »
our Way
«
associative cache would have four bloc
' set s

glven
the
f

.
Thc set
together with »JS ° ^
a
as a
stored tag which, assoc ativ > ty or set size. Each block in each set
W
can
‘ identification of the block. First- «
^
*

,
J
f addrcss
al! ° ° ,
frorn the processor is mpletes tIle C

« whh hctcol °"T l e The». comparators are used to cornK


ta

°
*
cruase * , an access to tbe ^ ,
''
f
found, the corresponding location >
t
, as before 3 matL ls
ma n memory is
made.

>!< , ; 0l

Scanned with CamScanner


r Architecture
iced Compute - , U.I ICIV, - »VM II
-
• *1 •• 'v
>v » - « *V

>
'
-
%m . n,Qfy
Me
address from
processor
Sector Maf
In sector map
of a number
Tag V -* - Index is stored witl
sector is not
Line Cache are transfen
specific loc;
blocks in tf
Index
Sector
microproo
Tag Data Tag Data Tag % Data Tag Data
* ; . each byte i

•K
6.4 R!
When a n
data item
the exist
Compare
r 1
Main
memory first-in
accessed
Same ** if tags do 1.
not match
to

Access word

Fig. 6.9 Cache with set- associative mapping .

The tag address bits are always chosen to be the most significant bits of the full address, the block 2.
address bits are the next significant bits and the word/byte address bits form the least significant bits as
s sPrfads out consecutive man memory blocks throughout consecutive sets in the cache. This addressing
Jcmai i $ Known as hit
selection and is used by all known systems . In a set- associative cache it would be
C have the set address bits as the most significant bits of the address and the block address bits
as th ° S1 n
lcant with the word within the block as the least significant bits , or with the block

^,r '^
addr s ll!> as the ’

and c C S
0
^^
at
least significant bits and the word within the block as the middle bits.
association between the stored tags and the incoming tag is done using comparators
0r eac assoc >ative search , and all the information , tags and data, can be stored in
rdinar ^ ^
° r inc om access memory . The number of comparators required in the set -associative cache is
given by ‘ *
The sei caj num
e er of blocks in a set, not the number of blocks in all , as in a fully associative memory
^
,

the tags
before waiv se0*rectec
C
quickly and all the blocks of the set can be read out simultaneously with
^
block ran / ^ tbe ta8 comparisons to be made . After a tag has been identified , the corresponding
Can
selected.
set as
a or lbm for set-associative mapping need only consider the lines in one .
* ]

the choice 1 Tmm ^ ' set , lor


exa nple ) lP ° Predetermined by the index in the address. Hence, with two blocks in each
‘ , 11 y one additional bit is necessary in each set to identify the block to replace.

Scanned with CamScanner


1

179
Cache Memory Organisation

Sector Mapping into sectors; each sectoi is


composed
memory and ( he cache are both divided
In sector mapping, the main any sector in the cache and a lag
of a number of blocks. Any sector in
the main memory can map into
identify the main memory sector address
. However, a complete
is stored with each sector in the cache to
. Instead, individual blocks
back to the main memory as one unit
sector is not transferred to the cache or sector is transferred into a
miss, the required block of the
are transferred as required . On cache sector in the cache is selected and all the
other existing
specific location " within one sector. Hie sector location
sector.
blocks in the sector in the cache arc from a previous
mapping scheme with valid bits, as in some
Sector mapping might be regarded as a fully associative
microprocessor caches. Each block in the fully associative
mapped cache corresponds to a sector , and
each byte corresponds to a “sector block”.

.-
..

. - .. -•
V

[MDU: Dec 2012]


6.4 REPLACEMENT ALGORITHMS .
_ .-- vL J - >' > J.
• • • A, V
'

When a miss occurs in a set-associative cache and the set is full , it is necessary to replace one of the tag-
data items with a new value. Except for direct mapping, which does not allow a replacement algorithm ,
the exisdng block in the cache is chosen by a replacement algorithm. These replacement policies are
implemented in hardware. The most common replacement algorithms used are: Random replacement ,
first -in first-out (FIFO) and least recently used ( LRU ).
. -
1 Random replacement ( RR ) algorithm: Under this policy replacement is done randomly by
selecting a candidate item and discard it to make space when necessary. This algorithm does not
require keeping any information about the access history.

A true random replacement algorithm would select a block to replace in a totally random
order, with no regard to memory references or previous selections; practical random
replacement algorithms can approximate this algorithm in one of the several ways.

le block 2. FIFO: The first-in first -out replacement algorithm removes the block that has been in the cache
it bits as for the longest time . The first- in first-out algorithm would naturally be implemented
dressing with a
first-in first -out queue of block address, but can be more easily implemented
vould be with counters, only
one counter for a fully associative cache or one counter for each set
Iress bits in a set-associative cache ,
each with a sufficient number of bits to identify the block .
he block
3. LRU: This algorithm selects the item for replacement
that has been least recently used by the
CPU. i .e. in LRU , the block which has not been referenced
iparators for the longest lime is removed from
the cache . Only those blocks in the cache are
stored in considered. LRU algorithm is popular for cache
system and can be implemented fully
cache is when number of blocks involved is small. There are
several ways the algorithm can be implemented
memory , in hardware for a cache like use of counters.
h the tags Register stack , reference matrix etc .
sponding LRU is considered as an ideal policy ,
since it most closely corresponds to the
temporal locality . concept of
ne set, as
:h set, for

Scanned with CamScanner


180 Advanced Computer Architecture
Algorithms are.
Other Types of Replacement
t counter Implementation: Other typer
of , „„
iraplementati would

^^
each l
is associated with
is line had been
In the counter implementation , a counter
be to increment each counter at regular
intervals and to rest ^ block since last referenced. The
. Hence the value in each counter would indicate the ag
referenced at leplacei
block with the largest age would be replaced
ft set of
is formed , one for each
.
2 Register Stack: In the register stack implementa tion
recently used bl •
lop* of the stack and
block in the set to be considered . The most enho .
con enhonal
the least recently used block at
stack, as both ends
register under
and
certain
internal
conditions
the
values
bottom
are
. Actually
accessible .
, the
The
set
value
of

. When a block is referenced , starting


the top of the stack, the values held in the registers are shifted one
registers does not
held in one register s passed to

place
at t e top o
towar s t e
.
ttom

o
the ne«
^ '

until a register is found to hold the\same value as the incoming block identification
-
r •. • i
% v
v r-
.. :
yi[MDU
: DEC 2007]
6.5 CACHE DATA
-
' 4'
' - "f- :
< v a- . ?.
• * AV *
• / ! !• •
r
- •
'

/ The performance of cache memory ( miss rate) is determined by the size of cache. If size of the cache is
small , obviously it can accommodate less data thus chances of miss rate will increase. And if the size of
cache is large, the miss rate will get reducedjThus
Smith in 1985 discuss DTMR in ISCA paper. The
Idea was baseline performance data to estimate effects of changes in baseline cache structure like

• Unified cache
• Demand fetch
• Write- back
• LRU replacement
10 Fully associative for large block - way set associative for 4-8 byte blocks
"DTMR is design target
njiss ^
ralesjjhis
is the basic cache data ( misses per reference) for each of the
integrated data and miction cacjigDTMR represents a reasonable assessor
Ice
lcem "T^—
for modem processors. DTMR data Ts based on traces of imSivHi alL
0f expected
E _ penormance
,
perfo R
completion . DTMR data assumes a fully associative cache with LR
For cache of other design , adjustments must be made to DTMR d
sJ Qjchejigrformance depends on workload and
develop the intuition.
nn n8

^^
machine characteristics. And DTMR is useful to
' C0PybaCk
^
DTMR Benchmark Mix considers
I V artery of applications and includes
, ,, operating system effects
.
? rh“ arCchi
C eC UreS: IBM S/37 IBM
3 . 7y languages: Fonran 370 Assembler,
'
° ««*
« M68000, Z8000 CDC 6400

Suggested use for DTMR


• -^
APL, C, LISP, AI W

CTMR provides a conservative baseline for


estimating good cache parameters
^ ^
.

V TMR gives a starting point for detailed


simulations

Scanned with CamScanner


Cache Memory Organisation
, . • •
' 1 Bfc
‘.l »
'
14
u
" ’’
*

taken from FLYNN


:

. .^ -
iir
is
is
three figures
The reference of these

Environ in ents
Effect of Application
3Uld - byte blocks (rh
16
>een + Fully associative
The

:ach
and
jnal
lO * 1 S
icxi a>
c Type of application

5
e at
ack vn 370-fortran
2 -*• 370-cobol
-* 370-mvs
io 24= 8200-ultrix
vax -llsp
07]
- vax -mlx
-** 32016 -unix
ikWti

; is
of
DTMR
1 0 ’3
Tie
101 102 TO 3 TO
4
10 s
Cache size (bytes)

Fig. 6.10 Effect of application environments DTMR.

he 1
ce
to ^ W
VlLthe i
y -
^ire sm
nrvi» .
to *

OUlQ
affected i

^ • 6 rC ;

1«.1

I
In

lne on
DTMR .

Scanned with CamScanner


Computer Architecture ..
• . .
-SC

% * tr^-
<

*****

pTMRComp ared to SPECMARK


-- qn
0.035 ^ «)
0.03 !\ DTMR

0.02 S A \ SPECMARK AVC

S i
:.
mm
0.02 -
5fgCr 6tov$h
l
10.01
O
S --
.13
u
0.01 ..

0.005 ..

0
+
32 KB 64 KB 128 KB 250KB 512 KB 1M
8KB 16 KB
Cache size

Fig. 6.12 DTMR compared to SPECMARK.

Hi ,„,her facers that affect thejacheeffecjiYettessjrf. the proressqr.architeanns ani its


b \1 - "

sets are C, C
jig uetter wav then they have small sized work sets, and hence
M » ' , capinre that workine sets and minimize the miss Different
require smaller instructions caches to
com puter architectures have different instruction working sets
arem
. Thus difference architectures are affected ^
by instruction working set.
by instruction set encoding , but is
But on the other hand data working sets is not directly affected
affected more by register set organisation and register a ocation p

~
^

fie
66 ./ v

OTHER TYPES OF CACHE <

5
^ I and D Cache

^
Cache
5ache (D-cache )
memory

— ^
of liie l leyeUs dmdef
^
that is so-called Harvard architect
!
__ ualbLlS
^— j

Scanned with CamScanner


183
Memory Organisation > . •. *
Cache

processor Fetch data


3
Fetch instructions
2

Instruction Data cache


cache

Main Memory
* 1
Harvard architecture
on
inn . cache and data
from the inslrucl
to fetch instructions
rHarvard architecture allows the processor
STYOnl G* [ !)
simultaneouslyJ
I from the data cache
W &Y ) Advantages of Harvard’s Architecture instruction caches can be designed as
. Thus ,
Programs do not in general
modify their own instructions they contain . An instruction cache
allow modification ol the
instructions main
read-only devices that do not it without writing them back to the
from
that have to be ejected changed since it was
brought into the
can simply discards any blocks is guaranteed not to have
memory , since the data they contain
cache.
caches separate prevents conflicts between blocks of
and data
Finally, peeping the instruction storage locations in a unified cachej
might map into the same
instructions and data that smaller (two to four times) than its data
significantly
- Often a system’s instruction cache
will
for a
be
program generally take up much less memory than
the
cache. This is because the instructions that when a program modifies its own instructions, those
is
program’s data. Thus the disadvantage the instruction cache.
instructions are treated as dataand arc stored in the data cache, not

6.6.2 Princeton Unified Cache (


Unified cache)

a cache contains both instructions and data , it is


called a unified cache (or Princeton architecture
fWiien .
cachejSelf modifying programs are easy to execute in this architecture

caches not separate and Instructions and data that


[ igJJnified Architecture: the instruction andin data
might map into the same storage locations a unified cachej
Although I-cache and D-cache are of the same size usually , it isn’ t mandatory.

Important terms
I-CACHE : Instruction cache
3
D-CACHE (LI) : Data cache closest to registers
S-CACHE (L2) : Secondary data cache
T-CACHE (L3) : Third level data cache

Scanned with CamScanner


* 84

Advanced Computer Architecture
' , #
^ ^

1. Data from S Cache has o go tough D


- ,-Cache to registers
.
2. T-Cache is larger than S-Cache, and -
3. Not all processors have T Cache

6.6.3 Unified Versus Split Caches


-
S Cache is larger than D-Cache

This refers to having a single or separate caches for data and


superior. It jeduces thrashing.
machine instructions. SpliMsoby101
^
Cache Thrashing
Thrashing occurs when frequently used cache lines replace each other There are three primary
causes
"
lor thrashing:
Ji.- Instructions and data carfcconflict, particularly in unified caches.
'V-2. Too many variables or too large of arrays are accessed that do not fit into cache.
13. Indirect addressing, e .g., sparse piaIdccs - Machine architects can add sets to the associativity .
Users can buy another vendor’s machine. However, neither solution is realistic .

Split Cache
There are several reasons to explain why split caches are preferred over unified ( U
-cache ) on the 1 st
level.Q)ifferent functional units fetch information from 1-cache and D-
cache: decoder and scheduler
operate with I - cachc , but integer execution unit , also referred as
arithmetical and logical umLand
-
floating point umugmmunicatejwith D- }Therc are also load /siore unit with cache and system bus
cadjc
.

controller, which are involved directly in operations with caches , i


p
.
(Tcache and

^
-
D cache operate with very' low access latenc es tous
serious performance loss on most tasks

die and to assure its proper synchronisation^


>
order to maintain them low, cache
figuresinmyalue qm 8Kbj 64Kb usually . That ’ s not an easy
task
. If a particular cache gets
to plac
^
e a
^lLer /
large
hdjt
increase would cause a
size is sacrificed and
cache on a silicon
?
internal organisation intact , it would
inevitably taffiiore
time to
in
search ‘" , whlle keeP,ng
C

A
^
functional units utilise a particular cache more f thal’ m0re Pipe nes of
new ones is a pretty stiff job . access ° ? whiIe adding ^
Additionally, U -cache is affected by oth H
, J
^ ^
mustjeevucted e Hushing instmctinnslw er Jawbadcs For example, all
^ ^- .
^
^ ^^
system exceptions which 3
p’s a regular practice to flush
hTind , U -cache allows for a ^ Zoma ^ ^^
intend to fiushprocess T du
virtual
I -cache with
more effective utilisation ontselH
‘ 313
usually on vw»5
"3$ address- Although
eW
Wel1’ 0n the other
7 “ dependence Sg ^
“ “"
^
a d with on a " ^
°
as fas, asof ,he I a tewl .
lredlifeCf nTn
r* - y
^ “ 7 °f
T
2 d letel does '1 « 1
" " " < >o be memory 1«
and

*'m6^> , rterenced
* B cachc ' k-aacSTiTI
'

^
^ Cf ftrf
01 sThmehappens
tact . B -cache moi>tochipsbe the lastlikelv
Scanned with CamScanner
most
level
'
CodN,terror Org» /. **
,, . . — — .cst
f

. ^ncllhcrbV
* ***
v
* ntutlw
Hu
/ Vt' ofcnfliohlerenliy U„UM,,.w e;"
^ lystcill logic °r c" y I "' <>' *
'
rw » 2221 *«>"
‘^ wrol'
„„
procmof C
* Wq,i
. *'
b»l 'lur,n| «««»'» n
“ “CL;''- 4.
..
.

otic,. .,
^: '
. vvcls c a d be u»ccd 10
° *' "' 1
*' » »

^
bvoici , <» hM wW
l i n i n'irSSc
. Ilinp sonic msll r U C u o n
‘ ^ ^ 'Jum r tec ' '
lh6 tcc\i‘cs &o c s redirected u> the "*
. n e“x t fvLt
" *
t cache cVt ' 6 to >
V
‘ C if a m»S» rcl .
9’
X
^
' ^0rsi <**’1* re* U, tum hits beenynUSbcd outoperating nemia ,/ Jjy
c \ivoted lhc
‘" ve0 m° '* prcvioust
yVjvC* . -
^
obviously lUj0n f r lUal
° " A

^ ^ of
v
.widc quantum
infoRMfe* d,

.ii v causes Scs* "y


^
tit

6.6.4 Code Density Effects


The cache performance is affected by instruction
set architecture . If we compare the more '"* dT ca 1* ^
ciativily . encoded architecture with less densely architecture, then more densely encoded architectures r
their program working set in fewer lines than a less densely encoded instruction set . This differ
"

^
code density directly affects the miss rate of two instruction sets. As it is clear ' hat
** f!
instructions by
architecture is taken into account, thus caches that contain only D -caches are generally unaffee H IL
the 1st
lieduler
instruction encoding and instruction set code density since all instruction sets need basically trie ame
^ ^
data sets fro program execution .
lit. and
;m bus
Instruction size x Number of instructions per 100 HLL instructions
Relative code density =
R/M instruction x R/M instruction per 100 HLL instructions 6J
ause a R/M Is Register-memory Architecture

1
d and According to Mitchell
ilicon
ig its Miss rate for architecture I _ Code density for architecture 1
some Miss rate for architecture 2 Code density for architecture 2
es of
ding In addition to instruction set architecture, compilers and register allocators also affect the cache
:, all performance by affecting the code density .
IOUS
ugh Sectors and Biocks
her Sectors share tag Among blocks
:en
Sector 0
Sector 1


be
> ts
Block 0
id
SetO t
&

^ - STvfoTtted| Wort iviol Weft


).
y

i
Typically block si
can be shared b
ft
o
then it will 15. 3 blocks, thus exploit,
in
ef

" "* cos, for


*
j
7
^ ,
% 6 ,4

- ^^
Sector
CS As s ckc t from the figure tM
'* ‘ *
locality. If lhe si2e of block and
r&e blocks fewer valid/dirtv/shared bits
- ^Pi.-i
U fli

^'' y

Scanned with CamScanner


Computer Architecture V MMIUIW . ,

ton IagsOi 1
°
nitmg case is 1 block /secl
^ tSS,
e .g.. direct mapped cache, block size of 1 word )
words to read from
words
memory on miss
to write to memory
on write back eviction
^ *
£f "
1
ds 10
'

r traffic ratio
• does pot
exploit burst mode transfers; not necessarily fastest overall design
D
Cache Memory Improve System Performance?
H W Poes ts
0 e a ° rache memory is extremely fast memory that is built into a computer’s central processing unit
'reviousiy *5) ‘ on a separate chip. The disk cache is used to hold the most recently -accessed M
located next to it disk
.
C/)
n E
drd dij (CPl ’ hard . The CPU uses cache memory to store instructions that are repeatedly
m 0f , nf* driv 11 tion from the o
1
^
, °rniar ’" ed to run use the programs , improving overall system speed . The advantage of cache memory is that the
ndreds ihou -
^ Lh
* 1 jSjdocs not
to motherboard ’ s system bus for data transfer. Whenever data o
have must be passed
system bus , the data transfer speed slows to the motherboard ’s capability . The CPU can
Ueh the cn

data much faster by avoiding the bottleneck created by the system bus.
When the processor needs to read from or write to a location in main memory , it first checks
f

' more
d ensely whether a
copy of that data is in the cache. If so, the processor immediately reads from or writes to the
setures which is much faster than reading from or writing to main memory.
,
. capture cache ,

s difference js Each location in each memory has a datum (a cache line ), which in different designs ranges in size
n st ructions 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access requested
set from
unaffected by by a CPU instruction , which ranges from 1 to 16 bytes. Each location in each memory also has an index,
,
:a 1y the same which is a unique number used to refer to that location. The index for a location in main memory is
called an address. Each location in the cache has a tag that contains the index of the datum in main
memory that has been cached . In a CPU ' s data cache these entries are called cache lines or cache blocks.

-
6.6.5 On Chip Caches
(On -chip caches are a popular technique to fight with the speed mismatch between memory and CPU. As
integrated circuits become denser, designers have more chip area that can be devoted to on chip caches.-
To increase the cache sizes as the available area also increases may not be the best solution, however ,
since the largera cache, the longer its access time, thus using cache hierarchies (two or more levels) is

^ ^-
Potential solution ”
:t the cache
Advantages of Two Level On -Chip Caching:
iJTio primary' ( LI ) cache usually is split into separate instruction and data caches to support the
lnstruction and data fetch bandwidths in modern processor Many programs would benefit from
fdata ^
cache that is larger than the instruction cache, while others would prefer the opposite. By
ujgjnwo - level hierarchy on -chip, a majority of the cache lines are dynamically allocated to
_
1 ^^^^ !i in mstrueiions or data. _
cache access time is likely to be lower. As existing processors are shrunk to smaller lithographic
uture sizes, the die area typically needs to be held constant in order to keep the same number

- is^
in
tag
w nlng
simnrched
pads When processors are initially designed , their on -chip cache access
-
first-level cache sizes, the caches will get slower relat ve to
" - Instead, if the extra area is used to hold a second-level cache, the pnmary
3
cPath
. .
t 0 their cycle times. If the additional area available due to a process shrin *
Ply extend the the PNcesso
caches
,

a scale

d
ia
^ CCCSS along with the datapath , while additional cache capacity is still added on -ch p.
JZS «1« * K» «»"
^
- - .
— — -— —-
'

involve the second -level cache


-
Disadvantages of Two Level On -Chip Caching
ivc while keeping
WfhTsecond-level cache can be made set -associative
* * ln“' W

second - level cache will consist of instructions and data which


misses to the first-level cache would also miss in

.
the second -

level
,

}
Cache Memory Organisation

In
.
187

the primary caches direct

’"”'Vi"8

that of the second - level cache, the


flfthe total capacity of the first -level cache is not much smallerarethanalready m the first- level cache. Most
this
level cache can “get in the way” by adding delay between a first - level cache miss
situation , adding a second -
and an off -chip access.

Cache memory is random access memory that a computer microprocessor can access more quickly
than it can access regular RAM. As the microprocessor processes data, it looks first in the cache
*

o
£
00

I
D
E
&

memory and if it finds the data there (from a previous reading of data), it does not have to do the
more time-consuming reading of data from larger memory ; C/3
Cache memory is described in levels of closeness and accessibility to the microprocessor. An
LI cache is on the same chip as the microprocessor. (For example, the PowerPC 601 processor nas
a 32 kilobyte level-1 cache built into its chip.)
L2 is usually a separate static RAMjSRAM ) chip. The main RAM is usually a dynamic RAM
(DRAM) chip.
Cfo optimize the area available for storing data , the sectored cache minimize the directory size
and affords the best use of area

^
fiphe contents of low level cache ( LI ) is also contained in higher level cache ( L2 ) then the cache
system is termed as inclusive ( logical inclusion )) The total number of misses that occur in second level
cache can be determined by assuming that the processor made all of its requests to 2nd level cache
without the intermediary first level cache. Also it is considered that the line size in 2nd level cache is
same or larger than 1st level cache. If it is smaller than lsl level cache then loading the line in first level
cache would simply cause two misses in the second level cache. So
/ Cache size = set size* number of sets = ( line size*association )* number of sets
Thus number of sets = cache size/( line size*association)
Local miss rate = number of misses experienced by the cache/number
T>f incoming references.
Since we don t know the number of references made it from LI , Second level
therefore it is difficult to evaluate the L 2 cache performance. cache L 2

7
hus Global miss
made by processor.
rale = number of L 2 misses/number of nderem - , H

(C.
So1 miss rate: As the na ,ne

^' —
inlPl es “SOLO the only or single”
Uhus sob miss rateisthe miss rale the' cache would have if i one .
were the only
ccuche
che th
£E2££®2IJ£l£ £
ratals ^^ ^
thenn numtoofU
num
~
' %^
j 2rin£
_ ^
T t misses can
defineTTiy the principle of
_ bTfind ‘-
_thejiresence
miss rate will be same as that of solo '
.
m iS rate.
of 1.1
iinclusion '
he ntenljoffirstjevel
^
out from HT

gans that glotul


First level
cache 11

Fig. 6.15 Levels of cache.


A•

8 Advanced Computer Architecture ~ 4 jr * '


*' *
*

'- •18
S- iv/- - ...
AssemblyCache _
6.6.6 Warm Cache. Cold Cache , Write
. prwes.sor s ce wntrd must P®*
( a ) Warm ca che; A cache that contains se eral
time between the processors is called as
required system support is also called warm
,
of multiprogramming appears to be cons stent
Warm cache the process has been running
cache ^ ^
• «
'
. rj* data for cachcs with lower degree
with data for user and syst
“a long time .
environment. In
‘‘stomped
^^ ^
flushed ome other task might have
( b ) Cold cache cddcache means that cache has been
^
on" the cache. The software I/O synchronization might
and data in same cache
^
ush/invalidate
flus cache. a;

Split/Unified cache: instructions CO


E
• Split allows specializing each part \ o
• Unified allows best use of the capacity cac e
£
Two level cache: Perhaps LI cache on -chip; L2 cache on - module, •
- T3

( c ) Write assembly cache: t reduces the write traffic and it is usually used wit a one ip i
level cache. ^
Sectored cache: how many bytes grouped together as a unit ? This sectored cache has improv
cn

the area effectiveness for on -chip cache.


On chip/off chip:
^ • on -chip ; fast but small
^ '
• off -chip : large but slow

LT I —-
•v'<
6.7 WRITE POLICIES 4
' V ~
••

L
^hen a system writes a datum to cache, it must at some point write that datum to backing store as well.
The timing of this write is controlled by what is known as the write policyT
)
There are two basic writing approaches:
• Wnte-through - Write is done synchronously both to the cache and to the backing store.
¥ • Wmejack (or Write- behind ) - Initially, writing is done only to the cache
The write to the
replaced bv new content. ^
backingjtore is postp aiiMlhe
^ i
are about to be modified/

± z:‘ :;xr;tzirtr? ,s

--
which
backlnS store4The data in these
i
locations are written back to the backing store only
, when
Z ih
^
- ZTV
»«
, F o I h i s r e a s o n,
te replaced b anollier) will ofien require * * ' !T
Wh Ch“ * ff >
* block“
:r memorv accesses
Iwo 10
,,

oZ “ o hc
* *• « ZZZLVZL* to
one to write the replaced

* “ “d
^^^
m a n y chan8 s io a datum m
'
no actual data are needed
write - misses: back, there are two approaches
for situations of
rite allocate (Fetch on write )
by a - Datum at the • .
operation. In this appr
«Kh . wriZTst cache'

cache and is writtenSSy


“ to
a ,« beingcacbej)
,

*> ^ *"fissed
s,„re ,hlsapp
at
oa h £ ^ £“ “
- wr Tocation is
^
not
. [ £
c0C
;19P
A<Wa °
^achoMem0fVOrga fy tra
use either of these write- miss
lanis
>, °
ni

write-ba ^ pollcics can P liciCs,


° .
-
|
„r, threu?h13anl
, hoping for subsequent writes (0r even
bv„
wnjte allocate
*
r
*

.. .
they he uses ) lo
clicti -
. Here , subsequent V
* "r
n0 wrile allocate
writes havc no
advanu
lUf. .
«• » Mcti E
“' emalivcly
.
-
Khey still need » be
, case the
changc the data in the backing; store, in which
^ theeacj
ln
egrtoan ** - Buitiesotiterti A|( , when the clten updates he a3
c

-stomps cache ma> become Ul° i r caches will


become stale. Commun cat on protocols bet '" % c
CO
o
i have
ies of those
copies
* Cache
CO
E
03


keep
managed which
O
. policy selection is the effect of the decision on menifl
- considH„ration in write
5
,
,^T
5
An important |ower n emory traffic with small caches while select

g£2J7S«
D

^
Q)
emoty traffic ration is used
£ traffic with large cach
C

mothertwanJ 0
c
03
O
on chip first CO

r memory traffic*
Locate i
has improved Memory request block
£V

Read Request Write


type

store as u cIL Yes . Yes


Cache hit? Cache hit?

.
Hjr

store No No
write to the
e modified/

-
\;v the
|
-
'

Locate a cache - i7 Write data into :


block to use -

-i cache block
ationshave m

5
-

na in these
i, an effect
1 a block to
Read data from
* replaced lower memory into Write data Into ; *

c
the cache block lower memory •
a datum in IV

nations of
Return data
> followed

,° isreads
!l n not
etn

Done
F|a.
6.16 A Write
%
-Throu9h cache with No-Write Allocation.
computer Architecture

^^
Or®
anisat( traffic ratio factual memory references (cache processor memory
0n > referenc
>QJici
- bu, Usual183
es
|y
Memory
request
Note: Wr

' reads , the


the block


) 0
replaces i
Satic elapsed ti
oadvania No- write a>
c
8c s
' nce
C
Write
' Read Request
type
> behind it
not be us
CO
o
C/5
E

»
CO
these kir O
£
$
Yes Yes x>
Cache hit?
Cache hit?
Cl
)

c
i memory traffic <1
CO
o
GO
acting a copyback, Nc No
d to
measure the
' h
H
/ :
Locate a cache : Locate a cache
block to use block to use
V -iv
-

Yes Yes
Is it 'dirt/? Is it 'dirty ?
1
>

Write its previous No Write its previous


No data back to the
data back to the
lower memory . lower memory 6.7.1 S
Read data from Read data from
lower memory into lower memory into % From the
the cache block the cache block -. z found in t
a cache m
v
*» T
U<
Mark the cache Write the new data
Dlock as 'not dirty* into the cache
There
block 1.1
— j
i

2.1

Ret rn data Mark the cache


*
“ block as ‘dirty*

3-

‘V Done

Pig - 6.17 A Write-Back cache with Write Allocation.


Cache Memory Organisation

and the block is dirty,



F191
• •* •' *
*

nceT\ a miss occurs in the cache with write allocate


Note: Write allocate : When and then the new loaded block
block in the cache is first written to the lower level memory which means that the
the
replaces the old one. Clean blocks
can always be directly overwritten
umu .
elapsed time on a miss can differ
No-write allocate: The block is only modified
behind this is that a program or function ends its
in the lower level
operation with storing

not in the cache. The idea
a result. This result will
could be a bad idea. The concept works well in
not be used so a replacement with a valid block a>
these kinds of programs with write-through caches -< :
c
c
* ro
“t-
'' o
7

cn
[ Case tints CPU * (
E
CO
O
£
read hit . ‘tA1 I
WB write hit tm *
X
CD
C
WT write hit lAS r“‘
c
CO
o
WTread miss AI ^AS81 co
WB read miss, clean block
+ AS811
WB read miss, dim block
' M +' A ^' '
write allocate
f

no write allocate

WT write miss tA <S / I + tAS lAS


WB write miss, clean block
WB write miss, dirty block
<Al +
' *M + -
+ "
AS lAS
' ' ' AS8 *
N
d:
AI AS lAS
--V
Where tAl is access time for the cache, tA2 is access time for the next lower memory level, B
• stands for block size and I for data bus width.

6.7.1 Strategies for Line Replacement at Miss Time


From the previous discussion, the meaning of hit and miss is clear If CPU wants some
]
found in the cache, then it is called HIT otherwise if the reference aetttress is not
data, and it is
a cache miss occursjand
this problem can be overcome by 2 ways:
found in the directory .
v/L The missecTline must be fetched from the main memory
2. Use the replacement strategies like LRU, FIFO,
^ RAND.
There are certain questions arise related with
1. Data fetching policies
• When do you fetch and how much?
• Blocking vs. non blocking caches.
-
. 2. Data replacement strategies
• How do you select a victim for replacement 9 .
• LRU
• Random
3. Data storing policies
• When do you store, and how
• Write Allocation much ?
• Write-through & Write-back
• Write buffering
192 Advanced Computer Architecture
in cache .
Fetching a line from the main memoy: as we know the write pi
wnt!eaoier.- massed
phe
^
policy, t

^.^^, ^_^ |
C
(a ) Write Through Policy: With a write through efoulted word (that
line can be accessed in 2 ways : (a ) atari
.of the line (
isstartfromlhe word from where the read n iss has occurred. Thts
line •
)
s called wrafiaround
TT!iss 0r
hvp

load.

,
access time of the firs word and then the time required to transmt
across the memoryjj where M is the number of physical word.v
_
.
IjVhen missed line is accessed from the start of the line ( that is case rematmngM

^ er
.
the
(a )) the mtss t
|
^ ^
^^^^
(he ]ine
a>

processing cant be started till entire line has not been transmitted
But in case ( b )[ theprocessing will start as soon as first word has been
accessed and fo 6
the P^ ssorh^ s resam
jmecj

^
>

processor. The rer ng words of the line are stored in the cache while
an
processing based upon initial returned word ] Io this case both the processor
simultaneously wish to access the*cache and memory gets the priority to access t e
cac e irs -
possible that another miss will occur while the first one was being processed . In this case t le irs miss
will gel the priority that is first miss must complete before the second miss can begin its access.
(b ) Copyback policy:!In this policy, we must determine whether the line to be replaced is dirty or not.
If the line to be replace? is not dirty then this is same as that of above write through policy.. But if the
line is dirty, there must be the provisions for the replaced line to be written back to the memory] And
this is done by just first select the line to be replaced, then if it is dirtVT write the line hack to the memory
and finally bring the missed line into the cache, resuming the processing when the line has been completely
written into the cachefTo speedup the performance a write buffer can be used here.
The fastest approach for fetching a line is the nonblocking cache or prefetching the cache. This
approach is applicable in both write through and write back policy .

• Non-Blocking means CPU doesn’t stall on cache miss or write completion


“Blocking ” caches stall CPU until access is completed
-
• Blocking speeds operation
Non
Out-of -order execution unit can issue multiple
reads

• ^ Wi'“”'i E 0
"iK “
Check if reads refer to data not yet
written

)
'
Sr rrrernap ° "e **•
iM s

There are two types of


prefetching
i.

•he near futuT


not only thc missed bloclcis ^
... , it

——
Cache Memory Organisation
"”V;;


193

. use, the less the miss delay , but


1^ '

-

(a ) The
inten,,
iTitataflta are free ( no occupied ) with
the anticipated data and
hence |* , Si!
;
that
inJ
sed •i Li
- for current requirements
»«
.
»« °r
,
h» » »K
" .* 6.»:

the
3
this P Q3

ne r GO
E
he (a ) LRU: Least recently used o
( b) FIFO: first in first out be fetC .
each'
£
$
T3
ie
RAND: random replacement
(c) . write
* il
y These policies have already been discussed in detail
previously. data item CO
deferring
s Si- backgroui
ftS WRITE ASSEMBLY CACHE (WAC) distributee
storage ar
The goal of write assembly cache is to assemble writes so that
they can be transmitted to memory in an orderly way. From CPU Buffe
the above discussion we know that Write through caches have incapable
advantages over copy back cache and the advantage is that write
through caches requires less software management to support
memory consistency as used in multiprocessor configurations. Wi
WAC is an expansion of write buffer idea i.e. WAC is effecti
write buffer with extra circuitry. WAC offers an intermediate Cache
strategy centralizing the pending memory writes in a single Write Buffer
^ reduces the bus traffic but consistency does not
buTfeTwhich protoc
maintained because now writes are delayed. form
Main Memory
always
Fig. 6.18 yVrite Assembly Cache. A
Note: dii

^
stage.
chaZeTspatLVl5y of wnTr^ ^ ^'° ** *** 0
Stores structs
to R is
1
Stores to arrays disk),
'1516 the
WACca
• Examples
'
*
^ Subr utine cal
0
* (c.g.,
S
al locality of writes
statically allocated scratch variable)
f ! ^NCR SS0 Sin8e- ine
? -line WAC’sWAC
CR - multi
~ |
5-8.2
in workstations
Suppose the buffer
* * WAC is ful"
S JW 0
has n em
,hen is ,
SI
'

yTssomt L?v ? ,
°f * "Umber of ransfer units that comPose a "®. e
( f Ully
'

bypassed - lf a reference s{ x ) occurs and J; IS > n 1 1 'l


^«8 Pr

!?
Temporal i lallly ' relurned to the " 1 Ip
|
plays an important
role f n ti ,
1 0111 0 therwise affect ng the memory syste '
'
cedComputer Architecture
0fhusjr^Mowing
is clem
that to reduce the resul am write
(

strategy would be f0| | fnr ,


Sma ,ler lines .
an
Vantage us than ia3rger
Owed
n Thus Ore lmP ememing WAC :* °
4
of lines: minimum
,
193 if 5'
jiuinber ; equal to transfer size ( 8 bytes )
,
Prefetch :
Line size :: directly mapped .
si
Uhi
e ce also Associ ativity
° css Buffer and Cache
pifference Between
6.8.1 r Tperformancc increase for transfers of data that is being repeatedly transferred. While
’laced is
is
Caching
a caching
'^

this pe
lie*"' -
^
resySfonliance
allZem may reaiize a performance increase upon the initial typically write
(
increase is due to buffering occurring within the caching
transfer data
system.
) of a
V D
(

CO

10 ,ion least on
E
| itemTo
^^^ 7
' Jze rfo T ^ Ca Thus three t

£fetched the cache s fAter intermediate , " ^’* residing


ir
r subsequentrends of the data
’ ( )
r
^ c ^beinltioTTWuh
lnCrease by Vl
31
of M
o
£
$

increase of writing a data hem mav te 7


storage ra h
from
'
^
v.
* flrst, wn ihe
be ( a) Wk
trite aches, a performance
item by virtue of the data item immediately being stored to
deferring the transfer of the data item to its residing storage
, “
* intermedi3<e storage
' ’
S age 0r else occurring as a
reqi
CO

IS background process. Contrary to strict buffering a cachm


distributed) cache coherency protocol in order to maintain consistency
'
consistency between
ll T7
* ** 3 (Potentially
the cache’s intermediate '°
( b) Wt
a,
ac
tmge and the location where the data resides.
Buffering, on the other hand, a) jproyides an intermediary for communicating nmre«« whi,h
incapable ofdrrect transfers amongst each other, or else b) ensures amlnimumdata size or representation
reguired by at least one of the communicating processes involved in a transfer
With typical caching implementations, a data item that is read or written for the first time is
effectively being buffered ; and in the case of a write, mostly realizing a performance increase for the
application from where the write originated . Additionally, the portion of a caching protocol where

3 individual writes are deferred to a batch of writes is a form of buffering. The portion of a caching
protocol where individual reads are deferred to a batch of reads is also a form of buffering , although this
iorm may negatively impact the performance of at least the initial reads. In practice, caching almost
always Evolves some form of buffering , while strict buffering does not involve caching. (0
|A buffer is a temporary memory location that is traditionally used because CPU instructions cannot (d)
dir
UM?ctv a(
Mress data stored in peripheral devices. Thus, addressable memory is used as an intermediate (c )

/ Srd ]
-
tiQnally 5Ucb a bu er may be feasible when a large block of data is assembled or disassembled
^
Squired by a storage device ), or when data may be delivered in a different order than that in which
,
i
Pc ^ ,
dUCeCl' Ais0, a whoIe bl ffer of data is usua Iy transferred sequentially (for example to hard
disk ) ° er n itself
For
the variation or jitter of

Present fku ^ ‘,
(he
ir
° S l 8 sometimes increases transfer performance or reduces . These benefits are
^ . ency as opposed to caching where the intent is to reduce the latency (
evcn if the buffered data are written to the buffer once and read from the buffer once. 0
(<

k in

i
depends on rffic is
healed by whether the cache has an
instructions are
a oc - P»W
"2
if
195
Cache Memory Organisation

n larger Traffic

data traffic
Instruction traffic

Q3
Data write traffic
Data read traffic
GO
E
I . While
traffic.
Fig. 6.19 Instruction traffic and data
O
)f a data £
_
read traffic and data write traffic.
once in
Thus three types of traffic are instruction trafficLdata — T3

; able to to

0ofWiththe (a )
and P is the phySjL_
required by each instruction where I average instruction size m
___
bytes
te
storage ,
,
^
word size from cache toTnsfruction (1) buffer. And P > = ’ lj enerally .
additional instruction traffic
ing as a ( b ) What would be thebranch strategy? Branch instructions also adds
tentially depending upon the branch policy followed . We all know that A branch is sequence of code in
mediate a computer program which is conditionally executed depending on how the
flow of conlroLis
arlnjtinnal fe.trh to the
altered at the branching point . Branch instruction policy always crealfiS- an
hich are target instruction . The fetching pf fhe additional instructions also requires I/P fetches . Thus total
entation instruction traffic per instruction is sumof 1/Hand-thfc excess of instructions fetched on branches:
.

( l7PjT (excess FtraffTc on BR ( branch unconditionally )) + (Excess I traffic on BC (conditional


time is Branch).
: for the Branching of instruction may be at any address in the instruction , if we assume branch target
1 where reference start at beginning of the reference then there is no extra traffic caused by BR traffic
caching except for in - line fetches , So: Excess I traffic on BR = prob. ( BR ) ( Tl /P )
ugh this Where prob . ( BR ) is probability of unconditional branch and T 1 is max number of in - line
; almost instruction fetched before branch is decoded and is always > = 0.
(c ) What is the distribution of data reads and writes.
s cannot
(d ) Is there any use of instructions that create multiple data
rmediate accesses to memory .
sembled (c ) and (d ) parts are explained in next section .
in which
; to hard DATA TRAFFIC

jitter of
lefits are
( a) Load -Store architecture ( L/S ) .
( b) Register- Memory architecture ( R/ M )
(c ) Register-plus-memory
architecture ( R + M )
In first architecture there is no memory
ry traffic . m m0ry instrucll ns. The load and store
inyolyes accumulator
be traffic and memory thus ° architecture
Ddata >
lraffic/ I = read + data writes )/!
= ( Move Instruction/100 HLL operations
+ Excess LDM and STM traffic
,,,„
( total L/S instructions per
Per 100 HLL ops)/
100 HLL operations )"
]
>196
>
'? Advanced Computer Architecture
-V--* w- TruM.-rrini. - 'W * "
*
-
.< sc-: nna '"
r»i
” *
-
TWO)

instructions
niai JtW '

, and STM is store multiple


Where HLL jsjigh level languageLDM is load multiinle
we need to know the average number of
If instructions may take more than one operand then
registers involved in the operation .
For data reads, LDM is used and for data writes STMjs used
. _
If we consider only data reads, then
4 floating point reads ( LD. Fs
per 100
Data reads/I = [(integer loads ( LDs per 100 HLL ops ) - -
HLL ops) + LDM )]/[ I ( I /S instructions per 100 HLL ops ) J Q5
(ST. Fs pe
Data writes/I [(integer writes (STs per 100 HLL ops ) + floating point writes
=
HLL ops ) + STM ) ]/[ I( L/S instructions per 100 HLL ops)] E

( bjFor 2 nd architecture Register-Memory architecture ( R /M ) there are same number


of reads or writes
5
IC > ecrease
from memory per HLL operation is assumed and also expectecTJThis increase the data tra -o
in R/M architecture per 100 HLL operations
There are 2 types of Memory'-to- memory instructions,-they are: instructionsjhatjnvolve 2 source
operanTand instructions that involve 1 source operand (like Mov instruction ) reads the operand from
the memory and writes it back into the memory . Each memory access takes: [operand size in bytes/data
width (bytes ) ] + [(operand size in bytes- 1 ) mod data width ( bytes ) ] /data width ( bytes) references

Technology dependent cache design considerations


Cache is small high speed memory usually Static RAM (SRAM ) that contains the most recently accessed
pieces of main memory .
Why is this high speed memory necessary or beneficial ? In today ’s systems, the time it takes to
bring an instruction (or piece of data ) into the processor is very long when compared to the time to
execute the instruction . For example, a typical access time for DRAM is 60 ns. A 100 MHz processor can
execute most instructions in 1 CLK or 10 ns. Therefore a bottle neck forms at the input to the processor.
Cache memory helps by decreasing the time it takes to move information to and from the processor
Atypical access time for SRAM is 15 ns. Therefore cache memory allows small portions
of main
memory to be accessed lip 4 _times faster than pR
AM ( main memory )
How can such a small piece of high speed memory improve system
explains this performance is called Ucalit fRgfaence
.fhe performance? The theory that
^
.. - .
concept is that at any 8given time the
processor will be accessing memory in a small or
localized regiion of memory . The'cache loads this
Xi,"rx rxs-bXXrzzt 'rwe"* *• « <» « *
8
*« »
« „ ThiLaX,!
by Ihe p«eS ,
°"" °
^occurs
memory accesses
9 f
.
speed cache out of the high

not replace main memory


DRAM with SRAM?
morTpowe “and' .
consumes
Thus there are several other
design. For example:
^^
effects th at the cache may have
““re
on the overall effecti
" DRAM. Also. SRAM
tha

ctiveness of the processor

a function of cache size.


^ 6 bus busy time and trough this
the effecti, VC mem
°D access time is also determined by the
Cache Memory Organisation £la> , 6.9
i<
Perforrnance cfl
6.8.3 Cache
lions.
a useful measure to evaluate the performance of a memory-hieraR;h
xrof memory access time is
{Average
configuration + miss rate x miss penalty
Average memory access time = hit time
imposes on each access (on average). And it can


= ^
It tells us how" for a particular CPU .
in clock cycles
easilybecoiwrted Into

JBS £ ftSSS 3 SSffi SS35 <D


100

ites

*^
.ngnn referenceTto
inV
.25% data references -
Wecan also computethewrite penalty
separately from the read penalty. This may be necessary for
CO
E
o
£
two reasons: 5
T3

rce • Miss rates are different for each situation.


• Miss penalties are different for each situation . CO

ata Treating them as a single quantity yields a useful CPU time formula:
\
Memory access
I
^ CPU time = IC x CPI + ^ jyjissrate x Misspenalty x Clock Cycle Time
Instruction J
Bd
An Example
to Compare the performance of a 64 KB unified cache with a split cache with 32 KB data and 16KB
to instruction . Given
• The miss penalty for either cache is 100 ns, and the CPU clock runs at 200 MHz.
r. • Don’ t forget that the cache requires an extra cycle for load and store hits on a unified cache
r. because of the structural conflict
IL
Calculate the effect on CPI rather than the average memory access time.
It • Assume miss rates are as follows:
5 for 64K Unified cache: 1.35%
s for J 6K instruction cache: 0.64 % and 32 K
I data cache: 4.82%
Assume a data access occurs once for every 3

i
instructions, on average.
i Solution: The solution is o figure out the

,
,
« figure on to miss pen lt i to
6 111 1 /1 C3Che the
., „
penalty to CPI separately for instructions and data.

per'lnstnic<ion
of clock cyc)es 00 . m, ^ cyd£S
_x
is (0 + 1.35% x 20) = 0.27 cycles.
i

• Ford
'


_
or 0.42 cycles per
instructions, the penalty is (1 + 1.35 % x
The total penalty is 0.69 instruction .
CPI.
,,,
• ZT -
The total penalty is
p e ms n c i 0n 1
'
. <
'

to accesses il is 0 4
I ' 1 ' Is (0 + 0.64% x 20) 0.13 CPI.
'*
4.82% x 20 ) x ( 1/J ) .
0.32 cpi
=
0.45 CPI.
• In this case > the split
cache performs better
because of the
lack of a stall on data accesses.
Cornp*Jter
Architect w ,

CACHE
ESSOR WITHby CPU of Ihe computer to reduce the average *
' lS

cache used lime of access memory


.
processor
A
L 1 Cache L 2 Cache

v CP
V
——
jjRegisterJ
I
5
=
fC 0
L 3 Cache

7Y i
aj
c
c
ro
r o

5
-
<
CO
E
CO
O
£
$

SZ O
"

• "T• <v
c
Storage c
co
Device o
co

.
Fig 6.20 How cache memory works.

' k Cycle Ti
irne Point to Remember: ..
Memory Hierarchy Terminology .v? '
>

,ta and 16KB 1. Hit: Data appears in some block in the faster level (Example: Block X) . :* * i • • • •
. r .
( a ) Hit Rate: Iiit rate is the fraction of memory access found in the faster level. ,-
( b) Hit Time: Hit time is time to access the faster level which consists of . , •
• tA* •

ed cache Memory access time + Time to determine hit/miss ,


» •••'* * s>
2. Miss: data needs to be retrieve lroni a block in the slower level (Block Y) 7
(a ) Miss Rate 1 - (Hit Rate).
=
(b) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block
processor. i

(c) Hit Time « Miss Penalty V

Slower Level

L
- 1.35% x
i
To Processor

From Processor
Faster Level
Memory
Block X
V
Memory

Block V

Fig . 6.21
([[ data
toain —
i not
ala is
memoryTj
Cache hisses are
found in cache then this is called miss and data must be brought in to the cache

classified in three cutegon


from

|2.
Compulsory
Capacity
Conflict

— Organisation 199
Cache Memory
. r*w •'

. '

MHlcallon rfCTOh Missas j


^
:
ri

Conflict

^
Capacity
Compulsory

misses
Fig. 6.22 Classification of cache
, misses or first
. Il is also called cold start
cu
a block is never inn the cache
1. Compulsorjfjhe first access to
reference misses!( Misses in even an infinite
' cacte) _ __
program, blocks
CO
E

2 .Capacity: If the cache cannot contain all


the bjocks neede _
ded during execution of a o
£
retrieved . (Misses in Fully Associa I
must be discarded and later
.

3. Conflict: If block-placement strategy is set associative


and later retrieved if too many blocks map to its seujt
misses . (Misses in N-way Associatiye^Size X Cache)* *-
or
, d i r e c t g r
s also called _colhs on misses

.
interference
O

oo

~
and data except vector processor
rrhe cache memory is used by all the processors to access instructions
be < the vector processors access the data directly from memory
^
fXyeeter processor, or array processor, is a central processing unit (CPU ) that implements
,
jin
instruction set containing instructions that operate on one -dimensional arrays of data called veciorsj

This is in contrast to a scalar processor, whose instructions operate on single data items

6.9.1 Fully Blocked, Partially Blocked and Non Blocked Cache


We all know about cache hit and cache miss. There are three types of cache memory interactions. There
are fully blocked , partially blocked and non blocked .
\ck the Table 6.1 Cache Memory Interaction .

Fully blocked Partially blocked Nonblocked


In fully blocked cache memory The processor not completely Non blocking means the cache is
interaction , when the cache miss stops the processing when a cache not blocked i .e . “ Non - blocking
occurs , the processor completely miss occurs , but the processor cache” oi “ lock - free cache ”
stops the processing until the resumes the processing after some allows data cache to continue to
entire line is returned to the cache, portion ot line is returned , thus supply cache hits during a miss.
then processing resumes. there is a time period when both In this the processor does not stop
processor and memory are busy. when the cache miss occurs unless
required by a data dependency ,
“ hit under miss ” reduces the
, e from effective miss penalty by working
during miss vs . ignoring CPU
requests
“hit under multiple miss” or “ miss
under miss” may further lower the
effective m i s s p e n a l t y b v
overlapping multiple misses
520 0 ' Advanced Computer Architecture » 7' 1
„ 'r a - •
^
J rlp icm

^
the
The addition of a cache to a memory system complicates isls 0f line read
allocate (CBWA ) caches, the requests to memory coasts
For copybnck systems with write
and line write requests. . f
, the requests consists
-
For write through systems without fetch on write fWTNWAI
line read requests and word wnie requests.
caches

be evaluated :
To develop models of memory systems with caches two basic parameters must
i. rfline time it takes to access a line in memory.
800t»
,
' time ( when memory is busy and proccssor/cachc is able lo make
^W*y* potential contention
requests ( o memory )
Consider a pipelined single processor system using interleaving to support fast line access.
Assume cache has line size of L physical wordsf bus word size) and memory uses low order
interleaving of degree m . Now if m L, the lojal time to move a line ( for both read and write
operations ) **
T
line access = Ta + ( L - 1 ) 7 ^. *

Where f is word access time and 7'bus is bus cycle time.


If L > m , a module has to be accessed more than once so module cycle time plays a role. Tc
If T < ( m ) Tbus
, * module first used will recover before it is to be used again so even for
L > m.
T line
,
access = Ta + ( L - l ) Tbgi
But for L > m and Tc m ( 7bus ), memory cycle time dominates the bus transfer.
The line access time now depends on relationship between and and we can now use
Ta Tc .

= Wc - 4M ~ 1 * Tbus


Tline access , . f( L - 1 ) mod m\
Thu fust word in the line is available in T„, bui module is not

, Um

\
“““"" made 10 fira « * being
available again until T A total of
accounted for in T ,So addirtonal
jj ~ 1 c > cles ai* required . Finally \{ L - 1) mod m ] bus cycles are required for other modules to
complete the line transfer.
Abvume a single module memory system ( m
1 ), with page mode and
sequential access and Ty is the time assume v is the no of fast
between each access
_ .lft -i*
.
°“ *",
^
IUK

Ked c
KCESS
'
~
v
t ( ma (+ T
* Vr
*“ *•» » *PM» mode or FPM mode.
*
! I
1 :

^luae MMCU * * r Ij m' — r\ +T


* L- L v —

V
Te
T < At
v idle
time
6.23

Scanned with CamScanner


if

-
yf »
-
. * »• *» «
.,r« rnr
. *


Cache Memory Organisation
"
201

/
e

y* Pir \
!
Example \
in -
Computing T unc
'

. fr * 200 ns m> m2.. Tfou *-


50 ns and L = 8
ad Acres*
sa
300 ns
^=
*
7
iLs
Case I : Lei
Here we have
£ l
T bus
L > m and Tc
tne d ^
<

T lmc KITM = r + rfc m - Thns ( ( i - l ) modm)


' - + S°
fl
So
.300 * 200 4 - 1)( 5<X 1) » 950 ns
10lal
50 ns, L = 8, v = 4, m =I
;e
T<1 = 200 ns, T = 150 ns, Tvr .- 40-I ns, T^ = %
Case 2:
--L
1 1

T line access = r + r -
1
- 1 + maxiT^ TJ L
0 C V
y

= 200 + 150((8/4 ) - 1) + 50 -
r (8 (8/4))
e v/ti
= 200 + 150 + 300 well- 'T
= 650 ns There
Ta = 200 ns, ^c = 150 ns, Tr = 50 ns, Tbus = 25 ns, L = 16, v = 4, m = 2

Case 3: _
r / i v1- 1 -f
T line access =
f .
Ta + Tc — • + ( 7 bus ) L V

16 16
. 200 + 150
= 1 + 25 16 - *
2.4 2.4
been
=200 + 150 + 350
=700 ns 6.9.

6.9.2 Contention Time and Copy back Caches Ana


dirty line
In a simple copy back cache processor ceases on cache miss and does not resume until
( vi = probability of dirty line ) is written back to main memory and new line read into the cache
.
AT
The Miss time penally thus is

» + W) access Tm = <* Tl
Miss time may be different for cache and main memory. Th
Let • Tcjnitt = Time processor is idle due to cache miss.
•7 • = Total time main memory takes to process a rniss.
'

' Ttm * lm - Tc «: Potential Contention time.


miss .mi

• T buiy is = 0 for normal CBWA cache


Consider a case when dirty line is written to a write buffer when new line is read into cache. When
processor resumes dirty line is written back to memory from
, * ( 1 + w ) /:
buffer. I
7Vm mm line access
.

Tc .mis = T line access


So 7 busy = w . T
line access

I
In case of wrap around load .
(1 * VV)
W ~ riinc access Ta

Scanned with CamScanner


:r ter Architecture

^ - -_ - .
:
r r
miss during rbusy we call additional delay as T- T
Hn

'
interference =
Expected n. of misses duringinterference'
' : ?9 l
;
of miss
^
during = No of requests during
Tbusy
(Delay per miss)
Bxpecte
'
Tbusy x prob of miss
T-busy * Xp - r,busy f
and
isp
s. sor request rale and
roce
where ^/is miss given
rate per request
a miss
,

during T •
, .
<S amply estimated as rbusy/2
me delay
So
factor “ but
T interference = V '
r
x
interference
, rw * r„ *
4, m 88 ,
Hi see rom Processor
jnd total mis* * "
I
miss — c.miss t ^interference

and Relative processor


^^
performance ^
-
^ ^.
*
Pert, rel 1/1 + / P miss
When a system writes a datum to cache it must a some point write that datum to backing8 store as
,

well. The liming of tins write is


controlled by whal is known as the write policyy
* '

There are two basic writing approaches:


4, m = 2
• Write- through: Write is done synchronously both to the cache and to the backing store.
• Write - back (or Write - behind ) : Initially , writing is done only to the cache. The write to the
backing store is postponed until the cache blocks containing the data are about to be modified/
replaced by new content .
Write- back cache is more complex to implement , since it needs to track which of its locations have
been written over, and mark them as dirty for later writing to the backing store.

6.9.3 Simple Write-Through Caches


Analysis of writer through cache follows the copy caches. However there are two difference

:til dirty line


cache.
1 busy is simPle
^
2. Write that immediately proceeds
A miss can delay the line access until memory is available .
thus rbusy ~ Tm.miss
. -T
c. miss = 0
The modeIing of memory and write through cache interaction depends upon
F le existence of write buffers.
^
How to deal with write buffer contents on a read miss
3 - The
amount of write traffic .
> Cache - Memory
h-*

Processor
ache. When 4- r \

Fig . 6.24
For explaining
this we are considering three cases here:
CaSe 1:
No write buffer and low write traffic:
Memory
Processor < . '
Cache
"
£
Fig. 6.25

Scanned with CamScanner


203
Cache MemonrOrgrtsaW
_ ^^-n c i c f c n f
4 r

nr
— an :************mr* - . - . .x..wux
-

Here since there is no write buffer and


on Jy read traffic, since write cant cause
v

a miss. Thus
. •
-
write traffic is also low, *
with read intc
th
ere
/ *•
„A
,
* *
* -
(0 compute only:
4

...d )

rr*

given a read miss ). Delay factor


Twrite interference = prob ( write during Tc preceding a miss )
.. ( 2 )
during any ol T ./ At cycles
prob ( write during TJ = prob ( write during Tc)
...( 3 )
Prob ( no write in any specified cycle =
) I prob ( write -
Let prob (write during T( ) = Prob vi’
Thus above equation becomes
Prob ( no write in any specified cycle = 1
) Prob w -
Prob w ) TJAt
Prob ( no write during any of TJt cycles) = 1 -
(
prob( no write during any of TJt cycles
)
Now, prob( a write during any of TJ t cycles) = 1 - Tcl&t }
1 - [( 1 - Prob w )
=
Let the delay is TJ 2 . thus putting in Eq. ( 1 )
e T write interference [1 f ( l Prob vip
= - - * TJ 2 '%
"Here 1

the contention is not included because single processor is considered . But contention
must be
considered if there are multiple processors with write through cache.

Case 2: No write buffer, but significant write traffic


Again there is no write buffer, so this case is same as that of case 1 but here traffic is involved , and
it affects the processor performance . Thus case 2 is an extension of case 1. Since there is signiheant
write traffic , therefore write requests are delayed by contention but still read requests are unaffected by
memory contention.
Thus + K
where k is total request rate
.r is the read request rate
and
^
kw
is the write request rate
We can say that is the desired request rate but the achieved request rate is a. The
achieved rate and
the desired rate are different because of write traffic.
Thus
K-K+k '
WO
hv and kwa are different.
Thus from this discussion it is clear that the processor
request rate ( MAPS ) is decreased if there IS
i
significant write traffic. As MAPS is
reduced , MIPS also get reduced by a factor of :
kafk = ( kr + kwa )/ k
-
Case 3: Write through cache with w rite
buffers
i;
/ /v
pre - <— — Cache
processor
Write buffer I Memory
Fig . 6.26 Write through
cache with write
buffers .
Write buffer is assumed to be of
large size. Thus this
is also less. Situations are
to be means that write traffic is low and hence delay
considered.

Scanned with CamScanner


"204;i
58® •

(W
——
Advanced Computer Architecture .
n M
( o ) Read Priority over Write on
^

Miss i .e o read
written to memory before the read miss
miss
is tan e
all
. - - ** *
(he pending writes in the
-

buffer
wnte Kllfft
,r are

against the missed line address. If


read. The read miss precedes a be
before
pending
• > •> >i > tv r t S9 »-or TV '

„ .
writes if no match is found .

6.10 VIRTUAL -TO-.REAL TRANSLATION i — . *



r-
"
-ri '( «?
.1" :
. .

.
• ’
- , O' ' w ' . * *
( *

translating the
addresses used by the cache by

.
j

real

.
buffer ) provides the

^The TLB translation


( lookaside
virtual addresses into real addresses)
Processes deal with virtual memory - they have he illusion
.
ha a verylarge <
memory is hinted and that is shared y

available to them. As we know that amoun of physical
processes - aprocess places par of lit virtual memory in this ,
physicaljnCTQgjnjJheggnigg
disk (called swap space). **
The virtual and physical memory are broken up into blocks and pages
respectively.
8 kb page size

Virtual address
m
i mi
/ X3/
irtual page Page offset
number

Translated to physical
page number

>
Physical address
Fig. 6.27 Mapping of virtual to Physical address.
The page address ( the upper bits of the virtual address) is composed of the bits that require translation.
Selected virtual address bits address the TLB entries. These are selected (or hashed ) from the bits of the
virtual address. This avoids too many address collisions, as might occur when both address and data
pages have the same, say , “000,” low-order page addresses. The size of the virtual address index
is equal
tojggiL where t is the number of entries in the TLB divided by the degree of set associativity
When
a TLBentiyJs accessed , a vinual and real translation pair from each entry is
addresses are compared to the virtual address tag ( the virtual address bits
that were not used in the
index ). If a match is found , the corresponding realraddress multiplexed
I hese may be sequential translation and access or parallel
Virtual Address
to the output of the TLB.
translation and access. ^
Virtual Address
Virtual Page Number Virtual Page Offset
J/irtual Page Number Virtual Page Offset
i ; ;
Translation Translation Cache
I
Cache i
Tag Compare .
I
Compare Tag t
Data

Data
Fig . 6.28 Segmental Translation and
Access. Fig. 6.29 Parallel Translation
and Access.

Scanned with CamScanner


V
. .
S Mww yo,
^

//

*
vrt An • r - fc •

*u
1
<
. i «'
.
BN#;: ;
v<
^ #
*v i

6.11 c
Read

*’ :,£
*
#
l. v
»> ***
„ M' "
-*
, ' . he:
''"' both instructions und
ls :l data
\
t'
separate 1 cache and -
4'
* ***
r (V
I wM*<« *!<**with - D cache ;‘
Cache
M 1 * * " fj
i '
-
•Xt ' T' :/!:

; I'aoh*
\\ TK*« the cache
contains
hit.
.
in tlie infonnation
requested, the transaction Fetch
Store '
Vf
H'
i
is s4id to he a cache Memory

Fig. 6.30
v tv S ^ .V iVhc Miss
requested, tlie transaction is said to be a
^ nOt contain the information
> VA;\
\\ hen the cache docs cache rr^.-,

4. Cache Consistency

S nce cache is a photo or copy of a small piece main memory, it is important that the cache alu:- ,

reflects w hat is in main memory. Some common terms used to describe the process of maintains:; v

cache consistency air :


ia ) Snoop: When a cache is watching the address lines for transaction, this is called a snoop This
function allows the cache to sec if any transactions are accessing memory it contains within
itself.
ibi Snarf: \\ hen a cache takes the information lrom the data lines, the cache is said to have snarfec
.. . data. This function allows tlie cache to be updated and maintain consistency.
Snoop and snarf are the mechanisms the
cache uses to maintain consistency. Two other terms i"
commonly used to describe tliee i
inconsistencies in the cache data.
n (# ) Krty Data:
Men data is modified within cache
d cache is called dirty but nol modified in main memory , the ca -
*
« n 6e
LIU
,b

»«•: Who daia « u
**^
“ led ls
stale ^
. data.

modified within main memory but not modified in cache the dat*
data.
,

1
7'
^“
v iual address lsi
wocafive

divided iW0 W
1 ping and set-
associative mapping are tliree basic typ® 0
f cache

^ niethrough and write back are ,"umber and IIme number.


'

Three Policies Ue tw policies


9- USCdfor line rep ace , ,
° for writing
wri into cache.

10*
SUe of cache s
Local m;
- set
size n - of
* ° sets
m
"IU 1 1 case of memory are LRU, FIFO and random rePlacC
ft
*
^ ** .* global miss rate and
sale miss rate
are the three 3 types of miss rate-

Scanned with CamScanner


ComP' uter Architecture
.> . _
. „ ... ,
. . ..
Advanc^
•• VV i

m
-
L^
r

REVIEW QUESTIONS
i «0H -

!a «o0 of using cache memory? Explain jn detail


%
• 1

vyhnl is the need 31 '

O' 1 - on mapping and its types ? a**


'"US - , Write note .
u vha do you mean by replacement algorithm ? Explain various types of replacement
, algorithms.
XXgi

iffi note on write policies.


Write
viirtual to
Real translation is done, explain? a'--

.
time of cache memory is loons a„d
(P 5 How
Q. 6. ? hc acces s M 'Sl 000 n3-
, ry ,
I is estimated that 80 of memo requests
%
The hit ratio for access es is 0.9 and write thm„„ * ° .
r r read and
remaining 20% are for write. * •

*

"
Memo,y memory read cycles?
accesS'
„.
Hint: —90
X 100 + — x 1100 = 220 ns
1
L 100 100
( b) What is average access time of the system both read and write requests .
miss.
(c ) What is the hit ratio taking into consideration the write cycles . •[0.9 x 0.8 = 0.72 ]

Q.7. Consider a cache ( Mj) and main memory (M2) with following.
always Characteristics : - 16 k words , 50 ns access time
lining A/2 = 1 M words , 400 ns access time
Assume 8 word cache blocks and a set size of 250 words, with set associative mapping . Calculate
P - This ?efr with cache hit ratio = 0.95 .
within
Q. 8. In a two level cache system we have Ll size 8 kB with 4w . Set associative 16 B lines and L2 size
is 64 kB direct mapping 64 B lines and CBWA . Suppose miss in L, , hit in and delay in
narfed
,
3 cycles and miss in L , miss in L2 delay is 10 cycles; the processor makes 1.5 reference/
instructions.
is are (a) What is L and Lj miss rates.
l
( b) What is expected CPI loss due to cache miss.
ita in
lc ) all line in L { always reside in L2 why?
ta in Cache size
Hint: = No. of sets
Association line size
tche
CPU Time_
= t cycle
Instruction
reference + ( miss rate - miss penalty )
CPI +
instruction
nt.
Use Tc.miss
Miss penalty =
^cycle

Scanned with CamScanner

You might also like