0% found this document useful (0 votes)
8 views20 pages

Acau 4

Uploaded by

Isha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Acau 4

Uploaded by

Isha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

S92

Shared Memory MIMD Architectures

Proc() Combining
queue
Wait Mem(k)
buffer

Noncomb.
queue

Proc) Combining
queue

Wait Mem()
buffer

Noncomb.
queue

Figure 18.19 Structure of a combining switch.

processes running on distinct processors. Switching networks that employ combin


ing switches are called combining networks. Section 18.4.l explains in detail bow
the fetch&add primitive can be used in a combining network to
implement various
synchronization tools.
The structure of the combining switch used in the NYU Ultracomputer is
depicted in Figure 18.19. The memory requests from Pi and Pj enter two combining
queues, one for the Mk and one for the MI memory block. If the two requests refer
to the same memory address the corresponding combining queue forwards one
request to the memory block and places the second request in the associated wait
buffer. Replies from memory are directed both to the non-combining queue of
the requesting processor and to the associated wait buffer which tries to match
the incoming reply with the stored requests. If a match is found, the wait buffer
generates a second reply for the other requesting processor, too.

18.3 Cache coherence

18.3.1 Cache coherence problems


Cache memories are introduced into computers in order to bring data closer to the
processor and hence to reduce mernory latency. Caches are widely accepted and
employed in uniprocessor systems. However, in multiprocessor machines where
several processors require a copy of the same memory block, the maintenance of
consistency among these copies raises the so-called cache coherence problem which
has three causes:
Sharingof writable data
Process migration
JOactivity.
Assumethat data
processorsP; and Pj readstructure
the Dis a
henceboth caches contain ayalue ofD. shAsaread writable
i and
updates Dto D', cache Ci
c
D value; that is. hewil conton sistenrD' evalsuulte, Dif sstructure and
t
pioriginal
the
Jatest value of copi
ain whi le D.Io
I fades
a into cacphroecses eSCi ON
returnthe
D. es D of the
oth er pOCESs On ani
willnot now that
Assume Dis a
on processor Pi. If A private
beCome inconsistent. Areadsil prOcocnetSaOinIsT
cache Cj
running
contains the originalwiteDs Dintowritable data from Cj
valon ue. If D,Ci wistlrucconttAureainownedD whibyleprocess
still
memory
andperforms aread
(jioi)
he
fetchedfrom the main operationto Ci the dataafterswtraurctdusre, themigrates to
memory the mainA.
activity icannsteadariseof inthetheupdatcaseed oDgivalunealD.valproucees Dor vllPj
from
Inconsistency VO
structureifthe I/O processor is
ifthedatastructure D is written working
change of value for Dsince
into D' bydi rect
any ly fro m the
of any
main memory.wrtable data
this the
From the point of view of main memor
cache
proyceSSOr
cont a
, in
the O
s the stsyst em cannotObviof ousseeiy,
intothree classes: a
coherence, data structures canvalubee D le
Read-only data structures which never cause any divided
They can be replicated and placed in any cache
withoutany problem. number of cachecohermemory
ence probiem.
Shared writable data blocks
problems. structures are the main
Source of cache
Private writable data
coherence
structures pose cache coherence problems onlyin the
case of process migration.
There are several techniques to
hat is. shared writable data maintain cache coherence for the ciücal case.
structures. The applied methods can be divided into yO
classes:

hardware-based protocols
software-based schemes.
Software-based schemes usually introduce some restrictions on the cach
ability of data in order to prevent cache coherence problems. First the hardware
based protocols are described in detail, then ashort introduction to software-tased
schemes is given.

18,3.2 Hardware-based protocols


cache coher-
solutions to the problems of approach
Hardware-based
ence without any
general
protocols provide; cachability of data. The price of this
restrictions onthe
694 Shared Memory MIMD Architectures
is that shared memory systems must be extended with sophisticated
anisms to support cache coherence. Hardware-based protocols :ardware mech-
can be
according to their:

memory update policy


clas ified
cache coherence policy
interconnection scheme.

Two types of memory update policy are applied in


write-through policy maintains consistency between the main
that is, when a block is updated in one of the caches it is mulmemorytiprocesandsorscaches:
. The
i
memory, too. The write-back policy permits the memory to bem medi ately
sistent with the most recently updated cached block. Memory is temporarilupdat e d in
when the modified block in the cache is replaced or invalidated.updated
Figure
yevent18.iu2nal0(a)con-ly
shows the effect of the write-through policy while Figure
same cache operation but with the write-back policy. The 18.20(b)
through policy leads to unnecessarytraffic on applicationil ustthe
the interconnection
rateswrithete- of
case of private data andinfrequently used shared data. On the networkiit isin the
other hand,
more
Memory

Pi
Processor Pj
D Store D'

D Cache

D'

(a)

Memory

Pi Processor P
D Store D'
D' Cache D

(b)
Write-through
Figure 18.20 Memory update policies in writing shared data structures. (a)
memory update policy; (b) write-back memory update policy.
Cache coherence 695

the write-back scheme since error detection and recovery features are
reliablethan
atthe main memory. The write-back policy avoids useless intercon-
availableonly ref-
traffic; however, it requires more complex cache controllers since read
nection
memory locations that have not yet been
updated should be redirected to
erencesto
the appropriatecache.
write-through policy is a greedy policy because it updates the memory
The memory
immediately, while the write-back policyis alazy one with postponed
been introduced
copy
Similarly, agreedy and alazy cache coherence policy have
update.
updatingthe cache copies of a data structure:
for
write-update policy (a greedy policy)
write-invalidate policy (a lazy policy)
write-update policy is that whenever a processor updates
The key idea of the updates all the other cached copies as
well.
immediately
acached data
structure, it memory update policy.
Whether the shared memory copy is updated depends on the
illustrated in Figure 18.21(a).
write-update policy is network.
of the
The mechanism migration the interconnection
Immediate data can cause saturation in

Processor Pk
Pi Pi

Store D'

Cache D'
D'
D'

update (D')

(a)

Processor Pk
Pi
Pi

Store D'

D Cache

invalidate (addr(D)

(b)
Invalid data

shared data structures. (a) Write-update


igure 18.21 Cache coherence policies in writing coherence policy.
cache coherence policy:(b) write-invalidate cache
696
Shared Memory MIMD
Architectures
Moreover, this traffic is
Cache controllers must beunnecessary in the case of
but also from other able to accept requests not infrequently used shared data
only from their
cache own processor
In the case of the controllerS.
write-invalidate policy, the updated cache block is not sent
immediately to other caches; instead a simple
cached copies and to the original version in invalidate command is sent to all other
invalid. If later another processor wants to the shared memory so that they become
the updating processor. The read the data structure, it is
mechanism
Figure 18.21(b). The traffic on the of the write-invalidate policy provided by
is shown in
compared with the write-update policyintercoanection
since the
network is significantly reduced
ing only the address of the
updated data
invalidate command requires send.
have to accept only the invalidate structure. Furthermore, cache controllers
command from other controllers.
Hardware-based protocols can be further classified into three basic classes
depending on the nature of the interconnection network applied in the shared
memory system. If the network efficiently supports
snoopy cache protocol can be advantageously broadcasting, the so-called
used in single bus-based shared memory exploited. This scheme is typically
systems
(invalidate or update commands) are broadcast via thewhere consistency commands
on the bus for incoming bus and each cache 'snoops'
consistency
Large interconnection networks commands.
(for example, multistage networks) cannot
support broadcasting efficiently and therefore a mechanism is needed that can
directly forward consistency commands to those caches that
updated data structure. For this purpose a directory must becontain copy of the
a
block of the shared memory to administer the actual location ofmaintained for each
blocks in the possi
ble caches. This approach is called the directory scheme.
The third approach tries to avoid the application of the costly dirèctory
scheme but stillprovide high scalability. It proposes multiple-bus networks with the
application of hierarchical cache coherence protocols that are generalized or
extended versions of the single bus-based snoopy cache protocol.
The design space and classification of hardware-based protocols are shown in
Figure 18.22. Notice that the three branches of the design space are orthogonal, that is,
for any hardware-based protocols these three choices can be used in any combination.
When a procesSor wants to read or write a data block the following cases can occur:

Read/write hit: There is a copy of the block in the processor's cache.


Read/write miss:The required block is missing from the processor's cache.
In describing a cache coherence protocol the following definitions must be
given:
Definition of possible states of blocks in caches, memories and directories.
Definition of commands to be performed at various read/write hitmiss actions.
Definition of state transitions in caches, memories and directories according
to the commands.
Definition of transmission routes of commands among processors, caches,
memories and directories.
Cache coherence 69T
Hardware-based
cache coherence protocols

Memory
ypdate Cache coherence
policy policy
Intercscohnemecetion
Write Write
Write
invalidate update
Mhite
mrough
back
Single bus
Snoopy cache Mutcireictstoargye
protocols Multigle bus
schemes hieracoherence
cache chical
protocols

diFulrect-moriapes Limited
Chained
directoies directories

Centralized Distributed
Figure 18.22 Design space of
hardware-based cache coherence protocols.
Io the next sections a
detailed description of the cache coherence
is given. The descripion of protocolsis
dto the
interconnection networks
bsed on the definitions listed above. protocols

Snoopy cache protocols


Shoopy cache protocols are very popular in shared bus multiprocessors due to
rlative simplicity. They have both write-update and write-invalidate policy their
Versions. Here, only a write-update snoopy cache protocol is explained in detail.
Wite-invalidate snoopy cache protocols resemble this protocol in many ways and
Uetore are also easy to understand after studying awrie-update protocol. Agood
SUrvey of several Snoopy cache protocols can be found in Archibald and aer
(986).
In describing a Snoopy cache protocol the following definitions must be
BNen:

Definition of possible states of blocks in caches. read/write hitmiss actions.


Definition of Commands to be performed at variousmemories according to the
Definition of state transitions in caches and
commands.
Shared Memory MIMD Architectures

The definition of transmission routes of commands can be omitted in spogn.


cache protocols since the commands are uniformly broadcasted on the shared bre
The example protocol is similar to that applied in the DEC Firefiy multiprOceSsor
workstation (Thakkar et al., 1988). The protocol applies both the write-back and the
write-through update policies. The former is used for private blocks, the latter for
shared blocks.

Definition of possible states of blocks incaches.


Valid-exclusive: The block is the only copy of the memory block. The
cache and memory blocksare consistent.
Shared: There are several cached copies of the memory block and allof
them areconsistent.
Dirty: It is the only copy of the memory block and the memory block is
inconsistent.

Definition of commands to be performed at various read/write hiUmiss


actions.
- Read miss: Thesnoopy cache controller broadcasts a Read-Blk command
on the bus.
(a) If there are shared copies of the data, the containingcaches deliver the
required copy as shown in Figure 18.23(a) wherc Pj has a read miss
The resulting state in the rcquester processor will be sharcd, to0.
(b) If a dirty copy exists, it supplies the necessary copy and updates the
main memory, to0. All thc copies become shared. The mcchahism is
illustrated in Figure 18.23(b).
(c) If avalid-exclusive copy exists, it supplies the nccessary copy and all
the copies become shared.
(d) If there is no cache copy of the block, the mcmory scnds the block
which becomes valid-exclusive.
Write hit:
(a) If the state of the block is valid-exclusive or dirty, the write operation
the
can be performed locally (no command is placed on thc bus) and
new state will be dirty (write-back policy).
an
(b) If the block is shared, the snoopy cache controller broadcasts
Update-BIk command on the bus. The Update-BIk command contains
the address and value of the updateddata. All copies, including the
memory one, will be updated (write-update, write-through policy)
and remain shared.
a Write-BIk command
- Write miss: The snoopy cache controller broadcasts
and value of the
on the bus. The Write-Blk command contains the address
updated data.
updated with the new
(a) Ifonly the memory contains the block, it will be the requester
value of the updated data and the updated block is sent to
18.24(a)).
processor where it becomes valid-exclusive (see Figure
Cache coherence 699

Memory

Pi
Processor Pi

Lohd D

D
Cache

2 Block (D)

(1) Read-Blk (addr(D))


(a)
Memory

Pi Processor Pi

Load D
D Cache

Block (D)

(1) Read-BIk (addr(D)


(b)
Shared Dirty

Figure 18.23 Read miss in the example protocol.

(b) If shared copies are available, all copies, including the memory one,
will be updated and the updated block is sent to the requester proces
SOr as shown in Figure 18.24(b). The resulting state of all copies will
be shared.
(C) If a dirty or valid-exclusive copy exists, it will be updated as well as
the memory block, and the updated block is sent to the requester
processor. The resulting state of all copies will be shared.
-Replacement: If the cache is full when a new block should be written into
1t, one of the blocks should be removed from the cache in order to replace
1t with the new block. If the block selected for replacement is a dirty
the memory should be updated, otherwise no further action is
one,
necessary.
700 Shared Memory MIMD Architectures

Memory

Pi ProcesSor P

Store D'

Cache

(2) Block (D')

(1) Write-Blk(addr(D), D)
a

Memory

Pi Processor Pi

D Store D'

D Cache

(2) Block (D')

Write-Blk(addr(D), D)

Figure 18.24 Write miss in the example protocol.

The state transition diagram of the example protocol is shown in Figure


associated
18.25 where P-Read andP-Write are commands (actions) initiated by the
commands
processor, while Read-Blk, Write-BIk and Update-BIk are a consistency
diagram
arriving from the bus and initiated by other caches. The state transition
associ
defines how the cache controller should work when a request is given by the
example, when a Read-BIk
ated processor or by other caches through the bus. Forcache controller should mod
command arrives at acache block in a Dirty state, the
ify the state of the block to Shared.
depicted in Figure
The typical structure of a snoopy cache controller is the state transition
realizes
18.26. It is constructed as a finite state machine which
main parts. The snoopy
diagram of the snoopy cache protocol. It comprises threecontinuously monitoring bus
controller implements the snoopy cache protocol by contains the state for each
operations through the bus interface. The cache directory
protocol). The directory is often
block (two bits/block in the case of the example commands arriving from the associ
duplicated in order to avoid contention among
Cache coherence 701
PRead
P-Read/PNrite
Read-Blk/Write-Bk
Valid
exclusive Shared

P-Write Read-
Read-BlkWrite-Blk
BlkN rite-Blk/Update-2lk
Dirty
P-Read/PWrite
Figure18.25 State transition graph of the
example protocol.
atedprocessorand from the bus. The cache
in uniprocessor systems but it is extended controller
is
to handle thesimilar
cacheto directory,
those employed
too. It
realizesthose parts of the state transition diagram of the snoopy cache protocol that
arerelated to processor activity. For example, in the case of a load
instruction the
cachecontroller checks if the requested block is in the cache. If it is, it services the
requestlocall If not, it passes the request to the snoopy controller which generates
aRead-BIk command on the system bus. When the requested block arives, the

PEi

Processor

Snoopy cache controller


Interface Cache
Cache diectoy PS

controller
Memory Cache
A D Proc.

Snoopy A D
controller Cache
Interface

cache controller.
Figure 18.26 Structure of the snoopy
702 Shared Memorv MIMD
Architectures
snoopy controller updates the cache and the
controller to service the load instruction. When a directory and notifies the cache
bus, the snoopv controller Read-BIk command arrives on the
checks if the block is available
is no further action. If it is available, the in the cache. If not, there
block. case of a shared block. it simply
In snoopy controller reads the state of the
places the cache
requested block is in state Valid-exclusive or Dirty. it block on the bus, If the
after placing the block on the bus. changes the state to Shared

Directory schemes
Director schemes selectively send
where the valid copy of the shared consistency commands only to those caches
data block is stored. In order to do that. a
director entry must be associated with each data block. The
of a set of pointers to the caches holding a valid directory entry consists
copy of the block. Additionally, a
dity bit specifies if anv of the holding caches has the right to update
biock of data. the associated
Three main approaches can be distinguished in the realization of directory
schemes:
Full-map directory
Limited directory
Chained directory.

These schenes differ in how the structure of the directories is organized. There
are two metrics which should be considered in comparing the three approaches:
Memory requirement for maintaining the directory
Number of consistency commands in the protocols.

Obviously, a directory scheme is favourable if it requires less memory space


and fewer consistency commands. In the following sections the three main directory
schemes are described in detail.

Full-map directory scheme In the full-map directory scheme each directory entry
consists of as many pointers as there are caches in the system. If cache Cicontains
to itself. If
a copy of the data block, the pointer Pi points to Ci, otherwise, Pi points
simple
there is a one-to-one correspondence among the pointers and the caches, a
flag Fi (implemented as a single bit) is enough to replace a pointer. In this imple
mentation structure, called the presence lag vector, aflag Fi is set true if Ci con
presence flags, there is
tains a valid copy, otherwise, Fi is false. Additionally to the
one and only one presence bit Is
an extra flag called thedirty fiag which is set when update the data block.
permission to
set, indicating that the asSociated cache has memory, two state flags
Besides the presence flag vector stored in the shared each data block. The
in each cache for
(valid and write-permission)are máintained
second is set if the cache has permission to
first is set if the block is valid, while the
Cache conerence 703

block. The cache coherence protocol is complemented by means of


the presence Aag vector. Stenström (1989) has
directoryscheme, where the presence flagproposedis modified
into a
and the
write a g s
the
full-map
instead of the memory copy. This vector associated
these o f
copy scheme has the advantages of
version cached number of protocol messages via the
the
the imterconnection network and
with
necessarybit
reducing overhead of maintaining presence flag vectors.
both the
educing
scheme The objective of introducing the
directory reduce the bit
limited directory
Limied radically overhead compared with the full-map direc-
wasto observed feature of cache-based shared memory systems
scheme. The
scheme showed
majority of cases only alimited number of caches contained valid
(ory thevast block. Therefore, maintaining a
in
that same data presence flag vector which
ofthe for each cache is
copies a fiag superfluous in most cases and particularty in highly
contains
shared memory systems where hundreds of processors (and therefore
scalable applied. Recoginizing this, Agarwal et al. (1988) proposed the use of
caches) are
limited directories.
point of view, the structure of the directory entries in theshared
From a logical to that of the full-map scheme. There are twO main differ-
very similar
memoryis
ences. TThe
number of pointers (k) in a directory entry is much smaller than the num-
prOcessors(n) in the shared memory
system and these pointers are dynamicaly
ber of caches that contain a copy of the associated data block. As a con-
allocated for those
simple bits are not enough to represent pointers. In an I-procesSor system
sequence, needed for defining a presence pointer and therefore, storing k pointers
are
logan bits memory; the bit overhead of the limited directory is klog2nbiock,
for each block of scheme it is n/block. The bit overhead of the limited directory
while in the full-map full-map scheme if
scheme is less than the

klogzn <n i.e. k<


log2n
the
As long as the number of valid copies of a data block does not exceed k,
limited directory-based cache coherence protocol is very similar to the full-map
scheme. The difference appears when more than k caches require read access to a
data block. When cache Cn sends a read request to the shared memory and all avail
able pointer places in the limited directory entry are filled with valid pointers, one of
the copies of X in the caches must be invalidated and then the coresponding pointer
Can be replaced with a pointer to the requesting cache. This pointer replacement is
aled eviction and obviously needs a selection policy, which can be as simple as a
pseudo-random eviction policy.
Chained directory scheme The chained directory scheme is a good compromise
between the previous two schemes. It preserves the limited size of the central direc-
tory frames and even further restricts the number of presence pointers to one, and vet
eviction becomesby unnecessary.
ers is achieved Dynamic extension of the number of presence
distributing the extra
point-
presence pointers among the caches. Each
data block in each cache is with apointer that is used to
chain the valid
copies of a data block. The extended
positive characteristic of this scheme is that it reyuires
704 Shared Memory MIMD
exactly as
Architectures
is no many presence pointers as there are valid
superfluous presence pointers in the
copies of the data
directory entries off the shared block.
The creation of
means of Figure 18.27. chained cached copies of shared There
When the first processor (PO) data blockS are
memorN sends a copy of to C0,
pointer. and sets a pointer toX C0 along with a special chain
in X's directory
explmemor
wants to read data block by
termination
ainedy.
X,( the
receiving a second read request for X from entry the shared memory. Afo
in
X to Cl. along with the previously stored cache C1, the mnemory sends a cony a
pointer C1 in X's
to pointer to C0, which 1s now
of X and the pointerdirectory entry. Cache C1 simply stores the replaced by a
to C0. Figure
memory and caches after these 18.27(b) illustrates the state incoming conu
of the shared
ties. artbitrarily long chains can beactivities. By repeatedly performing similar activi.
Suppose now that processordynamically constructed in a
Pn wants to write data block demand-driven way
Ch sends a wTite X. In this case cach
reguest to the memory which has to send an
along thepresence chain from the pointer stored in X'sinvalidate reguest
the shared memory. The originating
first invalidate request arrives at C1 directory entry in
copy of X and sends the invalidate which invalidates its
pointer. it does not send the invalidaterequest request
onto cache C0. Since CO finds a CT
any further after invalidating its copy
of X but sends an
edgement from C0,acknowledgement
back to the memory. On receiving the
the main memnory controller sets acknowl
ence pointer to Cn. Finally, it sends a copy of X X's dirty flag and sets X's pres
which sets the write-permission flag. updates block X and a write permission to cache Cn
and reactivates processor Pn.
Scalable Coheren: Interface A concrete example of the chained
Scalable Coherent Interface (SCI) defined bythe IEEE standard directory is the
standard£oes not define a specific interconnect rather than an interface 1596-1992. The SCI
to the
connection network. However. the point-to-point nature of the interface andinter the
communication protocols fit well to asimple, uni-directional ring as applied in the
Convex Exemplar machine (see Section 18.7.4). The major purpose of the SCI

PEO PE1 PEn PEO PE1 PEn

ProceSSor PO P1 Pn PO P1 Pn

Cache X, CT Cn X,CT Cn

Read X

Shared
X
mermory

Directory entry
(a)
Figure 18.27 Chained directory scheme.
Cache caherence 705

providea memory-address-based, cache-coherent communication


scalable narallel
to
creating protocol provides
machines with large
numbers of processors.
is a scalable
coherency
c linked list form of distrib-
aheme
tor
SCI The cache mechanism ensures simultaneous linked list modifica-
prOcessorsin a sharing-list for maximum concurrency. There are
The
directories.
choke points in the no
theresource
all
protocol, enabling it to scale with the
i na 1inear manner. Any number of nodes can share the same
ted
by proceSSors
no
o na n d

size of the directory is proportional to the number of cache


xks
mbeorf theand the sharing-list is fully
line,
Deletionof
nodes from
operation.
decentralized, not requiring any
memory

(memory) Although the insertion is centralized, relying


ines. directory operations, combining list insertion requests is possible.
entral
centra/
directory
communications are organized in packets. The base protocol is a
SCI invalidate
on All type that supports forward progress, delivery, fairness.
and
wTe-back
detection, and recovery.
errOr organization is based on the so-called sharing-list in which each
hasic SCI is chained up. The head element of the list points to the
The
cached block associated memory line is stored as shown in Figure 18.28.
coherently the
block where
each block of the memory is extended with the identifer of the
memory in
memorytag sharing-list (forw_id field) and a 2-bit memor state (mstate fieid).
node of the blocks are extended withtwo identifiers; one forthe previous node
The
head tags of
Thecachefield)and one for the following node (forw_id field) of the sharing-list.
(backid unit contains the cache state (cstate field), too. The first
thosefields, the cache
Beside
element of the
sharing-list uses the mem_idfield to identify the memory block.
cache operations are defined on the sharing-list:
Four main

Creation
Insertion
Deletion
Reduction to a single node.
Nodek
Nodei. Nodej
Memory

Pk
P Pj

Ci

forw_id back id
mstate forw_id
cstate mem_ id
data (64 bits)
data (64 bits)

sharing-lists in the SCI.


Figure 18.28 Structure of
706 Shared
Memory MIMD Architectures
Creation This is a simplified version of the
associated sharing-list is empty. insertion operation when
Insertion
a prepend
When acache miss occurs (at Ci in
Figure 18.29),
request to the memory
node and refreshes its head which responds with the pointer ofthethenode sends
old bes
receiving the response from node pointer with the address of the
memory, the new head (if the requester.wasAfer
empty) notifies the old head thenode by sending a sharing-list ne
updates its backward pointer and new-head request. The old head
trated in Figure 18.29 and the returns its data field. The two transactions are illue
Figure 8.28. result of the transactions is the
sharing-list shown in
Deletion When a node wants to remove its
it sends two cache line from the sharing-lies
mnessages. First, it sends an update backward
backward pointer (Pb) to its successor. The request containing ite
pointer by the value received in the successor will update its backward
request, sent to the predecessor. This
is request. The second message, an update forward
message
requester and is used by the predecessor to update its contains the forward pointer of the
tions and their result are illustrated in Figure 18.30.
forward pointer. The transac
Reduction to a single node When a cache line is
elements of the sharing-list must be invalidated. writen, all the other
update a cache line and to remove the other Only the head node has the right to
the reduction process and it is elements from the sharing-list. This is
performed sequentially. The head
request to the second node of the list. This node returns sends a purge
its forward pointer, that is.

Nodek
Nodej Memory
Pk Pi

Ck

new-head responses

prepend
Pi

Nodei

Figure 18.29 Insertion in a sharing-list.


Cache coherence 707
Nodei
Nodej Nodek
Memory

Pi Pi Pk

update forward update backward

(a) Nodei Nodek


Memory

Pi Pk

(b)
from a sharing-list. (a) Messages for deletion; (b) structure of the
Figure 18.30 Deletion sharing-list after deletion.

Then, the head node sends a purge request to


the address of the next node in the list.
node, and so on until the last node receives the purge request. If a node
the third write on a shared cache line, it has first
to
execute a
different from the head wants to then to insert it. As a result the node
line from the sharing-list and
remove its cache
right to update its cache line.
becomes the head of the sharing-list, with thecan be executed simultaneously on the
Several insertion and deletion actions immedi
sharing-list. In the case of simultaneous prepend requests, the memory
Sane arive. However, the new head that won the
alely responds in the order the requests it bas prepended itself to the sharing
COmpetition will delay the other new heads until among
precedence rules are applied in the case of simultaneousdeletions
OpeC1al always has priority and is deleted first. A use
acent nodes. The node closer to the tail hot
simultaneous insertions is that they can be combined and, as a result,
re of the
avoided, An optional extension of the SCI base protocol can optimize
Sa be (Gjesing et al., 1992).
sequential purge operation byreplacing it wih a parallel one

Hierarchicalcachecache coherent protocols


are applied in multiple
bus multiprocessors
Hierarchiwerecal introduced
:which coherent protocols
in Section 18.2.1.
One of the natural generalizations of the
708 Shared
Memory MIMD Architectures
single bus system is the hierarchical bus system where single bus clusters
odes' are connected to a higher-level bus via a higher-level cache or
By
recursively applying these construction techniques, arbitrarily large
be built. The only
or
operational extension is that second and higher levels
should maintain system-wide multicache coherency. In order to fulil of
'snetuperworcsupem-
can ache'
ks
ment, any memory locations for which there are copies in this caches
must also have the lower-level
copies in the higher-level cache. As a consequence,
should be an order of magnitude larger than the next
requicachesre-
since they can be implemented with slower, denser lower-level caches.
those applied for the main memory, this size dynamic RAMs, 'superHowecaches'ver,
The first hierarchical bus requirement is not a problem.
system with cache coherency was
Wilson (1987) who generalized Goodman's
identical
to

coherencyproposed
for the multilevel cache system. To 'write-once' cache by
achieved in hierarchical bus understand how multicache coherencycontrol i protocol
multiprocessors,
structure shown in Figure 18.31. Assume that consider the operations of a two-]evel
processor P2 1Ssues a
to a data block X which has
four copies in caches C11, C12, C16write commomd
shown, copies must be available in the and C17 A.
higher-level caches C20
Goodman's 'write-once' protocol is based on the write-invalidateand C22 to
therefore, after writing X into C12 and C20, all the
invalidated. This is achieved by other copies of X policy
should
and
bus B10 and then on the broadcasting the write command first on low-levelbe
higher-level bus B20. The copy of X in C11is invalidated

Main
memory

B20

Write
C20
C21
C22

B10 Write B11 B12,

L Invalidate
C11 C12
C10 C16 C17
X C13 C14 C15

PO P1 P2 P3 P4 P5 P6 P7

Figure 18.31 Hierarchical cache coherence.


the write command on hus Bl0. When the write command appears Cache
by coherence
on bus B20. 709
caches check if they have copy
a
second-level of X.
aninvalidate
command
X
andfinds acopy of X, it
must invalidate When C22 detects
to command to bus B12 where copies of XinitsC16 and C17 must beainval-
own copy and send write
jdated.Thefinal resultis that only the first- and second-level caches associated down
with
updatingprocessor(C12 and C20)
have copies of X.beAbroadcasted at the lower
issuedbyanother processor (for example, P5)
the
should subsequent read request
level,andsince there is no copy at that level it should be propagated up the hierar-
chy. When C20 detects the read request for Xand finds that it has a dirty copy of X.
copy of X to cache C15 and issues a
itsuppliesthe dirty flush request
cache C12 will relinquish
exclusive access. down to bus
where
BI0 Noticethat second-level caches that have no copies of the written
filters during the invalidate or read data block
Workasfi operations.
C21 did not send any command to bus B11. As a result, clusters
In the
current example cacheis
where there
nocopy of the updated data block are not affected at all by the hierarchical cache
coherenceprotocol.

18.3.3
Software-based protocols
Although hardware-based protocols offer the fastest mechanisn for maintaining
cache consistency, they introduce a significant extra hardware complexity. particu-
larly inscalable multiprocessors. Software-based approaches represent agood and
competitive compromise since they require nearly negligible hardware support and
to the same small number of invalidation misses as the hardware-based
they can lead
protocols. AlI the software-based protocols rely on compiler assistance. The
classes:
compiler analyses the program and classifies the variables into four

(1) Read-only
(2) Read-only for any number of processes and read-write for one process
(3) Read-write for one process
(4) Read-write for any number of processes.

can
Read-only variables can be cached without restrictions. Type 2 variables
runs. Since only one
be cached only for the processor where the read-write process
only for that process. Type
process uses type 3 variables it is sufficient to cache them
4 variables must not be cached in software-based schemes.
prvgram secions and
Variables demonstrate different behaviour in different
into sections by the compiler and the variables
etne program is usually divided For example, aparallel for-lo0p 1S a
eCategorized independentlyin each section. compiler generates instructions that
Program section. More than that, the the classiication of
cor.lrol the cache or access the cache explicitly based on program section the
variables and code segmentation. Typically, at the end of eacha consistent state
before
the variables are in
caches must be invalidatedto ensurethatinvalidation is done,two main schemes can
starting a new section. According to how
be distinguished:
710 Shared Memory MIMD
Architectures
Indiscriminate invalidation
Selective invalidation.

The simplest approach is indiscriminate invalidationin which


cache is invalidated at the end of each program section. This scheme the
Single hardware mechanism for turning on or off and invalidating
However. it is very conservative, and invalidates variables that could be requicomplcache.
the resete
a
same cache in the next program section. used in the
Selective invalidation schemes can be further categorized according toa
generation of program sections:

Critical section scheme


Parallel for-loop scheme.
The critical section scheme relies on the
variables are always accessed in critical sections assumption
that shared read-wit
tion tools. These variables are selectively guarded by software synchroniz2.
is completed. In order to do that, shared invalidated when the critical section
variables
section are placed on the same page and each page belonging to the same critical
is associated with a one-time
identifier. Loading a cache line will also place the
the cache. Invalidation is based on the one-time associated one-time identifer in
identifiers.
Three parallel for-loop selective invalidation schemes were
eliminatethe drawback of indiscriminate invalidation: proposed to
Fast selective invalidation
Timestamp scheme
Version control scheme.
The fast selective invalidation scheme (Cheong and
Veidenbaum, 1988) relies
On
write-through cache policy and introduces three instructions to control cache
accesses: Memory-Read, Cache-Read and
to access avariable when it is guaranteedCache-Invalidate. Cache-Read is applied
that the variable is stable in the cache,
otherwise the Memory-Read command is used. An extra bit, called the change bit,
is added to each cache line. The
Cache-Invalidate command sets all
An attempt to read a variable with its change bit set true will change bits true.
lead to a read
from memory. As asecond effect the change bit is modified to false and
hence forth
coming accesses can be satisfied by the cache.
The timestamp scheme (Min and Baer, 1989) allocates a clock to
each data
structure and a timestamp entry to each cache line. The clock associated with a dala
structure is updated at the end of each program section when the data structure was
updated. The timestamps in the cache directory are set to the value of the corre
sponding clock+1 when the block is updated in the cache. By comparing the inte
stamp and the associated clock value, it can be decided whether the variable in the
cache is valid (timestamp> clock) or whether it should be loaded from memory
i nthe timestamp
scheme, invalidations for Cache coherence 711
that tWo
subsequent program
Notice
processorin
these variables
exceeds
their clock variables that are local to a
sections can be omitted
value. since the
value
of
similar
method
in
called the
Veidenbaum(1990) order to avoid
A version control scheme
was
tímestamp
and
this writing generates a new
unnecessary invalidatiproposed
ofthismethodis that only one process in a program section is permittedinto write a
ons. Cheong
The main idea
and
variable program section. the new version of the
variable.
process exits its
used by other processors,
too.
version becomes the When the writing
which can
be In order to current verSIOn
cache of a processor
is valid, tWo
new data recognise whether a
variable
intheprocessor maintains a counter, called the structures introduced. First,
are
each variable it can use. Second. each cache current version number
cach version number (BVN). The CVN
line is extended with a (CVN), for
represents the tag called the
birth use, while the BVN version of the variable that the
processor must
belongs. A cache miss represents
occurs
a
particular version to which the
cache copy when
otherwise, the variable can be accessed from thethe CVN is greater than the
cache. BVN;
The categorization of software-based
cache coherent
18.32, Software support to reduce the protocols is shown in
Figure
maintain cache coherency is an important
active
complexity
of hardware necessary to
hardware cost of scalable parallel computers. research area in order to reduce the

Software-based protocols

Indiscriminate Selecting
invalidation invalidation

Parallel Cnticai
section
for-loop
based based

Fast Version Timestamp


control scheme
selective
invalidation scheme
protocols.
Figure 18.32 Classification of software-based cache coherence

You might also like