Acau 4
Acau 4
Proc() Combining
queue
Wait Mem(k)
buffer
Noncomb.
queue
Proc) Combining
queue
Wait Mem()
buffer
Noncomb.
queue
hardware-based protocols
software-based schemes.
Software-based schemes usually introduce some restrictions on the cach
ability of data in order to prevent cache coherence problems. First the hardware
based protocols are described in detail, then ashort introduction to software-tased
schemes is given.
Pi
Processor Pj
D Store D'
D Cache
D'
(a)
Memory
Pi Processor P
D Store D'
D' Cache D
(b)
Write-through
Figure 18.20 Memory update policies in writing shared data structures. (a)
memory update policy; (b) write-back memory update policy.
Cache coherence 695
the write-back scheme since error detection and recovery features are
reliablethan
atthe main memory. The write-back policy avoids useless intercon-
availableonly ref-
traffic; however, it requires more complex cache controllers since read
nection
memory locations that have not yet been
updated should be redirected to
erencesto
the appropriatecache.
write-through policy is a greedy policy because it updates the memory
The memory
immediately, while the write-back policyis alazy one with postponed
been introduced
copy
Similarly, agreedy and alazy cache coherence policy have
update.
updatingthe cache copies of a data structure:
for
write-update policy (a greedy policy)
write-invalidate policy (a lazy policy)
write-update policy is that whenever a processor updates
The key idea of the updates all the other cached copies as
well.
immediately
acached data
structure, it memory update policy.
Whether the shared memory copy is updated depends on the
illustrated in Figure 18.21(a).
write-update policy is network.
of the
The mechanism migration the interconnection
Immediate data can cause saturation in
Processor Pk
Pi Pi
Store D'
Cache D'
D'
D'
update (D')
(a)
Processor Pk
Pi
Pi
Store D'
D Cache
invalidate (addr(D)
(b)
Invalid data
Memory
ypdate Cache coherence
policy policy
Intercscohnemecetion
Write Write
Write
invalidate update
Mhite
mrough
back
Single bus
Snoopy cache Mutcireictstoargye
protocols Multigle bus
schemes hieracoherence
cache chical
protocols
diFulrect-moriapes Limited
Chained
directoies directories
Centralized Distributed
Figure 18.22 Design space of
hardware-based cache coherence protocols.
Io the next sections a
detailed description of the cache coherence
is given. The descripion of protocolsis
dto the
interconnection networks
bsed on the definitions listed above. protocols
Memory
Pi
Processor Pi
Lohd D
D
Cache
2 Block (D)
Pi Processor Pi
Load D
D Cache
Block (D)
(b) If shared copies are available, all copies, including the memory one,
will be updated and the updated block is sent to the requester proces
SOr as shown in Figure 18.24(b). The resulting state of all copies will
be shared.
(C) If a dirty or valid-exclusive copy exists, it will be updated as well as
the memory block, and the updated block is sent to the requester
processor. The resulting state of all copies will be shared.
-Replacement: If the cache is full when a new block should be written into
1t, one of the blocks should be removed from the cache in order to replace
1t with the new block. If the block selected for replacement is a dirty
the memory should be updated, otherwise no further action is
one,
necessary.
700 Shared Memory MIMD Architectures
Memory
Pi ProcesSor P
Store D'
Cache
(1) Write-Blk(addr(D), D)
a
Memory
Pi Processor Pi
D Store D'
D Cache
Write-Blk(addr(D), D)
P-Write Read-
Read-BlkWrite-Blk
BlkN rite-Blk/Update-2lk
Dirty
P-Read/PWrite
Figure18.25 State transition graph of the
example protocol.
atedprocessorand from the bus. The cache
in uniprocessor systems but it is extended controller
is
to handle thesimilar
cacheto directory,
those employed
too. It
realizesthose parts of the state transition diagram of the snoopy cache protocol that
arerelated to processor activity. For example, in the case of a load
instruction the
cachecontroller checks if the requested block is in the cache. If it is, it services the
requestlocall If not, it passes the request to the snoopy controller which generates
aRead-BIk command on the system bus. When the requested block arives, the
PEi
Processor
controller
Memory Cache
A D Proc.
Snoopy A D
controller Cache
Interface
cache controller.
Figure 18.26 Structure of the snoopy
702 Shared Memorv MIMD
Architectures
snoopy controller updates the cache and the
controller to service the load instruction. When a directory and notifies the cache
bus, the snoopv controller Read-BIk command arrives on the
checks if the block is available
is no further action. If it is available, the in the cache. If not, there
block. case of a shared block. it simply
In snoopy controller reads the state of the
places the cache
requested block is in state Valid-exclusive or Dirty. it block on the bus, If the
after placing the block on the bus. changes the state to Shared
Directory schemes
Director schemes selectively send
where the valid copy of the shared consistency commands only to those caches
data block is stored. In order to do that. a
director entry must be associated with each data block. The
of a set of pointers to the caches holding a valid directory entry consists
copy of the block. Additionally, a
dity bit specifies if anv of the holding caches has the right to update
biock of data. the associated
Three main approaches can be distinguished in the realization of directory
schemes:
Full-map directory
Limited directory
Chained directory.
These schenes differ in how the structure of the directories is organized. There
are two metrics which should be considered in comparing the three approaches:
Memory requirement for maintaining the directory
Number of consistency commands in the protocols.
Full-map directory scheme In the full-map directory scheme each directory entry
consists of as many pointers as there are caches in the system. If cache Cicontains
to itself. If
a copy of the data block, the pointer Pi points to Ci, otherwise, Pi points
simple
there is a one-to-one correspondence among the pointers and the caches, a
flag Fi (implemented as a single bit) is enough to replace a pointer. In this imple
mentation structure, called the presence lag vector, aflag Fi is set true if Ci con
presence flags, there is
tains a valid copy, otherwise, Fi is false. Additionally to the
one and only one presence bit Is
an extra flag called thedirty fiag which is set when update the data block.
permission to
set, indicating that the asSociated cache has memory, two state flags
Besides the presence flag vector stored in the shared each data block. The
in each cache for
(valid and write-permission)are máintained
second is set if the cache has permission to
first is set if the block is valid, while the
Cache conerence 703
ProceSSor PO P1 Pn PO P1 Pn
Cache X, CT Cn X,CT Cn
Read X
Shared
X
mermory
Directory entry
(a)
Figure 18.27 Chained directory scheme.
Cache caherence 705
Creation
Insertion
Deletion
Reduction to a single node.
Nodek
Nodei. Nodej
Memory
Pk
P Pj
Ci
forw_id back id
mstate forw_id
cstate mem_ id
data (64 bits)
data (64 bits)
Nodek
Nodej Memory
Pk Pi
Ck
new-head responses
prepend
Pi
Nodei
Pi Pi Pk
Pi Pk
(b)
from a sharing-list. (a) Messages for deletion; (b) structure of the
Figure 18.30 Deletion sharing-list after deletion.
coherencyproposed
for the multilevel cache system. To 'write-once' cache by
achieved in hierarchical bus understand how multicache coherencycontrol i protocol
multiprocessors,
structure shown in Figure 18.31. Assume that consider the operations of a two-]evel
processor P2 1Ssues a
to a data block X which has
four copies in caches C11, C12, C16write commomd
shown, copies must be available in the and C17 A.
higher-level caches C20
Goodman's 'write-once' protocol is based on the write-invalidateand C22 to
therefore, after writing X into C12 and C20, all the
invalidated. This is achieved by other copies of X policy
should
and
bus B10 and then on the broadcasting the write command first on low-levelbe
higher-level bus B20. The copy of X in C11is invalidated
Main
memory
B20
Write
C20
C21
C22
L Invalidate
C11 C12
C10 C16 C17
X C13 C14 C15
PO P1 P2 P3 P4 P5 P6 P7
18.3.3
Software-based protocols
Although hardware-based protocols offer the fastest mechanisn for maintaining
cache consistency, they introduce a significant extra hardware complexity. particu-
larly inscalable multiprocessors. Software-based approaches represent agood and
competitive compromise since they require nearly negligible hardware support and
to the same small number of invalidation misses as the hardware-based
they can lead
protocols. AlI the software-based protocols rely on compiler assistance. The
classes:
compiler analyses the program and classifies the variables into four
(1) Read-only
(2) Read-only for any number of processes and read-write for one process
(3) Read-write for one process
(4) Read-write for any number of processes.
can
Read-only variables can be cached without restrictions. Type 2 variables
runs. Since only one
be cached only for the processor where the read-write process
only for that process. Type
process uses type 3 variables it is sufficient to cache them
4 variables must not be cached in software-based schemes.
prvgram secions and
Variables demonstrate different behaviour in different
into sections by the compiler and the variables
etne program is usually divided For example, aparallel for-lo0p 1S a
eCategorized independentlyin each section. compiler generates instructions that
Program section. More than that, the the classiication of
cor.lrol the cache or access the cache explicitly based on program section the
variables and code segmentation. Typically, at the end of eacha consistent state
before
the variables are in
caches must be invalidatedto ensurethatinvalidation is done,two main schemes can
starting a new section. According to how
be distinguished:
710 Shared Memory MIMD
Architectures
Indiscriminate invalidation
Selective invalidation.
Software-based protocols
Indiscriminate Selecting
invalidation invalidation
Parallel Cnticai
section
for-loop
based based