Lecture 18:
Snooping vs. Directory Based
Coherency
Professor David A. Patterson
Computer Science 252
Fall 1996
DAP.F96 1
Review: Parallel Framework
Layers:
Programming Model
Communication Abstraction
Interconnection SW/OS
Interconnection HW
Programming Model:
Multiprogramming : lots of jobs, no communication
Shared address space: communicate via memory
Message passing: send and recieve messages
Data Parallel: several agents operate on several data
sets simultaneously and then exchange information
globally and simultaneously (shared or message
passing)
Communication Abstraction:
Shared address space: e.g., load, store, atomic swap
Message passing: e.g., send, recieve library calls
Debate over this topic (ease of programming, scaling)
=> many hardware designs 1:1 programming model
DAP.F96 2
Review: Small-Scale MP
Designs
Memory: centralized with uniform access time
(uma) and bus interconnect
Examples: Sun Enterprise 5000 , SGI Challenge,
Intel SystemPro
DAP.F96 3
Review: Large-Scale MP
Designs
Memory: distributed with nonuniform access time
(numa) and scalable interconnect (distributed memory)
Examples: T3E: (see Ch. 1, Figs 1-21, page 45 of [CSG96])
1 cycle
40 cycles
100 cycles
Low Latency
High Reliability
DAP.F96 4
Data Parallel Model
Operations can be performed in parallel on
each element of a large regular data structure,
such as an array
1 Control Processsor broadcast to many PEs
(see Ch. 1, Figs 1-26, page 51 of [CSG96])
When computers were large, could amortize the
control portion of many replicated PEs
Data distributed in each memory
Condition flag per PE so that can skip
Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs
+ memory on a chip was the PE
Data parallel programming languages lay out
data to processor
DAP.F96 5
Data Parallel Model
Vector processors have similar ISAs, but no
data placement restriction
Advancing VLSI led to single chip FPUs and
whole fast Procs
SIMD programming model led to Single
Program Multiple Data (SPMD) model
All processors execute identical program
Data parallel programming languages still
useful, do communication all at once:
Bulk Synchronous phases in which all
communicate after a global barrier
DAP.F96 6
Convergence in Parallel Architecture
Complete computers connected to scalable
network via communication assist
(see Ch. 1, Fig. 1-29, page 57 of [CSG96])
Different programming models place different
requirements on communication assist
Shared address space: tight integration with
memory to capture memory events that interact
with others + to accept requests from other nodes
Message passing: send messages quickly and
respond to incoming messages: tag match, allocate
buffer, transfer data, wait for receive posting
Data Parallel: fast global synchronization
HPF shared-memory, data parallel; PVM, MPI
message passing libraries; both work on
many machines, different implementations DAP.F96 7
Fundamental Issues: Naming
Naming: how to solve large problem fast
what data is shared
how it is addressed
what operations can access data
how processes refer to each other
Choice of naming affects code produced by a
compiler; via load where just remember
address or keep track of processor number
and local virtual address
Choice of naming affects replication of data;
via load in cache memory hierachy or via SW
replication and consistency
DAP.F96 8
Fundamental Issues: Naming
Global physical address space: any
processor can generate and address and
access it in a single operation
memory can be anywhere: virtual addr. translation
handles it
Global virtual address space: if the address
space of each process can be configured to
contain all shared data of the parallel program
Segmented shared address space: if
locations are named <process number,
address> uniformly for all processes of the
parallel program
DAP.F96 9
Fundamental Issues:
Synchronization
To cooperate, processes must coordinate
Message passing is implicit coordination with
transmission or arrival of data
Shared address => additional operations to
explicitly coordinate: e.g., write a flag, awaken
a thread, interrupt a processor
DAP.F96 10
Fundamental Issues:
Latency and Bandwidth
Bandwidth
Need high bandwidth in communication
Cannot scale, but stay close
Make limits in network, memory, and processor match
Overhead to communicate is a problem in many machines
Latency
Affects performance, since processor may have to wait
Affects ease of programming, since requires more thought
to overlap communication and computation
Latency Hiding
How can a mechanism help hide latency?
Examples: overlap message send with computation,
prefetch
DAP.F96
11
Small-ScaleShared Memory
Caches serve to:
Increase bandwidth
versus bus/memory
Reduce latency of
access
Valuable for both
private data and
shared data
What about cache
consistency?
DAP.F96 12
The Problem of Cache Coherency
DAP.F96 13
What Does Coherency Mean?
Informally:
Any read must return the most recent write
Too strict and very difficult to implement
Better:
Any write must eventually be seen by a read
All writes are seen in proper order (serialization)
Two rules to ensure this:
If P writes x and P1 reads it, Ps write will be seen by
P1 if the read and write are sufficiently far apart
Writes to a single location are serialized:
seen in one order
Latest write will be seen
Otherewise could see writes in illogical order
(could see older value after a newer value)
DAP.F96 14
CS 252 Administrivia
Next reading is Chapter 8 of CA:AQA 2/e and
Sections 1.1-1.4, Chapter 1 of upcoming book by
Culler, Singh, and Gupta:
www.cs.berkeley.edu/~culler/
Remzi Arpaci will talk Fri. 11/8 on Networks of
Workstations and world record sort
Dr. Dan Lenowski, architect of SGI Origin, talk in
Systems Seminar Thur. 11/14 at 4PM in 306 Soda
Next project review: survey due Mon. 11/11; 20 min.
meetings moved to Fri. 11/15; signup Wed. 11/6
DAP.F96 15
Potential Solutions
Snooping Solution (Snoopy Bus):
Send all requests for data to all processors
Processors snoop to see if they have a copy and respond
accordingly
Requires broadcast, since caching information is at processors
Works well with bus (natural broadcast medium)
Dominates for small scale machines (most of the market)
Directory-Based Schemes
Keep track of what is being shared in one centralized place
Distributed memory => distributed directory (avoids
bottlenecks)
Send point-to-point requests to processors
Scales better than Snoop
DAP.F96
Actually existed BEFORE Snoop-based schemes
16
Basic Snoopy Protocols
Write Invalidate Protocol:
Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches
which snoop and invalidate any copies
Read Miss:
Write-through: memory is always up-to-date
Write-back: snoop in caches to find most recent copy
Write Broadcast Protocol:
Write to shared data: broadcast on bus, processors
snoop, and update copies
Read miss: memory is always up-to-date
Write serialization: bus serializes requests
Bus is single point of arbitration
DAP.F96 17
Basic Snoopy Protocols
Write Invalidate versus Broadcast:
Invalidate requires one transaction per write-run
Invalidate uses spatial locality: one transaction per block
Broadcast has lower latency between write and read
Broadcast: BW (increased) vs. latency (decreased)
Nametradeoff
Protocol Type Memory-write policy
Machines using
Write Once
Write invalidate Write back
after first write
First snoopy protocol.
Synapse N+1 Write invalidate Write back
1st cache-coherent MPs
Berkeley
Write invalidate Write back
Berkeley SPUR
Illinois
Write invalidate Write back
SGI Power and Challenge
Firefly
Write broadcast Write back private,
Write through shared
SPARCCenter 2000
Write invalidate Write back
Pentium, PowerPC
MESI
DAP.F96 18
Snoop Cache Variations
Basic
Protocol
Exclusive
Shared
Invalid
Berkeley
Protocol
Illinois
Protocol
Owned Exclusive Private Dirty
Owned Shared Private Clean
Shared
Shared
Invalid
Invalid
MESI
Protocol
Modfied (private,Memory)
eXclusive (private,=Memory)
Shared (shared,=Memory)
Invalid
Owner can update via bus invalidate operation
Owner must write back when replaced in cache
If read sourced from memory, then Private Clean
if read sourced from other cache, then Shared
Can write in cache if held private clean or dirty
DAP.F96 19
An Example Snoopy Protocol
Invalidation protocol, write-back cache
Each block of memory is in one state:
Clean in all caches and up-to-date in memory (Shared)
OR Dirty in exactly one cache (Exclusive)
OR Not in any caches
Each cache block is in one state:
Shared : block can be read
OR Exclusive : cache has only copy, its writeable, and
dirty
OR Invalid : block contains no data
Read misses: cause all caches to snoop bus
DAP.F96 20
Writes to clean line are treated as misses
Snoopy-Cache State Machine-I
CPU Read hit
Invalid
CPU Read
Place read miss
on bus
State machine
for CPU requests
CPU Write
CPU read miss
Write back block
Place write miss
on bus
Cache Block
State
CPU read hit
CPU write hit
Shared
(read/only)
CPU Read miss
Place read miss
on bus
CPU Write
Place Write Miss on Bus
Exclusive
(read/
write)
CPU Write Miss
Write back cache block
Place write miss on bus
DAP.F96 21
Snoopy-Cache State Machine-II
Invalid
State machine
for bus requests
Write miss
for this block
Write Back
Block; abort
memory access
Shared
(read/only)
Write Back
Block; abort
memory access
Write miss
for this block
Exclusive
(read/
write)
Read miss
for this block
DAP.F96 22
Snoop Cache: State Machine
Extensions:
Fourth State:
Ownership
Clean-> dirty,
need invalidate only
(upgrade request),
dont read memory
Berkeley Protocol
Clean exclusive state
(no miss for private
data on write)
MESI Protocol
Cache supplies data
when shared state
(no memory access)
Illinois Protocol
DAP.F96 23
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P1
State
P2
Addr Value State
Bus
Memory
Addr Value Action Proc. Addr Value Addr Value
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
DAP.F96 24
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P1
State
Excl.
P2
Addr Value State
A1
10
Bus
Memory
Addr Value Action Proc. Addr Value Addr Value
WrMs
P1
A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
DAP.F96 25
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P1
P2
State Addr Value State
Excl.
A1
10
Excl.
A1
10
Bus
Memory
Addr Value Action Proc. Addr Value Addr Value
WrMs
P1
A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
DAP.F96 26
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr Value
Excl.
A1
10
WrMs
P1
A1
Excl.
A1
10
Shar. A1
RdMs
P2
A1
Shar.
A1
10
WrBk
P1
A1
10
10
Shar. A1
10
RdDa
P2
A1
10
10
10
10
10
Assumes A1 and A2 map to same cache block
DAP.F96 27
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr Value
Excl.
A1
10
WrMs
P1
A1
Excl.
A1
10
Shar. A1
RdMs
P2
A1
Shar.
A1
10
WrBk
P1
A1
10
10
Shar. A1
10
RdDa
P2
A1
10
10
Inv.
Excl.
A1
20 WrMs
P2
A1
10
10
10
Assumes A1 and A2 map to same cache block
DAP.F96 28
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr Value
Excl.
A1
10
WrMs
P1
A1
Excl.
A1
10
Shar. A1
RdMs
P2
A1
A1 10
Shar.
A1
10
WrBk
P1
A1
10
Shar. A1
10
RdDa
P2
A1
10
10
Inv.
Excl.
A1
20 WrMs
P2
A1
10
WrMs
P2
A2
10
A1 20
Excl. A2
40 WrBk
P2
A1
20
Assumes A1 and A2 map to same cache block
DAP.F96 29
Implementation Complications
Write Races:
Cannot update cache until bus is obtained
Otherwise, another processor may get bus first, and write the
same cache block
Two step process:
Arbitrate for bus
Place miss on bus and complete operation
If miss occurs to block while waiting for bus, handle miss
(invalidate may be needed) and then restart.
Split transaction bus:
Bus transaction is not atomic: can have multiple outstanding
transactions for a block
Multiple misses can interleave, allowing two caches to grab block
in the Exclusive state
Must track and prevent multiple misses for one block
Must support interventions and invalidations
DAP.F96 30
Implementing Snooping Caches
Multiple processors must be on bus, access to both
addresses and data
Add a few new commands to perform coherency,
in addition to read and write
Processors continuously snoop on address bus
If address matches tag, either invalidate or update
DAP.F96 31
Implementing Snooping Caches
Bus serializes writes, getting bus ensures no one else
can perform memory operation
On a miss in a write back cache, may have the desired
copy and its dirty, so must reply
Add extra state bit to cache to determine shared or not
Since every bus transaction checks cache tags, could
interfere with CPU just to check:
solution 1: duplicate set of tags for L1 caches just to allow
checks in parallel with CPU
solution 2: L2 cache that obeys inclusion with L1 cache
DAP.F96 32
Larger MPs
Separate Memory per Processor
Local or Remote access via memory controller
Cache Coherency solution: non-cached pages
Alternative: directory per cache that tracks state of every
block in every cache
Which caches have a copies of block, dirty vs. clean, ...
Info per memory block vs. per cache block?
PLUS: In memory => simpler protocol (centralized/one location)
MINUS: In memory => directory is (memory size) vs. (cache size)
Prevent directory as bottleneck: distribute directory
entries with memory, each keeping track of which Procs
have copies of their blocks
DAP.F96 33
Distributed Directory MPs
DAP.F96 34
Directory Protocol
Similar to Snoopy Protocol: Three states
Shared: 1 processors have data, memory up-to-date
Uncached (no processor hasit; not valid in any cache)
Exclusive: 1 processor (owner) has data; memory outof-date
In addition to cache state, must track which
processors have data when in the shared state
(usually bit vector, 1 if processor has copy)
Keep it simple(r):
Writes to non-exclusive data => write miss
Processor blocks until access completes
Assume messages received and acted upon in order
sent
DAP.F96 35
Directory Protocol
No bus and dont want to broadcast:
interconnect no longer single arbitration point
all messages have explicit responses
Terms:
Local node is the node where a request originates
Home node is the node where the memory location
of an address resides
Remote node is the node that has a copy of a cache
block, whether exclusive or shared
Example messages on next slide:
P = processor number, A = address
DAP.F96 36
Directory Protocol Messages
Message type
Read miss
Source
Local cache
Destination
Home directory
Msg
P, A
Processor P reads data at address A;
send data and make P a read sharer
Write miss
Local cache
Home directory
P, A
Processor P writes data at address A;
send data and make P the exclusive owner
Invalidate
Home directory
Remote caches
Invalidate a shared copy at address A.
Fetch
Home directory
Remote cache
Fetch the block at address A and send it to its home directory
Fetch/Invalidate
Home directory
Remote cache
Fetch the block at address A and send it to its home directory;
invalidate the block in the cache
Data value reply
Home directory
Local cache
Data
Return a data value from the home memory
Data write-back
Remote cache
Home directory
Write-back a data value for address A
A, Data
DAP.F96 37
State Transition Diagram for an
Individual Cache Block in a
Directory Based System
Invalid
Exclusive
Shared
States identical to snoopy
case; transactions very
similar.
Tranistions caused by
read misses, write misses,
invalidates, data fetch req.
Generates read miss &
write miss msg to home
directory.
Write misses that were
broadcast on the bus =>
explicit invalidate & data
DAP.F96 38
fetch requests.
State Transition Diagram for the
Directory
Data Value Reply
Sharers = Sharers+{P}
Uncached
RdMs
Sharers={}
Shared
RdMs
WrMs
Data Value
Reply
Sharers =
Sharers+{P}
WrBk
WrMs
Exclusive
Fetch/Invalidate
Sharers={P}
Same states &
structure as the
transition diagram for
an individual cache
2 actions: update of
directory state & send
msgs to statisfy req.
Tracks all copies of
memory block.
Also indicate an action
that updates the
sharing set, Sharers, as
opposed to sending a
message.
DAP.F96 39
Example Directory Protocol
Message sent to directory causes two actions:
Update the directory
More messages to satisfy request
Block is in Uncached state: the copy in memory is the
current value; only possible requests for that block are:
Read miss: requesting processor sent data from memory &
requestor made only sharing node; state of block made Shared.
Write miss: requesting processor is sent the value & becomes the
Sharing node. The block is made Exclusive to indicate that the only
valid copy is cached. Sharers indicates the identity of the owner.
Block is Shared => the memory value is up-to-date:
Read miss: requesting processor is sent back the data from
memory & requesting processor is added to the sharing set.
Write miss: requesting processor is sent the value. All processors
in the set Sharers are sent invalidate messages, & Sharers is set to
identity of requesting processor. The state of the block is made
Exclusive.
DAP.F96 40
Example Directory Protocol
Block is Exclusive: current value of the block is held in
the cache of the processor identified by the set Sharers
(the owner) => three possible directory requests:
Read miss: owner processor sent data fetch message, which
causes state of block in owners cache to transition to Shared
and causes owner to send data to directory, where it is written to
memory & sent back to requesting processor. Identity of
requesting processor is added to set Sharers, which still
contains the identity of the processor that was the owner (since it
still has a readable copy).
Data write-back: owner processor is replacing the block and
hence must write it back. This makes the memory copy up-todate (the home directory essentially becomes the owner), the
block is now uncached, and the Sharer set is empty.
Write miss: block has a new owner. A message is sent to old
owner causing the cache to send the value of the block to the
directory from which it is sent to the requesting processor, which
becomes the new owner. Sharers is set to identity of new owner,
DAP.F96 41
and state of block is made Exclusive.
Implementing a Directory
We assume operations atomic, but they are
not; reality is much harder; must avoid
deadlock when run out of bufffers in network
(see Appendix E)
Optimizations:
read miss or write miss in Exclusive: send data
directly to requestor from owner vs. 1st to memory
and then from memory to requestor
DAP.F96 42
Example
step
P1: Write 10 to A1
P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
DAP.F96 43
Example
step
P1: Write 10 to A1
P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
A1 Ex
{P1}
WrMs P1 A1
Excl. A1 10
DaRp P1 A1
0
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
DAP.F96 44
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
WrMs P1 A1
A1 Ex
{P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
DAP.F96 45
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
WrMs P1 A1
A1 Ex
{P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
10
Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10
10
10
10
A1 and A2 map to the same cache block
DAP.F96 46
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
WrMs P1 A1
A1 Ex
{P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
10
Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10
Excl. A1 20 WrMs P2 A1
10
Inv.
Inval. P1 A1
A1 Excl. {P2}
10
10
A1 and A2 map to the same cache block
DAP.F96 47
Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
WrMs P1 A1
A1 Ex
{P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
10
Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10
Excl. A1 20 WrMs P2 A1
10
Inv.
Inval. P1 A1
A1 Excl. {P2}
10
WrMs P2 A2
A2 Excl. {P2}
0
WrBk P2 A1 20 A1 Unca. {}
20
Excl. A2 40 DaRp P2 A2
0 A2 Excl. {P2}
0
A1 and A2 map to the same cache block
DAP.F96 48
Miss Rates for Snooping Protocol
4th C: Conflict, Capacity, Compulsory and Coherency
Misses
More processors: increase coherency misses while
decreasing capacity misses since more cache memory
(for fixed problem size)
Cache behavior of Five Parallel Programs:
FFT Fast Fourier Transform: Matrix transposition +
computation
LU factorization of dense 2D matrix (linear algebra)
Barnes-Hut n-body algorithm solving galaxy evolution probem
Ocean simluates influence of eddy & boundary currents on
large-scale flow in ocean: dynamic arrays per grid
DAP.F96 49
Miss Rates for Snooping Protocol
High Capacity
Misses
20%
18%
18%
16%
14%
Miss Rate
Miss Rate
14%
Big differences
in miss rates
among the
programs
15%
13%
12%
10%
8%
9%
8% 8% 8% 8% 8%
6%
4%
2% 2% 2% 2% 2%
2%
1% 1%1% 1% 1%
1% 1% 1% 1% 1%
0%
fft
# of processors
lu
1
barnes
2
ocean
Ocean
8
volrend
16
Cache size is 64KB, 2-way set associative, with 32B blocks.
Misses in these applications are generated by accesses to data
that is potentially shared.
Except for Ocean, data is heavily shared; in Ocean only the
boundaries of the subgrids are shared, though the entire grid is
treated as a shared data object. Since the boundaries change as
we increase the processor count (for a fixed size problem),
different amounts of the grid become shared. The anamolous
DAP.F96 50
increase in miss rate for Ocean in moving from 1 to 2 processors
arises because of conflict misses in accessing the subgrids.
% Misses Caused by Coherency
Traffic vs. # of Processors
80%
70%
60%
Miss Rate
% cache misses caused by
coherency transactions typically
rises when a fixed size problem is
run on more processors.
The absolute number of coherency
LU
misses is increasing in all these
Barnes
benchmarks, including Ocean. In
Ocean, however, it is difficult to
Ocean
separate out these misses from
Volrend others, since the amount of sharing
of the grid varies with processor
count.
16
Invalidations increases significantly;
In FFT, the miss rate arising from
coherency misses increases from
nothing to almost 7%.
80% of misses due to
coherency misses!
FFT
50%
40%
30%
20%
10%
0%
1
Processor Count
fft
lu
ocean
volrend
barnes
DAP.F96 51
Miss
Rate
Miss Rates as Increase Cache
Size/Processor
20%
18%
16%
14%
Miss Rate
Ocean and FFT
strongly influenced
by capacity misses
Ocean
12%
10%
FFT
8%
6%
4%
LU
2%
0%
16
32
64
Cache Size in KB
fft
lu
ocean
volrend
128
256
Volrend
Barnes
Cache Size
barnes
Miss rate drops as the cache size is increased, unless the
miss rate is dominated by coherency misses.
The block size is 32B & the cache is 2-way set-associative.
The processor count is fixed at 16 processors.
DAP.F96 52
Miss Rate vs. Block Size
14%
13%
13%
12%
10%
Miss Rate
Since cache block hold
multiple words, may get
coherency traffic for
unrelated variables in same
block
False sharing arises from
the use of an invalidationbased coherency algorithm.
It occurs when a block is
invalidated (and a
subsequent reference
causes a miss) because
some word in the block,
other than the one being
read, is written into.
8%
9%
8%
6%
6%
5%
5%
4%
4%
4%
2%
2%
1%
1% 1% 1% 1%
0% 1% 1% 1% 1%
0%
fft
lu
16
barnes
32
ocean
64
volrend
128
miss rates mostly fall with increasing block size
DAP.F96 53
% Misses Caused by Coherency
Traffic vs. Block Size
90%
FFT communicates data in large
blocks & communication adapts to
the block size (it is a parameter to
the code); makes effective use of
large blocks.
Ocean competing effects that favor
different block size
Barnes
80%
70%
LU
60%
50%
40%
FFT
30%
Ocean
20%
Volrend
10%
0%
16
32
64
fft
lu
ocean
volrend
128
barnes
Behavior tracks cache size behavior
FFT: Coherence misses reduced faster
than capacity misses!
Accesses to the boundary of
each subgrid, in one
direction the accesses
match the array layout,
taking advantage of large
blocks, while in the other
dimension, they do not
match. These two effects
largely cancel each other
out leading to an overall
decrease in the coherency
misses as well as theDAP.F96 54
capacity misses.
Bus Traffic as Increase Block Size
Bus traffic climbs steadily as
the block size is increased.
The factor of 3 increase in
Huge Increases in bus traffic
traffic for Ocean is the best
due to coherency!
argument against larger block
sizes.
Remember that our protocol
Ocean
treats ownership misses the
FFT
Volrend
same as other misses, slightly
LU
increasing the penalty for large
cache blocks: in both Ocean
32
64
128
and FFT this effect accounts for
fft
lu
barnes
less than 10% of the traffic.
Bytes per
data ref
Bytes per data reference
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
16
ocean
volrend
DAP.F96 55
Miss Rates for Directory
Use larger cache to circumvent longer latencies to directories
Ocean
Miss Rate
7%
7%
6%
6%
Miss Rate
5%
5% 5% 5%
5%
4%
4%
3%
3%
2%
1% 1% 1%
1%
1%
1% 1% 1% 1%
0% 0% 0% 0%
0%
fft
lu
barnes
8
16
ocean
32
64
# of Processors
volrend
Cache size is 128 KB, 2-way
set associative, with 64B
blocks (cover longer latency)
Ocean: only the boundaries
of the subgrids are shared.
Since the boundaries change
as we increase the processor
count (for a fixed size
problem), different amounts
of the grid become shared.
The increase in miss rate for
Ocean in moving from 32 to
64 processors arises
because of conflict misses in
accessing small subgrids &
for coherency misses for 64
DAP.F96 56
processors.
Miss Rates as Increase Cache
Size/Processor for Directory
18%
18%
16%
14%
13%
Miss Rate
12%
10%
8%
9%
9%
8%
7%
6%
7%
5%
5%
Miss rate drops as the
cache size is increased,
unless the miss rate is
dominated by coherency
misses.
The block size is 64B and
the cache is 2-way setassociative. The processor
count is fixed at 16
processors.
4%
4%
2%
2%
2% 2%
1% 1% 1%
0%
fft
lu
32
1%
1% 1%
0% 0% 0%
barnes
64
128
ocean
256
1% 1% 1%
volrend
512
DAP.F96 57
Block Size for Directory
Assumes 128 KB cache & 64 processors
Large cache size to combat higher memory latencies than snoop
caches
13%
14%
12%
12%
Miss Rate
10%
8%
6%
9%
7%
7%
5%
5%
4%
3%
3%
2%
2%
1%
0%
1% 1% 1% 1%
0% 0% 0% 0%
0%
fft
lu
16
barnes
32
ocean
64
128
volrend
DAP.F96 58
Summary
Caches contain all information on state of
cached memory blocks
Snooping and Directory Protocols similar;
bus makes snooping easier because of
broadcast
Directory has extra data structure to keep
track of state of all cache blocks
Distributing directory => scalable shared
address multiprocessor
DAP.F96 59