0% found this document useful (0 votes)

99 views59 pages

Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996

The document summarizes key concepts around cache coherency in shared memory multiprocessor systems. It discusses two main approaches for maintaining coherency: snooping and directory-based. Snooping relies on broadcasting cache coherency messages to all processors to monitor (snoop) cache states. Directory-based schemes track cache states in a centralized directory to avoid broadcasting. The document focuses on snooping protocols, covering write invalidate and write broadcast variations and examples like the Berkeley protocol. It describes snoopy cache state machines and extensions used in protocols like MESI.

Uploaded by

Reni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views59 pages

Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996

Uploaded by

Reni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Lecture 18:

Snooping vs. Directory Based

Coherency
Professor David A. Patterson
Computer Science 252
Fall 1996

DAP.F96 1

Review: Parallel Framework

Layers:

Programming Model
Communication Abstraction
Interconnection SW/OS
Interconnection HW

Programming Model:

Multiprogramming : lots of jobs, no communication

Shared address space: communicate via memory
Message passing: send and recieve messages
Data Parallel: several agents operate on several data
sets simultaneously and then exchange information
globally and simultaneously (shared or message
passing)

Communication Abstraction:
Shared address space: e.g., load, store, atomic swap
Message passing: e.g., send, recieve library calls
Debate over this topic (ease of programming, scaling)
=> many hardware designs 1:1 programming model

DAP.F96 2

Review: Small-Scale MP
Designs
Memory: centralized with uniform access time
(uma) and bus interconnect
Examples: Sun Enterprise 5000 , SGI Challenge,
Intel SystemPro

DAP.F96 3

Review: Large-Scale MP
Designs
Memory: distributed with nonuniform access time
(numa) and scalable interconnect (distributed memory)
Examples: T3E: (see Ch. 1, Figs 1-21, page 45 of [CSG96])
1 cycle

40 cycles

100 cycles
Low Latency
High Reliability

DAP.F96 4

Data Parallel Model

Operations can be performed in parallel on
each element of a large regular data structure,
such as an array
1 Control Processsor broadcast to many PEs
(see Ch. 1, Figs 1-26, page 51 of [CSG96])
When computers were large, could amortize the
control portion of many replicated PEs

Data distributed in each memory

Condition flag per PE so that can skip
Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs
+ memory on a chip was the PE
Data parallel programming languages lay out
data to processor
DAP.F96 5

Data Parallel Model

Vector processors have similar ISAs, but no
data placement restriction
Advancing VLSI led to single chip FPUs and
whole fast Procs
SIMD programming model led to Single
Program Multiple Data (SPMD) model
All processors execute identical program

Data parallel programming languages still

useful, do communication all at once:
Bulk Synchronous phases in which all
communicate after a global barrier

DAP.F96 6

Convergence in Parallel Architecture

Complete computers connected to scalable
network via communication assist
(see Ch. 1, Fig. 1-29, page 57 of [CSG96])

Different programming models place different

requirements on communication assist
Shared address space: tight integration with
memory to capture memory events that interact
with others + to accept requests from other nodes
Message passing: send messages quickly and
respond to incoming messages: tag match, allocate
buffer, transfer data, wait for receive posting
Data Parallel: fast global synchronization

HPF shared-memory, data parallel; PVM, MPI

message passing libraries; both work on
many machines, different implementations DAP.F96 7

Fundamental Issues: Naming

Naming: how to solve large problem fast

what data is shared

how it is addressed
what operations can access data
how processes refer to each other

Choice of naming affects code produced by a

compiler; via load where just remember
address or keep track of processor number
and local virtual address
Choice of naming affects replication of data;
via load in cache memory hierachy or via SW
replication and consistency
DAP.F96 8

Fundamental Issues: Naming

Global physical address space: any
processor can generate and address and
access it in a single operation
memory can be anywhere: virtual addr. translation
handles it

Global virtual address space: if the address

space of each process can be configured to
contain all shared data of the parallel program
Segmented shared address space: if
locations are named <process number,
address> uniformly for all processes of the
parallel program
DAP.F96 9

Fundamental Issues:
Synchronization
To cooperate, processes must coordinate
Message passing is implicit coordination with
transmission or arrival of data
Shared address => additional operations to
explicitly coordinate: e.g., write a flag, awaken
a thread, interrupt a processor

DAP.F96 10

Fundamental Issues:
Latency and Bandwidth
Bandwidth

Need high bandwidth in communication

Cannot scale, but stay close
Make limits in network, memory, and processor match
Overhead to communicate is a problem in many machines

Latency
Affects performance, since processor may have to wait
Affects ease of programming, since requires more thought
to overlap communication and computation

Latency Hiding
How can a mechanism help hide latency?
Examples: overlap message send with computation,
prefetch
DAP.F96

Small-ScaleShared Memory

Caches serve to:

Increase bandwidth
versus bus/memory
Reduce latency of
access
Valuable for both
private data and
shared data

What about cache

consistency?
DAP.F96 12

The Problem of Cache Coherency

DAP.F96 13

What Does Coherency Mean?

Informally:
Any read must return the most recent write
Too strict and very difficult to implement

Better:
Any write must eventually be seen by a read
All writes are seen in proper order (serialization)

Two rules to ensure this:

If P writes x and P1 reads it, Ps write will be seen by
P1 if the read and write are sufficiently far apart
Writes to a single location are serialized:
seen in one order
Latest write will be seen
Otherewise could see writes in illogical order
(could see older value after a newer value)

DAP.F96 14

CS 252 Administrivia
Next reading is Chapter 8 of CA:AQA 2/e and
Sections 1.1-1.4, Chapter 1 of upcoming book by
Culler, Singh, and Gupta:
www.cs.berkeley.edu/~culler/
Remzi Arpaci will talk Fri. 11/8 on Networks of
Workstations and world record sort
Dr. Dan Lenowski, architect of SGI Origin, talk in
Systems Seminar Thur. 11/14 at 4PM in 306 Soda
Next project review: survey due Mon. 11/11; 20 min.
meetings moved to Fri. 11/15; signup Wed. 11/6

DAP.F96 15

Potential Solutions
Snooping Solution (Snoopy Bus):
Send all requests for data to all processors
Processors snoop to see if they have a copy and respond
accordingly
Requires broadcast, since caching information is at processors
Works well with bus (natural broadcast medium)
Dominates for small scale machines (most of the market)

Directory-Based Schemes
Keep track of what is being shared in one centralized place
Distributed memory => distributed directory (avoids
bottlenecks)
Send point-to-point requests to processors
Scales better than Snoop
DAP.F96
Actually existed BEFORE Snoop-based schemes

Basic Snoopy Protocols

Write Invalidate Protocol:
Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches
which snoop and invalidate any copies
Read Miss:
Write-through: memory is always up-to-date
Write-back: snoop in caches to find most recent copy

Write Broadcast Protocol:

Write to shared data: broadcast on bus, processors
snoop, and update copies
Read miss: memory is always up-to-date

Write serialization: bus serializes requests

Bus is single point of arbitration
DAP.F96 17

Basic Snoopy Protocols

Write Invalidate versus Broadcast:

Invalidate requires one transaction per write-run

Invalidate uses spatial locality: one transaction per block
Broadcast has lower latency between write and read
Broadcast: BW (increased) vs. latency (decreased)
Nametradeoff
Protocol Type Memory-write policy
Machines using
Write Once

Write invalidate Write back

after first write

First snoopy protocol.

Synapse N+1 Write invalidate Write back

1st cache-coherent MPs

Berkeley

Write invalidate Write back

Berkeley SPUR

Illinois

Write invalidate Write back

SGI Power and Challenge

Firefly

Write broadcast Write back private,

Write through shared

SPARCCenter 2000

Write invalidate Write back

Pentium, PowerPC

MESI

DAP.F96 18

Snoop Cache Variations

Basic
Protocol
Exclusive
Shared
Invalid

Berkeley
Protocol

Illinois
Protocol

Owned Exclusive Private Dirty

Owned Shared Private Clean
Shared
Shared
Invalid
Invalid

MESI
Protocol
Modfied (private,Memory)
eXclusive (private,=Memory)
Shared (shared,=Memory)
Invalid

Owner can update via bus invalidate operation

Owner must write back when replaced in cache
If read sourced from memory, then Private Clean
if read sourced from other cache, then Shared
Can write in cache if held private clean or dirty

DAP.F96 19

An Example Snoopy Protocol

Invalidation protocol, write-back cache
Each block of memory is in one state:
Clean in all caches and up-to-date in memory (Shared)
OR Dirty in exactly one cache (Exclusive)
OR Not in any caches

Each cache block is in one state:

Shared : block can be read
OR Exclusive : cache has only copy, its writeable, and
dirty
OR Invalid : block contains no data

Read misses: cause all caches to snoop bus

DAP.F96 20
Writes to clean line are treated as misses

Snoopy-Cache State Machine-I

CPU Read hit

Invalid

CPU Read
Place read miss
on bus

State machine
for CPU requests

CPU Write
CPU read miss
Write back block
Place write miss
on bus

Cache Block
State
CPU read hit
CPU write hit

Shared
(read/only)

CPU Read miss

Place read miss
on bus

CPU Write
Place Write Miss on Bus
Exclusive
(read/
write)

CPU Write Miss

Write back cache block
Place write miss on bus

DAP.F96 21

Snoopy-Cache State Machine-II

Invalid

State machine
for bus requests

Write miss
for this block

Write Back
Block; abort
memory access

Shared
(read/only)

Write Back
Block; abort
memory access

Write miss
for this block
Exclusive
(read/
write)

Read miss
for this block

DAP.F96 22

Snoop Cache: State Machine

Extensions:
Fourth State:
Ownership
Clean-> dirty,
need invalidate only
(upgrade request),
dont read memory
Berkeley Protocol
Clean exclusive state
(no miss for private
data on write)
MESI Protocol
Cache supplies data
when shared state
(no memory access)
Illinois Protocol
DAP.F96 23

Example

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P1
State

P2
Addr Value State

Bus
Memory
Addr Value Action Proc. Addr Value Addr Value

P2: Write 20 to A1
P2: Write 40 to A2

Assumes A1 and A2 map to same cache block

DAP.F96 24

Example

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P1
State
Excl.

P2
Addr Value State
A1
10

Bus
Memory
Addr Value Action Proc. Addr Value Addr Value
WrMs
P1
A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes A1 and A2 map to same cache block

DAP.F96 25

Example

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P1
P2
State Addr Value State
Excl.
A1
10
Excl.
A1
10

Bus
Memory
Addr Value Action Proc. Addr Value Addr Value
WrMs
P1
A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes A1 and A2 map to same cache block

DAP.F96 26

Example

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

P1
P2
Bus
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr Value
Excl.
A1
10
WrMs
P1
A1
Excl.
A1
10
Shar. A1
RdMs
P2
A1
Shar.
A1
10
WrBk
P1
A1
10
10
Shar. A1
10
RdDa
P2
A1
10
10
10
10
10

Assumes A1 and A2 map to same cache block

DAP.F96 27

Example

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes A1 and A2 map to same cache block

DAP.F96 28

Example

step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

P1
P2
Bus
Memory
State Addr Value State Addr Value Action Proc. Addr Value Addr Value
Excl.
A1
10
WrMs
P1
A1
Excl.
A1
10
Shar. A1
RdMs
P2
A1
A1 10
Shar.
A1
10
WrBk
P1
A1
10
Shar. A1
10
RdDa
P2
A1
10
10
Inv.
Excl.
A1
20 WrMs
P2
A1
10
WrMs
P2
A2
10
A1 20
Excl. A2
40 WrBk
P2
A1
20

Assumes A1 and A2 map to same cache block

DAP.F96 29

Implementation Complications
Write Races:
Cannot update cache until bus is obtained
Otherwise, another processor may get bus first, and write the
same cache block

Two step process:

Arbitrate for bus
Place miss on bus and complete operation

If miss occurs to block while waiting for bus, handle miss

(invalidate may be needed) and then restart.
Split transaction bus:
Bus transaction is not atomic: can have multiple outstanding
transactions for a block
Multiple misses can interleave, allowing two caches to grab block
in the Exclusive state
Must track and prevent multiple misses for one block

Must support interventions and invalidations

DAP.F96 30

Implementing Snooping Caches

Multiple processors must be on bus, access to both
addresses and data
Add a few new commands to perform coherency,
in addition to read and write
Processors continuously snoop on address bus
If address matches tag, either invalidate or update

DAP.F96 31

Implementing Snooping Caches

Bus serializes writes, getting bus ensures no one else
can perform memory operation
On a miss in a write back cache, may have the desired
copy and its dirty, so must reply
Add extra state bit to cache to determine shared or not
Since every bus transaction checks cache tags, could
interfere with CPU just to check:
solution 1: duplicate set of tags for L1 caches just to allow
checks in parallel with CPU
solution 2: L2 cache that obeys inclusion with L1 cache

DAP.F96 32

Larger MPs

Separate Memory per Processor

Local or Remote access via memory controller
Cache Coherency solution: non-cached pages
Alternative: directory per cache that tracks state of every
block in every cache
Which caches have a copies of block, dirty vs. clean, ...

Info per memory block vs. per cache block?

PLUS: In memory => simpler protocol (centralized/one location)
MINUS: In memory => directory is (memory size) vs. (cache size)

Prevent directory as bottleneck: distribute directory

entries with memory, each keeping track of which Procs
have copies of their blocks
DAP.F96 33

Distributed Directory MPs

DAP.F96 34

Directory Protocol
Similar to Snoopy Protocol: Three states
Shared: 1 processors have data, memory up-to-date
Uncached (no processor hasit; not valid in any cache)
Exclusive: 1 processor (owner) has data; memory outof-date

In addition to cache state, must track which

processors have data when in the shared state
(usually bit vector, 1 if processor has copy)
Keep it simple(r):
Writes to non-exclusive data => write miss
Processor blocks until access completes
Assume messages received and acted upon in order
sent
DAP.F96 35

Directory Protocol
No bus and dont want to broadcast:
interconnect no longer single arbitration point
all messages have explicit responses

Terms:
Local node is the node where a request originates
Home node is the node where the memory location
of an address resides
Remote node is the node that has a copy of a cache
block, whether exclusive or shared

Example messages on next slide:

P = processor number, A = address
DAP.F96 36

Directory Protocol Messages

Message type
Read miss

Source
Local cache

Destination
Home directory

Msg
P, A

Processor P reads data at address A;

send data and make P a read sharer
Write miss

Local cache

Home directory

P, A

Processor P writes data at address A;

send data and make P the exclusive owner
Invalidate

Home directory

Remote caches

Invalidate a shared copy at address A.

Fetch

Home directory

Remote cache

Fetch the block at address A and send it to its home directory

Fetch/Invalidate

Home directory

Remote cache

Fetch the block at address A and send it to its home directory;

invalidate the block in the cache
Data value reply

Home directory

Local cache

Data

Return a data value from the home memory

Data write-back

Remote cache

Home directory

Write-back a data value for address A

A, Data
DAP.F96 37

State Transition Diagram for an

Individual Cache Block in a
Directory Based System

Invalid

Exclusive

Shared

States identical to snoopy

case; transactions very
similar.
Tranistions caused by
read misses, write misses,
invalidates, data fetch req.
Generates read miss &
write miss msg to home
directory.
Write misses that were
broadcast on the bus =>
explicit invalidate & data
DAP.F96 38
fetch requests.

State Transition Diagram for the

Directory
Data Value Reply
Sharers = Sharers+{P}
Uncached
RdMs
Sharers={}

Shared
RdMs

WrMs

Data Value
Reply
Sharers =
Sharers+{P}

WrBk
WrMs
Exclusive

Fetch/Invalidate
Sharers={P}

Same states &

structure as the
transition diagram for
an individual cache
2 actions: update of
directory state & send
msgs to statisfy req.
Tracks all copies of
memory block.
Also indicate an action
that updates the
sharing set, Sharers, as
opposed to sending a
message.
DAP.F96 39

Example Directory Protocol

Message sent to directory causes two actions:
Update the directory
More messages to satisfy request

Block is in Uncached state: the copy in memory is the

current value; only possible requests for that block are:
Read miss: requesting processor sent data from memory &
requestor made only sharing node; state of block made Shared.
Write miss: requesting processor is sent the value & becomes the
Sharing node. The block is made Exclusive to indicate that the only
valid copy is cached. Sharers indicates the identity of the owner.

Block is Shared => the memory value is up-to-date:

Read miss: requesting processor is sent back the data from
memory & requesting processor is added to the sharing set.
Write miss: requesting processor is sent the value. All processors
in the set Sharers are sent invalidate messages, & Sharers is set to
identity of requesting processor. The state of the block is made
Exclusive.
DAP.F96 40

Example Directory Protocol

Block is Exclusive: current value of the block is held in
the cache of the processor identified by the set Sharers
(the owner) => three possible directory requests:
Read miss: owner processor sent data fetch message, which
causes state of block in owners cache to transition to Shared
and causes owner to send data to directory, where it is written to
memory & sent back to requesting processor. Identity of
requesting processor is added to set Sharers, which still
contains the identity of the processor that was the owner (since it
still has a readable copy).
Data write-back: owner processor is replacing the block and
hence must write it back. This makes the memory copy up-todate (the home directory essentially becomes the owner), the
block is now uncached, and the Sharer set is empty.
Write miss: block has a new owner. A message is sent to old
owner causing the cache to send the value of the block to the
directory from which it is sent to the requesting processor, which
becomes the new owner. Sharers is set to identity of new owner,
DAP.F96 41
and state of block is made Exclusive.

Implementing a Directory
We assume operations atomic, but they are
not; reality is much harder; must avoid
deadlock when run out of bufffers in network
(see Appendix E)
Optimizations:
read miss or write miss in Exclusive: send data
directly to requestor from owner vs. 1st to memory
and then from memory to requestor

DAP.F96 42

Example
step
P1: Write 10 to A1

P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value

P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

A1 and A2 map to the same cache block

DAP.F96 43

Example
step
P1: Write 10 to A1

P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
A1 Ex
{P1}
WrMs P1 A1
Excl. A1 10
DaRp P1 A1
0

P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

A1 and A2 map to the same cache block

DAP.F96 44

Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
WrMs P1 A1
A1 Ex
{P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10

P2: Write 20 to A1
P2: Write 40 to A2

A1 and A2 map to the same cache block

DAP.F96 45

Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

A1 and A2 map to the same cache block

DAP.F96 46

Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

P1
P2
Bus
Directory
Memory
StateAddr ValueStateAddrValueActionProc.Addr Value Addr State {Procs}Value
WrMs P1 A1
A1 Ex
{P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
10
Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10
Excl. A1 20 WrMs P2 A1
10
Inv.
Inval. P1 A1
A1 Excl. {P2}
10
10

A1 and A2 map to the same cache block

DAP.F96 47

Example
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

A1 and A2 map to the same cache block

DAP.F96 48

Miss Rates for Snooping Protocol

4th C: Conflict, Capacity, Compulsory and Coherency
Misses
More processors: increase coherency misses while
decreasing capacity misses since more cache memory
(for fixed problem size)
Cache behavior of Five Parallel Programs:
FFT Fast Fourier Transform: Matrix transposition +
computation
LU factorization of dense 2D matrix (linear algebra)
Barnes-Hut n-body algorithm solving galaxy evolution probem
Ocean simluates influence of eddy & boundary currents on
large-scale flow in ocean: dynamic arrays per grid

DAP.F96 49

Miss Rates for Snooping Protocol

High Capacity
Misses

20%

18%

18%
16%

14%

Miss Rate

14%

Big differences
in miss rates
among the
programs

15%
13%

12%
10%
8%

9%
8% 8% 8% 8% 8%

6%
4%
2% 2% 2% 2% 2%

1% 1%1% 1% 1%

1% 1% 1% 1% 1%

0%
fft

# of processors

lu
1

barnes
2

ocean
Ocean
8

volrend

Cache size is 64KB, 2-way set associative, with 32B blocks.

Misses in these applications are generated by accesses to data
that is potentially shared.
Except for Ocean, data is heavily shared; in Ocean only the
boundaries of the subgrids are shared, though the entire grid is
treated as a shared data object. Since the boundaries change as
we increase the processor count (for a fixed size problem),
different amounts of the grid become shared. The anamolous
DAP.F96 50
increase in miss rate for Ocean in moving from 1 to 2 processors
arises because of conflict misses in accessing the subgrids.

% Misses Caused by Coherency

Traffic vs. # of Processors
80%
70%
60%

Miss Rate

% cache misses caused by

coherency transactions typically
rises when a fixed size problem is
run on more processors.
The absolute number of coherency
LU
misses is increasing in all these
Barnes
benchmarks, including Ocean. In
Ocean, however, it is difficult to
Ocean
separate out these misses from
Volrend others, since the amount of sharing
of the grid varies with processor
count.
16
Invalidations increases significantly;
In FFT, the miss rate arising from
coherency misses increases from
nothing to almost 7%.

80% of misses due to

coherency misses!
FFT

50%
40%
30%
20%
10%
0%
1

Processor Count
fft

ocean

volrend

barnes

DAP.F96 51

Miss
Rate

Miss Rates as Increase Cache

Size/Processor
20%
18%
16%
14%

Miss Rate

Ocean and FFT

strongly influenced
by capacity misses

Ocean

12%
10%

FFT

8%
6%
4%

2%
0%
16

64
Cache Size in KB

fft

ocean

volrend

128

256

Volrend
Barnes

Cache Size
barnes

Miss rate drops as the cache size is increased, unless the

miss rate is dominated by coherency misses.
The block size is 32B & the cache is 2-way set-associative.
The processor count is fixed at 16 processors.
DAP.F96 52

Miss Rate vs. Block Size

14%

13%

12%
10%

Miss Rate

Since cache block hold

multiple words, may get
coherency traffic for
unrelated variables in same
block
False sharing arises from
the use of an invalidationbased coherency algorithm.
It occurs when a block is
invalidated (and a
subsequent reference
causes a miss) because
some word in the block,
other than the one being
read, is written into.

9%
8%
6%

5%
4%

4%
2%

1% 1% 1% 1%

0% 1% 1% 1% 1%

0%
fft

lu
16

barnes
32

ocean
64

volrend

128

miss rates mostly fall with increasing block size

DAP.F96 53

% Misses Caused by Coherency

Traffic vs. Block Size
90%

FFT communicates data in large

blocks & communication adapts to
the block size (it is a parameter to
the code); makes effective use of
large blocks.
Ocean competing effects that favor
different block size

Barnes

80%
70%

60%
50%
40%

FFT

30%

Ocean

20%

Volrend

10%
0%
16

fft

ocean

volrend

128
barnes

Behavior tracks cache size behavior

FFT: Coherence misses reduced faster
than capacity misses!

Accesses to the boundary of

each subgrid, in one
direction the accesses
match the array layout,
taking advantage of large
blocks, while in the other
dimension, they do not
match. These two effects
largely cancel each other
out leading to an overall
decrease in the coherency
misses as well as theDAP.F96 54
capacity misses.

Bus Traffic as Increase Block Size

Bus traffic climbs steadily as
the block size is increased.
The factor of 3 increase in
Huge Increases in bus traffic
traffic for Ocean is the best
due to coherency!
argument against larger block
sizes.
Remember that our protocol
Ocean
treats ownership misses the
FFT
Volrend
same as other misses, slightly
LU
increasing the penalty for large
cache blocks: in both Ocean
32
64
128
and FFT this effect accounts for
fft
lu
barnes
less than 10% of the traffic.

Bytes per
data ref
Bytes per data reference

7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
16

ocean

volrend

DAP.F96 55

Miss Rates for Directory

Use larger cache to circumvent longer latencies to directories
Ocean
Miss Rate
7%

7%
6%

Miss Rate

5% 5% 5%
5%
4%

3%
2%
1% 1% 1%
1%

1% 1% 1% 1%
0% 0% 0% 0%

0%
fft

barnes
8

ocean
32

# of Processors

volrend

Cache size is 128 KB, 2-way

set associative, with 64B
blocks (cover longer latency)
Ocean: only the boundaries
of the subgrids are shared.
Since the boundaries change
as we increase the processor
count (for a fixed size
problem), different amounts
of the grid become shared.
The increase in miss rate for
Ocean in moving from 32 to
64 processors arises
because of conflict misses in
accessing small subgrids &
for coherency misses for 64
DAP.F96 56
processors.

Miss Rates as Increase Cache

Size/Processor for Directory
18%

18%
16%
14%

13%

Miss Rate

12%
10%
8%

9%
9%

7%
5%

Miss rate drops as the

cache size is increased,
unless the miss rate is
dominated by coherency
misses.
The block size is 64B and
the cache is 2-way setassociative. The processor
count is fixed at 16
processors.

2% 2%

1% 1% 1%

0%
fft

lu
32

1% 1%
0% 0% 0%

barnes
64

128

ocean
256

1% 1% 1%

volrend
512

DAP.F96 57

Block Size for Directory

Assumes 128 KB cache & 64 processors
Large cache size to combat higher memory latencies than snoop
caches
13%
14%
12%

12%

Miss Rate

10%
8%
6%

9%
7%
7%
5%
5%

3%
2%

1% 1% 1% 1%

0% 0% 0% 0%

0%
fft

lu
16

barnes
32

ocean
64

128

volrend

DAP.F96 58

Summary
Caches contain all information on state of
cached memory blocks
Snooping and Directory Protocols similar;
bus makes snooping easier because of
broadcast
Directory has extra data structure to keep
track of state of all cache blocks
Distributing directory => scalable shared
address multiprocessor

DAP.F96 59

CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Module 4
No ratings yet
Module 4
40 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Multiprocessors & Thread-Level Parallelism
79% (19)
Multiprocessors & Thread-Level Parallelism
29 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
MODULE 4 HPC
No ratings yet
MODULE 4 HPC
41 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
Multiprocessor Architectures & Cache Coherence
No ratings yet
Multiprocessor Architectures & Cache Coherence
54 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Distributed Shared Memory Guide
100% (1)
Distributed Shared Memory Guide
20 pages
Multi Processor
No ratings yet
Multi Processor
63 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
109 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
8051 Arch
No ratings yet
8051 Arch
55 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
1st Ia Preparation
No ratings yet
1st Ia Preparation
15 pages
Week 5
No ratings yet
Week 5
52 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
2.symmetric Shared Memory Architectures
No ratings yet
2.symmetric Shared Memory Architectures
12 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Cache Coherence - 20250120 - 142158 - 0000
No ratings yet
Cache Coherence - 20250120 - 142158 - 0000
34 pages
Cache Coherence (Part 1)
No ratings yet
Cache Coherence (Part 1)
13 pages
MULTIPROCTLPA
No ratings yet
MULTIPROCTLPA
99 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
Distributed Shared Memory Basics
No ratings yet
Distributed Shared Memory Basics
36 pages
Memory Organisation: Shared Memory in Distributed Memory Architectures
No ratings yet
Memory Organisation: Shared Memory in Distributed Memory Architectures
15 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
L 14 DSM
No ratings yet
L 14 DSM
3 pages
Unit 5 DOS SCR
No ratings yet
Unit 5 DOS SCR
46 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
32 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Lect5 - Distributed Shared Memory
No ratings yet
Lect5 - Distributed Shared Memory
120 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
Lecture 9 Multi-Processor
No ratings yet
Lecture 9 Multi-Processor
83 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
No ratings yet
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
22 pages
Distributed OS: Memory & Multiprocessors
No ratings yet
Distributed OS: Memory & Multiprocessors
89 pages
Cosc530 Ch5all6up
No ratings yet
Cosc530 Ch5all6up
5 pages
Computer Science Exam Prep
No ratings yet
Computer Science Exam Prep
3 pages
Wireshark Ethernet
No ratings yet
Wireshark Ethernet
7 pages
Lab 4
No ratings yet
Lab 4
3 pages
SASken - Ques
No ratings yet
SASken - Ques
33 pages
EDI Basic Training
No ratings yet
EDI Basic Training
42 pages
Wait Types
No ratings yet
Wait Types
2 pages
EE-314 Midterm Study Guide
No ratings yet
EE-314 Midterm Study Guide
2 pages
Odin Downloader Release Notes
No ratings yet
Odin Downloader Release Notes
9 pages
Apc Push Service Via Soap v3.0
No ratings yet
Apc Push Service Via Soap v3.0
46 pages
Windows Command Line Guide
No ratings yet
Windows Command Line Guide
4 pages
Huawei SAP HANA Appliance Single Node Installation Guide (CH121&CH242&2288H&2488H&9008 V5) 11 PDF
No ratings yet
Huawei SAP HANA Appliance Single Node Installation Guide (CH121&CH242&2288H&2488H&9008 V5) 11 PDF
270 pages
001 Core 15 - Java Programming - III Sem
No ratings yet
001 Core 15 - Java Programming - III Sem
61 pages
RCM3720 C Function Reference
No ratings yet
RCM3720 C Function Reference
448 pages
MARIE: An Introduction To A Simple Computer
No ratings yet
MARIE: An Introduction To A Simple Computer
98 pages
GIS Database Design for Village Land
No ratings yet
GIS Database Design for Village Land
10 pages
Switch Configuration: CCNA 3 Chapter 7
No ratings yet
Switch Configuration: CCNA 3 Chapter 7
19 pages
Adeptia Suite 5.0 Linux and Solaris Installation Guide
No ratings yet
Adeptia Suite 5.0 Linux and Solaris Installation Guide
22 pages
Answer Key Css Week 5 8
No ratings yet
Answer Key Css Week 5 8
5 pages
LG GIS Networking Network Data Admin L2
No ratings yet
LG GIS Networking Network Data Admin L2
4 pages
Code With Harry
100% (1)
Code With Harry
7 pages
Atmega 328
No ratings yet
Atmega 328
7 pages
Error List Hiab
100% (4)
Error List Hiab
2 pages
Raspberry Pi Command Line Audio PDF
No ratings yet
Raspberry Pi Command Line Audio PDF
8 pages
A Robust Dynamic Data Masking Transformation Approach To Safeguard Sensitive Data
No ratings yet
A Robust Dynamic Data Masking Transformation Approach To Safeguard Sensitive Data
5 pages
Architecture: Geometry - Columns Spatial - Ref - Sys Feature Table
No ratings yet
Architecture: Geometry - Columns Spatial - Ref - Sys Feature Table
14 pages
18cs56 Unix Notes
No ratings yet
18cs56 Unix Notes
195 pages
OBIEE Metadata Dictionary
No ratings yet
OBIEE Metadata Dictionary
12 pages
Ethernet Over SDH
No ratings yet
Ethernet Over SDH
3 pages
Security+ Acronyms Cheat Sheet
90% (10)
Security+ Acronyms Cheat Sheet
2 pages
MAnual UPS APC
No ratings yet
MAnual UPS APC
11 pages