0% found this document useful (0 votes)
44 views53 pages

Cache Coherence

The document discusses cache coherence in multiprocessors. It describes snooping and directory-based cache coherence protocols. The snooping protocol uses bus broadcasting to maintain coherence while the directory-based protocol uses a centralized directory to track shared data blocks.

Uploaded by

20bec004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views53 pages

Cache Coherence

The document discusses cache coherence in multiprocessors. It describes snooping and directory-based cache coherence protocols. The snooping protocol uses bus broadcasting to maintain coherence while the directory-based protocol uses a centralized directory to track shared data blocks.

Uploaded by

20bec004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

1

Cache Coherence in Multiprocessors

Dr. Dhaval Shah


2

Fig. 1. symmetric multiprocessor Fig. 2. Distributed memory Multiprocessor


(UMA Model) (NUMA Model)
Different Organization of SMPS
3

 Processors and Cache on separate extension boards (1980)


 Plugged on to the back plane

 Integrated on the main board (1990)


4 or 6 processors placed per board.

 Integrated on the same chip (multi-core) (2000)


 DualCore (IBM, Intel, AMD)
 Quad Core
Why not more cores on chip?
4

 Clock Skew
 Temperature/Power dissipation
Multicore for Low Power
5

 Same performance power dissipation is reduced.


 Thread level parallelism can be exploited to increase
performance of multicore.
6

 Private cache Vs. Shared cache

Shared cache Private cache


L2 Organizations
7

 Advantages of a Shared L2 Cache:


 Efficient
dynamic use of space by each core
 Data shared by multicore is not replicated

 Every block has a fixed “home” – hence easy to find the latest copy

 Advantage of private L2 cache


 Quick access to private L2 cache
 Private bus to private L2 cache, less contention
8
Shared Memory: Coherence
9

 When shared data are cached.


 Allows migration and replication
 These are replicated in multiple caches.
 Reduces latency to access a shared data

 Reduce bandwidth demand on the shared memory.

 Data in the caches of different processors may become inconsistent.(write


back policy)
 How to enforce cache coherency?
 How does a processor know changes in caches of other processor?
Possible Solutions
10

 Software solutions:
 Avoids additional hardware
 Relies on compiler and OS to deal with the problem

 Compile-time overhead

 Compiler performance analysis on the code to detect which data items may
become unsafe for caching
 Prevents non-cacheable item (shared data) to be cached.

 Approach being conservative, does not lead effective use of cache


Possible Solutions
11

 Hardware solutions:
 Allowsdynamic recognition of potential inconsistency at run time
 More effective use of caches and better performance than software based
approaches.
 Reduces software development burden.

 Two basic approaches


 Snoopy protocol
 Directory protocol
12

 Snooping Protocol:
 Eachcache controller “snoops” the bus to find out which data is being used
by whom.

 Directory based Protocol:


 Keeps track of the sharing state of each data block using in a directory.
 A directory is a centralized register for all memory blocks.

 Allows coherency protocol to avoid broadcast


13

 Snooping coherence on a bus was


first described by Goodman(1983).
 Each core snoops the bus to find out
which data is being used(updated)
by which processor.
 Reduces memory traffic.
 All transmission on a bus are
broadcast.
Fig. 5. Snooping
SNOOPING (CONT.)
14

 Snooping protocol is basically of two types


 Write-invalidate: Invalidate all remote copies of cache when a local cache
block is updated.

 Write-update: When a local cache block is updated, the new data block is
broadcast to all caches containing a copy of the block for updating them.
WRITE INVALIDATE PROTOCOL
15

 Handling a write to shared data:


 An invalidate command is sent on bus; all caches snoop and invalidate any
copies they have

 Handling a read Miss:


 Write through: Memory always up-to-date
 Write-back: Snooping find most recent copy.
16
17
Write Invalidate vs Write Update
18

 Invalidate exploits spatial locality


 Onlyone bus transaction for any number of writes to the same block.
 More efficient.

 Broadcast has lower latency for writes and reads:


 As Compared to invalidate

 Write invalidate is the winner


 It has been adopted in Pentium IV and Power PC
Example
19

 Assume:
 Invalidate Protocol, write-back cache
 Each block of memory is one of the following states:
 Modified/ Exclusive : The line in the cache has been modified.
 Shared: Clean in all caches and up-to-date in memory, block can be read.

 Invalid: Data present in the block is obsolete, can not be used.


3 STATE MSI PROTOCOL(CONT.)
20

 Modified: thisis the only valid copy in any cache and its value is different
from that in memory
3 STATE MSI PROTOCOL(CONT.)
21

 Shared: this is a valid copy, but other caches may also contain it, and
its value is the same as in memory
3 STATE MSI PROTOCOL(CONT.)
22

 Invalid: this copy is out of date and cannot be used.


3 STATE MSI PROTOCOL(CONT.)
23
P1 P2 BUS

Step memory
State Add. Val. State Add. Val. Action Pro. Add. Val.

P1 write 10
Excl A 10 Wr Mi P1 A
to A

P1 reads A Excl A 10

Share A - Rd mi P2 A

P2 reads A Share A 10 Wr bk P1 A 10 10

Share A 10 Da Rd P2 A 10 10

P2 write 20
Invalid A - Excl A 20 Wr mi P2 A 10
to A
25
STATE DIAGRAM(CONT.)
26
STATE DIAGRAM(CONT.)
27
STATE DIAGRAM(CONT.)
28
STATE DIAGRAM(CONT.)
29
STATE DIAGRAM(CONT.)
30
STATE DIAGRAM(CONT.)
31
STATE DIAGRAM(CONT.)
32
STATE DIAGRAM(CONT.)
33
STATE DIAGRAM(CONT.)
34
STATE DIAGRAM
35
Limitations of SMPs
36

 Centralized resources in the system becomes bottleneck. – BUS


 Bus Must support normal and coherence traffic both.
 As the speed of processor increases , the number of processor that can
be supported reduces.
 How designer can increase memory bandwidth?
 Use multiple buses or interconnection networks.
 Use multiple physical banks.
37

 A directory keeps the state of every block that may be cached


 Whichcaches have copies of block
 Whether it is dirty.

 In a directory-based system, the data being shared is placed in a


common directory that maintains the coherence between caches.
 The directory acts as a filter through which the processor must ask
permission to load an entry from the primary memory to its cache.
 When an entry is changed the directory either updates or invalidates
the other caches with that entry.
DICTIONARY BASED PROTOCOL
39
DICTIONARY BASED PROTOCOL
40

 NUMA computers:
 Message have long latency
 Also broadcast is inefficient – all message have explicit responses.

 Main memory controller to keep track of:


 Which processors are having cached copies of which memory locations.
 On a write – only need to inform user not everyone
 On a dirty read
 Forward to owner.
DICTIONARY BASED PROTOCOL
41

 Shared - One or more processors have the block cached, and the
value in memory is up to date (as well as in all the caches).
 Uncached - No processor has a copy of the cache block.
 Modified - Exactly one processor has a copy of the cache block, and
it has written the block, so the memory copy is out of date. The
processor is called the owner of the block.
42

 Must track which processors have data when in the shared state
 Usually implemented using bit vector, 1 if processor has copy.
 Writes to non-exclusive data --→ Write misses
 Processor block until access completes
 Assume message received and acted upon in order sent.
P1 P2 BUS Directory
me
Step Pr Ad stat mo
State Add. Val. State Add Val. Action Val. Add. Pro. ry
o. d. e

P1 write 10
Excl A 10 Wr Mi P1 A A Excl P1
to A
P1 reads A Excl A 10

Share A - Rd mi P2 A

P2 reads A Share A 10 Fatch P1 A 10 10

P1,
Share A 10 Da Rd P2 A 10 A Shar 10
P2

Excl A 20 Wr mi P2 A 10
P2 write 20
to A Invalid Invalid P1 A A Excl P2 10
STATE DIAGRAM
44
45
46
47
48
49
50
51
52
53
54

You might also like