1
Cache Coherence in Multiprocessors
Dr. Dhaval Shah
2
Fig. 1. symmetric multiprocessor Fig. 2. Distributed memory Multiprocessor
(UMA Model) (NUMA Model)
Different Organization of SMPS
3
Processors and Cache on separate extension boards (1980)
Plugged on to the back plane
Integrated on the main board (1990)
4 or 6 processors placed per board.
Integrated on the same chip (multi-core) (2000)
DualCore (IBM, Intel, AMD)
Quad Core
Why not more cores on chip?
4
Clock Skew
Temperature/Power dissipation
Multicore for Low Power
5
Same performance power dissipation is reduced.
Thread level parallelism can be exploited to increase
performance of multicore.
6
Private cache Vs. Shared cache
Shared cache Private cache
L2 Organizations
7
Advantages of a Shared L2 Cache:
Efficient
dynamic use of space by each core
Data shared by multicore is not replicated
Every block has a fixed “home” – hence easy to find the latest copy
Advantage of private L2 cache
Quick access to private L2 cache
Private bus to private L2 cache, less contention
8
Shared Memory: Coherence
9
When shared data are cached.
Allows migration and replication
These are replicated in multiple caches.
Reduces latency to access a shared data
Reduce bandwidth demand on the shared memory.
Data in the caches of different processors may become inconsistent.(write
back policy)
How to enforce cache coherency?
How does a processor know changes in caches of other processor?
Possible Solutions
10
Software solutions:
Avoids additional hardware
Relies on compiler and OS to deal with the problem
Compile-time overhead
Compiler performance analysis on the code to detect which data items may
become unsafe for caching
Prevents non-cacheable item (shared data) to be cached.
Approach being conservative, does not lead effective use of cache
Possible Solutions
11
Hardware solutions:
Allowsdynamic recognition of potential inconsistency at run time
More effective use of caches and better performance than software based
approaches.
Reduces software development burden.
Two basic approaches
Snoopy protocol
Directory protocol
12
Snooping Protocol:
Eachcache controller “snoops” the bus to find out which data is being used
by whom.
Directory based Protocol:
Keeps track of the sharing state of each data block using in a directory.
A directory is a centralized register for all memory blocks.
Allows coherency protocol to avoid broadcast
13
Snooping coherence on a bus was
first described by Goodman(1983).
Each core snoops the bus to find out
which data is being used(updated)
by which processor.
Reduces memory traffic.
All transmission on a bus are
broadcast.
Fig. 5. Snooping
SNOOPING (CONT.)
14
Snooping protocol is basically of two types
Write-invalidate: Invalidate all remote copies of cache when a local cache
block is updated.
Write-update: When a local cache block is updated, the new data block is
broadcast to all caches containing a copy of the block for updating them.
WRITE INVALIDATE PROTOCOL
15
Handling a write to shared data:
An invalidate command is sent on bus; all caches snoop and invalidate any
copies they have
Handling a read Miss:
Write through: Memory always up-to-date
Write-back: Snooping find most recent copy.
16
17
Write Invalidate vs Write Update
18
Invalidate exploits spatial locality
Onlyone bus transaction for any number of writes to the same block.
More efficient.
Broadcast has lower latency for writes and reads:
As Compared to invalidate
Write invalidate is the winner
It has been adopted in Pentium IV and Power PC
Example
19
Assume:
Invalidate Protocol, write-back cache
Each block of memory is one of the following states:
Modified/ Exclusive : The line in the cache has been modified.
Shared: Clean in all caches and up-to-date in memory, block can be read.
Invalid: Data present in the block is obsolete, can not be used.
3 STATE MSI PROTOCOL(CONT.)
20
Modified: thisis the only valid copy in any cache and its value is different
from that in memory
3 STATE MSI PROTOCOL(CONT.)
21
Shared: this is a valid copy, but other caches may also contain it, and
its value is the same as in memory
3 STATE MSI PROTOCOL(CONT.)
22
Invalid: this copy is out of date and cannot be used.
3 STATE MSI PROTOCOL(CONT.)
23
P1 P2 BUS
Step memory
State Add. Val. State Add. Val. Action Pro. Add. Val.
P1 write 10
Excl A 10 Wr Mi P1 A
to A
P1 reads A Excl A 10
Share A - Rd mi P2 A
P2 reads A Share A 10 Wr bk P1 A 10 10
Share A 10 Da Rd P2 A 10 10
P2 write 20
Invalid A - Excl A 20 Wr mi P2 A 10
to A
25
STATE DIAGRAM(CONT.)
26
STATE DIAGRAM(CONT.)
27
STATE DIAGRAM(CONT.)
28
STATE DIAGRAM(CONT.)
29
STATE DIAGRAM(CONT.)
30
STATE DIAGRAM(CONT.)
31
STATE DIAGRAM(CONT.)
32
STATE DIAGRAM(CONT.)
33
STATE DIAGRAM(CONT.)
34
STATE DIAGRAM
35
Limitations of SMPs
36
Centralized resources in the system becomes bottleneck. – BUS
Bus Must support normal and coherence traffic both.
As the speed of processor increases , the number of processor that can
be supported reduces.
How designer can increase memory bandwidth?
Use multiple buses or interconnection networks.
Use multiple physical banks.
37
A directory keeps the state of every block that may be cached
Whichcaches have copies of block
Whether it is dirty.
In a directory-based system, the data being shared is placed in a
common directory that maintains the coherence between caches.
The directory acts as a filter through which the processor must ask
permission to load an entry from the primary memory to its cache.
When an entry is changed the directory either updates or invalidates
the other caches with that entry.
DICTIONARY BASED PROTOCOL
39
DICTIONARY BASED PROTOCOL
40
NUMA computers:
Message have long latency
Also broadcast is inefficient – all message have explicit responses.
Main memory controller to keep track of:
Which processors are having cached copies of which memory locations.
On a write – only need to inform user not everyone
On a dirty read
Forward to owner.
DICTIONARY BASED PROTOCOL
41
Shared - One or more processors have the block cached, and the
value in memory is up to date (as well as in all the caches).
Uncached - No processor has a copy of the cache block.
Modified - Exactly one processor has a copy of the cache block, and
it has written the block, so the memory copy is out of date. The
processor is called the owner of the block.
42
Must track which processors have data when in the shared state
Usually implemented using bit vector, 1 if processor has copy.
Writes to non-exclusive data --→ Write misses
Processor block until access completes
Assume message received and acted upon in order sent.
P1 P2 BUS Directory
me
Step Pr Ad stat mo
State Add. Val. State Add Val. Action Val. Add. Pro. ry
o. d. e
P1 write 10
Excl A 10 Wr Mi P1 A A Excl P1
to A
P1 reads A Excl A 10
Share A - Rd mi P2 A
P2 reads A Share A 10 Fatch P1 A 10 10
P1,
Share A 10 Da Rd P2 A 10 A Shar 10
P2
Excl A 20 Wr mi P2 A 10
P2 write 20
to A Invalid Invalid P1 A A Excl P2 10
STATE DIAGRAM
44
45
46
47
48
49
50
51
52
53
54