Fully Associative
A memory block can go in any cache slot (many-to-many).
0 4 8 12 16 20 24 28 32 36 4 choices 0 1 2 3
Slot selection check all tags (preferably simultaneously) take a slot with V = 0 (a free slot) otherwise select a slot according to some replacement strategy
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
memory address (decimal)
Set Associative
A memory block goes into a set, and can be placed anywhere within the set (many-to some)
Slot Set
0 4 8 12 16 20 24 28 32 36
0 1 0 1 2-way set associative cache
0 1
Slot selection Determine set from block address In this set, take a free slot ... ... or evict a slot according to some replacement strategy
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
memory address (decimal)
Set Associative
Set = ((memory address) DIV (blocksize)) MOD (sets in cache). Example
In which set goes the block located at address 12D? 12 DIV 4 = 3 (block address) 3 MOD 2 = 1 (set 1)
In which slot the block finally goes depends on occupation and replacement strategy
Similar to direct mapping, the low order bits of the block address determine the destination set.
block address
m tag bits m-n
Computer Architecture WS 06/07
set
offset n
Dr.-Ing. Stefan Freinatis
Set Associative
N-way set associative cache
N = number of slots per set, not the number of sets N is a power of 2, common values are 2, 4, 8. Extremes N=1 There is only one slot per set, that is, each slot is a set. The set number (thus the slot) is drawn from the block address.
Direct Mapped
N=B There is only one set containing all slots (B = number of blocks in cache = number of slots).
Fully Associative
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Set Associative
AMD Opteron Cache
Two-way set associative
Figure from [HP06 p. C-13]
Set Associative
Opteron cache
Cache capacity: 64 kB in 64-byte blocks (1024 slots) Cache is two-way set associative: 512 sets 2 cache lines Hardware: Two arrays with each 512 caches lines, that is, each set has one cache line in array1 and one in array2. Physical address is 40 bits. Address is divided into 34 bit block address (subdivided into 25 tag bits and 9 index bits) and 6 bits byte offset (c).
Figure: The index selects the set (29 = 512), see d. The two tags of the set are compared against the tag bits. The valid bit must be set for a hit (e). On a hit, the corresponding data is delivered using the winning input from a 2:1 multiplexer (f). The data goes to Data in of the CPU. The victim buffer is needed when a cache line has to be written back to main memory (replacement).
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Cache Organization
Where can block 12 go?
Cache
Block address
Memory
Figure from [HP06 p.C-7] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Cache Organization
For the previous figure, assume block 12 and block 20 being used very often. What is the problem?
Fully associative: No special problem. Both blocks can be stored in the cache at the same time. Direct mapped: Problem! Only one of them can be stored at the same time since both map to the same slot. 12 mod 8 = 20 mod 8 = 4 No special problem. Both blocks can be stored in the same set at the same time.
Set associative:
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Cache Organization
Direct Mapped
Hard-Allocation (no choice) Simple & Inexpensive No replacement strategy required If a process uses 2 blocks mapping to the same slot, cache misses are high.
Fully Associative
Full choice Expensive searching (hardware) for free slot Replacement strategy required
Set Associative
Compromise between direct mapped and fully associative Some choice Replacement strategy required
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Replacement Strategies
Strategies for selecting a slot to evict (when necessary)
Random
Victim cache lines are selected randomly. Hardware pseudo-random number generator generates slot-numbers.
Least-Recently Used (LRU)
Relies on the temporal locality principle. The least recently used block is hoped to have smallest likelihood of (re)usage. Expensive hardware.
First-in, First out (FIFO)
Approximation of LRU by selecting the oldest block (oldest = load time).
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Replacement Strategies
Two-way Capacity 16 kB 64 kB 256 kB LRU 114.1 103.4 92.2 Rand 117.3 104.3 92.1 FIFO 115.5 103.9 92.5 LRU 111.7 102.4 92.1 Four-way Rand 115.1 102.3 92.1 FIFO 113.3 103.1 92.5 LRU 109.0 99.7 92.1 Eight-way Rand 111.8 100.5 92.1 FIFO 110.4 100.3 92.5
Table: Data cache misses per 1000 instructions
Data collected for Alpha architecture, block size = 64 byte.
Data from [HP06 p.C-10]
LRU is best for small caches little difference between all strategies for large caches
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Miss Categories
Compulsory Misses
The very first access to a block cannot be in the cache, so the block must be loaded. Also called cold-start misses.
Capacity Misses
Owing to the limited capacity of the cache, capacity misses will occur in addition to compulsory misses.
Conflict Misses
In set associative or direct mapped caches too many blocks may map to the same set (or slot). Also called collision misses.
Coherency Misses
are owing to cache flushes to keep multiple caches coherent in a multiprocessor. Not considered in this lecture (see lecture Advanced Computer Architecture).
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Cache Optimization
Average memory access time = hit time + miss rate miss penalty
reducing miss rate
Larger block size larger cache capacity higher associativity
reducing miss penalty
Multilevel caches read over write
reducing hit time
avoiding address translation
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Block Size
The data area gets larger cache lines (but less lines), the overall cache capacity remains the same.
reduced miss rate, taking advantage of spatial locality
more accesses will likely go to the same block
increased miss penalty
More bytes have to be fetched from main memory
increased conflict misses
cache has less slots (per set)
increased capacity misses
only for small caches. In case of high locality (e.g. repeatedly access to only one byte in a block) the remaining bytes are unused and waste up cache capacity.
Common block sizes are 32 ... 128 bytes
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Block Size
Miss rate versus block size [from HP06 p.C-26]
Computer Architecture WS 06/07
Cache capacity
Dr.-Ing. Stefan Freinatis
Block Size
For the previous figure, assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Assume the hit time to be 1 clock cycle independent of block size. Which block size has the smallest average memory access time? Average memory access time = hit time + miss rate miss penalty 4K cache, 16 byte block: Average memory access time = 1 + (8.57 % 82) = 8.027 clock cycles 4K cache, 32 byte block: Average memory access time = 1 + (7.24 % 84) = 7.082 clock cycles
... and so on for all cache sizes and block sizes.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Block Size
cache capacity
Block size Miss penalty
4K 8.027 7.082 7.160 8.469 11.651
16K 4.231 3.411 3.323 3.659 4.685
64K 2.673 2.134 1.933 1.979 2.288
256K 1.894 1.588 1.449 1.470 1.549
16 32 64 128 256
82 84 88 96 112
Average memory access time (in clock cycles) versus block size for 4 different cache capacities green values = best (smallest access time) per column (thus per cache)
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Cache Capacity
The cache is enlarged by adding more cache slots.
reduced miss rate
owing to less capacity misses
potentially increased hit time
owing to increased complexity
increased hardware & power consumption
Miss rates for block size 64 bytes
4K 7.00 %
16K 2.64 %
38 %
64K 1.06 %
40 %
256K 0.51 %
48 %
WS 06/07
Cache capacity Miss rate
Computer Architecture
Dr.-Ing. Stefan Freinatis
Associativity
The higher the associativity the more slots per set.
reduced miss rate
primarily owing to less conflict misses
increased hit time
time needed for finding a free slot in the set
Rules of Thumb
Eight-way set associative is almost as effective as fully associative. A direct mapped cache with capacity N has about the same miss rate as a two-way set associative cache of capacity N/2.
Common associativities are 1 (direct mapped), 2, 4, 8
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Associativity
4 K cache
Degree Miss rate [%]
8 K cache
Degree Miss rate [%]
16 K cache
Degree Miss rate [%]
1-way 2-way 4-way 8-way
9.8 7.6 7.1 7.1
1-way 2-way 4-way 8-way
6.8 4.9 4.4 4.4
1-way 2-way 4-way 8-way
4.9 4.1 4.1 4.1
64 K cache
Degree Miss rate [%]
128 K cache
Degree Miss rate [%]
512 K cache
Degree Miss rate [%]
1-way 2-way 4-way 8-way
3.7 3.1 3.0 2.9
1-way 2-way 4-way 8-way
2.1 1.9 1.9 1.9
WS 06/07
1-way 2-way 4-way 8-way
0.8 0.7 0.6 0.6
Data from [HP06 p.C-23]
Computer Architecture
Dr.-Ing. Stefan Freinatis
Associativity
(hit time)
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Multi-Level Caches
Building a cache hierarchy.
First-Level Cache (L1)
small high speed cache usually located in the CPU
Second level cache (L2)
fast and bigger cache located close to CPU (chip set)
Third level cache (L3), optional
Separate memory chip between L2 and main memory
CPU
L1
L2
L3
Main Memory
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Multi-Level Caches
Multi-level caches reduce the average miss penalty because on a miss the block can be fetched from the next higher level instead from main memory.
Distinction between local and global cache considerations: number of cache misses number of cache accesses
Local miss rate =
Local to a cache (e.g. L1, L2, ...)
Global miss rate =
number of cache misses number of memory references by CPU
local misses versus global references
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Multi-Level Caches
Example CH.1: Suppose that in 1000 memory references there are 40 misses in L1 and 20 misses in L2. What are the various miss rates? Local miss rateL1 = global miss rate L1 = Local miss rateL2 = Global miss rateL2 = 20 40 20 1000 = 50 % =2%
These 2 % go from L2 to main memory
40 =4% 1000
These 4% go from L1 to L2
CPU
L1
L2
Main Memory
Local miss rate L2 is large because L1 skims the cream of memory accesses.
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Multi-Level Caches
Average memory access time = hit timeL1 + miss rateL1 miss penaltyL1
CPU
L1
L2
Main Memory
miss penaltyL1 = hit timeL2 + local miss rateL2 miss penaltyL2 = hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2)
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Multi-Level Caches
Using the miss rates from example CH.1, and the following data hit time L1 = 1 clock cycle, hit timeL2 = 10 clock cycles, miss penaltyL2 = 200 clock cycles, the average memory access time is = 1 + 0.04 (10 + 0.05 200) = 5.5 clock cycles
hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2)
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Read over Write
Assume a direct mapped write-through cache with 512 slots, and a four-word write buffer that is not checked on a read miss.
store word
c d e
SW R3, 512(R0) LW R1, 1024(R0) LW R2, 512(R0)
load word
; mem[512]:= R3 ; R1:= mem[1024] ; R2:= mem[512]
(cache slot 0) (cache slot 0) (cache slot 0)
Read-after-write hazard: c The data in R3 is placed in write buffer. d causes a read miss. Cache line is discarded. e again causes a read miss. If the write buffer has not completed writing R3 into memory, e will read an incorrect value from mem[512]. Cache
CPU
Memory
Write Buffer Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Read over Write
Solutions to previous problem
Read misses wait until write buffer is empty,
and thereafter the required memory block is fetched into cache.
Check contents of write buffer,
if referenced data not in buffer let the read-access continue fetching the block into the cache. Write buffer is flushed later when memory system is available.
Giving reads priority over writes
Also applicable to write-back caches. The dirty block is put into a write buffer that allows inspection in case of a read miss. Read misses check the buffer before directly going to memory.
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Address Translation
What addresses are cached, virtual or physical addresses?
A fully virtual cache uses logical addresses only, a fully physical cache uses physical addresses only.
virtual
CPU
address
virtual cache
virtual address
physical
translation
address
memory
segment tables / page tables / TLB
virtual
physical
CPU
address
translation
address
physical cache
physical address
memory
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Address Translation
Fully virtual cache: No address translation time on a hit Cache must have copies of protection information
Protection info must be fetched from page/segment tables.
Cache flush on processor switch
Individual virtual addresses usually refer to different physical addresses.
Shared Memory
Different virt. addresses refer to same phys. address. Copies of same data in cache.
Fully physical cache: Very well on shared memory accesses Always address translation (time)
Hits are of no advantage regarding address translation
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Address Translation
Solution: get the best from both virtual and physical caches Two issues in accessing a cache:
Indexing the cache
that is, calculating the target set (or slot with direct mapping)
Comparing tags
comparing the tag field with (parts of) the block address
The page offset (the part that is identical in both virtual and physical address space) is used to index the cache. In parallel, the virtual part of the address is translated into the physical address and used for tag comparison. Improved hit time. virtually indexed, physically tagged cache
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Address Translation
Virtually indexed, physically tagged cache
virtual address
CPU
page number
page offset
word offset tags data data
Translation
(TLB, page table)
...
...
... Cache
physical address
frame address
offset
next memory level
WS 06/07 Dr.-Ing. Stefan Freinatis
next memory level
Computer Architecture
Cache Optimization
Technique Larger block size Larger cache capacity Higher associativity Multi-level cache Read over write Address translation + + +
Hit time Miss penalty Miss rate Complexity
Comment
+ + +
0 1 1 2 1 1
Trivial Widely used for L2 Widely used
Costly hardware; harder if L1 block size L2 block size
Widely used Widely used
Data from [HP06 p.C-39]
+ = improves a factor, = hurts a factor, blank = no impact
Summary of basic cache optimizations
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Exam Computer Architecture
March 9th, 2007
09.03.2007
Date
ST 025/118
Location
8:30 hrs
Time
Duisburg - Ruhrort !
CA
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Computer Architecture
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis