0% found this document useful (0 votes)
87 views18 pages

Associative Memory

Associative memory, set associative memory, cache representations, memory hierarchy, cache
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views18 pages

Associative Memory

Associative memory, set associative memory, cache representations, memory hierarchy, cache
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Fully Associative

A memory block can go in any cache slot (many-to-many).


0 4 8 12 16 20 24 28 32 36 4 choices 0 1 2 3

Slot selection check all tags (preferably simultaneously) take a slot with V = 0 (a free slot) otherwise select a slot according to some replacement strategy
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

memory address (decimal)

Set Associative
A memory block goes into a set, and can be placed anywhere within the set (many-to some)
Slot Set

0 4 8 12 16 20 24 28 32 36

0 1 0 1 2-way set associative cache

0 1

Slot selection Determine set from block address In this set, take a free slot ... ... or evict a slot according to some replacement strategy
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

memory address (decimal)

Set Associative
Set = ((memory address) DIV (blocksize)) MOD (sets in cache). Example
In which set goes the block located at address 12D? 12 DIV 4 = 3 (block address) 3 MOD 2 = 1 (set 1)
In which slot the block finally goes depends on occupation and replacement strategy

Similar to direct mapping, the low order bits of the block address determine the destination set.
block address

m tag bits m-n


Computer Architecture WS 06/07

set

offset n
Dr.-Ing. Stefan Freinatis

Set Associative
N-way set associative cache
N = number of slots per set, not the number of sets N is a power of 2, common values are 2, 4, 8. Extremes N=1 There is only one slot per set, that is, each slot is a set. The set number (thus the slot) is drawn from the block address.

Direct Mapped
N=B There is only one set containing all slots (B = number of blocks in cache = number of slots).

Fully Associative
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Set Associative

AMD Opteron Cache


Two-way set associative

Figure from [HP06 p. C-13]

Set Associative
Opteron cache
Cache capacity: 64 kB in 64-byte blocks (1024 slots) Cache is two-way set associative: 512 sets 2 cache lines Hardware: Two arrays with each 512 caches lines, that is, each set has one cache line in array1 and one in array2. Physical address is 40 bits. Address is divided into 34 bit block address (subdivided into 25 tag bits and 9 index bits) and 6 bits byte offset (c).
Figure: The index selects the set (29 = 512), see d. The two tags of the set are compared against the tag bits. The valid bit must be set for a hit (e). On a hit, the corresponding data is delivered using the winning input from a 2:1 multiplexer (f). The data goes to Data in of the CPU. The victim buffer is needed when a cache line has to be written back to main memory (replacement).

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Cache Organization
Where can block 12 go?
Cache

Block address

Memory

Figure from [HP06 p.C-7] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cache Organization
For the previous figure, assume block 12 and block 20 being used very often. What is the problem?

Fully associative: No special problem. Both blocks can be stored in the cache at the same time. Direct mapped: Problem! Only one of them can be stored at the same time since both map to the same slot. 12 mod 8 = 20 mod 8 = 4 No special problem. Both blocks can be stored in the same set at the same time.

Set associative:

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Cache Organization
Direct Mapped
Hard-Allocation (no choice) Simple & Inexpensive No replacement strategy required If a process uses 2 blocks mapping to the same slot, cache misses are high.

Fully Associative
Full choice Expensive searching (hardware) for free slot Replacement strategy required

Set Associative
Compromise between direct mapped and fully associative Some choice Replacement strategy required
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Replacement Strategies
Strategies for selecting a slot to evict (when necessary)

Random
Victim cache lines are selected randomly. Hardware pseudo-random number generator generates slot-numbers.

Least-Recently Used (LRU)


Relies on the temporal locality principle. The least recently used block is hoped to have smallest likelihood of (re)usage. Expensive hardware.

First-in, First out (FIFO)


Approximation of LRU by selecting the oldest block (oldest = load time).

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Replacement Strategies
Two-way Capacity 16 kB 64 kB 256 kB LRU 114.1 103.4 92.2 Rand 117.3 104.3 92.1 FIFO 115.5 103.9 92.5 LRU 111.7 102.4 92.1 Four-way Rand 115.1 102.3 92.1 FIFO 113.3 103.1 92.5 LRU 109.0 99.7 92.1 Eight-way Rand 111.8 100.5 92.1 FIFO 110.4 100.3 92.5

Table: Data cache misses per 1000 instructions


Data collected for Alpha architecture, block size = 64 byte.

Data from [HP06 p.C-10]

LRU is best for small caches little difference between all strategies for large caches
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Miss Categories
Compulsory Misses
The very first access to a block cannot be in the cache, so the block must be loaded. Also called cold-start misses.

Capacity Misses
Owing to the limited capacity of the cache, capacity misses will occur in addition to compulsory misses.

Conflict Misses
In set associative or direct mapped caches too many blocks may map to the same set (or slot). Also called collision misses.

Coherency Misses
are owing to cache flushes to keep multiple caches coherent in a multiprocessor. Not considered in this lecture (see lecture Advanced Computer Architecture).
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cache Optimization
Average memory access time = hit time + miss rate miss penalty
reducing miss rate

Larger block size larger cache capacity higher associativity


reducing miss penalty

Multilevel caches read over write

reducing hit time

avoiding address translation

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Block Size
The data area gets larger cache lines (but less lines), the overall cache capacity remains the same.

reduced miss rate, taking advantage of spatial locality


more accesses will likely go to the same block

increased miss penalty


More bytes have to be fetched from main memory

increased conflict misses


cache has less slots (per set)

increased capacity misses


only for small caches. In case of high locality (e.g. repeatedly access to only one byte in a block) the remaining bytes are unused and waste up cache capacity.

Common block sizes are 32 ... 128 bytes


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Block Size

Miss rate versus block size [from HP06 p.C-26]


Computer Architecture WS 06/07

Cache capacity

Dr.-Ing. Stefan Freinatis

Block Size
For the previous figure, assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Assume the hit time to be 1 clock cycle independent of block size. Which block size has the smallest average memory access time? Average memory access time = hit time + miss rate miss penalty 4K cache, 16 byte block: Average memory access time = 1 + (8.57 % 82) = 8.027 clock cycles 4K cache, 32 byte block: Average memory access time = 1 + (7.24 % 84) = 7.082 clock cycles
... and so on for all cache sizes and block sizes.
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Block Size
cache capacity
Block size Miss penalty

4K 8.027 7.082 7.160 8.469 11.651

16K 4.231 3.411 3.323 3.659 4.685

64K 2.673 2.134 1.933 1.979 2.288

256K 1.894 1.588 1.449 1.470 1.549

16 32 64 128 256

82 84 88 96 112

Average memory access time (in clock cycles) versus block size for 4 different cache capacities green values = best (smallest access time) per column (thus per cache)
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Cache Capacity
The cache is enlarged by adding more cache slots.

reduced miss rate


owing to less capacity misses

potentially increased hit time


owing to increased complexity

increased hardware & power consumption


Miss rates for block size 64 bytes

4K 7.00 %

16K 2.64 %
38 %

64K 1.06 %
40 %

256K 0.51 %
48 %
WS 06/07

Cache capacity Miss rate

Computer Architecture

Dr.-Ing. Stefan Freinatis

Associativity
The higher the associativity the more slots per set.

reduced miss rate


primarily owing to less conflict misses

increased hit time


time needed for finding a free slot in the set

Rules of Thumb

Eight-way set associative is almost as effective as fully associative. A direct mapped cache with capacity N has about the same miss rate as a two-way set associative cache of capacity N/2.
Common associativities are 1 (direct mapped), 2, 4, 8
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Associativity
4 K cache
Degree Miss rate [%]

8 K cache
Degree Miss rate [%]

16 K cache
Degree Miss rate [%]

1-way 2-way 4-way 8-way

9.8 7.6 7.1 7.1

1-way 2-way 4-way 8-way

6.8 4.9 4.4 4.4

1-way 2-way 4-way 8-way

4.9 4.1 4.1 4.1

64 K cache
Degree Miss rate [%]

128 K cache
Degree Miss rate [%]

512 K cache
Degree Miss rate [%]

1-way 2-way 4-way 8-way

3.7 3.1 3.0 2.9

1-way 2-way 4-way 8-way

2.1 1.9 1.9 1.9


WS 06/07

1-way 2-way 4-way 8-way

0.8 0.7 0.6 0.6

Data from [HP06 p.C-23]

Computer Architecture

Dr.-Ing. Stefan Freinatis

Associativity

(hit time)

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Building a cache hierarchy.

First-Level Cache (L1)


small high speed cache usually located in the CPU

Second level cache (L2)


fast and bigger cache located close to CPU (chip set)

Third level cache (L3), optional


Separate memory chip between L2 and main memory

CPU

L1

L2

L3

Main Memory

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Multi-level caches reduce the average miss penalty because on a miss the block can be fetched from the next higher level instead from main memory.
Distinction between local and global cache considerations: number of cache misses number of cache accesses

Local miss rate =

Local to a cache (e.g. L1, L2, ...)

Global miss rate =

number of cache misses number of memory references by CPU

local misses versus global references


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Example CH.1: Suppose that in 1000 memory references there are 40 misses in L1 and 20 misses in L2. What are the various miss rates? Local miss rateL1 = global miss rate L1 = Local miss rateL2 = Global miss rateL2 = 20 40 20 1000 = 50 % =2%
These 2 % go from L2 to main memory

40 =4% 1000

These 4% go from L1 to L2

CPU

L1

L2

Main Memory

Local miss rate L2 is large because L1 skims the cream of memory accesses.

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Average memory access time = hit timeL1 + miss rateL1 miss penaltyL1

CPU

L1

L2

Main Memory

miss penaltyL1 = hit timeL2 + local miss rateL2 miss penaltyL2 = hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2)

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Multi-Level Caches
Using the miss rates from example CH.1, and the following data hit time L1 = 1 clock cycle, hit timeL2 = 10 clock cycles, miss penaltyL2 = 200 clock cycles, the average memory access time is = 1 + 0.04 (10 + 0.05 200) = 5.5 clock cycles

hit timeL1 + miss rateL1 (hit timeL2 + local miss rateL2 miss penaltyL2)

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Read over Write


Assume a direct mapped write-through cache with 512 slots, and a four-word write buffer that is not checked on a read miss.
store word

c d e

SW R3, 512(R0) LW R1, 1024(R0) LW R2, 512(R0)


load word

; mem[512]:= R3 ; R1:= mem[1024] ; R2:= mem[512]

(cache slot 0) (cache slot 0) (cache slot 0)

Read-after-write hazard: c The data in R3 is placed in write buffer. d causes a read miss. Cache line is discarded. e again causes a read miss. If the write buffer has not completed writing R3 into memory, e will read an incorrect value from mem[512]. Cache

CPU

Memory

Write Buffer Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Read over Write


Solutions to previous problem

Read misses wait until write buffer is empty,


and thereafter the required memory block is fetched into cache.

Check contents of write buffer,


if referenced data not in buffer let the read-access continue fetching the block into the cache. Write buffer is flushed later when memory system is available.

Giving reads priority over writes

Also applicable to write-back caches. The dirty block is put into a write buffer that allows inspection in case of a read miss. Read misses check the buffer before directly going to memory.

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Address Translation
What addresses are cached, virtual or physical addresses?

A fully virtual cache uses logical addresses only, a fully physical cache uses physical addresses only.

virtual

CPU

address

virtual cache

virtual address

physical

translation

address

memory

segment tables / page tables / TLB

virtual

physical

CPU

address

translation

address

physical cache

physical address

memory

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Address Translation
Fully virtual cache: No address translation time on a hit Cache must have copies of protection information
Protection info must be fetched from page/segment tables.

Cache flush on processor switch


Individual virtual addresses usually refer to different physical addresses.

Shared Memory
Different virt. addresses refer to same phys. address. Copies of same data in cache.

Fully physical cache: Very well on shared memory accesses Always address translation (time)
Hits are of no advantage regarding address translation

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

Address Translation
Solution: get the best from both virtual and physical caches Two issues in accessing a cache:

Indexing the cache


that is, calculating the target set (or slot with direct mapping)

Comparing tags
comparing the tag field with (parts of) the block address

The page offset (the part that is identical in both virtual and physical address space) is used to index the cache. In parallel, the virtual part of the address is translated into the physical address and used for tag comparison. Improved hit time. virtually indexed, physically tagged cache
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Address Translation
Virtually indexed, physically tagged cache
virtual address

CPU

page number

page offset
word offset tags data data

Translation
(TLB, page table)

...

...

... Cache

physical address

frame address

offset
next memory level
WS 06/07 Dr.-Ing. Stefan Freinatis

next memory level


Computer Architecture

Cache Optimization
Technique Larger block size Larger cache capacity Higher associativity Multi-level cache Read over write Address translation + + +
Hit time Miss penalty Miss rate Complexity

Comment

+ + +

0 1 1 2 1 1

Trivial Widely used for L2 Widely used


Costly hardware; harder if L1 block size L2 block size

Widely used Widely used


Data from [HP06 p.C-39]

+ = improves a factor, = hurts a factor, blank = no impact

Summary of basic cache optimizations


Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Exam Computer Architecture


March 9th, 2007

09.03.2007
Date

ST 025/118
Location

8:30 hrs
Time

Duisburg - Ruhrort !

CA
Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture

Computer Architecture

WS 06/07

Dr.-Ing. Stefan Freinatis

You might also like