dce
2013
COMPUTER ARCHITECTURE CE2013
Faculty of Computer Science and Engineering Department of Computer Engineering Vo Tan Phuong http://www.cse.hcmut.edu.vn/~vtphuong
BK
TP.HCM
dce
2013
Chapter 5
Memory
Computer Architecture  Chapter 4.2
2013, CE
dce
2013
Presentation Outline
 Random Access Memory and its Structure  Memory Hierarchy and the need for Cache Memory  The Basics of Caches  Cache Performance and Memory Stall Cycles
 Improving Cache Performance
 Multilevel Caches
Computer Architecture  Chapter 5
2013, CE
dce
2013
Random Access Memory
 Large arrays of storage cells  Volatile memory
 Hold the stored data as long as it is powered on
 Random Access
 Access time is practically the same to any data on a RAM chip
 Output Enable (OE) control signal
 Specifies read operation
n
RAM
Address Data
 Write Enable (WE) control signal
 Specifies write operation
m OE WE
 2n  m RAM chip: n-bit address and m-bit data
Computer Architecture  Chapter 5
2013, CE
dce
2013
Memory Technology
 Requires 6 transistors per bit  Requires low power to retain bit
 Static RAM (SRAM) for Cache
 Dynamic RAM (DRAM) for Main Memory
 One transistor + capacitor per bit  Must be re-written after being read
 Must also be periodically refreshed
 Each row can be refreshed simultaneously
 Address lines are multiplexed
 Upper half of address: Row Access Strobe (RAS)  Lower half of address: Column Access Strobe (CAS)
Computer Architecture  Chapter 5 2013, CE 5
dce
2013
Static RAM Storage Cell
 Static RAM (SRAM): fast but expensive RAM  6-Transistor cell with no static current
 Typically used for caches
 Provides fast access time
Word line Vcc
 Cell Implementation:
 Cross-coupled inverters store bit  Two pass transistors
bit bit
Typical SRAM cell
 Row decoder selects the word line
 Pass transistors enable the cell to be read and written
Computer Architecture  Chapter 5 2013, CE 6
dce
2013
Dynamic RAM Storage Cell
Word line
 Dynamic RAM (DRAM): slow, cheap, and dense memory  Typical choice for main memory  Cell Implementation:
 1-Transistor cell (pass transistor)  Trench capacitor (stores bit)
Pass Transistor
 Bit is stored as a charge on capacitor  Must be refreshed periodically  Refreshing for all memory rows
bit
Capacitor
 Because of leakage of charge from tiny capacitor
Typical DRAM cell
 Reading each row and writing it back to restore the charge
Computer Architecture  Chapter 5
2013, CE
dce
2013
Typical DRAM Packaging
Legend
Ai CAS Dj NC OE RAS WE Address bit i Column address strobe Data bit j No connection Output enable Row address strobe Write enable
 24-pin dual in-line package for 16Mbit = 222  4 memory  22-bit address is divided into
 11-bit row address  11-bit column address  Interleaved on same address lines
Vss D4 D3 CAS OE A9 A8 A7 A6 A5 A4 Vss 24 23 22 21 20 19 18 17 16 15 14 13
10 11
12
Vcc D1 D2 WE RAS NC A10 A0 A1 A2 A3 Vcc
Computer Architecture  Chapter 5 2013, CE 8
dce
2013
Typical Memory Structure
Row Decoder
 Row decoder
Row address
 Select row to read/write
...
 Column decoder
 Select column to read/write
2r  2c  m bits Cell Matrix
 Cell Matrix
 2D array of tiny memory cells
Sense/write amplifiers
Data
m
 Sense/Write amplifiers
 Sense & amplify data on read  Drive bit line with data in on write
Row Latch 2c  m bits
...
Column Decoder
c
 Same data lines are used for data in/out
Computer Architecture  Chapter 5
Column address
2013, CE
dce
2013
DRAM Operation
 Latch and decode row address to enable addressed row
 Row Access (RAS)
 Small change in voltage detected by sense amplifiers
 Latch whole row of bits  Sense amplifiers drive bit lines to recharge storage cells
 Column Access (CAS) read and write operation
 Latch and decode column address to select m bits  m = 4, 8, 16, or 32 bits depending on DRAM package  On read, send latched bits out to chip pins  On write, charge storage cells to required value  Can perform multiple column accesses to same row (burst mode)
Computer Architecture  Chapter 5 2013, CE
10
dce
2013
Burst Mode Operation
 Row address is latched and decoded  A read operation causes all cells in a selected row to be read  Selected row is latched internally inside the SDRAM chip  Column address is latched and decoded  Selected column data is placed in the data output register  Column address is incremented automatically  Multiple data items are read depending on the block length
 Block Transfer
 Fast transfer of blocks between memory and cache  Fast transfer of pages between memory and disk
Computer Architecture  Chapter 5
2013, CE
11
dce
2013
Trends in DRAM
Chip size 64 Kbit 256 Kbit 1 Mbit 4 Mbit 16 Mbit 64 Mbit Type DRAM DRAM DRAM DRAM DRAM Row access 170 ns 150 ns 120 ns 100 ns 80 ns Column access 75 ns 50 ns 25 ns 20 ns 15 ns Cycle Time New Request 250 ns 220 ns 190 ns 165 ns 120 ns
Year Produced 1980 1983 1986 1989 1992 1996
SDRAM
SDRAM DDR1 DDR1 DDR2 DDR2 DDR3 DDR3
70 ns
70 ns 65 ns 60 ns 55 ns 50 ns 35 ns 30 ns
12 ns
10 ns 7 ns 5 ns 5 ns 3 ns 1 ns 0.5 ns
110 ns
100 ns 90 ns 80 ns 70 ns 60 ns 37 ns 31 ns
2013, CE 12
1998
2000 2002 2004 2006 2010 2012
128 Mbit
256 Mbit 512 Mbit 1 Gbit 2 Gbit 4 Gbit 8 Gbit
Computer Architecture  Chapter 5
dce
2013
SDRAM and DDR SDRAM
 Added clock to DRAM interface
 SDRAM is Synchronous Dynamic RAM
 SDRAM is synchronous with the system clock
 Older DRAM technologies were asynchronous  As system bus clock improved, SDRAM delivered higher performance than asynchronous DRAM
 DDR is Double Data Rate SDRAM
 Like SDRAM, DDR is synchronous with the system clock, but the difference is that DDR reads data on both the rising and falling edges of the clock signal
Computer Architecture  Chapter 5 2013, CE 13
dce
2013
Transfer Rates & Peak Bandwidth
Standard Name Memory Bus Clock 100 MHz 167 MHz 200 MHz 333 MHz 400 MHz 533 MHz 533 MHz 667 MHz 800 MHz Millions Transfers per second 200 MT/s 333 MT/s 400 MT/s 667 MT/s 800 MT/s 1066 MT/s 1066 MT/s 1333 MT/s 1600 MT/s Module Name PC-1600 PC-2700 PC-3200 PC-5300 PC-6400 PC-8500 PC-8500 PC-10600 PC-12800 Peak Bandwidth 1600 MB/s 2667 MB/s 3200 MB/s 5333 MB/s 6400 MB/s 8533 MB/s 8533 MB/s 10667 MB/s 12800 MB/s
DDR-200 DDR-333 DDR-400 DDR2-667 DDR2-800 DDR2-1066 DDR3-1066 DDR3-1333 DDR3-1600
DDR4-3200
1600 MHz
3200 MT/s
PC-25600
25600 MB/s
 1 Transfer = 64 bits = 8 bytes of data
Computer Architecture  Chapter 5 2013, CE 14
dce
2013
DRAM Refresh Cycles
 Refresh cycle is about tens of milliseconds  Refreshing is done for the entire memory
 Each row is read and written back to restore the charge
 Some of the memory bandwidth is lost to refresh cycles
Voltage for 1 Threshold voltage 0 Stored 1 Written Refreshed Refreshed Refreshed
Refresh Cycle
Time
Voltage for 0
Computer Architecture  Chapter 5
2013, CE
15
dce
2013
Expanding the Data Bus Width
 Memory chips typically have a narrow data bus  We can expand the data bus width by a factor of p
 Use p RAM chips and feed the same address to all chips  Use the same Output Enable and Write Enable control signals
OE Address Data
WE
OE Address
WE
OE
WE
...
m
Address Data
Data
..
Data width = m  p bits
Computer Architecture  Chapter 5 2013, CE 16
dce
2013
Next . . .
 Random Access Memory and its Structure  Memory Hierarchy and the need for Cache Memory  The Basics of Caches  Cache Performance and Memory Stall Cycles
 Improving Cache Performance
 Multilevel Caches
Computer Architecture  Chapter 5
2013, CE
17
dce
2013
Processor-Memory Performance Gap
CPU Performance: 55% per year, slowing down after 2004 Performance Gap DRAM: 7% per year
 1980  No cache in microprocessor  1995  Two-level cache on microprocessor
Computer Architecture  Chapter 5 2013, CE 18
dce
2013
The Need for Cache Memory
 Processor operation takes less than 1 ns
 Widening speed gap between CPU and main memory
 Main memory requires more than 50 ns to access
 Each instruction involves at least one memory access
 One memory access to fetch the instruction
 A second memory access for load and store instructions
 Memory bandwidth limits the instruction execution rate  Cache memory can help bridge the CPU-memory gap  Cache memory is small in size but fast
Computer Architecture  Chapter 5 2013, CE
19
dce
2013
Typical Memory Hierarchy
 Typical size < 1 KB  Access time < 0.5 ns
 Registers are at the top of the hierarchy
 Level 1 Cache (8  64 KB)
 Access time: 1 ns
Microprocessor Registers
 L2 Cache (512KB  8MB)
 Access time: 3  10 ns
Faster
L1 Cache
Bigger
20
 Main Memory (4  16 GB)
 Access time: 50  100 ns
L2 Cache Memory Bus Main Memory I/O Bus Magnetic or Flash Disk
 Disk Storage (> 200 GB)
 Access time: 5  10 ms
Computer Architecture  Chapter 5
2013, CE
dce
2013
Principle of Locality of Reference
 At any time, only a small set of instructions & data is needed
 Programs access small portion of their address space  Temporal Locality (in time)
 If an item is accessed, probably it will be accessed again soon  Same loop instructions are fetched each iteration  Same procedure may be called and executed many times
 Spatial Locality (in space)
 Tendency to access contiguous instructions/data in memory  Sequential execution of Instructions  Traversing arrays element by element
Computer Architecture  Chapter 5
2013, CE
21
dce
2013
What is a Cache Memory ?
 Stores the subset of instructions & data currently being accessed
 Small and fast (SRAM) memory technology
 Used to reduce average access time to memory
 Caches exploit temporal locality by 
 Keeping recently accessed data closer to the processor
 Caches exploit spatial locality by 
 Moving blocks consisting of multiple contiguous words
 Goal is to achieve
 Fast speed of cache memory access
 Balance the cost of the memory system
Computer Architecture  Chapter 5 2013, CE 22
dce
2013
Cache Memories in the Datapath
Imm
Imm16
E
32
ALU result 32
0 1
Instruction
RA RB RW
Instruction
Rt 5
PC
BusB
Address
0 1 2 3
A L U
1 0 32
0 32
Address Data_out
1
BusW
32
Rd2
Rd3
Data_in
Rd
0 1
clk
Instruction Block
Block Address
Block Address
D-Cache miss
I-Cache miss
I-Cache miss or D-Cache miss causes pipeline to stall Interface to L2 Cache or Main Memory
Computer Architecture  Chapter 5
2013, CE
Data Block
Rd4
WB Data 23
I-Cache
Register File
Rs 5
BusA
ALUout
D-Cache
dce
2013
Almost Everything is a Cache !
 In computer architecture, almost everything is a cache!  Registers: a cache on variables  software managed  First-level cache: a cache on second-level cache  Second-level cache: a cache on memory  Memory: a cache on hard disk
 Stores recent programs and their data
 Hard disk can be viewed as an extension to main memory
 Branch target and prediction buffer
 Cache on branch target and prediction information
Computer Architecture  Chapter 5
2013, CE
24
dce
2013
Next . . .
 Random Access Memory and its Structure  Memory Hierarchy and the need for Cache Memory  The Basics of Caches  Cache Performance and Memory Stall Cycles
 Improving Cache Performance
 Multilevel Caches
Computer Architecture  Chapter 5
2013, CE
25
dce
2013
Four Basic Questions on Caches
 Block placement  Direct Mapped, Set Associative, Fully Associative
 Q1: Where can a block be placed in a cache?
 Q2: How is a block found in a cache?
 Block identification  Block address, tag, index
 Q3: Which block should be replaced on a miss?
 Block replacement  FIFO, Random, LRU
 Q4: What happens on a write?
 Write strategy  Write Back or Write Through (with Write Buffer)
Computer Architecture  Chapter 5 2013, CE 26
dce
2013
Block Placement: Direct Mapped
 Block: unit of data transfer between cache and memory  Direct Mapped Cache:
 A block can be placed in exactly one location in the cache
In this example: Cache index = least significant 3 bits of Memory address
000 001 010 011 100 101 110 111
Cache
Computer Architecture  Chapter 5
00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111
2013, CE
Main Memory
27
dce
2013
Direct-Mapped Cache
Block Address Tag Index offset
 A memory address is divided into
 Block address: identifies block in memory  Block offset: to access bytes within a block
 A block address is further divided into
 Index: used for direct cache access  Tag: most-significant bits of block address Index = Block Address mod Cache Blocks
V Tag
Block Data
 Tag must be stored also inside cache
 For block identification
=
Data Hit
 A valid bit is also required to indicate
 Whether a cache block is valid or not
Computer Architecture  Chapter 5
2013, CE
28
dce
2013
Direct Mapped Cache  contd
 Index is used to access cache block
Block Address Tag Index offset
 Cache hit: block is stored inside cache
 Address tag is compared against stored tag
 If equal and cache block is valid then hit  Otherwise: cache miss
V Tag Block Data
 If number of cache blocks is 2n
 n bits are used for the cache index
 If number of bytes in a block is 2b
 b bits are used for the block offset
 If 32 bits are used for an address
 32  n  b bits are used for the tag
=
Data Hit
2013, CE 29
 Cache data size = 2n+b bytes
Computer Architecture  Chapter 5
dce
2013
Mapping an Address to a Cache Block
 Consider a direct-mapped cache with 256 blocks  Block size = 16 bytes
 Example
 Compute tag, index, and byte offset of address: 0x01FFF8AC
 Solution
 32-bit address is divided into:
Block Address 20 8 4
Tag
Index offset
 4-bit byte offset field, because block size = 24 = 16 bytes
 8-bit cache index, because there are 28 = 256 blocks in cache  20-bit tag field
 Byte offset = 0xC = 12 (least significant 4 bits of address)
 Cache index = 0x8A = 138 (next lower 8 bits of address)
 Tag = 0x01FFF (upper 20 bits of address)
Computer Architecture  Chapter 5
2013, CE
30
dce
2013
Example on Cache Placement & Misses
 Cache is initially empty, Block size = 16 bytes  The following memory addresses (in decimal) are referenced: 1000, 1004, 1008, 2548, 2552, 2556.  Map addresses to cache blocks and indicate whether hit or miss
23 Tag 5 4
 Consider a small direct-mapped cache with 32 blocks
 Solution:
      1000 = 0x3E8 1004 = 0x3EC 1008 = 0x3F0 2548 = 0x9F4 2552 = 0x9F8 2556 = 0x9FC
Index offset
cache index = 0x1E cache index = 0x1E cache index = 0x1F cache index = 0x1F cache index = 0x1F cache index = 0x1F
Miss (first access) Hit Miss (first access) Miss (different tag) Hit Hit
Computer Architecture  Chapter 5
2013, CE
31
dce
2013
Fully Associative Cache
 A block can be placed anywhere in cache  no indexing  If m blocks exist then
 m comparators are needed to match tag  Cache data size = m  2b bytes
V Tag Block Data V Tag Block Data V Tag Block Data
Address
Tag offset
V Tag Block Data
mux
m-way associative
Computer Architecture  Chapter 5
Hit
Data
2013, CE
32
dce
2013
Set-Associative Cache
 A set is a group of blocks that can be indexed  A block is first mapped onto a set
 Set index = Block address mod Number of sets in cache
 If there are m blocks in a set (m-way set associative) then
 m tags are checked in parallel using m comparators
 If 2n sets exist then set index consists of n bits
 Cache data size = m  2n+b bytes (with 2b bytes per block)
 Without counting tags and valid bits
 A direct-mapped cache has one block per set (m = 1)
 A fully-associative cache has one set (2n = 1 or n = 0)
Computer Architecture  Chapter 5 2013, CE 33
dce
2013
Set-Associative Cache Diagram
Address
Tag Index offset
V Tag Block Data
V Tag Block Data
V Tag Block Data
V Tag Block Data
m-way set-associative
Hit
mux Data
Computer Architecture  Chapter 5
2013, CE
34
dce
2013
Write Policy
 Writes update cache and lower-level memory
 Cache control bit: only a Valid bit is needed  Memory always has latest data, which simplifies data coherency  Can always discard cached data when a block is replaced
 Write Through:
 Write Back:
 Writes update cache only  Cache control bits: Valid and Modified bits are required
 Modified cached data is written back to memory when replaced
 Multiple writes to a cache block require only one write to memory  Uses less memory bandwidth than write-through and less power  However, more complex to implement than write through
Computer Architecture  Chapter 5 2013, CE 35
dce
2013
Write Miss Policy
 What happens on a write miss?  Write Allocate:
 Allocate new block in cache
 Write miss acts like a read miss, block is fetched and updated
 No Write Allocate:
 Send data to lower-level memory
 Cache is not modified
 Typically, write back caches use write allocate
 Hoping subsequent writes will be captured in the cache
 Write-through caches often use no-write allocate
 Reasoning: writes must still go to lower level memory
Computer Architecture  Chapter 5
2013, CE
36
dce
2013
Write Buffer
 Permits writes to occur without stall cycles until buffer is full
 Decouples the CPU write from the memory bus writing
 Write-through: all stores are sent to lower level memory
 Write buffer eliminates processor stalls on consecutive writes
 Write-back: modified blocks are written when replaced
 Write buffer is used for evicted blocks that must be written back
 The address and modified data are written in the buffer
 The write is finished from the CPU perspective
 CPU continues while the write buffer prepares to write memory
 If buffer is full, CPU stalls until buffer has an empty entry
Computer Architecture  Chapter 5
2013, CE
37
dce
2013
What Happens on a Cache Miss?
 Cache sends a miss signal to stall the processor  Decide which cache block to allocate/replace
 One choice only when the cache is directly mapped
 Multiple choices for set-associative or fully-associative cache
 Transfer the block from lower level memory to this cache
 Set the valid bit and the tag field from the upper address bits
 If block to be replaced is modified then write it back
 Modified block is moved into a Write Buffer  Otherwise, block to be replaced can be simply discarded
 Restart the instruction that caused the cache miss  Miss Penalty: clock cycles to process a cache miss
Computer Architecture  Chapter 5 2013, CE 38
dce
2013
Replacement Policy
 Which block to be replaced on a cache miss?  No selection alternatives for direct-mapped caches  m blocks per set to choose from for associative caches  Random replacement
 Candidate blocks are randomly selected  One counter for all sets (0 to m  1): incremented on every cycle  On a cache miss replace block specified by counter
 First In First Out (FIFO) replacement
 Replace oldest block in set
 One counter per set (0 to m  1): specifies oldest block to replace
 Counter is incremented on a cache miss
Computer Architecture  Chapter 5
2013, CE
39
dce
2013
Replacement Policy  contd
 Replace block that has been unused for the longest time  Order blocks within a set from least to most recently used
 Least Recently Used (LRU)
 Update ordering of blocks on each cache hit
 With m blocks per set, there are m! possible permutations
 Pure LRU is too costly to implement when m > 2
 m = 2, there are 2 permutations only (a single bit is needed)  m = 4, there are 4! = 24 possible permutations  LRU approximation is used in practice
 For large m > 4, Random replacement can be as effective as LRU
Computer Architecture  Chapter 5 2013, CE 40
dce
2013
Comparing Random, FIFO, and LRU
 10 SPEC2000 benchmarks on Alpha processor  Block size of 64 bytes
 Data cache misses per 1000 instructions
 LRU and FIFO outperforming Random for a small cache
 Little difference between LRU and Random for a large cache
 LRU is expensive for large associativity (# blocks per set)
 Random is the simplest to implement in hardware
2-way Size 16 KB 64 KB LRU Rand FIFO LRU 4-way Rand FIFO LRU 8-way Rand FIFO
114.1 117.3 115.5 103.4 104.3 103.9
111.7 115.1 113.3 102.4 102.3 103.1
109.0 111.8 110.4 99.7 100.5 100.3
256 KB
92.2
92.1
92.5
92.1
92.1
92.5
92.1
92.1
92.5
Computer Architecture  Chapter 5
2013, CE
41
dce
2013
Next . . .
 Random Access Memory and its Structure  Memory Hierarchy and the need for Cache Memory  The Basics of Caches  Cache Performance and Memory Stall Cycles
 Improving Cache Performance
 Multilevel Caches
Computer Architecture  Chapter 5
2013, CE
42
dce
2013
Hit Rate and Miss Rate
= Hits / (Hits + Misses)
 Hit Rate
 Miss Rate = Misses / (Hits + Misses)  I-Cache Miss Rate = Miss rate in the Instruction Cache  D-Cache Miss Rate = Miss rate in the Data Cache  Example:
 Out of 1000 instructions fetched, 150 missed in the I-Cache  25% are load-store instructions, 50 missed in the D-Cache  What are the I-cache and D-cache miss rates?
 I-Cache Miss Rate = 150 / 1000 = 15%  D-Cache Miss Rate = 50 / (25%  1000) = 50 / 250 = 20%
Computer Architecture  Chapter 5
2013, CE
43
dce
2013
Memory Stall Cycles
 When fetching instructions from the Instruction Cache (I-cache)  When loading or storing data into the Data Cache (D-cache)
 The processor stalls on a Cache miss
Memory stall cycles = Combined Misses  Miss Penalty
 Miss Penalty: clock cycles to process a cache miss Combined Misses = I-Cache Misses + D-Cache Misses
I-Cache Misses = I-Count  I-Cache Miss Rate
D-Cache Misses = LS-Count  D-Cache Miss Rate LS-Count (Load & Store) = I-Count  LS Frequency  Cache misses are often reported per thousand instructions
Computer Architecture  Chapter 5 2013, CE 44
dce
2013
Memory Stall Cycles Per Instruction
Combined Misses Per Instruction  Miss Penalty
 Memory Stall Cycles Per Instruction =
 Miss Penalty is assumed equal for I-cache & D-cache
 Miss Penalty is assumed equal for Load and Store  Combined Misses Per Instruction = I-Cache Miss Rate + LS Frequency  D-Cache Miss Rate  Therefore, Memory Stall Cycles Per Instruction = I-Cache Miss Rate  Miss Penalty + LS Frequency  D-Cache Miss Rate  Miss Penalty
Computer Architecture  Chapter 5 2013, CE 45
dce
2013
Example on Memory Stall Cycles
 Instruction count (I-Count) = 106 instructions  30% of instructions are loads and stores
 Consider a program with the given characteristics
 D-cache miss rate is 5% and I-cache miss rate is 1%
 Miss penalty is 100 clock cycles for instruction and data caches  Compute combined misses per instruction and memory stall cycles
 Combined misses per instruction in I-Cache and D-Cache
 1% + 30%  5% = 0.025 combined misses per instruction  Equal to 25 misses per 1000 instructions
 Memory stall cycles
 0.025  100 (miss penalty) = 2.5 stall cycles per instruction
 Total memory stall cycles = 106  2.5 = 2,500,000
Computer Architecture  Chapter 5 2013, CE 46
dce
2013
CPU Time with Memory Stall Cycles
CPU Time = I-Count  CPIMemoryStalls  Clock Cycle
CPIMemoryStalls = CPIPerfectCache + Mem Stalls per Instruction  CPIPerfectCache = CPI for ideal cache (no cache misses)  CPIMemoryStalls = CPI in the presence of memory stalls  Memory stall cycles increase the CPI
Computer Architecture  Chapter 5 2013, CE 47
dce
2013
Example on CPI with Memory Stalls
 Cache miss rate is 2% for instruction and 5% for data  20% of instructions are loads and stores
 A processor has CPI of 1.5 without any memory stalls
 Cache miss penalty is 100 clock cycles for I-cache and D-cache
 What is the impact on the CPI?  Answer:
Instruction data
Mem Stalls per Instruction =0.02100 + 0.20.05100 = 3
CPIMemoryStalls = 1.5 + 3 = 4.5 cycles per instruction CPIMemoryStalls / CPIPerfectCache = 4.5 / 1.5 = 3
Processor is 3 times slower due to memory stall cycles
CPINoCache = 1.5 + (1 + 0.2)  100 = 121.5 (a lot worse)
Computer Architecture  Chapter 5 2013, CE 48
dce
2013
Average Memory Access Time
AMAT = Hit time + Miss rate  Miss penalty
 Average Memory Access Time (AMAT)  Time to access a cache for both hits and misses  Example: Find the AMAT for a cache with
 Cache access time (Hit time) of 1 cycle = 2 ns  Miss penalty of 20 clock cycles  Miss rate of 0.05 per access
 Solution:
AMAT = 1 + 0.05  20 = 2 cycles = 4 ns
Without the cache, AMAT will be equal to Miss penalty = 20 cycles
Computer Architecture  Chapter 5 2013, CE 49
dce
2013
Next . . .
 Random Access Memory and its Structure  Memory Hierarchy and the need for Cache Memory  The Basics of Caches  Cache Performance and Memory Stall Cycles
 Improving Cache Performance
 Multilevel Caches
Computer Architecture  Chapter 5
2013, CE
50
dce
2013
Improving Cache Performance
 Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate * Miss penalty  Used as a framework for optimizations  Reduce the Hit time
 Small and simple caches
 Reduce the Miss Rate
 Larger cache size, higher associativity, and larger block size
 Reduce the Miss Penalty
 Multilevel caches
Computer Architecture  Chapter 5
2013, CE
51
dce
2013
Small and Simple Caches
 Fast clock rate demands small and simple L1 cache designs
 Hit time is critical: affects the processor clock cycle  Small cache reduces the indexing time and hit time
 Indexing a cache represents a time consuming portion  Tag comparison also adds to this hit time
 Direct-mapped overlaps tag check with data transfer
 Associative cache uses additional mux and increases hit time
 Size of L1 caches has not increased much
 L1 caches are the same size on Alpha 21264 and 21364
 Same also on UltraSparc II and III, AMD K6 and Athlon  Reduced from 16 KB in Pentium III to 8 KB in Pentium 4
Computer Architecture  Chapter 5 2013, CE 52
dce
2013
Classifying Misses  Three Cs
 Conditions under which misses occur  Compulsory: program starts with no block in cache
 Also called cold start misses  Misses that would occur even if a cache has infinite size
 Capacity: misses happen because cache size is finite
 Blocks are replaced and then later retrieved  Misses that would occur in a fully associative cache of a finite size
 Conflict: misses happen because of limited associativity
 Limited number of blocks per set  Non-optimal replacement algorithm
Computer Architecture  Chapter 5
2013, CE
53
dce
2013
Classifying Misses  contd
Compulsory misses are independent of cache size
Miss Rate
Very small for long-running programs
1-way
2-way
14% 12%
10%
Capacity misses decrease as capacity increases
4-way
8% 8-way 6%
Conflict misses decrease as associativity increases
Capacity
Data were collected using LRU replacement
Compulsory
4%
2% 0
16
32
64
128 KB
2013, CE 54
Computer Architecture  Chapter 5
dce
2013
Larger Size and Higher Associativity
 Increasing cache size reduces capacity misses  It also reduces conflict misses
 Larger cache size spreads out references to more blocks
 Drawbacks: longer hit time and higher cost  Larger caches are especially popular as 2nd level caches  Higher associativity also improves miss rates
 Eight-way set associative is as effective as a fully associative
Computer Architecture  Chapter 5
2013, CE
55
dce
2013
Larger Block Size
 Simplest way to reduce miss rate is to increase block size  However, it increases conflict misses if cache is small
25% 20%
Reduced Compulsory Misses Increased Conflict Misses
1K 4K 16K 64K 64-byte blocks are common in L1 caches 128-byte block are common in L2 caches
Miss Rate
15% 10% 5%
256K 32 64 128 256 16 0%
Block Size (bytes)
Computer Architecture  Chapter 5
2013, CE
56
dce
2013
Next . . .
 Random Access Memory and its Structure  Memory Hierarchy and the need for Cache Memory  The Basics of Caches  Cache Performance and Memory Stall Cycles
 Improving Cache Performance
 Multilevel Caches
Computer Architecture  Chapter 5
2013, CE
57
dce
2013
Multilevel Caches
 Keep pace with processor speed
 Top level cache should be kept small to  Adding another cache level
 Can reduce the memory gap
 Can reduce memory bus loading
I-Cache D-Cache
Unified L2 Cache Main Memory
 Local miss rate
 Number of misses in a cache / Memory accesses to this cache
 Miss RateL1 for L1 cache, and Miss RateL2 for L2 cache
 Global miss rate
Number of misses in a cache / Memory accesses generated by CPU Miss RateL1 for L1 cache, and Miss RateL1  Miss RateL2 for L2 cache
Computer Architecture  Chapter 5 2013, CE 58
dce
2013
Power 7 On-Chip Caches [IBM 2010]
32KB I-Cache/core 32KB D-Cache/core 3-cycle latency 256KB Unified L2 Cache/core 8-cycle latency 32MB Unified Shared L3 Cache Embedded DRAM 25-cycle latency to local slice
Computer Architecture  Chapter 5
2013, CE
59
dce
2013
Multilevel Cache Policies
 Multilevel Inclusion
 L1 cache data is always present in L2 cache  A miss in L1, but a hit in L2 copies block from L2 to L1  A miss in L1 and L2 brings a block into L1 and L2  A write in L1 causes data to be written in L1 and L2
 Typically, write-through policy is used from L1 to L2
 Typically, write-back policy is used from L2 to main memory
 To reduce traffic on the memory bus
 A replacement or invalidation in L2 must be propagated to L1
Computer Architecture  Chapter 5
2013, CE
60
dce
2013
Multilevel Cache Policies  contd
 L1 data is never found in L2 cache  Prevents wasting space
 Cache miss in L1, but a hit in L2 results in a swap of blocks  Cache miss in both L1 and L2 brings the block into L1 only
 Multilevel exclusion
 Block replaced in L1 is moved into L2
 Example: AMD Athlon
 Same or different block size in L1 and L2 caches
 Choosing a larger block size in L2 can improve performance
 However different block sizes complicates implementation  Pentium 4 has 64-byte blocks in L1 and 128-byte blocks in L2
Computer Architecture  Chapter 5 2013, CE 61