12 Caches Notes
12 Caches Notes
CS 3410
Computer System Organization & Programming
These slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer.
Programs 101
C Code MIPS Assembly
int main (int argc, char* argv[ ]) { main: addiu $sp,$sp,-48
int i; sw $31,44($sp)
int m = n; sw $fp,40($sp)
int sum = 0; move $fp,$sp
sw $4,48($fp)
for (i = 1; i <= m; i++) { sw $5,52($fp)
sum += i; la $2,n
} lw $2,0($2)
printf (“...”, n, sum); sw $2,28($fp)
} sw $0,32($fp)
li $2,1
sw $2,24($fp)
Load/Store Architectures: $L2: lw $2,24($fp)
• Read data from memory lw
slt
$3,28($fp)
$2,$3,$2
(put in registers) bne $2,$0,$L3
. . .
• Manipulate it
• Store it back to memory Instructions that read from
or write to memory…
2
1 Cycle Per Stage: the Biggest Lie (So Far)
Code Stored in Memory
(also, data and stack)
compute
jump/branch
targets
A
memory
register ALU
D
file
B
+4
addr
PC din dout
inst
control
M
B
memory
imm
extend
new
forward
pc detect Stack, Data, Code
unit
hazard Stored in Memory
Write-
ctrl
ctrl
ctrl
Instruction Instruction
Fetch Decode Execute Memory Back
5
The Need for Speed
CPU Pipeline
Instruction speeds:
• add,sub,shift: 1 cycle
• mult: 3 cycles
• load/store: 100 cycles
off-chip 50(-70)ns
2(-3) GHz processor 0.5 ns clock
6
The Need for Speed
CPU Pipeline
7
What’s the solution?
Caches !
Level 1
Data $
Level 2 $
Level 1
Insn $
9
What’s the solution?
Caches !
Level 1
Data $
Level 2 $
Level 1
Insn $
What lucky
data gets to go
here?
Intel Pentium 3, 1999 10
Locality Locality Locality
11
Clicker Questions
This highlights the 1
2
total = 0;
for (i = 0; i < n; i++) {
temporal and spatial 3 n--;
locality of data. 4 total += a[i];
5 return total;
Last Called
Speed Dial
Favorites
Contacts
Google/Facebook/email
16
Your life is full of Locality
17
The Memory Hierarchy
Small, Fast
1 cycle,
Registers 128 bytes
4 cycles,
L1 Caches 64 KB
12 cycles,
L2 Cache 256 KB
36 cycles,
L3 Cache 2-20 MB
50-70 ns, Big, Slow
Main Memory 512 MB – 4 GB
5-20 ms
Disk 16GB – 4 TB,
L1 Caches
I$ D$
L2 Cache
L2
L3 Cache
Main
Main Memory
Memory
Disk
Disk
21
Multi-Core Memory Hierarchy
ON CHIP
I$ D$ I$ D$ I$ D$ I$ D$
L2 L2 L2 L2
L3
Main Memory
Disk
22
Memory Hierarchy by the Numbers
CPU clock rates ~0.33ns – 2ns (3GHz-500MHz)
24
MEMORY
16 Byte Memory addr
0000
data
A
0001 B
0010 C
0011 D
load 1100 r1 0100 E
0101 F
0110 G
0111 H
1000 J
• Byte-addressable memory 1001 K
• 4 address bits 16 bytes total 1010 L
1011 M
• b addr bits 2b bytes in memory 1100 N
1101 O
1110 P
1111 Q 25
MEMORY
4-Byte, Direct Mapped Cache addr data
0000 A
CACHE 0001 B
index index data 0010 C
XXXX 00 A Cache entry 0011 D
01 B = row 0100 E
10 C = (cache) line 0101 F
11 D = (cache) block 0110 G
Block Size: 1 byte 0111 H
1000 J
1001 K
Direct mapped: 1010 L
• Each address maps to 1 cache block 1011 M
• 4 entries 2 index bits (2n n bits) 1100 N
Index with LSB: 1101 O
1110 P
• Supports spatial locality 1111 Q 26
Analogy to a Spice Rack
Spice Rack Spice Wall
(Cache) (Memory)
index spice
A
B
C
D
E
F
…
Z
28
MEMORY
4-Byte, Direct Mapped addr data
Cache 0000 A
0001 B
tag|index 0010 C
CACHE
XXXX 0011 D
index tag data 0100 E
00 00 A 0101 F
01 00 B 0110 G
10 00 C 0111 H
11 00 D 1000 J
1001 K
Tag: minimalist label/address 1010 L
address = tag + index 1011 M
1100 N
1101 O
1110 P
1111 Q 29
MEMORY
4-Byte, Direct Mapped addr data
Cache 0000 A
0001 B
0010 C
CACHE
0011 D
index V tag data 0100 E
00 0 00 X 0101 F
01 0 00 X 0110 G
10 0 00 X 0111 H
11 0 00 X 1000 J
1001 K
One last tweak: valid bit 1010 L
1011 M
1100 N
1101 O
1110 P
1111 Q 30
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE
XXXX 0011 D
index V tag data 0100 E
00 0 11 X 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
• Index into $ 1011 M
• Check tag 1100 N
• Check valid bit 1101 O
1110 P
1111 Q 31
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE
XXXX 0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 0 xx X 0110 G
10 0 xx X 0111 H
11 0 xx X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
• Index into $ 1011 M
• Check tag 1100 N
• Check valid bit 1101 O
1110 P
1111 Q 32
MEMORY
Simulation #1 addr data
of a 4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE
XXXX 0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
... • Index into $ 1011 M
load 1100 Hit!
• Check tag 1100 N
• Check valid bit 1101 O
Awesome! 1110 P
1111 Q 33
Block Diagram
4-entry, direct mapped Cache
tag|index
CACHE
1101
V tag data
2 2 1 00 1111 0000
1 11 1010 0101
0 01 1010 1010 Great!
1 11 0000 0000 Are we done?
2 8
= 1010 0101
data
Hit!
36
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
A) Hit 0011 D
index V tag data 0100 E
B) Miss 00 0 11 X 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
1111 Q 37
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
0010 C
CACHE
0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
1111 Q 38
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
A) Hit 0011 D
index V tag data 0100 E
B) Miss 00 1 11 N 0101 F
01 0 11 X 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss • Index into $ 1011 M
load 0100 • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
1111 Q 39
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE
XXXX 0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss • Index into $ 1011 M
load 0100 • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
1111 Q 40
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
A) Hit 0011 D
index V tag data 0100 E
B) Miss 00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 xx X 0111 H
11 0 xx X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss • Index into $ 1011 M
load 0100 Miss
• Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
1111 Q 41
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
0010 C
CACHE
0011 D
index V tag data 0100 E
00 1 01 E 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss • Index into $ 1011 M
load 0100 Miss • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
1111 Q 42
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
Clicker: CACHE
0010 C
A) Hit 0011 D
index V tag data 0100 E
B) Miss 00 1 01 E 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Miss • Index into $ 1011 M
load 0100 Miss • Check tag 1100 N
load 1100 Miss
• Check valid bit 1101 O
1110 P
1111 Q 43
MEMORY
Simulation #2: addr data
4-byte, DM Cache 0000 A
0001 B
tag|index 0010 C
CACHE
XXXX 0011 D
index V tag data 0100 E
00 1 11 N 0101 F
01 1 11 O 0110 G
10 0 11 X 0111 H
11 0 11 X 1000 J
1001 K
load 1100 Miss cold
Disappointed! 1010 L
load 1101 Miss cold 1011 M
load
load
0100
1100
Miss
Miss
cold
1100
1101
N
O
1110 P
1111 Q 44
Reducing Cold Misses
by Increasing Block Size
Leveraging Spatial Locality
45
MEMORY
Increasing Block Size addr
0000
data
A
CACHE 0001 B
offset 0010 C
index V tag data 0011 D
XXXX
00 0 x A | B 0100 E
01 0 x C | D 0101 F
10 0 x E | F 0110 G
11 0 x G | H 0111 H
1000 J
• Block Size: 2 bytes 1001 K
54
MEMORY
8 byte, fully-associative addr data
Cache 0000 A
0001 B
XXXX
0010 C
tag|offset
offset
0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
What should the offset be? Clicker: 1001 K
What should the index be? A) xxxx 1010 L
B) xxxx 1011 M
What should the tag be? C) xxxx 1100 N
1101 O
D) xxxx
1110 P
E) None 1111 Q 55
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX
0010 C
tag|offset
0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
0 xxx X | X 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 • Check tags 1100 N
load 1100 • Check valid bits 1101 O
1110 P
LRU Pointer 1111 Q 56
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX
0010 C
tag|offset
0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit! • Index into $ 1011 M
load 0100 • Check tags 1100 N
load 1100 • Check valid bits 1101 O
1110 P
1111 Q 57
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX
0010 C
tag|offset
0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 0 xxx X | X 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit! • Index into $ 1011 M
load 0100 Miss • Check tags 1100 N
load 1100 • Check valid bits 1101 O
1110 P
LRU Pointer 1111 Q 58
MEMORY
Simulation #4: addr data
8-byte, FA Cache 0000 A
0001 B
XXXX
0010 C
tag|offset
0011 D
CACHE 0100 E
0101 F
V tag data V tag data V tag data V tag data 0110 G
1 110 N | O 1 010 E | F 0 xxx X | X 0 xxx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit! • Index into $ 1011 M
load 0100 Miss • Check tags 1100 N
load 1100 Hit!
• Check valid bits 1101 O
1110 P
LRU Pointer 1111 Q 59
Pros and Cons of Full Associativity
+ No more conflicts!
+ Excellent utilization!
But either:
Parallel Reads
– lots of reading!
Serial Reads
– lots of waiting
tavg = thit + %miss* tmiss
= 4 + 5% x 100 = 6 + 3% x 100
= 9 cycles = 9 cycles
60
Pros & Cons
Direct Mapped Fully Associative
Tag Size Smaller Larger
SRAM Overhead Less More
Controller Logic Less More
Speed Faster Slower
Price Less More
Scalability Very Not Very
# of conflict misses Lots Zero
Hit Rate Low High
Pathological Cases Common ?
Reducing Conflict Misses
with Set-Associative Caches
Not too conflict-y. Not too slow.
… Just Right!
62
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX
XXXX
0010 C
tag||offset
offset
index
0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 0 xx E | F 0 xx N | O 0110 G
1 0 xx C | D 0 xx P | Q 0111 H
1000 J
What should the offset be? 1001 K
1010 L
What should the index be? 1011 M
1100 N
What should the tag be? 1101 O
1110 P
1111 Q 63
Clicker Question
5 bit address
XXXXX
2 byte block size
24 byte, 3-Way Set Associative CACHE
index V tag data V tag data V tag data
00 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
01 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
10 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
11 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
A) 0
B) 1
How many tag bits?
C) 2
D) 3
E) 4 64
Clicker Question
5 bit address
XXXXX
2 byte block size
24 byte, 3-Way Set Associative CACHE
index V tag data V tag data V tag data
00 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
01 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
10 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
11 0 ? X | Y 0 ? X’ | Y’ 0 ? X’’ | Y’’
A) 0
B) 1
How many tag bits?
C) 2
D) 3
E) 4 65
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX
0010 C
tag||offset
index
0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 0 xx X | X 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 • Index into $ 1011 M
load 0100 • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
LRU Pointer 1111 Q 66
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX
0010 C
tag||offset
index
0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit! • Index into $ 1011 M
load 0100 • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
LRU Pointer 1111 Q 67
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX
0010 C
tag||offset
index
0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 0 xx X | X 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit! • Index into $ 1011 M
load 0100 Miss • Check tag 1100 N
load 1100 • Check valid bit 1101 O
1110 P
LRU Pointer 1111 Q 68
MEMORY
8 byte, 2-way addr data
set associative Cache 0000 A
0001 B
XXXX
0010 C
tag||offset
index
0011 D
CACHE 0100 E
index V tag data V tag data 0101 F
0 1 11 N | O 1 01 E | F 0110 G
1 0 xx X | X 0 xx X | X 0111 H
1000 J
1001 K
load 1100 Miss Lookup: 1010 L
load 1101 Hit! • Index into $ 1011 M
load 0100 Miss • Check tag 1100 N
load 1100 • Check valid bit 1101 O
Hit!
1110 P
LRU Pointer 1111 Q 69
Eviction Policies
70
Misses: the Three C’s
71
Miss Rate vs. Block Size
72
Block Size Tradeoffs
• For a given total cache size,
Larger block sizes mean….
– fewer lines
– so fewer tags, less overhead
– and fewer cold misses (within-block “prefetching”)
• But also…
– fewer blocks available (for scattered accesses!)
– so more conflicts
– can decrease performance if working set can’t fit in $
– and larger miss penalty (time to fetch block)
Miss Rate vs. Associativity
74
Clicker Question
75
Clicker Question
76
ABCs of Caches
tavg = thit + %miss* tmiss
+ Associativity:
⬇conflict misses
⬆hit time
+ Block Size:
⬇cold misses
⬆conflict misses
+ Capacity:
⬇capacity misses
⬆hit time
77
Which caches get what properties?
tavg = thit + %miss* tmiss
Design with
Fast
speed in mind
L1 Caches
More Associative
L2 Cache Bigger Block Sizes
Larger Capacity
L3 Cache
Design with miss
Big rate in mind
78
Roadmap
• Things we have covered:
– The Need for Speed
– Locality to the Rescue!
– Calculating average memory access time
– $ Misses: Cold, Conflict, Capacity
– $ Characteristics: Associativity, Block Size, Capacity
• Things we will now cover:
– Cache Figures
– Cache Performance Examples
– Writes
79
2-Way Set Associative Cache (Reading)
Tag Index Offset
V Tag Data V Tag Data
= =
line select
64bytes
word select
32bits 80
hit? data
3-Way Set Associative Cache (Reading)
Tag Index Offset
= = =
line select
64bytes
word select
32bits 81
hit? data
How Big is the Cache?
Tag Index Offset
87
Takeaway
Direct Mapped fast, but low hit rate
Fully Associative higher hit cost, higher hit rate
Set Associative middleground
Memory
Instructions:
0 78
LB $1 M[ 1 ] 1 29
LB $2 M[ 7 ] 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V tag data 4 71
LB $2 M[ 10 ] 1 0 5 150
6 162
SB $1 M[ 5 ] 0 0 7 173
SB $1 M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 12 19
Misses: 0 13 200
$2
$3 Hits: 0 14 210
Reads: 0 15 225
Writes: 0
Write-Through (REF 1)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V tag data 4 71
LB $2 M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1 M[ 5 ] 1 0 7 173
SB $1 M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
Misses: 1 13 200
$2
$3 Hits: 0 14 210
Reads: 2 15 225
Writes: 0
Write-Through (REF 2)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V tag data 4 71
LB $2 M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1 M[ 5 ] 1 0 7 173
SB $1 M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
Misses: 1 13 200
$2
$3 Hits: 0 14 210
Reads: 2 15 225
Writes: 0
Write-Through (REF 2)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V tag data 4 71
LB $2 M[ 10 ] 0 1 000 78 5 150
29 6 162
SB $1 M[ 5 ] 1 1 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 0 14 210
Reads: 4 15 225
Writes: 0
Write-Through (REF 3)
CLICKER:
(A) HIT Memory
Instructions:
LB $1 M[ 1 ] M (B) MISS 0 78
1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V tag data 4 71
LB $2 M[ 10 ] 1 1 000 78 5 150
29 6 162
SB $1 M[ 5 ] 0 1 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 0 14 210
Reads: 4 15 225
Writes: 0
Write-Through (REF 3)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] lru V tag data 4 71
LB $2 M[ 10 ] 0 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 1 1 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 1 14 210
Reads: 4 15 225
Writes: 1
Write-Through (REF 4)
write-allocate
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] 0 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 1 1 011
010 7
162
71 173
SB $1 M[ 10 ] 150
173 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 2 13 200
$2
$3 Hits: 1 14 210
Reads: 4 15 225
Writes: 1
Write-Through (REF 4)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] 1 1 000 173 5 150
29
29 6 162
SB $1 M[ 5 ] 0 1 010 71 7 173
SB $1 M[ 10 ] 150
29
150 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 3 13 200
$2
$3 Hits: 1 14 210
Reads: 6 15 225
Writes: 2
Write-Through (REF 5)
CLICKER:
(A) HIT Memory
Instructions:
LB $1 M[ 1 ] M (B) MISS 0 173
1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] 1 1 000 173 5 29
29 6 162
SB $1 M[ 5 ] 0 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
173
Misses: 3 13 200
$2
$3 Hits: 1 14 210
Reads: 6 15 225
Writes: 2
Write-Through (REF 5)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] M 0 1 101 33 5 29
28 6 162
SB $1 M[ 5 ] 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 1 14 210
Reads: 8 15 225
Writes: 2
Write-Through (REF 6)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] M 0 1 101 33 5 29
28 6 162
SB $1 M[ 5 ] 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 1 14 210
Reads: 8 15 225
Writes: 2
Write-Through (REF 6)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] M 0 1 101 33 5 29
28 6 162
SB $1 M[ 5 ] Hit 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 2 14 210
Reads: 8 15 225
Writes: 3
Write-Through (REF 7)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] M 0 1 101 33 5 29
28 6 162
SB $1 M[ 5 ] Hit 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 2 14 210
Reads: 8 15 225
Writes: 3
Write-Through (REF 7)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V tag data 4 71
LB $2 M[ 10 ] M 0 1 101 29
33 5 29
28 6 162
SB $1 M[ 5 ] Hit 1 1 010 71 7 173
SB $1 M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 29
33
$0 11 28
$1 29 12 19
33
Misses: 4 13 200
$2
$3 Hits: 3 14 210
Reads: 8 15 225
Writes: 4
Summary: Write Through
Memory
Instructions:
0 78
LB $1 M[ 1 ] 1 29
LB $2 M[ 7 ] 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 1 0 5 150
6 162
SB $1 M[ 5 ] 0 0 7 173
SB $1 M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
Misses: 0 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 0 14 210
Writes: 0 15 225
Write-Back (REF 1)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 0 1 0 000 78 5 150
29 6 162
SB $1 M[ 5 ] 1 0 7 173
SB $1 M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 1 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 2 14 210
Writes: 0 15 225
Write-Back (REF 2)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 0 1 0 000 78 5 150
29 6 162
SB $1 M[ 5 ] 1 0 7 173
SB $1 M[ 10 ] 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 1 12 19
$1
$2 Hits: 0 13 200
$3 Reads: 2 14 210
Writes: 0 15 225
Write-Back (REF 2)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 1 1 0 000 78 5 150
29 6 162
SB $1 M[ 5 ] 0 1 0 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 2 12 19
$1
$2 173 Hits: 0 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 3)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 1 1 0 000 78 5 150
29 6 162
SB $1 M[ 5 ] 0 1 0 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 2 12 19
$1
$2 173 Hits: 0 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 3)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 1 1 0 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 2 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 4)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 1 1 0 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 2 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 4 14 210
Writes: 0 15 225
Write-Back (REF 4)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] lru V d tag data 4 71
LB $2 M[ 10 ] 0 1 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 1 1 0 011 162 7 173
SB $1 M[ 10 ] 173 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 4)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 0 1 1 010 71 7 173
SB $1 M[ 10 ] 150
29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 5)
Memory
Instructions:
0 78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 0 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 0 15 225
Write-Back (REF 5)
Eviction, WB dirty block
Memory
Instructions:
0 173
78
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] 1 1 1 000 173 5 150
29 6 162
SB $1 M[ 5 ] 0 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 3 12 19
$1
$2 173 Hits: 1 13 200
$3 Reads: 6 14 210
Writes: 2 15 225
Write-Back (REF 5)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] M 0 1 0 101 33 5 150
28 6 162
SB $1 M[ 5 ] 1 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 1 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 6)
CLICKER:
(A) HIT Memory
Instructions:
LB $1 M[ 1 ] M (B) MISS 0 173
1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] M 0 1 0 101 33 5 150
28 6 162
SB $1 M[ 5 ] 1 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 1 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 6)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] M 1 1 0 101 33 5 150
28 6 162
SB $1 M[ 5 ] Hit 0 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 2 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 7)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] M 1 1 0 101 33 5 150
28 6 162
SB $1 M[ 5 ] Hit 0 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 2 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 7)
Memory
Instructions:
0 173
LB $1 M[ 1 ] M 1 29
LB $2 M[ 7 ] M 2 120
SB $2 M[ 0 ] Hit 3 123
SB $1 M[ 5 ] M lru V d tag data 4 71
LB $2 M[ 10 ] M 0 1 1 101 29 5 150
28 6 162
SB $1 M[ 5 ] Hit 1 1 1 010 71 7 173
SB $1 M[ 10 ] Hit 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 8,9)
Cheap subsequent updates!
M
M Memory
Instructions:
0 173
... Hit
1 29
SB $1 M[ 5 ] M 2 120
LB $2 M[ 10 ] M 3 123
SB $1 M[ 5 ] Hit lru V d tag data 4 71
SB $1 M[ 10 ] Hit 0 1 1 101 29 5 150
28 6 162
SB $1 M[ 5 ] 1 1 1 010 71 7 173
SB $1 M[ 10 ] 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
Write-Back (REF 8,9)
M
M Memory
Instructions:
0 173
... Hit
1 29
SB $1 M[ 5 ] M 2 120
LB $2 M[ 10 ] M 3 123
SB $1 M[ 5 ] Hit lru V d tag data 4 71
SB $1 M[ 10 ] Hit 0 1 1 101 29 5 150
28 6 162
SB $1 M[ 5 ]Hit 1 1 1 010 71 7 173
SB $1 M[ 10 ]Hit 29 8 18
9 21
Register File Cache 10 33
$0 11 28
29
Misses: 4 12 19
$1
$2 33 Hits: 3 13 200
$3 Reads: 8 14 210
Writes: 2 15 225
How Many Memory References?
Write-back performance
• How many reads?
– Each miss (read or write) reads a block from mem
– 4 misses 8 mem reads
• How many writes?
– Some evictions write a block to mem
– 1 dirty eviction 2 mem writes
– (+ 2 dirty evictions later +4 mem writes)
Write-back vs. Write-through Example
Assume: large associative cache, 16-byte lines
N 4-byte words
• Write-through is slower
– But simpler (memory always consistent)
YOUR CACHE
H
MIND
137
Clicker Question
Choose the best block size for your cache among the
choices given. Assume that integers and pointers are all 4
bytes each and that the scores array is 4-byte aligned.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
138
Clicker Question
Choose the best block size for your cache among the
choices given. Assume integers and pointers are 4 bytes.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
typedef struct item_t {
int value;
struct item_t *next;
char *name;
} item_t;
int sum = 0;
item_t *curr = list_head;
while (curr != NULL) {
sum += curr->value;
curr = curr->next;
} 139
Clicker Question
Choose the best block size for your cache among the
choices given. Assume integers and pointers are 4 bytes.
(a) 1 byte (b) 4 bytes (c) 8 bytes (d) 16 bytes (e) 32 bytes
typedef struct item_t {
int value;
struct item_t *next;
char *name;
} item_t;
int sum = 0;
item_t *curr = list_head;
while (curr != NULL) {
sum += curr->value;
curr = curr->next;
} 140
By the end of the cache lectures…
•
•
> dmidecode -t cache
Cache Information
A Real Example
• Socket Designation: L1 Cache Microsoft Surfacebook
Configuration: Enabled, Not Socketed, Level 1
•
• Operational Mode: Write Back Dual core
•
•
Location: Internal
Installed Size: 128 kB
Intel i7-6600 CPU @ 2.6 GHz
• Maximum Size: 128 kB (purchased in 2016)
• Supported SRAM Types: Cache Information
• Synchronous Socket Designation: L3 Cache
• Installed SRAM Type: Synchronous Configuration: Enabled, Not Socketed,
Level 3
• Speed: Unknown Operational Mode: Write Back
• Error Correction Type: Parity Location: Internal
• System Type: Unified Installed Size: 4096 kB
• Associativity: 8-way Set-associative Maximum Size: 4096 kB
Supported SRAM Types:
Synchronous
• Cache Information Installed SRAM Type: Synchronous
• Socket Designation: L2 Cache Speed: Unknown
• Configuration: Enabled, Not Socketed, Error Correction Type: Multi-bit ECC
System Type: Unified
• Level 2 Associativity: 16-way Set-associative
• Operational Mode: Write Back
• Location: Internal
• Installed Size: 512 kB
• Maximum Size: 512 kB
• Supported SRAM Types:
• Synchronous
• Installed SRAM Type: Synchronous
• Speed: Unknown
• Error Correction Type: Single-bit ECC
• System Type: Unified
• Associativity: 4-way Set-associative
•
A Real Example
> sudo dmidecode -t cache
• Cache Information
• Configuration: Enabled, Not Socketed, Level 1 Dual-core 3.16GHz Intel
• Operational Mode: Write Back
• Installed Size: 128 KB (purchased in 2011)
• Error Correction Type: None
• Cache Information
• Configuration: Enabled, Not Socketed, Level 2
• Operational Mode: Varies With Memory Address
• Installed Size: 6144 KB
• Error Correction Type: Single-bit ECC
• > cd /sys/devices/system/cpu/cpu0; grep cache/*/*
• cache/index0/level:1
• cache/index0/type:Data
• cache/index0/ways_of_associativity:8
• cache/index0/number_of_sets:64
• cache/index0/coherency_line_size:64
• cache/index0/size:32K
• cache/index1/level:1
• cache/index1/type:Instruction
• cache/index1/ways_of_associativity:8
• cache/index1/number_of_sets:64
• cache/index1/coherency_line_size:64
• cache/index1/size:32K
• cache/index2/level:2
• cache/index2/type:Unified
• cache/index2/shared_cpu_list:0-1
• cache/index2/ways_of_associativity:24
• cache/index2/number_of_sets:4096
• cache/index2/coherency_line_size:64
• cache/index2/size:6144K
A Real Example
• Dual 32K L1 Instruction caches
– 8-way set associative Dual-core 3.16GHz Intel
(purchased in 2009)
– 64 sets
– 64 byte line size
• Dual 32K L1 Data caches
– Same as above
• Single 6M L2 Unified cache
– 24-way set associative (!!!)
– 4096 sets
– 64 byte line size
• 4GB Main memory
• 1TB Disk
Summary
• Memory performance matters!
– often more than CPU performance
– … because it is the bottleneck, and not improving much
– … because most programs move a LOT of data
• Design space is huge
– Gambling against program behavior
– Cuts across all layers:
users programs os hardware
• NEXT: Multi-core processors are complicated
– Inconsistent views of memory
– Extremely complex protocols, very hard to get right