Systems I
Locality and Caching
Topics
Locality of reference
Cache principles
Multi-level caches
Locality
Principle of Locality:
Programs tend to reuse data and instructions near those
they have used recently, or that were recently referenced
themselves.
Temporal locality: Recently referenced items are likely to be
referenced in the near future.
Spatial locality: Items with nearby addresses tend to be
referenced close together in time.
Locality Example:
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Data
Reference array elements in succession
(stride-1 reference pattern): Spatial locality
Reference sum each iteration: Temporal locality
Instructions
Reference instructions in sequence: Spatial locality
Cycle through loop repeatedly: Temporal locality
2
Locality Example
Claim: Being able to look at code and get a qualitative
sense of its locality is a key skill for a professional
programmer.
Question: Does this function have good locality?
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
3
Locality Example
Question: Does this function have good locality?
int sumarraycols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Locality Example
Question: Can you permute the loops so that the
function scans the 3-d array a[] with a stride-1
reference pattern (and thus has good spatial
locality)?
int sumarray3d(int a[M][N][N])
{
int i, j, k, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
sum += a[k][i][j];
return sum;
}
Memory Hierarchies
Some fundamental and enduring properties of
hardware and software:
Fast storage technologies cost more per byte and have less
capacity.
The gap between CPU and main memory speed is widening.
Well-written programs tend to exhibit good locality.
These fundamental properties complement each other
beautifully.
They suggest an approach for organizing memory and
storage systems known as a memory hierarchy.
6
An Example Memory Hierarchy
Smaller,
faster,
and
costlier
(per byte)
storage
devices
L0:
registers
L1: on-chip L1
cache (SRAM)
L2:
L3:
Larger,
slower,
and
cheaper
(per byte)
storage
devices
L5:
CPU registers hold words retrieved
from L1 cache.
L4:
off-chip L2
cache (SRAM)
L1 cache holds cache lines retrieved
from the L2 cache memory.
L2 cache holds cache lines
retrieved from main memory.
main memory
(DRAM)
Main memory holds disk
blocks retrieved from local
disks.
local secondary storage
(local disks)
Local disks hold files
retrieved from disks on
remote network servers.
remote secondary storage
(distributed file systems, Web servers)
7
Caches
Cache: A smaller, faster storage device that acts as a staging area
for a subset of the data in a larger, slower device.
Fundamental idea of a memory hierarchy:
For each k, the faster, smaller device at level k serves as a cache
for the larger, slower device at level k+1.
Why do memory hierarchies work?
Programs tend to access the data at level k more often than they
access the data at level k+1.
Thus, the storage at level k+1 can be slower, and thus larger and
cheaper per bit.
Net effect: A large pool of memory that costs as much as the cheap
storage near the bottom, but that serves data to programs at the
rate of the fast storage near the top.
Use combination of small fast memory and big slow memory to
give illusion of big fast memory.
8
Caching in a Memory Hierarchy
Level k:
8
4
10
4
Level k+1:
14
10
Smaller, faster, more expensive
device at level k caches a
subset of the blocks from level k+1
Data is copied between
levels in block-sized transfer
units
10
11
12
13
14
15
Larger, slower, cheaper storage
device at level k+1 is partitioned
into blocks.
General Caching Concepts
14
12
Level
k:
Request
12
14
Cache hit
4*
12
14
12
4*
Level
k+1:
Program needs object d, which is stored
in some block b.
Program finds b in the cache at level
k. E.g., block 14.
Cache miss
Request
12
4
4*
10
11
12
13
14
15
b is not at level k, so level k cache
must fetch it from level k+1.
E.g., block 12.
If level k cache is full, then some
current block must be replaced
(evicted). Which one is the victim?
Placement policy: where can the new
block go? E.g., b mod 4
Replacement policy: which block
should be evicted? E.g., LRU
10
General Caching Concepts
Types of cache misses:
Cold (compulsary) miss
Cold misses occur because the cache is empty.
Conflict miss
Most caches limit blocks at level k+1 to a small subset
(sometimes a singleton) of the block positions at level k.
E.g. Block i at level k+1 must be placed in block (i mod 4) at
level k+1.
Conflict misses occur when the level k cache is large enough,
but multiple data objects all map to the same level k block.
E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
Occurs when the set of active cache blocks (working set) is
larger than the cache.
11
Examples of Caching in the Hierarchy
Cache Type
What Cached
Where Cached
Registers
4-byte word
CPU registers
0 Compiler
TLB
Address
translations
32-byte block
32-byte block
4-KB page
On-Chip TLB
0 Hardware
On-Chip L1
Off-Chip L2
Main memory
Parts of files
Main memory
1 Hardware
10 Hardware
100 Hardware+
OS
100 OS
L1 cache
L2 cache
Virtual
Memory
Buffer cache
Network buffer Parts of files
cache
Browser
Web pages
cache
Web cache
Web pages
Local disk
Local disk
Remote server
disks
Latency
(cycles)
Managed
By
10,000,000 AFS/NFS
client
10,000,000 Web
browser
1,000,000,000 Web proxy
server
12
Cache Memories
Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
Hold frequently accessed blocks of main memory
CPU looks first for data in L1, then in L2, then in main
memory.
Typical bus structure:
CPU chip
register file
L1
cache
cache bus
L2 cache
ALU
system bus memory bus
bus interface
I/O
bridge
main
memory
13
Inserting an L1 Cache Between
the CPU and Main Memory
The tiny, very fast CPU register file
has room for four 4-byte words.
The transfer unit between
the CPU register file and
the cache is a 4-byte block.
line 0
The small fast L1 cache has room
for two 4-word blocks.
line 1
The transfer unit between
the cache and main
memory is a 4-word block
(16 bytes).
block 10
abcd
...
block 21
pqrs
...
block 30
The big slow main memory
has room for many 4-word
blocks.
wxyz
...
14
Multi-Level Caches
Options: separate data and instruction caches, or a
unified cache
Processor
Regs
L1
d-cache
L1
i-cache
size:
speed:
$/Mbyte:
line size:
200 B
3 ns
8-64 KB
3 ns
8B
32 B
larger, slower, cheaper
Unified
L2
Cache
1-4MB SRAM
6 ns
$100/MB
32 B
Memory
128 MB DRAM
60 ns
$1.50/MB
8 KB
disk
30 GB
8 ms
$0.05/MB
15
Intel Pentium Cache Hierarchy
Regs.
L1 Data
1 cycle latency
16 KB
4-way assoc
Write-through
32B lines
L1 Instruction
16 KB, 4-way
32B lines
L2 Unified
128KB--2 MB
4-way assoc
Write-back
Write allocate
32B lines
Main
Memory
Up to 4GB
Processor Chip
16
Find the Caches
IBM Power 5, 2004
17
Summary
Today
Locality: Spatial and Temporal
Cache principles
Multi-level cache hierarchies
Next Time
Cache organization
Replacement and writes
Programming considerations
18