Advanced Computer
Architecture
Dr. Angela C. Sodan
Lecture 8
Overview
Memory hierarchy / caches
Memory Hierarchy
Recall: memory/processor gap is
increasing
Technological background:
DRAM – dynamic, to be (automatically)
rewritten, improved performance if BiCMOS
-> the technology for main memory
SRAM – static, no refreshing, faster, more
expensive
-> the technology for caches
In addition access speed reduces with size
Memory Hierarchy
Trade-off between cost/size and speed
(and space on chip)
Basic solution: memory hierarchy
Registers
Caches (multilevel, bridging)
Main memory
Other measures:
Memory interleaving
(Multiple) register sets
Review Caches
Which characteristics are important
Cache is implicit (typically transparent and all
transfer automatic)
Conversely to registers/memories which are
explicitly accessed (load/store)
In speed and size between registers and
memory
Works as a “buffer” to memory (load/store)
Why can this work?
Caches and Locality
Caches – only – work if program has
temporal and/or spatial locality
This typically is the the case but
What does this mean?
Temporal: the same data is accessed
multiple times in short time intervals
e.g. a loop
Spatial: spatially close data is accessed
e.g. a vector/array, sequentially accessed
and instructions if not branch…
Caches and Locality (cont.)
Thus it helps to
Keep the recently used data in the cache
To fetch more than one word and even
pipeline memory access
Modern processors heavily depend on the
cache
High hit rates (about 90%) required for
good performance
Review Caches (cont.)
Different approaches
Associative or direct
Associative: stores arbitrarily, “searches”,
complex and expensive, works for all kind of
accesses
Direct: hashes, potential conflicts, can work
poorly with certain access sequences
Practically applied is hybrid set-associative:
direct access but a few associative slots per
entry – still cheap and works very well,
typical is 4-way
Review Caches (cont.)
Replacement strategy required for
associative and set-associative
Least-Recently Used: applies locality idea
Random: easy and often applied in caches
(Not much difference for larger caches and
penalty for miss not too high)
Ensuring data coherence:
Reading is uncritical
Writes need to be done in memory, too
– Store through: every change immediately in memory
– Write back: change in memory only if replaced,
applied in modern processors
Review Caches (cont.)
Entity is cache line (multiple words)
Exploits spatial locality
Longer cache line: good if much locality
Danger: Useless fetch cost, wasting cache space
Shorter cache line: good if little locality
Typical is 32, recently sometimes 64 bytes
Larger cache size compensates for longer
cache lines, direct caches and simpler
replacement
The Miss Sources
Conflict misses -> include associativity,
make larger
Capacity misses -> make larger
Initial misses -> make cache lines longer
A Cost Metric
Teff – Effective access time
Teff = h1 t1 + (1-h1) h2 t2 + (1-h1) (1-h2) h3 t3 +…
with hi is hit rate at level i and ti is access time level i
Useto calculate average access time or to
configure systems according to
requirements
Some Comments About
Machine Configuration
Ifyou purchase a machine, think about your
requirements (estimates of typical
applications)
E.g. will you have large problems which should fit
into memory or even cache, how fast needs your
disk to be (or may you want multiple disks)?
This may be a question about how to spend
your available budget…
More later when we talk about performance
Caches are Implicit – Are They
Really?
Answer:
Yes, with respect to semantic correctness –
enough for standard programming
No, with respect to performance
- for high performance computing
Proper loop organization
(user or compiler)
Getting Good Cache Performance
Temporal: Partition into small loops
for (i = 0; l++; l < k)
for (l = 0; i++; i<n)
for (j = 0; j++; j < m)
c [i,l] += a[i,j] * b[j,l]
may re-use row i of a (if fits into cache)
Spatial: use vectors and sequential
access if possible
Significant speedup in bio-computing app
Cache Locality (cont.)
A simple improvement to at least
exploit spatial locality for second matrix
for (i = 0; i++; i < k) Loop interchange
for (j = 0; j++; j < n)
for (l = 0; l++; l < m)
c [i,l] += a[i,j] * b[j,l]
Note that this code now accesses lines of the
second matrix in the innermost loop!
can increase performance by up to e.g. a factor of
32/4=8 if only 1 row/column of B fits into
memory
Cache Locality (cont.)
Sequential access on vector outperforms
Arbitrary access on vector
Linked data structures
(unless fit completely into cache)
If multiple related data items
If arbitrary access, put them in single array
with struct
If sequential access, it does not matter
struct {double real, imag;} COMPLEX [n] data;
double data_r [n], data_i [n];
Cache Locality (cont.)
The matrix example revisited: further
possible improvement:
Try to keep the second matrix in the cache
to exploit temporal locality
Since the whole matrix most likely is too
large, partition it into submatrices
Cache Locality Advanced
blocking
A11 A12 B11 B12 C11 C12
x =
A21 A22 B21 B22 C21 C22
Calculate submatrices and combine them:
Cij = Ai1*B1j + Ai2*B2j
Cache Locality Advanced (cont.)
Assume that matrix is n*n,
m is number of blocks per dimension
and that submatrix fits into cache!!!
We have to load A m times instead of
once, but B only m instead of n times
Which gives (2m)*n*n elements instead of
(n+1)*n*n elements to load from memory into
cache
E.g. n=1000, m=2 gives
4 Mio vs. 1001 Mio elements
Final Comment
Again, you do not need to change all your
programming
But its worth if you write programs which
will run many times and take a lot of
runtime:
imagine your want to test 100 scenarios in a
simulation and each runs 10 hours vs. 1 hour!