Cache Memory
Cache
• Small amount of fast memory
• Sits between normal main memory
  and CPU
• May be located on CPU chip or
  module
Cache and Main Memory
        LEVELS OF MEMORY
• Level 1 or Register: It is a type of memory in which data
  is stored and accepted that are immediately stored in the
  CPU. The most commonly used register is Accumulator,
  Program counter , Address Register, etc.
• Level 2 or Cache memory: It is the fastest memory that
  has faster access time where data is temporarily stored
  for faster access.
• Level 3 or Main Memory: It is the memory on which the
  computer works currently. It is small in size and once
  power is off data no longer stays in this memory.
• Level 4 or Secondary Memory: It is external memory that
  is not as fast as the main memory but data stays
  permanently in this memory.
Characteristics of Cache Memory
• Cache memory is an extremely fast memory type that acts
  as a buffer between RAM and the CPU.
• Cache Memory holds frequently requested data and
  instructions so that they are immediately available to the
  CPU when needed.
• Cache memory is costlier than main memory or disk
  memory but more economical than CPU registers.
• Cache Memory is used to speed up and synchronize with a
  high-speed CPU.
Cache operation – overview
• CPU requests contents of memory
  location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block
  from main memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which
  block of main memory is in each
  cache slot
Cache Read Operation -
      Flowchart
          Cache Addressing
• Where does cache sit?
   – Between processor and virtual memory management
     unit
   – Between MMU and main memory
• Logical cache (virtual cache) stores data using
  virtual addresses
   – Processor accesses cache directly, not thorough
     physical cache
   – Cache access faster, before MMU address translation
   – Virtual addresses use same address space for different
     applications
      • Must flush cache on each context switch
• Physical cache stores data using main memory
  physical addresses
         Size does matter
• Cost
  – More cache is expensive
• Speed
  – More cache is faster (up to a point)
  – Checking cache for data takes time
Cache/Main Memory
    Structure
Typical Cache Organization
     Comparison of Cache Sizes
                                        Year of
  Processor            Type                           L1 cache         L2 cache      L3 cache
                                     Introduction
  IBM 360/85        Mainframe            1968        16 to 32 KB         —             —
  PDP-11/70        Minicomputer         1975            1 KB             —             —
 VAX 11/780        Minicomputer         1978           16 KB             —             —
  IBM 3033          Mainframe           1978           64 KB             —             —
  IBM 3090          Mainframe           1985        128 to 256 KB        —             —
  Intel 80486           PC              1989            8 KB             —             —
   Pentium              PC              1993         8 KB/8 KB      256 to 512 KB      —
 PowerPC 601            PC              1993           32 KB             —             —
 PowerPC 620            PC              1996        32 KB/32 KB          —             —
 PowerPC G4          PC/server          1999        32 KB/32 KB     256 KB to 1 MB    2 MB
IBM S/390 G4        Mainframe           1997           32 KB           256 KB         2 MB
IBM S/390 G6        Mainframe           1999           256 KB           8 MB           —
  Pentium 4          PC/server          2000         8 KB/8 KB         256 KB          —
                  High-end server/
   IBM SP                               2000        64 KB/32 KB         8 MB           —
                   supercomputer
 CRAY MTAb        Supercomputer         2000            8 KB            2 MB           —
    Itanium          PC/server          2001        16 KB/16 KB         96 KB         4 MB
SGI Origin 2001   High-end server       2001        32 KB/32 KB         4 MB           —
   Itanium 2         PC/server          2002           32 KB           256 KB         6 MB
IBM POWER5        High-end server       2003           64 KB           1.9 MB         36 MB
 CRAY XD-1        Supercomputer         2004        64 KB/64 KB         1MB            —
        Mapping Function
• Cache of 64kByte
• Cache block of 4 bytes
  – i.e. cache is 16k (214) lines of 4 bytes
• 16MBytes main memory
• 24 bit address
  – (224=16M)
            Direct Mapping
• Each block of main memory maps to only
  one cache line
  – i.e. if a block is in cache, it must be in one
    specific place
• Address is in two parts
• Least Significant w bits identify unique word
• Most Significant s bits specify one memory
  block
• The MSBs are split into a cache line field r
  and a tag of s-r (most significant)
                 Direct Mapping
                Address Structure
 Tag s-r                   Line or Slot r          Word w
     8                          14                         2
• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
   – 8 bit tag (=22-14)
   – 14 bit slot or line
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking
  Tag
         Direct Mapping
         Cache Line Table
• Cache line       Main Memory
  blocks held
• 0                0, m, 2m, 3m…2s-
  m
• 1                1,m+1, 2m+1…2s-
  m+1
• m-1           m-1, 2m-1,3m-1…2s-1
Direct Mapping Cache
    Organization
Direct Mapping Example
  Direct Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or
  bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s+ w/2w = 2s
• Number of lines in cache = m = 2r
• Size of tag = (s – r) bits
Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
  – If a program accesses 2 blocks that map
    to the same line repeatedly, cache
    misses are very high
     Associative Mapping
• A main memory block can load into any
  line of cache
• Memory address is interpreted as tag
  and word
• Tag uniquely identifies block of memory
• Every line’s tag is examined for a
  match
• Cache searching gets expensive
Fully Associative Cache
      Organization
Associative Mapping
     Example
        Associative Mapping
         Address Structure
                                               Word
                  Tag 22 bit                   2 bit
• 22 bit tag stored with each 32 bit block of data
• Compare tag field with tag entry in cache to
  check for hit
• Least significant 2 bits of address identify
  which 16 bit word is required from 32 bit data
  block
• e.g.
  – Address   Tag Data Cache line
  – FFFFFC    FFFFFC   24682468 3FFF
       Associative Mapping
            Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or
  bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s+ w/2w = 2s
• Number of lines in cache = undetermined
• Size of tag = s bits
   Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a number of lines
• A given block maps to any line in a
  given set
  – e.g. Block B can be in any line of set i
• e.g. 2 lines per set
  – 2 way associative mapping
  – A given block can be in one of 2 lines in
    only one set
   Set Associative Mapping
          Example
• 13 bit set number
• Block number in main memory is
  modulo 213
• 000000, 00A000, 00B000, 00C000 …
  map to same set
Two Way Set Associative
  Cache Organization
      Set Associative Mapping
         Address Structure
                                        Word
Tag 9 bit            Set 13 bit         2 bit
• Use set field to determine cache set to
  look in
• Compare tag field to see if we have a hit
• e.g
  – Address Tag Data Set number
  – 1FF 7FFC 1FF 123456781FFF
  – 001 7FFC 001 112233441FFF
Two Way Set Associative
   Mapping Example
   Set Associative Mapping
          Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or
  bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2d
• Number of lines in set = k
• Number of sets = v = 2d
• Number of lines in cache = kv = k * 2d
• Size of tag = (s – d) bits
  Replacement Algorithms (1)
        Direct mapping
• No choice
• Each block only maps to one line
• Replace that line
 Replacement Algorithms (2)
 Associative & Set Associative
• Hardware implemented algorithm (speed)
• Least Recently used (LRU)
• e.g. in 2 way set associative
  – Which of the 2 block is lru?
• First in first out (FIFO)
  – replace block that has been in cache longest
• Least frequently used
  – replace block which has had fewest hits
• Random
           Write Policy
• Must not overwrite a cache block
  unless main memory is up to date
• Multiple CPUs may have individual
  caches
• I/O may address main memory
  directly
           Write through
• All writes go to main memory as well as
  cache
• Multiple CPUs can monitor main memory
  traffic to keep local (to CPU) cache up to
  date
• Lots of traffic
• Slows down writes
• Remember bogus write through caches!
                      Line Size
• Retrieve not only desired word but a number of
  adjacent words as well
• Increased block size will increase hit ratio at first
   – the principle of locality
• Hit ratio will decreases as block becomes even
  bigger
   – Probability of using newly fetched information becomes
     less than probability of reusing replaced
• Larger blocks
   – Reduce number of blocks that fit in cache
   – Data overwritten shortly after being fetched
   – Each additional word is less local so less likely to be
     needed
• No definitive optimum value has been found
• 8 to 64 bytes seems reasonable
• For HPC systems, 64- and 128-byte most
  common
        Multilevel Caches
• High logic density enables caches on
  chip
  – Faster than bus access
  – Frees bus for other transfers
• Common to use both on and off chip
  cache
  – L1 on chip, L2 off chip in static RAM
  – L2 access much faster than DRAM or ROM
  – L2 often uses separate data path
  – L2 may now be on chip
  – Resulting in L3 cache
    • Bus access or now on chip…
     Hit Ratio (L1 & L2)
For 8 kbytes and 16 kbyte L1
    Unified v Split Caches
• One cache for data and instructions or
  two, one for data and one for
  instructions
• Advantages of unified cache
  – Higher hit rate
    • Balances load of instruction and data fetch
    • Only one cache to design & implement
• Advantages of split cache
  – Eliminates cache contention between
    instruction fetch/decode unit and
    execution unit
    • Important in pipelining