Graphics Processing Unit
(GPU) Memory Hierarchy
    Presented by Vu Dinh and Donald MacIntyre
                                                1
Agenda
●   Introduction to Graphics Processing
●   CPU Memory Hierarchy
●   GPU Memory Hierarchy
●   GPU Architecture Comparison
    o   NVIDIA
    o   AMD (ATI)
● GPU Memory Performance
● Q&A
                                          2
Brief Graphics Processing History
● Graphics Processing has
  evolved from single hardware
  pipelined units and are now
  highly programmable
  pipelined units.
● Over time tasks have been
  moved from the CPU to the
  GPU
                                    3
Timeline
● 1980s
  o   Discrete Transistor-Transistor Logic (TTL) frame
      buffer with graphics processed by CPU
● 1990s
  o   Introduction of GPU pipeline - CPU tasks began to
      be moved to GPU
● 2000s
  o   Introduction of Programmable GPU Pipeline
● 2010s
  o   GPUs becoming general purpose and also utilized
      for high performance parallel computations
                                                          4
Movement of Tasks from CPU to GPU
                                    5
Introduction to Graphic Processing
                                     6
CPU Memory Hierarchy
      NVIDIA Fermi Memory Hierarchy
                                      7
GPU Memory Hierarchy
    Streaming Multiprocessors (SM) Register Files
● Large and Unified Register File
   (32768 Registers)
● 16 SMs (128KB Register File per
   SM), 32 Cores per SM
   -> 2MB across the chip
● 48 warps (1,536 threads per SM)
   -> 21 Registers/Thread
● Multi-Banked Memory
● Very high bandwidth ( 8,000 GB/s)
● ECC protected
                                                    8
GPU Memory Hierarchy (Cont.)
                        Shared/L1 Memory
● Configurable 64KB Memory
● 16KB shared / 48 KB L1
   OR 48KB shared / 16KB L1
● Shared Multi-Threads & L1 Private
● Shared Memory Multi-Banked
● Very low latency (20-30 cycles)
● High bandwidth (1,000+ GB/s)
● ECC protected
                                           9
GPU Memory Hierarchy (Cont.)
                    Texture & Constant Cache
● 64 KB read-only constant cache
● 12 KB texture cache
● Texture cache memory throughput
   (GB/s): 739.63
● Texture cache hit rate (%): 94.21
                                               10
GPU Memory Hierarchy (Cont.)
                           L2 Cache
● 768KB Unified Cache
● Shared among SMs
● ECC protected
● Fast Atomic Memory Operations
                                      11
GPU Memory Hierarchy (Cont.)
                      Main Memory (DRAM)
● Accessed by GPU and CPU
● Six 64-bit DRAM channels
● Up to 6GB GDDR5 Memory
● Higher latency (400-800 cycles)
● Throughput: up to 177 GB/s
                                           12
Different GPU Memory Hierarchies
● NVIDIA GeForce     ● AMD Radeon HD
  GTX 580              5870
                                       13
GPU Memory Architecture NVIDIA - Fermi
● On board GPU memory →
  high bandwidth DDR5 768 MB
  to 6GB
● L2 shared cache → 512-768
  KB high bandwidth
● L1 cache → one for each
  streaming multiprocessor
                                         14
GPU Memory Architecture - AMD Ring
● Mid 2000s design, used to
  increase memory bandwidth
● To increase bandwidth
  requires a wider bus
● Ring bus was an attempt to
  avoid long circuit paths and
  their propagation delays
● Two 512-bit links for true bi-
  directional operation
● Delivered 100 GB/s of
  internal bandwidth
                                     15
GPU Memory Architecture - AMD Hub
● Ring bus wasted power →
  all nodes got data even if
  they did not need it
● Switched hub approach
  reduces power and latency
  since data is sent point to
  point
● AMD increased internal bus
  width to 2k bits wide
● Maximum bandwidth was
  192 GB/s
                                    16
GPU Bandwidth
● High bandwidth between main memory is required to support
  multiple cores
● GPUs have relatively small cache
● GPU memory systems are designed for data throughput with wide
  memory buses
● Much larger bandwidth than typical CPUs typically 6 to 8 times
                                                                   17
GPU Bandwidth (Cont.)
● Bandwidth Use Techniques
  o   Avoid fetching data whenever possible
       Share/reuse data
       Make use of compression
       Perform math calculations instead of fetching
         data when possible → math calculations are not
         limited by memory bandwidth
                                                          18
GPU vs. CPU Bandwidth Growth
                               19
GPU Latency
● Big register files
● Dedicated shared memory (configurable)
● Multi-banked memory
● Reuse data in dedicated memories
● Focus on parallelism
                                           20
GPU Latency (Cont.)
                           Latency Hiding
● 1,536 threads per SM (48 warps)
● 32 threads per warp (SIMT)
● 1000 cycles memory access stall
● Switch to another group to hide
   latency
                                            21
 Future of GPU Memory
 ● New manufacturing process → High Bandwidth Memory
 ● Stacking DRAM dies on top of each other thus allowing
   for close proximity between DRAM and processor
● Allows for very
  high bandwidth
  memory bus
● Due to stacking
  will be harder to
  cool
                                                           22
References
● Fatahalian, Kayvon. “The GPU Memory
  Hierarchy”. Carnegie Mellon University.
● Cao Young. “GPU Memory II”. Virginia Tech.
● McClanahan Chris. “History and Evolution of
  GPU Architecture”. Georgia Tech.
● “CUDA Memory and Cache Architecture”.
  Supercomputing Blog.
● “Radeon X1800 Memory Controller”. ATI.
                                                23
Q&A
      Thank you!
                   24