Dragon Star Summer School
Advanced Topics in Modern VISL and Architecture
Module 5: Emerging NVM
Yuan Xie
Associate Professor The Pennsylvania State University Department of Computer Science & Engineering
www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu
(Some slides are adapted from Dr. Yiran Chen and Dr. Helen Li from Seagate)
What is Nonvolatile Memory?
Non-volatile memory, nonvolatile memory, NVM or non-volatile storage, is computer memory that can retain the stored information even when not powered.   -www.wikipedia.org
Evolution of Nonvolatile Memory
Punched card Flash Memory ROM
Paper tape
HDD
Floppy Disk
CD-R
1864
1900s 1956 1970s
1971
1980
1988
Success Story of Magnetism  HDD
         1956: First HDD: RAMAC 305 (IBM). 5MB of data at $50,000. As big as two refrigerators. Uses 50 24 platters. 1973: First modern "Winchester" HDD (IBM): Model 3340. 1979: First 5.25 HDD for PC (Shugart Tech., now Seagate Tech.). 1982: First drive with more than 1GB of storage: 1.2GB H-8598, with 50kg (Hitachi). 1983: First 3.5 HDD: RO352 (10MB, Rodime). 1988: First 2.5 HDD: 220 (20MB, Prairie Tek). 1997: Magneto resistive (GMR) heads (IBM). 2000: First 15,000-rpm HDD: Cheetah X15 (Seagate). 2002 100Gbits per square inch (Seagate). 2006: First 2.5-inch model to use perpendicular magnetic recording, boosts capacity up to 160GB. (Seagate) 2006: 1 12G HDD (Seagate) 2007: First real 1TB hard disk drive (Seagate).
4
Price Trend of HDD and SSD (NAND)
 $1/GB challenges (2009) enterprise market  $0.2/GB challenges (2011) laptop market
Enterprise Market Laptop Market
 NAND Cost Reduction Is Challenge After 2011 For Technology Barrier With NAND
$99 64GB SSD deal showed up 2 month ago!
5
Next Generation Nonvolatile Memory
*Source: IBM
Page 6
Nonvolatile Memory Candidates
SRAM Data Retention Memory Cell Factor Read Time (ns) Write /Erase time (ns) Number of rewrites Power Consumption  Read/Write Power consumption  other than R/W N 50-120 1 1 1016 Low DRAM N 6-10 30 50 1016 Low NOR Y 10 10 105-107 105 High NAND Y 2-5 50 106-105 105 High PRAM Y 6-12 20-50 50-120 1010 Low MRAM Y 4-20 2-20 2-20 1015 Low
Leakage Current
Refresh Power
None
None
None
None
 R-RAM outperforms NAND in cost (< x1/4), density (> x2) and performance  STT-RAM is ideal for embedded solution  Competition: IBM = PCM, SEC = PCM/NiO, Toshiba  3D NAND, SEAGATE = MRAM EVERSPIN=MRAM *Source: ITRS
Page 7
MRAM Cells
The structure of one transistor and one Magnetic Tunnel Junction (MTJ).
High resistance :: Low resistance 1 Free layer 0
Reference layer
3D View
Bit Line
MM4 BTLDM VIM3
Bit Line
TPLDM Sensor stack ( X-bit )
M3
ine Source L
Drain M2
e Source Lin
M1
Drain
Ga te=
Wo rd
Lin S e G
Read Operation of STT-RAM
Rmin 0.005 0.004
Rmax
Rref
P ro b a b ility
0.003 0.002 0.001 0 500 1500 2500
Resistance (Ohm)
10
MRAM technologies are being improved and attracting more attention.
11
11
SRAM vs. MRAM
High Density Fast Read Slow Write Low Read Energy High Write Energy Area (65nm) Capacity Read latency Write latency Read energy Write energy 3.66mm2 SRAM 128KB 2.25ns 2.26ns 0.90nJ 0.80nJ 3.30mm2 MRAM 512KB 2.32ns 11.02ns 0.86nJ 5.00nJ Leakage power 2.09W 0.26W
Cache configurations Low Leakage 2MB (16x128KB) SRAM cache 8MB (16x512KB) MRAM cache
Pros: Low leakage power, high density. Cons: Long write latency and large write energy. Replace SRAM caches with MRAM ? (HPCA 2009) 12 12
The Baseline 3D Architecture
 Core layer + cache layers. NUCA caches with NOC connections.
Data Migration Vertical Hop Cache bank R TSV Cache bank R Cache bank R Horizontal Hop R
Cache Bank Router Layer2
Cache bank
TSV Core Cache Controller Layer1 (Li et al, ISCA 06)
13
13
The Baseline 3D Architecture
14
14
Direct Replacement
  
Replace SRAM with MRAM of same area. The number of banks are kept the same. The capacity of L2 cache increases by three times.
L2 Cache Miss Rate
L2 cache miss rate reduced. How is the performance?
15
15
IPC Comparison (Direct Replacement)
IPC (SRAM vs. MRAM) The last four benchmarks have high write intensities. (see Observation 1)
16
16
Observation 1 (Direct Replacement)
Replacing SRAM L2 caches directly with MRAM can reduce the access miss rate of L2 caches. However, the long access latency to MRAM cache has a negative impact on the performance. When the write intensity is high, it even results in performance degradation.
Direct MRAM replacement may harm performance, How is power consumption?
17
17
Power Analysis
(Direct Replacement)
(Normalized to 2M-SRAM-SNUCA)
MRAM dynamic power MRAM leakage power
Total Power (SRAM vs. MRAM)
For some workloads, MRAM dynamic power dominates! (see Observation 2)
18
18
Observation 2
Replacing SRAM L2 caches directly with MRAM can greatly reduce the leakage power. When the write intensity is high, the dynamic power increases significantly because of the high write energy of MRAM cache. Question: How to improve the performance and further reduce power of MRAM?
19
19
T1: Read-Preemptive Write Buffer
(Demo) Write Buffer(FIFO) Write just begins is almost request done Write Op. Read Op. Read Data Read request MRAM Caches
How can read request evict write Read Op. request (preemptive condition)?
Cores Read Data
20
16
T2: SRAM-MRAM Hybrid L2 Cache
The read-preemptive write buffer hides the MRAM long write latency. We propose SRAM-MRAM Hybrid Cache to reduce write intensities to MRAM.
21
17
T2: SRAM-MRAM Hybrid L2 Cache
(Hybrid Structure) 
31 way MRAM caches + one way SRAM cache. 32
MRAM bank
TSV
Core SRAM bank
22
18
T2: SRAM-MRAM Hybrid L2 Cache
(Reduce Write Operations) 
Migrate data migrations among MRAM cache banks. Reduce data frequently written to the SRAM cache banks.
Home region No migrations
Migration from MRAM to SRAM
23
19
T2: SRAM-MRAM Hybrid L2 Cache
(Write Intensity: Pure vs. Hybrid)
Write Intensity (Pure vs. Hybrid)
Using hybrid L2 cache, MRAM write intensities are reduced
24
20
Combine T1 and T2 (IPC Result)
direct replacement with read-preemptive
IPC Comparison After adopting T1 and T2, the performance degradation is eliminated. The average IPC is increased by 15%.
25
21
Combine T1 and T2 (Power Result)
8M-MRAM-DNUCA direct replacement with read-preemptive
Total Power Comparison After adopting T1 and T2, the dynamic power is reduced. The average total power is further reduced by 17%.
26
22
Compare DRAM & MRAM caches
Cache size Area Read Latency Write Latency Read Energy Write Energy 512KB DRAM 2.38 mm2 4.966ns 4.966ns 0.689nJ 0.689nJ 512KB MRAM 3.30 mm2 2.318ns 11.024ns 0.858nJ 4.997nJ Leakage Power 1.6W 0.255W
Cache configurations 8MB 16 X 512KB DRAM cache 8MB 16 X 512KB MRAM cache
We can get better performance with MRAM caches.
27
27
Compare power of DRAM and MRAM caches
MRAM has lower power.
28
28
Hybrid Cache Architecture with Disparate Memory Technologies (ISCA 2009)
29
Different memory technologies
DRAM 1T1C structure
SRAM 6T structure
MTJ
Magnetic RAM 1T1J structure
Phase Change RAM 1T1J structure 30
31
Different memory technologies
MTJ
SRAM 6T structure
Magnetic RAM 1T1J structure
Phase Change RAM 1T1J structure
32
Comparisons
Density High (4) High(16) (ratio) (1) Dynamic Low Low for read; Medium for High for read; High Power Leakage High Low Low write for write Fast outperform Slow for Speed Fast Hybrid Cache could for Power read; read; Non-volatility No Yes Yes Slow for technology Very slow its counterpart of single Scalability Yes Yes write forYes write >1015 1016 1012 Endurance
 PRAM assumes four bits per cell
SRA Low M
MRAM
PRAM
Reducedynamicmiss rate High leakage power Increase hit latency Low Cache power
33
Read/Write
Reads and writes  Reads and writes have different performance/power implications  Varied read/write behaviors for different benchmarks  Emerging memories have different read/write features
Read-write aware Hybrid Cache Architecture (RWHCA) using NVM
34
RWHCA
Read-write aware Hybrid Cache Architecture (RWHCA) using Emergin NVM:  Made of different memory technologies and distinguish reads and writes  Increase effective cache size under similar area  Reduce leakage power consumption  Read/write exclusive regions in the same cache level  Write region has faster write and low write power (SRAM)  Intra-cache data movement policies  Placing frequently written data to the write region  Reduce power, may improve performance
35
Methodology
Chiplet Core w/ L1s L2 Write (SRAM) Core w/ L1s L2 Write (SRAM) L2 (SRAM) L2 Read (MRAM/ PRAM) L2 Read (MRAM)
Core w/ L1s
L3 (PRAM)
Baseline
RWHCA
3DRWHCA
36
Methodology
Cache parameters: CACTI or modified versions  SRAM: 1MB, 8 cycles, 0.388 nJ, 1.36 W (45nm)  MRAM: 4MB, 20/60 cycles, 0.4/2.3 nJ, 0.15W  PRAM: 16MB, 40/200 cycles, 0.8/1.5 nJ, 0.3W System configuration  Simulator: IBM Mambo  Processor: 8-way issue, out-of-order, 4GHz  L1: 32KB DL1,32KB IL1, 128B, 4-way, 1 r/w port, 2 cycles  L2/L3: different for design cases Workloads  30 workloads from SPECINT2006, SPECJBB, NAS, BioPerf, PARSEC, SPLASH2  Various cache size requirements
37
RWHCA-result
 SRAM/MRAM RWHCA L2 performance
 
5% geometric mean performance improvement over baseline 3% improvement over previous DNUCA policy  DNUCA: move a line to a closer bank on each hit, no difference for reads and writes, other policies Also achieve better performance than 3-level SRAM cache  256KB L2 and 1MB L3, similar area
1.66
1.94
38
RWHCA-result
 SRAM/MRAM RWHCA L2 power
  
55% power reduction over baseline dynamic power: normal + swap, less leakage power Lower power than DNUCA and 3-level SRAM
4 bars: SRAM baseline, DNUCA, RWHCA, 3-level SRAM
39
RWHCA-result
 SRAM/PRAM RWHCA L2 performance
  
20% performance degradation over baseline PRAM is not suitable for L2 cache from the performance perspective due to its long write latency Low endurance, not suitable for lower level cache
1.42 1.44
40
Outline
 Introduction  Methodology  Read-write
and Motivation
aware Hybrid Cache Architecture  3D Hybrid Cache stacking  Conclusions
41
3DRWHCA-configuration
SRAM/MRAM/PRAM 3DRWHCA  SRAM + MRAM L2  Total size: 4MB, 256KB SRAM  Write region: SRAM, region region: MRAM  SRAM r/w: 6 cycles, MRAM r: 20 cycles, w: 60 cycles  Bank number: 16, Associativity: 16  Block size: 128B, 1 r/w port
L3 PRAM  32MB (core + L1 has similar area with L2)  L3 bank number: 64, Associativity: 64  Block size: 128B, 1 r/w port  Power: scale from RWHCA
42
3DRWHCA-result
 3DRWHCA performance
 
16% geometric mean performance improvement over baseline 11% improvement over SRAM/MRAM RWHCA
1.94
2.2
1.88
1.71
43
3DRWHCA-result
 3DRWHCA power
  
10% power reduction over baseline even with a PRAM L3 Higher power than RWHCA Lower power than 3-level SRAM
4 bars: SRAM baseline, RWHCA, 3DRWHCA, 3-level SRAM
44
Conclusion
 
Emerging NVM is getting mature Will this bring a new impact on computer architecture and system design?
45