On-chip MRAM as a High-Bandwidth,
Low-Latency Replacement for DRAM Physical Memories
Abstract:                                   proliferation of chip multiprocessors due
Impediments       to     main     memory    to increased memory bandwidth
performance have traditionally been due     demands.
to the divergence in processor versus
memory speed and the pin bandwidth          Introduction:
limitations of modern packaging             Main memory latencies are already
technologies. In this paper we evaluate a   hundreds of cycles; often processors
magneto-resistive memory (MRAM)-            spend more than half of their time
based hierarchy to address these future     stalling for L2 misses. Memory latencies
constraints.    MRAM        devices   are   will continue to grow, but more slowly
nonvolatile, and have the potential to be   over the next decade than in the last,
faster than DRAM, denser than               since processor pipelines are nearing
embedded DRAM, and can be integrated        their optimal depths. However, off-chip
into the processor die in layers above      bandwidth will continue to grow as a
those of conventional wiring. We            performance-limiting factor, since the
describe basic MRAM device operation,       number of transistors on chip is
develop detailed models for MRAM            increasing at a faster rate than chip
banks and layers, and evaluate an           signaling pins. Left unaddressed, this
MRAM-based memory hierarchy in              disparity will limit the scalability of
which all off-chip physical DRAM is         future chip multiprocessors. Larger
replaced by on-chip MRAM. We show           caches can reduce off-chip bandwidth
that this hierarchy offers extremely high   constraints, but consume area that could
bandwidth, resulting in a 15%               instead be used for processing, limiting
improvement          in      end-program    the number of useful processors that can
performance over conventional DRAM-         be implemented on a single die. In this
based main memory systems. Finally,         paper, we evaluate the potential of on-
we compare the MRAM hierarchy to one        chip magneto-resistive random access
using a chipstacked DRAM technology         memory (MRAM)to solve this set of
and show that the extra bandwidth of        problems. MRAM is an emerging
MRAM enables it to outperform this          memory       technology    that    stores
nearer-term technology. We expect that      information using the magnetic polarity
the     advantage       of    MRAM-like     of a thin ferromagnetic layer. This
technologies will increase with the         information is read by measuring the
current      across      an     MRAMcell,      hierarchy that replaces off-chip DRAM
determined by the rate of electron             physical memories with an onchip
quantum tunneling, which is in turn            MRAM memory hierarchy. Our MRAM
affected by the magnetic polarity of the       hierarchy breaks a single MRAM layer
cell. MRAM cells have many potential           into a collection of banks, in which the
advantages. They are non-volatile, and         MRAM devices sit between two metal
they can be both faster, and potentially       wiring layers, but in which the decoders,
as dense, as DRAM cells. They can be           word line drivers, and sense amplifiers
implemented in wiring layers above an          reside on the transistor layer, thus
active silicon substrate as part of a single   consuming chip area. The MRAM banks
chip. Multiple MRAMlayers can thus be          hold physical pages, and under each
placed on top of a single die, permitting      MRAM bank resides a small level-2
highly integrated capacities. Most             (L2) cache which caches lines mapped to
important, the enormous interconnection        that MRAM bank. The mapping of
density of 100,000 vertical wires per          physical pages thus determines to which
square millimeter, assuming vertical           L2 cache bank a line will be mapped.
wires have pitch similar to global vias        Since some MRAM banks are more
(currently24 thicknessand10 width),            expensive to access than others, due to
will enable as many as 10,000 wires per        the physical distances across the chip,
addressable bank within the MRAM               page placement into MRAM banks can
layer. In this technology, the number of       affect performance. An ideal placement
interconnects and total bandwidth are          policy would:
limited by the pitch of the vertical vias       (1)minimize routing latency by placing
rather than that of the pads required by       frequently     accessed     pages     into
conventional packaging technologies.           MRAMbanks close to the processor,
Unsurprisingly, MRAMdevices have               (2)minimize network congestion by
several potential drawbacks. They              placing pages into banks that have the
require high power to write, and layers        fewest highly accessed pages,
of MRAMdevices may interfere with                (3)minimize L2 bank miss rates by
heat dissipation. Furthermore, while           distributing hot pages evenly across the
MRAMdevices have been prototyped,              MRAM banks. According to our results,
the latency and density of production          the best page placement policy with
MRAM          cells     in    contemporary     MRAM outperforms a conventional
conventional        technologies    remains    DRAM-based hierarchy by 15% across
unknown. To justify the investment             16 memory-intensive benchmarks. We
needed to make MRAMs commercially              evaluate several page placement
competitive will require evidence of           policies, and find that in near-term
significant advantages over conventional       technology, minimizing L2 miss rates
technologies. One goal of our work is to       with     uniform      page    distribution
determine whether MRAM hierarchies             outweighs minimization of bank
show enough potential performance              contention or routing delay. That balance
advantages to be worth further                 will shift as cross-chip routing delays
exploration. In this paper, we develop         grow in future technologies, and as both
and describe access latency and area           wider-issue and CMP processors place a
models for MRAM banks and layers.              heavier load on the memory subsystem.
Using these models, we simulate a
        Finally, we compare our MRAM        information, and a bit can be detected by
hierarchy against another emerging          sampling the difference in electrical
memory technology called chipstacked        resistance between the two polarized
SDRAM. With this technology, a              states of the MTJ. Current MRAM
conventional DRAM chip is die-attached      designs using MTJ material to store data,
to the surface of a logic chip. This        are non-volatile and have unlimited read
procedure enables the two chips to          and write endurance. Along with its
communicate through many more wires         advantages of small dimensions and
than can be found on conventional chip      non-volatility, MRAM has the potential
packages or even multi-chip modules         to be fabricated on top of a conventional
(MCMs).       Although      the    higher   microprocessor, thus providing very
bandwidth is exploited in a manner          high bandwidth. The access time and
similar to that of MRAM, the total I/O      cell size of MRAM memory has been
counts is still substantially lower. Our    shown to be comparable to DRAM
preliminary evaluation of these two         memory . Thus, MRAM memory has
hierarchies shows that the MRAM             attributes which make it competitive
hierarchy performs best, with the caveat    with semiconductor memory.
that the parameters used for both are
somewhat speculative. Section 2             MRAM Cell:
describes MRAMdevice operation and          Figure 1 shows the different components
presents a model for access delay of        of an MRAM cell. The cell is composed
MRAMbanks and layers. Section 3             of a diode and an MTJ stack, which
proposes an MRAM-based memory               actually stores the data. The diode acts
hierarchy, in which a single MRAM           as a current rectifier and is required for
layer is broken into banks, cached by       reliable readout operation. The MTJ
per-bank L2 banks, and connected via a      stack material consists of two
2-D switched network. Section 4             ferromagnetic layers separated by a thin
compares      a    traditional    memory    dielectric barrier. The polarization of one
hierarchy, with an off-chip DRAM            of the magnetic layers is pinned in a
physical memory, to the MRAM                fixed direction, while the direction of the
memory hierarchy. Section 5 presents a      other layer can be changed using the
performance comparison of the MRAM          direction of current in the bitline. The
hierarchy to a chipstack SDRAM              resistance of the MTJ depends on the
memory hierarchy. Section 6 describes       relative direction of polarization of the
related work, and Section 7 summarizes      fixed and the free layer, and is minimum
our conclusions and describes issues for    or maximum depending on whether the
future study. MRAM Memory Cells and         direction is parallel or anti-parallel to
Banks Magnetoresitive random access         each other. When the polarization is
memory (MRAM) is a memory                   anti-parallel, the electrons experience an
technology that uses the magnetic tunnel    increased resistance to tunneling through
junction (MTJ) to store information. The    the MTJ stack. Thus, the information
potential for MRAM has improved             stored in a selected memory cell can be
steadily due to advances in magnetic        read by comparing its resistance with the
materials.      MRAM         uses     the   resistance of a reference memory cell
magnetization orientation of a thin         located along the same wordline. The
ferromagnetic     material     to   store   resistance of the reference memory cell
always remains at the minimum level.            techniques have been proposed for
As the data stored in an MRAM cell are          improving cell stability down to 100 nm
non-volatile, MRAMs do not consume              feature size. Also, IBM and Motorola are
any static power. Also, MRAM cells do           already exploring 0.18 um MRAM
not have to be refreshed periodically like      designs, and researchers at MIT have
DRAM cells. However, the read and               demonstrated 100 nm x 150 nm
write power for MRAM cells are                  prototypes. While there will be
considerably different as the current           challenges for design and manufacture,
required for changing the polarity of the       existing projections indicate that MRAM
cell is almost 8 times than that required       technology can be scaled, and with
for reading. A more complete                    enough investment and research, will be
comparison          of     the      physical    competitive with other conventional and
characteristics of competing memory             emerging memory technologies.
technologies can be found in.
     The MRAM cell consists of a diode,         MRAM Bank Design:
which currently can be fabricated using         Figure 2 shows an MRAM bank
excimer laser processing on a metal             composed of a number of MRAM cells
underlayer, and an MTJ stack which can          located at the intersection of every bit
be fabricated using more conventional           and word line. During a read operation,
lithographic processes. The diode in this       current sources are connected to the bit
architecture must have a large on-to-off        lines and the selected wordline is pulled
conductance ratio to provide isolation of       low by the wordline driver. Current
the sense path from the sneak paths. This       flows through the cells in the selected
isolation is achievable using thin film         wordline and the magnitude of the
diodes. Schottky barrier diodes have also       current through each cell depends on its
been shown to be promising candidates           relative magnetic polarity. If the
for current rectification in MRAM cells.        ferromagnetic layers have the same
Thus, MRAM memory has the potential             polarity, the cell will have lower
to be fabricated on top of a conventional       resistance and hence more current will
microprocessor in wiring layers.                flow through the cell thus reducing the
However, the devices required to operate        current flowing through the sense
the MRAM cells namely, the decoders,            amplifiers. The current through the sense
the drivers, and the sense amplifiers,          amplifiers is shown graphically in Figure
cannot be fabricated in this layer and          2, when the middle wordline is selected
hence must reside in the active transistor      for reading. The bitline associated with
layers below. Thus, a chip with MRAM            the topmost cell experiences a smaller
memory will have area overhead                  drop in current as the cell has higher
associated with these devices. The data         resistance compared to the other two
cells and the diode themselves do not           cells connected to the selected wordline.
result in any silicon overhead since they       This change in current is detected using
are fabricated in metal layers. One of the      sense amplifiers,
main challenges for MRAM scalability
is the cell stability at small feature sizes,
as thermal agitation can cause a cell to
lose data. However, researchers are
already addressing this issue and
                                               independently accessible. Some of the
                                               important features we added to model
                                               MRAMs include:
                                               1. The area consumed in the transistor
                                               layer by the devices required to operate
                                               the bank including decoders, wordline
                                               drivers, bitline drivers, and sense
                                               amplifiers.
                                               2. The delay due to vertical wires
                                               carrying signals and data between the
                                               transistor layer and the MRAM
                                               layer .
                                               3. MRAM capacity for a given die size
                                               and MRAM cell size.
                                               4. Multiple layers of MRAM with
                                               independent and shared wordlines and
                                               bitlines. We used the 2001 SIA roadmap
                                               for the technology parameters at 90 nm
                                               technology . Given an MRAM bank size
                                               and the number of sub-banks in each
                                               bank, our tool computes the time to
                                               access the MRAM bank by computing
                                               the access time of the sub-bank and
                                               accounting for the wire delay to reach
and the stored data is read out. As the        the farthest sub-bank. To compute the
wordline is responsible for sinking the        optimal sub-bank size, we looked at
current through a number of cells, the         designs of modern DRAM chips and
wordline driver should be strong to            made MRAM sub-bank sizes similar to
ensure reliable sensing. Alternative           current DRAM sub-bank sizes. We
sensing schemes have been proposed for         computed the access latency for various
MRAM which have increased sensing              sub-bank configurations using our area
reliability but also increase the cell area.   and timing model. This latency is shown
                                               in Table 1. From this table it is clear that
 MRAM Bank Modeling:                           the latency increases substantially once
To estimate the access time of an              we increase the sub-bank size beyond 8
MRAM bank and the area overhead in             Mb. We fixed 4 Mb as the size for the
the transistor layer due to the MRAM           sub-banks in our system. Our area and
banks, we developed an area and timing         timing tool was then used to compute the
tool by extending CACTI-3.0 and adding         delay for banks composed of different
MRAM specific features to the model.           number of sub-banks. We added a fixed
In our model, MRAM memory is                   5ns latency to the bank latency to
divided into a number of banks which           account for the MRAM cell access
are independently accessible. The banks        latency which is half the access time
in turn are broken up into sub-banks to        demonstrated in current prototypes . The
reduce access time. The subbanks               4Mb bank latency was then used in our
comprising a bank, however, are not            architectural model which is described in
the next section. We used a single
vertical MRAM layer in our evaluation.
Future implementations with multiple
MRAM layers will result in a much
larger memory capacity, but will also
increase the number of vertical buses
and the active area overhead, if the
layers have to be independently
accessed. It might be possible to reduce
the number of vertical buses and the
active area overhead by either sharing
the wordlines or the bitlines among the
different layers. Sharing bitline among
layers might interfere with reliable
sensing of the MRAM cells. Evaluation
of multiple layer MRAM architectures is
a topic for future research.
A Chip-Level MRAM Memory
Hierarchy:
MRAM memory technology promises
large memory capacity close to the         memory architecture for managing
processor. With global wire delays         MRAM memory, and use dynamic page
becoming significant, we need a            allocation policies to distribute the data
different approach to managing this        efficiently across the chip
memory efficiently, to ensure low
latency access in the common case . In
this section, we develop a distributed
                                            below each MRAM bank. We assume
                                            MRAM banks occupy 75% of the chip
                                            area in the metal layer, and the SRAM
                                            caches and associated MRAM circuitry
                                            occupy 60% of the chip area in the
                                            active layer. Each node in the network
                                            has a MRAM bank controller that
                                            receives requests from the L1 cache and
Basic MRAM System Architecture:             checks its local L2 cache first to see if
Our basic MRAM architecture consists        the data are present in it. On a cache hit,
of a number of independently accessed       the data are retrieved from the cache and
MRAM banks distributed across the           returned via the network. On a cache
chip. As described in Section 2, the data   miss, the request is sent to the MRAM
stored in the MRAM banks are present        bank which returns the data to the
in a separate vertical layer above the      controller and also fills its associated L2
processor substrate, while the bank         cache. We model channel and buffer
controller and other associated logic       contention in the network, and also
required to operate the MRAM bank           model contention for the ports associated
reside on the transistor layer. The banks   with each SRAM cache and MRAM
are connected through a network that        bank.
carries request and response packets
between the level-1 (L1) cache and each     Factors influencing MRAM design
bank controller. Figure 3 shows our         space:
proposed architecture, with the processor   The cost to access data in an MRAM
assumed to be in the center of the          system depends on a number of factors.
network. To cache the data stored in        Since the MRAM banks are distributed
each MRAM bank, we break the SRAM           and connected by a network, the latency
L2 cache into a number of smaller           to access a bank depends on the bank’s
caches, and associate each one of these     physical location in the network. The
smaller caches with an MRAM bank.           access cost also depends on the
The SRAM cache associated with each         congestion in the network to reach the
MRAM bank has low latency due to its        bank, and the contention in the L2 cache
small size, and has a high bandwidth        associated with the bank. Understanding
vertical channel to its MRAM bank.          the trade-offs between these factors is
Thus, even for large cache lines, the       important to achieve high performance
cache can be filled with a single access    in an MRAM system.
to the MRAM bank on a miss. Each
SRAM cache is fabricated in the active      Number of banks: Having a large
layer below the MRAM bank with which        number of banks in the system increases
it is associated. The SRAM cache is         the concurrency in the system, and
smaller than the MRAM bank and can          ensures fast hits to the closest banks.
thus easily fit under the bank. The         However, the network traversal cost to
decoders, sense amplifiers, and other       reach the farthest bank also increases
active devices required for operating an    due to the increased number of hops.
MRAM bank are also present                  The amount of L2 cache associated with
                                            each bank depends on the number of
banks in the system. For a fixed total L2     examine the performance of our MRAM
size, having more number of banks             system with increasing bank latencies.
results in a smaller size for the L2 cache    The mean performance of the system
associated with each bank. However, the       across our set of benchmarks for
latency of each L2 cache is lower now         different bank access latencies is shown
because of its smaller size. Thus             in Figure 8. The horizontal line
increasing the number of banks in the         represents the mean IPC for the
system results in reduced cache and           conventional SDRAM system. As can be
MRAM bank latency (because of                 seen from this graph, the performance of
smaller bank size for a fixed total           our architecture is relatively insensitive
MRAM capacity), while increasing the          to MRAM latency and breaks even with
potential miss rates in each individual L2    the SDRAM system only at MRAM
cache and increased latency to traverse       latencies larger than the off-chip
the network due to greater number of          SDRAM latency. This phenomenon
hops.                                         occurs because the higher
Cache Line Size: Because of the
potential for MRAM to provide a high-
bandwidth interface to its associated L2
cache, we can have large line sizes in the
L2 cache which can potentially be filled
with a single access to MRAM on an L2
miss. Large line sizes can result in
increased spatial locality but they also
result in an increase in the access time of
the cache. Thus, there is a trade-off
between increased locality and increased
hit latency which determines the optimal
line size when bandwidth is not a
constraining factor. In addition, the line
size has an effect on the number of bytes
written back into an MRAM array,
which is important due to the substantial
amount of power required to perform an
MRAM write compared to a read.
Page Placement Policy: The MRAM
banks are accessed using physical
addresses, and the memory in the banks
is allocated on a page granularity. Thus,
when a page is loaded, the operating
system can assign a
MRAM Latency Sensitivity:
To study the sensitivity of our MRAM
architecture to MRAM bank latency, we
                                            architecture which minimizes the
                                            amount of data written back into the
                                            MRAM banks from the L2 cache. In
                                            Table 4 we show the total number of
                                            bytes written back into MRAM memory
                                            for a 100 bank configuration with
                                            different line sizes. We show only a
                                            subset of the benchmarks as all the
                                            benchmarks show the same trend. From
                                            Table 4 we can see that the total volume
                                            of data written back increases with
                                            increasing line size. We found that even
                                            though the number of writebacks
                                            decreases with larger line size, the
                                            amount of data written back increases.
                                            This is because the decrease in the
                                            number of writebacks is offset by the
                                            increased line size. Thus, there is a
                                            power performance trade-off in an
                                            MRAM system as larger line sizes
                                            consume more power but yield better
                                            performance. We are currently exploring
                                            other mechanisms such as sub-blocking
                                            to reduce the volume of data written
                                            back when long cache lines are
                                            employed.
                                            Conclusions:
                                            In this paper, we have introduced and
                                            examined      an   emerging     memory
                                            technology, MRAM, which promises to
                                            enable large, high bandwidth memories.
                                            MRAM can be integrated into the
                                            microprocessor die and avoid the
                                            conventional pin bandwidth limitations
                                            found in off-chip memory systems. We
                                            have developed a model for simulating
                                            MRAM banks and use it to examine the
                                            trade-offs between line size and bank
                                            number to derive the MRAM
                                            organization with the best performance.
Cost of Writes: Writes to MRAM
                                            We break down the components of
memory consume more power than
                                            latency in the memory system, and
reads because of the larger current
                                            examine the potential of page placement
needed to change the polarity of the
                                            to improve performance. Finally, we
MRAM cell. Hence, for a low power
                                            have      compared     MRAM        with
design, it might be better to consider an
conventional SDRAM memory systems             volatile, its impact on system reliability
and another emerging technology,              over conventional memory should be
chipstacked SDRAM, to evaluate its            measured. Finally, our uniprocessor
potential as a replacement for main           simulation does not take full advantage
memory. Our results show that MRAM            of the large bandwidth inherent in the
systems perform 15% better than               partitioned MRAM. We expect that chip
conventional SDRAM systems and 30%            multiprocessors will have additional
better than stacked SDRAM systems. An         performance       gains    beyond     the
important feature of our memory               uniprocessor model studied in this paper.
architecture is that the L2 cache and
MRAM banks are partitioned. This
architecture reduces miss conflicts in the
L2 cache and provides high bandwidth          References:
when multiple L2s are accessed
simultaneously. We studied MRAM               [1] V. Agarwal, M. S. Hrishikesh, S. W.
systems with perfect L2 caches and            Keckler, and D. Burger. Clock rate
perfect networks to understand where          versus IPC: The end of the road for
performance was being lost. We found          conventional microarchitectures. In
that the penalty of cache conflicts in the    Proceedings of the 27th Annual
L2 cache and the network latency had          International Symposium on Computer
widely varying effects among the              Architecture, pages 248–259, June 2000.
benchmarks. However, these results did
show that page allocation policies in the     [2] V. Agarwal, S. W. Keckler, and D.
operating system have a great potential       Burger. The effect of technology scaling
to improve MRAM performance. Our              on     microarchitectural     structures.
work suggests several opportunities for       Technical      Report       TR2000-02,
future MRAM research. First, our              Department of Computer Sciences,
partitioned MRAM memory system                University of Texas at Austin, Austin,
allows page placement policies for a          TX, Aug. 2000.
uniprocessor to consider a new variable
– proximity to the processor. Allowing        [3] D. Bailey, J. Barton, T. Lasinski, and
pages to dynamically migrate between          H.    Simon.      The     NAS     parallel
MRAM         partitions   may     provide     benchmarks. Technical Report RNR-91-
additional performance benefit. Second,       002 Revision 2, NASA Ames Research
the energy use of MRAM must be                Laboratory, Ames, CA, Aug. 1991.
characterized       and   compared       to
alternative     memory       technologies.    [4] H. Boeve, C. Bruynseraede, J. Das,
Applications may have quite different         K. Dessein, G. Borghs, and J. D. Boeck.
energy use given that the energy              Technology     assessment     for   the
required to write the MRAM cell is            implementation of magnetoresistive
greater than that to read it. In addition,    elements       with      semiconductor
the L2 cache line size has a strong effect    components in magnetic random access
on the amount of data written to the          memory MRAM architectures. IEEE
MRAM and may be an important factor           Transactions on Magnetics
in tuning systems to use less energy.         35:2820–2825, Sep 1999.
Third, since MRAM memory is non-
[5] P. N. Brown, R. D. Falgout, and J. E.
Jones. Semicoarsening multigrid on
distributed memory machines. Technical
Report UCRL-JC-130720, Lawrence
Livermore National Laboratory, 2000.
[6] R. Chandra, S. Devine, B. Verghese,
A. Gupta, and M. Rosenblum.
Scheduling and page migration for
multiprocessor compute servers. In
Proceedings of the Sixth International
Conference on Architectural Support for
Programming Languages and Operating
Systems, pages 12–24, San Jose,
California, 1994.
[7] V. Cuppu, B. Jacob, B. Davis, and T.
Mudge. A performance comparison of
contemporary DRAM architectures. In
Proceedings of the 26th Annual
International Symposium on Computer
Architecture, pages 222–233, May 1999.
[8] R. Desikan, D. Burger, S. W.
Keckler, and T. M. Austin. Sim-alpha: a
validated execution driven alpha 21264
simulator. Technical Report TR-01-23,
Department of Computer Sciences,
University of Texas at Austin, 2001.