0% found this document useful (0 votes)
156 views99 pages

Lec07 Memory sp17

Uploaded by

Golnaz Korkian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views99 pages

Lec07 Memory sp17

Uploaded by

Golnaz Korkian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

CS250


VLSI Systems Design

Lecture 7: Memory Technology and


Patterns

Spring 2017

John Wawrzynek
with
James Martin (GSI)

Thanks to John Lazzaro for the slides


Lecture 07, Memory CS250, UC Berkeley Sp17
Memory: Technology and Patterns

Memory, the 10,000 ft view. Latency,


from Steve Wozniak to the power wall.

How DRAM works. Memory design


when low cost per bit is the priority.
Break

How SRAM works. The memory


technology available on logic dies.

Memory design patterns. Ways to


use SRAM in your project designs.
CS 250 L07: Memory UC Regents S17 © UCB
40% of
this ARM
CPU is
devoted to
SRAM
cache.

But the
role of
cache in
computer
design has
varied
widely
over time.
CS 152 L14: Cache I UC Regents Spring 2005 © UCB
1977: DRAM faster than microprocessors
Apple ][ (1977)
CPU: 1000 ns
DRAM: 400 ns

Steve
Steve Wozniak
Jobs

CS 250 L07: Memory UC Regents S17 © UCB


1980-2003, CPU speed outpaced DRAM ...
Q. How do architects address this gap?
Performance A. Put smaller, faster “cache” memories
(1/latency) bet ween CPU and DRAM.
Create a “memory hierarchy”. The
0
1000 power
CPU wall
60% per yr CPU
1000 2X in 1.5 yrs
Gap grew 50% per
100 year
DRAM
9% per yr
10
2X in 10 yrs
DRAM

1 1 2 2
9 9 0 0
8 9 0 0
0 0 0 5 Year
CS 250 L07: Memory UC Regents S17 © UCB
Caches: Variable-latency memory ports
Data in upper Data Lower Level
Memory
memory To Processor Upper Level
Memory
returned with Large,
lower latency. Small, fast
Address Blk X slow
From Processor
Data in lower Blk Y

level returned
with higher
latency.

From
CPU

To CPU

CS 250 L07: Memory UC Regents S17 © UCB


Programs with locality cache well ...
Bad
Memory Address (one dot per access)

Temporal
Locality

Spatial
Locality
Time
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems
CS 250 L07: Memory Journal 10(3): 168-192 (1971) UC Regents S17 © UCB
The caching algorithm in one slide

Temporal locality: Keep most recently


accessed data closer to processor.
Lower Level
Upper Level Memory
To Processor
Memory

Blk X
From Processor Blk Y

Spatial locality: Move contiguous blocks


in the address space to upper levels.

CS 250 L07: Memory UC Regents S17 © UCB


2005 Memory Hierarchy: Apple iMac G5
Managed Managed Managed by OS,
by compiler by hardware hardware,
application

Reg L1 L1 L2 DRAM Disk


Inst Data
Size 1K 64K 32K 512K 256M 80G
iMac G5
Latency
(cycles) 1 3 3 11 160 1E+07 1.6 GHz
$1299.00
Goal: Illusion of large, fast, cheap memory
Let programs address a memory space that scales
to the disk size, at a speed that is usually as fast
as register access
CS 250 L07: Memory UC Regents S17 © UCB
90 nm, 58 M L1 (64K Instruction) 512K
transistors L2

R
e
g
i
s
t
e
r
s
(1K)

CS 152 L14: Cache I L1 (32K Data) PowerPC 970 FX


UC Regents Spring 2005 © UCB
Latency: A closer look
Read latency: Time to return first byte of a random access

Reg L1 L1 L2 DRAM Disk


Inst Data
Size 1K 64K 32K 512K 256M 80G
Latency
(cycles) 1 3 3 11 160 1E+07
Latency
(sec) 0.6n 1.9n 1.9n 6.9n 100n 12.5m

Hz 1.6G 533M 533M 145M 10M 80


Architect’s latency toolkit:
(1) Parallelism. Request data from N 1-bit-wide memories at the
same time. Overlaps latency cost for all N bits. Provides N times
the bandwidth. Requests to N memory banks (interleaving) have
potential of N times the bandwidth.
(2) Pipeline memory. If memory has N cycles of latency,
issue a request each cycle, receive it N cycles later.
CS 250 L07: Memory UC Regents S17 © UCB
Capacitance and memory

CS 152 L14: Cache I UC Regents Spring 2005 © UCB


Storing computational state as charge
State is coded as the
amount of energy stored +++ +++
by a device. --- ---

1.5V
+++ +++ State is read by
sensing the amount
--- --- of energy

Problems: noise changes Q (up or down),


parasitics leak or source Q. Fortunately,
Q cannot change instantaneously, but that only
gets us in the ballpark.
CS 250 L07: Memory UC Regents S16 © UCB
How do we fight noise and win?
Store more energy Q = CV. To store more
charge, use a bigger V or
than we expect make a bigger C.
from the noise. Cost: Power, chip size.

Example: 1 bit per capacitor.


Represent state Write 1.5 volts on C.
as charge in ways that To read C, measure V.
are robust to noise. V > 0.75 volts is a “1”.
V < 0.75 volts is a “0”.
Cost: Could have stored many bits on that capacitor.
Correct small state errors Ex: read C every 1 ms
Is V > 0.75 volts?
that are introduced by noise. Write back 1.5V
Cost: Complexity. (yes) or 0V (no).
CS 250 L07: Memory UC Regents S16 © UCB
Dynamic Memory Cells

CS 250 L07: Memory UC Regents S16 © UCB


DRAM cell: 1 transistor, 1 capacitor

“Bit Line” “Word Line” Vdd Word Vdd


Line

Capacitor
“Bit Line”
“Bit Line”

Why Vdd
oxide oxide Vcap Vcap
n+ n+ ------ values
p- start Diode
out at leakage
Word Line and Vdd run on “z-axis” ground. current.
CS 250 L07: Memory UC Regents S16 © UCB
A 4 x 4 DRAM array (16 bits) ....

CS 250 L07: Memory UC Regents S16 © UCB


Invented after SRAM, by Robert Dennard

CS 250 L07: Memory UC Regents S16 © UCB


DRAM Circuit Challenge #1: Writing

Vdd
Vdd

Vgs Vdd

Vc
Vdd - Vth. Bad, we store less
charge. Why do we not get Vdd?
Ids = k [Vgs -Vth]^2 ,
but “turns off” when Vgs <= Vth!
Vgs = Vdd - Vc. When Vdd - Vc == Vth, charging effectively stops!
CS 250 L07: Memory UC Regents S16 © UCB
DRAM Challenge #2: Destructive Reads
+++++++ (stored charge from cell)
Bit Line
(initialized Word Line
+
to a low +
+
voltage) g s + Vdd
V +
+
+
0 -> Vdd Vc -> 0

Raising the word line removes the


charge from every cell it connects to!

DRAMs write back after each read.

CS 250 L07: Memory UC Regents S16 © UCB


DRAM Circuit Challenge #3a: Sensing

Assume Ccell = 1 fF
Bit line may have 2000 nFet drains,
assume bit line C of 100 fF, or 100*Ccell.
Ccell holds Q = Ccell*(Vdd-Vth) 100*Ccell Ccell

When we dump this charge onto the


bit line, what voltage do we see?
dV = [Ccell*(Vdd-Vth)] / [100*Ccell]
dV = (Vdd-Vth) / 100 ≈ tens of millivolts!
In practice, scale array to get a 60mV signal.
CS 250 L07: Memory UC Regents S16 © UCB
DRAM Circuit Challenge #3b: Sensing

How do we reliably sense a 60mV signal?


Compare the bit line against the voltage on a
“dummy” bit line.
...

“sense amp”
Bit line to sense +
“Dummy” bit line. ?
Dummy bit line -
Cells hold no charge.

CS 250 L07: Memory UC Regents S16 © UCB


DRAM Challenge #4: Leakage ...
Bit Line
Word Line
+
+
+
+ Vdd
+
+
+
Parasitic currents
leak away charge.

Solution: “Refresh”, by rewriting cells at


regular intervals (tens of milliseconds)

oxide oxide
n+ n+ ------
p- Diode leakage ...
CS 250 L07: Memory UC Regents S16 © UCB
DRAM Challenge #5: Cosmic Rays ...
Bit Line
Word Line
+
+
+
+ Vdd
+
+
+
Cell capacitor holds 25,000 electrons (or
less). Cosmic rays that constantly
bombard us can release the charge!
Solution: Store extra bits to detect and
correct random bit flips (ECC).

oxide oxide
n+ n+ ------
p- Cosmic ray hit.
CS 250 L07: Memory UC Regents S16 © UCB
DRAM Challenge #6: Yield

If one bit is bad, do we throw chip away?

Solution: add extra bit lines (i.e. 80 when


...

you only need 64). During testing, find the


bad bit lines, and use high current to burn
Extra bit lines. away “fuses” put on chip to remove them.
Used for “sparing”.

CS 250 L07: Memory UC Regents S16 © UCB


DRAM Challenge #7: Scaling

Each generation of IC technology,


we shrink width and length of cell.
Problem 1: If Ccell and drain capacitances scale
together, number of bits per bit line stays constant.
dV ≈ 60 mV= [Ccell*(Vdd-Vth)] / [100*Ccell]
Problem 2: Vdd may need to scale down too!
Number of electrons per cell shrinks.

Solution: Constant Innovation of Cell Capacitors!


CS 250 L07: Memory UC Regents S16 © UCB
Poly-diffusion Ccell is ancient history

“Bit Line” “Word Line” Vdd


Word Vdd
Line

Capacitor
“Bit Line”
“Bit Line”

oxide oxide
n+ n+ ------
p-
Word Line and Vdd run on “z-axis”
CS 250 L07: Memory UC Regents S16 © UCB
Early replacement: “Trench” capacitors

CS 250 L07: Memory UC Regents S16 © UCB


Final generation of trench capacitors

The
companies
that kept
scaling
trench
capacitors

for
commodity
DRAM chips

went out of
business.
CS 152 L14: Cache I UC Regents Spring 2005 © UCB
Samsung
90nm
stacked
capacitor
bitcell.

DRAM: the field for material and process innovation


Arabinda Das
CS 152 L14: Cache I UC Regents Spring 2005 © UCB
Samsung 30nm
EEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 1, JANUARY 2013

168

From JSSC, and


Arabinda Das
CS 152 L14: Cache I UC Regents Spring 2005 © UCB
In the labs: Vertical cell transistors ...

880 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 4, APRIL 2010

A 31 ns Random Cycle VCAT-Based 4F DRAM


Fig. 1. (a) The
Withcross section of surrounding-gate
Manufacturability and Enhanced vertical channel access tran-
Cell Efficiency
sistor (VCAT).Ki-Whan
(b)Song,
The schematic
Jin-Young diagram
Kim, Jae-Man Yoon, of Kim,
Sua Kim, Huijung VCAT-based
Hyun-Woo Chung, Hyungi4FKim, DRAM cell
Kanguk Kim, Hwan-Wook Park, Hyun Chul Kang, Nam-Kyun Tak, Dukha Park, Woo-Seop Kim, Member, IEEE,
array. Yeong-Taek Lee, Yong Chul Oh, Gyo-Young Jin, Jeihwan Yoo, Donggun Park, Senior Member, IEEE,
Kyungseok Oh, Changhyun Kim, Senior Member, IEEE, and Young-Hyun Jun
CS 250 L07: Memory UC Regents S16 © UCB
512Mb: x4, x8, x16 DDR2 SDRAM
Features

DDR2 SDRAM
MT47H128M4 – 32 Meg x 4 x 4 banks
MT47H64M8 – 16 Meg x 8 x 4 banks
MT47H32M16 – 8 Meg x 16 x 4 banks

Features Memory Arrays Options1


! Configuration
Marking
! VDD = +1.8V ±0.1V, VDDQ = +1.8V ±0.1V
" 128 Meg x 4 (32 Meg x 4 x 4 banks) 128M4
! JEDEC-standard 1.8V I/O (SSTL_18-compatible) " 64 Meg x 8 (16 Meg x 8 x 4 banks) 64M8
! Differential data strobe (DQS, DQS#) option " 32 Meg x 16 (8 Meg x 16 x 4 banks) 32M16
! 4n-bit prefetch architecture ! FBGA package (Pb-free) – x16
! Duplicate output strobe (RDQS) option for x8 " 84-ball FBGA (8mm x 12.5mm) Rev. F HR
! FBGA package (Pb-free) – x4, x8
! DLL to align DQ and DQS transitions with CK
" 60-ball FBGA (8mm x 10mm) Rev. F CF
! 4 internal banks for concurrent operation
! FBGA package (lead solder) – x16
! Programmable CAS latency (CL) " 84-ball FBGA (8mm x 12.5mm) Rev. F HW
! Posted CAS additive latency (AL) ! FBGA package (lead solder) – x4, x8
! WRITE latency = READ latency - 1 tCK " 60-ball FBGA (8mm x 10mm) Rev. F JN
! Selectable burst lengths: 4 or 8 ! Timing – cycle time
" 2.5ns @ CL = 5 (DDR2-800) -25E
! Adjustable data-output drive strength
" 2.5ns @ CL = 6 (DDR2-800) -25
! 64ms, 8192-cycle refresh
" 3.0ns @ CL = 4 (DDR2-667) -3E
! On-die termination (ODT) " 3.0ns @ CL = 5 (DDR2-667) -3
! Industrial temperature (IT) option " 3.75ns @ CL = 4 (DDR2-533) -37E
! Automotive temperature (AT) option ! Self refresh
! RoHS-compliant " Standard None
" Low-power L
! Supports JEDEC clock jitter specification
! Operating temperature
CS 250 L07: Memory
" Commercial (0°C # T # 85°C) UC Regents S16 © UCB
None
Bit Line
“Column”

“Word Line”
“Row”

People
buy So, we
DRAM for amortize
the bits. the edge
“Edge” circuits
circuits over big
are arrays.
overhead.

CS 250 L07: Memory UC Regents S16 © UCB


A “bank” of 128 Mb (512Mb chip -> 4 banks)

1 In reality, 16384 columns are


o divided into 64 smaller arrays.
13-bit f
row
address 8
16384
1
input 9 columns
2
8192
d rows 134 217 728 usable bits
e (tester found good bits in bigger array)
c
o
d 16384 bits delivered by sense amps
e
r
Select requested bits, send off the chip

CS 250 L07: Memory UC Regents S16 © UCB


Recall DRAM Challenge #3b: Sensing

How do we reliably sense a 60mV signal?


Compare the bit line against the voltage on a
[...] “dummy” bit line.
“sense amp”
Bit line to sense +
“Dummy” bit line. ?
Dummy bit line -
Cells hold no charge.

CS 250 L07: Memory UC Regents S16 © UCB


“Sensing” is row read into sense amps
Slow! This 2.5ns period DRAM (400 MT/s) can
1
do row reads at only 55 ns ( 18 MHz).
o
13-bit f DRAM has high latency to first bit out. A fact of life.
row
address 8
16384
1
input 9 columns
2
8192
d rows 134 217 728 usable bits
e (tester found good bits in bigger array)
c
o
d 16384 bits delivered by sense amps
e
r
Select requested bits, send off the chip

CS 250 L07: Memory UC Regents S16 © UCB


An ill-timed refresh may add to latency
Bit Line
Word Line
+
+
+
+ Vdd
+
+
+
Parasitic currents leak
away charge.

Solution: “Refresh”, by rewriting cells at


regular intervals (tens of milliseconds)

oxide oxide
n+ n+ ------
p- Diode leakage ...
CS 250 L07: Memory UC Regents S16 © UCB
Latency versus bandwidth
Thus, push to What if we want all of the 16384 bits?
faster DRAM
interfaces In row access time (55 ns) we can do
1 22 transfers at 400 MT/s.
16-bit chip bus -> 22 x 16 = 352 bits << 16384
o
13-bit f Now the row access time looks fast!
row
address 8
16384
1
input 9 columns
2
8192
d rows 134 217 728 usable bits
e (tester found good bits in bigger array)
c
o
d 16384 bits delivered by sense amps
e
r
Select requested bits, send off the chip
CS 250 L07: Memory UC Regents S16 © UCB
DRAM latency/bandwidth chip features
Columns: Design the right interface
for CPUs to request the subset of a
column of data it wishes:
16384 bits delivered by sense amps

Select requested bits, send off the chip


Interleaving: Design the right interface
to the 4 memory banks on the chip, so
several row requests run in parallel.

Bank 1 Bank 2 Bank 3 Bank 4

CS 250 L07: Memory UC Regents S16 © UCB


ture allows the READ command to be issued prior to tRCD (MIN) by delaying the
Off-chip interface for the Micron part ...
internal command to the DDR2 SDRAM by AL clocks. The AL feature is described in
further detail in Posted CAS Additive Latency (AL) (page 78).
A clocked
Examples of CL bus:
= 3 and CL Note!
= 4 are shown in Figure 35; This example
both assume ALis= best-case!
0. If a READ
command is200 MHzatclock,
registered To access
clock edge n, and the a newthe
CL is m clocks, row, a will
data slow beACTIVE
available
nominally coincident on edge n + m command
with clock
data transfers (this assumesmust
AL =run
0). before the READ.
both edges (DDR).
T0 T1 T2 T3 T4 T5 T6
CK#

CK

Command READ NOP NOP NOP NOP NOP NOP

DQS, DQS#

DQ DO DO DO DO
n n+1 n+2 n+3
CL = 3 (AL = 0)

T0 T1 T2 T3 T4 T5 T6
DRAM is controlled via
CK#

commands
CK
Synchronous data
READ
(READ, WRITE,
Command NOP NOP NOP
output.
NOP NOP NOP

REFRESH, ...)
DQS, DQS#

DQ DO DO DO DO
n n+1 n+2 n+3
CL = 4 (AL = 0)
CS 250 L07: Memory UC Regents S16 © UCB
Auto-Precharge
Opening a row before reading
ure 52: Bank Read
CK#
CK
T0 – with
T1 AutoT2Precharge
T3 T4 T5 T6
... READ
T7 T7n T8 T8n

tCK tCH tCL


T0 T1 T2 T3 T4 T5 T6 T7 T7n T8 T8n
CK#
CKE
CK
tCK tCH tCL
Command1 NOP1 ACT NOP1 READ2,3 NOP1 NOP1 NOP1 NOP1 NOP1 ACT
CKE

Address
1 RA Col n RA
Command NOP1 ACT NOP1 READ2,3 NOP1 NOP1 NOP1 NOP1 NOP1 ACT

4
A10
Address RA
RA Col n RA
RA

4
Bank address Bank x
A10 RA x
Bank Bank x RA
AL = 1 CL = 3

Bank address Bank x tRCD Bank x tRTP Bank x


tRAS
AL = 1 15 ns CL = 3
tRP

15 ns
tRC
tRCD tRTP
tRAS tRP
DM
tRC

tDQSCK (MIN)
DM
Case 1: tAC (MIN) and tDQSCK (MIN) tRPRE
tRPST
5 5
DQS, DQS#
tLZ (MIN) tDQSCK (MIN)
Case 1: tAC (MIN) and tDQSCK (MIN) tRPRE
tRPST
5 5
DQ6 DO
DQS, DQS# n
tLZ (MIN) tLZ (MIN) tAC (MIN) tHZ (MIN)

DQ6 DO
n
tDQSCK (MAX)
Case 2: tAC (MAX) and tDQSCK (MAX) t t tRPST
tHZ (MIN)
CS 250 L07: Memory
55 ns bet ween row opens.
5
LZ (MIN)
RPRE tAC (MIN)
5
UC Regents S16 © UCB
However, we can read 512Mb:
columns quickly
x4, x8, x16 DDR2 SDR
RE

e READ Bursts
T0 T1 T2 T3 T3n T4 T4n T5 T5n T6 T6n
CK#
CK
Command READ NOP READ NOP NOP NOP NOP

Address Bank, Bank,


Col n Col b
tCCD

RL = 3
DQS, DQS#

DO DO
DQ n b

T0 T1 T2 T2n T3 T3n T4 T4n T5 T5n T6 T6n


CK#
CK
Command READ Note: This
NOP is a “normal
READ read”
NOP (not NOP
Auto-Precharge).
NOP NOP

Address Both
Bank,
Col n READs are to Bank,
the
Col b same bank, but different columns.
tCCD

RL = 4
CS 250 L07: Memory UC Regents S16 © UCB
Why can we read columns quickly?

1
Column reads select from the 16384 bits here
o
13-bit f
row
address 8
16384
1
input 9 columns
2
8192
d rows 134 217 728 usable bits
e (tester found good bits in bigger array)
c
o
d 16384 bits delivered by sense amps
e
r
Select requested bits, send off the chip

CS 250 L07: Memory UC Regents S16 © UCB


Interleave: Access all 4 banks in parallel
re 43: Multibank Activate Restriction

T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
CK#
CK

Command ACT READ ACT READ ACT READ ACT READ NOP NOP ACT

Address Row Col Row Col Row Col Row Col Row

Bank address Bank a Bank a Bank b Bank b Bank c Bank c Bank d Bank d Bank e
tRRD (MIN)
tFAW (MIN)

Don’t Care

Interleaving: Design the right interface


Note: 1. DDR2-533 (-37E, x4 or x8), tCK = 3.75ns, BL = 4, AL = 3, CL = 4, tRRD (MIN) = 7.5ns,
tFAW (MIN) = 37.5ns.

to the 4 memory banks on the chip, so


several row requests run in parallel.

Bank a Bank b Bank c Bank d

Can also do other commands on banks concurrently.


CS 250 L07: Memory UC Regents S16 © UCB
Only part of a bigger story ...
re 5: 32 Meg x 16 Functional Block Diagram

ODT

CKE Control
CK Logic
CK#
Command

CS#
decode

CK, CK# ODT control VddQ


RAS# COL0, COL1 sw1 sw2 sw3
CAS# Bank 3 Bank 3
WE# Bank 2 Bank 2 16 DLL
Bank 1 Bank 1 64 16
Mode Refresh 13 Read 16 sw1 sw2 sw3
Bank 0 Bank 0
registers counter Row- 13 latch 16 MUX DRVRS
row- Data R1 R2 R3 DQ0–DQ15
address Address 8,192 16
15 Memory R1 R2 R3
13 MUX latch and array 4
decoder (8,192 x 256 x 64) DQS
generator UDQS, UDQS#
Sense amplifiers
Input LDQS, LDQS#
16,384 registers sw1 sw2 sw3
64 2 2
2 R1 R2 R3 UDQS, UDQS#
I/O gating 2 2 LDQS, LDQS#
A0–A12, Bank DM mask logic 8 2 R1 R2 R3
A0, BA1 15 Address Write 2 2
register 2 control
FIFO Mask 2 2
logic 256 RCVRS
(x64) 64 and
drivers 16 16
Column Internal CK out 16 sw1 sw2 sw3
16
decoder CK, CK# 64 16
Column- 8 CK in 16 16 R1 R2 R3 UDM, LDM
10 address Data
2 16 16 R1 R2 R3
counter/
latch 4
COL0, COL1

VssQ

05aef82f1e6e2
DR2.pdf - Rev. O 7/09 EN 12 Micron Technology, Inc. reserves the right to change products or specifications without n
©2004 Micron Technology, Inc. All rights rese

CS 250 L07: Memory UC Regents S16 © UCB


Diagram
uniquely selected by A2–Ai when BL = 4 and by A3–Ai w

Only part of a bigger story ...


2: Simplified State Diagram
significant column address bit for a given configuration
cant) address bit(s) is (are) used to select the starting lo
programmed burst length applies to both read and writ
CKE_L
Initialization
sequence Figure 34: MR Definition
OCD Self
default refreshing 1 2
BA2 BA1 BA0 An A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0 Address Bus
SR
PRE
_H
C KE
Setting Idle
MRS (E)MRS all banks REFRESH Refreshing 16 15 14 n 12 11 10 9 8 7 6 5 4 3 2 1 0 Mode Register (Mx)
EMRS precharged
0 MR 0 PD WR DLL TM CAS# Latency BT Burst Length
CK
E_ CK L
H E_ E_
L CK
M12 PD Mode M7 Mode M2 M1 M0 Burst Length
Precharge
power- 0 Fast exit 0 Normal 0 0 0 Reserved
down (normal) 0 0 1 Reserved
1 Test
CKE_L
1 Slow exit 0 1 0 4
Automatic Sequence (low power)
Command Sequence M8 DLL Reset 0 1 1 8

ACT 0 No 1 0 0 Reserved
ACT = ACTIVATE
CKE_H = CKE HIGH, exit power-down or self refresh 1 Yes 1 0 1 Reserved
CKE_L = CKE LOW, enter power-down 1 1 0 Reserved
(E)MRS = (Extended) mode register set
CKE_L Activating PRE = PRECHARGE M11 M10 M9 Write Recovery 1 1 1 Reserved
_L PRE_A = PRECHARGE ALL
C KE READ = READ 0 0 0 Reserved
Active
power- READ A = READ with auto precharge 2
0 0 1 M3 Burst Type
down REFRESH = REFRESH
C K C KE _ SR = SELF REFRESH 0 1 0 3 0 Sequential
E _L H WRITE = WRITE
WRITE A = WRITE with auto precharge 0 1 1 4 1 Interleaved
Bank
1 0 0 5
active
1 0 1 6 CAS Latency (CL)
E
M6 M5 M4
RE
WRITE RIT AD READ 1 1 0 7 Reserved
A

W 0 0 0
RE
E

AD
RIT

1 1 1 8 Reserved
0 0 1
W

Writing READ Reading 0 1 0 Reserved


WRITE M15 M14 Mode Register Definition
0 1 1 3
0 0 Mode register (MR)
1 0 0 4
REA A 0 1 Extended mode register (EMR)
DA I TE 1 0 1 5
WR READ A 1 0 Extended mode register (EMR2)
1 1 0 6
WRITE A 1 1 Extended mode register (EMR3)
1 1 1 7
PR
E,

A
PR

1. M16 (BA2) is only applicable for densities !1Gb, reser


E_

Notes:
E_

Writing PRE, PRE_A Reading


PR
A

with with
,

programmed to “0.”
E
PR

auto auto
precharge precharge
2. Mode bits (Mn) with corresponding address balls (An)
Precharging
CS 250 L07: Memory
served for future use and must UC
be Regents
programmed to “0
S16 © UCB
DRAM controllers: reorder requests
(A) Without access scheduling (56 DRAM Cycles)
(A) Without access scheduling
Time (Cycles) (56 DRAM Cycles)

Time (Cycles)
Column)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
Column)

(0,0,0) 1 P
2 3 4 A
5 6 C
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
(0,1,0) P A C
Row,Row,

(0,0,0) P A C
(0,0,1) P A C
(0,1,0) P A C
(0,1,3) P A C
(Bank,

(0,0,1) P A C
(1,0,0) P A C
(0,1,3) P A C
(Bank,

(1,1,1) P A C
P A C
References

(1,0,0)
(1,0,1) P A C
(1,1,1) P A C
(1,1,2) P A C
References

(1,0,1) P A C
(1,1,2) P A C

(B) With access scheduling (19 DRAM Cycles)


(B) With access scheduling (19 DRAM Cycles)
Time (Cycles) DRAM Operations:
Time (Cycles) DRAM Operations:
Column)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
P: bank precharge (3 cycle occupancy)
Column)

(0,0,0) 1 P
2 3 4 A
5 6 C
7 8 9 10 11 12 13 14 15 16 17 18 19
(0,1,0) P A C P: bank
A: precharge (3 cycle occupancy)
row activation
Row,Row,

(0,0,0) P A C
(0,0,1) C
(0,1,0) P A C A: row activation (3 cycle occupancy)
(0,1,3) C C: column access (1
(Bank,

(0,0,1) C
(1,0,0) P A C
(0,1,3) C C: column access (1 cycle occupancy)
(Bank,

(1,1,1) P A C
(1,0,0) P A C
References

(1,0,1) C
(1,1,1) P A C
(1,1,2) C
References

(1,0,1) C
(1,1,2) C

Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
charge, a row access, and a column access for a total of available column accesses, the cached row must be wr
seven
charge,cycles
a rowper reference,
access, and aorcolumn
56 cycles for all
access foreight refer-
a total of back to the
available memory
column array the
accesses, by cached
an explicit
row operation
must be wr(
ences. If we reschedule
seven cycles
1B theyIfcan
theseoroperations
per reference, 56 cycles as Memory
forshown
all eight Access
in Figure
refer-From: Scheduling
precharge)
back to the which
memory prepares the an
array by bank for a operation
explicit subsequent(
ences. we be performed
reschedule in 19
these cycles. as shown in Figure
operations activation.
precharge) An overview
which of several
prepares different
the bank modern DR
for a subsequent
1
1B they CS
can250beL07:
performed
Memory in 19 Rixner
Scott types
cycles., William J. Dally, Ujval J. Kapasi, Peter and organizations,
activation. An
Mattson, andoverview along
of
John D. Owens withdifferent
several a performance
UC Regentsmodern com
S16 © UCBDR
Memory Packaging

CS 250 L07: Memory UC Regents S16 © UCB


Functional Block Diagrams Functional Block Diagrams

From DRAM chip


Functional Block to DIMM module ...
Figure 2:
Diagrams
RS0#
Functional Block Diagram – Raw Card A Non-Parity

Figure 2: Functional BlockDQS0#


Diagram – Raw Card A Non-Parity
DQS0 DQS4
DQS4#

Each RAM chip


DM0/DQS9 DM4/DQS13
NC/DQS9# NC/DQS13#
DM/ NU/ CS# DQS DQS# DM/ NU/ CS# DQS DQS#
RS0# RDQS RDQS#
RDQS RDQS#
DQS0 DQ0 DQ DQS4 DQ32 DQ
DQS0# DQ1 DQ DQS4# DQ33 DQ
DM0/DQS9 DM4/DQS13

responsible for
DQ2 DQ DQ34 DQ
NC/DQS9#
DQ3 DQ U1 NC/DQS13#
DQ35 DQ U9
DM/ NU/ CS# DQS DQS# DM/ NU/ CS# DQS DQS#
DQ
DQ4 DQ RDQS DQ36
RDQS#
RDQS RDQS# DQ
DQ0 DQ DQ5 DQ DQ32 DQ DQ37
DQ1 DQ6 DQ DQ33 DQ DQ38 DQ
DQ
DQ2 DQ7 DQ DQ34 DQ DQ39 DQ
DQ

8 lines of the
DQ3 DQ U1 DQ35 DQ U9
DQ4 DQ DQ36 DQ
DQS1 DQS5
DQ5 DQ
DQS1# DQ37 DQ
DQS5#
DQ6 DM1/DQS10
DQ DQ
DQ38 DM5/DQS14
DQ7 NC/DQS10#
DQ DQ
DQ39 NC/DQS14#

64 bit data bus


DM/ NU/ CS# DQS DQS# DM/ NU/ CS# DQS DQS#
RDQS RDQS# RDQS RDQS#
DQS1 DQ8 DQ DQS5 DQ40 DQ
DQS1# DQ9 DQ DQS5# DQ41 DQ
DM1/DQS10 DQ10 DQ DM5/DQS14 DQ42 DQ
NC/DQS10# DQ11 NC/DQS14# DQ U10
DQ U2 DM/
DQ43
NU/ CS# DQS DQS#

(U5 holds the


DM/ DQ12
NU/ CS# DQS DQS#
DQ DQ
RDQS DQ13
RDQS# RDQS DQ44
RDQS#
DQ8 DQ DQ DQ DQ45 DQ
DQ40
DQ9 DQ14 DQ DQ DQ46 DQ
DQ DQ41
DQ10 DQ15 DQ DQ DQ47 DQ
DQ DQ42
DQ11 DQ U2 DQ43 DQ U10
DQ12 DQ DQ44 DQ

check bits).
DQS2 DQS6
DQ13 DQ DQ45 DQ
DQS6#
DQS2#
DQ14 DQ
DM2/DQS11 DQ46 DQ
DM6/DQS15
DQ15 DQ
NC/DQS11# DQ47 DQ
NC/DQS15#
DM/ NU/ CS# DQS DQS# DM/ NU/ CS# DQS DQS#
RDQS RDQS# RDQS RDQS#
DQS2 DQ16 DQ DQS6 DQ48 DQ
DQS2# DQ17 DQ DQS6# DQ49 DQ
DM2/DQS11 DQ18 DQ DM6/DQS15 DQ50 DQ
NC/DQS11# DQ19 DQ U3 NC/DQS15# DQ51 DQ U11
DM/ DQ20
NU/ CS# DQS DQS# DM/ DQ52
NU/ CS# DQS DQS#
DQ
DQ RDQS RDQS#
RDQS RDQS# DQ
DQ DQ21 DQ DQ48 DQ DQ53
DQ16
DQ DQ54 DQ

Commands sent
DQ17 DQ DQ22 DQ49 DQ
DQ23 DQ DQ50 DQ DQ55 DQ
DQ18 DQ
DQ19 DQ U3 DQ51 DQ U11
DQ20 DQS3
DQ DQ52 DQS7
DQ
DQS3#
DQ DQ53 DQS7#
DQ
DQ21
DM3/DQS12 DM7/DQS16
DQ
DQ22 DQ DQ54
NC/DQS16#

to all 9 chips,
NC/DQS12#
DQ23 DQ DQ55 DQ
DM/ NU/ CS# DQS DQS# DM/ NU/ CS# DQS DQS#
RDQS RDQS# RDQS RDQS#
DQS3 DQ24 DQ DQS7 DQ56 DQ
DQS3# DQ25 DQ DQS7# DQ57 DQ
DM3/DQS12 DQ26 DQ DM7/DQS16 DQ58 DQ
NC/DQS12# U4 NC/DQS16# DQ59 DQ U12

qualified by

DQ27 DQ
DM/ DQ28
NU/ CS# DQS DQS#
DQ DM/ DQ60
NU/ CS# DQS DQS#
DQ
RDQS RDQS# RDQS RDQS#
DQ DQ DQ61 DQ
DQ24 DQ DQ29 DQ56
DQ DQ62 DQ
DQ25 DQ DQ30 DQ57 DQ
DQ DQ DQ63 DQ
DQ26 DQ DQ31 DQ58
DQ27 DQ U4 DQ59 DQ U12

per-chip

DQ28 DQS8
DQ DQ60 DQ
DQS8# DQ61 DQ DDR2 SDRAM
DQ29 DQ
DM8/DQS17 DQ
DQ30 DQ
NC/DQS17# DQ62 DDR2 SDRAM
DQ31 DQ DQ63 DQ U8 DDR2 SDRAM
DM/ NU/ CS# DQS DQS#
RDQS RDQS# DDR2 SDRAM

select lines.
DQS8 CB0 DQ CK0 DDR2 SDRAM
DQS8# CB1 DQ CK0# PLL
DDR2 SDRAM DDR2 SDRAM
DM8/DQS17 CB2 DQ
NC/DQS17# DDR2 SDRAM DDR2 SDRAM
CB3 DQ U5 U8 RESET# DDR2 SDRAM
DM/ NU/
CB4 CS# DQS DQS#
DQ DDR2 SDRAM
RDQS RDQS# U7 DDR2 SDRAM
CB0 DQ CB5 DQ DDR2 SDRAM
CB6 DQ CK0
SPD EEPROM DDR2 SDRAM
CB1 DQ SCL SDA PLL Register
DQ CK0# DDR2 SDRAM
CB2 DQ CB7 WP A0 A1 A2
CS 250 L07: Memory CB3 DQ U5 RESET#
UC Regents S16 © UCB
DDR2 SDRAM
U6 VSS SA0 SA1 SA2 DDR2 SDRAM
MacBook Air ... too thin to use DIMMs

CS 250 L10: Memory UC Regents Fall 2013 © UCB


Macbook Air
Core i5: CPU + DRAM controller
Top

4GB DRAM soldered to the main board


Bottom

CS 250 L10: Memory UC Regents Fall 2013 © UCB


Original iPad (2010) “Package-in-Package”

128MB SDRAM dies (2) Cut-away


Apple A4 SoC side view

Dies connect using bond wires and solder balls ...


CS 152 L14: Cache I UC Regents Spring 2005 © UCB
CMOs in lrgi
Figure 20. Thermal resistance BGA
15 mm sq. SMAFTI
3-D memory stack
FigurePKG
17. Prototype package structure dimensions
References
1. Y. Kurita,
SMAFTI
shown in Table 2. We took a 15 mm squa
Gbit stacked DRAM with TSV
package
K. Soejima,as K.a Kikuchi,
test M. Takahashi,
structure which
M. contains 9-str
inn n Wind Velocilty: 1mrn/s Tago, M. Koike,(512Y. Morishita, strata) and M.
Mbit x S. Yamamichi,
2
1001. Specification of prototype sample
Table AA
f r- stacked DRAM and
Kawano, "Development of High-Density die.
a Logic We set the pow
Inter-Chip-
80 size
DRAM die
78.0
10.7 mm x 13.3 mm
Molded
consumption
Connection Structure
resin of PLogic:PDRAM
ratio Package," 15th'toMicro
Proc. ofSilicon lid In our powe
9:1.
Electronics Symposium (MES 2005), Osaka, Japan, Oct.,
die
64.3 effective
2005, pp. 189-192. architecture, the total power consumption
DRAM
=a) U)
DRAM 60 thickness 50 ptm
TSVUCU count in DRAM 1,560
a 2.9-strata stacked
M. Takahashi, DRAM
M. Tago, module
Y. Kurita, will not
K. Soejima, M. exceed twice t
Kawano, K. Kikuchi, S. Yamamichi, and T. Murakami,
40
DRAM capacity 512 Mbit/die x 2 strata
of single DRAM
"Inter-chip power
Connection because
Structure through only the accessed will la
an Interposer
CMOS logic20 die size 17.5 mm x 17.5 mm
would withbeHigh
activated.
Density Via," Proc. of 12th Symposium on
"Microjoining
-W 5Aand Assembly Technology in Electronics"
CMOS logic die thickness 200 ptm The thermal ^ resistances
(Mate 2006), Yokohama, Japan, Feb., (Oja) pp. 423-426. in Computation
2006,simulated
0
Fluid
3. Y. Dynamics
FTI CMOs
(CFD)
Kurita, K. Soejima, arelrgi
K. Kikuchi, shown
M Takahashi, inBGA
Fig. 20. A stack
M.
CMOS logic bump (a) count (b) (C) 3,497 (d)
0.18 pim CMOS Tago, M Koike, K. Shibuya, S. Yamamichi, and M.
CMOS logic process Figure "A
Kawano, Prototype
17.Novel "SMAFTI"package
Packagestructure
for Inter-Chip
DRAM-logic FTI via pitch 50 ptm= -777
Wide-Band Data Transfer," Proc. of 56th Electric
Package
Figure 21. size 33 mm
Effect of optional structures; (a) xt 33 mm on
Components and Technology Conference (ECTC 2006),
lounted
BGA terminal system board (ref.), (b)520
BGA / 1mm
pinwitl h Table
San
pitch Specification
1. CA,
Diego, May/June, 2006, pp. 289-297.sample
of prototype }Silicon lid
4. K. Nanba, M. Tago, Y. Kurita, K. Soejima, M. Kawano,
underfilling, (c) Lid attachment, zand (d) Heat
sink attachment DRAM die size S. Yamamichi, 10.7
K. Kikuchi, and mm x 13.3 mm
T. Murakami,
"Development of CoW (Chip on Wafer) bonding process
with die
2007 Electronic Components and Technology Conference DRAM high thickness
density SiP (System in Package)50technology
ptm
"SMAFTI"," Proc. of 16th Micro Electronics Symposium DRAM
TSV(MES
count in Osaka,
2006), DRAM 1,560
Japan, Oct., 2006, pp. 35-38. with TSV
5. F. Kawashiro,
DRAM capacity K. Abe, K. Shibuya, M. Koike,x 2 M.
512 Mbit/die strata
Ujiie, T. Kawashima, Y. Kurita, Y. Soejima, and M. ~ ~ }FTI~ ~ ~ ~ T
CMOS logic"Development
Kawano, die size of BGA attach 17.5process
mm xwith 17.5highmm
density Sip technology "SMAFTI"," Proc. of 13th
CMOS Symposiumlogic die thickness and Assembly200
on "Microjoining ptm
Technology I CMOS logic
CMOS logic bump
in Electronics" (Matecount 2007), Yokohama, 3,497 Japan, Feb.,
2007, pp. 49-54.
CMOS
6. M. Kawano, logic process S. Uchiyama, Y. Egawa,0.18 pim CMOS
N. Takahashi, Y.
DRAM-logic
FigureKurita, 18. FTI
K. Soejima,
A 3D Stacked Memory via M.pitch
Cross-sectional
Integrated
image
Matsui,
on a LogicS.Device
Komuro, K.of
50
Using SMAFTI ptmstacked
Shibata, J.
Technology
DRAM
Yamada, M. Ishino, H. Ikeda, Y. Saeki, 0. Kato, H.
Package andinterconnected
sizeYoshimi Saeki2, Hidekazuto 33CMOS mm x 33 logicmm die
Yoichiro Kurita', Satoshi Matsui', Nobuaki Takahashi', Koji Soejimal, Masahiro Komurol, Makoto Itoul, Chika Kakegawal,
ed-point cross-sectional observation of the Kikuchi,
Masaya Kawanol, Egawa2, Mitsuhashi,
T. Yoshihiro "A
Kikuchi2, 3D Packaging
Osamu Kato2, Azusa Yanagisawa2,
Toshiro Mitsuhashi2, Masakazu Ishino3, Kayoko Shibata3, Shiro Uchiyama3, Junji Yamada3, and Hiroaki Ikeda3

ng behaviorCSof250micro-bump interconnection BGA terminalfor'NEC


Technology Gbit Stacked
4 Electronics, 2Oki ElectricDRAM 520
Industry, pinMemory
and with
3Elpida
1120 Shimokuzawa, Sagamihara, Kanagawa 229-1198, Japan
1mmData
3/ Gbps pitch
L07: Memory Electron Devices MeetingUC Regents S16 © UCB
Transfer," International y.kurita@necel.com
Break

Play:
CS 250 L10: Memory UC Regents Fall 2013 © UCB
Static Memory Circuits
Dynamic Memory: Circuit remembers
for a fraction of a second.
Static Memory: Circuit remembers
as long as the power is on.
Non-volatile Memory: Circuit remembers
for many years, even if power is off.
CS 250 L07: Memory UC Regents S16 © UCB
Recall DRAM cell: 1 T + 1 C
“Word Line”
“Row”

“Column”

Bit Line
Word
“Column” “Row” Line

Vdd

“Bit Line”
CS 250 L07: Memory UC Regents S17 © UCB
Idea: Store each bit with its complement
x x
“Row”
Why?

Gnd Vdd
y y
Vdd Gnd
We can use the redundant
representation to compensate
for noise and leakage.

CS 250 L07: Memory UC Regents S17 © UCB


Case #1: y = Gnd, y = Vdd ...
x x
“Row”

Isd

y y
Gnd Vdd
Ids

CS 250 L07: Memory UC Regents S17 © UCB


Case #2: y = Vdd, y = Gnd ...
x x
“Row”

Isd

y y
Vdd Gnd
Ids

CS 250 L07: Memory UC Regents S17 © UCB


Combine both cases to complete circuit
noise noise
Gnd
Vdd Vth Vth Vdd Gnd
“Cross-
coupled
inverters”

y y

x x
CS 250 L07: Memory UC Regents S17 © UCB
SRAM Challenge #1: It’s so big!
SRAM area is 6X-10X DRAM area, same generation ...

Cell has Vdd


both AND
transistor Gnd
types

Capacitors are More


usually contacts,
“parasitic” more
capacitance devices,
of wires and t wo bit
transistors. lines ...
CS 250 L07: Memory UC Regents S17 © UCB
Challenge #2: Writing is a “fight”
When word line goes high, bitlines “fight” with cell
inverters to “flip the bit” -- must win quickly!
Solution: tune W/L of cell & driver transistors

Initial Initial
state state
Vdd Gnd

Bitline Bitline
drives drives
Gnd Vdd
CS 250 L07: Memory UC Regents S16 © UCB
Challenge #3: Preserving state on read
When word line goes high on read, cell inverters must drive
large bitline capacitance quickly,
to preser ve state on its small cell capacitances

Cell Cell
state state
Vdd Gnd

Bitline Bitline
a big a big
capacitor capacitor

CS 250 L07: Memory UC Regents S16 © UCB


Adding More Ports
WordlineB
WordlineA

Differential
Read or Write
ports
BitB BitB

BitA BitA

Wordline

Read Bitline

Optional Single-ended
Read port
Lecture 9, Memory 15 CS250, UC Berkeley, Fall 2012
Lec19.13

SRAM array: like DRAM, but non-destructive


Architects specify number of rows and columns.
Word and bit lines
Typical SRAMslow down as array
Organization: growsx larger!
16-word 4-bit
Din 3 Din 2 Din 1 Din 0
WrEn
Precharge

WrWrite
Driver & WrWrite
Driver & WrWrite
Driver & WrWrite
Driver &
- Precharger
Driver + - Precharger
Driver + - Precharger
Driver + - Precharger
Driver +
Word 0 A0

Address Decoder
SRAM SRAM SRAM SRAM
Parallel Cell Cell Cell Cell A1
Data Word 1 A2
I/O SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell A3
Lines
: : : :
Word 15
SRAM SRAM SRAM SRAM
Cell Cell Cell Cell
- Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Q: Which is longer:
Add muxes
word line or
bit line?
Dout 3 Dout 2 Dout 1 Dout 0
to select
4/12/04 ©UCB Spring 2004
subset
CS152 of bits
/ Kubiatowicz

How could we pipeline this memory?


CS 250 L07: Memory
Lec19.15

UC Regents S17 © UCB


Building Larger Memories
D D
Large arrays constructed by
Bit cells e Bit cells Bit cells e Bit cells tiling multiple leaf arrays, sharing
c c decoders and I/O circuitry
I/O I/O I/O I/O e.g., sense amp attached to
D D arrays above and below
Bit cells e Bit cells Bit cells e Bit cells
c c
Leaf array limited in size to
D D 128-256 bits in row/column due
Bit cells e Bit cells Bit cells e Bit cells
c c to RC delay of wordlines and
I/O I/O I/O I/O bitlines
D D
Bit cells e Bit cells Bit cells e Bit cells
Also to reduce power by only
c c activating selected sub-bank
In larger memories, delay and
energy dominated by I/O wiring
Lecture 9, Memory 14 CS250, UC Berkeley, Fall 2012
SRAM vs DRAM, pros and cons
Big win for DRAM
DRAM has a 6-10X density advantage
at the same technology generation.

SRAM advantages
SRAM has deterministic latency:
its cells do not need to be refreshed.
SRAM is much faster: transistors
drive bitlines on reads.
SRAM easy to design in logic
fabrication process (and premium
logic processes have SRAM add-ons)
CS 250 L07: Memory UC Regents S17 © UCB
Flip Flops Revisited

CS 250 L10: Memory UC Regents Fall 2013 © UCB


Recall: Static RAM cell (6 Transistors)
noise noise
Gnd
Vdd Vth Vth Vdd Gnd
“Cross-
coupled
inverters”

x x!
CS 250 L10: Memory UC Regents Fall 2013 © UCB
clk clk
Recall: Positive edge-triggered flip-flop
Q clk
clk’

ay in Flip-flops
D Q
setup time
A flip-flop “samples” right before
clock to Q delay
the edge, and then “holds” clk’ value.
• Setup time results from delayto Q delay results from
• Clock
through firstSampling
latch. delay throughHolds
second latch.

clk
circuit clk’
value

clk clk’
clk’ clk

to Q delay
clk’ clk

Spring 2003 • Clock to Q EECS150


delay –results from
Lec10-Timing Page 14
16delay
Transistors:
through Makes
second an SRAM look compact!
latch.

clk’
What do we get for the 10 extra transistors?
Clocked logic semantics.
CS 250 L07: Memory UC Regents S16 © UCB
Small Memories from Stdcell Latches
Write Address Write Data Read Address
Clk

Write Address Decoder

Read Address Decoder


Data held in
transparent-low
latches

Write by
clocking latch
Combinational logic for
Add additional ports by replicating read port (synthesized)
read and write port logic (multiple
write ports need mux in front of latch) Clk Optional read output latch
Expensive to add many ports

Lecture 9, Memory 6 CS250, UC Berkeley, Fall 2012


Synthesized, custom, and SRAM-based register files, 40nm

For small
register Synthesis
files, logic
synthesis is
competitive.

Not clear if
the SRAM SRAMS
data points
include area Register

for register file

control, etc. compiler

Figure 3: Using the raw area data, the physical implementation team can get a more accurate area estimation early in the
RTL development stage for floorplanning purposes. This shows an example of this graph for a 1-port, 32-bit-wide SRAM.
Bhupesh Dasila
Memory Design
Patterns

Lecture 9, Memory 18 CS250, UC Berkeley, Fall 2012


When register files get big, they get slow.
clk Even worse: adding ports slows down as O(N2) ...
sel(ws) sel(rs1)
5 R0 - The constant 0 Q 32 5
. M 32
D D En R1 Q . U rd1
E . X
WE
M. D En R2 Q 32
U.
. sel(rs2)
X.
... .
32 5
.
D En R31 Q . M 32
. U rd2
32 Why? Number of loads on each . X
wd Q goes as O(N), and the wire length 32
to port mux goes as O(N).
CS 250 L07: Memory UC Regents S16 © UCB
True Multiport Example: Itanium-2 Regfile
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 11

Intel Itanium-2 [Fetzer et al, IEEE JSSCC 2002]

Lecture 9, Memory 21 CS250, UC Berkeley, Fall 2012


True Multiport Memory

Problem: Require simultaneous read and write access by multiple


independent agents to a shared common memory.
Solution: Provide separate read and write ports to each bit cell for each
requester
Applicability: Where unpredictable access latency to the shared
memory cannot be tolerated.
Consequences: High area, energy, and delay cost for large number of
ports. Must define behavior when multiple writes on same cycle to same
word (e.g., prohibit, provide priority, or combine writes).

Lecture 9, Memory 20 CS250, UC Berkeley, Fall 2012


Crossbar networks: many CPUs sharing cache banks
TATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP

Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels

Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW.


Crossbar BW: 270 GB/s total (Read + Write).
(Also shared by an I/O port, not shown)
gram. CS 250 L07: Memory UC Regents S16 © UCB
Banked Multiport Memory
Problem: Require simultaneous read and write access by multiple
independent agents to a large shared common memory.
Solution: Divide memory capacity into smaller banks, each of which has
fewer ports. Requests are distributed across banks using a fixed hashing
scheme. Multiple requesters arbitrate for access to same bank/port.
Applicability: Requesters can tolerate variable latency for accesses.
Accesses are distributed across address space so as to avoid “hotspots”.
Consequences: Requesters must wait arbitration delay to determine if
request will complete. Have to provide interconnect between each
requester and each bank/port. Can have greater, equal, or lesser number of
banks*ports/bank compared to total number of external access ports.

Lecture 9, Memory 23 CS250, UC Berkeley, Fall 2012


Banked Multiport Memory
Port A Port B

Arbitration and Crossbar

Bank 0 Bank 1 Bank 2 Bank 3

Lecture 9, Memory 24 CS250, UC Berkeley, Fall 2012


Cached Multiport Memory
Problem: Require simultaneous read and write access by multiple
independent agents to a large shared common memory.
Solution: Provide each access port with a local cache of recently touched
addresses from common memory, and use a cache coherence protocol to
keep the cache contents in sync.
Applicability: Request streams have significant temporal locality, and
limited communication between different ports.
Consequences: Requesters will experience variable delay depending on
access pattern and operation of cache coherence protocol. Tag overhead in
both area, delay, and energy/access. Complexity of cache coherence
protocol.

Lecture 9, Memory 29 CS250, UC Berkeley, Fall 2012


Cached Multiport Memory
Port A Port B

Cache A Cache B

Arbitration and Interconnect

Common Memory

Lecture 9, Memory 30 CS250, UC Berkeley, Fall 2012


ARM CPU

The

arbiter

and

interconnect

on the last
slide is how
the two
caches on
this chip

share

access to
DRAM.
CS 152 L14: Cache I UC Regents Spring 2005 © UCB
Stream-Buffered Multiport Memory
Problem: Require simultaneous read and write access by multiple
independent agents to a large shared common memory, where each
requester usually makes multiple sequential accesses.
Solution: Organize memory to have a single wide port. Provide each
requester with an internal stream buffer that holds width of data returned/
consumed by each memory access. Each requester can access own stream
buffer without contention, but arbitrates with others to read/write stream
buffer from memory.
Applicability: Requesters make mostly sequential requests and can
tolerate variable latency for accesses.
Consequences: Requesters must wait arbitration delay to determine if
request will complete. Have to provide stream buffers for each requester.
Need sufficient access width to serve aggregate bandwidth demands of all
requesters, but wide data access can be wasted if not all used by requester.
Have to specify memory consistency model between ports (e.g., provide
stream flush operations).
Lecture 9, Memory 26 CS250, UC Berkeley, Fall 2012
Stream-Buffered Multiport Memory

Stream Buffer A Port A


Stream Buffer B Port B
Arbitration

Wide Memory

Lecture 9, Memory 27 CS250, UC Berkeley, Fall 2012


Replicated-State Multiport Memory

Problem: Require simultaneous read and write access by multiple


independent agents to a small shared common memory. Cannot tolerate
variable latency of access.
Solution: Replicate storage and divide read ports among replicas. Each
replica has enough write ports to keep all replicas in sync.
Applicability: Many read ports required, and variable latency cannot be
tolerated.
Consequences: Potential increase in latency between some writers and
some readers.

Lecture 9, Memory 31 CS250, UC Berkeley, Fall 2012


Replicated-State Multiport Memory
Write Port 0 Write Port 1

Copy 0 Copy 1

Example: Alpha 21264


Read Ports Regfile clusters
Lecture 9, Memory 32 CS250, UC Berkeley, Fall 2012
Flash Memory

Intel Micron 8 GB NAND flash device, 2 bit per cell, 25 nm minimum feature, 16.5 mm by 10.1 mm.

. 4. Intel Micron 8 GB NAND flash device, 2 bit per cell, 25 nm minimu


CS 152 L14: Cache I UC Regents Spring 2005 © UCB
Device physics ...

NAND Flash Memory

CS 250 L10: Memory UC Regents Spring 2017 © UCB


The physics of non-volatile memory
Vg Two gates. But the
middle one
Vd dielectric Vs is not connected.
Ids
dielectric
n+ n+

p-

1. Electrons “placed” on floating gate stay


Vd
there for many years (ideally).

+++ +++
--- ---
Vg Ids
2. 10,000 electrons on floating gate shift
transistor threshold by 2V. Vs
3. In a memory array, shifted transistors
hold “0”, unshifted hold “1”. “Floating gate”.
CS 250 L10: Memory UC Regents Fall 2013 © UCB
Moving electrons on/off floating gate
A high drain voltage A high gate voltage
injects “hot electrons” “tunnels” electrons off
onto floating gate. Vg of floating gate.

Vd dielectric Vs
dielectric
n+ n+

p-

1. Hot electron injection 2. High voltages damage


and tunneling produce tiny the floating gate.
currents, thus writes are Too many writes and a bit
slow. goes “bad”.
CS 250 L10: Memory UC Regents Fall 2013 © UCB
Architecture ...

NAND Flash Memory

CS 250 L10: Memory UC Regents Spring 2017 © UCB


Flash: Disk Replacement Chip “remembers”
for 10 years.
Presents memory to the
CPU as a set of pages.

Page format:

2048 Bytes + 64 Bytes


(user data) (meta data)

1GB Flash: 512K pages


2GB Flash: 1M pages
4GB Flash: 2M pages
CS 250 L10: Memory UC Regents Fall 2013 © UCB
Reading a Page ... Bus Control Flash
,-.#/0123# Memory
33 MB/s Read Bandwidth
,-,1/0120# ,-45/0126# !"#$%&'(')*+
8-bit data or address
Samsung
*789&):7;8<=>? (bi-directional) K9WAG08U1A
&/G:

@"(

@(
&?/

.(
&?2
Clock out page bytes:
&<: 52,800 ns
#"(
&:@A
&: &:/

*(
Page address in: 175 ns

!
&::

! !
DA)E BBC, /6=D,<++9 /6=D,<++! :6;,<++9 :6;,<++! :6;,<++F FBC, 563&,7 563&,789 563&,E

/6=3>%,<++('44 :6;,<++('44

*A5 234.
First byte out: 10,000 ns
CS 250 L10: Memory UC Regents Fall 2013 © UCB
Where Time Goes
K9WAG08U1A
K9K8G08U0A K9NBG08U5A !"AS% 'E'OR+
!igure 1. K9K8G08U0A !unctional Block Diagram
PCC
PSS

A12 - A30 R-Buffers 8,192' ^ 256' Bit First


NAND !lash
"atches
& Decoders ARRA+ byte out:
Page A0 - A11 +-Buffers (2,048 ^ 64)Byte x 524,288 10,000 ns
"atches
address & Decoders Data Register & S/A

in: +-Gating

175 ns Command
Command Clock
Register
I/O Buffers & "atches S--
S66
out
-N
<N
Control "ogic
& %igh Poltage QRI I
page
Output
ON Generator Global Buffers
Driver bytes:
QRI L
52,
-?N A?N OP
800
!igure 2. K9K8G08U0A Array OrganiHation
ns
CS 250 L10: Memory UC Regents Fall 2013 © UCB
Writing a Page ... L:,.0)&
M5&A$)N:3=BO

N:3=B)&
A page lives in a block of 64 pages:
L,/0)&
1GB Flash: 8K blocks L,/0)!

2GB Flash: 16K blocks


L,/0)P5
4GB Flash: 32K blocks L,/0)PJ

N:3=B)5
To write a page: L,/0)&
L,/0)!
1. Erase all pages in the block Time: 1,500,000 ns
(cannot erase just one page).
L,/0)P5

2. May program each page L,/0)PJ


Time: 200,000 ns
individually, exactly once. per page.

Block lifetime: 100,000 erase/program cycles.


CS 250 L10: Memory N:3=B)A&#5
UC Regents Fall 2013 © UCB
Block Failure L:,.0)&
M5&A$)N:3=BO

N:3=B)&
Even when new, not all blocks work!
L,/0)&
1GB: 8K blocks, 160 may be bad. L,/0)!

2GB: 16K blocks, 220 may be bad.


L,/0)P5
4GB: 32K blocks, 640 may be bad. L,/0)PJ

N:3=B)5
During factory testing, Samsung writes good/bad
L,/0)&
info for each block in the meta data bytes.L,/0)!
2048 Bytes + 64 Bytes
L,/0)P5
L,/0)PJ
(user data) (meta data)
After an erase/program, chip can say “write failed”, and block
is now “bad”. OS must recover (migrate bad block data to a
new block). Bits can also go bad “silently” (!!!).
CS 250 L10: Memory N:3=B)A&#5
UC Regents Fall 2013 © UCB
Flash controllers: Chips or Verilog IP ...
Flash memory controller manages write lifetime
management, block failures, silent bit errors ...

Software sees a “perfect” disk-like storage device.


CS 250 L10: Memory UC Regents Fall 2013 © UCB
Recall: iPod 2005 ...
Flash Flash
memory controller

CS 250 L10: Memory UC Regents Fall 2013 © UCB

You might also like