VLSI Design I
CMOS Sequential Logic
Clocking Strategies
Today’s handouts:
(1) Lecture Slides
MicroLab, VLSI-10 (1/21)
JMM v1.2
Sequential Logic
Use #1: Get better utilization from
idle combinational logic blocks.
Pipeline the system so that new
computations start before the old ones
complete. Add registers to keep
computations separate.
8
A
8 Use #2: Convert parallel operations
x C
B to a sequence of (faster, smaller)
8 serial operations.
1
A
1
+ C
B
8 8
Use #3: Need to process a
sequence of inputs and want to
reuse the same hardware (finit
state machine).
MicroLab, VLSI-10 (2/21)
JMM v1.2
Latches and Flip-Flops
Q follows D
D Q D
G G
Q
level sensitive latch
Q stable
Q takes value from D
D Q D
clk clk
Q
edge sensitive flip-flop
Q stable
A static latch will hold data while G is inactive, however long
that may be. A dynamic latch will hold data while G is
inactive, but only “for a while”, after which the saved value
may decay.
Do static latches dissipate static power?
How long is “for a while”?
Which one should I use?
MicroLab, VLSI-10 (3/21)
JMM v1.2
Latch Timing Constraints #1
latch a latch b
D Q CLa D Q CLb D Q
G G G
CLK
t1a
t2b
H S
CLK H S
Do I have to
check ALL these t1a = tmqa+ tmda > thb
constraints?
t1b = tmqb + tmdb > tha
t2a = tqa + tda < tc0 - tsb
t2b = tqb + tdb < tc1 - tsa
th = hold time
ts = setup time
tm = min delay from invalid input to invalid output
td = max delay from valid input to valid output for comb. logic
tq = max delay from G to Q
tc0 = low periode of clock cycle tc
MicroLab, VLSI-10 (4/21)
JMM v1.2
Latch Timing Constraints #2
t1a
t2b
H S
CLK H S
t1a = tmqa+ tmda > thb
t1b = tmqb + tmdb > tha
t2a = tqa + tda < tc0 - tsb
t2b = tqb + tdb < tc1 - tsa
Questions for latch-based designs:
w how much time for useful work (i.e. for combinational logic
delay)?
tda + tdb < tc - 2(ts + tq)
w what is the maximal clock frequency
w does it help to guarantee a minimum tm, for example, by requiring
a minimum number of gates in each cloud?
w Suppose the maximum clock skew is tSKEW. How does that affect
the equations above? Clock skew measures the difference in
arrival of CLK at two cascaded latches (not necessarily any two
latches!).
MicroLab, VLSI-10 (5/21)
JMM v1.2
Static Latches
Basic idea: Want storage node to
be isolated from whatever
Need gain around user does to Q.
this loop to make 0
latch static.
Q
D 1
Would like fast CLK-to-Q,
small setup and zero hold
times.
CLK
Oops… feedback not
Obvious implementation: isolated from Q. Could
add additional
output inverters...
Good! Input goes
only to fet gates
Q
D D
CLKN
CLK CLK
Should we buffer CLK
0, 1 or 2 times?
MicroLab, VLSI-10 (6/21)
JMM v1.2
Latch Timing
1 2
CLK
setup time = how long D input has to be stable
before CLK transition.
hold time = how long D input has to be stable
after CLK transition.
ts
th
CLK
So, what node should we use to measure
setup and hold times? And what should we measure?
Other time of interest: CLK-to-Q MicroLab, VLSI-10 (7/21)
JMM v1.2
Dynamic Latches
Suppose in the interest of speed we were
willing to give up the “static guarantee”
and take our chances with dynamic latches,
i.e., remove feedback path...
Eliminate when
Q fanout is small (1)
D Q
Can combine
other logic
with inverter
CLK local or global
clock inverter?
Can we do without the CLK inverter too?
DEC did without on 21064 but put in back in for 21164
CLKN
D Q
D Q
CLK
CLK
Delete the PFET driven by CLKN and then add
NFET driven by CLK in Q’s pulldown path to
handle what happens when D goes from 1 to 0.
MicroLab, VLSI-10 (8/21)
JMM v1.2
Single-Phase Clocked Systems
RTL #1:
D Q D Q D Q
clk clk clk
CLK
latch #2:
D Q D Q D Q
G G G
CLK
Simplest clocking methodology is to use a single clock in conjunction
with a register. Clocks are generated with global clock buffers.
CLK and CLK are generated locally.
buffers necessary
for large loads
clk-in
clk
clk
MicroLab, VLSI-10 (9/21)
JMM v1.2
Clock Skew
D Q D Q D Q
clk clk clk
CLK delay delay
w if a clock net is heavily loaded, there might be a race
between clock and data -> clock skew
w special attention has be made by designing the clock
tree. CAD tools are able to design balanced clock trees.
w two methods to avoid clock skew:
latch
D Q D Q D Q
clk clk clk
CLK delay
D Q D Q
clk clk
delay CLK
MicroLab, VLSI-10 (10/21)
JMM v1.2
Two-Phase Clocked Systems
D Q D Q D Q
G G G
PHI1
PHI2
phi1
“non-overlapping
two phase clocks” phi2
w a problem in singlem phase clocked systems is the
generation ad distribution of nearly perfect overlapping
clocks.
w in two-phase clocked systems this is solved by non-
overlapping clocks
w non-overlapping clocks can be generated with latch
structures
clk ≥1 phi1
≥1 phi2
MicroLab, VLSI-10 (11/21)
JMM v1.2
Clock Distribution
Two main techniques for clock distribution exist:
u a single large buffer (see Alpha processor)
u a distributed clock tree approach
n-bit datapath
n-bit datapath
n-bit datapath
n-bit datapath
n-bit datapath
n-bit datapath delays have
n-bit datapath to match
clk n-bit datapath between
n-bit datapath stages
n-bit datapath
n-bit datapath
n-bit datapath
u there is no such thing as design-free clocking
strategy in today’s high-performance processes
u clock buffers should be surrounded by power pads
due to its large power consumption
vdd clk gnd clk
clk clk clk clk driver
clk
MicroLab, VLSI-10 (12/21)
JMM v1.2
Phase Locked Loop Clock Technique
Phase locked loops (PLL) are used to generate
internal clocks on chips for two main reasons:
u to synchronize the internal clock of a chip with an
external clock
u to operate the internal clock at a higher rate than
the external clock input
clock clock
PLL
clock clock
route route
dclk dclk
dclk+dpad dclk+dpad
clock clock
dclk dclk
data out data out
MicroLab, VLSI-10 (13/21)
JMM v1.2
Flip-flops (registers)
Using alternating positive and negative dynamic latches with
a single clock gives great speed and small area, but…
w lots of worries about clock skew
w must balance logic delays to minimize wastage
w need latch size checks (check optimizations!)
What about those of us who don’t have buildings full of
engineers to sweat the details? Use D-flip-flops and
address all the problems once!
D D Q D Q Q D D Q Q
master slave
G G CLK
CLK
D
CLK
Q
!
MicroLab, VLSI-10 (14/21)
JMM v1.2
Flip-flop Implementations
Obvious implementation:
Q
D
CLK
Use “jamb” latches to lighten CLK load:
“Weak” feedback inverters
(long n and p) get overridden
D Q
CLK
MicroLab, VLSI-10 (15/21)
JMM v1.2
Flip-Flop Timing
D Q CLa D Q
clk clk
CLK
t1
t2
CLK
t1 = tmq + tma > th
t2 = tq + tda < tc - ts
Questions for register-based designs:
w how much time for useful work (i.e. for combinational logic
delay)?
w does it help to guarantee a minimum tm? How about designing
registers so that
tmq > th?
w Supose the maximum clock skew is tSKEW. How does that affect
the equations above?
MicroLab, VLSI-10 (16/21)
JMM v1.2
Dynamic Flip-Flops
I’ll have the Christer Svensson
special please!
2
CLK QN
CLK is low:
w node 1 follows not(D)
w node 2 pulled up
w QN is “floating” with it’s old value
CLK is high:
w node 2 = “0” if node 1 = “1”,
otherwise it stays “1”
ð node 2 = not(node 1) shortly after CLKé
w QN = not(node 2) ð stable soon after CLKé
w node 1 can be pulled down if D goes to “0” (capacitive
coupling), but node 2 won’t change!
MicroLab, VLSI-10 (17/21)
JMM v1.2
Static Timing Analysis
Do I have to Yup, for every pair of connected
check ALL the register/latches AND for all
constraints? possible data values!
We need a CAD tool: static timing analyzer. Here’s how
it works:
Step 1: “Level-ize” all signal nodes.
Start by assigning all register outputs and top-level inputs a
level of 0. For all other gates: levelOUTPUT =
max(levelINPUT )+1.
Step 2: Compute min/max signal delays.
For each successive node level, compute min and max time for
all nodes on that level (see next slide for details). This is a
“data independent” computation. Might need case analysis to
avoid false paths.
Step 3: Check setup and hold constraints
Use min times of register inputs to check hold time. Use max
times and tCLK to check setup time or use max time + tSETUP
to determine min tCLK.
MicroLab, VLSI-10 (18/21)
JMM v1.2
Stage Delay Computation
Look at each gate and use knowledge of input timing and rise/fall
timing to compute earliest and latest time output could change for
both rising and falling output transitions.
IN VDD
D é ð OUT ê
C1 COUT
2
CLKN min ð 1=OV, fast
IN OUT max ð 1=VDD, slow
CLK
1 IN GND
D ê ð OUT é
C2 COUT
Other transitions:
CLK é, CLK ê, CLKN é, CLKN ê min ð 2= VDD , fast
max ð 2=0V, slow
Use Penfield-Rubenstein model to compute
td,in-out = sum(Ri,Ci) over all nodes “i” in the stage, where Ri is
total “effective resistance” to power rail and C i is non-zero if node
capacitor needs to be charged/discharged. Multiply by derating
factor to account for rise/fall time of input.
MicroLab, VLSI-10 (19/21)
JMM v1.2
Coming Up...
Next topic…
Finite state machines: state diagrams, state
minimization, state assignment, logic and PLA
implementations.
Readings for next time…
Weste:
u Sections 5.5 thru 5.5.6 (latch, FF)
u 5.5.8 thru 5.5.11 (clock strategy)
u 5.5.15 and 5.5.16 (clock strategy)
Selfstudy…
Weste:
u PLL section 9.3.5.3
MicroLab, VLSI-10 (20/21)
JMM v1.2
Exercises: VLSI-10
Ex vlsi10.1 (difficulty: easy): calculate peak current
and power cnsumption of a 100MHz clock driver
with rise and fall times of 1ns driving 30k registers
bits at 100fF each with Vdd=3.3V
Result: Ipeak=9.9A, Pd=2.18 Watt
MicroLab, VLSI-10 (21/21)
JMM v1.2