0% found this document useful (0 votes)
59 views40 pages

ACA UNIT-2 Kai Hwang

This document discusses pipelining and superscalar techniques in processor design, covering linear and nonlinear pipeline models, asynchronous and synchronous operations, and various scheduling techniques. It details the performance analysis of instruction, arithmetic, and memory-access pipelines, including concepts like clocking, speedup, efficiency, and throughput. Additionally, it explores the complexities of dynamic pipelines and the challenges of scheduling tasks within them.

Uploaded by

ashishkumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views40 pages

ACA UNIT-2 Kai Hwang

This document discusses pipelining and superscalar techniques in processor design, covering linear and nonlinear pipeline models, asynchronous and synchronous operations, and various scheduling techniques. It details the performance analysis of instruction, arithmetic, and memory-access pipelines, including concepts like clocking, speedup, efficiency, and throughput. Additionally, it explores the complexities of dynamic pipelines and the challenges of scheduling tasks within them.

Uploaded by

ashishkumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

PM !

l|¢G-NH-‘ Hllffitvoponm

— —

Pipelining and Superscalar


Techniques
This chapter dis with pipelining and superscalar design in processor developn1ent.We begin with a
discussion of conventional linear pipelines and analyze their performance!-\ generalized pipeline model
is introduced to include nonlinear interstage connections. Collision-free scheduling techniques are
described for performing dynamic functions.
Specific techniques for building instruction pipelines. arithmetic pipelines. and memory-access
pipelines are presented. The discussion includes instruction pnefetching. internal data forwarding.
software interlocking. hardware scoreboardirrg. hazard avoidance. branch handling. and instruction-
issuing techniques. Both static and multifunctional arithmetic pipelines are designed. Superscalar design
techniques are studied along with performance analysis.

IJHEAR PIPEIJNE PROCESSORS


1 A linear pipcl'r'ne pro;-c.s.sor is a cascade of proccssing stages which arc lincatly connected to
perforrn a fixed function over a stream of data flowing from one end to the other. in modem
computers. linear pipelines arc applied for instruction execution, arithmetic computation, and mcrnory-access
operations.

6.1.1 Asynchronous and Synchronous Models


A linear pipeline processor is consiructecl with J1: processing stages. External inputs (operands) are fed into
the pipeline at the first stage 8|. The processed rcsulls are passed from stage S. to stage S.-+1. for all F = 1, 2,...
It - l. The final result emerges from thc pipeline at the last stage 3;,
Depending on the control of data flow along the pipeline. we model linear pipelines in two categories:
n.s_1-'rrc-in'ormn.s and s_i-'nc-rrrr;Irrons'.

Asynchronous Model As shown in Fig. 6.1a. data flow between adjacent stages in an asynchronous
pipeline is conlrolled by a handshaking protocol. When stage S; is ready to transmit, it sends a rt-mlr-' signal to
stage 5‘.-+|. Aflcr stage 5‘,., receives the incoming dala, it returns an tat-krron-'fc't'.igt= signal to 5‘.-.
Asynchronous pipelines are useful in designing communication channels in message-passing multicom-
pulers whore pipelinod wormhole routing is practiced [see Chapter 9). Asynchronous pipelines may have a
variable throughput rate. Different amounts of delay may be experienced in different stages.
l
WM MIT cI 0' lb‘ Hfiitim
- =- |r.\.m*\
.- _ _
115' i I Advwioed Computer Architecture

Input é Q Ijjjjjjf q Ou‘lp-u'l


File cl
Ready 52 SI: Ready
Aek W; “Kain Ack

[a] An asynchronous pipeline moclel


L L L L L

|m;,u Output

O I

Ce H H nil H l_|
|- T + rm -l=*l~
(hi A synchronous plpellne model
in Tlrne [clo-cit cycles]
1 2 3 4
Qqitlons:
III
We i=0-WM
S3
L = Latch

rm= Maxlimm stage delay


= Latch delay
54 El =Ackno»wlnclge signal.

[cl Reseryatlorl table of a four-stage linear pipeline

Fig.-ii.1 ‘Mo models of linear pipeline mill: and she corresponding reserva1:lc.in table

Syn-ehmnuu: Model Synchronous pipelines are illustrated in Fig. l5.lb. Cloclced latches are used to
interface between stages. The latches are made with master-slave fiip-flops, which can isolate inputs from
outputs. Upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously.
The pipeline stages are combinational logic circuits. It is desired to have approximately equal delays
in all stages. These delays dctenriinc the clock period and thus the speed of the pipeline. Unless otherwise
specified, only synchronous pipelines are studied in this book.
The utilization pattern of successive stages in a synchronous pipeline is specified by a re'servafion rnbie.
For a linear pipeline, the utilization follows the diagonal streamline pattern shown in Fig. 6.lc. This table
is essentially :1 space-time diagram depicting the precedence relationship in using the pipeline stages. For a
ii-stage linear pipeline, it clock cycles are needed for data to flow through the pipeline.
Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is filled
up, one result emerges fi'om the pipeline for each additional cycle. This throughput is sustained only if the
successive tasks arc independent of each other.
,.,,,,-.,,,,,...,,.,,._.,,,_,,.,,,,,,,, _ m
6.1.1 Clocking a.ndTirnir|g Control
The c-Ind: r-_i-do r of a pipeline is determined below. Let r,- be the time delay of the circuitry in stage S, and d
the time delay ofa latch, as shown in Fig. 6.11:.
Clock Cycle and Throughput Denote the moxinmm stage r.lein__i-' as rm, and we can write tas

r=max{r,}f+n'= r,,,+n' (5.1)


At the rising edge of the clock pulse. the data is latched to the master flip-flops of each latch register. The
clock pulse has a width equal to o‘. In general, rm P-P of by one to two orders of magnitude. This implies that
the maximum stage delay r,,, dominates the clock period.
The ;Jipefim'_fi'eq1ienc__v is defined as the inverse of the clock period:
l
f~ ? (*5-3)
If one result is expected to come out ofthe pipeline per cycle, _,r"r-cpresents the mrtrimum Ifirougfipult of the
pipeline. Depending on the initiation rate of successive tasks entering the pipeline, the arnml rhrmighpur of
the pipeline may be lower than _,l.' This is because more than one clock cycle has elapsed between successive
task initiations.
Clock Slmwing Ideally, we expect the clock pulses to arrive at all stages (latches) at the same time.
However. due to a problem known as dock .§‘.l1’l1--‘ing. the same clock pulse may arrive at different stages with
a time oi"Tset ofs. l..»t:tr,m,_, be the time delay of the longest logic path within a stage and rm that of the shortest
logic path within a stage.
To avoid a race in two successive stages. we must choose r,,, 2 rm, + s and n‘ 5 rm,-,, — .s. These constraints
translate into the following bounds on the clock period when clock skew takes effect:

51+ f.I:rr¢r.t + S S r S rm + -rnrin _ 5'

In the ideal case s = 0, r,,,,,_, = r,,,, and r,,,,-,, = d. Thus, we have r = rm + d, consistent with the definition in
Eq. 6.1 without the effect of clock skewing.

6.1 .3 Speedup, Efficiency, and Throughput


Ideally, a linear pipeline of it stages can process ri tasks in k — (n — 1) clock cycles, where It cycles are needed
to complete the execution of the very first task and the remaining H — 1 tasks require n — 1 cycles. Thus the
total time required is
r,, = [k + in - 1)] r (5.4)
where ris the clock Consideran equivalent-function nonpipelinetl processor which has aflow-through
deiqi-' of icr. The amount of time it takes to execute n tasks on this nonpipelin-ed processor is T| = nkr.
Speedup Factor The speedup factor of a ii"-stage pipeline over an equivalent non pipelined processor is
defined as
S‘: TL: nkr = nit (6.5)
11 It-r+(n—l]r k+[n—l]
I15 i Advanced Computer Architecture

Note 6.1‘ Pipelined versus non-pipelined pmceuor:


If each pipeline stage has a stage delay of T, then clearly an in5l'fl.l¢'tlon passing through .1: pipeline
stages in a processor secs a tonal latency ofkt. New suppose we also have a non-pipclined processor
for the same instruction set, using the same technology. This non-pipelined processor need not present
ti latency of kr to every instruction. because it does not have Ir separate stages for an instruction to
pass through. Since the non-pipclined processor would have a more compact hardware design, we can
expect that the average latency seen by instructions on this processor will be smaller than kr.
In other words, the advantage of a pipelined processor lies in its instruction throughput; in terms
of instmction latency, the non-pipelined version can in fact be expected do better. However. for the
comparative analysis here, we have assumed that the instruction latency on the non-pipelined version
is also kr. This is a simplification which does not change substantially the conclusion reached.

I»)
B Example 6.1 Pipeline speedup versus stream length
The maximurn speedup is .51 —> Jr as n —> =~=. This maximum speedup is very difficult to achieve because of
data dependences between successive tasks (instructions). program branches, interrupts. and other factors to
be studied in subsequent sections.
Figure 6.2a plots the speedup factor as a function of n, the number of tasks (operations or instructions]
performed by the pipeline. For small values ofn. the speedup can be very poor. The smallest value of 51 is 1
when H = 1.
The larger the number Ir of subdivided pipeline stages. the higher the potential speedup performance.
When n = 64, an eight-stage pipeline has a speedup value of ?.l and a four-stage pipeline has a speedup of
3.1 However, the number of pipeline stages cannot increase indefinitely due to practical constraints on costs,
control complexity. circuit implementation, and packaging limitations. Furthermore, the stream length n also
affects the speedup; the longer the better in using a pipeline.

Optirnal Number of Stage: In practice, most pipelining is staged at the functional level with 2 5 Ir 5 15.
Very few pipelines are designed to exceed I0 stages in real computers. The optimal choice of the number of
pipeline stages sht:-u.ld be able to maximize the perfomtancefeost ratio for the target processing load.
Let r be the total time required for a nonpipelined sequential program of a given function. To execute the
same prngrim on a Ir-stage pipeline with an equal flow-through delay r. one needs a clock period ofp = ril-
+ ti where if is the latch delay. Thus, the pipeline has a maximum throughput off= Up = 1.*{n'.t' + rt). The total
pipeline cost is roughly estimated by c + kh. whcrc c covers the cost of all logic stages and h represents thc
cost of each latch. A pipeline git-ijformrvir-eat-as! ratio [PCR] has been defined by Lemon {I 9T3):
PCR = _-L = ml (as)
c + Hi (Mk + d)(c + Ha)
,;,,,.1,,,,,,W,5,,,,._,,,£_,,,,,T.,
_ __ H‘.-|-Mcliruw Hl'lI:'|>¢r.-moi-r~ I
_ m
Sn
10 .................................................................................. ......__

I: = 10 stages

SpeedupFact-o4cnon
» k = 6 stages

2
1 I I I I I I I I ,.
1 2 4 B 1 6 32 64 128 n
No. of operations
[a] Speedup factor as a function of the number of oporatlms [Eq. 6.5]

— Peak

PeroformanceCostRat
- k
6"Q_____
No. of stages.-
[Op-timaij
[b1 Dptl rral nurrbor of pip-oil no stages [Eqs. 6.6 and B.?j|

Fig. L2 Speedup factors and the optirml l'Il.lfi'lbEl‘ ofpdp-elm stages bra flnear ppeiine unit

Figure 6.21:: plots the PCR as a function ofk. The peak of ihe PCR nun-c corresponds to an optimal choice
for the number of desired pipeline stages:
I-C
Ir.-,= 5 (6-7)
when: r is the total fl0w—111r0ugh delay of lho pipeline. Thus the total stage cost c, thc latch delay d. and thc
latch cost fa must he considered to achieve the optimal value £1-,.
Efllcfency and Throughput Thc ejjficienqy Ek of a linear It-stage pipclinc is dcfilmd as

.5‘ = Z’;
5k = _.£ (53)
J: k +(n — 1] '
Obviously, the cfiicicncy approachcs I whcn n —> =-1, and a luwccr bound on E‘: is 1H: when n = 1. The
pi']'J€Hm.* rhrmighpur II,‘ is defined as the number oi" tasks {operations} performed per unit lime:
= n = nf
H‘ [k+{n—l)]t k+(n—l) (63)
F|'>r'MfGJ'|Ili' N“ I'm-tiqtrlrlitt
III N _ Advanced Comptrtetwfircititscture

The rrrrtrimrmi rhmrig!i;Jurf occurs when E; —> l as N —> W. This coincides with thc speedup definition
given in Chapter 3. Note that Iik = E; _f= E11 r= Silk r. Other rclcvant factors of instruction pipclincs will bc
discussed in Chapters 12 and 13.

NONLINEAR PIPELINE PROCESSORS


2
A uf_t-name‘: pipeiine can be reconfigured to perform variable functions at different times. The
traditional linear pipelines are static pipelines because they are used to perform fixed functions.
Adynamic pipeline allows fcedforward and feedback connections in addition to the streamline connections.
For this reason, some authors call such a structun: a rioriiinertr pipeline.

6.1.1 Reservation and Latency Analysis


In a static pipeline, it is relatively easy to partition a given function into s sequence of linearly ordered
subfunctions. However, function partitioning in a dynamic pipeline becomes quite involved because the
pipclinc stages arc intcrconncctcd with loops in addition to streamline cottncctions.
A multifimction dynamic pipclinc is shown in Fig. 6.3a. This pipclint: has three stages. Bcsidcs thc
srrernniine r-onriecrions from S| to S; and from S; to S3, there is a_fiv=d_fom'rtrrf cviriner-rion from S | to S3 and
two_fi*¢=r2’brtr-I: £‘t'J.li.flr_’£‘flit'J.ri.§‘ from S3 to S2 and from S3 to 5-'|.
These feodfortvard and feedback connections make the scheduling of successive events into the pipeline
a nontrivial task. With thcsc connections, thc output of thc pipclinc is not necessarily from thc last stage. In
fact, following different datallow patterns, one can use the same pipeline to evaluate different functions.
Output it
I-1-I O|.lP'i-ItY
[a] A trree-stage pipeline

_L »s
—r-
I\J U3
—I- Time
1 2 3 4 5 6
51 IIII-"’Hen --t Hm IIII
Sass 52 IIIIII Sass S2 IIIII
-it IIIII as III
[bi Fits-servation table for fun-ctien JC [cl Reservation table for function Y

FIg.6.3 A dynamic pipeline with feed forward and izeelbaeit connections fir two different functions

Reservation Table: The reservation table for a static linear pipeline is trivial in the sense that dataflow
follows a linear streamline. The rr'st'rvrn'ion mink’ for a dynamic pipeline becomes more interesting because
a nonlinear pattern is followed. Given a pipeline configuration, multiple reservation tables can be generated
for the evaluation of differmt functions.
Two reservation tables are given in Figs. 6.31:: and 6.3c-, corresponding to a function X and a function Y,
respectively. Each function evaluation is specified by one reservation table. A static pipeline is specified by a
single reservation table. A dynamic pipeline may be specified by more than one reservation table.
,.,,,e.-.,i,,g,,.,..5.,,._.,,,._,,.,,T...,~, _ m
Each reservation table displays the time~space flow ofdata through the pipeline for one function evaluation.
Different functions follow different paths through the pipeline.
The number of columns in a reservation table is ealled the evni'uorion rune of a given function. For
example, the function X requires eight clock cycles to evaluate, and function Y requires six cycles, as shown
in Figs. 6.3b and 6.3c, respectively.
A pipeline initiation table corresponds to each fisnction evaluation. All initiations to a static pipe-line use
the same reservation table. On the other han-ti, a dynatnie pipeline may allow different initiations to follow a
mix of reservation tables. The checltmarks in each row of the reservation table correspond to the time instants
(cycles) that a particular stage will be used.
There may be multiple checltmarks in a row, which means repeated usage of the same stage in diifenent
cycles. Contiguous cheelcrnarks in a row sirriply imply the extended usage of a stage over more than one
cycle. Multiple cheekmarks in a column mean that multiple stages need to he used in parallel during a
particular clock cycle.
Latency Analysis The number of time units [clock cycles] between two initiations of a pipeline is the
hireriey between them. Latency values must be nonnegative integers. A latency ofIt nieans that mo initiations
are separated by it clock cycles. Any attempt by two or more initiations to use the same pipeline stage at the
same time will eause a eol'1r'sion.
A collision implies resource conflicts between two initiations in the pipeline. Therefore, all collisions must
be avoided in scheduling a sequence of pipeline initiations. Some latencies will cause collisions, and some
will not. Latcncies that cause collisions are ealled_,forbi.dden latencies. In using the pipeline in Fig. 6.3 to
evaluate the function X, latencies 2 and 5 are forbidden, as illustrated in Fig. 6.4.

—I-Time
1 2 3 it 5 6 1' 3 9 10 11
51 X1 X2 Ks 9'11 X4 K1-*2 K213
SHQGS S2 X1 X1, X2 X2, X3 X3. X4 ill.‘ I I I

53 X1 x1- X2 x1- K2» Ks x2~ is K-it


[aj Collision with scheduling latency 2

—I-Tu-no
123456?BQ-1011
51 K1 K1-X2 X1
Seuss 52 X1 X1 K2 K2 '''
53 *1 X1 K1 K2 K2
[ti-J Collision with eehoeluling latency 5

Fig.6.-1 Collisions with forbidden hteneies 1 and 5 in using the pipeline in Fig. 6.3 to evaluate the function X

The ith initiation is denoted as X; in Fig. 6.4. With latency 2, initiations X, and X; collide in stage 2 at time
4. At time T. these initiations collide in stage 3. Similarly, other collisions are shown at times 5. 6, 8. ..., etc.
I ' I Ifllli t'r>rrIq|r_.r.I|n*\ _
Z31 i Advwiced Computerflrrchitecture

The collisio.n patterns for latency 5 are shown in Fig. 6.4b, where X, and X2 are scheduled 5 clock cycles
apart. Their first collision occurs at time 6.
To detect a forbidden latency, one needs simply to cheek the distance between any two checkmarks in the
same row of the reservation table. l-‘or example, the distance between the first mark and the second mark in
row S, in Fig. 6.31:: is 5, implying that 5 is a forbidden latency.
Similarly, latencies 2, 4, S. and T are all seen to be forbidden from inspecting the same reservation table.
From the reservation table in Fig. 6.3c, we discover the iiirbirlden latencies 2 and 4 for function Y. A inrentjt-'
seqrieitce is a sequence ofpermissible nonforhidden latencies between successive task initiations.
A fflfs'HC_\-‘ ct-‘cit’ is a latency sequence which repeats the same suhsequence (cycle) indefinitely. l-"ig1.|ret:i.5
illustrates latency cycles in using the pipeline in Fig. 6.3 to evaluate the function X without causing acollision.
For example, the latency cycle l_'l, S] represents the infinite latency sequence 1, B, l, B, l, 8, This implies
that successive initiations ofnew tasks are separated by one cycle and eight cycles alternately.
]-ii Cycle repeats it-l-ii--~
123456T8Q-10111213-1415161?13192D21
513912 x'lx'2x1x’2x3)(4 Ksxttxsxrtxsxe

53 9"-1 9"-2 X1 ‘K1 K2 9'13 x-rt *3 is


[a] Latency cycle [1, B] = 1. 3. 1. B. 1. B. with anavorege latency of4.5

C terep-eats
t-
123455T3‘Q‘ll]"l1‘l2‘l3‘l4‘l5‘l51i"1iB‘l'Q2D2‘l
-11° + -+- -+~
S‘|x‘l x2 ‘K1‘K3x1x2xdx2x3x5X3‘K4x'6xdx5x?X5E
s2 it, if-r"12 Kzxs Xaxtt Xaxs Ksxe Xsxr "'
53 X1 X1x2x1x2xsX2’(3x4X3X4KsxaXs’(exsXs
{bj|Latencycyclo{3)=3,3,3,3,...,-withan average latency-of3

F Cyelerop-nets 1+‘ + _-

12 3 4 5 E T B‘Q1l]"l1‘l2‘l3‘l4—‘l5‘l51i"‘l3‘lQ'2D2‘l
5111 x'lx2x1 xzxsxz Xsxax3
52 X1 3'11 x2 >12 x3 x3 x4 ...
53 it, x, Jt-1 x2 x2 X2 x3 x3 x3 X3
[c] Latency cycle {6} = E, 6, E, 6, with an average istency of 6

Fig.6.! Three valid latency cydes icr the evaluation offunerion X

The r'nr+:r¢rge i'nIcrr-t‘__t' of a latency cycle is obtained by dividing the sum of all latencies by the number of
latencies along the cycle. The latency cycle { I , 8) thus has an average latency of (1 + 3).f2 = 4.5. A consmnr
cjveie is a latency cycle which contains only one latency value. Cycles {3} and (6) in Figs. 6.5h and 6.5c are
both constant cycles. The average latency of a constant cycle is simply the latency itself. ln the next section,
we describe how to obtain these latency cycles systematically.
...,,,-,,-,,,,.,..,,,,,,,_,.,.,.,,,,, _ ,3,
6.2.1 Collision-Free Scheduling
When scheduling events in a nonlinear pipeline, the main objective is to obtain the shortest average latency
between initiations witliout causing collisions. In what follows, we present a systematic method for achieving
such collision-free scheduling.
We study below c0ii'r'.sion vectors. stare o’r'agrorns_ .sirig')'e c'_t-‘cf-es. growl:-' c_vc*!cs', and minimal at-'cragt>
interior {MAL}. This pipeline design theory was originally developed by Davidson [I97 I) and his students.
Collision ‘Factor: By examining the reservation table, one can distinguish the set of permissible latencies
from the set of forbidden latencies. For a reservation table with n cohnnns, the ninrimum_;‘orbiriricn inrency
in E n - 1. The pennissible latency p should be as small as possible. The choice is made in the range 1 E p 5
ml.
A permissible latency of p = I corresponds to the ideal case. In theory, a latency of I can always be
achieved in a static pipeline which follows a linear (diagonal or streamlined) reservation table as shown in
Fig. 6.lc.
The combined Set of permissible and forbidden latencies can be easily displayed by a collision vector,
which is an in-bit binary vector C = (C,,,t_".,,, | ...t_".3C|). The value of C, = 1 if laicncy ieauses a collision
and C, = 0 if latcncy i is permissible. Note that it is always true that C,,, — 1, corresponding to the maximum
forbidden latency.
For the two reservation tables in Fig. 6.3, the collision vector Cy = {101 I010) is obtained for fiinction X,
and Cy = [_llIIltl_] for function Y. From l:__\;', we can immediately tell that latencies T, S, 4, and 2 are forbidden
and latencies 6, 3. and l are permissible. Similarly, 4 and 2 are forbidden latencies and 3 and I are permissible
latencies for fimction Y.
State Diogmm: From the above collision vector, one can consn-uct a stoic .n‘ingr.nm spccifying thc
permissible state transitions among successive initiations. The collision vector, like Cy above, corresponds to
thc inirirti smrc of the pipeline at time 1 and thus is called an iflifititl c'oi'i"isi0n vector. Lerp be a permissible
latency within thc range lip E in — 1.
The nctr state ofthc pipclinc at time r + p is obtained with the assistance of an m-bit right shift register as
in Fig. 6.6a. The initial collision vector C is initially loaded into lite register. The register is then Shifted to the
right. Each l—bit shift corresponds to an increase in the latency by l. ‘When a CI bit emerges from the right end
aftcr ,o shifts, it means p is a permissible latency. Likewise, a 1 bit being shifted out means a collision, and
thus the corresponding latency should be forbidden.
Logical ll enters fiom the lei’: end of the shift register. 'I'he next state after p shifts is thus obtained by
bitwise-{llling the initial collision vector with the shifted register contents. For example, from thc initial
state Cy — (l0ll010}, the next state (l l I I l 1 1) is reached after one right shifi ofthe register, and the next state
(I01 10 I I} is reached after three shifts or six shifts.

Ir)
éjj Example 6.2 The state transition diagram for a pipeline unit
A snrrc ri‘ingr'arn is obtained in Fig. t'i.t'ib for function X. Frorn the initial state [101 lill 0}, only three outgoing
transitions are possible, corresponding to the three permissible latencies 6, 3, and I in the initial collision
vector. Similarly, from state (till I011], one reaches the same state afier either three shifts or six shifts.
I 1 I Ifllli l'm'rIq|r_.\.I|n*\ _
Z35 i Advwiced Computerhrdsitecture

When the number of shifts is m + i or gneater, all transitions are redirected back to the initial state. For
example, aficr eight or more (denoted as Bl] shifts, the neat state must be the initial state, regardless of which
state the transition starts from. in l-‘lg. 6.60, a state diagram is obtained for the reservation table in 1-"ig. 6.3c
using a 4-bit shift register. Once the initial collision vector is determined, the corresponding state diagram is
uniquely determined.

{Cm CD4, _ _ _ _ _ __ C1j|= lnltialeotilelon vector

I I I

in

"fl-" ‘ ‘ III I “D" safe


*1" collision
000

[aj State trans ltlen using an malt right shift register, where n is the rnaxirnurn forblddeniateney


_; $ _L _L Q‘ _L Q‘
i B
-5. |= _|t ¢

3 * 3+ 5+ 5+

at 1 s 1*
1n11n11 1111111 i
_; ID _; _; 1111
Q) .
(bi State diagram for function it
0
[0] State diagram for function Y

Fig. 6.6 Two stane diagralns obtainecl from the two reservation table: in Fig. 6.3,relp-eetzively

The ll’s and 1’s in the present state, say at time I, of a state diagram indicate the permissible and forbidden
latcneics, respectively, at time r. The hitwise Olling of the shifted versicn of the present state with thc initial
collision vector is meant to prevent collisions finrn future initiations starting at time r + l and onward.
Thus the state diagram covers all permissible state transitions that avoid collisions. All latencies equal to
or greater than m are permissible. This implies that collisions can always be avoided if evenm are scheduled
far apart (will! latencies ofm+). However, sueh long latencies are not tolerable from the viewpoint ofpipeline
throughput.
Greedy Cyeie: From the state diagram, we can determine optimal latency cycles which result in the MAL.
There are infinitely many latency eyeles one ean traee from the state diagram. For example, (I, S), (I, S,
6, ii), (3), (6), [3, ii), (3, 6, 3'] ..., are legitimate cycles traced fi'om the state diagram in Fig. 15.6b. Among these
cycles, only sim_rJ1'e .1:-__t-ales are of interest.
...,,,,-,,,-,,,,,.,,,,.,,,,,.,,.,,,,, _ ,3,
A simple cycle is a latency cycle in which each state appears only once. In the state diagram in Fig. tit‘-b.
only -[3], {G}, (8), (1, 3), (3, B}, and [6, E) are simple cycles. The cycle [1, S, I5, E) is not simple because it
travels through the state (ltll 1010') twice. Similarly, the cycle (3, 6, 3, 8, 15) is not simple because it repeats
the state (1 [ll I 01 ll three times.
Some ofthe simple rye let are grccdi-' r_lc!cs. A greedy cycle is one whose edges are all made with minimum
latencies from their respective starting states. For example, in Fig. 6.6b the cyelcs [1, 8) and (3) are greedy
cycles. Greedy cycles in Fig. 6.6c are (1, 5) and (3). Such cycles must first he simple, and their average
latencies must be lower than those of other simple cycles. The greedy cycle [1, 8) in Fig. 15.6b has an average
latency of(l + S]-'2 = 4.5, which is lower than that ofthe simple cycle [6, S) = (6 + 8}l'2 = T. The greedy cycle
(3) has a constant latency which equals the MAL for evaluating fimction X without causing a collision.
The i'viAL in l-‘lg. 6.6-c is 3, corresponding to either of the two greedy cycles. The minimum-latency edges
in the state diagrams are marked with asterisks.
At least one of the greedy cycles will lead to the MAL. The collision-free scheduling of pipeline events
is thus reduced to finding greedy cycles fi'om the set of simple cycles. The greedy cycle yielding the It-'LAI.. is
the final choice.

6.1.3 Pipeline Schedule Optimization


An optimization technique based on the MAL is given below. The idea is to insert noncompute delay stages
into the original pipeline. This will modify the reservation table, resulting in a new collision vector and an
improved state diagram. The purpose is to yield an optimal latency cycle, which is absolutely the shortest.
Bounds on the MAL in 1972, Shar determined the following bounds on the rnirrimol at-‘wag-e icrrencji-'
(MAL) achievable by any control strategy on a statically reconfigured pipeline executing a given reservation
table:
{l _| Thc MAL is lower-hounded by the maxirnum numhcr of chcclcrnarlrs in any row of thc reservation
table.
{2} Thc MAL is lower than or equal to thc average latency of any greedy cycle in the state diagram.
{3} Thc average latency ofany greedy cycle is upper-bountlctl by the number of] ‘s in the initial collision
vector plus l. This is also an upper botmtl on the MAL.

Interested readers may refer to Shar (I 972} or find proofs of these bounds in Kogge (I981). These results
suggest that thc optimal latency cycle must he selected firom one of the lowest greedy cycles. However,
a greedy cycle is not suflicient to guarantee the optimality of the MAL. The lower botmd guarantees the
optimality. For example, the MAL = 3 for both function "X and iiinction Y and has met the lower bound of
3 fiom their respective reservation iahlcs.
From Fig. 6.iSb, the upper hound on the ltll_A.L for function X is equal to 4 e l = 5, a rather loose bound.
On the other hand, 1-‘ig. 6.6:: shows a rather tight upper bound of 2 + 1 = 3 on the MAL. Therefore, all greedy
cycles for function Y lead to the optimal latency value of 3, which cannot be lowered further.
To optimize the MAL, one needs to find the lower bound by modifying the reservation table. The approach
is to reduce the maximum number of chcckrnarks in any row. The modified reservation table must preserve
the original function being evaluated. Patel and Davidson ['19l'6) have suggested the use of noncompute
delay stages to increase pipeline performance with a shorter MAL. Their technique is described below.
Z15 i ' Advanced Computerhrdiitecture

Delay Insertion The purpose of delay insertion is to modify the reservation table, yielding a new collision
vector. This leads to a modified state diagram, which may produce greedy cycles meeting the lower hound
on the MAI...
Before delay insertion, the three-stage pipeline in Fig. 6.7a is Specified by the I‘BSe1'\'ntion table in Fig,
6.Tb. This table leads to :1 collision vector C = {lfil 1}, corresponding to forbidden latencies 1, 2, and 4. The
corresponding state diagram (Fig. 6.'.|'c] contains only one self-reflecting state with a greedy cycle of latency
3 equal to the MAL.
Based on the given reservation table, the maximum number of eheclcmarks in any row is 2. Therefore, the
MAL = 3 so obtained in Fig. 6.']"e is not optimal.

3?) Example 6.3 Inserting noncompute delays to reduce


the MAL
To insert a noncompute stage D, afier stage S3 will delay both X, and X2 operations one cycle beyond time
4. To insert yet another noncomputc stage D; after the second usage of.S'| will delay the operation X2 by
another cycle.
These delay operations, as grouped in l-‘lg. 6.Tb, result in a new pipeline configuration in Fig. 6.8a. Both
delay elements Dl and D3 are inserted as extra stages. as shown in Fig. 6.Sb with an enlarged reservation table
haying3+2=5rowsund5+2='FcoIurnns.

Output

H I.-I

2
- 52 I 53

{a1 Athros-stage pipeline

—h "limo
1 2 3 4 5

S» III» MM t» at
II! L. _ ,. _ __ .. .. ._ .. ,. _. ._ _ _ _ --

[h] Ftosoniation tah-lo and operations being delayed [c] State transition diagram with MAL = 3

Fig.8.?‘ Apipilne with a minimum areragc la1:eneyol'3

In total, the operation X, has been delayed one cycle fi'om time 4 to time 5 and the operation X’; has been
delayed two cycles from time 5 to time 1'. All remaining operations (marked as X in Fig. 6.3b) are unchanged.
This new tahle leads to a new collision vector (100010) and a modified state diagrain in Fig. 6.Se_
...,,,,,-,,,,,.,,,.,,,_,,,,.,.,, _ 2,,

°""’"‘ I-IE!
we
[a] Insertion of two nonoornpute delay stages

—I- Time 4 r
—L l\J L0 -F

51 III . l .
Stages 5
IIIII"-
2
s, !!E__
IIIIe IIE~ -I
I 9

near D1
M”%IIIIIEI mum
[b] Mo-dfiad reservation tabto [ct Modified state diagram with a reduced M!-Y-.L= [1 + 31t2 = 2

Fig. 6.8 lrecrrlon oftwo delay stages to obtain an op1:lma.l MAL for the pipeline in Fig.6.?

This diagram displays a greedy cycle (1, 3), resulting in a reduced lvL-kl. = (1 + 3]:‘2 = 2. The delay
insertion thus improves the pipeline perfonnance. yielding a lower hound for the MAL.

Pipeline T'hmughprrt This is essentially the initiation rate or thc average number of task initiations per
clock cycle. if N tasks are initiated within n pipeline cycles, then the irririrrrrbrr rare or pipeline rirmugfrprn‘
is measured as N.-‘n. This rate is determined primarily by the inverse of the ltd.-"LL adapted. Therefore, the
scheduling strategy docs affect the pipeline performance.
In general, the shorter the adapted MAL, the higher the throughput that can he expected. The highest
achievable throughput is one task initiation per cycle, when the MAL equals 1 since 1 5 MAL 5 the shortest
latency of any greedy cycle. Unless the MAL is reduced to I, the pipeline throughput becomes a fiaction.

Pipeline Efficiency Another important measure is p|‘pei|'ne qfieiency. The percentage of time that each
pipeline stage is used over asuflireiently long series oftask initiations is the srnge ririttznrrhn. The accumulated
rate of all stage utilizations dctcrrnincs the pipeline cflicicncy.
Let us reexamine latency cycle [3] in Fig. 6.5b. Within each latency cycle of three cloc-ls cycles, there are
two pipeline stages, S, and S3, which are completely and continuously utilized after time 6. The pipeline stage
S3 is used for two cycles and is idle for one cycle.
Therefore, the entire pipeline can be considered St‘? = 33.3% cffieicnt for latency cycle [3]-. On the other
hand, the pipeline is only 14L‘-£7 = 51.8% efficient for latency cycle (1, 8) and 8116 = 50% eflicient for latency
cycle (6), as illustrated in Figs. 6.5a and 6.51:, respectively. Note that none of the three stages is fully utilized
with respect to two initiation cycles.
The pipeline throughput and pipeline effieiency are related to each other. Higher throughput results from
a shorter latency cycle. Higher eflicieney implies less idle time for pipeline stages. The above example
demonstrates that higher throughput also accompanies higher efficiency. Other examples however may show
FM Mtfirnw H'IlI:'n.-rq|i;uin1'
24“ N _ flidvursced Computer Architecture

a contrary conclusion. The relationship between the two measures is a Function of the reservation table and
of the initiation cycle adopted.
At least one stage of the pipeline should be fully (100%) utilized at the steady state in any acceptable
initiation cycle; otherwise, the pipeline capability has not been fully explored. ln such cases, the initiation
cycle may not be optimal and another initiation cycle should be examined for improvement.

INSTRUCTION PIPELINE DESIGN


1 A stream of instructions can be executed by a pipeline in an overlapped manner. We describe
below instruction pipelines for CISC and RISC scalar processors. Topics to be studied include
instruction prefetehing, data forwarding, hazard avoidance, interlocking for resolving data dependences,
dynamic instruction scheduling, and branch handling techniques for improving pipelined processor
perfomiance. Further discussion on instruction level parallelism will be found in Chapter 12.

6.3.1 Instruction Execution Phases


A typical instruction execution consists of a sequence of operations, including instruction fetch, decode,
operand fetch, execute, and write-hack phases. These phases are ideal for overlapped execution on a linear
pipeline.
Pip-elined Instruction Pmcessing A typical inslniction pipeline is depicted in Fig. 6.9. The fizrch .smgc (F)
fetches instructions from a cache memory, ideally one per cycle. The tint-ode stage (D) reveals the instruction
function to he performed and identifies the resources needed. Resources include general-purpose registers.
buses, and fimctional units. The issue stage (I) reserves resources. The operands are also read from registers
during the issue stage.
The instructions are executed in one or several ere:-nre stages (E). Three execute stages are shown in
Fig. 6.9a. The last wrircbacl: stage (W) is used to write results into the registers. Memory load or store
operations are treated as part of execution. Figure 6.9 shows the flow of machine instructions through a
typical pipeline. These eight instructions are for pipelined execution of the high-level language statements
X = Y + Z and A = B >< C. Herc we have assum-od that form‘ and store instructions take four execution clock
cycles, while floating-point mid and nmIn;n{i- operations take three cycles.
The above timing assumptions represent typical values found in an older CLSC processor. in many RJSC
processors, fewer clock cycles are needed. On the other hand. Cray l required ll cycles for a load and a
floating-point addition took six. With in-order i.nstIuct_ion issuing, if an instruction is blocked from issuing
due to a data or resource dependence, all instructions following it are blocked.
Figure 6.91:: illustrates the issue of instructions following the original program order. The shaded boxes
correspond to idle cycles when instruc-tion issues are blocked due to resource latency or conflicts or due to
data dependences. The first two Joan‘ instructions issue on consecutive cycles. The nah’ is dependent on both
feats and must wait three cycles before the data (Y and Z) are loaded in.
Similarly, the store of the sum to memory location X must wait three cycles for the nniri’ to finish due to a
flow dependence. There are similar blockages during the calculation of A. The total time required is 1? clock
cycles. This time is measured beginning at cycle 4 when the first instruction starts execution until cycle 2|]
,.,,,..-W,-,.g..,....,,,E,,..,,,,,.C,,, _ H,
when the last instruction starts execution. This timing measure eliminates the undue effects of the pipeline
“st.srtup" or “draining” delays.
Figure 6.9c shows an improved timing after the instruction issuing order is changed to eliminate
unnecessary delays due to dependence. The idea is to issue all four toad’ operations in the beginning. Both the
acidand mu1rrplfi- instructions are blocked fewer cycles due to this data prefctching. The reordering should not
change the end results. The time required is being reduced to ll cycles, measured from cycle 4 to cycle I4.

Fetch Do-node lss no Exec ute Execute Execute


F D I E E E W

[a]/it seven-stage Instruction pipeline

or Tlrno
1 2 3 4 5 6 7' B Q10111.2113141516171319-2t}.5!12223
R1<-Mem[‘t'J
R2 t- lu'ls|'r|[Zj
R3~e—[R1]+[R2)
lu1nm[x]s— [R3]
R44-Mom[B]
R5<- Mom[C]
R61:-[R1t)'[R5J
lulem[A]<t-[R6] 1l_l

[bl In-order Instruction issuing

or Time
12 3 4 5 6 T B 91D11121314~151B1T
“ts Marti" nnnn
R2R4 e4- "Bull
Mom[E-J
nnnnn
F
R5 \t— Mom[C)
R3 <- {Ft1]+[R2J
HE
HE MINE
R6 <- [R4)‘[R5]
lulern[x]<- [R31
HEHE E
HEIIEIHE EHEEHfllil EEEHE
Me~v~=<~ ea nnnn
[cl Rnorelorod Instruction issuing

Fig. 6.9 Plpelln-ed execution ofX = Y + Z andfii = B :-< C (Courtesy efjarn-in Smlel'i;reprlri1:ed with pcrrnlml-on
from IEEE Con1purnr,july 1939}

Ir)
Cg Example 6.4 The MIPS R4000 instruction pipeline
The MIPS R4000 was a pipelined I54-bit processor using separate instruction and data caches and an eight-
stage pipeline for executing register-based instructions. As illustrated in Fig. tS.lU, the processor pipeline
design was targeted to achieve an execution rate approaching one instruction per cycle.
24‘: Ii ifldvwrcedfiompmerhichitedure

¥1 ti I-U1'1 rn>< ca "l'| 0tn -1n WB

Iflfitflflbfl n|=: Data me Is: Instruction seen 3.


cache '
mew lfldmmf; cs: Data second RF: Register flte
tag check EX: Eiosmtlm TC: Tag check
Iflfiwflbfl IF: |I'|S1l'LI§‘lb|'t first ws: Wrtte back
traretatbn |"$W*3"°"
; disco-do
Read:
mg|sm.__; ALU ,_,_{D-cache
me ioperallcm access]
' Data
address
s '5'
traristaticm

I
IPi
E’
--. - .
CF

I R

[a] R4000 pip-silnestagae


Master
etnckeyets
| i Eightdteep
I |FI|sIRFIExInFInsITcIwsI
I I|FI|sIRFIExIOFInsIrcIwa|
I I F |.|c§..|§...F |55_,i_l3!Fl$ I7? W5.
PIP*""* n
I IF E .F-‘.F |E51I15.F. I53
I IF I IS IRFIEZKIDFIDSITCIWBI
IIF I |sIRF|E>c|nFIns|"rc|wsI
I ||= I |s|RF|ExInF|ns|'rcIwsI
i‘ CUl1'B1'lCP‘U cycle
[I1] R4000 Insimetion overlapping In p-Ipsrtine

Fig. 6.10 The srehlreceure of ‘H14! MIPS IHOUIU Insrrucrien pipeline {Coureesy nfMIPS Cornpuner Sysrenu]

Thc execution of each R4000 i11st111ction consisted of eight major shcps as sumrnarizod in Fig, 6. lfla. Each
of these steps required approximately one clock cycle. The instruction and data rneniory references are split
across two stages. The single-cycle ALU stage took slightly more time than each ofthe cache access stages.
The overlapped execution of successive instructions is shown in Fig. l5.l0h. This pipeline operated
cfficienfly because diFfCl'BIll CPU resources, such as address and bus access, ALU operations, register
accesses, and so on, were utilized simultaneously on a noninterfering basis.
The internal pipeline clock rate (IUD MI-lz) of the R4000 was twice the external input or master clock
...,...-...,.,........,,,._.,.,..,......,.. _ ,,_,_
frequency. Figure 6.l0h shows the optimal pipeline movement. completing one instruction every intemal
clock cycle. Load and branch instructions introduce extra delays.

6.3.1 Mechanisms for Instruction Pipelining


We introduce instruction buffers and describe the use of cacheing, collision avoidance, multiple functional
units. register tagging, and intemal forwarding to smooth pipeline flow and to remove bottlenecks and
unnecessary memory access operations.
Preflstclr Buffer: Three types of buffers can be used to match the instruction fetch rate to the pipeline
consumption rate. ln one memory-access time, a block of consecuti ve instluctions are fetched into a prefetch
bufier as illustrated in Fig. 6.11. The block access can be achieved using interleaved memory modules or
using a cache to shorten the effective memory-access time as demonstrated in the MIPS R4000.
S-equon t

li_ll
nstmetlons Indicated by program counts-r
. Buffer 1
= .Bu er2

at Ta “fiat 1
Target Buffer 2
uni
Instruction Pipeline

I/s
.t\ tlons from braneltad to-cations

Fig.6.11 The use ofseqoemclal and target hofizrs

Sequential instructions are loaded into a pair of.scqrrcnri.ni' brrflers for in-sequence pipelining. Instructions
from a branch target are loaded into a pair of airgyr brg{*i=r.~r for out-of-sequence pipelining. Both buffers
operate in a first—in-first-out fashion. These buffers become part of the pipeline as additional stages.
A conditional branch instruction causes both sequential buffers and target buffers to fill with instructions.
After the branch condition is checked, appropriate instructions are taken from one of the two buifers, and
instructions in the other buffer are discarded. Within each pair, one can use one buffer to load instructions
from memory and use another buffer to feed instructions into the pipeline. The two buffers in each pair
altcmatc to prevent a collision between instructions flowing into and out of the pipeline.
A third type ofprcfetch buflier is known as afonp inrrflbr. This buffer holds sequential instructions contained
in a small loop. The loop buffers are maintained by the fetch stage of the pipeline. Pre-fetched instructions in
the loop body will be executed repeatedly until all iterations complete execution. The loop buffer operates in
two steps. First, it contains instructions sequentially ahead ofthe current instruction. This saves the instruction
fetch time from memory. Second, it recognises when the target ofa branch falls within the loop boundary. In
this case, unnecessary memory accesses can be avoided if the target instruction is already in the loop buffer.
The CDC 6600 and Cray 1 made use of loop buffers.
Multiple Functional Unit: Sometimes a certain pipeline stage becomes the bottleneck. This stage
corresponds to the row with the maximum number of checkmarks in the reservation table. This bottleneck
problem can be alleviated by using rnultiplc copies of the same stage simultaneously. This leads to the use of
multiple execution units in a pipelined processor design (Fig. 612}.
Par MIGIITLH HI" l'mrJI||r_.u|i¢\
Z44 i Aduwrced 'l:-Da'TlPl.lIvl!l'-"||!Ctl't|lIvECfl.rIE

Instruction from Memory

Ftsglster
Instruction Fetch Unit _ _ _Flis _ I

Tag |
Deon-do and Issue Units
|_ ______l A S
' II

ll‘l1l
§;i?JI§‘“"l*’f~l lrfl rfl lRf| R..'§::i.t
Um
Functional
at FU we

Fig. 6.12
s 1 ts. 1
A pipelined processor with multiple functional u-nlrs and dlsrrlhuind reservation srarlons sup-ported
by tagging {Comte-sy of G.Sd'ti;roprlnned with permission from IEEE Tmnsucrlons on Cempurerglrhrch
1990}

Solii (1990) used a model architecture for a pipelined scalar processor containing multiple functional units
(Fig. 6.12). In order to resolve data or resource dependences among the successive instructions entering the
pipeline, the msermrion .smrr'on.s (RS) are used with each functional unit. Operations wait in the RS until
their data dependences have been resolved. Each RS is uniquely identified by a mg. which is monitored by
a rag unit.
The tag unit keeps checking the tags from all currently used registers or RSs_ This register tagging
technique allows the hardware to resolve conflicts between source and destination registers assigned for
multiple instructions. Besides resolving conflicts, the RSs also serve as buffers to interface the pipelined
functional units with the decode and issue units. The multiple functional units opemte in parallel, once the
dependences are resolved. This alleviates the bottleneck in the execution stages of the instruction pipeline.
Internal Dara Forwarding The throughput of a pipelined processor can be further improved with internal
data forwarding among multiple functional units. In some eases, some memory-access operations can be
replaced by register transfer operations. The idea is described in Fig. 6.13.
A store-ioadforwarding is shown in Fig. 6.13s in which the loan‘ tJ].:k_'r'flItirJn (LI) R2. Ml frflm II'1BII101'1~'
to register R2 ean be replaced by the move operation (MOVE R2, RI) from register R1 to register R2.
Since register transfer is faster than memory access, this data forwarding will reduce memory traflle and
thus results in a shorter execution time. Similarly. load-loadfonvar-ding {Fig. 6.1315) eliminates the second
,.,,,.,-,,,-,,,,,,,,5,,,,.,,,,,,,,,,.,,,, _ 2,5
loud operation [LEI R2, M) and replaces it with the mow operation {MOVE R2, RI 1. Further discussion on
operand forwarding will be continued in Chapter 12.

M M M M
Access Aoooss Access Aces-ss
Unit Unit Unit Urit
R1 R2 R1 R2 R1 R2 R1 R2
— — _ — — _ _ —

STD M, R1 LD R2, M STD M, R1 MOVE R2, R1 LD R1, M LD R, M2 LD R1, M MOVE R2, R1


{a) Store-load forwarding to) Load-toad forwarding

Fig.-5.13 Internal ohm forwarding by rephclng memory~aooess operations with register transfer operations

I»)
lg Example 6.5 Implementing the dot-product operation with
internal data forwarding between a multiply
unit and an add unit
One can feed the output of a multiplier directly to the input of an adder (Fig. 6.14) for implementing the
following dot-product operation:

S = 2.-1, >< rt, (0.10)


1 |
Without internal data forwarding between the two functional units. the three instructions must be
sequentially executed in a looping structure [Fig. ti.I4a). With data forwarding, the output of thc multiplier is
fed directly into the input register R4 of the adder (Fig. 6.l4b). At the same time, the output of the multiplier
is also routed to register R3. internal data forwarding between the two functional units thus reduces the total
execution time through t.he pipelinod processor.

Hazard Avoidance The rerrd and it-rim of shared variables by different instructions in a pipeline may lead
to different results if these instructions are executed out of order. As illustrated in Fig. 6.15. three types of
logic .Fm:.nrr.fs are possible.
Consider two instructions land .1. Instruction J is assumed to logically follow instruction l according to
program order. If the actual execution order of these two instructions violates the program order. incorrect
results may be read or written, thereby producing hazards.
Hazards should be prevented before these instructions enter tl'|c pipeline, sueh as by holding instruction J
until the dependence on instruction 1 is resolved. We use the notation DU) and Rtli for the domain and range
of an instruction I.
The domain contains the input ser {such as operands in registers or in memory) to be used by instruction
I. The range corresponds to the onrpnr sor of instruction I. Listed below are conditions under which possible
hazards can occur:
246 ii ifldvwrcedfiumplmerahdnitedum
3; bl"

R1 2 2 R2

a 0 l1:R3<-{R1j|'{R2j
R3 51
‘Q l2:R4<-[R31
R4 i i
l3:R'54-4_‘R5)+[R4]

[a] Without data forwarding

3; bl"

R1 Z Z R2
Cr

@ 0:1
"“““‘“""“"
té :R4<- [R1]'{R2)
[5 : R5<- [R4] + [R5]
R3 1 1 R4 Z
[’1an-cl 5 can he emecmed
simuitaneously with lntamai
data fomardhg.

[bi with lntamzn zhta forwarding

Fig. 6.14 lrmernal dam 5:.-rwarding for lrr|piemen1:ing the dot-pm-duct operadm

[wrltejl Wnte

w [Read] [Write] w

[a] Read-afwr-Write [RAWJ hazard [t-J Write-after-Wrke [WAJHJ hazard

@ [Raadj
-
m

W M»
[c] Wrhe-after-Read [WAR] hazard

Fig. L15 Ftvsslble hazards between read and write up-eratlons in an ll'|5tfl.lCtlGfl pip-eHne{hstruc|:I-a|1 I is ahead
of HEDIIHCHOHJ in program order]
...,,“-.,t,.g.,.,‘,5.,,,._.,,,:_,,.,..,.c,.. _ M
R(I} A D(J] qr d for RAW hazard
Rm rt rr(.n qr n for waw hazard (5.111
D{1)n RU) e‘= d for WAR hazard
These conditionsare necessary but not sufficient. This means the hazard may not appear even ifone ormore
of the conditions exist. The RAW hazard corresponds to the flow dependence, WAR to the antidependenee,
and WAW to the output dependence introduced in Secfiou 2.}. The occurrence of a logic hazard depends on
the order in which the two instructions are executed. Chapter 12 discusses techniques to handle sueh hazards.

6.3.3 Dynamic Instruction Scheduling


In this section, we describe three methods for scheduling instructions through an instruction pipeline. The
static sc!redr.u'r'ng scheme is supported by an optimizing compiler. Qt-'mrrrrie scheratrfing is achieved using a
technique such as Tomasulo’s register-tagging scheme built in the IBM 3l50f9I , or the sea.-ebmmtng scheme
built in the CDC 6600 processor.
Static Scheduling Data dependences in tl sequence of instructions create interlocked relationships among
them. interlocking can he resolved through a compiler-based static scheduling approach. A compiler or a
postprocessor can be used to increase the separation between interlocked instructions.
Consider the execution of the following code fiagment. The mu!ripr'_1-‘ instruction cannot be initiated until
the prcc-cdirlg land is complete. This data dependence will stall the pipeline for three clock cycles since the
two fonds overlap by one cycle.

Stage delay: Instruction:


2 cycles Add R0. R1 FRO <— (R0) + (RI )1‘
I cycle Move R1. R5 {RI e— {R5]r'
2 cycles Load R2, lvl{rr} IRZ 4- [Memory (|'I]]r’
2 cycles Load R3. MLB) IR! <— [Memory (,3))r'
3 cycles Multiply R2, R3 IRE <— {R2} >< (R31!

The two lmrriv. since they are independent of the odd and move. can be moved ahead to increase the
spacing between them and the rrruln}u.'__v instruction. The following program is obtained after this modification:
Load R2, M(rr) 2 to 3 cycles
Load R3, M (B) 2 cycles due to overlapping
Add R0, RI 2 cycles
Move RI, R5 l cycle
Multiply R2, R3 3 cycles

Through this code reanangement, the data dependences and program semantics are preserved, and the
m1rlripl_v can be initiated without delay. While the operands are being loaded from memory cells rr and )3 into
registers R2 and R3, the two instructions mid and mot-'e consu.rne three cycles and thus pipeline stalling is
avoided.
Ff» M nf Grow Hill t'4.\m;rwrn1
243 i Advanced Comprrterwfirtltitecture

Tomasulo’: Ji'l.lgo.r'ithm This hardware dependence-resolution scheme was first implemented with multiple
floating-point units of the IBM 36{lf9I processor. The hardware platform is abstracted ir1 Fig. 6.12. For thc
Model 9] processor, three RSs were used in a floating-point adder and two pairs in a floating—point multiplier.
The scheme resolved resource conflicts as well as data dependences using register tagging to allocate or
deallocatc the source and destination registers.
An issued instruction whose operands are not available is forwarded to an RS associated with the finnctional
unit it will use. It waits until its data dependences have been resolved and its operands become available.
The dependence is resolved by monitoring the result bus (called eormrrmr dam fins in Model 91). ‘When all
operands for an instruction are available, it is dispatched to the functional unit for execution. All working
registers are tagged. ll‘ a source register is busy when an instruction reaches the issue stage, the tag for t.he
source register is forwarded to an RS. “ho the register data becomes available, it also reaches the RS which
has the same tag.

I»)
El Example 6.6 Tomasulo's algorithm for dynamic
instruction scheduling
Tomasulo's algorithm was applied to work with processors having a few floating-point registers. In the ease
of Model 91, only four registers were available. liigure 6.1641 shows a minimum-register machine code for
computing X = Y + Z and A = H >< C. The pipeline liming with Tomc.sulo‘s algorithm appears in Fig. 6.16-b.
Herc, the total execution time is 13 cycles, cotmtirtg fiorn cycle 4 to cycle 15 by ignoring the pipeline stm11.rp
and draining times.
Memory is treated as a special functional unit. When an instruction has completed execution, the result
(along with its tag} appears on t.he result bus. The registers as well as the RSs monitor the result bus and
update their contents (and ready/busy bits) when a matching tag is found. Details of the algorithm can be
found in the original paper by Tomasulo [I967].

-iv Time
12 s ti 5 s 1 a o1o11121s1=t1s1s1r1a1a
R1 r-Mom[Y]|
R2 <-Mem[Z]
R3 4-[R1i+[R2]
Mernifx] s—[R3] F |-<
R1 <-Mem[B] -1
R2 <-Mem[Cj|
R3 <-[R1]'[R2]
Morn[A] t-[R3] "rr U -5
[a] Minimum-register machine code lb) Tho pipeline schedule

Fig. $.16 Dynamic lnsrru-ecion scheduling usir|gTornasr.|lo‘s algorithm on the processor in Fig. 6.11 (Co urresy of
James Smith; reprinted with pmnissi-on from IEEE Cornprrtofluly 1989'}

CDC Scar-eboor-ding The CDC 6600 was anearly high-performance compute-rthatused dynamic instruction
scheduling hardware. Figure d.l'Ia shows a CDC 6600-like processor, in which multiple fttnetional units
,=,,....,..-,.g....,5.,,,......,.......,.. _ M
appeared as multiple execution pipelines. Parallel units allowed instructions to complete out of the original
program order. The processor had instruction buffers for each execution unit. Instructions were issued to
available functional units regardless of whether register input data was available.

Wilto-

Instruction Wilto-
fmch Deco-do Issue Exec ute n. Exiecuto back
D I E E
F W

I/' -
I
I

\\ Exec ute I" Exiecuia vii:


E E W

(a] A CDC B60-Osllke processor

irflmo
12 3 4 5 6 7 B 91fl1112131¢151fi171B19
R1 4-Mem[Y] E
R2 <-Mem[Z]
R3 <-[Ft1}+[R2i
Msm[x] <-[R3]
R4 <-Mom[BJ
R5 <-fl.-'lom[-C]
R6 4-[R4-j|'[R5j|
lu1em(Aj| <-[RB]

[ls] T he Improved schedu lo from Flg. B.9b

Fig.-i.1'I' Hardware secret:-carding for dynainic instruction scheduling (Courtesy ofjames Srn+d1;reprh11:od with
permission from IEEE Colnputeltjuly 1989]

Thc instruction would then wait in a buffer for its data to bc produced by other instructions. To control
the correct routing of data between execution units and registers, the CDC 6600 used a centralized control
unit known as the scorehmrn‘. This unit kept track of the registers needed by instructions waiting for the
various functional units. Vflicn all registers had valid data, the scoreboard enabled thc instruction cxccution.
Similarly, when a functional unit finished, it signaled the scoreboard to release the resources.

V)
g Example 6.1 Pipelined operations using hardware
scoreboarding on the CDC 6600-like
processor flames Smith, 1989)
Figure 6.l?b shows the pipeline schedule based on scoreboard issue logic. The schedule corresponds to the
F?» Mtfiruw Hit |:'i».-rqtwtnw
I50 i Advanced Computer Architecture

execution of the same machine code for X - Y ~' Z and A- B >< C. The pipeline latencies are the same as those
resulting from Tomasu|o's algorithm. The mid instruction is issued to its functional unit before its registers
are ready. lt then waits for its input register operands.
The scoreboard routes the register values to the adder unit when they become available. In the meantime,
the issue stage is not blocked, so other instructions can bypass the blocked add. It takes 13 clock cycles to
perform the operations. Details of the CDC seorcboarding can be found in the book by Thomton {I970}.

The scoreboard is a centralized control logic which keeps track of the status of registers and multiple
functional units. When fiinctional units generate new results, some data dependences can be resolved and thus
a highcr degree of parallelism can be explored with seorcboarding. Seoreboarding in latter microprocessors
like MCBEUDU used forwarding logic and register tagging. In a way, scoreboarding implements a kind of
data-dr-it-en mechanism to achieve eflicient computations.
Dynamic instruction scheduling was implemented only in high-end mainframes or supercomputers in the
past. Most microprocessors used static scheduling. But the trend has changed over the last two decades. RI SC
and superscalar processorsare today built with hardware support ofdynamic scheduling at runtime. Significant
trace-driven data are needed to optimize the pipelined processor design. Toward this goal, processor and
compiler designers have to work together to achieve an efficient design. Multiple-issue instruction pipelines,
which are rnuch more complicated than single—issue instruction pipelines, will be studied in Section 6.5.

6.3.4 Branch HancIlingTet:hr|iques


The performance of pipelined processors is limited by data dependences and branch instructions. In previous
sections, we have studied the effects of data dependence. In this subsection, we study thc effects of branching.
Various branching strategies are reviewed. The evaluation ofbranching strategies can be performed either on
specific pipeline architecture using trace data, or by applying analytic models. We provide below a simple
performance analysis. For a more detailed treatment of the subject, readers are referred to the book Brrtrtrh
.'i‘rrrttegLt-' ifittortnrny min‘ Perfnrrrtrirtr-e .-lforit’l.s by Harvey Cragon {I991}.

Effect of Branching Three basic temis are introduced below for the analysis of branching efiects: The
action of fetching a nonscqu-ential or remote instruction afier a branch instruction is called branch rnknrt. Thc
instruction to be executed after a branch taken is called a brrmch target. The ntnnber of pipeline cycles wasted
between a branch taken and the fetching of its branch target is called the def-n_y star. denoted by b. In generaL
ll S b S .i'— I, where It is the number ofpipcline stages.
When a branch is taken, all the instructions following the branch in the pipeline become useless and will
be drained from the pipeline. This implies that a branch taken causes the pipeline to be flushed. losing a
number of useful cycles.
These terms are illustrated in Fig. 5.13, where a branch taken causes l;,H through I M. k | to be drained from
the pipeline. Let p be the probability of a conditional branch instruction in atypical instruction stream and q
the probability of a successfirlly executed conditional branch instruction (a branch taken]. Typical values of
p = 20% and (,1 = 60% have been observed in some programs.
The penalty paid by branching is equal to pqnbrbocause each branch taken costs fl rcxtra pipeline cycles.
Based on Eq. 6.4, we thus obtain the total execution time ofn instructions, including the effect of branching,
as follows:
T,',!y= kt + (rt — I) t+pqnl:It
P§pe“'|wpgand5uPe-[§£uk1|'TK
_ __ Fr-|-Mcfiruw Hi'iI:'|-rr.-w.-.-|-rt _
_ m
Modifying Eq. 6.9, we define the following qffecrive pi]-Jeiine rhmngnpur with the influence of branching:
mg. = L = _ (W)
126,- k+n—i+pr;mb
When n —> 0-=, the tightest upper bound on the effective pipeline throughput is obtained when b = k — 1:

11;, = (6.13)
When p = qr = D {no branching), the above bound approaches the maxirnum tl1rougi1putf= If r, same as in
l.-Jq, 6.2. Suppose p = £12, ql = 0.6, and b = ir — 1 ='= 7. We define the following rrerfarmancerdtgradrrmn fiicrrlri
- 11 * _
n= I f ”*@"=1- _rJq{k—1]+i
1 = pq[k—1)+l
1"!“ 1]‘ e= “'S4=n.4s
1.84
(E-.14)
The above analysis implies that the pipeline performance can be degraded by 46% with branching when
the irtstmction stream is sufficiently long. This analysis demonstrates the high degree of performance
degradation caused by branching in an instruction pipeline.
lmtnrction flow
in

(+1111:-1in+r+2 ¢'= [n+2 inn in pi’

-|-I r |<- cio-cit cycle


[aj| A it-stage pip-eiine

Original instruction flow Branch taken


an *

I I I [b+k_1.l I I [n+2 [inf lb I I I

nfii
Cagions:
A cieiay slot of length k-1 lb = Branch taken
L, = Branch target
ir= Ne. of pipeline stages
1 = ciock cycie [stage deiay]
ti = Delay slot size

I -I I I I I I [H2 [H1 [1,

in
Branch target
New lnstmction flow
{bj|Anin>str1.|ctien stream containing a branch taken

Flg.6.1lI Thedecisionofaha'ar|d11:ienatd1ehststagcofa1insuucdmpip-eihecatn-esbSk-iprevinusiy
imdedksnnucdomwbedninedfinmdupqselkn
I51 i Auhiwrccd 'i:-i2ta"i‘ijJti.lIvi!l'-"II-fCi't|iIvEt'1l.rJ'E

Brunch Prediction Branch can be predicted either based on branch code types statically or based on branch
history during program execution. Thc probability of branch with respect to a particular branch instruction
type can be used to predict branch. This requires collecting the frequency and probabilities of branch taken
and branch types across a large number of program traces. Such a srnric brrmch srrnregv may not be very
HDCHIBIG.

The static prediction direction {token or nor rat-en) can even be wired into the processor. According to past
experience, the best performance is given by predicting rniren. This results from the fact that most conditional
branch instructions are taken in program execution. Thc wired-in static prediction cannot be changed oncc
committed to the hardware. However, the scheme can be modified to allow the compiler to select the direction
of each branch on a semi-static prediction basis.
A nf_1-nnmic branch strategy works better because it uses recent branch history to predict whether or not
the branch will be taken next time when it occurs. To be accurate, one may need to use the entire history of
the branch to predict the fiiture choice. This is infeasible to implement. ‘Therefore, most dynamic prediction
is cietermined with limited recent history, as illustrated in Fig. 6.19.

Y ' Y Y
Branch Branch Branch
Instruction Prediction target
address Statistics add ress
[a] B ranch target buffer organization

¢@ T
@ Brarchtaiten
Ca ions:

N = Not-taken branch
N NN=Lasttwobranci1es nottaiten
T HT = Not branch taken and previous taken
- TT = Beth last two branches taken
Q TN = Last branch taken and previous not taken
N
[bi Atypical state diagram

Fig. 5.1! Branch hrisrory botiior and a st-are r:rsnsi1:ic.n ring:-am used in dynarrdc branch prediction (Coiirresy of
Lee and Smith. i.E.E.E Computer, 1984}

Cragon {I992} classified dyrtamic branch strategies into three major classes: One class predicts thc branch
direction based upon information found at the decode stage. The second class uses a cache to store target
addresses at the stage the effective address of the branch target is computed. The third scheme uses a cache
to store target instructions at the ictch stage.
,=,,,,,-,,,-,,,,,.,,5,,,,._.,,,,,,.,,,,,,,, _ ,5,
Dynamic prediction demands additional hardware to keep track of the past behavior of the branch
instructions at run time. Thc amount of history recorded should bc relatively small. Otherwise, the prediction
logic becomes too costly to implement.
Lee and Smith [1984] suggested the use of a branch rmger hujt-r ['l:'lTB) to implement branch prediction
(Fig. 6. l Sta). The BTB is used to hold recent branch information including the address of the branch target
used. The address of the brunch instruction locates its entry in the BTB.
Astahe transition diagram (Fig. t5.l9b) has been used by Lee and Smith for tracking the last two outcomes
at each branch instruction in a given program. The BTB entry contains the information which will guide the
prediction. Prediction information is updated upon completion of the current branch.
The HTH can be extended to store not only the branch target address but also the target instruction itself,
i.n order to allow zero delay in converting conditional branches to unconditional branches. Thc rrrkcn (T) and
not-taken {N} labels in the state diagram correspond to actual program behavior. Further discussion on this
topic will be found in Chapter 12.
Delayed Branches Examining the branch penalty, we realize that the branch penalty would be reduced
significantly if the delay slot could be shortened or minimized to a zlero penalty. The purpose of delayed
branches is to make this Possible, as illustrated in Fig. 6.20.
The idea was originally used to rednce the branching penalty in coding rnicroinslructiorts. A r;li'irr_1-‘ed
brrmr:-fl of d cycles allows at most rt — l useful instructions to be executed following the branch taken. The
execution of these instructions should be independent of the outcome of the branch instruction. Otherwise. a
zero branching penalty cannot he achieved.

Delayed branch
Delayed branch 1 2 gr {E E T

1 2 @ X 5 6 IIIIBB
is IIIIE 2 delay H IIIIE
1 slaw Inseam nun lnsweoel e nun
EIIIB H IIIIB
[ai A delayed branch for 2 cycles when the branch [b] Adelayod bran-cit for3 cycles when the branch
condition is reserved at the decode stage condition is resolved at the execute stage

Time
Delayed branch
1 2 2. -t (51 @ Cb
lBe"el nun
3 may ‘1 IIIIB
—amlreees=
nstructions

trace: [1 IIIIE
[cl A delayed branch for 4 cycles when the branch
condition is resolved at the store stage

Fig.i.10 The concept of delayed branch by moving independent instructions or HOP fillers into the delay slot
of a four-serge pipeline
If J11!!!‘ r'mr:-;|(min
I54 i ' Advanced Computerdrdtitecture

The technique is similar to that used for software interlocking. NOPs can be used as fillers if needed.
The probability of moving one instruction (if = 2 in Fig. 6.20a) into thc delay slot is greater than 0.6, that
of moving two instructions (d = 3 in l-‘lg. 6.2(lb) is about 0.2, and that of moving three instructions (d= 4 in
Fig. 6.20c) is less than lll. according to some program trace results.

I/)
Cg Example 6.8 A delayed branch with code motion into a
delay slot
Code motion across branches can be used to achieve a delayed branch, as illustrated in l-‘ig. 6.21. Consider the
execution of a code fiagment in Fig. 6.21 a. The original program is modified by moving the useful instruction
Il into the delay slot after the branch instruction I3.

:1. LOAD R1. A :2. Doc no.1


:2. Dec ns. 1 is. BFZBFO ns. :5
:3. Brloro ns. :5 :1. LOEEI R1. A
lint, Am an, Ft-1 :4. rm R2, R4
:5. Sub ns. no :5. Sub ns. as
[6_ Btorg R5‘ B I6. Store R5, B
O O
I I

[a] Original program [ti] ll.-loving useful lnstructloris Into the delay slot

Fig. $.21 Code motion across a branch to achieve a deiayed hraincii with a reduced penalty re pipeline
perforrrianee

in case the branch is not taken, the execution of the modified program produces the same results as the
original program. In case the branch is taken in the modified program, execution of the delayed instructions
Il and I5 is needed anyway.
In general, data dependence between instructions moving across the branch and the remaining instructions
being scheduled must be analyzed. Since instruction ll is independent of the remaining instructions, leaving
it in the delay slot will not crcatc logic hazards or data dependences.

Sometimes HOP fillers can be inserted in the delay slot if no useful instructions can be found. However,
inserting NOP fillers does not save any cycles in the delayed branch operation. From the above analysis.
one can conclude that delayed branching may be more cffcctiyc in short instruction pipelines with about
four stages. Delayed branching has been built into some RISE processors, including the MLPS R4000 and
Motorola MC-881 10.
..,,,.,,-,..,.,,,,,., ,,.,,.,,,,.,,,,,.,,,,,. . _ 255

ARITHMETIC PIPELINE DESIGN


_ Pipelining techniques can be applied to speed up numerical arithmetic computations. We start
with a review of arithmetic principles and standards. Then we consider arithmetic pipelines
with fixed functions.
A iised-point multiply pipeline design and the MC61ilLl40 floating-point unit are used as examples to
illustrate the design techniques involved. A multifunction arithmetic pipeline is studied with the TI-AF-C
arithmetic processor as an example.

6.4.1 Computer!-‘arithmetic Principles


In a digital computer, arithmetic is performed withfinite precision due to the use of fixed-size memory words
or registers. Fixed-point or integer arithmetic otfers a fixed range of numbers that can be operated upon.
Floating-point arithmetic operates over a much increased dynamic range ofnutnbers.
In modem processors, fixed-point and floating-point arithmetic operations are very often performed by
separate hardware on the same processor chip.
Finite precision implies that numbers exceeding the limit must be truncated or rounded to provide a
precision within the number of significant bits allowed. in the case of fioating—point numbers, exceeding
the exponent range means error eonditions, called ovegdow or underjiloi-v. Thc Institute of Electrical and
Elec-tronics Engineers (IEEE) has developed standard formats for 32- and 64-bit floating niunbers known as
the IEEE Z54 Srmidarri. This standard has been adopted for most of today's computers.
Fixed-Point Op-motions Fixed-point numbers an: represented internally in machines in sign-magnitude,
ones corrtpi'emcnI. or In-'0 Ir cornpicmcnf notation. Most computers use the two’s complement notation because
of its unique representation of all numbers (including zero). One‘s complement notation introduces a second
zero representation called rfirry ccro.
Add. suhrmci mulriph-: and di:-'in’c are the four primitive arithmetic operations. For fated-point numbers,
the add or subtract of two n-bit integers (or fractions) produces an n-bit result with at most one carry-out.
The multiplication of two n-bit numbers produces a Zn-bit result which requires thc use of two memory
words or two registers to hold the full-precision result.
The division of an n-bit number by another may create an arbitrarily long quotient and a remainder. Only
an approximate result is expected in fixed-point division with rounding or truncation. However, one can
expand thc precision by using a Zn-bit dividend and an n-bit divisor to yield an rt-bit quotient.

Flnnringflininr Number: A floating-point number Xis represented by apair (m. e], when: m is thc ninnnsm
(o1'_,r'r.ocrion) and c is thc r.’.1]'J(J!'lr.’Nf with an implied brisc {or r.:idr'Jt',l'. The algebraic value is represented as X =
m >< r”. The sign of X can be embedded in the mantissa.

g : Example 6.9 The IEEE 154 floating-point standard


A 32-bit i'loating—point number is specified in the [EEE 154 Standard as follows:
Par MIGIITLH HI" l'mrJI||r_.u|n¢\
I 55 i Auluwtccd Computer Architecture

D 1 2 9' 31

r_
II---Ill ll
‘Y’ WI’
.2?-:1>_ Exponent a lulantkssa m

A binary base is assumed with r = 2. The 8-bit exponent c field uses an e:xces.r-Ll‘? code. The dynamic
rangc of c is (-121, 128), internally represented as [0, 255}. The sign s and the 23-bit mantissa field m form
a 25-bit sign-magnitude fraction, including an implicit or “hidden“ l bit to the left ofthe binary point. Thus
the complete rnantissa actually represents the value Lm in binary.
This hidden bit is not stored with the number. If U *-1 c < 255, then a nonzero normalized numb-er represents
the following algebraic value:
x=(_-11-*'><2* '*"><(1m) (5.15)
When 1- = 255 and m ¢ U, a not-a-number (NaN_] is represented. NaNs can be caused by dividing a zero by
s zero or taking the square root ofa negative number, among many other nondeterminate cases. When e - 255
and m = 0, an infinite number X = (— l]|" -=-=~ is represented. Note that +-=-= and —¢-=> are represented differently.
‘Nhen c = 0 and m sfi D, the number represented is X = (—l]"'2. m'[D.m). when c = 0 and m = D, a zero is
rcprcscntcd as.-Y = (—1)"'U. Again, +0 and — U are possible.
The 64-bit (double-precision) floating point can be defnied similarly using an excess-1023 code in the
exponent field and a 52-bit rnantissa field. A number which is nonzero, finite, non-NaN. and nomialized. has
the following value:
.r=(-1)-"'>< 2“ ‘°”>< {Lm} (6.16)
Special rules are given in the standard to handle overflow or underflow conditions. Interested readers may
check the published LEEE standards for details.
Floating-Point Operations The four prinilitive arithmetic operations are defined below for a pair of
floating-point numbers represented by X = (mp cl) and l’ = |[m_,., al.). For clarity, we assume cl E cl. and base
r = 2.

,r+ r = on, >< 2*“ + m_,,) >< it-'" (an)


.r- r = rm, >< 2“ *'-"- my) ><.e- (ans)
X->< r = (in, >< my) >< 2*'~**'-~- (619)
aw, r ={m__ +m,.} >< 2"‘ "1 (ass)
The above equations clearly identify the number of arithmetic operations involved in eaeh floating-point
function. These operations can be divided into two halves: One half is for exponent operations sueh as
eomparing their relative magnitudes or addingfsubtraeting them; the other half is for mantissa operations,
including four types of fixed-point operations.
Floating-point units are ideal for pipclined implcmcntation. The two halves of the operations demand
almost twice as much hardware as that required in a Ilxed-point unit. Arithmetic shifting operations are
needed for equalizing the two exponents before their mantissas can be added or subtracted.
Shifting a binary fiaction m to thc right Ir placcs oorrcsponds to thc weighting m >< I ‘E, and shifti ng .1: places
to thc left corresponds to m X 2.". In addition, nonnalization of a floating-point number also requires left shifts
to be performed.
M-..,,,g..,,.,5.,,,.,,._,,..,,,,.,,. _ 25,
Elementary Function: Elementary Functions include trigonometric, eztponential, logarithmic, and other
transcendental Functions. Truncated polynomials or power series can be used to evaluate the elementary
liinctions. such as sin r, In x. c". cosh .r. tan ' __1-; xi, etc. Interested readers may refer to the book bv
Hwang (1979) for details of computer arithmetic functions and their hardware irnplenientation.
It should he noted that computer arithrnetic can be implemented by hardwired logic circuitry as well as by
table loolcup using fast memory. Frequently used constants and special function values can also be generated
by table lookup.

6.4.1 Static Arithmetic Pipelines


Most of to-day’s arithmetic pipelines are designed to perform fixed firnctions. These m-irhnwri'r:-’logir- rmirs
(.-'tLUs) perform fixed-point and floating-point operations separately. The fixed-point unit is also called the
integer |.m.it. The floating-point unit can be built either as part of the central processor or on a separate
coprocessor.
These arithmetic units perfonn scalar operations involving one pair of operands at a time. The pipelining
in scalar arithmetic pipelines is controlled by sofizware loops. Vector arithmetic units can be designed with
pipeline hardware directly under firmware or hardwired control.
Scalar and vector arithmetic pipelines differ mainly in the areas of register files and control mechanisms
involved. Vector hardware pipelines are often built as add-on options to a scalar processor or as an attached
processor driven by a control processor. Both scalar and vector processors are used in modern supercomputers.
Arithmetic Pipeline Siege: Depending on the firnction to he implemented, different pipeline stages in
an aritlunetic unit require different hardware logic. Since all arithmetic operations [such as nnii .s'ubB'nrf.
mliififlili dii-‘ids. squaring, square roofing. iogarirhnt ete,) can be implemented with the basic add and Shifiing
operations, the core arithmetic stages require some form of hardware to add and to shift.
1-"or example, a typical tl1ree—stage floating-point adder includes a first stage for exponent comparison and
equalization which is implemented with an integer adder and some shifting logic; a second stage for fraction
addition using a high-speed carry lookahcad adder; and a third stage for fraction normalization and exponent
readjustment using a shifter and another addition logic.
Arithmetic or logical shifis can be easily implemented with sh {ii registers. High-speed addition requires
either the use of a carry-propagation adder {CPA} which adds two numbers and produces an arithmetic sum
as shown in Fig. 6.22s, or the use ofa satay-save adder (GSA) to three input numbers and produce one
sum output and a carry output as exemplified in Fig. 6.22b.
In a CPA, the carries generated in successive digits are allowed to propagate from the low end to the high
end, using either ripple carry propagation or some carry looka-head technique.
In a CSA, the carries are not allowed to propagate but instead are saved in a can'y vector. In general, an
n-hit CSA is specified as follows: Let X, 1", and Z he three n-bit input ntunbers, expressed as X= Qt], l, x,,_ 2. . . ..
.r| , xq) and so on. The CSA performs bitwise operations simultaneously on all columns of digits to produce
two n-bit output numbers, denoted as Si’ = [0, S” |, 5,, 1, 5,, S0) and C = (Cm C" |, .. ., Ch 0}.
Note that the leading bit of the br'rn'r'.s'e sun! Sb is always a I], and the tail bit of the cnrrjv vector C is always
a D. The input-output relationships are expressed below:

S, = x,- E _v,- E :,-


C,-+| = .r,-__1-,- v_1t,-:,- v 2,-r; (5.11)
|__ _ rr.-.- Mcfirow Hill I'r>¢r.q|r_.u||r\ '
ifldvwt-cred Computer Architecture

A B
e.g. n=¢
3 II
A =

+1 B:

S: _n UU-‘ U-L-U1 _L_.L_L Q44 =A-1-B

Com g

l5LI'fll
[a] An n-bit eany-propagate adder [CPA] \~.rhic:i1 allows either eany
propagation or app-lies the earry- ieoirahead technique

X Y Z
e.g. n=-i
x = rt n n
Y:

EB-Z=
Sb; U -I-'-UU °_L._L° 1.5" ?-“ =»—~—~=~ _L¢@—l- _;_|._L-L n+1 n+1
*lC=D111tIi1tIi b
—~—--—-—-—-—-—-— b C S
S=1011111=S+-C=.i(+Y+Z (Can? {amuse
veetor] sum)

{bi An n-tilt carry-satre adder [CSAL where Sbis the bitwise strn of it, Y, andZ, and
C is a eany vector generated without carry propagation between digits

Fig. 6.22 Disdncrziort b-etween a cat'ry-propagate adder {CPA} an-d a 1:.-rryssatre adder (CSAI

for r‘ - 0.1, 2, .. ..n — 1, where 6' is the exclusive OR and v is the logical OR operation. Note thatthe arithmetic
sum of three input numbers, i.e., S = X+ Y + Z, is obtained by the two output numbers, i.e., S = +
C, using a CPA. We use the CPA and CSAS to implement the pipeline stages of a fixed-point multiply unit
as follows.
Multiply Pipeline Design Consider as an example the multiplication of two 3-hit integers .-I >< B = P,
where P is the 16-I:-it product. This fixed-poirit multiplication ean be written as the summation of eight partial
products as shown below: P = A >< B = P0 + P, + P; + + P7, where >< and + are arithmetic multiply and acid
operations, respectively.
I G I I D I D I .-"I
><}l00]tltllI=E
. Pu
P1
P2
Pa
P4
D P5

o o P6
i) l Ci l :1:1
_-. - UUUUU
UU
'-'-' UU
QUU
'-— UUUUUU
'-'-' UUU
UU
U
— UUUUUU
-I- UUU
'-— 5UUUUU
U
— UUUUUUU
'- UUUUUU
U— ‘I P;
UUUUUUU
-

DIIODIIIIIIDIIIIP
,..,,.,.,,-,,g.....,5.,,,._.,,._,,......,,, _ 25,
‘Note that the partial product P; is obtained by multiplying the multiplicanel A by thejth bit of B and then
shifting the result j bits to the left for j = U, 1, 2, ..., 7. Thus is [B + j} hits long with j trailing zeros. The
surnmation of the eight partial products is done with a risrrm-' tree of CSAs plus a CPA at the final stage, as
shown in Fig. 6.23.
The first stage (5,) generates all eight partial products, ranging fiom 8 bits to 15 bits. simultaneously. The
second. stage {S2} is made up of two levels of four CSA.s, and it essentially merges eight uu.mlJ-ers into four
numbers ranging from I3 to I5 bits. The third stage {S3} consists of two CSAs, and it merges four numbers
from S1 into two l6-bit numbers. 'l'he final stage (S4) is a CPA, which adds up the last two numbers to produce
the final product P.

. 1“ i“ ._
s1 I
it
Multiplier recoding to-gin
is 1
£11¢12¢13 £14 i’15D—

UU
“-_‘:-“"_‘. -
\ 1’ \ i11ciji13/1 //15
S2 we/3 1*» .3.
fillG?
-sf)

113 313 115 115


I Oi

1 13 15
(‘J 1s
ss 1

1s 1s
i D—
16 15

=~»~ Zfl
16
i D—
Captions: $16
CS1-'t= Carry save adder
CPA = Cary Prqsagate adder p = A1 3

Fig. 6.23 A pipeline unit for fixed-point rrrultiplication of B-bit: integen {The nurnber along each line infieanes
fllfl “I11! W'idti't..]

For a maxirntun width of 16 hits, the CPA is estimated to need four gate levels of delay. Eaelt level of the
CS-A can be implemented with a two-gate-level logic. The delay of the first stage (Sr) also involves two gate
levels. Thus all the pipeline stages have an approximately equal amount of delay.
Par MIGIITLH H“ l'mrJI||r_.u|r¢\
ICU U Auiuwtced Comptrterwfirchitecture

The matching of stage delays is crucial to the detemiination of the number of pipeline stages, as well as
the clock period (Eq. 15. I}. Ifthc delay ofthe CPA stage can be further reduced to match that ofa single CSA
level, then the pipeline can be divided into sis stages with a clock rate twice as fast. The basic concepts can
be extended to operands with a larger number of bits. as we see in the example below.

5% Example 6.10 The floating-point unit in the Motorola


MC6 8040
Figure 6.24 shows the design of a pipelined floating-point unit built as an on-chip feature in the Motorola
M68040 processor.
iuiantissa Expflngfll

Multiplex it

Register -

Stage 1
/at/at W
Register
'.3
,3 ii
Register floral . 3-"no:o- gm,
.~ Q
:--;' .-'1'.-’. '.-"A 2'-//'.-2:-' .--'. .--'/..-:-I/.-sr-.-r/1--'5.-->'-'.<--:-:-;-'.-;/ /:--:.-;--'/' z w'1>:-'.g'--/:-o-.- -4-//.4.//--z‘mm’-.1--".4/.-n-v.--/-' 1/» / / .1 /
64-bit .K. B-bit
Multiplier - hi

Register

Stage 2
/57113
Ea .

6?-bit Add Unit Reg lei“

67-bit barrel
shifter _
s-'.-1»:-.--'/:.-.--:-‘ --/' ‘s-:-or:/'. /-'.»:--'1:--2 -/1 .//:/,-'/-->1-'2:-'1:-7-.z--//:.c-->'.--.-c--'.c-'1.-%.-- .- -it---'.--:/.--/s-//:1.-'/'-1 /xi4-"I-’.-'.»4-".-'1-'I4-'I-".-.-61?-"I"r?'.-'1-'3 /

Fhfl lster /16


./l':i-7;’ 67 Inerementer 11_ 23E
Ei s3»
a
Slag“ 3 2
s= Q Add Unit
Register ,.
/:4//-'.-z-7-2--7---.7;',.-.41:"/-A1'.x2:/-?'2x.:4I"/A-'»:/ -':/.-'14:/-2-
0/w7/.x/-

FIg.i.14 Pipelined floating-point unit of the Hetero-la MC-6-‘B040 processor [Courtesy of Hotorola. Inc, 1991)

This arithmetic pipeline has three stages. The mantissa section and exponent section are essentially two
,.,,,,-.,,-,,,,.,,5,,,,._.,,,,,,.,,T.,,,, _ H,
separate pipelines. The mantissa section can perform floating-point add or multiply operations, either single-
precision [32 bits) or double-precision (6-4 hits).
In the mantissa section, stage I receives input operands and returns with computation results; 64-bit
registers are used in this stage. Note that all three stages are connected to two 64-bit data buses. Stage 2
contains the array multiplier [64 >< 8) which must be repeatedly used to can'y out a long multiplication of the
two mantissas.
The 6?-bit adder performs the acldition.*’suhtra-ction of two mantissas, the barrel shifter is used for
normalization. Stage 3 contains registers for holding results before they are loaded into the register file in
stage 1 for subsequent use by other instructions.
On the exponent side, a lo-bit bus is used between stages. Stage 1 has an exponent adder for comparing
the relative ntagrtitude of two exponents. The result of stage 1 is used to equalize the exponents before
mantissa addition can be pe;|'fon‘ned. Therefore, a shift count (from the output of the exponent adder} is sent
to the barrel shifter for mantissa alignment.
After normalization of the final result (getting rid of leading zeros), the exponent needs to be readjusted in
stage 3 using another adder. The final value of the resulting exponent is fed from the register in stage 3 to the
register file in stage 1, ready for subsequent usage.

Convergence Division One technique for division involves repeated multiplications. Mantissa division
is carried out by a cont-ergvence rrrerhod. This convergence division obtains the quotient Q = MED of two
normalized fractions 0.5 5 M< D < l in two‘s complement notation by performing two sequences of chain
multiplications as follows:
Q: M X R| X R2 X X R.A
D><R| >< R: ><---><R,,
where the successive multipliers
R,-=1+5”'=2 oi" fori'=l,2,...,.lr and o=1-5
The purpose is to choose R; such that the denominator Dm = D >< Rl >< R3 >< - - - >< Rk —> 1 for a sufficient
number oft: iterations, and then the resulting numerator Mx R, >< R; >< >< Ry, —> Q.
Note that the multiplier R, can be obtained by finding the two's complement ofthe previous chain product
Dl‘l= Dx R, >< ><R,- ,= 1 — 52’ ' becausei -Dl‘7'=R,-. The reason why Dill —> 1 for large I; isthat
ni'l= gi s)(1+ 511+ 8211+ 5") (1 + 51“ ‘)
={l 51](1+ s*}(1+ .5‘) (1 + 51' '1
={1 8*’; forr'=l,2,...,.lt (6.23)
Since D < 5 = 1 ~ D S 0.5, 52" -3- D as ibecomes sufficiently large, say, f= it for some Ii’; thus Dm =
1 - 5*‘ = 1 sir large t. The end result is
g=.tr><t1 | s)><(1 +s1)><---><(1| 51*") (6.24)
The above two sequences of chain multiplications are carried out alternately between the numerator and
denominator through the pipeline stages. To summarize, in this technique division is carried out by repeated
multiplications. Thus divide and multiply can share the same hardware pipeline.
I51 i Advanced Computer Arclritecture

I»)
E Example 6.11 The IBM 360iModel 91 floating-point
unit design
In the history of building scientific computers, IBM 360 Mode] 91 was certainly a milestone. Many of the
pipeline design fcattucs introduced in previous sections were implcrncntcd in this machine. Thcrcforc, it is
worth the effort to examine the architecture of Model 91. In particular, we describe how floating-point add
and rnultip1y.i'divide operations were irnplernented in this machine.
As shown in Fig. 6.25, the floating—point execution unit in Model 91 consisted of two separate functional
pipclincs: thc addrmir and the nirifripifi-'.-‘rift-'ic1'r' unit, which could be used concturcntly. The former was a two-
stage pipeline, and the latter was a sis-stage pipeline.
From From
Store Instruction Cambnsj
UM Um‘ cos = Common Data sus.
R5 = Rosoivaticn station. each indenti-
Flieti E Floating fled by a miquo tag number.
33:23 g°"'l t_ CSA= Carry-save adder.
para ion _

lll!!!
(FLB: Sack CPA- Carry propagate adder.

Iil l l l E
FLR Bus
illlflfli
Bits Tags

,,li,,t|,,tt
FLB Bus

[1D][11] [12 s ><


CUB

ll II -L:
T
Add Unit II ll
fiiéifi I
in
-15:
O
sta as
To Q Multiply-i'|Iiivit:|-o
Storage Unit
Unit [6 pip-alino stages)

@
—Z
_-
*IE"E
cns _ J

Fig. 6.25 The IBM 360 Model 91 floating-point: uni: (Courtesy of IBM Corporation. 1967]
...,...-...,,g...,..5.,,,._.,,._,,..,.,,.,,, _ m
The floating-point operation stack was a kind of prefetch buffer holding eight floating-point instructions
for subsequent execution through the two functional pipelines. The floating-point buffers were used to input
operands.
Operands may also come from the floating-point registers which were connected via the common data bus
to the output bus. Results finrn the two furictional units could be sent back to the memory via the store data
buffers. or they could be routed back to the FLR or to the reservation stations at the input ends.
The add unit allowed three pairs of operands to be loaded into three reservation stations. Only one pair
could bc used at a time. The other two pairs held operands for subsequent use. The use of these reservation
stations made the add tmit behave like three virtual functional units.
Similarly, the two pairs at the input end of the multiply/divide unit made it behave like two virtual units.
Internal data forwarding in Model 91 was accomplished using source tags on all registers and reservation
stations. Divide was implemented in Model 9| based on the convergence method.
Every so1.u'cc of an input operand was uniquely identified with a 4-bit tag. Every destination of an input
operand had an associated tag register that held the tag naming the sotnce of data if the destination was
busy. Through this register tagging technique, operandsfresults could be directly passed among the virtual
functional units. This forwarding significantly e-ut down the data flow time between them.
Dynamic scheduling logic was bu.ilt into Model 91 rising Tomasulo's algorithm to resolve the data
dependence problem. l.-Zither the add unit or the multiplyfdivide unit could execute an operation using
operands from one of the reservation stations.
Under Tornasulo’s algorithm. data dependences are preserved by copying source tags when the sources are
busy. When data is generated by a source, it passes its identification and the data onto the common data bus.
Awaiting destinations continuously monitor the bus in a tag watch.
When the source tag matches, the destination takes in the data from the bus. Other variations ot'Tornasulo’s
algorithm can be made to store the source tags within the destinations. to use a special tag {such as U000) to
indicate nonbusy registen'buf’fers, or to use direct-mapped tags to avoid associative hardware.

Besides the IBM 360.-G70, the CDC ssoorrsoo also implemented convergence division. It took two
pipeline cycles to perform the floating-point add. six cycles to multiply. and [B cycles to divide in the [BM
Systcn'u'31Sl] Model 91 due to five iterations involved in the convergence division process.

6.4.3 Multifirnctional Arithmetic Pipelines


Static arithmetic pipelines are designed to perform a fixed function and are thus ealled unfi'i.1ric1‘iormI. When
:1 pipeline can perform more than one function, it is ealled nlrvlrffiimriormi. A multifunctional pipeline can
be either srnric or rift-‘HcIIIli£'. Static pipelines perform one function at a time, but different fimctions can be
perfonned at dit‘r"erent times. A dynamic pipeline allows several functions to be performed simultaneously
through the pipeline. as long as there are no conflicts in the shared usage of pipeline stages. In this section, we
study a static multifunctional pipeline which was designed into the TI Advanced Scientific Computer {ASC}.
I54 i ' Advanced Computer Architecture

I»)
8] Example 6.12 The TIIASC arithmetic processor design
There were four pipeline arithmetic units built into the Tl-ABC system, as shown in Fig. 6.26. Thc instruction-
processing unit handled the fetching and decoding of instructions. There were a large number of working
registers in the processor which also controlled the operations ofthe memory bufi'er unit and ofthe aritltmetic
units.
There were two sets of operand buffers, {.-Y, 1", Zi and {X', ll". Z}, in each arithrnctic unit. .-'t'. X. ll" and 1"
were used for input operands, and Z’ and Z were used to output results. Note that intermediate results could
he also routed from Z-registers to either X- or lrlrcgisters. Both processor and memory buffers accessed the
main memory for ioshucticrns and opcrundsfrcsults, respectively.
Each pipeline arithmetic unit had eight stages as shown in Fig. 6-.2'l'a. The PAU was a static multifunction
pipeline which could perform only one function at a time. Figure 627a shows all the possible interstage
connections for perfom-ting arithmetic, logical, shitting, and data conversion functions.

lrtstnietton
Buffer
Instruction Reg Hers Comm
Q.
Unit (IPUJ 3
16 Arlthm
Main
Memory
_ _ MBU

_,_,a _
O
"'“"“ds a
Memory
Buffer
L.Inlt [MBU]

' lpeillna ' lpollna Pip-allno Plpol lno


1 2 3 4

PAU I I 1 t
Plpoltno Arlthrnatie Units {PAUJ

Fig. are The anahiceeune of the-Tl Advanced Scientific Composer (ABC) (Ccurccsy ofTexas Instruments. toe.)
,=,,,.,,,,,,,,g.,,,.,5,,,,._,,,£_,4,,,.,c,,, _ M
Both fixed-point and floating-point arithmetic functions could be performed by this pipeline. The PAU
also supported vector in addition to scalar arithmetic operations. It should be noted that different functions

is i i
required difl"crcnt pipeline stages and dificrcnt interstage connection patterns.

is
input S1 S1 S1

to ii
Exponent
Stbtract S2 L PF-I S2

Align

‘ A
‘ F tion

-—i r
— I rr

is
Nomalirse 55 S5 S5

Fraction I
Multiply‘ S6

5 Aocurnoiate

{E
i Ottput

R=![.r'-1,9)
SB

iHiii SB

R=A"x:B
S8

J1
R'EA’-J-(BI
I =1
[a] Pipeline stages and {bi F bred-point moitinlication [oi Floating-point dot product
interconnections

Fig. 6.2‘? The multiplication arirdimetic pipeiine of the Tl Advanced Scientific Computer and die intersuge
connecci-one of two I‘Eprt‘Béfl‘lI312l"M functions {Ended stages are unmliized]

For example, fixed-point multiplication required the use of only segments S], S6, 5-,, and 5'5 as -Shown
in Fig. 6.2Tb. On thc other hand, thc floating-point dot product function, which p-csrforms thc clot product
operation between two vectors, required the use of all segments with the complex connections shown in
Fig. 6.2?c. This dot product was implemented by essentially the following accumulated summation of a
scqucnoc of multiplications through thc pipeline:
War MIGIIILH H“ I'mt!I;|(1rlnr\
Iii E ' Advanced Comptrtetwfircltitecture

Z t— .-'1; X B1-+ Z (6.25)

where the successive operands (A,-, B,-) were fed through the X- and 1"-buffers, and the accumulated sums
through the Z-buffer recursively.
The entire pipeline could perform the mtr)'rr}Jl'_v [X] and the ado'(--) in a single flow through the pipeline. The
two levels of buffer registers isolated the loading and fetching of operands to or from the PAU, respectively,
as in the concept of using a pair in the prefetch buffers described in Fig. 6.11.
Even though the Tl-ASC is no longer in production, the system provided a unique design for multifunction
arithmetic pipelines. Today, most supercomputers implement aritltmeiic pipelines with dedicated ftnictions
for much simplified control circuitry and faster operations.

SUPERSCALAR PIPELINE DESIGN


-
Pipeline Design Parameter: Some parameters used in designing the scalar base processor
and superscalar processor are sumrnarized in Table 6.1 for the pipeline processors to be studied
below. All pipelines discussed are assumed to have it stages.
The pr]-:eh'ne ct-'1-la’ for the scalar base processor is assumed to be l time unit, called the hose r-ycle. We
defined the instruction i.ss1rerrira'. i.ssrre Irtrenctt and SlHl1|'JllL'Op£'rrIHDH 1"-ttIt'rr£'_t‘ in Section 4.1.] . The instruction-
l'¢='t-‘clpt'trflll't.’liSl'itI (ILP) is the maximum number of instructions that can be simultaneously executed in the
pipeline.
For the base processor, all of these parameters have a value of 1. All processor types are designed relative
to the base processor. The ILP is needed to fully utilize a given pipeline processor.
Table i.1 Ensign Fhrorneters for Pipeline Processors

.il-fuchinc ifypie .S'¢'r.rlur 1.ru.t-rs macdrlne o_,F'.lt plpeiltlne srerges St.rpe'r.\'ez.rlur mat;'.lri.rre' of'1!-tgtgr-t'¢' m
Machine pipeline cycle l (btue cycle} 1
lrutlmction issue rate I m
lnst:|'uction issue latency l_ _1
Simple operation latency I
TLP to Fully utilize the pipeline m

Note: All timing is relative to the base cycle for the scalar base machine. ILP: Instruction level parallelism.
We study below the structure ofsnperscalar pipelines, the data dependence problem, the factors causing
pipeline stalling, and mutti-instruction issuing mechanisms for achieving parallel pipelining operations. For
a superscalar machine of degree m, in instructions are issued per cycle and the lI.P should be m in order to
fully utilize lIl't-B pipeline. As a matter of fact, the scalar base processor can be considered a degenerate case of
a superscalar processor of degree l.
Super:/color Pipeline Structure In an m-issue superscalar processor, the instruction decoding and
execution restntrces are increased to form effectively HI pipelines operating concurrently. At some pipeline
stages, the functional units may be shared by multiple pipelines.

You might also like