ACA UNIT-2 Kai Hwang
ACA UNIT-2 Kai Hwang
l|¢G-NH-‘ Hllffitvoponm
— —
Asynchronous Model As shown in Fig. 6.1a. data flow between adjacent stages in an asynchronous
pipeline is conlrolled by a handshaking protocol. When stage S; is ready to transmit, it sends a rt-mlr-' signal to
stage 5‘.-+|. Aflcr stage 5‘,., receives the incoming dala, it returns an tat-krron-'fc't'.igt= signal to 5‘.-.
    Asynchronous pipelines are useful in designing communication channels in message-passing multicom-
pulers whore pipelinod wormhole routing is practiced [see Chapter 9). Asynchronous pipelines may have a
variable throughput rate. Different amounts of delay may be experienced in different stages.
                                   l
                               WM MIT   cI 0' lb‘ Hfiitim
                                                      - =- |r.\.m*\
                                                             .-                                 _   _
115' i                                                     I                                                  Advwioed Computer Architecture
|m;,u Output
O I
          Ce H                                    H                        nil H                                  l_|
                        |-             T              + rm -l=*l~
                                                         (hi A synchronous plpellne model
                                   in Tlrne [clo-cit cycles]
                                    1  2      3       4
                                                                                   Qqitlons:
                     III
                    We   i=0-WM
                             S3
                                                                                   L = Latch
Fig.-ii.1 ‘Mo models of linear pipeline mill: and she corresponding reserva1:lc.in table
Syn-ehmnuu: Model Synchronous pipelines are illustrated in Fig. l5.lb. Cloclced latches are used to
interface between stages. The latches are made with master-slave fiip-flops, which can isolate inputs from
outputs. Upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously.
   The pipeline stages are combinational logic circuits. It is desired to have approximately equal delays
in all stages. These delays dctenriinc the clock period and thus the speed of the pipeline. Unless otherwise
specified, only synchronous pipelines are studied in this book.
  The utilization pattern of successive stages in a synchronous pipeline is specified by a re'servafion rnbie.
For a linear pipeline, the utilization follows the diagonal streamline pattern shown in Fig. 6.lc. This table
is essentially :1 space-time diagram depicting the precedence relationship in using the pipeline stages. For a
ii-stage linear pipeline, it clock cycles are needed for data to flow through the pipeline.
     Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is filled
up, one result emerges fi'om the pipeline for each additional cycle. This throughput is sustained only if the
successive tasks arc independent of each other.
,.,,,,-.,,,,,...,,.,,._.,,,_,,.,,,,,,,,                                                                    _ m
6.1.1    Clocking a.ndTirnir|g Control
The c-Ind: r-_i-do r of a pipeline is determined below. Let r,- be the time delay of the circuitry in stage S, and d
the time delay ofa latch, as shown in Fig. 6.11:.
Clock Cycle and Throughput            Denote the moxinmm stage r.lein__i-' as rm, and we can write tas
   In the ideal case s = 0, r,,,,,_, = r,,,, and r,,,,-,, = d. Thus, we have r = rm + d, consistent with the definition in
Eq. 6.1 without the effect of clock skewing.
     I»)
B          Example 6.1             Pipeline speedup versus stream length
The maximurn speedup is .51 —> Jr as n —> =~=. This maximum speedup is very difficult to achieve because of
data dependences between successive tasks (instructions). program branches, interrupts. and other factors to
be studied in subsequent sections.
   Figure 6.2a plots the speedup factor as a function of n, the number of tasks (operations or instructions]
performed by the pipeline. For small values ofn. the speedup can be very poor. The smallest value of 51 is 1
when H = 1.
   The larger the number Ir of subdivided pipeline stages. the higher the potential speedup performance.
When n = 64, an eight-stage pipeline has a speedup value of ?.l and a four-stage pipeline has a speedup of
3.1 However, the number of pipeline stages cannot increase indefinitely due to practical constraints on costs,
control complexity. circuit implementation, and packaging limitations. Furthermore, the stream length n also
affects the speedup; the longer the better in using a pipeline.
Optirnal Number of Stage: In practice, most pipelining is staged at the functional level with 2 5 Ir 5 15.
Very few pipelines are designed to exceed I0 stages in real computers. The optimal choice of the number of
pipeline stages sht:-u.ld be able to maximize the perfomtancefeost ratio for the target processing load.
  Let r be the total time required for a nonpipelined sequential program of a given function. To execute the
same prngrim on a Ir-stage pipeline with an equal flow-through delay r. one needs a clock period ofp = ril-
+ ti where if is the latch delay. Thus, the pipeline has a maximum throughput off= Up = 1.*{n'.t' + rt). The total
pipeline cost is roughly estimated by c + kh. whcrc c covers the cost of all logic stages and h represents thc
cost of each latch. A pipeline git-ijformrvir-eat-as! ratio [PCR] has been defined by Lemon {I 9T3):
                               PCR = _-L = ml                                                                  (as)
                                        c + Hi      (Mk + d)(c + Ha)
,;,,,.1,,,,,,W,5,,,,._,,,£_,,,,,T.,
 _   __                          H‘.-|-Mcliruw Hl'lI:'|>¢r.-moi-r~                                                                                      I
                                                                                                                                                                 _ m
                                                 Sn
                                 10                 ..................................................................................       ......__
I: = 10 stages
                          SpeedupFact-o4cnon
                                         »                                                  k = 6 stages
                                    2
                                    1                           I        I        I           I                    I     I          I       I               ,.
                                                            1            2        4           B   1 6 32 64                                128              n
                                                                                          No. of operations
                         [a] Speedup factor as a function of the number of oporatlms [Eq. 6.5]
— Peak
                                      PeroformanceCostRat
                                                                                                    -                                                       k
                                                                                                        6"Q_____
                                                                                                                                         No. of stages.-
                                                                                            [Op-timaij
                                            [b1 Dptl rral nurrbor of pip-oil no stages [Eqs. 6.6 and B.?j|
Fig. L2 Speedup factors and the optirml l'Il.lfi'lbEl‘ ofpdp-elm stages bra flnear ppeiine unit
   Figure 6.21:: plots the PCR as a function ofk. The peak of ihe PCR nun-c corresponds to an optimal choice
for the number of desired pipeline stages:
                                                                     I-C
                                               Ir.-,= 5                                                                                                           (6-7)
when: r is the total fl0w—111r0ugh delay of lho pipeline. Thus the total stage cost c, thc latch delay d. and thc
latch cost fa must he considered to achieve the optimal value £1-,.
Efllcfency and Throughput                            Thc ejjficienqy Ek of a linear It-stage pipclinc is dcfilmd as
                                                    .5‘ = Z’;
                                               5k = _.£                                                                                                           (53)
                                                                    J:           k +(n — 1]                                                                        '
     Obviously, the cfiicicncy approachcs I whcn n —> =-1, and a luwccr bound on E‘: is 1H: when n = 1. The
pi']'J€Hm.* rhrmighpur II,‘ is defined as the number oi" tasks {operations} performed per unit lime:
                                                            =                 n                     =                  nf
                                               H‘                   [k+{n—l)]t                                     k+(n—l)                                        (63)
                             F|'>r'MfGJ'|Ili' N“ I'm-tiqtrlrlitt
III N                                                                                        _          Advanced Comptrtetwfircititscture
   The rrrrtrimrmi rhmrig!i;Jurf occurs when E; —> l as N —> W. This coincides with thc speedup definition
given in Chapter 3. Note that Iik = E; _f= E11 r= Silk r. Other rclcvant factors of instruction pipclincs will bc
discussed in Chapters 12 and 13.
                             _L     »s
                                  —r-
                                  I\J   U3
                                                                                             —I- Time
                                                                                         1 2 3 4 5 6
                       51         IIII-"’Hen --t Hm                                       IIII
              Sass 52 IIIIII                                                Sass S2 IIIII
                       -it IIIII                                                       as III
                        [bi Fits-servation table for fun-ctien JC                 [cl Reservation table for function Y
FIg.6.3 A dynamic pipeline with feed forward and izeelbaeit connections fir two different functions
Reservation Table:      The reservation table for a static linear pipeline is trivial in the sense that dataflow
follows a linear streamline. The rr'st'rvrn'ion mink’ for a dynamic pipeline becomes more interesting because
a nonlinear pattern is followed. Given a pipeline configuration, multiple reservation tables can be generated
for the evaluation of differmt functions.
   Two reservation tables are given in Figs. 6.31:: and 6.3c-, corresponding to a function X and a function Y,
respectively. Each function evaluation is specified by one reservation table. A static pipeline is specified by a
single reservation table. A dynamic pipeline may be specified by more than one reservation table.
,.,,,e.-.,i,,g,,.,..5.,,._.,,,._,,.,,T...,~,                                                                                      _ m
   Each reservation table displays the time~space flow ofdata through the pipeline for one function evaluation.
Different functions follow different paths through the pipeline.
   The number of columns in a reservation table is ealled the evni'uorion rune of a given function. For
example, the function X requires eight clock cycles to evaluate, and function Y requires six cycles, as shown
in Figs. 6.3b and 6.3c, respectively.
   A pipeline initiation table corresponds to each fisnction evaluation. All initiations to a static pipe-line use
the same reservation table. On the other han-ti, a dynatnie pipeline may allow different initiations to follow a
mix of reservation tables. The checltmarks in each row of the reservation table correspond to the time instants
(cycles) that a particular stage will be used.
   There may be multiple checltmarks in a row, which means repeated usage of the same stage in diifenent
cycles. Contiguous cheelcrnarks in a row sirriply imply the extended usage of a stage over more than one
cycle. Multiple cheekmarks in a column mean that multiple stages need to he used in parallel during a
particular clock cycle.
Latency Analysis        The number of time units [clock cycles] between two initiations of a pipeline is the
hireriey between them. Latency values must be nonnegative integers. A latency ofIt nieans that mo initiations
are separated by it clock cycles. Any attempt by two or more initiations to use the same pipeline stage at the
same time will eause a eol'1r'sion.
   A collision implies resource conflicts between two initiations in the pipeline. Therefore, all collisions must
be avoided in scheduling a sequence of pipeline initiations. Some latencies will cause collisions, and some
will not. Latcncies that cause collisions are ealled_,forbi.dden latencies. In using the pipeline in Fig. 6.3 to
evaluate the function X, latencies 2 and 5 are forbidden, as illustrated in Fig. 6.4.
                                                                    —I-Time
                          1   2         3      it        5          6     1'              3            9         10       11
                   51 X1                X2               Ks        9'11       X4        K1-*2                   K213
            SHQGS S2          X1             X1, X2             X2, X3                  X3. X4                   ill.‘         I I I
                                                                          —I-Tu-no
                                             123456?BQ-1011
                                  51 K1                                     K1-X2          X1
                         Seuss 52              X1             X1                     K2           K2                     '''
                                   53                  *1            X1              K1 K2                 K2
                                                    [ti-J Collision with eehoeluling latency 5
Fig.6.-1 Collisions with forbidden hteneies 1 and 5 in using the pipeline in Fig. 6.3 to evaluate the function X
   The ith initiation is denoted as X; in Fig. 6.4. With latency 2, initiations X, and X; collide in stage 2 at time
4. At time T. these initiations collide in stage 3. Similarly, other collisions are shown at times 5. 6, 8. ..., etc.
    I                              '       I     Ifllli      t'r>rrIq|r_.r.I|n*\                                          _
Z31 i                                                                                                                              Advwiced Computerflrrchitecture
The collisio.n patterns for latency 5 are shown in Fig. 6.4b, where X, and X2 are scheduled 5 clock cycles
apart. Their first collision occurs at time 6.
  To detect a forbidden latency, one needs simply to cheek the distance between any two checkmarks in the
same row of the reservation table. l-‘or example, the distance between the first mark and the second mark in
row S, in Fig. 6.31:: is 5, implying that 5 is a forbidden latency.
   Similarly, latencies 2, 4, S. and T are all seen to be forbidden from inspecting the same reservation table.
From the reservation table in Fig. 6.3c, we discover the iiirbirlden latencies 2 and 4 for function Y. A inrentjt-'
seqrieitce is a sequence ofpermissible nonforhidden latencies between successive task initiations.
    A fflfs'HC_\-‘ ct-‘cit’ is a latency sequence which repeats the same suhsequence (cycle) indefinitely. l-"ig1.|ret:i.5
illustrates latency cycles in using the pipeline in Fig. 6.3 to evaluate the function X without causing acollision.
For example, the latency cycle l_'l, S] represents the infinite latency sequence 1, B, l, B, l, 8,                                                 This implies
that successive initiations ofnew tasks are separated by one cycle and eight cycles alternately.
                                         ]-ii Cycle repeats it-l-ii--~
                          123456T8Q-10111213-1415161?13192D21
                    513912                                 x'lx'2x1x’2x3)(4                                      Ksxttxsxrtxsxe
                                                                                   C     terep-eats
                                                               t-
                          123455T3‘Q‘ll]"l1‘l2‘l3‘l4‘l5‘l51i"1iB‘l'Q2D2‘l
                                                                                  -11°             +              -+-        -+~
                    S‘|x‘l            x2      ‘K1‘K3x1x2xdx2x3x5X3‘K4x'6xdx5x?X5E
                    s2       it,      if-r"12    Kzxs  Xaxtt  Xaxs   Ksxe Xsxr "'
                    53             X1      X1x2x1x2xsX2’(3x4X3X4KsxaXs’(exsXs
                              {bj|Latencycyclo{3)=3,3,3,3,...,-withan average latency-of3
F Cyelerop-nets 1+‘ + _-
                          12           3        4     5    E     T        B‘Q1l]"l1‘l2‘l3‘l4—‘l5‘l51i"‘l3‘lQ'2D2‘l
                    5111                                   x'lx2x1                                xzxsxz                 Xsxax3
                    52   X1                    3'11                     x2             >12                 x3       x3             x4 ...
                    53             it,                x,        Jt-1              x2         x2       X2         x3 x3   x3        X3
                              [c] Latency cycle {6} = E, 6, E, 6,                                  with an average istency of 6
   The r'nr+:r¢rge i'nIcrr-t‘__t' of a latency cycle is obtained by dividing the sum of all latencies by the number of
latencies along the cycle. The latency cycle { I , 8) thus has an average latency of (1 + 3).f2 = 4.5. A consmnr
cjveie is a latency cycle which contains only one latency value. Cycles {3} and (6) in Figs. 6.5h and 6.5c are
both constant cycles. The average latency of a constant cycle is simply the latency itself. ln the next section,
we describe how to obtain these latency cycles systematically.
...,,,-,,-,,,,.,..,,,,,,,_,.,.,.,,,,,                                                                         _ ,3,
6.2.1     Collision-Free Scheduling
When scheduling events in a nonlinear pipeline, the main objective is to obtain the shortest average latency
between initiations witliout causing collisions. In what follows, we present a systematic method for achieving
such collision-free scheduling.
   We study below c0ii'r'.sion vectors. stare o’r'agrorns_ .sirig')'e c'_t-‘cf-es. growl:-' c_vc*!cs', and minimal at-'cragt>
interior {MAL}. This pipeline design theory was originally developed by Davidson [I97 I) and his students.
Collision ‘Factor:     By examining the reservation table, one can distinguish the set of permissible latencies
from the set of forbidden latencies. For a reservation table with n cohnnns, the ninrimum_;‘orbiriricn inrency
in E n - 1. The pennissible latency p should be as small as possible. The choice is made in the range 1 E p 5
ml.
   A permissible latency of p = I corresponds to the ideal case. In theory, a latency of I can always be
achieved in a static pipeline which follows a linear (diagonal or streamlined) reservation table as shown in
Fig. 6.lc.
   The combined Set of permissible and forbidden latencies can be easily displayed by a collision vector,
which is an in-bit binary vector C = (C,,,t_".,,, | ...t_".3C|). The value of C, = 1 if laicncy ieauses a collision
and C, = 0 if latcncy i is permissible. Note that it is always true that C,,, — 1, corresponding to the maximum
forbidden latency.
   For the two reservation tables in Fig. 6.3, the collision vector Cy = {101 I010) is obtained for fiinction X,
and Cy = [_llIIltl_] for function Y. From l:__\;', we can immediately tell that latencies T, S, 4, and 2 are forbidden
and latencies 6, 3. and l are permissible. Similarly, 4 and 2 are forbidden latencies and 3 and I are permissible
latencies for fimction Y.
State Diogmm:         From the above collision vector, one can consn-uct a stoic .n‘ingr.nm spccifying thc
permissible state transitions among successive initiations. The collision vector, like Cy above, corresponds to
thc inirirti smrc of the pipeline at time 1 and thus is called an iflifititl c'oi'i"isi0n vector. Lerp be a permissible
latency within thc range lip E in — 1.
   The nctr state ofthc pipclinc at time r + p is obtained with the assistance of an m-bit right shift register as
in Fig. 6.6a. The initial collision vector C is initially loaded into lite register. The register is then Shifted to the
right. Each l—bit shift corresponds to an increase in the latency by l. ‘When a CI bit emerges from the right end
aftcr ,o shifts, it means p is a permissible latency. Likewise, a 1 bit being shifted out means a collision, and
thus the corresponding latency should be forbidden.
   Logical ll enters fiom the lei’: end of the shift register. 'I'he next state after p shifts is thus obtained by
bitwise-{llling the initial collision vector with the shifted register contents. For example, from thc initial
state Cy — (l0ll010}, the next state (l l I I l 1 1) is reached after one right shifi ofthe register, and the next state
(I01 10 I I} is reached after three shifts or six shifts.
     Ir)
éjj        Example 6.2 The state transition diagram for a pipeline unit
A snrrc ri‘ingr'arn is obtained in Fig. t'i.t'ib for function X. Frorn the initial state [101 lill 0}, only three outgoing
transitions are possible, corresponding to the three permissible latencies 6, 3, and I in the initial collision
vector. Similarly, from state (till I011], one reaches the same state afier either three shifts or six shifts.
    I                                1   I       Ifllli         l'm'rIq|r_.\.I|n*\                                _
Z35 i                                                                                                                       Advwiced Computerhrdsitecture
   When the number of shifts is m + i or gneater, all transitions are redirected back to the initial state. For
example, aficr eight or more (denoted as Bl] shifts, the neat state must be the initial state, regardless of which
state the transition starts from. in l-‘lg. 6.60, a state diagram is obtained for the reservation table in 1-"ig. 6.3c
using a 4-bit shift register. Once the initial collision vector is determined, the corresponding state diagram is
uniquely determined.
I I I
in
[aj State trans ltlen using an malt right shift register, where n is the rnaxirnurn forblddeniateney
                          —
                          _; $ _L _L Q‘ _L Q‘
                                                                                                                     i B
                                                                                                                     -5. |= _|t ¢
3 * 3+ 5+ 5+
                                at           1                                                                          s 1*
                 1n11n11                             1111111                                              i
                                                                                                          _; ID _; _;           1111
              Q) .
                 (bi State diagram for function it
                                                                                                            0
                                                                                                       [0] State diagram for function Y
Fig. 6.6 Two stane diagralns obtainecl from the two reservation table: in Fig. 6.3,relp-eetzively
   The ll’s and 1’s in the present state, say at time I, of a state diagram indicate the permissible and forbidden
latcneics, respectively, at time r. The hitwise Olling of the shifted versicn of the present state with thc initial
collision vector is meant to prevent collisions finrn future initiations starting at time r + l and onward.
   Thus the state diagram covers all permissible state transitions that avoid collisions. All latencies equal to
or greater than m are permissible. This implies that collisions can always be avoided if evenm are scheduled
far apart (will! latencies ofm+). However, sueh long latencies are not tolerable from the viewpoint ofpipeline
throughput.
Greedy Cyeie: From the state diagram, we can determine optimal latency cycles which result in the MAL.
There are infinitely many latency eyeles one ean traee from the state diagram. For example, (I, S), (I, S,
6, ii), (3), (6), [3, ii), (3, 6, 3'] ..., are legitimate cycles traced fi'om the state diagram in Fig. 15.6b. Among these
cycles, only sim_rJ1'e .1:-__t-ales are of interest.
...,,,,-,,,-,,,,,.,,,,.,,,,,.,,.,,,,,                                                                 _ ,3,
   A simple cycle is a latency cycle in which each state appears only once. In the state diagram in Fig. tit‘-b.
only -[3], {G}, (8), (1, 3), (3, B}, and [6, E) are simple cycles. The cycle [1, S, I5, E) is not simple because it
travels through the state (ltll 1010') twice. Similarly, the cycle (3, 6, 3, 8, 15) is not simple because it repeats
the state (1 [ll I 01 ll three times.
   Some ofthe simple rye let are grccdi-' r_lc!cs. A greedy cycle is one whose edges are all made with minimum
latencies from their respective starting states. For example, in Fig. 6.6b the cyelcs [1, 8) and (3) are greedy
cycles. Greedy cycles in Fig. 6.6c are (1, 5) and (3). Such cycles must first he simple, and their average
latencies must be lower than those of other simple cycles. The greedy cycle [1, 8) in Fig. 15.6b has an average
latency of(l + S]-'2 = 4.5, which is lower than that ofthe simple cycle [6, S) = (6 + 8}l'2 = T. The greedy cycle
(3) has a constant latency which equals the MAL for evaluating fimction X without causing a collision.
    The i'viAL in l-‘lg. 6.6-c is 3, corresponding to either of the two greedy cycles. The minimum-latency edges
in the state diagrams are marked with asterisks.
    At least one of the greedy cycles will lead to the MAL. The collision-free scheduling of pipeline events
is thus reduced to finding greedy cycles fi'om the set of simple cycles. The greedy cycle yielding the It-'LAI.. is
the final choice.
   Interested readers may refer to Shar (I 972} or find proofs of these bounds in Kogge (I981). These results
suggest that thc optimal latency cycle must he selected firom one of the lowest greedy cycles. However,
a greedy cycle is not suflicient to guarantee the optimality of the MAL. The lower botmd guarantees the
optimality. For example, the MAL = 3 for both function "X and iiinction Y and has met the lower bound of
3 fiom their respective reservation iahlcs.
   From Fig. 6.iSb, the upper hound on the ltll_A.L for function X is equal to 4 e l = 5, a rather loose bound.
On the other hand, 1-‘ig. 6.6:: shows a rather tight upper bound of 2 + 1 = 3 on the MAL. Therefore, all greedy
cycles for function Y lead to the optimal latency value of 3, which cannot be lowered further.
   To optimize the MAL, one needs to find the lower bound by modifying the reservation table. The approach
is to reduce the maximum number of chcckrnarks in any row. The modified reservation table must preserve
the original function being evaluated. Patel and Davidson ['19l'6) have suggested the use of noncompute
delay stages to increase pipeline performance with a shorter MAL. Their technique is described below.
Z15 i                                                                                '         Advanced Computerhrdiitecture
Delay Insertion The purpose of delay insertion is to modify the reservation table, yielding a new collision
vector. This leads to a modified state diagram, which may produce greedy cycles meeting the lower hound
on the MAI...
   Before delay insertion, the three-stage pipeline in Fig. 6.7a is Specified by the I‘BSe1'\'ntion table in Fig,
6.Tb. This table leads to :1 collision vector C = {lfil 1}, corresponding to forbidden latencies 1, 2, and 4. The
corresponding state diagram (Fig. 6.'.|'c] contains only one self-reflecting state with a greedy cycle of latency
3 equal to the MAL.
   Based on the given reservation table, the maximum number of eheclcmarks in any row is 2. Therefore, the
MAL = 3 so obtained in Fig. 6.']"e is not optimal.
Output
H I.-I
                                                                                          2
                                                                - 52 I 53
                     —h "limo
            1    2    3    4        5
       S» III» MM t» at
          II!             L. _ ,.       _ __ .. .. ._ .. ,. _. ._ _ _ _ --
[h] Ftosoniation tah-lo and operations being delayed [c] State transition diagram with MAL = 3
   In total, the operation X, has been delayed one cycle fi'om time 4 to time 5 and the operation X’; has been
delayed two cycles from time 5 to time 1'. All remaining operations (marked as X in Fig. 6.3b) are unchanged.
This new tahle leads to a new collision vector (100010) and a modified state diagrain in Fig. 6.Se_
...,,,,,-,,,,,.,,,.,,,_,,,,.,.,,                                                                                   _ 2,,
                                                 °""’"‘                    I-IE!
                                    we
                                                 [a] Insertion of two nonoornpute delay stages
                         —I- Time                                                 4 r
                    —L   l\J   L0   -F
            51 III                                                         .        l                          .
   Stages   5
            IIIII"-
             2
            s, !!E__
                   IIIIe IIE~            -I
                                                                                           I                            9
   near     D1
  M”%IIIIIEI                                                                            mum
                 [b] Mo-dfiad reservation tabto                [ct Modified state diagram with a reduced M!-Y-.L= [1 + 31t2 = 2
Fig. 6.8 lrecrrlon oftwo delay stages to obtain an op1:lma.l MAL for the pipeline in Fig.6.?
   This diagram displays a greedy cycle (1, 3), resulting in a reduced lvL-kl. = (1 + 3]:‘2 = 2. The delay
insertion thus improves the pipeline perfonnance. yielding a lower hound for the MAL.
Pipeline T'hmughprrt This is essentially the initiation rate or thc average number of task initiations per
clock cycle. if N tasks are initiated within n pipeline cycles, then the irririrrrrbrr rare or pipeline rirmugfrprn‘
is measured as N.-‘n. This rate is determined primarily by the inverse of the ltd.-"LL adapted. Therefore, the
scheduling strategy docs affect the pipeline performance.
   In general, the shorter the adapted MAL, the higher the throughput that can he expected. The highest
achievable throughput is one task initiation per cycle, when the MAL equals 1 since 1 5 MAL 5 the shortest
latency of any greedy cycle. Unless the MAL is reduced to I, the pipeline throughput becomes a fiaction.
Pipeline Efficiency Another important measure is p|‘pei|'ne qfieiency. The percentage of time that each
pipeline stage is used over asuflireiently long series oftask initiations is the srnge ririttznrrhn. The accumulated
rate of all stage utilizations dctcrrnincs the pipeline cflicicncy.
   Let us reexamine latency cycle [3] in Fig. 6.5b. Within each latency cycle of three cloc-ls cycles, there are
two pipeline stages, S, and S3, which are completely and continuously utilized after time 6. The pipeline stage
S3 is used for two cycles and is idle for one cycle.
   Therefore, the entire pipeline can be considered St‘? = 33.3% cffieicnt for latency cycle [3]-. On the other
hand, the pipeline is only 14L‘-£7 = 51.8% efficient for latency cycle (1, 8) and 8116 = 50% eflicient for latency
cycle (6), as illustrated in Figs. 6.5a and 6.51:, respectively. Note that none of the three stages is fully utilized
with respect to two initiation cycles.
   The pipeline throughput and pipeline effieiency are related to each other. Higher throughput results from
a shorter latency cycle. Higher eflicieney implies less idle time for pipeline stages. The above example
demonstrates that higher throughput also accompanies higher efficiency. Other examples however may show
                          FM Mtfirnw H'IlI:'n.-rq|i;uin1'
24“ N                                                                            _        flidvursced Computer Architecture
a contrary conclusion. The relationship between the two measures is a Function of the reservation table and
of the initiation cycle adopted.
   At least one stage of the pipeline should be fully (100%) utilized at the steady state in any acceptable
initiation cycle; otherwise, the pipeline capability has not been fully explored. ln such cases, the initiation
cycle may not be optimal and another initiation cycle should be examined for improvement.
                                           or Tlrno
                                           1 2 3 4 5 6 7' B Q10111.2113141516171319-2t}.5!12223
                    R1<-Mem[‘t'J
                    R2 t- lu'ls|'r|[Zj
                    R3~e—[R1]+[R2)
               lu1nm[x]s— [R3]
                    R44-Mom[B]
                    R5<- Mom[C]
                    R61:-[R1t)'[R5J
              lulem[A]<t-[R6]                                                                                          1l_l
                                                      or Time
                                                       12 3 4 5 6 T B 91D11121314~151B1T
                             “ts Marti"                     nnnn
                             R2R4 e4- "Bull
                                      Mom[E-J
                                                             nnnnn
                                                             F
                             R5 \t—   Mom[C)
                             R3 <-    {Ft1]+[R2J
                                                                             HE
                                                                 HE                  MINE
                             R6 <-    [R4)‘[R5]
                        lulern[x]<-   [R31
                                                                      HEHE                                   E
                                                                                  HEIIEIHE EHEEHfllil EEEHE
                        Me~v~=<~ ea                                                           nnnn
                                                    [cl Rnorelorod Instruction issuing
   Fig. 6.9     Plpelln-ed execution ofX = Y + Z andfii = B :-< C (Courtesy efjarn-in Smlel'i;reprlri1:ed with pcrrnlml-on
                from IEEE Con1purnr,july 1939}
     Ir)
Cg            Example 6.4 The MIPS R4000 instruction pipeline
The MIPS R4000 was a pipelined I54-bit processor using separate instruction and data caches and an eight-
stage pipeline for executing register-based instructions. As illustrated in Fig. tS.lU, the processor pipeline
design was targeted to achieve an execution rate approaching one instruction per cycle.
24‘: Ii                                                                                                                  ifldvwrcedfiompmerhichitedure
                     I
                     IPi
                             E’
                              --. - .
                                             CF
I R
Fig. 6.10 The srehlreceure of ‘H14! MIPS IHOUIU Insrrucrien pipeline {Coureesy nfMIPS Cornpuner Sysrenu]
   Thc execution of each R4000 i11st111ction consisted of eight major shcps as sumrnarizod in Fig, 6. lfla. Each
of these steps required approximately one clock cycle. The instruction and data rneniory references are split
across two stages. The single-cycle ALU stage took slightly more time than each ofthe cache access stages.
   The overlapped execution of successive instructions is shown in Fig. l5.l0h. This pipeline operated
cfficienfly because diFfCl'BIll CPU resources, such as address and bus access, ALU operations, register
accesses, and so on, were utilized simultaneously on a noninterfering basis.
   The internal pipeline clock rate (IUD MI-lz) of the R4000 was twice the external input or master clock
...,...-...,.,........,,,._.,.,..,......,..                                                          _ ,,_,_
frequency. Figure 6.l0h shows the optimal pipeline movement. completing one instruction every intemal
clock cycle. Load and branch instructions introduce extra delays.
                                                    li_ll
                                                 nstmetlons Indicated by program counts-r
                                                         . Buffer 1
                                                        = .Bu er2
                                      at            Ta      “fiat 1
                                                     Target Buffer 2
                                                                            uni
                                                                         Instruction Pipeline
                                     I/s
                                      .t\         tlons from braneltad to-cations
   Sequential instructions are loaded into a pair of.scqrrcnri.ni' brrflers for in-sequence pipelining. Instructions
from a branch target are loaded into a pair of airgyr brg{*i=r.~r for out-of-sequence pipelining. Both buffers
operate in a first—in-first-out fashion. These buffers become part of the pipeline as additional stages.
  A conditional branch instruction causes both sequential buffers and target buffers to fill with instructions.
After the branch condition is checked, appropriate instructions are taken from one of the two buifers, and
instructions in the other buffer are discarded. Within each pair, one can use one buffer to load instructions
from memory and use another buffer to feed instructions into the pipeline. The two buffers in each pair
altcmatc to prevent a collision between instructions flowing into and out of the pipeline.
   A third type ofprcfetch buflier is known as afonp inrrflbr. This buffer holds sequential instructions contained
in a small loop. The loop buffers are maintained by the fetch stage of the pipeline. Pre-fetched instructions in
the loop body will be executed repeatedly until all iterations complete execution. The loop buffer operates in
two steps. First, it contains instructions sequentially ahead ofthe current instruction. This saves the instruction
fetch time from memory. Second, it recognises when the target ofa branch falls within the loop boundary. In
this case, unnecessary memory accesses can be avoided if the target instruction is already in the loop buffer.
The CDC 6600 and Cray 1 made use of loop buffers.
Multiple Functional Unit: Sometimes a certain pipeline stage becomes the bottleneck. This stage
corresponds to the row with the maximum number of checkmarks in the reservation table. This bottleneck
problem can be alleviated by using rnultiplc copies of the same stage simultaneously. This leads to the use of
multiple execution units in a pipelined processor design (Fig. 612}.
                             Par MIGIITLH HI" l'mrJI||r_.u|i¢\
Z44 i                                                                                                 Aduwrced 'l:-Da'TlPl.lIvl!l'-"||!Ctl't|lIvECfl.rIE
                                                                                          Ftsglster
                                                       Instruction Fetch Unit        _ _ _Flis _       I
                                     Tag                                                           |
                                                     Deon-do and Issue Units
                                                                                 |_ ______l A    S
                                                                  '             II
                                        ll‘l1l
                           §;i?JI§‘“"l*’f~l lrfl rfl lRf| R..'§::i.t
                          Um
                           Functional
                                              at           FU         we
   Fig. 6.12
                         s 1 ts. 1
               A pipelined processor with multiple functional u-nlrs and dlsrrlhuind reservation srarlons sup-ported
               by tagging {Comte-sy of G.Sd'ti;roprlnned with permission from IEEE Tmnsucrlons on Cempurerglrhrch
               1990}
   Solii (1990) used a model architecture for a pipelined scalar processor containing multiple functional units
(Fig. 6.12). In order to resolve data or resource dependences among the successive instructions entering the
pipeline, the msermrion .smrr'on.s (RS) are used with each functional unit. Operations wait in the RS until
their data dependences have been resolved. Each RS is uniquely identified by a mg. which is monitored by
a rag unit.
   The tag unit keeps checking the tags from all currently used registers or RSs_ This register tagging
technique allows the hardware to resolve conflicts between source and destination registers assigned for
multiple instructions. Besides resolving conflicts, the RSs also serve as buffers to interface the pipelined
functional units with the decode and issue units. The multiple functional units opemte in parallel, once the
dependences are resolved. This alleviates the bottleneck in the execution stages of the instruction pipeline.
Internal Dara Forwarding The throughput of a pipelined processor can be further improved with internal
data forwarding among multiple functional units. In some eases, some memory-access operations can be
replaced by register transfer operations. The idea is described in Fig. 6.13.
   A store-ioadforwarding is shown in Fig. 6.13s in which the loan‘ tJ].:k_'r'flItirJn (LI) R2. Ml frflm II'1BII101'1~'
to register R2 ean be replaced by the move operation (MOVE R2, RI) from register R1 to register R2.
Since register transfer is faster than memory access, this data forwarding will reduce memory traflle and
thus results in a shorter execution time. Similarly. load-loadfonvar-ding {Fig. 6.1315) eliminates the second
,.,,,.,-,,,-,,,,,,,,5,,,,.,,,,,,,,,,.,,,,                                                                        _ 2,5
loud operation [LEI R2, M) and replaces it with the mow operation {MOVE R2, RI 1. Further discussion on
operand forwarding will be continued in Chapter 12.
      M                                M                              M                                M
          Access                       Aoooss                             Access                       Aces-ss
           Unit                         Unit                               Unit                          Urit
R1                      R2        R1                    R2       R1                     R2        R1                 R2
  —               —                _             —                —            _                   _             —
Fig.-5.13 Internal ohm forwarding by rephclng memory~aooess operations with register transfer operations
      I»)
lg          Example 6.5                    Implementing the dot-product operation with
                                           internal data forwarding between a multiply
                                           unit and an add unit
One can feed the output of a multiplier directly to the input of an adder (Fig. 6.14) for implementing the
following dot-product operation:
Hazard Avoidance The rerrd and it-rim of shared variables by different instructions in a pipeline may lead
to different results if these instructions are executed out of order. As illustrated in Fig. 6.15. three types of
logic .Fm:.nrr.fs are possible.
   Consider two instructions land .1. Instruction J is assumed to logically follow instruction l according to
program order. If the actual execution order of these two instructions violates the program order. incorrect
results may be read or written, thereby producing hazards.
   Hazards should be prevented before these instructions enter tl'|c pipeline, sueh as by holding instruction J
until the dependence on instruction 1 is resolved. We use the notation DU) and Rtli for the domain and range
of an instruction I.
    The domain contains the input ser {such as operands in registers or in memory) to be used by instruction
I. The range corresponds to the onrpnr sor of instruction I. Listed below are conditions under which possible
hazards can occur:
246 ii                                                                                        ifldvwrcedfiumplmerahdnitedum
                              3;          bl"
R1 2 2 R2
                             a                            0           l1:R3<-{R1j|'{R2j
                              R3                          51
                                                         ‘Q           l2:R4<-[R31
                              R4 i                       i
                                                                      l3:R'54-4_‘R5)+[R4]
3; bl"
                          R1 Z          Z R2
                                                          Cr
                             @                            0:1
                                                                      "“““‘“""“"
                                                                      té :R4<- [R1]'{R2)
                                                                      [5 : R5<- [R4] + [R5]
                    R3 1           1 R4                  Z
                                                                      [’1an-cl 5 can he emecmed
                                                                      simuitaneously with lntamai
                                                                      data fomardhg.
Fig. 6.14 lrmernal dam 5:.-rwarding for lrr|piemen1:ing the dot-pm-duct operadm
[wrltejl Wnte
w [Read] [Write] w
                                @               [Raadj
                                                              -
                                                              m
                                                              W    M»
                                         [c] Wrhe-after-Read [WAR] hazard
 Fig. L15   Ftvsslble hazards between read and write up-eratlons in an ll'|5tfl.lCtlGfl pip-eHne{hstruc|:I-a|1 I is ahead
            of HEDIIHCHOHJ in program order]
...,,“-.,t,.g.,.,‘,5.,,,._.,,,:_,,.,..,.c,..                                                        _ M
                               R(I} A D(J] qr d for RAW hazard
                               Rm rt rr(.n qr n for waw hazard                                            (5.111
                               D{1)n RU) e‘= d for WAR hazard
   These conditionsare necessary but not sufficient. This means the hazard may not appear even ifone ormore
of the conditions exist. The RAW hazard corresponds to the flow dependence, WAR to the antidependenee,
and WAW to the output dependence introduced in Secfiou 2.}. The occurrence of a logic hazard depends on
the order in which the two instructions are executed. Chapter 12 discusses techniques to handle sueh hazards.
   The two lmrriv. since they are independent of the odd and move. can be moved ahead to increase the
spacing between them and the rrruln}u.'__v instruction. The following program is obtained after this modification:
                   Load       R2, M(rr)                2 to 3 cycles
                   Load       R3, M (B)                2 cycles due to overlapping
                   Add        R0, RI                   2 cycles
                   Move       RI, R5                   l cycle
                   Multiply   R2, R3                   3 cycles
   Through this code reanangement, the data dependences and program semantics are preserved, and the
m1rlripl_v can be initiated without delay. While the operands are being loaded from memory cells rr and )3 into
registers R2 and R3, the two instructions mid and mot-'e consu.rne three cycles and thus pipeline stalling is
avoided.
                                Ff» M nf Grow Hill t'4.\m;rwrn1
243 i                                                                                                   Advanced Comprrterwfirtltitecture
Tomasulo’: Ji'l.lgo.r'ithm This hardware dependence-resolution scheme was first implemented with multiple
floating-point units of the IBM 36{lf9I processor. The hardware platform is abstracted ir1 Fig. 6.12. For thc
Model 9] processor, three RSs were used in a floating-point adder and two pairs in a floating—point multiplier.
The scheme resolved resource conflicts as well as data dependences using register tagging to allocate or
deallocatc the source and destination registers.
   An issued instruction whose operands are not available is forwarded to an RS associated with the finnctional
unit it will use. It waits until its data dependences have been resolved and its operands become available.
The dependence is resolved by monitoring the result bus (called eormrrmr dam fins in Model 91). ‘When all
operands for an instruction are available, it is dispatched to the functional unit for execution. All working
registers are tagged. ll‘ a source register is busy when an instruction reaches the issue stage, the tag for t.he
source register is forwarded to an RS. “ho the register data becomes available, it also reaches the RS which
has the same tag.
     I»)
El         Example 6.6 Tomasulo's algorithm for dynamic
                                           instruction scheduling
Tomasulo's algorithm was applied to work with processors having a few floating-point registers. In the ease
of Model 91, only four registers were available. liigure 6.1641 shows a minimum-register machine code for
computing X = Y + Z and A = H >< C. The pipeline liming with Tomc.sulo‘s algorithm appears in Fig. 6.16-b.
Herc, the total execution time is 13 cycles, cotmtirtg fiorn cycle 4 to cycle 15 by ignoring the pipeline stm11.rp
and draining times.
   Memory is treated as a special functional unit. When an instruction has completed execution, the result
(along with its tag} appears on t.he result bus. The registers as well as the RSs monitor the result bus and
update their contents (and ready/busy bits) when a matching tag is found. Details of the algorithm can be
found in the original paper by Tomasulo [I967].
                                                              -iv Time
                                                                  12 s ti 5 s 1 a o1o11121s1=t1s1s1r1a1a
                          R1 r-Mom[Y]|
                         R2     <-Mem[Z]
                         R3     4-[R1i+[R2]
                     Mernifx]   s—[R3]                                 F    |-<
                          R1    <-Mem[B]                                          -1
                         R2     <-Mem[Cj|
                         R3     <-[R1]'[R2]
                     Morn[A]    t-[R3]                                                 "rr U                     -5
               [a] Minimum-register machine code                            lb) Tho pipeline schedule
   Fig. $.16    Dynamic lnsrru-ecion scheduling usir|gTornasr.|lo‘s algorithm on the processor in Fig. 6.11 (Co urresy of
                James Smith; reprinted with pmnissi-on from IEEE Cornprrtofluly 1989'}
CDC Scar-eboor-ding The CDC 6600 was anearly high-performance compute-rthatused dynamic instruction
scheduling hardware. Figure d.l'Ia shows a CDC 6600-like processor, in which multiple fttnetional units
,=,,....,..-,.g....,5.,,,......,.......,..                                                                   _ M
appeared as multiple execution pipelines. Parallel units allowed instructions to complete out of the original
program order. The processor had instruction buffers for each execution unit. Instructions were issued to
available functional units regardless of whether register input data was available.
Wilto-
                      Instruction                                                                   Wilto-
                         fmch           Deco-do        Issue         Exec ute    n.      Exiecuto   back
                                           D              I             E                   E
                           F                                                                         W
                                                        I/'                        -
                                                                                   I
                                                                                   I
                                                       irflmo
                                                        12 3 4 5 6 7 B 91fl1112131¢151fi171B19
                        R1    4-Mem[Y]                       E
                       R2     <-Mem[Z]
                       R3     <-[Ft1}+[R2i
                   Msm[x]     <-[R3]
                        R4    <-Mom[BJ
                       R5     <-fl.-'lom[-C]
                       R6     4-[R4-j|'[R5j|
                  lu1em(Aj|   <-[RB]
   Fig.-i.1'I'   Hardware secret:-carding for dynainic instruction scheduling (Courtesy ofjames Srn+d1;reprh11:od with
                 permission from IEEE Colnputeltjuly 1989]
  Thc instruction would then wait in a buffer for its data to bc produced by other instructions. To control
the correct routing of data between execution units and registers, the CDC 6600 used a centralized control
unit known as the scorehmrn‘. This unit kept track of the registers needed by instructions waiting for the
various functional units. Vflicn all registers had valid data, the scoreboard enabled thc instruction cxccution.
Similarly, when a functional unit finished, it signaled the scoreboard to release the resources.
     V)
g            Example 6.1                   Pipelined operations using hardware
                                           scoreboarding on the CDC 6600-like
                                           processor flames Smith, 1989)
Figure 6.l?b shows the pipeline schedule based on scoreboard issue logic. The schedule corresponds to the
                           F?» Mtfiruw Hit |:'i».-rqtwtnw
I50 i                                                                                  Advanced Computer Architecture
execution of the same machine code for X - Y ~' Z and A- B >< C. The pipeline latencies are the same as those
resulting from Tomasu|o's algorithm. The mid instruction is issued to its functional unit before its registers
are ready. lt then waits for its input register operands.
   The scoreboard routes the register values to the adder unit when they become available. In the meantime,
the issue stage is not blocked, so other instructions can bypass the blocked add. It takes 13 clock cycles to
perform the operations. Details of the CDC seorcboarding can be found in the book by Thomton {I970}.
   The scoreboard is a centralized control logic which keeps track of the status of registers and multiple
functional units. When fiinctional units generate new results, some data dependences can be resolved and thus
a highcr degree of parallelism can be explored with seorcboarding. Seoreboarding in latter microprocessors
like MCBEUDU used forwarding logic and register tagging. In a way, scoreboarding implements a kind of
data-dr-it-en mechanism to achieve eflicient computations.
   Dynamic instruction scheduling was implemented only in high-end mainframes or supercomputers in the
past. Most microprocessors used static scheduling. But the trend has changed over the last two decades. RI SC
and superscalar processorsare today built with hardware support ofdynamic scheduling at runtime. Significant
trace-driven data are needed to optimize the pipelined processor design. Toward this goal, processor and
compiler designers have to work together to achieve an efficient design. Multiple-issue instruction pipelines,
which are rnuch more complicated than single—issue instruction pipelines, will be studied in Section 6.5.
Effect of Branching Three basic temis are introduced below for the analysis of branching efiects: The
action of fetching a nonscqu-ential or remote instruction afier a branch instruction is called branch rnknrt. Thc
instruction to be executed after a branch taken is called a brrmch target. The ntnnber of pipeline cycles wasted
between a branch taken and the fetching of its branch target is called the def-n_y star. denoted by b. In generaL
ll S b S .i'— I, where It is the number ofpipcline stages.
   When a branch is taken, all the instructions following the branch in the pipeline become useless and will
be drained from the pipeline. This implies that a branch taken causes the pipeline to be flushed. losing a
number of useful cycles.
  These terms are illustrated in Fig. 5.13, where a branch taken causes l;,H through I M. k | to be drained from
the pipeline. Let p be the probability of a conditional branch instruction in atypical instruction stream and q
the probability of a successfirlly executed conditional branch instruction (a branch taken]. Typical values of
p = 20% and (,1 = 60% have been observed in some programs.
   The penalty paid by branching is equal to pqnbrbocause each branch taken costs fl rcxtra pipeline cycles.
Based on Eq. 6.4, we thus obtain the total execution time ofn instructions, including the effect of branching,
as follows:
                                                 T,',!y= kt + (rt — I) t+pqnl:It
P§pe“'|wpgand5uPe-[§£uk1|'TK
 _   __                       Fr-|-Mcfiruw Hi'iI:'|-rr.-w.-.-|-rt                                                   _
                                                                                                                                       _ m
     Modifying Eq. 6.9, we define the following qffecrive pi]-Jeiine rhmngnpur with the influence of branching:
                                    mg. = L =                                 _                                                         (W)
                                                  126,-               k+n—i+pr;mb
     When n —> 0-=, the tightest upper bound on the effective pipeline throughput is obtained when b = k — 1:
                                    11;, =                                                                                              (6.13)
     When p = qr = D {no branching), the above bound approaches the maxirnum tl1rougi1putf= If r, same as in
l.-Jq, 6.2. Suppose p = £12, ql = 0.6, and b = ir — 1 ='= 7. We define the following rrerfarmancerdtgradrrmn fiicrrlri
                                                  - 11 *                                                               _
                                    n= I f ”*@"=1- _rJq{k—1]+i
                                                         1     = pq[k—1)+l
                                                                  1"!“ 1]‘ e= “'S4=n.4s
                                                                              1.84
                                                                                        (E-.14)
     The above analysis implies that the pipeline performance can be degraded by 46% with branching when
the irtstmction stream is sufficiently long. This analysis demonstrates the high degree of performance
degradation caused by branching in an instruction pipeline.
                                                                         lmtnrction flow
                                                                          in
                                    nfii
                                                                                                       Cagions:
                                      A cieiay slot of length k-1                                       lb = Branch taken
                                                                                                        L, = Branch target
                                                                                                       ir= Ne. of pipeline stages
                                                                                                       1 = ciock cycie [stage deiay]
                                                                                                       ti = Delay slot size
                         in
                                                                                Branch target
                   New lnstmction flow
                                           {bj|Anin>str1.|ctien stream containing a branch taken
     Flg.6.1lI Thedecisionofaha'ar|d11:ienatd1ehststagcofa1insuucdmpip-eihecatn-esbSk-iprevinusiy
               imdedksnnucdomwbedninedfinmdupqselkn
I51 i                                                                                            Auhiwrccd 'i:-i2ta"i‘ijJti.lIvi!l'-"II-fCi't|iIvEt'1l.rJ'E
Brunch Prediction Branch can be predicted either based on branch code types statically or based on branch
history during program execution. Thc probability of branch with respect to a particular branch instruction
type can be used to predict branch. This requires collecting the frequency and probabilities of branch taken
and branch types across a large number of program traces. Such a srnric brrmch srrnregv may not be very
HDCHIBIG.
  The static prediction direction {token or nor rat-en) can even be wired into the processor. According to past
experience, the best performance is given by predicting rniren. This results from the fact that most conditional
branch instructions are taken in program execution. Thc wired-in static prediction cannot be changed oncc
committed to the hardware. However, the scheme can be modified to allow the compiler to select the direction
of each branch on a semi-static prediction basis.
   A nf_1-nnmic branch strategy works better because it uses recent branch history to predict whether or not
the branch will be taken next time when it occurs. To be accurate, one may need to use the entire history of
the branch to predict the fiiture choice. This is infeasible to implement. ‘Therefore, most dynamic prediction
is cietermined with limited recent history, as illustrated in Fig. 6.19.
                                          Y          '   Y               Y
                                        Branch          Branch         Branch
                                      Instruction      Prediction      target
                                       address         Statistics     add ress
                                        [a] B ranch target buffer organization
                  ¢@              T
                                           @                  Brarchtaiten
                                                          Ca ions:
                                                          N = Not-taken branch
                        N                                 NN=Lasttwobranci1es nottaiten
                                               T          HT = Not branch taken and previous taken
                         -                                TT = Beth last two branches taken
                        Q                                 TN = Last branch taken and previous not taken
                                    N
                                            [bi Atypical state diagram
   Fig. 5.1!   Branch hrisrory botiior and a st-are r:rsnsi1:ic.n ring:-am used in dynarrdc branch prediction (Coiirresy of
               Lee and Smith. i.E.E.E Computer, 1984}
   Cragon {I992} classified dyrtamic branch strategies into three major classes: One class predicts thc branch
direction based upon information found at the decode stage. The second class uses a cache to store target
addresses at the stage the effective address of the branch target is computed. The third scheme uses a cache
to store target instructions at the ictch stage.
,=,,,,,-,,,-,,,,,.,,5,,,,._.,,,,,,.,,,,,,,,                                                                           _ ,5,
   Dynamic prediction demands additional hardware to keep track of the past behavior of the branch
instructions at run time. Thc amount of history recorded should bc relatively small. Otherwise, the prediction
logic becomes too costly to implement.
   Lee and Smith [1984] suggested the use of a branch rmger hujt-r ['l:'lTB) to implement branch prediction
(Fig. 6. l Sta). The BTB is used to hold recent branch information including the address of the branch target
used. The address of the brunch instruction locates its entry in the BTB.
   Astahe transition diagram (Fig. t5.l9b) has been used by Lee and Smith for tracking the last two outcomes
at each branch instruction in a given program. The BTB entry contains the information which will guide the
prediction. Prediction information is updated upon completion of the current branch.
   The HTH can be extended to store not only the branch target address but also the target instruction itself,
i.n order to allow zero delay in converting conditional branches to unconditional branches. Thc rrrkcn (T) and
not-taken {N} labels in the state diagram correspond to actual program behavior. Further discussion on this
topic will be found in Chapter 12.
Delayed Branches        Examining the branch penalty, we realize that the branch penalty would be reduced
significantly if the delay slot could be shortened or minimized to a zlero penalty. The purpose of delayed
branches is to make this Possible, as illustrated in Fig. 6.20.
   The idea was originally used to rednce the branching penalty in coding rnicroinslructiorts. A r;li'irr_1-‘ed
brrmr:-fl of d cycles allows at most rt — l useful instructions to be executed following the branch taken. The
execution of these instructions should be independent of the outcome of the branch instruction. Otherwise. a
zero branching penalty cannot he achieved.
                                                                                                          Delayed branch
                        Delayed branch                                                        1       2      gr {E      E     T
                    1    2 @ X                     5   6                                    IIIIBB
               is       IIIIE                                                     2 delay         H       IIIIE
   1 slaw Inseam            nun                                                  lnsweoel             e       nun
                                          EIIIB                                                           H      IIIIB
  [ai A delayed branch for 2 cycles when the branch                              [b] Adelayod bran-cit for3 cycles when the branch
      condition is reserved at the decode stage                                      condition is resolved at the execute stage
                                                                    Time
                                                                                 Delayed branch
                                                       1        2     2.   -t    (51 @ Cb
                                          lBe"el                nun
                                           3 may           ‘1        IIIIB
                            —amlreees=
                            nstructions
                                                                     trace: [1       IIIIE
                                          [cl A delayed branch for 4 cycles when the branch
                                              condition is resolved at the store stage
   Fig.i.10   The concept of delayed branch by moving independent instructions or HOP fillers into the delay slot
              of a four-serge pipeline
                                 If    J11!!!‘   r'mr:-;|(min
I54 i                                                                                        '           Advanced Computerdrdtitecture
   The technique is similar to that used for software interlocking. NOPs can be used as fillers if needed.
The probability of moving one instruction (if = 2 in Fig. 6.20a) into thc delay slot is greater than 0.6, that
of moving two instructions (d = 3 in l-‘lg. 6.2(lb) is about 0.2, and that of moving three instructions (d= 4 in
Fig. 6.20c) is less than lll. according to some program trace results.
     I/)
Cg         Example 6.8 A delayed branch with code motion into a
                                          delay slot
Code motion across branches can be used to achieve a delayed branch, as illustrated in l-‘ig. 6.21. Consider the
execution of a code fiagment in Fig. 6.21 a. The original program is modified by moving the useful instruction
Il into the delay slot after the branch instruction I3.
[a] Original program [ti] ll.-loving useful lnstructloris Into the delay slot
   Fig. $.21    Code motion across a branch to achieve a deiayed hraincii with a reduced penalty re pipeline
                perforrrianee
   in case the branch is not taken, the execution of the modified program produces the same results as the
original program. In case the branch is taken in the modified program, execution of the delayed instructions
Il and I5 is needed anyway.
   In general, data dependence between instructions moving across the branch and the remaining instructions
being scheduled must be analyzed. Since instruction ll is independent of the remaining instructions, leaving
it in the delay slot will not crcatc logic hazards or data dependences.
   Sometimes HOP fillers can be inserted in the delay slot if no useful instructions can be found. However,
inserting NOP fillers does not save any cycles in the delayed branch operation. From the above analysis.
one can conclude that delayed branching may be more cffcctiyc in short instruction pipelines with about
four stages. Delayed branching has been built into some RISE processors, including the MLPS R4000 and
Motorola MC-881 10.
..,,,.,,-,..,.,,,,,., ,,.,,.,,,,.,,,,,.,,,,,. .                                                                     _ 255
Flnnringflininr Number: A floating-point number Xis represented by apair (m. e], when: m is thc ninnnsm
(o1'_,r'r.ocrion) and c is thc r.’.1]'J(J!'lr.’Nf with an implied brisc {or r.:idr'Jt',l'. The algebraic value is represented as X =
m >< r”. The sign of X can be embedded in the mantissa.
D 1 2 9' 31
                                         r_
                                                  II---Ill                           ll
                                                      ‘Y’                 WI’
                                   .2?-:1>_       Exponent a          lulantkssa m
   A binary base is assumed with r = 2. The 8-bit exponent c field uses an e:xces.r-Ll‘? code. The dynamic
rangc of c is (-121, 128), internally represented as [0, 255}. The sign s and the 23-bit mantissa field m form
a 25-bit sign-magnitude fraction, including an implicit or “hidden“ l bit to the left ofthe binary point. Thus
the complete rnantissa actually represents the value Lm in binary.
   This hidden bit is not stored with the number. If U *-1 c < 255, then a nonzero normalized numb-er represents
the following algebraic value:
                                  x=(_-11-*'><2* '*"><(1m)                                                        (5.15)
   When 1- = 255 and m ¢ U, a not-a-number (NaN_] is represented. NaNs can be caused by dividing a zero by
s zero or taking the square root ofa negative number, among many other nondeterminate cases. When e - 255
and m = 0, an infinite number X = (— l]|" -=-=~ is represented. Note that +-=-= and —¢-=> are represented differently.
   ‘Nhen c = 0 and m sfi D, the number represented is X = (—l]"'2. m'[D.m). when c = 0 and m = D, a zero is
rcprcscntcd as.-Y = (—1)"'U. Again, +0 and — U are possible.
  The 64-bit (double-precision) floating point can be defnied similarly using an excess-1023 code in the
exponent field and a 52-bit rnantissa field. A number which is nonzero, finite, non-NaN. and nomialized. has
the following value:
                                  .r=(-1)-"'>< 2“ ‘°”>< {Lm}                                                      (6.16)
   Special rules are given in the standard to handle overflow or underflow conditions. Interested readers may
check the published LEEE standards for details.
Floating-Point Operations The four prinilitive arithmetic operations are defined below for a pair of
floating-point numbers represented by X = (mp cl) and l’ = |[m_,., al.). For clarity, we assume cl E cl. and base
r = 2.
                                                                                                              A                          B
                             e.g. n=¢
                                                                                                                   3                         II
                                       A   =
+1 B:
Com g
                                                                                                                           l5LI'fll
                                  [a] An n-bit eany-propagate adder [CPA] \~.rhic:i1 allows either eany
                                      propagation or app-lies the earry- ieoirahead technique
                                                                                                    X              Y                 Z
                             e.g. n=-i
                                 x =                                                                    rt             n                 n
                                  Y:
                             EB-Z=
                                Sb; U -I-'-UU °_L._L° 1.5" ?-“ =»—~—~=~ _L¢@—l- _;_|._L-L                    n+1           n+1
                             *lC=D111tIi1tIi                                                                              b
                             —~—--—-—-—-—-—-— b                                                       C                  S
                                S=1011111=S+-C=.i(+Y+Z                                              (Can?              {amuse
                                                                                                    veetor]             sum)
                  {bi An n-tilt carry-satre adder [CSAL where Sbis the bitwise strn of it, Y, andZ, and
                      C is a eany vector generated without carry propagation between digits
Fig. 6.22 Disdncrziort b-etween a cat'ry-propagate adder {CPA} an-d a 1:.-rryssatre adder (CSAI
for r‘ - 0.1, 2, .. ..n — 1, where 6' is the exclusive OR and v is the logical OR operation. Note thatthe arithmetic
sum of three input numbers, i.e., S = X+ Y + Z, is obtained by                                                     the two output numbers, i.e., S =                             +
C, using a CPA. We use the CPA and CSAS to implement the pipeline stages of a fixed-point multiply unit
as follows.
Multiply Pipeline Design                       Consider as an example the multiplication of two 3-hit integers .-I >< B = P,
where P is the 16-I:-it product. This fixed-poirit multiplication ean be written as the summation of eight partial
products as shown below: P = A >< B = P0 + P, + P; + + P7, where >< and + are arithmetic multiply and acid
operations, respectively.
                                                                                 I      G   I   I   D         I        D         I                .-"I
                                                                          ><}l00]tltllI=E
                                                                                        . Pu
                                                                                                                                                  P1
                                                                                                                                                  P2
                                                                                                                                                  Pa
                                                                                                                                                  P4
                                           D                                                                                                      P5
                                  o o                                                                                                             P6
                 i)      l        Ci       l     :1:1
                                                  _-. -      UUUUU
                                                                 UU
                                                                 '-'-' UU
                                                                       QUU
                                                                       '-— UUUUUU
                                                                           '-'-' UUU
                                                                                 UU
                                                                                 U
                                                                                 —   UUUUUU
                                                                                     -I- UUU
                                                                                         '-— 5UUUUU
                                                                                             U
                                                                                             —   UUUUUUU
                                                                                                 '-  UUUUUU
                                                                                                     U—     ‘I P;
                                                                                                         UUUUUUU
                                                                                                         -
                 DIIODIIIIIIDIIIIP
,..,,.,.,,-,,g.....,5.,,,._.,,._,,......,,,                                                                               _ 25,
   ‘Note that the partial product P; is obtained by multiplying the multiplicanel A by thejth bit of B and then
shifting the result j bits to the left for j = U, 1, 2, ..., 7. Thus                     is [B + j} hits long with j trailing zeros. The
surnmation of the eight partial products is done with a risrrm-' tree of CSAs plus a CPA at the final stage, as
shown in Fig. 6.23.
   The first stage (5,) generates all eight partial products, ranging fiom 8 bits to 15 bits. simultaneously. The
second. stage {S2} is made up of two levels of four CSA.s, and it essentially merges eight uu.mlJ-ers into four
numbers ranging from I3 to I5 bits. The third stage {S3} consists of two CSAs, and it merges four numbers
from S1 into two l6-bit numbers. 'l'he final stage (S4) is a CPA, which adds up the last two numbers to produce
the final product P.
                                            .                                1“               i“               ._
                                     s1 I
                                                                             it
                                                                   Multiplier recoding to-gin
                                                                                              is               1
                                                                                   £11¢12¢13             £14 i’15D—
                                                                        UU
                                                        “-_‘:-“"_‘. -
                                            \                            1’ \ i11ciji13/1 //15
                                     S2 we/3 1*» .3.
                                                fillG?
                                                 -sf)
                                                             1                    13     15
                                                                                  (‘J                1s
                                     ss                                           1
                                                                                         1s         1s
                                            i                                                                  D—
                                                                                        16                15
                                     =~»~                                               Zfl
                                                                                               16
                                            i                                                                  D—
                            Captions:                                                        $16
                            CS1-'t= Carry save adder
                            CPA = Cary Prqsagate adder                                  p = A1 3
 Fig. 6.23   A pipeline unit for fixed-point rrrultiplication of B-bit: integen {The nurnber along each line infieanes
             fllfl “I11! W'idti't..]
   For a maxirntun width of 16 hits, the CPA is estimated to need four gate levels of delay. Eaelt level of the
CS-A can be implemented with a two-gate-level logic. The delay of the first stage (Sr) also involves two gate
levels. Thus all the pipeline stages have an approximately equal amount of delay.
                                  Par MIGIITLH H“ l'mrJI||r_.u|r¢\
ICU U                                                                                                                                                                                                                                            Auiuwtced Comptrterwfirchitecture
   The matching of stage delays is crucial to the detemiination of the number of pipeline stages, as well as
the clock period (Eq. 15. I}. Ifthc delay ofthe CPA stage can be further reduced to match that ofa single CSA
level, then the pipeline can be divided into sis stages with a clock rate twice as fast. The basic concepts can
be extended to operands with a larger number of bits. as we see in the example below.
Multiplex it
Register -
                   Stage 1
                                           /at/at                                                             W
                                                                                         Register
                                                                                                                                                                                                                                   '.3
                                                                                                                                                                                                                                     ,3          ii
                                                                               Register floral                                                    .                                                                               3-"no:o- gm,
                                                                                                                                          .~                                                                                           Q
                       :--;' .-'1'.-’. '.-"A 2'-//'.-2:-' .--'. .--'/..-:-I/.-sr-.-r/1--'5.-->'-'.<--:-:-;-'.-;/ /:--:.-;--'/' z w'1>:-'.g'--/:-o-.- -4-//.4.//--z‘mm’-.1--".4/.-n-v.--/-' 1/» / / .1 /
                                                                                  64-bit .K. B-bit
                                                                                    Multiplier                                                                                                                         - hi
Register
                   Stage 2
                                           /57113
                                                                     Ea                                                                               .
                                                                                   67-bit barrel
                                                                                      shifter                                             _
                           s-'.-1»:-.--'/:.-.--:-‘ --/' ‘s-:-or:/'. /-'.»:--'1:--2 -/1 .//:/,-'/-->1-'2:-'1:-7-.z--//:.c-->'.--.-c--'.c-'1.-%.-- .- -it---'.--:/.--/s-//:1.-'/'-1 /xi4-"I-’.-'.»4-".-'1-'I4-'I-".-.-61?-"I"r?'.-'1-'3 /
FIg.i.14 Pipelined floating-point unit of the Hetero-la MC-6-‘B040 processor [Courtesy of Hotorola. Inc, 1991)
   This arithmetic pipeline has three stages. The mantissa section and exponent section are essentially two
,.,,,,-.,,-,,,,.,,5,,,,._.,,,,,,.,,T.,,,,                                                              _ H,
separate pipelines. The mantissa section can perform floating-point add or multiply operations, either single-
precision [32 bits) or double-precision (6-4 hits).
   In the mantissa section, stage I receives input operands and returns with computation results; 64-bit
registers are used in this stage. Note that all three stages are connected to two 64-bit data buses. Stage 2
contains the array multiplier [64 >< 8) which must be repeatedly used to can'y out a long multiplication of the
two mantissas.
   The 6?-bit adder performs the acldition.*’suhtra-ction of two mantissas, the barrel shifter is used for
normalization. Stage 3 contains registers for holding results before they are loaded into the register file in
stage 1 for subsequent use by other instructions.
   On the exponent side, a lo-bit bus is used between stages. Stage 1 has an exponent adder for comparing
the relative ntagrtitude of two exponents. The result of stage 1 is used to equalize the exponents before
mantissa addition can be pe;|'fon‘ned. Therefore, a shift count (from the output of the exponent adder} is sent
to the barrel shifter for mantissa alignment.
    After normalization of the final result (getting rid of leading zeros), the exponent needs to be readjusted in
stage 3 using another adder. The final value of the resulting exponent is fed from the register in stage 3 to the
register file in stage 1, ready for subsequent usage.
Convergence Division One technique for division involves repeated multiplications. Mantissa division
is carried out by a cont-ergvence rrrerhod. This convergence division obtains the quotient Q = MED of two
normalized fractions 0.5 5 M< D < l in two‘s complement notation by performing two sequences of chain
multiplications as follows:
                               Q:    M   X   R|   X   R2   X   X   R.A
                                     D><R| >< R: ><---><R,,
where the successive multipliers
                          R,-=1+5”'=2 oi" fori'=l,2,...,.lr and o=1-5
   The purpose is to choose R; such that the denominator Dm = D >< Rl >< R3 >< - - - >< Rk —> 1 for a sufficient
number oft: iterations, and then the resulting numerator Mx R, >< R; ><                 >< Ry, —> Q.
   Note that the multiplier R, can be obtained by finding the two's complement ofthe previous chain product
Dl‘l= Dx R, ><     ><R,- ,= 1 — 52’ ' becausei -Dl‘7'=R,-. The reason why Dill —> 1 for large I; isthat
                               ni'l= gi s)(1+ 511+ 8211+ 5")                       (1 + 51“ ‘)
                               ={l 51](1+ s*}(1+ .5‘)                    (1 + 51' '1
                               ={1 8*’; forr'=l,2,...,.lt                                                 (6.23)
   Since D < 5 = 1 ~ D S 0.5, 52" -3- D as ibecomes sufficiently large, say, f= it for some Ii’; thus Dm =
1 - 5*‘ = 1 sir large t. The end result is
                               g=.tr><t1 | s)><(1 +s1)><---><(1| 51*")                                    (6.24)
   The above two sequences of chain multiplications are carried out alternately between the numerator and
denominator through the pipeline stages. To summarize, in this technique division is carried out by repeated
multiplications. Thus divide and multiply can share the same hardware pipeline.
I51 i                                                                                               Advanced Computer Arclritecture
     I»)
E          Example 6.11                  The IBM 360iModel 91 floating-point
                                         unit design
In the history of building scientific computers, IBM 360 Mode] 91 was certainly a milestone. Many of the
pipeline design fcattucs introduced in previous sections were implcrncntcd in this machine. Thcrcforc, it is
worth the effort to examine the architecture of Model 91. In particular, we describe how floating-point add
and rnultip1y.i'divide operations were irnplernented in this machine.
   As shown in Fig. 6.25, the floating—point execution unit in Model 91 consisted of two separate functional
pipclincs: thc addrmir and the nirifripifi-'.-‘rift-'ic1'r' unit, which could be used concturcntly. The former was a two-
stage pipeline, and the latter was a sis-stage pipeline.
                               From               From
                               Store              Instruction        Cambnsj
                               UM                 Um‘                cos = Common Data sus.
                                                                     R5 = Rosoivaticn station. each indenti-
               Flieti E                                 Floating          fled by a miquo tag number.
               33:23                                    g°"'l t_     CSA= Carry-save adder.
                                                          para ion       _
                          lll!!!
               (FLB:                                    Sack         CPA- Carry propagate adder.
                         Iil l l l                E
                                                           FLR Bus
                                                                      illlflfli
                                                                       Bits Tags
                 ,,li,,t|,,tt
             FLB Bus
                             ll              II                                       -L:
                                  T
                                  Add Unit                             II            ll
                                   fiiéifi                              I
                                                                       in
                                                                        -15:
                                                                      O
                                  sta as
                   To                Q                                                          Multiply-i'|Iiivit:|-o
                 Storage                                                                        Unit
                   Unit                                                                         [6 pip-alino stages)
                @
               —Z
               _-
                                                                             *IE"E
                                              cns                              _            J
             Fig. 6.25     The IBM 360 Model 91 floating-point: uni: (Courtesy of IBM Corporation. 1967]
...,...-...,,g...,..5.,,,._.,,._,,..,.,,.,,,                                                         _ m
   The floating-point operation stack was a kind of prefetch buffer holding eight floating-point instructions
for subsequent execution through the two functional pipelines. The floating-point buffers were used to input
operands.
   Operands may also come from the floating-point registers which were connected via the common data bus
to the output bus. Results finrn the two furictional units could be sent back to the memory via the store data
buffers. or they could be routed back to the FLR or to the reservation stations at the input ends.
   The add unit allowed three pairs of operands to be loaded into three reservation stations. Only one pair
could bc used at a time. The other two pairs held operands for subsequent use. The use of these reservation
stations made the add tmit behave like three virtual functional units.
   Similarly, the two pairs at the input end of the multiply/divide unit made it behave like two virtual units.
Internal data forwarding in Model 91 was accomplished using source tags on all registers and reservation
stations. Divide was implemented in Model 9| based on the convergence method.
   Every so1.u'cc of an input operand was uniquely identified with a 4-bit tag. Every destination of an input
operand had an associated tag register that held the tag naming the sotnce of data if the destination was
busy. Through this register tagging technique, operandsfresults could be directly passed among the virtual
functional units. This forwarding significantly e-ut down the data flow time between them.
   Dynamic scheduling logic was bu.ilt into Model 91 rising Tomasulo's algorithm to resolve the data
dependence problem. l.-Zither the add unit or the multiplyfdivide unit could execute an operation using
operands from one of the reservation stations.
   Under Tornasulo’s algorithm. data dependences are preserved by copying source tags when the sources are
busy. When data is generated by a source, it passes its identification and the data onto the common data bus.
Awaiting destinations continuously monitor the bus in a tag watch.
   When the source tag matches, the destination takes in the data from the bus. Other variations ot'Tornasulo’s
algorithm can be made to store the source tags within the destinations. to use a special tag {such as U000) to
indicate nonbusy registen'buf’fers, or to use direct-mapped tags to avoid associative hardware.
   Besides the IBM 360.-G70, the CDC ssoorrsoo also implemented convergence division. It took two
pipeline cycles to perform the floating-point add. six cycles to multiply. and [B cycles to divide in the [BM
Systcn'u'31Sl] Model 91 due to five iterations involved in the convergence division process.
     I»)
8]         Example 6.12 The TIIASC arithmetic processor design
There were four pipeline arithmetic units built into the Tl-ABC system, as shown in Fig. 6.26. Thc instruction-
processing unit handled the fetching and decoding of instructions. There were a large number of working
registers in the processor which also controlled the operations ofthe memory bufi'er unit and ofthe aritltmetic
units.
   There were two sets of operand buffers, {.-Y, 1", Zi and {X', ll". Z}, in each arithrnctic unit. .-'t'. X. ll" and 1"
were used for input operands, and Z’ and Z were used to output results. Note that intermediate results could
he also routed from Z-registers to either X- or lrlrcgisters. Both processor and memory buffers accessed the
main memory for ioshucticrns and opcrundsfrcsults, respectively.
  Each pipeline arithmetic unit had eight stages as shown in Fig. 6-.2'l'a. The PAU was a static multifunction
pipeline which could perform only one function at a time. Figure 627a shows all the possible interstage
connections for perfom-ting arithmetic, logical, shitting, and data conversion functions.
                                                                     lrtstnietton
                                                                        Buffer
                          Instruction                                    Reg Hers                             Comm
                                                               Q.
                                          Unit (IPUJ                                                    3
                                                           16 Arlthm
                  Main
                  Memory
                                                                       _                            _ MBU
                                 _,_,a                                                                      _
                           O
                               "'“"“ds a
                             Memory
                               Buffer
                           L.Inlt [MBU]
                                          PAU I                  I              1               t
                                                         Plpoltno Arlthrnatie Units {PAUJ
  Fig. are The anahiceeune of the-Tl Advanced Scientific Composer (ABC) (Ccurccsy ofTexas Instruments. toe.)
,=,,,.,,,,,,,,g.,,,.,5,,,,._,,,£_,4,,,.,c,,,                                                           _ M
   Both fixed-point and floating-point arithmetic functions could be performed by this pipeline. The PAU
also supported vector in addition to scalar arithmetic operations. It should be noted that different functions
          is                                     i                                            i
required difl"crcnt pipeline stages and dificrcnt interstage connection patterns.
                                                                                                 is
               input     S1                            S1                                        S1
      to                                               ii
           Exponent
           Stbtract S2                           L                                     PF-I      S2
Align
                                                                                        ‘                        A
      ‘    F      tion
  -—i r
                                                 —                                            I             rr
                                                                                                 is
          Nomalirse 55                                 S5                                        S5
           Fraction                          I
            Multiply‘                                                                            S6
5 Aocurnoiate
                                                                                       {E
          i Ottput
          R=![.r'-1,9)
                         SB
iHiii SB
                                                   R=A"x:B
                                                                                                 S8
                                                                                                  J1
                                                                                              R'EA’-J-(BI
                                                                                                I =1
[a] Pipeline stages and                  {bi F bred-point moitinlication            [oi Floating-point dot product
    interconnections
   Fig. 6.2‘? The multiplication arirdimetic pipeiine of the Tl Advanced Scientific Computer and die intersuge
              connecci-one of two I‘Eprt‘Béfl‘lI312l"M functions {Ended stages are unmliized]
   For example, fixed-point multiplication required the use of only segments S], S6, 5-,, and 5'5 as -Shown
in Fig. 6.2Tb. On thc other hand, thc floating-point dot product function, which p-csrforms thc clot product
operation between two vectors, required the use of all segments with the complex connections shown in
Fig. 6.2?c. This dot product was implemented by essentially the following accumulated summation of a
scqucnoc of multiplications through thc pipeline:
                                  War MIGIIILH H“ I'mt!I;|(1rlnr\
Iii E                                                                                                     '              Advanced Comptrtetwfircltitecture
where the successive operands (A,-, B,-) were fed through the X- and 1"-buffers, and the accumulated sums
through the Z-buffer recursively.
    The entire pipeline could perform the mtr)'rr}Jl'_v [X] and the ado'(--) in a single flow through the pipeline. The
two levels of buffer registers isolated the loading and fetching of operands to or from the PAU, respectively,
as in the concept of using a pair in the prefetch buffers described in Fig. 6.11.
    Even though the Tl-ASC is no longer in production, the system provided a unique design for multifunction
arithmetic pipelines. Today, most supercomputers implement aritltmeiic pipelines with dedicated ftnictions
for much simplified control circuitry and faster operations.
  .il-fuchinc ifypie                         .S'¢'r.rlur 1.ru.t-rs macdrlne o_,F'.lt plpeiltlne srerges       St.rpe'r.\'ez.rlur mat;'.lri.rre' of'1!-tgtgr-t'¢' m
  Machine pipeline cycle                                        l (btue cycle}                                                         1
  lrutlmction issue rate                                                 I                                                            m
  lnst:|'uction issue latency                                            l_                                                           _1
  Simple operation latency                                                                                                             I
  TLP to Fully utilize the pipeline                                                                                                   m
Note: All timing is relative to the base cycle for the scalar base machine. ILP: Instruction level parallelism.
   We study below the structure ofsnperscalar pipelines, the data dependence problem, the factors causing
pipeline stalling, and mutti-instruction issuing mechanisms for achieving parallel pipelining operations. For
a superscalar machine of degree m, in instructions are issued per cycle and the lI.P should be m in order to
fully utilize lIl't-B pipeline. As a matter of fact, the scalar base processor can be considered a degenerate case of
a superscalar processor of degree l.
Super:/color Pipeline Structure In an m-issue superscalar processor, the instruction decoding and
execution restntrces are increased to form effectively HI pipelines operating concurrently. At some pipeline
stages, the functional units may be shared by multiple pipelines.