Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
Instruction Pipeline Design, Arithmetic Pipeline Deign - Super Scalar Pipeline Design
MODULE V
Instruction pipeline design, Arithmetic pipeline deign -Super Scalar Pipeline Design
    The phases are ideal for overlapped execution of a linear pipeline, each phase may require one or more
    clock cycles to execute depending on the instruction type and processor/memory architecture.
                                                                                                          1
Computer System Architecture
   Decode Stage – reveals instr function to be performed and identifies the resources needed. Resources
   include registers, buses, Functional Units…
                                                                                                     2
 Computer System Architecture
        Figure above shows the flow of machine instructions through a typical pipeline. These eight
         instructions are for pipelined execution of the high-level language statements
         X = Y + Z and A = B * C.
        Here we have assumed that load and store instructions take four execution clock
         cycles, while floating-point add and multiply operations take three cycles.
   Figure b (above)illustrates the issue of instructions following the original program order. The shaded
    boxes correspond to idle cycles when instruction issues are blocked due to resource latency or conflicts
    or due to data dependences.
   The first two load instructions issue on consecutive cycles.
   The add is dependent on both loads and must wait three cycles before the data (Y and Z) are loaded in.
   Similarly, the store of the sum to memory location X must wait three cycles for the add to finish due to a
    flow dependence.
 There are similar blockages during the calculation of A. The total time required is 18 clock
 cycles. This time is measured beginning at cycle 4 when the first instruction starts execution until
 cycle 21 when the last instruction starts execution. This timing measure eliminates the undue effects of
 the pipeline “startup" or “draining” delays.
 Figure c (above) shows an improved timing after the instruction issuing order is changed to eliminate
 unnecessary delays due to dependence. The idea is to issue all four load operations in the beginning. Both
 the add and multiply instructions are blocked fewer cycles due to this data prefetching. The reordering
 should not change the end results. The time required is being reduced to 11 cycles, measured from cycle 4 to
 cycle I4.
                                                                                                             3
 Computer System Architecture
       2. Target buffer
       3. Loop buffer
In a single memory access time, a block of consecutive instr’s are fetched into prefetch buffer.
   -       Sequential inst’s are loaded into a pair of sequential buffers for in-sequence pipelining.
   -       Instr’s from Branch target are loaded into pair of target buffers for out of sequence
           Pipelining.
   -       Both buffers operate in FIFO (First In First Out)
       -   A conditional branch instr causes both sequential buffers and target buffers to fill with instr. After
           checking branch condition, appropriate instruction will be taken from one of the two buffers and instr
           in other buffer is discarded.
       -   From each buffer pair, one may be used to load instr’s from memory and another buffer may be
           used to feed instr’s into pipeline, thus it prevents collision btw instr flowing IN and OUT of
           pipeline.
       -    Loop Buffer – holds sequential instr’s in a loop. Its maintained by Fetch Stage.
       -   Pre fetched instr’s in a loop body will be executed repeatedly until all iterations complete execution.
   Loop Buffer operates in 2 steps:
   -       First it contains instrs sequentially ahead of current instr, thus saves instr fetch time from
           memory.
   -       It recognizes when target of a branch falls within loop boundary. In this case, unnecessary
           memory accesses can be avoided if the target instruction is already in the loop buffer.
- Memory access operations can be replaced by register transfer operations to save time
                                                                                                                     4
Computer System Architecture
 load operation(LD R2,M) from memory to register R2 can be replaced by move operation (MOVE R2,R1)
 from register R1 to R2, since register transfer is faster than memory access and will also reduce memory
 traffic , resulting in shorter execution time.
 -     Load-Load Forwarding
 -
We can eliminate 2nd load operation(LD R2,M) and replace it with move operation(MOVE r2, R1)
 Example :problem:
 Implementing the dot-product operation with internal data forwarding between a multiply unit
 and an add unit
 One can feed the output of a multiplier directly to the input of an adder (Fig. 6.14) for
 implementing the following dot-product operation:
 Without internal data forwarding between the two functional units. the three instructions must be
 sequentially executed in a looping structure [Fig. a below).
 With data forwarding, the output of the multiplier is fed directly into the input register R4 of the
 adder (Fig. b). At the same time, the output of the multiplier is also routed to register R3. internal
                                                                                                          5
Computer System Architecture
    data forwarding between the two functional units thus reduces the total execution time through
    the pipelined processor.
-    Certain pipeline stage becomes bottleneck – this stage corresponds to row with maximum number of
     checkmarks in the reservation table
-    Can be solved using multiple copies of same stage simultaneously , ie:- use of multiple execution units
     in a pipelined processor design.
                                                                                                          6
    Computer System Architecture
    FIG – A pipelined processor with multiple FU’s and distributed RS’s supported by Tagging.
-       In order to resolve data or resource dependences among successive instructions entering the pipeline the
        Reservation Stations(RS) are used with each functional unit.
-       Operands wait in RS until data dependences are resolved.
-       Each RS is uniquely identified by a Tag monitored by tag unit.
-       Tag unit keeps checking the tags from all currently used registers or RS’s
-       RS also serve as buffers to interface pipelined functional units with decode and issue unit
-       Multiple Functional units operate in parallel once dependencies are resolved.
    2.4.Hazard avoidance
    Qn: Define hazards with respect to pipeline. Describe the various categories of the hazards .
    Qn:
    -   The read and write of shared variables by different instructions in a pipeline may lead to different results
        if these instructions are executed out-of-order.
    Three types of logic hazards are possible. As shown in figure below:-
                                                                                                                  7
    Computer System Architecture
    -    Consider instr’s I and J, J assumed to logically follow instr I ,according to program order.
    -    If the actual execution order of these two instructions violates the program order. incorrect
    -    results may be read or written, thereby producing hazards.
    -    Hazards should be prevented before these instructions enter the pipeline, such as by holding instruction
         J until the dependence on instruction I is resolved.
    -    D is the domain of instr –input set D(I) and D(J)
    -    R is the range of instr – output set
                 I: a=b+c           (R(I)∩D(J)a)
                 J : e =a + d
-       Resolution of hazard conditions can be checked by special hardware while instr’s are being loaded into
        prefetch buffer
- Special tag bit can be used with each operand register to indicate safe or hazard.
    2 techniques
                       1. Tomasulo’s Algorithm
                                                                                                               8
Computer System Architecture
2. CDC scoreboarding
STATIC Scheduling
          Considering the execution of code, multiply instr cannot be started until preceding load is
           complete.
          It stalls pipeline for 3 clock cycles(since 2 loads overlap by one cycle)
          The 2 loads are independent of add and Move instr. We can move it ahead to increase spacing
           btw them and multiply instr.
          Through this code rearrangement data dependences and program semantics are preserved and
           multiply can be started without delay
          Compiler based s/w interlocking is cheaper to implement and flexible to apply.
                                                                                                     9
Computer System Architecture
      If source register is busy when an instr reaches issue stage, tag for source register is forwarded to
       RS
      When register becomes available tag can signal availability.
      This value is copied into all reservation station which have the matching tag. Thus operand
       forwarding is achieved here with the use of tags.
      All destinations which require a data value receive it in the same clock cycle over the common data
       bus, by matching stored operand tags with source tag sent over the bus.
      The fig above shows a functional unit connected to common data bus with three reservation
       stations provided on it.
      When the needed operand values are available in reservation station, the functional unit can initiate
       the required operation in the next clock cycle.
      At time of instruction issue the reservation station is filled out with the operation code(op).
          o if an operand value is available in programmable register it is transferred to the
              corresponding source operand field in the reservation station. It waits until its data
              dependencies are resolved and operands become available. Dependence is resolved by
              monitoring Result bus and when all operands of an instr are available its dispatched to
              functional unit for execution.
          o If the operand value is not available at the time of issue, the corresponding source tag(t1
              and/or t2) is copied into the reservation station. The source tag identifies the source of the
              required operand. As soon as the required operand is available at its source- typically output
              of functional unit – the data value is forwarded over the common data bus along with source
              tag.
             Tomasulo's algorithm was applied to work with processors having a few floating-point
              registers. In the ease of Model 91, only four registers were available.
             Fig a below shows a minimum-register machine code for computing X = Y + Z and A = B
              * C. The pipeline timing with Tomasulo‘s algorithm appears in Fig.-b.
             Here, the total execution time is 13 cycles, counting from cycle 4 to cycle 15 by ignoring
              the pipeline startup and draining times.
                                                                                                         11
Computer System Architecture
       Memory is treated as a special functional unit. When an instruction has completed execution, the
       result (along with its tag} appears on the result bus. The registers as well as the RSs monitor the
       result bus and update their contents (and ready/busy bits) when a matching tag is found.
      Figure above shows CDC6600 like processor that uses dynamic instruction scheduling hardware,.
      Here Multiple Functional units appeared as multiple execution units pipelines.
      Parallel units allow instr’s to complete out of original program order.
      The processor had instr buffers for each execution unit
      Instrs are issued to available FU’s regardless of whether register i/p data are available.
      To Control correct routing of data btw execution units and registers CDC 6600 used a Centralized
       Control unit known as scoreboard.
      Scoreboard kept track of registers needed by instrs waiting for various functional units
      When all registers have valid data scoreboard enables instr execution.
      When a FU finishes it signals scoreboard to release the resources.
      Scoreboard is a Centralized control logic which keeps track of status of registers and multiple
       functional units.
                                                                                                         12
Computer System Architecture
          In the meantime, the issue stage is not blocked, so other instructions can bypass the blocked
       add.
          It takes 13 clock cycles to perform the operations.
Here we are going to discuss the effects of branching on performance of pipelined processors.
       TERMS :
            Branch Taken : the action of fetching a non sequential or remote instruction after a branch
             instruction is called Branch taken.
            Branch Target : the instruction to be executed after a branch taken is called Branch target
             Delay slot : the no: of pipeline cycles wasted between a branch taken and its branch target is
             called delay slot, denoted by b
             0 ≤ b ≤ k-1 (K-no of pipeline stages)
EFFECT OF BRANCHING
                                                                                                           13
Computer System Architecture
         When a branch is taken all instructions following the branch in the pipeline becomes useless and
          will be drained from the pipeline. Thus branch taken causes a pipeline to be flushed losing a
          number of pipeline stages.
Calculating Total Execution Time for n instr’s including effect of branching ( Teff)
Teff= k+(n-1)+pqnb
                                                                                                       14
Computer System Architecture
H* eff= f / pq(k-1)+1
                     (ie pipeline performance degrades by 46% with branching when instruction stream is
         sufficiently long. The above analysis demonstrates the high degree of performance degradation
         caused by branching in an instruction pipeline.)
                    Branches are predicted based on branch code types statically or based on branch History
                     during program execution.
                    The frequency and probabilities of branch taken and branch types across large no of pgm
                     traces is used to predict a branch.
                    Such a Static branch strategy may not be very accurate.
                    The static prediction direction (taken or nor rat-en) can even be wired into the processor.
                    The wired-in static prediction cannot be changed once committed to the hardware.
                                                                                                                   15
Computer System Architecture
                   Requires additional hardware to keep track of the past behavior of the branch instructions
                    at run time.
                   Branch Target Buffers are used to implement branch prediction – BTB holds recent
                    branch information including address of branch target used. The address of branch instr
                    locates its entry in BTB.
                   The state transition diagram is used for backtracking the last 2 branches in a given
                    program. The BTB entry contains backtracking information which will guide the
                    prediction.
                   BTB can store not only Branch Target address but also target instruction itself and few
                    successor instr’s, in order to allow zero delay in converting conditional branches to
                    unconditional branches.
DELAYED BRANCHES
      The branch penalty can be reduced if delay slot is shortened or minimized to zero penalty.
      The purpose of delayed branches is to make this Possible, as illustrated in Fig. below
      A delayed branch of d cycles allows at most d-1 useful instr’s to be executed following branch
       taken.
      The execution of these instr’s should be independent of the outcome of branch instruction.
                                                                                                           16
Computer System Architecture
    The probability of moving one instruction (if = 2 in Fig. a) into the delay slot is > 0.6.
    The probability of moving two instructions (d = 3 in fig b) is about 0.2,
    and that of moving three instructions (d= 4 in Fig. c) is less than 0.1.
    From the above analysis, one can conclude that delayed branching may be more effective in short
     instruction pipelines with about four stages.
                                                                                                           17
Computer System Architecture
           In case branch is not taken , execution of modified pgm produces the same result as
            original pgm.
           In case branch is taken in modified pgm ( fig b) execution of delayed instr I1 and I5 is
            needed anyways.
   •       In a digital computer, arithmetic is performed with finite precision due to the use of fixed-size
           memory words or registers.
   •       Fixed-point or integer arithmetic offers a fixed range of numbers that can be operated upon.
           Floating-point arithmetic operates over a much increased dynamic range of numbers.
   •       In modem processors, fixed-point and floating-point arithmetic operations are very often performed
           by separate hardware on the same processor chip.
FIXED-POINT OPERATIONS
   •       Most computers use the two’s complement notation because of its unique representation of all
           numbers (including zero).
• Add, subtract, multiply and divide are the four primitive arithmetic operations.
                                                                                                               18
Computer System Architecture
     •   For fixed-point numbers, the add or subtract of two n-bit integers (or fractions) produces an n-bit
         result with at most one carry-out.
     •   The multiplication of two n-bit numbers produces a 2n-bit result which requires the use of two
         memory words or two registers to hold the full-precision result.
A binary base is assumed with r = 2. The 8-bit exponent e field uses an excess – 127 code. The dynamic
range of e is (-127, 128), internally represented as (0, 255). The sign s and the 23-bit mantissa field m form
a 25-bit sign-magnitude fraction, including an implicit or “hidden“ l bit to the left of the binary point. Thus
the complete mantissa actually represents the value 1.m in binary.
This hidden bit is not stored with the number . If 0<e< 255 , then a nonzero normalized number represents
the following algebraic value:
X= x x (1.m)
The four primitive arithmetic operations are defined below for a pair of floating point numbers represented
by X = (       , ) and Y = (     ,   ) . For clarity , we assume   <
The above equations clearly identify the number of arithmetic operations involved in each floating-point
function. These operations can be divided into two halves: One half is for exponent operations such as
comparing their relative magnitudes or adding/subtracting them; the other half is for mantissa operations,
                                                                                                               19
Computer System Architecture
including four types of fixed-point operations. Arithmetic shifting operations are needed for equalizing the two
exponents before their mantissas can be added or subtracted.
The arithmetic /logic units (ALUs) perform fixed-point and floating-point operations separately.
These arithmetic units perform scalar operations involving one pair of operands at a time. And vector
operations. Scalar and vector arithmetic pipelines differ mainly in the areas of register files and control
mechanisms involved.
         Vector hardware pipelines are often built as add-on options to a scalar processor or as an
          attached processor driven by a control processor.
For example, a typical three stage floating-point adder includes the following stages:-
        1. first stage for exponent comparison and equalization which is implemented with an integer adder
           and some shifting logic;
        2. a second stage for fraction addition using a high-speed carry lookahead adder;
        3. and a third stage for fraction normalization and exponent readjustment using a shifter and
           another addition logic.(Fraction normalization means, Shifting mantissa to right and
           incrementing exponent by 1 to get a non zero value after fraction)
           Eg:X=0.9504*103
              Y=0.8200*102
                                  After Step1: X=0.9504*103
                                                 Y=0.08200*103
                                         Step2:Add
                                                Result=1.0324*103
                                        Step 3: fraction normalization and exponent readjustment
                                                 Result=0.10324*104
                                                                                                                   20
Computer System Architecture
         In general, an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input numbers,
          expressed as X=(Xn-1,Xn-2 …,X1,X0) and so on. The CSA performs bitwise operations
          simultaneously on all columns of digits to produce two n bit output numbers , carry and sum
          denoted as                                       and C= (Cn,Cn-1,…..C1, 0).
Note that the leading bit of bitwise sum      is always 0 and the tail bit of the carry vector C is always a 0.
The input output relationships are expressed below:
The arithmetic sum of 3 input numbers i.e S= X+Y+Z is obtained by adding two output numbers ie
S=       +C Using a CPA.
We use the CPA and CSA to implement the pipeline stages of a fixed- point unit as follows:
                                                                                                                  21
Computer System Architecture
Multiply Pipeline Design : Consider as an example the multiplication of two 38bit integers .AxB = P,
where P is the 16-bit product. This fixed-point multiplication can be written as the summation of eight
partial products as shown below: P = A x B = P0 + P1 + P2 + …..+ P7, where x and + are arithmetic
multiply and add operations, respectively
The summation of the eight partial products is done with a Wallace tree of CSAs plus a CPA at the final
stage, as shown in Fig. 6.23.
      The first stage (S1) generates all eight partial products, ranging from 8 bits to 15 bits
       simultaneously.
                                                                                                          22
Computer System Architecture
       The second. stage (S2) is made up of two levels of four CSAs, and it essentially merges eight
        numbers into four numbers ranging from 12 to 15 bits.
       The third stage (S3) consists of two CSAs, and it merges four numbers from S1 into two l6-bit
        numbers.
       The final stage (S4) is a CPA, which adds up the last two numbers to produce the final product P.
Fig 6.4 below shows the design of a pipelined floating-point unit built as an on-chip feature in the Motorola
MC68040 processor.
                                                                                                            23
Computer System Architecture
             o Stage 3 contains registers for holding results before they are loaded into the register file in
                 stage 1 for subsequent use by other instructions.
Convergence Division
One technique for division involves repeated multiplications. Mantissa division is carried out by a
convergence method . This convergence division obtains the quotient Q = M/D of two normalized fractions
0.5 <M< D < l in two‘s complement notation by performing two sequences of chain multiplications as
follows:
Q=
   Static arithmetic pipelines are designed to perform a fixed function and are thus called unifunctional.
   When a pipeline can perform more than one function, it is called multifunctional.
      o A multifunctional pipeline can be either static or dynamic .
      o Static pipelines perform one function at a time, but different functions can be performed at
           different times.
      o A dynamic pipeline allows several functions to be performed simultaneously through the pipeline
           as long as there are no conflicts in the shared usage of pipeline stages.
(a static multifunctional pipeline which was designed into the TI Advanced Scientific Computer (ASC).
 There were four pipeline arithmetic units built into the Tl-ASC system, as shown in Fig. 6.26 below.
                                                                                                                  24
Computer System Architecture
   The instruction processing unit handled the fetching and decoding of instructions.
      o There were a large number of working registers in the processor which also controlled the
          operations of the memory buffer unit and of the arithmetic units.
   There were two sets of operand buffers, { X , Y , Z} and {X', Y’, Z’}, in each arithmetic unit. .
      o X’ , X, Y’ and Y were used for input operands, and Z’ and Z were used to output results.
      o   Note that intermediate results could be also routed from Z-registers to either X or Y registers.
   Both processor and memory buffers accessed the main memory for instructions and operands/results,
    respectively.
   Each pipeline arithmetic unit had eight stages as shown in Fig. 6-.27a below. The PAU was a static
    multifunction pipeline which could perform only one function at a time. Figure 6.27a shows all the
    possible interstage connections for performing arithmetic, logical, shitting, and data conversion functions.
   Both fixed-point and floating-point arithmetic functions could be performed by this pipeline. The PAU
    also supported vector in addition to scalar arithmetic operations. It should be noted that different
    functions required different pipeline stages and different interstage connection patterns.
                                                                                                             25
Computer System Architecture
For example, fixed-point multiplication required the use of only segments S1, S6, S7 and S8 as -Shown
in Fig. 6.27b. On the other hand, the floating-point dot product function, which performs the dot product
operation between two vectors, required the use of all segments with the complex connections shown in
Fig. 6.27c. This dot product was implemented by essentially the following accumulated summation of a
sequence of multiplications through the pipeline:
Z<-Ai*Bi+Z
where the successive operands (Ai, Bi) were fed through the X and Y -buffers, and the accumulated sums
through the Z-buffer recursively.
Even though the Tl-ASC is no longer in production, the system provided a unique design for multifunction
arithmetic pipelines. Today, most supercomputers implement arithmetic pipelines with dedicated functions
for much simplified control circuitry and faster operations.
Pipeline Design Parameters: Some parameters used in designing the scalar base processor and superscalar
processor are summarized in Table 6.1 for the pipeline processors to be studied below. All pipelines discussed are
assumed to have k stages.
The pipeline cycle for the scalar base processor is assumed to be l time unit, called the base cycle. We
defined the instruction issue rate ,issue latency and simple operation latency. The instruction level
                                                                                                                 26
Computer System Architecture
parallelism is the maximum number of instructions that can be simultaneously executed in the
pipeline.
For the base processor, all of these parameters have a value of 1. All processor types are designed relative
to the base processor. The ILP is needed to fully utilize a given pipeline processor.
In an m-issue superscalar processor, the instruction decoding and execution resources are increased to form
effectively m pipelines operating concurrently. At some pipeline stages, the functional units may be shared
by multiple pipelines.
This resource-shared multiple pipeline structure is illustrated by a design example in Fig. 6.28a.In this
design, the processor can issue two instructions per cycle if there is no resource conflict and no data
dependence problem. There are essentially two pipelines in the design. Both pipelines have four
processing stages labeled fetch, decode, execute, and store, respectively.
                                                                                                               27
Computer System Architecture
Each pipeline essentially has its own fetch unit, decode unit and store unit. The two instruction streams
flowing through the two pipelines are retrieved from a single source stream (the I-cache). The fan-out from
a single instruction stream is subject to resource constraints and a data dependence relationship among the
successive instructions.
For simplicity, we assume that each pipeline stage requires one cycle, except the execute stage which
may require a variable number of cycles. Four functional units, multiplier, adder, logic unit, and load
unit, are available for use in the execute stage. These functional units are shared by the two pipelines
on a dynamic basis. The multiplier itself has three pipeline stages, the adder has two stages, and the
others each have only one stage.
The two store units (S1 and S2) can be dynamically used by the two pipelines, depending on availability
at a particular cycle. There is a lookahead window with its own fetch and decoding logic. Lookahad
window is used for instruction lookahead in case out-of-order instruction issue is desired to achieve
better pipeline throughput.
It requires complex logic to schedule multiple pipelines simultaneously, especially when the instructions
are retrieved from the same source. The aim is to avoid pipeline stalling and minimize pipeline idle time.
Data dependence: In fig 6.28 B, a dependence graph is drawn to indicate the relationship among the instructions.
Because the register content in R1 is loaded by I1 and then used b I2, we have flow dependence I1->I2
To schedule instructions through one or more pipelines, these data dependences must not be violated (Eg fig
6.28). Otherwise, erroneous results may be produced.
Pipeline Stalling
                                                                                                                   28
Computer System Architecture
This is a problem which may seriously lower pipeline utilization. Proper scheduling
avoids pipeline stalling. The problem exists in both scalar and superscalar processors. However, it is more
serious in a superscalar pipeline. Pipeline stalling can be caused by data dependences or by resource
conflicts among instructions already in the pipeline or about to enter the pipeline. We use an example
to illustrate the conditions causing pipeline stalling.
Consider the scheduling of two instruction pipelines in a two-issue superscalar processor. Figure 6.29a
shows the case of no data dependence on the left and flow dependence (ll -> I2) on the right. Without data
dependence, all pipeline stages are utilised without idling.
With dependence, instruction I2 entering the second pipeline must wait for two cycles (shaded time slots)
before entering the execution stages. This delay may also pass to the next instruction I4 entering the
pipeline.
In Fig. 6.29b, we show the effect of branching (instruction I2). A delay slot of four cycles results from a
branch taken by I2 at cycle 5. Therefore, both pipelines must be flushed before the target instructions I3 and
I4 can enter the pipelines from cycle 6. Here, delayed branch or other amending actions are not taken.
                                                                                                              29
Computer System Architecture
In Fig. 6.29c , we show a combined problem involving both resource conflict and data dependence.
Instructions Il and I2 need to use the same functional unit, and also I2 -> I4 exists.
The net effect is that I2 must be scheduled one cycle behind because the two pipeline stages (e1 and e2) of
the same functional unit must be used by ll and I2 in an overlapped fashion. For the same reason, I3 is also
delayed by one cycle. Instruction I4 is delayed by two cycles due to the flow dependence on I2. The shaded
boxes in all the timing charts correspond to idle stages
Instruction issue and completion policies are critical to superscalar processor performance. Three
scheduling policies are introduced below.
   1. In order issue & In order completion : Instructions are issued in program order and instructions
       are completed in program order.
   2. In order issue & Out of order completion : Instructions are issued in program order and
       instructions are completed not in program order.
   3. Out of order issue & Out of order completion : Program order is violated in instruction issue and
       completion
   In-order issue may result in either in-order or out-of-order completion. . ln-order issue is easier to
   implement but may not yield the optimal performance. The purpose of out of order issue and completion
   is to improve performance.
These three scheduling policies are illustrated in Fig 6.30 by the execution of the example program in Fig
6.28b on the dual-pipeline hardware in Fig 6.28a. It is demonstrated that performance can be improved from
an in-order to out-of-order schedule. Not all programs can be scheduled out of order.
In-Order Issue
Figure 6.30a shows a schedule for the six instructions being issued in program order Il, I2, I6. Pipeline 1
receives I1, I3, and I5, and pipeline 2 receives I2, I4, and I6 in three consecutive cycles. Due to I1 —> I2, I2
has to wait one cycle to use the data loaded in by I1.
I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the result of I5 before it can enter
the multiplier stages. In order to maintain in-order completion , I5 is forced to wait for two cycles to come
out of pipeline 1. In total, nine cycles are needed and five idle cycles (shaded boxes) are observed.
                                                                                                                30
Computer System Architecture
In Fig. 6.30b , out-of-order completion is allowed even if in-order issue is practiced. The only difference
between this out-of-order schedule and the in-order schedule is that I5 is allowed to complete ahead of I3
and I4, which are totally independent of I5. The total execution time does not improve. However, the
pipeline utilization rate does.
Only three idle cycles are observed. Note that in Figs. 6.29a and 6.29b, we did not use the lookahead
window. In order to shorten the total execution time, the window can be used to reorder the instruction
issues.
Out-of-Order Issue
By using the lookahead window, instruction I5 can be decoded in advance because it is independent of all
the other instructions. The six instructions are issued in three cycles as shown: I5 is fetched and decoded by
the window, while I3 and I4 are decoded concurrently.
It is followed by issuing I6 and I1 at cycle 2, and I2 at cycle 3. Because the issue is out of order, the
completion is also out of order as shown in Fig. 6.30c. Now, the total execution time has been reduced to
seven cycles with no idle stages during the execution of these six instructions.
The in-order issue and completion is the simplest one to implement. It is rarely used today even in a
conventional scalar processor due to some unnecessary delays in maintaining program order. However, in a
multiprocessor environment, this policy is still attractive. Allowing out-of-order completion can be found in
both scalar and superscalar processors.
                                                                                                              31
Computer System Architecture
Output dependence and antidependence are the two relations preventing out-of-order completion. Out of
order issue gives the processor more freedom to exploit parallelism, and thus pipeline efficiency is
enhanced. It should be noted that multiple-pipeline scheduling is an NP-complete problem. Optimal
scheduling is very expensive to obtain.
Simple data dependence checking, a small lookahead window, and score-boarding mechanisms are needed
along with an optimizing compiler, to exploit instruction parallelism in a superscalar processor.
Motorola 88710Architecture
The Motorola 88110 was an early superscalar RISC processor. It combined the three chip set, one CPU
(B8100) chip and two cache (33200) chips, in a single-chip implementation, with additional
improvements. The 83110 employed advanced techniques for exploiting instruction-level parallelism,
including instruction issue, out-of order instruction completion, speculative execution, dynamic instruction
rescheduling, and two on-chip caches. The unit also supported demanding graphics and digital signal
processing applications.
The 88110 employed a symmetrical superscalar instruction dispatch unit which dispatched two instructions
each clock cycle into an array of 10 concurrent units. It allowed out-of-order instruction completion and
some out-of-order instruction issue, and branch prediction with speculative execution past branches.
The instruction set of the 88110 extended that of the 83100 in integer and floating-point operations. It
added a new set of capabilities to support 3-D color graphics image rendering. The 88110 had separate,
independent instruction and data paths, along with split caches for instructions and data. The instruction
cache was 8K-byte, 2-way set-associative with 128 sets, two blocks for each set, and 32 bytes (8
instructions) per block. The data cache resembled that of the instruction set.
The 88110 employed the MESI cache coherence protocol. A write-invalidate procedure guaranteed that
one processor on the bus had a modified copy of any cache block at any time. The 83110 was implemented
with 1.3 million transistors in a 299-pin package and driven by a 50 MHz clock.
Superscalar Performance
Qn: What is the performance of superscalar processor
To compare the relative performance of a superscalar processor with that of a scalar base machine, we
estimate the ideal execution time of N independent instructions through the pipeline.
                                                                                                             32
Computer System Architecture
The ideal speedup of the superscalar machine over the base machine is
As illustrated in Fig. 6.3 l, this was a 64-bit superscalar processor. The design emphasized speed, multiple -
instruction issue, multiprocessor applications, software migration from the VAX./VMS and MIPS/OS, and a
long list of usable features. The clock rate was 150 MHz, with the first chip implementation.
Unlike others, the Alpha architecture had thirty-two 64-bit integer registers and thirty-two 64-bit
floating point registers. The integer pipeline had 7 stages, and the floating-point pipeline had 10
                                                                                                             33
Computer System Architecture
stages. All Alpha instructions were 32 bits. The first Alpha implementation issued two instructions per
cycle, with larger number of issues in later implementations. Pipeline timing hazards, load delay slots, and
branch delay slots were all minimized by hardware support. The Alpha was designed to support fast
multiprocessor interlocking and interrupts. The processor was designed to have a 300-MIPS peak and 150 -
Mflops peak at 150 MHz.
34