Introduction to Computer Architecture
(Parallel and Pipeline processors)
                KR Chowdhary
               Professor & Head
       Email: kr.chowdhary@gmail.com
         webpage: krchowdhary.com
    Department of Computer Science and Engineering
          MBM Engineering College, Jodhpur
              November 22, 2013
            KR Chowdhary     Parallel and Pipeline processors   1/ 21
Classification of parallel Processors
     ◮   One classification by M.J. Flynn       2. Single-instruction multiple-data
         considers the organization of a           streams (SIMD): All processors
         computer system by the number             receive the same instruction from
         of instructions and data items            control unit, but operate in
         that can be manipulated                   different sets of data.
         simultaneously.
     ◮   The sequence of instructions read      3. Multiple-instruction single-data
         from the memory constitute an             streams (MISD): It is of
         instruction stream.                       theoretical interest only as no
     ◮   The operation performed on data           practical organization can be
         in the processor constitutes a            constructed using this
         data stream.                              organization.
     ◮   Parallel processing may occur in
         instruction stream stream or data      4. Multiple-instruction multiple-data
         stream, or both.                          streams (MIMD): Several
     1. Single-instruction single-data             programs can execute at the
        streams (SISD): Instructions are           same time. Most multiprocessors
        executed sequentially.                     come in this category.
                                KR Chowdhary   Parallel and Pipeline processors   2/ 21
Flynn’s taxonomy of computer architecture(1966)
   1. SISD                     Instruction stream                                                                     Instruction stream
                                                                                  3. MIMD
                                                                                                 Control unit            Processor                Memory
         Control unit          Processor (P)                   memory (M)
                                                                                                     1                       1                      1
   I/O                                                                                                                                                      b
                     Instr. stream                   data stream                                     b
                                                                                                     b
                                                                                                             Instr. stream             data stream      b
                                                                                                                                                        b
                                                                                                     b
                                                                                                                          Instruction stream
                                                P1                 M1
                                                                                                  Control unit           Processor                  memory
                                            b        data stream                                     n                       n                        n
                                        b                 b          b
                        CU              b                 b          b
                                                          b          b
                                                                                                              Instr. stream           data stream
    2. SIMD
                                        Pn                          Mn
              Program loaded
              from front end                         data stream
                                                                                         Figure 2: MIMD Architecture.
     Figure 1: SISD and SIMD architecture
   One of the parallel processing class that does not fit into this classification is
   pipeline processing.
                                                                   KR Chowdhary   Parallel and Pipeline processors                              3/ 21
Parallel processing
         Parallel Processing:
     ◮   Increasing speed by doing many things in parallel.
     ◮   Let P is a sequential processor processing the task T in sequential
         manner. If T is partitioned into n subtasks T1 , T2 , . . . , Tn of appox. same
                                  ′
         size, then a processor P (say) having n processors can be programmed so
         that all the subtasks of T can execute in parallel.
                  ′
     ◮   Then P executes n times faster than P.
     ◮   A failure of CPU is fatal to a sequential processor, but not in the case of
         parallel processor.
     ◮   Some of the applications of parallel computer (processors) are:
          1.   Expert system for AI
          2.   Fluid flow analysis,
          3.   Seismic data analysis
          4.   Long range weather forecasting,
          5.   Computer Assisted tomography
          6.   Nuclear reactor modeling,
          7.   Visual image processing
          8.   VLSI design
     ◮   The typical characteristic of parallel computing are: vast amount of
         computation, floating point arithmetic, vast number of operands.
                                   KR Chowdhary   Parallel and Pipeline processors   4/ 21
Computational Models
    ◮   We assume that a given computation can be divided into concurrent tasks
        for execution on the multiprocessor.
    ◮   Equal Duration Model: A given task can be divided into n equal subtasks,
        each of which can be executed by one processor. If ts is the execution
        time of the whole task using a single processor, then the time taken by
        each processor to execute its subtask is tm = ts /n. Since, according to this
        model, all processors are executing their subtasks simultaneously, then the
        time taken to execute the whole task is tm = ts /n.
    ◮   The speedup factor of a parallel system can be defined as the ratio
        between the time taken by a single processor to solve a given problem
        instance to the time taken by a parallel system consisting of n processors
        to solve the same problem instance.
                                   S(n) = speedup factor
                                           ts
                                        =
                                          tm
                                            ts
                                        =       =n
                                          ts /n
                                KR Chowdhary   Parallel and Pipeline processors   5/ 21
Pipelining
     ◮   A typical example of parallel processing is a one-dimensional array of
         processors, where there are n identical processors P1 . . . Pn and each having
         its local memory. These processors communicate by message passing
         (send - receive).
                                                                       b       b   b
                                       P1              P2                                       Pn
                                                                   b       b           b
                                                       serial IO -operations(send-receive)
                                      Figure 3: Pipeline processing.
     ◮ There are total n operations going on in parallel.
     ◮ A pipe line constitutes a sequence of processing circuits, called segments
       or stages.
     - m stage pipeline has same throughput as m separate units.
                     R1          C1         R2           C2                                Rm        Cm   R
                     segment 1               segment 2                                      segment m
                                        Ri : Buffer register    Ci : Computing element
                                      Figure 4: Pipeline segments
                                         KR Chowdhary          Parallel and Pipeline processors               6/ 21
Pipeline Processors
    1. Instruction pipeline: Transfer of instructions through various stages of cpu,
       during instruction cycle: fetch, decode, execute. Thus, there can be three
       different instructions in different stages of execution: one getting fetched,
       previous of that is getting decoded, and previous to that is getting
       executed.
    2. Arithmetic pipeline: The data is computed through different stages.
                        istr. fetch       Instr. decode            Instr. execute
         Instr. adr.
                                                                                         instruction results
                       segment 1            segment 2               segment 3
                        compare          align                                      normalize
                        exponent                              add
                                        mantissa             mantissa                resulst
                       segment 1          seg. 2               seg. 3                 seg. 4
           X = (XM , XE ), Y = (YM , YE )                                                    X +Y
                        Figure 5: Instruction and data pipeline examples.
                                      KR Chowdhary        Parallel and Pipeline processors                     7/ 21
Pipe-lining Example
     ◮   Consider an example to compute: Ai ∗ Bi + Ci , for i = 1, 2, 3, 4, 5. Each
         segment has r registers, a multiplier, and an adder unit.
                                     R1          R2
                                       Multiplier                     R4
                                           R3
                                                      Adder
                                                          R5
             Figure 6: A segment comprising registers and computing elements.
         R1 ← Ai , R2 ← Bi ; input Ai , Bi
         R3 ← R1 ∗ R2 , R4 ← C ; multiply and i/ C
         R5 ← R3 + R4 ; add Ci to product
                                  KR Chowdhary        Parallel and Pipeline processors   8/ 21
Pipe-lining Example
   Table 1: Computation of expression Ai ∗ Bi + Ci in space and time in 3-stage pipeline.
                Clock pulse Segment 1 Segment 2                    Segment 3
                no.            R1 , R2            R3 , R4          R5
                1.             A1 , B1            −, −             -
                2.             A2 , B2            A1 ∗ B1 , C1 -
                3.             A3 , B3            A2 ∗ B2 , C2 A1 ∗ B1 + C1
                4.             A4 , B4            A3 ∗ B3 , C3 A2 ∗ B2 + C2
                5.             A5 , B5            A4 ∗ B4 , C4 A3 ∗ B3 + C3
                6.             −−                 A5 ∗ B5 , C5 A4 ∗ B4 + C4
                7.             −−                 −, −             A5 ∗ B5 + C5
     ◮   Any operator that can be decomposed into a sequence of sub-operations
         of about the same components can be implemented by pipeline processor.
     ◮   Consider that for a k-segment pipeline with clock cycle time =tp sec.,
         with total n no. of tasks (T1 , T2 , . . . , Tn ) are required to be executed.
     ◮   T1 requires time equal to k.tp secs. Remaining n − 1 tasks emerge from
         the pipeline at the rate of one task per clock cycle, and they will be
         completed in time of (n − 1)tp sec, so total clock cycles required =
         k + (n − 1).
     ◮   For k = 3 segment and n = 5 tasks it is 3 + (5 − 1) = 7, as clear from
         table 1.
                                  KR Chowdhary   Parallel and Pipeline processors   9/ 21
Computational Models
    ◮   Consider an instruction pipeline unit (segment) that performs the same
        operation and takes time equal to tu to complete each task. Total time for
        n tasks is n.tu . The speedup for no. of segments as k and clock period as
        tp is:
                                                     n.tu
                                  S(n) =                                               (1)
                                               (k + (n − 1))tp
    ◮   For large number of tasks, n >> k − 1, k + n − 1 ≈ n, so,
                                                   n.tu
                                         S(n) =                                        (2)
                                                   n.tp
                                                   tu
                                                 =                                     (3)
                                                   tp
    ◮   Instruction pipelining is similar to use of assembly line in manufacturing
        plant
    ◮   An instruction’s execution is broken in to many steps, which indicates the
        scope for pipelining
    ◮   pipelining requires registers to store data between stages.
                                KR Chowdhary      Parallel and Pipeline processors   10/ 21
Computational Models
        Parallel computation with serial section model:
    ◮   It is assumed that fraction f of a given task (computation) cannot be
        divided into concurrent subtasks. The remaining part (1 − f ) is assumed
        to be dividable. (for example, f may correspond to data i/p).
    ◮   The time required to execute the task on n processors is:
                                                          ts
                                   tm = f .ts + (1 − f ).                      (4)
                                                          n
    ◮   The speedup is therefore,
                                                   ts
                                  S(n) =                                       (5)
                                          f .ts + (1 − f ). tns
                                                 n
                                       =                                       (6)
                                          1 + (n − 1).f
    ◮   So, S(n) is primarily determined by the code section, which cannot be
        divided.
    ◮   If task is completely serial (f = 1), then no speedup can be achieved even
        by parallel processors.
    ◮   For n → ∞,
                                                   1
                                           S(n) =                               (7)
                                                   f
        which is maximum speedup.
                                KR Chowdhary   Parallel and Pipeline processors   11/ 21
Computational Models
    ◮   Improvement in performance (speed) of parallel algorithm over a
        sequential is limited not by no. of processors but by fraction of the
        algorithm (code) that cannot be parallelized. (Amdahl’s law).
    ◮   Considering the communication overhead:
                                                  ts
                              S(n) =                                                (8)
                                     f .ts + (1 − f )(ts /n) + tc
                                                  n
                                   =                                                (9)
                                     f .(n − 1) + 1 + n(tc /ts )
    ◮   For n → ∞,
                                                   n
                              S(n) =                                              (10)
                                      f (n − 1) + 1 + n(tc /ts )
                                            1
                                    =                                             (11)
                                      f + (tc /ts )
    ◮   Thus, S(n) depends on communication overhead tc also.
                                KR Chowdhary   Parallel and Pipeline processors   12/ 21
Pipe-lining processors
   Instruction Pipe-lining: typical stages of pipeline are:
     1. FI (fetch instruction)
     2. DI (decode Instruction)
     3. CO (calculate operands)
     4. FO (fetch operands)
     5. EI (execute instruction)
     6. WO (write operands)
                                   KR Chowdhary   Parallel and Pipeline processors   13/ 21
Instruction Pipe-lining
      ◮    Nine different instructions are to be executed
      ◮    The six stage pipeline can reduce the execution time for 9 instructions
           from 54 time units to 14 time units.
   time units →
     Instruc.   1    2     3    4    5     6        7        8        9        10           11   12   13    14
       In1      FI   DI   CO   FO   EI    WO
       In2           FI   DI   CO   FO    EI        WO
       In3                FI   DI   CO    FO        EI     WO
       In4                     FI   DI    CO        FO     EI        WO
       In5                          FI    DI        CO     FO        EI       WO
       In6                                FI        DI     CO        FO       EI        WO
       In7                                          FI     DI        CO       FO        EI       WO
       In8                                                 FI        DI       CO        FO       EI   WO
       In9                                                           FI       DI        CO       FO   EI   WO
      ◮    The diagram assumes that each instruction goes through 6 stages of
           pipeline.
      ◮    But, for example, a load instruction does not need WO.
      ◮    It is also assumed that there is no memory conflicts, for example, FI, FO,
           WO in all require memory access (together).
      ◮    The value may be in cache, or FO/WO may be null.
      ◮    Six stages may not be of equal duration, conditional branch/interrupt
           instruction may invalidate several fetches
      ◮    After which stage it should check for conditional branch/interrupt?
                                     KR Chowdhary        Parallel and Pipeline processors                  14/ 21
Factors in Instruction Pipe-lining
     ◮   Overhead in each stage of                 ◮   Pipeline Hazard occur when
         pipeline for data movements                   pipeline or its portion stalls.
         buffer to buffer                                  1. Resource hazard: Two or more
                                                              instructions in pipeline require
     ◮   Amount of control logic needed                       same resource (say ALU/reg.)
         to handle memory/register                            (called structure hazard)
         dependencies increases with size                  2. Data hazards: conflict in
         of pipeline                                          memory access
                                                           3. Control hazards: (called branch
     ◮   It needs time for the buffers to                     hazards) wrong decision in
         operate                                              branch prediction
                                  KR Chowdhary   Parallel and Pipeline processors        15/ 21
Vector Processing
     ◮   In many computational                               mvi i, 0
         applications, a problem can be                loop: read A[i]
         formulated in terms of vectors                      read B[i]
         and matrices. Processing these                      store i = i +1
         by a special computer is called                     cmp i, 100
         vector processing.                                  jnz loop
     ◮   A vector is:                              ◮   Accesses the arrays A and B, and
         V = [V1 V2 V3 . . . Vn ]. The index           only counter needs to updated.
         for Vi is represented as V [i]. A             The vector processing computer
         program for adding two vectors A              eliminates the need of fetching
         and B of length 100, to produce               the instructions, and executing
         vector C is:                                  them. As they are fetched only
     ◮   Scalar Concept:                               once only, decoded once only, but
                                                       executes them 100 times. This
          for(i=0; i < 100; i++)                       allows operations to be specified
           c[i]=b[i]+a[i];                             only as:
     ◮   In machine language we write it
         as:                                           C (1 : 100) = A(1 : 100)+B(1 : 100)
                                  KR Chowdhary   Parallel and Pipeline processors   16/ 21
Vector Processing
     ◮   Vector instructions includes the              design vector processor to store
         initial address of operands, length           all operands in registers in
         of vectors, and operands to be                advance.
         performed, all in one composition
         instruction. The addition is done         ◮   It can be applied in matrix
         with a pipelines floating pointing            multiplication, for
         point adder. It is possible to                [l × m] × [m × n].
                                  KR Chowdhary   Parallel and Pipeline processors    17/ 21
RISC v/s CISC
    ◮   CISC: (Complex instruction set                        program, thus small size, thus
        computer)                                             lesser memory, and faster
         1. Complex programs have                             access
            motivated the complex and            ◮   RISC: (Reduced instruction set
            powerful HLL. This produced              computer)
            semantic gap between HLL and
                                                         1. Large number of Gen. purpose
            machine languages, and
                                                            registers, use of compiler
            required more effort in
                                                            technology to optimize register
            compiler constructions.
                                                            usage
         2. The attempt to reduce the
                                                         2. R-R operations (Adv.?)
            semantic gap/simplify compiler
            construction, motivated to                   3. Simple addressing modes
            make more powerful instruction               4. Limited and simple instruction
            sets. (CISC - Complex                           set (one instruction per
            Instruction set computing)                      machine cycle)
         3. CISC provide better support for              5. Advantage of simple
            HLLs                                            instructions?
         4. Lesser count of instructions in              6. Optimizing pipeline
                                KR Chowdhary   Parallel and Pipeline processors         18/ 21
Exercises
     1. Design a pipeline configuration to carry out the task to compute:
                                       (Ai + Bi )/(Ci + Di )
     2. Construct pipeline to add 100 floating point numbers, i.e., find the result
        of x1 × x2 × . . . x100 .
     3. 3.1 List the advantages of designing a floating point processor in the form of a
             k-segment pipeline rather than a k-unit parallel processor.
         3.2 A floating-point pipeline has four segments S1 ,S2 ,S3 ,S4 , whose delays are
             100, 90, 100, and 110 nano-secs, respectively. What is the pipeline’s
             maximum throughput in MFLOPS?
     4. It is frequently argued that large (super) computer is approaching its
        performance limits, and the future advances in large computers will
        depend on interconnected large number of inexpensive computers
        together. List the arguments against and favor.
     5. List the features to be added in sequential programming languages, to use
        them in large interconnection of small inexpensive computers.
     6. Let S1 , S2 , . . ., Sk denote the sequence of k-operations on a program.
        Suppose that execution of these operations on a uni-processor produces
        the same results regardless of the oder of execution of the k Si ’s. Show
        that this does not imply that the Si ’s are parallelizable on a
        multi-processor.
                                   KR Chowdhary   Parallel and Pipeline processors     19/ 21
Exercises
     7. Prove that general problem of determining whether two program segments
        S1 , S2 are parallelizable is undecidable by showing that a solution to this
        problem implies a solution to the halting problem for Turing machine.
        (Hint: Assume that an algorithm A to determine parallelization exists, and
        consider applying A to a program containing the statement: if Turing
        machine T halts after at most n steps then S1 else S2 ).
                                KR Chowdhary   Parallel and Pipeline processors   20/ 21
Bibliography
       M. Morris Mano, “Computer System Architecture”, 3nd Edition, Pearson,
       2006.
       William Stalling, ”Computer Organization and Architecture”, 8th Edition,
       Pearson, 2010.
       John P. Hayes, “Computer Architecture and Organization”, 2nd Edition,
       McGraw-Hill International Edition, 1988.
                              KR Chowdhary   Parallel and Pipeline processors   21/ 21